Re: [PATCH 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Ingo Molnar

* Asias He asias.he...@gmail.com wrote:

 As virtio spec says:
 
 
  Because this is high importance and low bandwidth, the current Linux
  implementation polls for the buffer to be used, rather than waiting
  for an interrupt, simplifying the implementation signicantly.
 
 
 drivers/char/virtio_console.c
  send_buf() {
  ...
   /* Tell Host to go! */
   virtqueue_kick(out_vq);
  ...
 while (!virtqueue_get_buf(out_vq, len))
 cpu_relax();
  ...
  }
 
 The console hangs can simply be reproduced by yes command which
 gives tremendous console IOs and IRQs.
 
 [   16.786440] irq 4: nobody cared (try booting with the irqpoll option)
 [   16.786440] Pid: 1437, comm: yes Tainted: GW 2.6.39-rc6+ #56
 [   16.786440] Call Trace:
 [   16.786440]  [c16578eb] __report_bad_irq+0x30/0x89
 [   16.786440]  [c10980e6] note_interrupt+0x118/0x17a
 [   16.786440]  [c1096e7d] handle_irq_event_percpu+0x168/0x179
 [   16.786440]  [c1096eba] handle_irq_event+0x2c/0x46
 [   16.786440]  [c1098516] ? unmask_irq+0x1e/0x1e
 [   16.786440]  [c1098566] handle_level_irq+0x50/0x6e
 [   16.786440]  IRQ  [c102fa69] ? do_IRQ+0x35/0x7f
 [   16.786440]  [c1665ea9] ? common_interrupt+0x29/0x30
 [   16.786440]  [c16610d6] ? _raw_spin_unlock_irqrestore+0x7/0x28
 [   16.786440]  [c1364f65] ? hvc_write+0x88/0x9e
 [   16.786440]  [c1355500] ? do_output_char+0x88/0x18a
 [   16.786440]  [c1355631] ? process_output+0x2f/0x42
 [   16.786440]  [c1355af6] ? n_tty_write+0x211/0x2dc
 [   16.786440]  [c1059d77] ? try_to_wake_up+0x226/0x226
 [   16.786440]  [c13534a4] ? tty_write+0x15e/0x1d1
 [   16.786440]  [c12c1644] ? security_file_permission+0x22/0x26
 [   16.786440]  [c13558e5] ? process_echoes+0x241/0x241
 [   16.786440]  [c10dd9d2] ? vfs_write+0x84/0xd7
 [   16.786440]  [c1353346] ? tty_write_lock+0x3d/0x3d
 [   16.786440]  [c10ddb92] ? sys_write+0x3b/0x5d
 [   16.786440]  [c166594c] ? sysenter_do_call+0x12/0x22
 [   16.786440] handlers:
 [   16.786440] [c1351397] (vp_interrupt+0x0/0x3a)
 [   16.786440] Disabling IRQ #4

Hm, why is irq #4 active if the guest-side virtio console driver does not 
handle it?

 Signed-off-by: Asias He asias.he...@gmail.com
 ---
  tools/kvm/virtio/console.c |2 --
  1 files changed, 0 insertions(+), 2 deletions(-)
 
 diff --git a/tools/kvm/virtio/console.c b/tools/kvm/virtio/console.c
 index f5449ba..1fecf37 100644
 --- a/tools/kvm/virtio/console.c
 +++ b/tools/kvm/virtio/console.c
 @@ -171,8 +171,6 @@ static void virtio_console_handle_callback(struct kvm 
 *self, void *param)
   len = term_putc_iov(CONSOLE_VIRTIO, iov, out);
   virt_queue__set_used_elem(vq, head, len);
   }
 -
 - virt_queue__trigger_irq(vq, virtio_console_pci_device.irq_line, 
 cdev.isr, self);
  }

I think this at least requires a comment at that place, that we intentionally 
skip notifying the guest, because Linux guests do not use the console IRQ.

Does the guest-side virtio driver *ever* use the irq?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows Page File

2011-05-08 Thread Yaniv Kaul

On 05/07/2011 10:34 AM, --[ UxBoD ]-- wrote:

Hello all,

Am about to build a new KVM server which will host a number of W2K8 Remote 
Desktop servers. We present storage to the KVM server via an iSCSI LUN for 
which the virtual machines are then built upon.

The question is whether we should use local disks for the virtual machines 
swap/paging file ? Windows always appears to be using its pagefile so am 
wondering if that would be better on locally attached storage.

Any help gratefully appreciated.


I don't see the benefit of using local disks - unless your main storage 
is slow and your local disks are fast (RAID, SAS, etc.).

Y.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Enable earlyprintk=serial by default

2011-05-08 Thread Ingo Molnar

Enable the earlyprintk console to the serial port, to allow the debugging of 
very early hangs/crashes.

Since we already enable the serial console by default, this is a natural 
extension of it.

I have tested that it indeed works, by provoking an early hang that triggers 
after the early console is enabled by before the real console is registered. In 
that case before the patch we get:

  $ ./kvm run --cpus 2
  [ silent hang ]

With this patch applied i got the early output:

 $ ./kvm run --cpus 60
 [0.00] console [earlyser0] enabled
 [0.00] Initializing cgroup subsys cpu
 [0.00] Linux version 2.6.39-rc6-tip-02944-g87b0bcf-dirty 
(mingo@aldebaran) (gcc version 4.6.0 20110419 (Red Hat 4.6.0-5) (GCC) ) #84 SMP 
Mon May 9 02:34:26 CEST 2011
 [0.00] Command line: notsc noapic noacpi pci=conf1 console=ttyS0 
earlyprintk=serialroot=/dev/vda1 rw 
 [0.00] locking up the box!

Signed-off-by: Ingo Molnar mi...@elte.hu
---
 tools/kvm/kvm-run.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c
index 764a242..eb50b6a 100644
--- a/tools/kvm/kvm-run.c
+++ b/tools/kvm/kvm-run.c
@@ -409,7 +409,7 @@ int kvm_cmd_run(int argc, const char **argv, const char 
*prefix)
kvm-nrcpus = nrcpus;
 
memset(real_cmdline, 0, sizeof(real_cmdline));
-   strcpy(real_cmdline, notsc noapic noacpi pci=conf1 console=ttyS0 );
+   strcpy(real_cmdline, notsc noapic noacpi pci=conf1 console=ttyS0 
earlyprintk=serial);
if (kernel_cmdline)
strlcat(real_cmdline, kernel_cmdline, sizeof(real_cmdline));
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/30] nVMX: Nested VMX, v9

2011-05-08 Thread Nadav Har'El
Hi,

This is the ninth iteration of the nested VMX patch set. This iteration
addresses all of the comments and requests that were raised by reviewers in
the previous rounds, with only a few exception listed below.

Some of the issues which were solved in this version include:

 * Overhauled the hardware VMCS (vmcs02) allocation. Previously we had up to
   256 vmcs02s, one for each L2. Now we only have one, which is reused.
   We also have a compile-time option VMCS02_POOL_SIZE to keep a bigger pool
   of vmcs02s. This option will be useful in the future if vmcs02 won't be
   filled from scratch on each entry from L1 to L2 (currently, it is).

 * The vmcs01 structure, containing a copy of all fields from L1's VMCS, was
   unnecessary, as all the necessary values are either known to KVM or appear
   in vmcs12. This structure is now gone for good.

 * There is no longer a vmcs_fields sub-structure that everyone disliked.
   All the VMCS fields appear directly in the vmcs12 structure, which makes
   the code simpler and more readable.

 * Make sure that the vmcs12 fields have fixed sizes and location, and add
   some extra padding, to support live migration and improve future-proofing.

 * For some fields, nested exit used to fail to return the host-state as set
   by L1. Fixed that.

 * nested_vmx_exit_handled (deciding if to let L1 handle an exit, or handle it
   in L0 and return to L2) is now more correct, and handles more exit reasons.

 * Complete overhaul of the cr0, exception bitmap, cr3 and cr4 handling code.
   The code is now shorter (uses existing functions like kvm_set_cr3, etc.),
   more readable, and more uniform (no pieces of code for enable_ept and not,
   less special code for cr0.TS, and none of that ugly cr0.PG monkey-business).

 * Use kvm_register_write(), kvm_rip_read(), etc. Got rid of new and now
   unneeded function sync_cached_regs_to_vcms().

 * Fix return value of the VMX msrs to be more correct, and more constant
   (not to needlessly vary on different hosts).

 * Added some more missing verifications to vmcs12's fields (cleanly failing
   the nested entry if these verifications fail).

 * Expose the MSR-bitmap feature to L1. Every MSR access still exits to L0,
   but slow exits to L1 are avoided when L1's MSR bitmap doesn't want it.

 * Removed or rate limited printouts which could be exploited by guests.

 * Fix VM_ENTRY_LOAD_IA32_PAT feature handling.

 * Fixed potential bug and verified that nested vmx now works with both
   CONFIG_PREEMPT and CONFIG_SMP enabled.

 * Dozens of other code cleanups and bug fixes.

Only a few issues from previous reviews remain unaddressed. These are:

 * The interrupt injection and IDT_VECTORING_INFO_FIELD handling code was
   still not rewritten. It works, though ;-)

 * No KVM autotests for nested VMX yet.

 * Merging of L0's and L1's MSR bitmaps (and IO bitmaps) is still not
   supported. As explained above, the current code uses L1's MSR bitmap
   to avoid costly exits to L1, but still suffers exits to L0 on each
   MSR access in L2.

 * Still no option for disabling some capabilities advertised to L1.

 * No support for TPR_SHADOW feature for L1.

This new set of patches applies to the current KVM trunk (I checked with
082f9eced53d50c136e42d072598da4be4b9ba23).
If you wish, you can also check out an already-patched version of KVM from
branch nvmx9 of the repository:
 git://github.com/nyh/kvm-nested-vmx.git


About nested VMX:
-

The following 30 patches implement nested VMX support. This feature enables
a guest to use the VMX APIs in order to run its own nested guests.
In other words, it allows running hypervisors (that use VMX) under KVM.
Multiple guest hypervisors can be run concurrently, and each of those can
in turn host multiple guests.

The theory behind this work, our implementation, and its performance
characteristics were presented in OSDI 2010 (the USENIX Symposium on
Operating Systems Design and Implementation). Our paper was titled
The Turtles Project: Design and Implementation of Nested Virtualization,
and was awarded Jay Lepreau Best Paper. The paper is available online, at:

http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf

This patch set does not include all the features described in the paper.
In particular, this patch set is missing nested EPT (L1 can't use EPT and
must use shadow page tables). It is also missing some features required to
run VMWare hypervisors as a guest. These missing features will be sent as
follow-on patchs.

Running nested VMX:
--

The nested VMX feature is currently disabled by default. It must be
explicitly enabled with the nested=1 option to the kvm-intel module.

No modifications are required to user space (qemu). However, qemu's default
emulated CPU type (qemu64) does not list the VMX CPU feature, so it must be
explicitly enabled, by giving qemu one of the following options:

 -cpu host  (emulated CPU has all 

[PATCH 01/30] nVMX: Add nested module option to kvm_intel

2011-05-08 Thread Nadav Har'El
This patch adds to kvm_intel a module option nested. This option controls
whether the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   25 +
 1 file changed, 25 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:17.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:17.0 +0300
@@ -72,6 +72,14 @@ module_param(vmm_exclusive, bool, S_IRUG
 static int __read_mostly yield_on_hlt = 1;
 module_param(yield_on_hlt, bool, S_IRUGO);
 
+/*
+ * If nested=1, nested virtualization is supported, i.e., guests may use
+ * VMX and be a hypervisor for its own guests. If nested=0, guests may not
+ * use VMX instructions.
+ */
+static int __read_mostly nested = 0;
+module_param(nested, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST  \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK \
@@ -1261,6 +1269,23 @@ static u64 vmx_compute_tsc_offset(struct
return target_tsc - native_read_tsc();
 }
 
+static bool guest_cpuid_has_vmx(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best = kvm_find_cpuid_entry(vcpu, 1, 0);
+   return best  (best-ecx  (1  (X86_FEATURE_VMX  31)));
+}
+
+/*
+ * nested_vmx_allowed() checks whether a guest should be allowed to use VMX
+ * instructions and MSRs (i.e., nested VMX). Nested VMX is disabled for
+ * all guests if the nested module option is off, and can also be disabled
+ * for a single guest by disabling its VMX cpuid bit.
+ */
+static inline bool nested_vmx_allowed(struct kvm_vcpu *vcpu)
+{
+   return nested  guest_cpuid_has_vmx(vcpu);
+}
+
 /*
  * Reads an msr value (of 'msr_index') into 'pdata'.
  * Returns 0 on success, non-0 otherwise.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/30] nVMX: Implement VMXON and VMXOFF

2011-05-08 Thread Nadav Har'El
This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  110 ++-
 1 file changed, 108 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:17.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:17.0 +0300
@@ -130,6 +130,15 @@ struct shared_msr_entry {
u64 mask;
 };
 
+/*
+ * The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
+ */
+struct nested_vmx {
+   /* Has the level1 guest done vmxon? */
+   bool vmxon;
+};
+
 struct vcpu_vmx {
struct kvm_vcpu   vcpu;
struct list_head  local_vcpus_link;
@@ -184,6 +193,9 @@ struct vcpu_vmx {
u32 exit_reason;
 
bool rdtscp_enabled;
+
+   /* Support for a guest hypervisor (nested VMX) */
+   struct nested_vmx nested;
 };
 
 enum segment_cache_field {
@@ -3890,6 +3902,99 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called VMXON pointer) because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+   struct kvm_segment cs;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   /* The Intel VMX Instruction Reference lists a bunch of bits that
+* are prerequisite to running VMXON, most notably cr4.VMXE must be
+* set to 1 (see vmx_set_cr4() for when we allow the guest to set this).
+* Otherwise, we should fail with #UD. We test these now:
+*/
+   if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
+   !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
+   (vmx_get_rflags(vcpu)  X86_EFLAGS_VM)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmx_get_segment(vcpu, cs, VCPU_SREG_CS);
+   if (is_long_mode(vcpu)  !cs.l) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (vmx_get_cpl(vcpu)) {
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+   }
+
+   vmx-nested.vmxon = true;
+
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
+/*
+ * Intel's VMX Instruction Reference specifies a common set of prerequisites
+ * for running VMX instructions (except VMXON, whose prerequisites are
+ * slightly different). It also specifies what exception to inject otherwise.
+ */
+static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
+{
+   struct kvm_segment cs;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   if (!vmx-nested.vmxon) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 0;
+   }
+
+   vmx_get_segment(vcpu, cs, VCPU_SREG_CS);
+   if ((vmx_get_rflags(vcpu)  X86_EFLAGS_VM) ||
+   (is_long_mode(vcpu)  !cs.l)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 0;
+   }
+
+   if (vmx_get_cpl(vcpu)) {
+   kvm_inject_gp(vcpu, 0);
+   return 0;
+   }
+
+   return 1;
+}
+
+/*
+ * Free whatever needs to be freed from vmx-nested when L1 goes down, or
+ * just stops using VMX.
+ */
+static void free_nested(struct vcpu_vmx *vmx)
+{
+   if (!vmx-nested.vmxon)
+   return;
+   vmx-nested.vmxon = false;
+}
+
+/* Emulate the VMXOFF instruction */
+static int handle_vmoff(struct kvm_vcpu *vcpu)
+{
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+   free_nested(to_vmx(vcpu));
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
@@ -3917,8 +4022,8 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_VMREAD]  = handle_vmx_insn,
[EXIT_REASON_VMRESUME]= handle_vmx_insn,
[EXIT_REASON_VMWRITE] = handle_vmx_insn,
-   [EXIT_REASON_VMOFF]   = handle_vmx_insn,
-   [EXIT_REASON_VMON]= handle_vmx_insn,
+   [EXIT_REASON_VMOFF]   = handle_vmoff,
+   [EXIT_REASON_VMON]= handle_vmon,
[EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold,
[EXIT_REASON_APIC_ACCESS] 

[PATCH 03/30] nVMX: Allow setting the VMXE bit in CR4

2011-05-08 Thread Nadav Har'El
This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops-set_cr4(). This function
now returns an int: If kvm_x86_ops-set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the nested VMX feature is
enabled, and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/include/asm/kvm_host.h |2 +-
 arch/x86/kvm/svm.c  |6 +-
 arch/x86/kvm/vmx.c  |   17 +++--
 arch/x86/kvm/x86.c  |4 +---
 4 files changed, 22 insertions(+), 7 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h 2011-05-08 10:43:17.0 
+0300
+++ .after/arch/x86/include/asm/kvm_host.h  2011-05-08 10:43:17.0 
+0300
@@ -559,7 +559,7 @@ struct kvm_x86_ops {
void (*decache_cr4_guest_bits)(struct kvm_vcpu *vcpu);
void (*set_cr0)(struct kvm_vcpu *vcpu, unsigned long cr0);
void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);
-   void (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
+   int (*set_cr4)(struct kvm_vcpu *vcpu, unsigned long cr4);
void (*set_efer)(struct kvm_vcpu *vcpu, u64 efer);
void (*get_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
void (*set_idt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
--- .before/arch/x86/kvm/svm.c  2011-05-08 10:43:17.0 +0300
+++ .after/arch/x86/kvm/svm.c   2011-05-08 10:43:17.0 +0300
@@ -1496,11 +1496,14 @@ static void svm_set_cr0(struct kvm_vcpu 
update_cr0_intercept(svm);
 }
 
-static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
unsigned long host_cr4_mce = read_cr4()  X86_CR4_MCE;
unsigned long old_cr4 = to_svm(vcpu)-vmcb-save.cr4;
 
+   if (cr4  X86_CR4_VMXE)
+   return 1;
+
if (npt_enabled  ((old_cr4 ^ cr4)  X86_CR4_PGE))
svm_flush_tlb(vcpu);
 
@@ -1510,6 +1513,7 @@ static void svm_set_cr4(struct kvm_vcpu 
cr4 |= host_cr4_mce;
to_svm(vcpu)-vmcb-save.cr4 = cr4;
mark_dirty(to_svm(vcpu)-vmcb, VMCB_CR);
+   return 0;
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
--- .before/arch/x86/kvm/x86.c  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/x86.c   2011-05-08 10:43:18.0 +0300
@@ -615,11 +615,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
   kvm_read_cr3(vcpu)))
return 1;
 
-   if (cr4  X86_CR4_VMXE)
+   if (kvm_x86_ops-set_cr4(vcpu, cr4))
return 1;
 
-   kvm_x86_ops-set_cr4(vcpu, cr4);
-
if ((cr4 ^ old_cr4)  pdptr_bits)
kvm_mmu_reset_context(vcpu);
 
--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:18.0 +0300
@@ -2078,7 +2078,7 @@ static void ept_save_pdptrs(struct kvm_v
  (unsigned long *)vcpu-arch.regs_dirty);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 
 static void ept_update_paging_mode_cr0(unsigned long *hw_cr0,
unsigned long cr0,
@@ -2175,11 +2175,23 @@ static void vmx_set_cr3(struct kvm_vcpu 
vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-static void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)-rmode.vm86_active ?
KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
 
+   if (cr4  X86_CR4_VMXE) {
+   /*
+* To use VMXON (and later other VMX instructions), a guest
+* must first be able to turn on cr4.VMXE (see handle_vmon()).
+* So basically the check on whether to allow nested VMX
+* is here.
+*/
+   if (!nested_vmx_allowed(vcpu))
+   return 1;
+   } else if (to_vmx(vcpu)-nested.vmxon)
+   return 1;
+
vcpu-arch.cr4 = cr4;
if (enable_ept) {
if (!is_paging(vcpu)) {
@@ -2192,6 +2204,7 @@ static void vmx_set_cr4(struct kvm_vcpu 
 
vmcs_writel(CR4_READ_SHADOW, cr4);
vmcs_writel(GUEST_CR4, hw_cr4);
+   return 0;
 }
 
 static void vmx_get_segment(struct kvm_vcpu *vcpu,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/30] nVMX: Introduce vmcs12: a VMCS structure for L1

2011-05-08 Thread Nadav Har'El
An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it vmcs12, as it is the VMCS
that L1 keeps for its L2 guest. We will add more content to this structure
in later patches.

This patch also adds the notion (as required by the VMX spec) of L1's current
VMCS, and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   75 +++
 1 file changed, 75 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:18.0 +0300
@@ -131,12 +131,53 @@ struct shared_msr_entry {
 };
 
 /*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure, and vmcs12 is our emulated VMX's VMCS. This structure is
+ * stored in guest memory specified by VMPTRLD, but is opaque to the guest,
+ * which must access it using VMREAD/VMWRITE/VMCLEAR instructions.
+ * More than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a vmcs02: a VMCS for the
+ * underlying hardware which will be used to run L2.
+ * This structure is packed to ensure that its layout is identical across
+ * machines (necessary for live migration).
+ * If there are changes in this struct, VMCS12_REVISION must be changed.
+ */
+struct __packed vmcs12 {
+   /* According to the Intel spec, a VMCS region must start with the
+* following two fields. Then follow implementation-specific data.
+*/
+   u32 revision_id;
+   u32 abort;
+};
+
+/*
+ * VMCS12_REVISION is an arbitrary id that should be changed if the content or
+ * layout of struct vmcs12 is changed. MSR_IA32_VMX_BASIC returns this id, and
+ * VMPTRLD verifies that the VMCS region that L1 is loading contains this id.
+ */
+#define VMCS12_REVISION 0x11e57ed0
+
+/*
+ * VMCS12_SIZE is the number of bytes L1 should allocate for the VMXON region
+ * and any VMCS region. Although only sizeof(struct vmcs12) are used by the
+ * current implementation, 4K are reserved to avoid future complications.
+ */
+#define VMCS12_SIZE 0x1000
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
  */
 struct nested_vmx {
/* Has the level1 guest done vmxon? */
bool vmxon;
+
+   /* The guest-physical address of the current VMCS L1 keeps for L2 */
+   gpa_t current_vmptr;
+   /* The host-usable pointer to the above */
+   struct page *current_vmcs12_page;
+   struct vmcs12 *current_vmcs12;
 };
 
 struct vcpu_vmx {
@@ -212,6 +253,31 @@ static inline struct vcpu_vmx *to_vmx(st
return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+static inline struct vmcs12 *get_vmcs12(struct kvm_vcpu *vcpu)
+{
+   return to_vmx(vcpu)-nested.current_vmcs12;
+}
+
+static struct page *nested_get_page(struct kvm_vcpu *vcpu, gpa_t addr)
+{
+   struct page *page = gfn_to_page(vcpu-kvm, addr  PAGE_SHIFT);
+   if (is_error_page(page)) {
+   kvm_release_page_clean(page);
+   return NULL;
+   }
+   return page;
+}
+
+static void nested_release_page(struct page *page)
+{
+   kvm_release_page_dirty(page);
+}
+
+static void nested_release_page_clean(struct page *page)
+{
+   kvm_release_page_clean(page);
+}
+
 static u64 construct_eptp(unsigned long root_hpa);
 static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
@@ -3995,6 +4061,12 @@ static void free_nested(struct vcpu_vmx 
if (!vmx-nested.vmxon)
return;
vmx-nested.vmxon = false;
+   if (vmx-nested.current_vmptr != -1ull) {
+   kunmap(vmx-nested.current_vmcs12_page);
+   nested_release_page(vmx-nested.current_vmcs12_page);
+   vmx-nested.current_vmptr = -1ull;
+   vmx-nested.current_vmcs12 = NULL;
+   }
 }
 
 /* Emulate the VMXOFF instruction */
@@ -4518,6 +4590,9 @@ static struct kvm_vcpu *vmx_create_vcpu(
goto free_vmcs;
}
 
+   vmx-nested.current_vmptr = -1ull;
+   vmx-nested.current_vmcs12 = NULL;
+
return vmx-vcpu;
 
 free_vmcs:
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/30] nVMX: Implement reading and writing of VMX MSRs

2011-05-08 Thread Nadav Har'El
When the guest can use VMX instructions (when the nested module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/include/asm/msr-index.h |   12 ++
 arch/x86/kvm/vmx.c   |  174 +
 2 files changed, 186 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:18.0 +0300
@@ -1365,6 +1365,176 @@ static inline bool nested_vmx_allowed(st
 }
 
 /*
+ * nested_vmx_pinbased_ctls() returns the value which is to be returned
+ * for MSR_IA32_VMX_PINBASED_CTLS, and also determines the legal setting of
+ * vmcs12-pin_based_vm_exec_control. See the spec and vmx_control_verify() for
+ * the meaning of the low and high halves of this MSR.
+ * TODO: allow the return value to be modified (downgraded) by module options
+ * or other means.
+ */
+static inline void nested_vmx_pinbased_ctls(u32 *low, u32 *high)
+{
+   /*
+* According to the Intel spec, if bit 55 of VMX_BASIC is off (as it is
+* in our case), bits 1, 2 and 4 (i.e., 0x16) must be 1 in this MSR.
+*/
+   *low = 0x16 ;
+   /* Allow only these bits to be 1 */
+   *high = 0x16 | PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING
+| PIN_BASED_VIRTUAL_NMIS;
+}
+
+static inline bool vmx_control_verify(u32 control, u32 low, u32 high)
+{
+   /*
+* Bits 0 in high must be 0, and bits 1 in low must be 1.
+*/
+   return ((control  high) | low) == control;
+}
+
+/*
+ * If we allow our guest to use VMX instructions (i.e., nested VMX), we should
+ * also let it use VMX-specific MSRs.
+ * vmx_get_vmx_msr() and vmx_set_vmx_msr() return 1 when we handled a
+ * VMX-specific MSR, or 0 when we haven't (and the caller should handle it
+ * like all other MSRs).
+ */
+static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+{
+   u32 vmx_msr_high, vmx_msr_low;
+
+   if (!nested_vmx_allowed(vcpu)  msr_index = MSR_IA32_VMX_BASIC 
+msr_index = MSR_IA32_VMX_TRUE_ENTRY_CTLS) {
+   /*
+* According to the spec, processors which do not support VMX
+* should throw a #GP(0) when VMX capability MSRs are read.
+*/
+   kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
+   return 1;
+   }
+
+   switch (msr_index) {
+   case MSR_IA32_FEATURE_CONTROL:
+   *pdata = 0;
+   break;
+   case MSR_IA32_VMX_BASIC:
+   /*
+* This MSR reports some information about VMX support. We
+* should return information about the VMX we emulate for the
+* guest, and the VMCS structure we give it - not about the
+* VMX support of the underlying hardware.
+*/
+   *pdata = VMCS12_REVISION |
+  ((u64)VMCS12_SIZE  VMX_BASIC_VMCS_SIZE_SHIFT) |
+  (VMX_BASIC_MEM_TYPE_WB  VMX_BASIC_MEM_TYPE_SHIFT);
+   break;
+   case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+   case MSR_IA32_VMX_PINBASED_CTLS:
+   nested_vmx_pinbased_ctls(vmx_msr_low, vmx_msr_high);
+   *pdata = vmx_msr_low | ((u64)vmx_msr_high  32);
+   break;
+   case MSR_IA32_VMX_TRUE_PROCBASED_CTLS:
+   case MSR_IA32_VMX_PROCBASED_CTLS:
+   /* This MSR determines which vm-execution controls the L1
+* hypervisor may ask, or may not ask, to enable. Normally we
+* can only allow enabling features which the hardware can
+* support, but we limit ourselves to allowing only known
+* features that were tested nested. We can allow disabling any
+* feature (even if the hardware can't disable it) - we just
+* need to enable this feature and hide the extra exits from L1
+*/
+   rdmsr(MSR_IA32_VMX_PROCBASED_CTLS, vmx_msr_low, vmx_msr_high);
+   vmx_msr_low = 0; /* allow disabling any feature */
+   vmx_msr_high = /* do not expose new untested features */
+   CPU_BASED_HLT_EXITING | CPU_BASED_CR3_LOAD_EXITING |
+   CPU_BASED_CR3_STORE_EXITING | CPU_BASED_USE_IO_BITMAPS |
+   CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING |
+   CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING |
+   CPU_BASED_INVLPG_EXITING |
+#ifdef CONFIG_X86_64
+   CPU_BASED_CR8_LOAD_EXITING |
+   CPU_BASED_CR8_STORE_EXITING |
+#endif
+   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
+   /*
+* We can allow some features even when not supported by the
+* 

[PATCH 06/30] nVMX: Decoding memory operands of VMX instructions

2011-05-08 Thread Nadav Har'El
This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   53 +++
 arch/x86/kvm/x86.c |3 +-
 arch/x86/kvm/x86.h |4 +++
 3 files changed, 59 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/x86.c   2011-05-08 10:43:18.0 +0300
@@ -3815,7 +3815,7 @@ static int kvm_fetch_guest_virt(struct x
  exception);
 }
 
-static int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
+int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
   gva_t addr, void *val, unsigned int bytes,
   struct x86_exception *exception)
 {
@@ -3825,6 +3825,7 @@ static int kvm_read_guest_virt(struct x8
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
  exception);
 }
+EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
 
 static int kvm_read_guest_virt_system(struct x86_emulate_ctxt *ctxt,
  gva_t addr, void *val, unsigned int bytes,
--- .before/arch/x86/kvm/x86.h  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/x86.h   2011-05-08 10:43:18.0 +0300
@@ -81,4 +81,8 @@ int kvm_inject_realmode_interrupt(struct
 
 void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data);
 
+int kvm_read_guest_virt(struct x86_emulate_ctxt *ctxt,
+   gva_t addr, void *val, unsigned int bytes,
+   struct x86_exception *exception);
+
 #endif
--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:18.0 +0300
@@ -4254,6 +4254,59 @@ static int handle_vmoff(struct kvm_vcpu 
 }
 
 /*
+ * Decode the memory-address operand of a vmx instruction, as recorded on an
+ * exit caused by such an instruction (run by a guest hypervisor).
+ * On success, returns 0. When the operand is invalid, returns 1 and throws
+ * #UD or #GP.
+ */
+static int get_vmx_mem_address(struct kvm_vcpu *vcpu,
+unsigned long exit_qualification,
+u32 vmx_instruction_info, gva_t *ret)
+{
+   /*
+* According to Vol. 3B, Information for VM Exits Due to Instruction
+* Execution, on an exit, vmx_instruction_info holds most of the
+* addressing components of the operand. Only the displacement part
+* is put in exit_qualification (see 3B, Basic VM-Exit Information).
+* For how an actual address is calculated from all these components,
+* refer to Vol. 1, Operand Addressing.
+*/
+   int  scaling = vmx_instruction_info  3;
+   int  addr_size = (vmx_instruction_info  7)  7;
+   bool is_reg = vmx_instruction_info  (1u  10);
+   int  seg_reg = (vmx_instruction_info  15)  7;
+   int  index_reg = (vmx_instruction_info  18)  0xf;
+   bool index_is_valid = !(vmx_instruction_info  (1u  22));
+   int  base_reg   = (vmx_instruction_info  23)  0xf;
+   bool base_is_valid  = !(vmx_instruction_info  (1u  27));
+
+   if (is_reg) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   /* Addr = segment_base + offset */
+   /* offset = base + [index * scale] + displacement */
+   *ret = vmx_get_segment_base(vcpu, seg_reg);
+   if (base_is_valid)
+   *ret += kvm_register_read(vcpu, base_reg);
+   if (index_is_valid)
+   *ret += kvm_register_read(vcpu, index_reg)scaling;
+   *ret += exit_qualification; /* holds the displacement */
+
+   if (addr_size == 1) /* 32 bit */
+   *ret = 0x;
+
+   /*
+* TODO: throw #GP (and return 1) in various cases that the VM*
+* instructions require it - e.g., offset beyond segment limit,
+* unusable or unreadable/unwritable segment, non-canonical 64-bit
+* address, and so on. Currently these are not checked.
+*/
+   return 0;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/30] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-08 Thread Nadav Har'El
We saw in a previous patch that L1 controls its L2 guest with a vcms12.
L0 needs to create a real VMCS for running L2. We call that vmcs02.
A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
fields. This patch only contains code for allocating vmcs02.

In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
be reused even when L1 runs multiple L2 guests. However, in future versions
we'll probably want to add an optimization where vmcs02 fields that rarely
change will not be set each time. For that, we may want to keep around several
vmcs02s of L2 guests that have recently run, so that potentially we could run
these L2s again more quickly because less vmwrites to vmcs02 will be needed.

This patch adds to each vcpu a vmcs02 pool, vmx-nested.vmcs02_pool,
which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
reused to enter any L2 guest. In the future, when prepare_vmcs02() is
optimized not to set all fields every time, VMCS02_POOL_SIZE should be
increased.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  134 +++
 1 file changed, 134 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:18.0 +0300
@@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
 module_param(ple_window, int, S_IRUGO);
 
 #define NR_AUTOLOAD_MSRS 1
+#define VMCS02_POOL_SIZE 1
 
 struct vmcs {
u32 revision_id;
@@ -166,6 +167,30 @@ struct __packed vmcs12 {
 #define VMCS12_SIZE 0x1000
 
 /*
+ * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
+ * while we use L2's VMCS), and we wish to save the previous VMCS, we must also
+ * remember on which CPU it was last loaded (vcpu-cpu), so when we return to
+ * using this VMCS we'll know if we're now running on a different CPU and need
+ * to clear the VMCS on the old CPU, and load it on the new one. Additionally,
+ * we need to remember whether this VMCS was launched (vmx-launched), so when
+ * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot deduce
+ * this from other state, because it's possible that this VMCS had once been
+ * launched, but has since been cleared after a CPU switch).
+ */
+struct saved_vmcs {
+   struct vmcs *vmcs;
+   int cpu;
+   int launched;
+};
+
+/* Used to remember the last vmcs02 used for some recently used vmcs12s */
+struct vmcs02_list {
+   struct list_head list;
+   gpa_t vmcs12_addr;
+   struct saved_vmcs vmcs02;
+};
+
+/*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
  */
@@ -178,6 +203,10 @@ struct nested_vmx {
/* The host-usable pointer to the above */
struct page *current_vmcs12_page;
struct vmcs12 *current_vmcs12;
+
+   /* vmcs02_list cache of VMCSs recently used to run L2 guests */
+   struct list_head vmcs02_pool;
+   int vmcs02_num;
 };
 
 struct vcpu_vmx {
@@ -4155,6 +4184,106 @@ static int handle_invalid_op(struct kvm_
 }
 
 /*
+ * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
+ * We could reuse a single VMCS for all the L2 guests, but we also want the
+ * option to allocate a separate vmcs02 for each separate loaded vmcs12 - this
+ * allows keeping them loaded on the processor, and in the future will allow
+ * optimizations where prepare_vmcs02 doesn't need to set all the fields on
+ * every entry if they never change.
+ * So we keep, in vmx-nested.vmcs02_pool, a cache of size VMCS02_POOL_SIZE
+ * (=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
+ *
+ * The following functions allocate and free a vmcs02 in this pool.
+ */
+
+static void __nested_free_saved_vmcs(void *arg)
+{
+   struct saved_vmcs *saved_vmcs = arg;
+
+   vmcs_clear(saved_vmcs-vmcs);
+   if (per_cpu(current_vmcs, saved_vmcs-cpu) == saved_vmcs-vmcs)
+   per_cpu(current_vmcs, saved_vmcs-cpu) = NULL;
+}
+
+/*
+ * Free a VMCS, but before that VMCLEAR it on the CPU where it was last loaded
+ * (the necessary information is in the saved_vmcs structure).
+ * See also vcpu_clear() (with different parameters and side-effects)
+ */
+static void nested_free_saved_vmcs(struct vcpu_vmx *vmx,
+   struct saved_vmcs *saved_vmcs)
+{
+   if (saved_vmcs-cpu != -1)
+   smp_call_function_single(saved_vmcs-cpu,
+   __nested_free_saved_vmcs, saved_vmcs, 1);
+
+   free_vmcs(saved_vmcs-vmcs);
+}
+
+/* Free and remove from pool a vmcs02 saved for a vmcs12 (if there is one) */
+static void nested_free_vmcs02(struct vcpu_vmx *vmx, gpa_t vmptr)

[PATCH 08/30] nVMX: Fix local_vcpus_link handling

2011-05-08 Thread Nadav Har'El
In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu vcpus_on_cpu linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, others for
each L2), and each of those may be have been last loaded on a different cpu.

This trivial patch changes the code to keep on vcpus_on_cpu only L1 VMCSs.
This fixes crashes on L1 shutdown caused by incorrectly maintaing the linked
lists.

It is not a complete solution, though. It doesn't flush the inactive L1 or L2
VMCSs loaded on a CPU which is being shutdown. Doing this correctly will
probably require replacing the vcpu linked list by a link list of saved_vcms
objects (VMCS, cpu and launched), and it is left as a TODO.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:18.0 +0300
@@ -638,7 +638,9 @@ static void __vcpu_clear(void *arg)
vmcs_clear(vmx-vmcs);
if (per_cpu(current_vmcs, cpu) == vmx-vmcs)
per_cpu(current_vmcs, cpu) = NULL;
-   list_del(vmx-local_vcpus_link);
+   /* TODO: currently, local_vcpus_link is just for L1 VMCSs */
+   if (!is_guest_mode(vmx-vcpu))
+   list_del(vmx-local_vcpus_link);
vmx-vcpu.cpu = -1;
vmx-launched = 0;
 }
@@ -1100,8 +1102,10 @@ static void vmx_vcpu_load(struct kvm_vcp
 
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
local_irq_disable();
-   list_add(vmx-local_vcpus_link,
-per_cpu(vcpus_on_cpu, cpu));
+   /* TODO: currently, local_vcpus_link is just for L1 VMCSs */
+   if (!is_guest_mode(vmx-vcpu))
+   list_add(vmx-local_vcpus_link,
+per_cpu(vcpus_on_cpu, cpu));
local_irq_enable();
 
/*
@@ -1806,7 +1810,9 @@ static void vmclear_local_vcpus(void)
 
list_for_each_entry_safe(vmx, n, per_cpu(vcpus_on_cpu, cpu),
 local_vcpus_link)
-   __vcpu_clear(vmx);
+   /* TODO: currently, local_vcpus_link is just for L1 VMCSs */
+   if (!is_guest_mode(vmx-vcpu))
+   __vcpu_clear(vmx);
 }
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/30] nVMX: Add VMCS fields to the vmcs12

2011-05-08 Thread Nadav Har'El
In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
a hardware VMCS for running L2.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  275 +++
 1 file changed, 275 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:18.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:18.0 +0300
@@ -144,12 +144,148 @@ struct shared_msr_entry {
  * machines (necessary for live migration).
  * If there are changes in this struct, VMCS12_REVISION must be changed.
  */
+typedef u64 natural_width;
 struct __packed vmcs12 {
/* According to the Intel spec, a VMCS region must start with the
 * following two fields. Then follow implementation-specific data.
 */
u32 revision_id;
u32 abort;
+
+   u64 io_bitmap_a;
+   u64 io_bitmap_b;
+   u64 msr_bitmap;
+   u64 vm_exit_msr_store_addr;
+   u64 vm_exit_msr_load_addr;
+   u64 vm_entry_msr_load_addr;
+   u64 tsc_offset;
+   u64 virtual_apic_page_addr;
+   u64 apic_access_addr;
+   u64 ept_pointer;
+   u64 guest_physical_address;
+   u64 vmcs_link_pointer;
+   u64 guest_ia32_debugctl;
+   u64 guest_ia32_pat;
+   u64 guest_ia32_efer;
+   u64 guest_pdptr0;
+   u64 guest_pdptr1;
+   u64 guest_pdptr2;
+   u64 guest_pdptr3;
+   u64 host_ia32_pat;
+   u64 host_ia32_efer;
+   u64 padding64[8]; /* room for future expansion */
+   /*
+* To allow migration of L1 (complete with its L2 guests) between
+* machines of different natural widths (32 or 64 bit), we cannot have
+* unsigned long fields with no explict size. We use u64 (aliased
+* natural_width) instead. Luckily, x86 is little-endian.
+*/
+   natural_width cr0_guest_host_mask;
+   natural_width cr4_guest_host_mask;
+   natural_width cr0_read_shadow;
+   natural_width cr4_read_shadow;
+   natural_width cr3_target_value0;
+   natural_width cr3_target_value1;
+   natural_width cr3_target_value2;
+   natural_width cr3_target_value3;
+   natural_width exit_qualification;
+   natural_width guest_linear_address;
+   natural_width guest_cr0;
+   natural_width guest_cr3;
+   natural_width guest_cr4;
+   natural_width guest_es_base;
+   natural_width guest_cs_base;
+   natural_width guest_ss_base;
+   natural_width guest_ds_base;
+   natural_width guest_fs_base;
+   natural_width guest_gs_base;
+   natural_width guest_ldtr_base;
+   natural_width guest_tr_base;
+   natural_width guest_gdtr_base;
+   natural_width guest_idtr_base;
+   natural_width guest_dr7;
+   natural_width guest_rsp;
+   natural_width guest_rip;
+   natural_width guest_rflags;
+   natural_width guest_pending_dbg_exceptions;
+   natural_width guest_sysenter_esp;
+   natural_width guest_sysenter_eip;
+   natural_width host_cr0;
+   natural_width host_cr3;
+   natural_width host_cr4;
+   natural_width host_fs_base;
+   natural_width host_gs_base;
+   natural_width host_tr_base;
+   natural_width host_gdtr_base;
+   natural_width host_idtr_base;
+   natural_width host_ia32_sysenter_esp;
+   natural_width host_ia32_sysenter_eip;
+   natural_width host_rsp;
+   natural_width host_rip;
+   natural_width paddingl[8]; /* room for future expansion */
+   u32 pin_based_vm_exec_control;
+   u32 cpu_based_vm_exec_control;
+   u32 exception_bitmap;
+   u32 page_fault_error_code_mask;
+   u32 page_fault_error_code_match;
+   u32 cr3_target_count;
+   u32 vm_exit_controls;
+   u32 vm_exit_msr_store_count;
+   u32 vm_exit_msr_load_count;
+   u32 vm_entry_controls;
+   u32 vm_entry_msr_load_count;
+   u32 vm_entry_intr_info_field;
+   u32 vm_entry_exception_error_code;
+   u32 vm_entry_instruction_len;
+   u32 tpr_threshold;
+   u32 secondary_vm_exec_control;
+   u32 vm_instruction_error;
+   u32 vm_exit_reason;
+   u32 vm_exit_intr_info;
+   u32 vm_exit_intr_error_code;
+   u32 idt_vectoring_info_field;
+   u32 idt_vectoring_error_code;
+   u32 vm_exit_instruction_len;
+   u32 vmx_instruction_info;
+   u32 guest_es_limit;
+   u32 guest_cs_limit;
+   u32 guest_ss_limit;
+   u32 guest_ds_limit;
+   u32 guest_fs_limit;
+   u32 guest_gs_limit;
+   u32 guest_ldtr_limit;
+   u32 guest_tr_limit;
+   u32 guest_gdtr_limit;
+   u32 guest_idtr_limit;
+   u32 guest_es_ar_bytes;
+   u32 guest_cs_ar_bytes;
+   u32 guest_ss_ar_bytes;
+   u32 guest_ds_ar_bytes;
+   u32 guest_fs_ar_bytes;
+   u32 guest_gs_ar_bytes;
+   u32 

[PATCH 10/30] nVMX: Success/failure of VMX instructions.

2011-05-08 Thread Nadav Har'El
VMX instructions specify success or failure by setting certain RFLAGS bits.
This patch contains common functions to do this, and they will be used in
the following patches which emulate the various VMX instructions.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/include/asm/vmx.h |   31 +++
 arch/x86/kvm/vmx.c |   30 ++
 2 files changed, 61 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:19.0 +0300
@@ -4722,6 +4722,36 @@ static int get_vmx_mem_address(struct kv
 }
 
 /*
+ * The following 3 functions, nested_vmx_succeed()/failValid()/failInvalid(),
+ * set the success or error code of an emulated VMX instruction, as specified
+ * by Vol 2B, VMX Instruction Reference, Conventions.
+ */
+static void nested_vmx_succeed(struct kvm_vcpu *vcpu)
+{
+   vmx_set_rflags(vcpu, vmx_get_rflags(vcpu)
+~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+   X86_EFLAGS_ZF | X86_EFLAGS_SF | X86_EFLAGS_OF));
+}
+
+static void nested_vmx_failInvalid(struct kvm_vcpu *vcpu)
+{
+   vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+~(X86_EFLAGS_PF | X86_EFLAGS_AF | X86_EFLAGS_ZF |
+   X86_EFLAGS_SF | X86_EFLAGS_OF))
+   | X86_EFLAGS_CF);
+}
+
+static void nested_vmx_failValid(struct kvm_vcpu *vcpu,
+   u32 vm_instruction_error)
+{
+   vmx_set_rflags(vcpu, (vmx_get_rflags(vcpu)
+~(X86_EFLAGS_CF | X86_EFLAGS_PF | X86_EFLAGS_AF |
+   X86_EFLAGS_SF | X86_EFLAGS_OF))
+   | X86_EFLAGS_ZF);
+   get_vmcs12(vcpu)-vm_instruction_error = vm_instruction_error;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
--- .before/arch/x86/include/asm/vmx.h  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/include/asm/vmx.h   2011-05-08 10:43:19.0 +0300
@@ -426,4 +426,35 @@ struct vmx_msr_entry {
u64 value;
 } __aligned(16);
 
+/*
+ * VM-instruction error numbers
+ */
+enum vm_instruction_error_number {
+   VMXERR_VMCALL_IN_VMX_ROOT_OPERATION = 1,
+   VMXERR_VMCLEAR_INVALID_ADDRESS = 2,
+   VMXERR_VMCLEAR_VMXON_POINTER = 3,
+   VMXERR_VMLAUNCH_NONCLEAR_VMCS = 4,
+   VMXERR_VMRESUME_NONLAUNCHED_VMCS = 5,
+   VMXERR_VMRESUME_AFTER_VMXOFF = 6,
+   VMXERR_ENTRY_INVALID_CONTROL_FIELD = 7,
+   VMXERR_ENTRY_INVALID_HOST_STATE_FIELD = 8,
+   VMXERR_VMPTRLD_INVALID_ADDRESS = 9,
+   VMXERR_VMPTRLD_VMXON_POINTER = 10,
+   VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID = 11,
+   VMXERR_UNSUPPORTED_VMCS_COMPONENT = 12,
+   VMXERR_VMWRITE_READ_ONLY_VMCS_COMPONENT = 13,
+   VMXERR_VMXON_IN_VMX_ROOT_OPERATION = 15,
+   VMXERR_ENTRY_INVALID_EXECUTIVE_VMCS_POINTER = 16,
+   VMXERR_ENTRY_NONLAUNCHED_EXECUTIVE_VMCS = 17,
+   VMXERR_ENTRY_EXECUTIVE_VMCS_POINTER_NOT_VMXON_POINTER = 18,
+   VMXERR_VMCALL_NONCLEAR_VMCS = 19,
+   VMXERR_VMCALL_INVALID_VM_EXIT_CONTROL_FIELDS = 20,
+   VMXERR_VMCALL_INCORRECT_MSEG_REVISION_ID = 22,
+   VMXERR_VMXOFF_UNDER_DUAL_MONITOR_TREATMENT_OF_SMIS_AND_SMM = 23,
+   VMXERR_VMCALL_INVALID_SMM_MONITOR_FEATURES = 24,
+   VMXERR_ENTRY_INVALID_VM_EXECUTION_CONTROL_FIELDS_IN_EXECUTIVE_VMCS = 25,
+   VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS = 26,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID = 28,
+};
+
 #endif
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/30] nVMX: Implement VMCLEAR

2011-05-08 Thread Nadav Har'El
This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   65 ++-
 arch/x86/kvm/x86.c |1 
 2 files changed, 65 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/x86.c  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/x86.c   2011-05-08 10:43:19.0 +0300
@@ -347,6 +347,7 @@ void kvm_inject_page_fault(struct kvm_vc
vcpu-arch.cr2 = fault-address;
kvm_queue_exception_e(vcpu, PF_VECTOR, fault-error_code);
 }
+EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
 
 void kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
 {
--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:19.0 +0300
@@ -152,6 +152,9 @@ struct __packed vmcs12 {
u32 revision_id;
u32 abort;
 
+   u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+   u32 padding[7]; /* room for future expansion */
+
u64 io_bitmap_a;
u64 io_bitmap_b;
u64 msr_bitmap;
@@ -4751,6 +4754,66 @@ static void nested_vmx_failValid(struct 
get_vmcs12(vcpu)-vm_instruction_error = vm_instruction_error;
 }
 
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   gva_t gva;
+   gpa_t vmcs12_addr;
+   struct vmcs12 *vmcs12;
+   struct page *page;
+   struct x86_exception e;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   vmcs_read32(VMX_INSTRUCTION_INFO), gva))
+   return 1;
+
+   if (kvm_read_guest_virt(vcpu-arch.emulate_ctxt, gva, vmcs12_addr,
+   sizeof(vmcs12_addr), e)) {
+   kvm_inject_page_fault(vcpu, e);
+   return 1;
+   }
+
+   if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+   nested_vmx_failValid(vcpu, VMXERR_VMCLEAR_INVALID_ADDRESS);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+
+   if (vmcs12_addr == vmx-nested.current_vmptr) {
+   kunmap(vmx-nested.current_vmcs12_page);
+   nested_release_page(vmx-nested.current_vmcs12_page);
+   vmx-nested.current_vmptr = -1ull;
+   vmx-nested.current_vmcs12 = NULL;
+   }
+
+   page = nested_get_page(vcpu, vmcs12_addr);
+   if (page == NULL) {
+   /*
+* For accurate processor emulation, VMCLEAR beyond available
+* physical memory should do nothing at all. However, it is
+* possible that a nested vmx bug, not a guest hypervisor bug,
+* resulted in this case, so let's shut down before doing any
+* more damage:
+*/
+   kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+   return 1;
+   }
+   vmcs12 = kmap(page);
+   vmcs12-launch_state = 0;
+   kunmap(page);
+   nested_release_page(page);
+
+   nested_free_vmcs02(vmx, vmcs12_addr);
+
+   skip_emulated_instruction(vcpu);
+   nested_vmx_succeed(vcpu);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4772,7 +4835,7 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_INVD]= handle_invd,
[EXIT_REASON_INVLPG]  = handle_invlpg,
[EXIT_REASON_VMCALL]  = handle_vmcall,
-   [EXIT_REASON_VMCLEAR] = handle_vmx_insn,
+   [EXIT_REASON_VMCLEAR] = handle_vmclear,
[EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
[EXIT_REASON_VMPTRLD] = handle_vmx_insn,
[EXIT_REASON_VMPTRST] = handle_vmx_insn,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/30] nVMX: Implement VMPTRST

2011-05-08 Thread Nadav Har'El
This patch implements the VMPTRST instruction. 

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   28 +++-
 arch/x86/kvm/x86.c |3 ++-
 arch/x86/kvm/x86.h |4 
 3 files changed, 33 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/x86.c  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/x86.c   2011-05-08 10:43:19.0 +0300
@@ -3836,7 +3836,7 @@ static int kvm_read_guest_virt_system(st
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, exception);
 }
 
-static int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
+int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
   gva_t addr, void *val,
   unsigned int bytes,
   struct x86_exception *exception)
@@ -3868,6 +3868,7 @@ static int kvm_write_guest_virt_system(s
 out:
return r;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
 
 static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
  unsigned long addr,
--- .before/arch/x86/kvm/x86.h  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/x86.h   2011-05-08 10:43:19.0 +0300
@@ -85,4 +85,8 @@ int kvm_read_guest_virt(struct x86_emula
gva_t addr, void *val, unsigned int bytes,
struct x86_exception *exception);
 
+int kvm_write_guest_virt_system(struct x86_emulate_ctxt *ctxt,
+   gva_t addr, void *val, unsigned int bytes,
+   struct x86_exception *exception);
+
 #endif
--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:19.0 +0300
@@ -4874,6 +4874,32 @@ static int handle_vmptrld(struct kvm_vcp
return 1;
 }
 
+/* Emulate the VMPTRST instruction */
+static int handle_vmptrst(struct kvm_vcpu *vcpu)
+{
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   gva_t vmcs_gva;
+   struct x86_exception e;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (get_vmx_mem_address(vcpu, exit_qualification,
+   vmx_instruction_info, vmcs_gva))
+   return 1;
+   /* ok to use *_system, as nested_vmx_check_permission verified cpl=0 */
+   if (kvm_write_guest_virt_system(vcpu-arch.emulate_ctxt, vmcs_gva,
+(void *)to_vmx(vcpu)-nested.current_vmptr,
+sizeof(u64), e)) {
+   kvm_inject_page_fault(vcpu, e);
+   return 1;
+   }
+   nested_vmx_succeed(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4898,7 +4924,7 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_VMCLEAR] = handle_vmclear,
[EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
[EXIT_REASON_VMPTRLD] = handle_vmptrld,
-   [EXIT_REASON_VMPTRST] = handle_vmx_insn,
+   [EXIT_REASON_VMPTRST] = handle_vmptrst,
[EXIT_REASON_VMREAD]  = handle_vmx_insn,
[EXIT_REASON_VMRESUME]= handle_vmx_insn,
[EXIT_REASON_VMWRITE] = handle_vmx_insn,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/30] nVMX: Implement VMPTRLD

2011-05-08 Thread Nadav Har'El
This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   62 ++-
 1 file changed, 61 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:19.0 +0300
@@ -4814,6 +4814,66 @@ static int handle_vmclear(struct kvm_vcp
return 1;
 }
 
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   gva_t gva;
+   gpa_t vmcs12_addr;
+   struct x86_exception e;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   vmcs_read32(VMX_INSTRUCTION_INFO), gva))
+   return 1;
+
+   if (kvm_read_guest_virt(vcpu-arch.emulate_ctxt, gva, vmcs12_addr,
+   sizeof(vmcs12_addr), e)) {
+   kvm_inject_page_fault(vcpu, e);
+   return 1;
+   }
+
+   if (!IS_ALIGNED(vmcs12_addr, PAGE_SIZE)) {
+   nested_vmx_failValid(vcpu, VMXERR_VMPTRLD_INVALID_ADDRESS);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+
+   if (vmx-nested.current_vmptr != vmcs12_addr) {
+   struct vmcs12 *new_vmcs12;
+   struct page *page;
+   page = nested_get_page(vcpu, vmcs12_addr);
+   if (page == NULL) {
+   nested_vmx_failInvalid(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+   new_vmcs12 = kmap(page);
+   if (new_vmcs12-revision_id != VMCS12_REVISION) {
+   kunmap(page);
+   nested_release_page_clean(page);
+   nested_vmx_failValid(vcpu,
+   VMXERR_VMPTRLD_INCORRECT_VMCS_REVISION_ID);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+   if (vmx-nested.current_vmptr != -1ull) {
+   kunmap(vmx-nested.current_vmcs12_page);
+   nested_release_page(vmx-nested.current_vmcs12_page);
+   }
+
+   vmx-nested.current_vmptr = vmcs12_addr;
+   vmx-nested.current_vmcs12 = new_vmcs12;
+   vmx-nested.current_vmcs12_page = page;
+   }
+
+   nested_vmx_succeed(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -4837,7 +4897,7 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_VMCALL]  = handle_vmcall,
[EXIT_REASON_VMCLEAR] = handle_vmclear,
[EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
-   [EXIT_REASON_VMPTRLD] = handle_vmx_insn,
+   [EXIT_REASON_VMPTRLD] = handle_vmptrld,
[EXIT_REASON_VMPTRST] = handle_vmx_insn,
[EXIT_REASON_VMREAD]  = handle_vmx_insn,
[EXIT_REASON_VMRESUME]= handle_vmx_insn,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/30] nVMX: Implement VMREAD and VMWRITE

2011-05-08 Thread Nadav Har'El
Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the vmcs12 structure introduced in a previous patch.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  176 ++-
 1 file changed, 174 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:19.0 +0300
@@ -4814,6 +4814,178 @@ static int handle_vmclear(struct kvm_vcp
return 1;
 }
 
+enum vmcs_field_type {
+   VMCS_FIELD_TYPE_U16 = 0,
+   VMCS_FIELD_TYPE_U64 = 1,
+   VMCS_FIELD_TYPE_U32 = 2,
+   VMCS_FIELD_TYPE_NATURAL_WIDTH = 3
+};
+
+static inline int vmcs_field_type(unsigned long field)
+{
+   if (0x1  field)/* the *_HIGH fields are all 32 bit */
+   return VMCS_FIELD_TYPE_U32;
+   return (field  13)  0x3 ;
+}
+
+static inline int vmcs_field_readonly(unsigned long field)
+{
+   return (((field  10)  0x3) == 1);
+}
+
+/*
+ * Read a vmcs12 field. Since these can have varying lengths and we return
+ * one type, we chose the biggest type (u64) and zero-extend the return value
+ * to that size. Note that the caller, handle_vmread, might need to use only
+ * some of the bits we return here (e.g., on 32-bit guests, only 32 bits of
+ * 64-bit fields are to be returned).
+ */
+static inline bool vmcs12_read_any(struct kvm_vcpu *vcpu,
+   unsigned long field, u64 *ret)
+{
+   short offset = vmcs_field_to_offset(field);
+   char *p;
+
+   if (offset  0)
+   return 0;
+
+   p = ((char *)(get_vmcs12(vcpu))) + offset;
+
+   switch (vmcs_field_type(field)) {
+   case VMCS_FIELD_TYPE_NATURAL_WIDTH:
+   *ret = *((natural_width *)p);
+   return 1;
+   case VMCS_FIELD_TYPE_U16:
+   *ret = *((u16 *)p);
+   return 1;
+   case VMCS_FIELD_TYPE_U32:
+   *ret = *((u32 *)p);
+   return 1;
+   case VMCS_FIELD_TYPE_U64:
+   *ret = *((u64 *)p);
+   return 1;
+   default:
+   return 0; /* can never happen. */
+   }
+}
+
+static int handle_vmread(struct kvm_vcpu *vcpu)
+{
+   unsigned long field;
+   u64 field_value;
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   gva_t gva = 0;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   /* Decode instruction info and find the field to read */
+   field = kvm_register_read(vcpu, (((vmx_instruction_info)  28)  0xf));
+   /* Read the field, zero-extended to a u64 field_value */
+   if (!vmcs12_read_any(vcpu, field, field_value)) {
+   nested_vmx_failValid(vcpu, VMXERR_UNSUPPORTED_VMCS_COMPONENT);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+   /*
+* Now copy part of this value to register or memory, as requested.
+* Note that the number of bits actually copied is 32 or 64 depending
+* on the guest's mode (32 or 64 bit), not on the given field's length.
+*/
+   if (vmx_instruction_info  (1u  10)) {
+   kvm_register_write(vcpu, (((vmx_instruction_info)  3)  0xf),
+   field_value);
+   } else {
+   if (get_vmx_mem_address(vcpu, exit_qualification,
+   vmx_instruction_info, gva))
+   return 1;
+   /* _system ok, as nested_vmx_check_permission verified cpl=0 */
+   kvm_write_guest_virt_system(vcpu-arch.emulate_ctxt, gva,
+field_value, (is_long_mode(vcpu) ? 8 : 4), NULL);
+   }
+
+   nested_vmx_succeed(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
+
+static int handle_vmwrite(struct kvm_vcpu *vcpu)
+{
+   unsigned long field;
+   gva_t gva;
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   char *p;
+   short offset;
+   /* The value to write might be 32 or 64 bits, depending on L1's long
+* mode, and eventually we need to write that into a field of several
+* possible lengths. The code below first zero-extends the value to 64
+* bit (field_value), and then copies only the approriate number of
+* bits into the vmcs12 field.
+*/
+   u64 field_value = 0;
+   struct x86_exception e;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (vmx_instruction_info  (1u  10))
+   field_value = kvm_register_read(vcpu,
+   (((vmx_instruction_info)  3)  0xf));
+   else {
+ 

[PATCH 15/30] nVMX: Move host-state field setup to a function

2011-05-08 Thread Nadav Har'El
Move the setting of constant host-state fields (fields that do not change
throughout the life of the guest) from vmx_vcpu_setup to a new common function
vmx_set_constant_host_state(). This function will also be used to set the
host state when running L2 guests.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   72 ---
 1 file changed, 41 insertions(+), 31 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:19.0 +0300
@@ -3323,17 +3323,53 @@ static void vmx_disable_intercept_for_ms
 }
 
 /*
+ * Set up the vmcs's constant host-state fields, i.e., host-state fields that
+ * will not change in the lifetime of the guest.
+ * Note that host-state that does change is set elsewhere. E.g., host-state
+ * that is set differently for each CPU is set in vmx_vcpu_load(), not here.
+ */
+static void vmx_set_constant_host_state(void)
+{
+   u32 low32, high32;
+   unsigned long tmpl;
+   struct desc_ptr dt;
+
+   vmcs_writel(HOST_CR0, read_cr0() | X86_CR0_TS);  /* 22.2.3 */
+   vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
+   vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
+
+   vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);  /* 22.2.4 */
+   vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+   vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+   vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+   vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
+
+   native_store_idt(dt);
+   vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
+
+   asm(mov $.Lkvm_vmx_return, %0 : =r(tmpl));
+   vmcs_writel(HOST_RIP, tmpl); /* 22.2.5 */
+
+   rdmsr(MSR_IA32_SYSENTER_CS, low32, high32);
+   vmcs_write32(HOST_IA32_SYSENTER_CS, low32);
+   rdmsrl(MSR_IA32_SYSENTER_EIP, tmpl);
+   vmcs_writel(HOST_IA32_SYSENTER_EIP, tmpl);   /* 22.2.3 */
+
+   if (vmcs_config.vmexit_ctrl  VM_EXIT_LOAD_IA32_PAT) {
+   rdmsr(MSR_IA32_CR_PAT, low32, high32);
+   vmcs_write64(HOST_IA32_PAT, low32 | ((u64) high32  32));
+   }
+}
+
+/*
  * Sets up the vmcs for emulated real mode.
  */
 static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 {
-   u32 host_sysenter_cs, msr_low, msr_high;
-   u32 junk;
+   u32 msr_low, msr_high;
u64 host_pat;
unsigned long a;
-   struct desc_ptr dt;
int i;
-   unsigned long kvm_vmx_return;
u32 exec_control;
 
/* I/O */
@@ -3390,16 +3426,9 @@ static int vmx_vcpu_setup(struct vcpu_vm
vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf);
vmcs_write32(CR3_TARGET_COUNT, 0);   /* 22.2.1 */
 
-   vmcs_writel(HOST_CR0, read_cr0() | X86_CR0_TS);  /* 22.2.3 */
-   vmcs_writel(HOST_CR4, read_cr4());  /* 22.2.3, 22.2.5 */
-   vmcs_writel(HOST_CR3, read_cr3());  /* 22.2.3  FIXME: shadow tables */
-
-   vmcs_write16(HOST_CS_SELECTOR, __KERNEL_CS);  /* 22.2.4 */
-   vmcs_write16(HOST_DS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
-   vmcs_write16(HOST_ES_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
vmcs_write16(HOST_FS_SELECTOR, 0);/* 22.2.4 */
vmcs_write16(HOST_GS_SELECTOR, 0);/* 22.2.4 */
-   vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
+   vmx_set_constant_host_state();
 #ifdef CONFIG_X86_64
rdmsrl(MSR_FS_BASE, a);
vmcs_writel(HOST_FS_BASE, a); /* 22.2.4 */
@@ -3410,31 +3439,12 @@ static int vmx_vcpu_setup(struct vcpu_vm
vmcs_writel(HOST_GS_BASE, 0); /* 22.2.4 */
 #endif
 
-   vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
-
-   native_store_idt(dt);
-   vmcs_writel(HOST_IDTR_BASE, dt.address);   /* 22.2.4 */
-
-   asm(mov $.Lkvm_vmx_return, %0 : =r(kvm_vmx_return));
-   vmcs_writel(HOST_RIP, kvm_vmx_return); /* 22.2.5 */
vmcs_write32(VM_EXIT_MSR_STORE_COUNT, 0);
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_EXIT_MSR_LOAD_ADDR, __pa(vmx-msr_autoload.host));
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR, __pa(vmx-msr_autoload.guest));
 
-   rdmsr(MSR_IA32_SYSENTER_CS, host_sysenter_cs, junk);
-   vmcs_write32(HOST_IA32_SYSENTER_CS, host_sysenter_cs);
-   rdmsrl(MSR_IA32_SYSENTER_ESP, a);
-   vmcs_writel(HOST_IA32_SYSENTER_ESP, a);   /* 22.2.3 */
-   rdmsrl(MSR_IA32_SYSENTER_EIP, a);
-   vmcs_writel(HOST_IA32_SYSENTER_EIP, a);   /* 22.2.3 */
-
-   if (vmcs_config.vmexit_ctrl  VM_EXIT_LOAD_IA32_PAT) {
-   rdmsr(MSR_IA32_CR_PAT, msr_low, msr_high);
-   host_pat = msr_low | ((u64) msr_high  32);
-   vmcs_write64(HOST_IA32_PAT, host_pat);
-   }
if (vmcs_config.vmentry_ctrl  VM_ENTRY_LOAD_IA32_PAT) {
rdmsr(MSR_IA32_CR_PAT, msr_low, 

[PATCH 16/30] nVMX: Move control field setup to functions

2011-05-08 Thread Nadav Har'El
Move some of the control field setup to common functions. These functions will
also be needed for running L2 guests - L0's desires (expressed in these
functions) will be appropriately merged with L1's desires.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   80 +--
 1 file changed, 47 insertions(+), 33 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:19.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:19.0 +0300
@@ -3361,6 +3361,49 @@ static void vmx_set_constant_host_state(
}
 }
 
+static void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
+{
+   vmx-vcpu.arch.cr4_guest_owned_bits = KVM_CR4_GUEST_OWNED_BITS;
+   if (enable_ept)
+   vmx-vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
+   vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx-vcpu.arch.cr4_guest_owned_bits);
+}
+
+static u32 vmx_exec_control(struct vcpu_vmx *vmx)
+{
+   u32 exec_control = vmcs_config.cpu_based_exec_ctrl;
+   if (!vm_need_tpr_shadow(vmx-vcpu.kvm)) {
+   exec_control = ~CPU_BASED_TPR_SHADOW;
+#ifdef CONFIG_X86_64
+   exec_control |= CPU_BASED_CR8_STORE_EXITING |
+   CPU_BASED_CR8_LOAD_EXITING;
+#endif
+   }
+   if (!enable_ept)
+   exec_control |= CPU_BASED_CR3_STORE_EXITING |
+   CPU_BASED_CR3_LOAD_EXITING  |
+   CPU_BASED_INVLPG_EXITING;
+   return exec_control;
+}
+
+static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
+{
+   u32 exec_control = vmcs_config.cpu_based_2nd_exec_ctrl;
+   if (!vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))
+   exec_control = ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+   if (vmx-vpid == 0)
+   exec_control = ~SECONDARY_EXEC_ENABLE_VPID;
+   if (!enable_ept) {
+   exec_control = ~SECONDARY_EXEC_ENABLE_EPT;
+   enable_unrestricted_guest = 0;
+   }
+   if (!enable_unrestricted_guest)
+   exec_control = ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
+   if (!ple_gap)
+   exec_control = ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+   return exec_control;
+}
+
 /*
  * Sets up the vmcs for emulated real mode.
  */
@@ -3370,7 +3413,6 @@ static int vmx_vcpu_setup(struct vcpu_vm
u64 host_pat;
unsigned long a;
int i;
-   u32 exec_control;
 
/* I/O */
vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
@@ -3385,36 +3427,11 @@ static int vmx_vcpu_setup(struct vcpu_vm
vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
vmcs_config.pin_based_exec_ctrl);
 
-   exec_control = vmcs_config.cpu_based_exec_ctrl;
-   if (!vm_need_tpr_shadow(vmx-vcpu.kvm)) {
-   exec_control = ~CPU_BASED_TPR_SHADOW;
-#ifdef CONFIG_X86_64
-   exec_control |= CPU_BASED_CR8_STORE_EXITING |
-   CPU_BASED_CR8_LOAD_EXITING;
-#endif
-   }
-   if (!enable_ept)
-   exec_control |= CPU_BASED_CR3_STORE_EXITING |
-   CPU_BASED_CR3_LOAD_EXITING  |
-   CPU_BASED_INVLPG_EXITING;
-   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control);
+   vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, vmx_exec_control(vmx));
 
if (cpu_has_secondary_exec_ctrls()) {
-   exec_control = vmcs_config.cpu_based_2nd_exec_ctrl;
-   if (!vm_need_virtualize_apic_accesses(vmx-vcpu.kvm))
-   exec_control =
-   ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
-   if (vmx-vpid == 0)
-   exec_control = ~SECONDARY_EXEC_ENABLE_VPID;
-   if (!enable_ept) {
-   exec_control = ~SECONDARY_EXEC_ENABLE_EPT;
-   enable_unrestricted_guest = 0;
-   }
-   if (!enable_unrestricted_guest)
-   exec_control = ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
-   if (!ple_gap)
-   exec_control = ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
-   vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
+   vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
+   vmx_secondary_exec_control(vmx));
}
 
if (ple_gap) {
@@ -3475,10 +3492,7 @@ static int vmx_vcpu_setup(struct vcpu_vm
vmcs_write32(VM_ENTRY_CONTROLS, vmcs_config.vmentry_ctrl);
 
vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL);
-   vmx-vcpu.arch.cr4_guest_owned_bits = KVM_CR4_GUEST_OWNED_BITS;
-   if (enable_ept)
-   vmx-vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
-   vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx-vcpu.arch.cr4_guest_owned_bits);
+   set_cr4_guest_host_mask(vmx);
 
kvm_write_tsc(vmx-vcpu, 0);
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to 

[PATCH 17/30] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-08 Thread Nadav Har'El
This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
own guests).

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  272 +++
 1 file changed, 272 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:20.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:20.0 +0300
@@ -346,6 +346,12 @@ struct nested_vmx {
/* vmcs02_list cache of VMCSs recently used to run L2 guests */
struct list_head vmcs02_pool;
int vmcs02_num;
+   u64 vmcs01_tsc_offset;
+   /*
+* Guest pages referred to in vmcs02 with host-physical pointers, so
+* we must keep them pinned while L2 runs.
+*/
+   struct page *apic_access_page;
 };
 
 struct vcpu_vmx {
@@ -835,6 +841,18 @@ static inline bool report_flexpriority(v
return flexpriority_enabled;
 }
 
+static inline bool nested_cpu_has(struct vmcs12 *vmcs12, u32 bit)
+{
+   return vmcs12-cpu_based_vm_exec_control  bit;
+}
+
+static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
+{
+   return (vmcs12-cpu_based_vm_exec_control 
+   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) 
+   (vmcs12-secondary_vm_exec_control  bit);
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
int i;
@@ -1425,6 +1443,22 @@ static void vmx_fpu_activate(struct kvm_
 
 static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu);
 
+/*
+ * Return the cr0 value that a nested guest would read. This is a combination
+ * of the real cr0 used to run the guest (guest_cr0), and the bits shadowed by
+ * its hypervisor (cr0_read_shadow).
+ */
+static inline unsigned long guest_readable_cr0(struct vmcs12 *fields)
+{
+   return (fields-guest_cr0  ~fields-cr0_guest_host_mask) |
+   (fields-cr0_read_shadow  fields-cr0_guest_host_mask);
+}
+static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
+{
+   return (fields-guest_cr4  ~fields-cr4_guest_host_mask) |
+   (fields-cr4_read_shadow  fields-cr4_guest_host_mask);
+}
+
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
vmx_decache_cr0_guest_bits(vcpu);
@@ -3366,6 +3400,9 @@ static void set_cr4_guest_host_mask(stru
vmx-vcpu.arch.cr4_guest_owned_bits = KVM_CR4_GUEST_OWNED_BITS;
if (enable_ept)
vmx-vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
+   if (is_guest_mode(vmx-vcpu))
+   vmx-vcpu.arch.cr4_guest_owned_bits =
+   ~get_vmcs12(vmx-vcpu)-cr4_guest_host_mask;
vmcs_writel(CR4_GUEST_HOST_MASK, ~vmx-vcpu.arch.cr4_guest_owned_bits);
 }
 
@@ -4681,6 +4718,11 @@ static void free_nested(struct vcpu_vmx 
vmx-nested.current_vmptr = -1ull;
vmx-nested.current_vmcs12 = NULL;
}
+   /* Unpin physical memory we referred to in current vmcs02 */
+   if (vmx-nested.apic_access_page) {
+   nested_release_page(vmx-nested.apic_access_page);
+   vmx-nested.apic_access_page = 0;
+   }
 
nested_free_all_vmcs02(vmx);
 }
@@ -5749,6 +5791,236 @@ static void vmx_set_supported_cpuid(u32 
 {
 }
 
+/*
+ * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it
+ * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
+ * guest in a way that will both be appropriate to L1's requests, and our
+ * needs. In addition to modifying the active vmcs (which is vmcs02), this
+ * function also has additional necessary side-effects, like setting various
+ * vcpu-arch fields.
+ */
+static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   u32 exec_control;
+
+   vmcs_write16(GUEST_ES_SELECTOR, vmcs12-guest_es_selector);
+   vmcs_write16(GUEST_CS_SELECTOR, vmcs12-guest_cs_selector);
+   vmcs_write16(GUEST_SS_SELECTOR, vmcs12-guest_ss_selector);
+   vmcs_write16(GUEST_DS_SELECTOR, vmcs12-guest_ds_selector);
+   vmcs_write16(GUEST_FS_SELECTOR, vmcs12-guest_fs_selector);
+   vmcs_write16(GUEST_GS_SELECTOR, vmcs12-guest_gs_selector);
+   vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12-guest_ldtr_selector);
+   vmcs_write16(GUEST_TR_SELECTOR, vmcs12-guest_tr_selector);
+
+   vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12-guest_ia32_debugctl);
+
+   vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+   vmcs12-vm_entry_intr_info_field);
+   vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+   vmcs12-vm_entry_exception_error_code);
+   vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+   vmcs12-vm_entry_instruction_len);
+
+   vmcs_write32(GUEST_ES_LIMIT, vmcs12-guest_es_limit);
+   

Is KVM_GET_SREGS safe from other threads?

2011-05-08 Thread Pekka Enberg
Hi!

We've noticed that sometimes KVM_GET_SREGS from a signal handler
hangs. We use it like this:

static void handle_sigquit(int sig)
{
 int i;

 for (i = 0; i  nrcpus; i++) {
 struct kvm_cpu *cpu = kvm_cpus[i];

 kvm_cpu__show_registers(cpu); -- here
 kvm_cpu__show_code(cpu);
 kvm_cpu__show_page_tables(cpu);
 }

and

void kvm_cpu__show_registers(struct kvm_cpu *self)
{
[...]
if (ioctl(self-vcpu_fd, KVM_GET_SREGS, sregs)  0)
die(KVM_GET_REGS failed);

is it not OK to call KVM_GET_SREGS from other threads than the one
that's doing KVM_RUN?

Pekka
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/30] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-08 Thread Nadav Har'El
Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  139 ++-
 1 file changed, 137 insertions(+), 2 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:20.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:20.0 +0300
@@ -346,6 +346,9 @@ struct nested_vmx {
/* vmcs02_list cache of VMCSs recently used to run L2 guests */
struct list_head vmcs02_pool;
int vmcs02_num;
+
+   /* Saving the VMCS that we used for running L1 */
+   struct saved_vmcs saved_vmcs01;
u64 vmcs01_tsc_offset;
/*
 * Guest pages referred to in vmcs02 with host-physical pointers, so
@@ -4880,6 +4883,21 @@ static int handle_vmclear(struct kvm_vcp
return 1;
 }
 
+static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch);
+
+/* Emulate the VMLAUNCH instruction */
+static int handle_vmlaunch(struct kvm_vcpu *vcpu)
+{
+   return nested_vmx_run(vcpu, true);
+}
+
+/* Emulate the VMRESUME instruction */
+static int handle_vmresume(struct kvm_vcpu *vcpu)
+{
+
+   return nested_vmx_run(vcpu, false);
+}
+
 enum vmcs_field_type {
VMCS_FIELD_TYPE_U16 = 0,
VMCS_FIELD_TYPE_U64 = 1,
@@ -5160,11 +5178,11 @@ static int (*kvm_vmx_exit_handlers[])(st
[EXIT_REASON_INVLPG]  = handle_invlpg,
[EXIT_REASON_VMCALL]  = handle_vmcall,
[EXIT_REASON_VMCLEAR] = handle_vmclear,
-   [EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
+   [EXIT_REASON_VMLAUNCH]= handle_vmlaunch,
[EXIT_REASON_VMPTRLD] = handle_vmptrld,
[EXIT_REASON_VMPTRST] = handle_vmptrst,
[EXIT_REASON_VMREAD]  = handle_vmread,
-   [EXIT_REASON_VMRESUME]= handle_vmx_insn,
+   [EXIT_REASON_VMRESUME]= handle_vmresume,
[EXIT_REASON_VMWRITE] = handle_vmwrite,
[EXIT_REASON_VMOFF]   = handle_vmoff,
[EXIT_REASON_VMON]= handle_vmon,
@@ -6021,6 +6039,123 @@ static int prepare_vmcs02(struct kvm_vcp
return 0;
 }
 
+/*
+ * nested_vmx_run() handles a nested entry, i.e., a VMLAUNCH or VMRESUME on L1
+ * for running an L2 nested guest.
+ */
+static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
+{
+   struct vmcs12 *vmcs12;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   int cpu;
+   struct saved_vmcs *saved_vmcs02;
+   u32 low, high;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+   skip_emulated_instruction(vcpu);
+
+   /*
+* The nested entry process starts with enforcing various prerequisites
+* on vmcs12 as required by the Intel SDM, and act appropriately when
+* they fail: As the SDM explains, some conditions should cause the
+* instruction to fail, while others will cause the instruction to seem
+* to succeed, but return an EXIT_REASON_INVALID_STATE.
+* To speed up the normal (success) code path, we should avoid checking
+* for misconfigurations which will anyway be caught by the processor
+* when using the merged vmcs02.
+*/
+
+   vmcs12 = get_vmcs12(vcpu);
+   if (vmcs12-launch_state == launch) {
+   nested_vmx_failValid(vcpu,
+   launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS
+  : VMXERR_VMRESUME_NONLAUNCHED_VMCS);
+   return 1;
+   }
+
+   if (vmcs12-guest_interruptibility_info  GUEST_INTR_STATE_MOV_SS) {
+   nested_vmx_failValid(vcpu,
+   VMXERR_ENTRY_EVENTS_BLOCKED_BY_MOV_SS);
+   return 1;
+   }
+
+   if ((vmcs12-cpu_based_vm_exec_control  CPU_BASED_USE_MSR_BITMAPS) 
+   !IS_ALIGNED(vmcs12-msr_bitmap, PAGE_SIZE)) {
+   /*TODO: Also verify bits beyond physical address width are 0*/
+   nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+   return 1;
+   }
+
+   if (vmcs12-vm_entry_msr_load_count  0 ||
+   vmcs12-vm_exit_msr_load_count  0 ||
+   vmcs12-vm_exit_msr_store_count  0) {
+   if (printk_ratelimit())
+   printk(KERN_WARNING
+ %s: VMCS MSR_{LOAD,STORE} unsupported\n, __func__);
+   nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+   return 1;
+   }
+
+   nested_vmx_pinbased_ctls(low, high);
+   if (!vmx_control_verify(vmcs12-pin_based_vm_exec_control, low, high)) {
+   nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
+   return 1;
+   }
+
+   if (((vmcs12-host_cr0  VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON) ||
+   

[PATCH 19/30] nVMX: No need for handle_vmx_insn function any more

2011-05-08 Thread Nadav Har'El
Before nested VMX support, the exit handler for a guest executing a VMX
instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
exception. Now that all these exit reasons are properly handled (and emulate
the respective VMX instruction), nothing calls this dummy handler and it can
be removed.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |6 --
 1 file changed, 6 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:20.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:20.0 +0300
@@ -4240,12 +4240,6 @@ static int handle_vmcall(struct kvm_vcpu
return 1;
 }
 
-static int handle_vmx_insn(struct kvm_vcpu *vcpu)
-{
-   kvm_queue_exception(vcpu, UD_VECTOR);
-   return 1;
-}
-
 static int handle_invd(struct kvm_vcpu *vcpu)
 {
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 20/30] nVMX: Exiting from L2 to L1

2011-05-08 Thread Nadav Har'El
This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in the next patch.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  288 +++
 1 file changed, 288 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:20.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:20.0 +0300
@@ -6150,6 +6150,294 @@ static int nested_vmx_run(struct kvm_vcp
return 1;
 }
 
+/*
+ * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
+ * because L2 may have changed some cr0 bits directly (see CRO_GUEST_HOST_MASK)
+ * without L0 trapping the change and updating vmcs12.
+ * This function returns the value we should put in vmcs12.guest_cr0. It's not
+ * enough to just return the current (vmcs02) GUEST_CR0 - that may not be the
+ * guest cr0 that L1 thought it was giving its L2 guest; It is possible that
+ * L1 wished to allow its guest to set some cr0 bit directly, but we (L0) asked
+ * to trap this change and instead set just the read shadow bit. If this is the
+ * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0, where
+ * L1 believes they already are.
+ */
+static inline unsigned long
+vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+   /*
+* As explained above, we take a bit from GUEST_CR0 if we allowed the
+* guest to modify it untrapped (vcpu-arch.cr0_guest_owned_bits), or
+* if we did trap it - if we did so because L1 asked to trap this bit
+* (vmcs12-cr0_guest_host_mask). Otherwise (bits we trapped but L1
+* didn't expect us to trap) we read from CR0_READ_SHADOW.
+*/
+   unsigned long guest_cr0_bits =
+   vcpu-arch.cr0_guest_owned_bits | vmcs12-cr0_guest_host_mask;
+   return (vmcs_readl(GUEST_CR0)  guest_cr0_bits) |
+  (vmcs_readl(CR0_READ_SHADOW)  ~guest_cr0_bits);
+}
+
+static inline unsigned long
+vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+   unsigned long guest_cr4_bits =
+   vcpu-arch.cr4_guest_owned_bits | vmcs12-cr4_guest_host_mask;
+   return (vmcs_readl(GUEST_CR4)  guest_cr4_bits) |
+  (vmcs_readl(CR4_READ_SHADOW)  ~guest_cr4_bits);
+}
+
+/*
+ * prepare_vmcs12 is part of what we need to do when the nested L2 guest exits
+ * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 
(vmcs12),
+ * and this function updates it to reflect the changes to the guest state while
+ * L2 was running (and perhaps made some exits which were handled directly by 
L0
+ * without going back to L1), and to reflect the exit reason.
+ * Note that we do not have to copy here all VMCS fields, just those that
+ * could have changed by the L2 guest or the exit - i.e., the guest-state and
+ * exit-information fields only. Other fields are modified by L1 with VMWRITE,
+ * which already writes to vmcs12 directly.
+ */
+void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
+{
+   /* update guest state fields: */
+   vmcs12-guest_cr0 = vmcs12_guest_cr0(vcpu, vmcs12);
+   vmcs12-guest_cr4 = vmcs12_guest_cr4(vcpu, vmcs12);
+
+   kvm_get_dr(vcpu, 7, (unsigned long *)vmcs12-guest_dr7);
+   vmcs12-guest_rsp = kvm_register_read(vcpu, VCPU_REGS_RSP);
+   vmcs12-guest_rip = kvm_register_read(vcpu, VCPU_REGS_RIP);
+   vmcs12-guest_rflags = vmcs_readl(GUEST_RFLAGS);
+
+   vmcs12-guest_es_selector = vmcs_read16(GUEST_ES_SELECTOR);
+   vmcs12-guest_cs_selector = vmcs_read16(GUEST_CS_SELECTOR);
+   vmcs12-guest_ss_selector = vmcs_read16(GUEST_SS_SELECTOR);
+   vmcs12-guest_ds_selector = vmcs_read16(GUEST_DS_SELECTOR);
+   vmcs12-guest_fs_selector = vmcs_read16(GUEST_FS_SELECTOR);
+   vmcs12-guest_gs_selector = vmcs_read16(GUEST_GS_SELECTOR);
+   vmcs12-guest_ldtr_selector = vmcs_read16(GUEST_LDTR_SELECTOR);
+   vmcs12-guest_tr_selector = vmcs_read16(GUEST_TR_SELECTOR);
+   vmcs12-guest_es_limit = vmcs_read32(GUEST_ES_LIMIT);
+   vmcs12-guest_cs_limit = vmcs_read32(GUEST_CS_LIMIT);
+   vmcs12-guest_ss_limit = vmcs_read32(GUEST_SS_LIMIT);
+   vmcs12-guest_ds_limit = vmcs_read32(GUEST_DS_LIMIT);
+   vmcs12-guest_fs_limit = vmcs_read32(GUEST_FS_LIMIT);
+   vmcs12-guest_gs_limit = vmcs_read32(GUEST_GS_LIMIT);
+   vmcs12-guest_ldtr_limit = vmcs_read32(GUEST_LDTR_LIMIT);
+   vmcs12-guest_tr_limit = vmcs_read32(GUEST_TR_LIMIT);
+   vmcs12-guest_gdtr_limit = 

[PATCH 21/30] nVMX: Deciding if L0 or L1 should handle an L2 exit

2011-05-08 Thread Nadav Har'El
This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, nested_run_pending, which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
and therefore expects L2 to be run (and perhaps be injected with an event it
specified, etc.). Nested_run_pending is especially intended to avoid switching
to L1 in the injection decision-point described above.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |  265 ++-
 1 file changed, 264 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:20.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:20.0 +0300
@@ -350,6 +350,8 @@ struct nested_vmx {
/* Saving the VMCS that we used for running L1 */
struct saved_vmcs saved_vmcs01;
u64 vmcs01_tsc_offset;
+   /* L2 must run next, and mustn't decide to exit to L1. */
+   bool nested_run_pending;
/*
 * Guest pages referred to in vmcs02 with host-physical pointers, so
 * we must keep them pinned while L2 runs.
@@ -856,6 +858,23 @@ static inline bool nested_cpu_has2(struc
(vmcs12-secondary_vm_exec_control  bit);
 }
 
+static inline bool nested_cpu_has_virtual_nmis(struct kvm_vcpu *vcpu)
+{
+   return is_guest_mode(vcpu) 
+   (get_vmcs12(vcpu)-pin_based_vm_exec_control 
+   PIN_BASED_VIRTUAL_NMIS);
+}
+
+static inline bool is_exception(u32 intr_info)
+{
+   return (intr_info  (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
+   == (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
+}
+
+static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, bool is_interrupt);
+static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
+   struct vmcs12 *vmcs12);
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
int i;
@@ -5196,6 +5215,232 @@ static int (*kvm_vmx_exit_handlers[])(st
 static const int kvm_vmx_max_exit_handlers =
ARRAY_SIZE(kvm_vmx_exit_handlers);
 
+/*
+ * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
+ * rather than handle it ourselves in L0. I.e., check whether L1 expressed
+ * disinterest in the current event (read or write a specific MSR) by using an
+ * MSR bitmap. This may be the case even when L0 doesn't use MSR bitmaps.
+ */
+static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
+   struct vmcs12 *vmcs12, u32 exit_reason)
+{
+   u32 msr_index = vcpu-arch.regs[VCPU_REGS_RCX];
+   gpa_t bitmap;
+
+   if (!nested_cpu_has(get_vmcs12(vcpu), CPU_BASED_USE_MSR_BITMAPS))
+   return 1;
+
+   /*
+* The MSR_BITMAP page is divided into four 1024-byte bitmaps,
+* for the four combinations of read/write and low/high MSR numbers.
+* First we need to figure out which of the four to use:
+*/
+   bitmap = vmcs12-msr_bitmap;
+   if (exit_reason == EXIT_REASON_MSR_WRITE)
+   bitmap += 2048;
+   if (msr_index = 0xc000) {
+   msr_index -= 0xc000;
+   bitmap += 1024;
+   }
+
+   /* Then read the msr_index'th bit from this bitmap: */
+   if (msr_index  1024*8) {
+   unsigned char b;
+   kvm_read_guest(vcpu-kvm, bitmap + msr_index/8, b, 1);
+   return 1  (b  (msr_index  7));
+   } else
+   return 1; /* let L1 handle the wrong parameter */
+}
+
+/*
+ * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+   struct vmcs12 *vmcs12)
+{
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int cr = exit_qualification  15;
+   int reg = (exit_qualification  8)  15;
+   unsigned long val = kvm_register_read(vcpu, reg);
+
+   switch ((exit_qualification  4)  3) {
+   case 0: /* mov to cr */
+ 

[PATCH 22/30] nVMX: Correct handling of interrupt injection

2011-05-08 Thread Nadav Har'El
When KVM wants to inject an interrupt, the guest should think a real interrupt
has happened. Normally (in the non-nested case) this means checking that the
guest doesn't block interrupts (and if it does, inject when it doesn't - using
the interrupt window VMX mechanism), and setting up the appropriate VMCS
fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most efficient
thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be notified
of interrupts, we should inject directly to the running L2 guest (i.e.,
the normal code path).

However, properly doing what is described above requires invasive changes to
the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
now, to exit from L2 to L1 before continuing the normal code. The normal kvm
code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The cost is an
extra L1 exit (the interrupt window).

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   35 +++
 1 file changed, 35 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:20.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:20.0 +0300
@@ -3675,9 +3675,25 @@ out:
return ret;
 }
 
+/*
+ * In nested virtualization, check if L1 asked to exit on external interrupts.
+ * For most existing hypervisors, this will always return true.
+ */
+static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
+{
+   return get_vmcs12(vcpu)-pin_based_vm_exec_control 
+   PIN_BASED_EXT_INTR_MASK;
+}
+
 static void enable_irq_window(struct kvm_vcpu *vcpu)
 {
u32 cpu_based_vm_exec_control;
+   if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu))
+   /* We can get here when nested_run_pending caused
+* vmx_interrupt_allowed() to return false. In this case, do
+* nothing - the interrupt will be injected later.
+*/
+   return;
 
cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3800,6 +3816,13 @@ static void vmx_set_nmi_mask(struct kvm_
 
 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
+   if (is_guest_mode(vcpu)  nested_exit_on_intr(vcpu)) {
+   if (to_vmx(vcpu)-nested.nested_run_pending)
+   return 0;
+   nested_vmx_vmexit(vcpu, true);
+   /* fall through to normal code, but now in L1, not L2 */
+   }
+
return (vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF) 
!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) 
(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
@@ -5463,6 +5486,14 @@ static int vmx_handle_exit(struct kvm_vc
if (vmx-emulation_required  emulate_invalid_guest_state)
return handle_invalid_guest_state(vcpu);
 
+   /*
+* the KVM_REQ_EVENT optimization bit is only on for one entry, and if
+* we did not inject a still-pending event to L1 now because of
+* nested_run_pending, we need to re-enable this bit.
+*/
+   if (vmx-nested.nested_run_pending)
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
+
if (exit_reason == EXIT_REASON_VMLAUNCH ||
exit_reason == EXIT_REASON_VMRESUME)
vmx-nested.nested_run_pending = 1;
@@ -5660,6 +5691,8 @@ static void __vmx_complete_interrupts(st
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
+   if (is_guest_mode(vmx-vcpu))
+   return;
__vmx_complete_interrupts(vmx, vmx-idt_vectoring_info,
  VM_EXIT_INSTRUCTION_LEN,
  IDT_VECTORING_ERROR_CODE);
@@ -5667,6 +5700,8 @@ static void vmx_complete_interrupts(stru
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
+   if (is_guest_mode(vcpu))
+   return;
__vmx_complete_interrupts(to_vmx(vcpu),
  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
  VM_ENTRY_INSTRUCTION_LEN,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 23/30] nVMX: Correct handling of exception injection

2011-05-08 Thread Nadav Har'El
Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   26 ++
 1 file changed, 26 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:21.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:21.0 +0300
@@ -1572,6 +1572,25 @@ static void vmx_clear_hlt(struct kvm_vcp
vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 }
 
+/*
+ * KVM wants to inject page-faults which it got to the guest. This function
+ * checks whether in a nested guest, we need to inject them to L1 or L2.
+ * This function assumes it is called with the exit reason in vmcs02 being
+ * a #PF exception (this is the only case in which KVM injects a #PF when L2
+ * is running).
+ */
+static int nested_pf_handled(struct kvm_vcpu *vcpu)
+{
+   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+
+   /* TODO: also check PFEC_MATCH/MASK, not just EB.PF. */
+   if (!(vmcs12-exception_bitmap  PF_VECTOR))
+   return 0;
+
+   nested_vmx_vmexit(vcpu, false);
+   return 1;
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
bool has_error_code, u32 error_code,
bool reinject)
@@ -1579,6 +1598,10 @@ static void vmx_queue_exception(struct k
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
+   if (nr == PF_VECTOR  is_guest_mode(vcpu) 
+   nested_pf_handled(vcpu))
+   return;
+
if (has_error_code) {
vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
intr_info |= INTR_INFO_DELIVER_CODE_MASK;
@@ -3750,6 +3773,9 @@ static void vmx_inject_nmi(struct kvm_vc
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+   if (is_guest_mode(vcpu))
+   return;
+
if (!cpu_has_virtual_nmis()) {
/*
 * Tracking the NMI-blocked state in software is built upon
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 24/30] nVMX: Correct handling of idt vectoring info

2011-05-08 Thread Nadav Har'El
This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while handling an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2 (i.e., nested_mode is true).

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   32 
 1 file changed, 32 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:21.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:21.0 +0300
@@ -352,6 +352,10 @@ struct nested_vmx {
u64 vmcs01_tsc_offset;
/* L2 must run next, and mustn't decide to exit to L1. */
bool nested_run_pending;
+   /* true if last exit was of L2, and had a valid idt_vectoring_info */
+   bool valid_idt_vectoring_info;
+   /* These are saved if valid_idt_vectoring_info */
+   u32 vm_exit_instruction_len, idt_vectoring_error_code;
/*
 * Guest pages referred to in vmcs02 with host-physical pointers, so
 * we must keep them pinned while L2 runs.
@@ -5736,6 +5740,22 @@ static void vmx_cancel_injection(struct 
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
+static void nested_handle_valid_idt_vectoring_info(struct vcpu_vmx *vmx)
+{
+   int irq  = vmx-idt_vectoring_info  VECTORING_INFO_VECTOR_MASK;
+   int type = vmx-idt_vectoring_info  VECTORING_INFO_TYPE_MASK;
+   int errCodeValid = vmx-idt_vectoring_info 
+   VECTORING_INFO_DELIVER_CODE_MASK;
+   vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
+   irq | type | INTR_INFO_VALID_MASK | errCodeValid);
+
+   vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
+   vmx-nested.vm_exit_instruction_len);
+   if (errCodeValid)
+   vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
+   vmx-nested.idt_vectoring_error_code);
+}
+
 #ifdef CONFIG_X86_64
 #define R r
 #define Q q
@@ -5748,6 +5768,9 @@ static void __noclone vmx_vcpu_run(struc
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+   if (is_guest_mode(vcpu)  vmx-nested.valid_idt_vectoring_info)
+   nested_handle_valid_idt_vectoring_info(vmx);
+
/* Record the guest's net vcpu time for enforced NMI injections. */
if (unlikely(!cpu_has_virtual_nmis()  vmx-soft_vnmi_blocked))
vmx-entry_time = ktime_get();
@@ -5879,6 +5902,15 @@ static void __noclone vmx_vcpu_run(struc
 
vmx-idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
+   vmx-nested.valid_idt_vectoring_info = is_guest_mode(vcpu) 
+   (vmx-idt_vectoring_info  VECTORING_INFO_VALID_MASK);
+   if (vmx-nested.valid_idt_vectoring_info) {
+   vmx-nested.vm_exit_instruction_len =
+   vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
+   vmx-nested.idt_vectoring_error_code =
+   vmcs_read32(IDT_VECTORING_ERROR_CODE);
+   }
+
asm(mov %0, %%ds; mov %0, %%es : : r(__USER_DS));
vmx-launched = 1;
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 25/30] nVMX: Handling of CR0 and CR4 modifying instructions

2011-05-08 Thread Nadav Har'El
When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
previous patch).
When L2 modifies bits that L1 doesn't care about, we let it think (via
CR[04]_READ_SHADOW) that it did these modifications, while only changing
(in GUEST_CR[04]) the bits that L0 doesn't shadow.

This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
want to leave TS on, while pretending to allow the guest to change it.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   58 ---
 1 file changed, 55 insertions(+), 3 deletions(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:21.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:21.0 +0300
@@ -4094,6 +4094,58 @@ vmx_patch_hypercall(struct kvm_vcpu *vcp
hypercall[2] = 0xc1;
 }
 
+/* called to set cr0 as approriate for a mov-to-cr0 exit. */
+static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   if (to_vmx(vcpu)-nested.vmxon 
+   ((val  VMXON_CR0_ALWAYSON) != VMXON_CR0_ALWAYSON))
+   return 1;
+
+   if (is_guest_mode(vcpu)) {
+   /*
+* We get here when L2 changed cr0 in a way that did not change
+* any of L1's shadowed bits (see nested_vmx_exit_handled_cr),
+* but did change L0 shadowed bits. This can currently happen
+* with the TS bit: L0 may want to leave TS on (for lazy fpu
+* loading) while pretending to allow the guest to change it.
+*/
+   if (kvm_set_cr0(vcpu, (val  vcpu-arch.cr0_guest_owned_bits) |
+(vcpu-arch.cr0  ~vcpu-arch.cr0_guest_owned_bits)))
+   return 1;
+   vmcs_writel(CR0_READ_SHADOW, val);
+   return 0;
+   } else
+   return kvm_set_cr0(vcpu, val);
+}
+
+static int handle_set_cr4(struct kvm_vcpu *vcpu, unsigned long val)
+{
+   if (is_guest_mode(vcpu)) {
+   if (kvm_set_cr4(vcpu, (val  vcpu-arch.cr4_guest_owned_bits) |
+(vcpu-arch.cr4  ~vcpu-arch.cr4_guest_owned_bits)))
+   return 1;
+   vmcs_writel(CR4_READ_SHADOW, val);
+   return 0;
+   } else
+   return kvm_set_cr4(vcpu, val);
+}
+
+/* called to set cr0 as approriate for clts instruction exit. */
+static void handle_clts(struct kvm_vcpu *vcpu)
+{
+   if (is_guest_mode(vcpu)) {
+   /*
+* We get here when L2 did CLTS, and L1 didn't shadow CR0.TS
+* but we did (!fpu_active). We need to keep GUEST_CR0.TS on,
+* just pretend it's off (also in arch.cr0 for fpu_activate).
+*/
+   vmcs_writel(CR0_READ_SHADOW,
+   vmcs_readl(CR0_READ_SHADOW)  ~X86_CR0_TS);
+   vcpu-arch.cr0 = ~X86_CR0_TS;
+   } else
+   vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+}
+
 static int handle_cr(struct kvm_vcpu *vcpu)
 {
unsigned long exit_qualification, val;
@@ -4110,7 +4162,7 @@ static int handle_cr(struct kvm_vcpu *vc
trace_kvm_cr_write(cr, val);
switch (cr) {
case 0:
-   err = kvm_set_cr0(vcpu, val);
+   err = handle_set_cr0(vcpu, val);
kvm_complete_insn_gp(vcpu, err);
return 1;
case 3:
@@ -4118,7 +4170,7 @@ static int handle_cr(struct kvm_vcpu *vc
kvm_complete_insn_gp(vcpu, err);
return 1;
case 4:
-   err = kvm_set_cr4(vcpu, val);
+   err = handle_set_cr4(vcpu, val);
kvm_complete_insn_gp(vcpu, err);
return 1;
case 8: {
@@ -4136,7 +4188,7 @@ static int handle_cr(struct kvm_vcpu *vc
};
break;
case 2: /* clts */
-   vmx_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~X86_CR0_TS));
+   handle_clts(vcpu);
trace_kvm_cr_write(0, kvm_read_cr0(vcpu));
skip_emulated_instruction(vcpu);
vmx_fpu_activate(vcpu);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 26/30] nVMX: Further fixes for lazy FPU loading

2011-05-08 Thread Nadav Har'El
KVM's Lazy FPU loading means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
and L1's needs. Note that handle_cr() was already fixed in the above patch,
and that new code in introduced in previous patches already handles CR0
correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:21.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:21.0 +0300
@@ -1170,6 +1170,15 @@ static void update_exception_bitmap(stru
eb = ~(1u  PF_VECTOR); /* bypass_guest_pf = 0 */
if (vcpu-fpu_active)
eb = ~(1u  NM_VECTOR);
+
+   /* When we are running a nested L2 guest and L1 specified for it a
+* certain exception bitmap, we must trap the same exceptions and pass
+* them to L1. When running L2, we will only handle the exceptions
+* specified above if L1 did not want them.
+*/
+   if (is_guest_mode(vcpu))
+   eb |= get_vmcs12(vcpu)-exception_bitmap;
+
vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
@@ -1464,6 +1473,9 @@ static void vmx_fpu_activate(struct kvm_
vmcs_writel(GUEST_CR0, cr0);
update_exception_bitmap(vcpu);
vcpu-arch.cr0_guest_owned_bits = X86_CR0_TS;
+   if (is_guest_mode(vcpu))
+   vcpu-arch.cr0_guest_owned_bits =
+   ~get_vmcs12(vcpu)-cr0_guest_host_mask;
vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits);
 }
 
@@ -1487,12 +1499,29 @@ static inline unsigned long guest_readab
 
 static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
 {
+   /* Note that there is no vcpu-fpu_active = 0 here. The caller must
+* set this *before* calling this function.
+*/
vmx_decache_cr0_guest_bits(vcpu);
vmcs_set_bits(GUEST_CR0, X86_CR0_TS | X86_CR0_MP);
update_exception_bitmap(vcpu);
vcpu-arch.cr0_guest_owned_bits = 0;
vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu-arch.cr0_guest_owned_bits);
-   vmcs_writel(CR0_READ_SHADOW, vcpu-arch.cr0);
+   if (is_guest_mode(vcpu)) {
+   /*
+* L1's specified read shadow might not contain the TS bit,
+* so now that we turned on shadowing of this bit, we need to
+* set this bit of the shadow. Like in nested_vmx_run we need
+* guest_readable_cr0(vmcs12), but vmcs12-guest_cr0 is not
+* yet up-to-date here because we just decached cr0.TS (and
+* we'll only update vmcs12-guest_cr0 on nested exit).
+*/
+   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+   vmcs12-guest_cr0 = (vmcs12-guest_cr0  ~X86_CR0_TS) |
+   (vcpu-arch.cr0  X86_CR0_TS);
+   vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
+   } else
+   vmcs_writel(CR0_READ_SHADOW, vcpu-arch.cr0);
 }
 
 static unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 27/30] nVMX: Additional TSC-offset handling

2011-05-08 Thread Nadav Har'El
In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
set vmcs12.tsc_offset, for this change to survive the next nested entry (see
prepare_vmcs02()).
Additionally, we also need to modify vmx_adjust_tsc_offset: The semantics
of this function is that the TSC of all guests on this vcpu, L1 and possibly
several L2s, need to be adjusted. To do this, we need to adjust vmcs01's
tsc_offset (this offset will also apply to each L2s we enter). We can't set
vmcs01 now, so we have to remember this adjustment and apply it when we
later exit to L1.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |   12 
 1 file changed, 12 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:21.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:21.0 +0300
@@ -1757,12 +1757,24 @@ static void vmx_set_tsc_khz(struct kvm_v
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
vmcs_write64(TSC_OFFSET, offset);
+   if (is_guest_mode(vcpu))
+   /*
+* We're here if L1 chose not to trap the TSC MSR. Since
+* prepare_vmcs12() does not copy tsc_offset, we need to also
+* set the vmcs12 field here.
+*/
+   get_vmcs12(vcpu)-tsc_offset = offset -
+   to_vmx(vcpu)-nested.vmcs01_tsc_offset;
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
 {
u64 offset = vmcs_read64(TSC_OFFSET);
vmcs_write64(TSC_OFFSET, offset + adjustment);
+   if (is_guest_mode(vcpu)) {
+   /* Even when running L2, the adjustment needs to apply to L1 */
+   to_vmx(vcpu)-nested.vmcs01_tsc_offset += adjustment;
+   }
 }
 
 static u64 vmx_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 28/30] nVMX: Add VMX to list of supported cpuid features

2011-05-08 Thread Nadav Har'El
If the nested module option is enabled, add the VMX CPU feature to the
list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |2 ++
 1 file changed, 2 insertions(+)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:21.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:21.0 +0300
@@ -6244,6 +6244,8 @@ static void vmx_cpuid_update(struct kvm_
 
 static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
+   if (func == 1  nested)
+   entry-ecx |= bit(X86_FEATURE_VMX);
 }
 
 /*
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 29/30] nVMX: Miscellenous small corrections

2011-05-08 Thread Nadav Har'El
Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 arch/x86/kvm/vmx.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- .before/arch/x86/kvm/vmx.c  2011-05-08 10:43:21.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2011-05-08 10:43:21.0 +0300
@@ -947,7 +947,7 @@ static void vmcs_load(struct vmcs *vmcs)
: =qm(error) : a(phys_addr), m(phys_addr)
: cc, memory);
if (error)
-   printk(KERN_ERR kvm: vmptrld %p/%llx fail\n,
+   printk(KERN_ERR kvm: vmptrld %p/%llx failed\n,
   vmcs, phys_addr);
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 30/30] nVMX: Documentation

2011-05-08 Thread Nadav Har'El
This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

Signed-off-by: Nadav Har'El n...@il.ibm.com
---
 Documentation/kvm/nested-vmx.txt |  243 +
 1 file changed, 243 insertions(+)

--- .before/Documentation/kvm/nested-vmx.txt2011-05-08 10:43:22.0 
+0300
+++ .after/Documentation/kvm/nested-vmx.txt 2011-05-08 10:43:22.0 
+0300
@@ -0,0 +1,243 @@
+Nested VMX
+==
+
+Overview
+-
+
+On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
+to easily and efficiently run guest operating systems. Normally, these guests
+*cannot* themselves be hypervisors running their own guests, because in VMX,
+guests cannot use VMX instructions.
+
+The Nested VMX feature adds this missing capability - of running guest
+hypervisors (which use VMX) with their own nested guests. It does so by
+allowing a guest to use VMX instructions, and correctly and efficiently
+emulating them using the single level of VMX available in the hardware.
+
+We describe in much greater detail the theory behind the nested VMX feature,
+its implementation and its performance characteristics, in the OSDI 2010 paper
+The Turtles Project: Design and Implementation of Nested Virtualization,
+available at:
+
+   http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
+
+
+Terminology
+---
+
+Single-level virtualization has two levels - the host (KVM) and the guests.
+In nested virtualization, we have three levels: The host (KVM), which we call
+L0, the guest hypervisor, which we call L1, and its nested guest, which we
+call L2.
+
+
+Known limitations
+-
+
+The current code supports running Linux guests under KVM guests.
+Only 64-bit guest hypervisors are supported.
+
+Additional patches for running Windows under guest KVM, and Linux under
+guest VMware server, and support for nested EPT, are currently running in
+the lab, and will be sent as follow-on patchsets.
+
+
+Running nested VMX
+--
+
+The nested VMX feature is disabled by default. It can be enabled by giving
+the nested=1 option to the kvm-intel module.
+
+No modifications are required to user space (qemu). However, qemu's default
+emulated CPU type (qemu64) does not list the VMX CPU feature, so it must be
+explicitly enabled, by giving qemu one of the following options:
+
+ -cpu host  (emulated CPU has all features of the real CPU)
+
+ -cpu qemu64,+vmx   (add just the vmx feature to a named CPU type)
+
+
+ABIs
+
+
+Nested VMX aims to present a standard and (eventually) fully-functional VMX
+implementation for the a guest hypervisor to use. As such, the official
+specification of the ABI that it provides is Intel's VMX specification,
+namely volume 3B of their Intel 64 and IA-32 Architectures Software
+Developer's Manual. Not all of VMX's features are currently fully supported,
+but the goal is to eventually support them all, starting with the VMX features
+which are used in practice by popular hypervisors (KVM and others).
+
+As a VMX implementation, nested VMX presents a VMCS structure to L1.
+As mandated by the spec, other than the two fields revision_id and abort,
+this structure is *opaque* to its user, who is not supposed to know or care
+about its internal structure. Rather, the structure is accessed through the
+VMREAD and VMWRITE instructions.
+Still, for debugging purposes, KVM developers might be interested to know the
+internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c.
+For convenience, we repeat its content here. If the internals of this structure
+changes, this can break live migration across KVM versions. VMCS12_REVISION
+(from vmx.c) should be changed if struct vmcs12 or its inner struct shadow_vmcs
+is ever changed.
+
+   typedef u64 natural_width;
+   struct __packed vmcs12 {
+   /* According to the Intel spec, a VMCS region must start with
+* these two user-visible fields */
+   u32 revision_id;
+   u32 abort;
+
+   u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+   u32 padding[7]; /* room for future expansion */
+
+   u64 io_bitmap_a;
+   u64 io_bitmap_b;
+   u64 msr_bitmap;
+   u64 vm_exit_msr_store_addr;
+   u64 vm_exit_msr_load_addr;
+   u64 vm_entry_msr_load_addr;
+   u64 tsc_offset;
+   u64 virtual_apic_page_addr;
+   u64 apic_access_addr;
+   u64 ept_pointer;
+   u64 guest_physical_address;
+   u64 vmcs_link_pointer;
+   u64 guest_ia32_debugctl;
+   u64 guest_ia32_pat;
+   u64 guest_ia32_efer;
+   u64 guest_pdptr0;
+   u64 guest_pdptr1;
+   u64 

Re: Is KVM_GET_SREGS safe from other threads?

2011-05-08 Thread Avi Kivity

On 05/08/2011 11:24 AM, Pekka Enberg wrote:

Hi!

We've noticed that sometimes KVM_GET_SREGS from a signal handler
hangs. We use it like this:

static void handle_sigquit(int sig)
{
  int i;

  for (i = 0; i  nrcpus; i++) {
  struct kvm_cpu *cpu = kvm_cpus[i];

  kvm_cpu__show_registers(cpu);-- here
  kvm_cpu__show_code(cpu);
  kvm_cpu__show_page_tables(cpu);
  }

and

void kvm_cpu__show_registers(struct kvm_cpu *self)
{
[...]
if (ioctl(self-vcpu_fd, KVM_GET_SREGS,sregs)  0)
die(KVM_GET_REGS failed);

is it not OK to call KVM_GET_SREGS from other threads than the one
that's doing KVM_RUN?


From Documentation/kvm/api.txt:

 - vcpu ioctls: These query and set attributes that control the operation
   of a single virtual cpu.

   Only run vcpu ioctls from the same thread that was used to create the
   vcpu.


So no, it is not okay (nor is it meaningful, you get a register snapshot 
that is disconnected from all other vcpu state).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM_GET_SUPPORTED_CPUID returning -E2BIG

2011-05-08 Thread Avi Kivity

On 05/08/2011 08:15 AM, Sasha Levin wrote:

Hello,

I'm seeing a case where KVM tools occasionally fails with the following
error message: KVM_GET_SUPPORTED_CPUID failed: Argument list too long,
which means that we get -E2BIG back from KVM_GET_SUPPORTED_CPUID.

Why would it happen if we pass KVM_MAX_CPUID_ENTRIES as the max number
of entries (in the nent field of struct kvm_cpuid2)?


KVM_MAX_CPUID_ENTRIES is a private define, not exported to userspace 
(since it can change).



Also, Why would it happen randomly and not each time the code is run?


Probably you have some bug.  E2BIG is returned when nent is smaller than 
the number of entries returned (which only depends on the host cpu type).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is KVM_GET_SREGS safe from other threads?

2011-05-08 Thread Pekka Enberg
On Sun, May 8, 2011 at 11:50 AM, Avi Kivity a...@redhat.com wrote:
 On 05/08/2011 11:24 AM, Pekka Enberg wrote:

 Hi!

 We've noticed that sometimes KVM_GET_SREGS from a signal handler
 hangs. We use it like this:

 static void handle_sigquit(int sig)
 {
          int i;

          for (i = 0; i  nrcpus; i++) {
                  struct kvm_cpu *cpu = kvm_cpus[i];

                  kvm_cpu__show_registers(cpu);-- here
                  kvm_cpu__show_code(cpu);
                  kvm_cpu__show_page_tables(cpu);
          }

 and

 void kvm_cpu__show_registers(struct kvm_cpu *self)
 {
 [...]
        if (ioctl(self-vcpu_fd, KVM_GET_SREGS,sregs)  0)
                die(KVM_GET_REGS failed);

 is it not OK to call KVM_GET_SREGS from other threads than the one
 that's doing KVM_RUN?

 From Documentation/kvm/api.txt:

  - vcpu ioctls: These query and set attributes that control the operation
   of a single virtual cpu.

   Only run vcpu ioctls from the same thread that was used to create the
   vcpu.


 So no, it is not okay (nor is it meaningful, you get a register snapshot
 that is disconnected from all other vcpu state).

Aah, I've read that part at some point but forgot all about it. Thanks, Avi!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Fix 'kill -3' hangs

2011-05-08 Thread Pekka Enberg
Ingo Molnar reported that 'kill -3' didn't work on his machine:

  * Ingo Molnar mi...@elte.hu wrote:

   This is really cumbersome to debug - is there some good way to get to the 
RIP
   that the guest is hanging in? If kvm would print that out to the host 
console
   (even if it's just the raw RIP initially) on a kill -3 that would help
   enormously.

  Looks like the code should be doing that already - but the 
ioctl(KVM_GET_SREGS)
  hangs:

[pid   748] ioctl(6, KVM_GET_SREGS

Avi Kivity pointed out that it's not safe to call KVM_GET_SREGS (or other vcpu
related ioctls) from other threads:

   is it not OK to call KVM_GET_SREGS from other threads than the one
   that's doing KVM_RUN?

  From Documentation/kvm/api.txt:

   - vcpu ioctls: These query and set attributes that control the operation
 of a single virtual cpu.

 Only run vcpu ioctls from the same thread that was used to create the
 vcpu.

Fix that up by using pthread_kill() to force the threads that are doing KVM_RUN
to do the register dumps.

Reported: Ingo Molnar mi...@elte.hu
Cc: Asias He asias.he...@gmail.com
Cc: Avi Kivity a...@redhat.com
Cc: Cyrill Gorcunov gorcu...@gmail.com
Cc: Ingo Molnar mi...@elte.hu
Cc: Prasad Joshi prasadjoshi...@gmail.com
Cc: Sasha Levin levinsasha...@gmail.com
Signed-off-by: Pekka Enberg penb...@kernel.org
---
 tools/kvm/kvm-run.c |   20 +---
 1 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c
index eb50b6a..58e2977 100644
--- a/tools/kvm/kvm-run.c
+++ b/tools/kvm/kvm-run.c
@@ -127,6 +127,18 @@ static const struct option options[] = {
OPT_END()
 };
 
+static void handle_sigusr1(int sig)
+{
+   struct kvm_cpu *cpu = current_kvm_cpu;
+
+   if (!cpu)
+   return;
+
+   kvm_cpu__show_registers(cpu);
+   kvm_cpu__show_code(cpu);
+   kvm_cpu__show_page_tables(cpu);
+}
+
 static void handle_sigquit(int sig)
 {
int i;
@@ -134,9 +146,10 @@ static void handle_sigquit(int sig)
for (i = 0; i  nrcpus; i++) {
struct kvm_cpu *cpu = kvm_cpus[i];
 
-   kvm_cpu__show_registers(cpu);
-   kvm_cpu__show_code(cpu);
-   kvm_cpu__show_page_tables(cpu);
+   if (!cpu)
+   continue;
+
+   pthread_kill(cpu-thread, SIGUSR1);
}
 
serial8250__inject_sysrq(kvm);
@@ -332,6 +345,7 @@ int kvm_cmd_run(int argc, const char **argv, const char 
*prefix)
 
signal(SIGALRM, handle_sigalrm);
signal(SIGQUIT, handle_sigquit);
+   signal(SIGUSR1, handle_sigusr1);
 
while (argc != 0) {
argc = parse_options(argc, argv, options, run_usage,
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Asias He
On 05/08/2011 02:22 PM, Ingo Molnar wrote:
 
 * Asias He asias.he...@gmail.com wrote:
 
 As virtio spec says:

 
  Because this is high importance and low bandwidth, the current Linux
  implementation polls for the buffer to be used, rather than waiting
  for an interrupt, simplifying the implementation signicantly.
 

 drivers/char/virtio_console.c
  send_buf() {
  ...
  /* Tell Host to go! */
  virtqueue_kick(out_vq);
  ...
 while (!virtqueue_get_buf(out_vq, len))
 cpu_relax();
  ...
  }

 The console hangs can simply be reproduced by yes command which
 gives tremendous console IOs and IRQs.

 [   16.786440] irq 4: nobody cared (try booting with the irqpoll option)
 [   16.786440] Pid: 1437, comm: yes Tainted: GW 2.6.39-rc6+ #56
 [   16.786440] Call Trace:
 [   16.786440]  [c16578eb] __report_bad_irq+0x30/0x89
 [   16.786440]  [c10980e6] note_interrupt+0x118/0x17a
 [   16.786440]  [c1096e7d] handle_irq_event_percpu+0x168/0x179
 [   16.786440]  [c1096eba] handle_irq_event+0x2c/0x46
 [   16.786440]  [c1098516] ? unmask_irq+0x1e/0x1e
 [   16.786440]  [c1098566] handle_level_irq+0x50/0x6e
 [   16.786440]  IRQ  [c102fa69] ? do_IRQ+0x35/0x7f
 [   16.786440]  [c1665ea9] ? common_interrupt+0x29/0x30
 [   16.786440]  [c16610d6] ? _raw_spin_unlock_irqrestore+0x7/0x28
 [   16.786440]  [c1364f65] ? hvc_write+0x88/0x9e
 [   16.786440]  [c1355500] ? do_output_char+0x88/0x18a
 [   16.786440]  [c1355631] ? process_output+0x2f/0x42
 [   16.786440]  [c1355af6] ? n_tty_write+0x211/0x2dc
 [   16.786440]  [c1059d77] ? try_to_wake_up+0x226/0x226
 [   16.786440]  [c13534a4] ? tty_write+0x15e/0x1d1
 [   16.786440]  [c12c1644] ? security_file_permission+0x22/0x26
 [   16.786440]  [c13558e5] ? process_echoes+0x241/0x241
 [   16.786440]  [c10dd9d2] ? vfs_write+0x84/0xd7
 [   16.786440]  [c1353346] ? tty_write_lock+0x3d/0x3d
 [   16.786440]  [c10ddb92] ? sys_write+0x3b/0x5d
 [   16.786440]  [c166594c] ? sysenter_do_call+0x12/0x22
 [   16.786440] handlers:
 [   16.786440] [c1351397] (vp_interrupt+0x0/0x3a)
 [   16.786440] Disabling IRQ #4
 
 Hm, why is irq #4 active if the guest-side virtio console driver does not 
 handle it?
 
 Signed-off-by: Asias He asias.he...@gmail.com
 ---
  tools/kvm/virtio/console.c |2 --
  1 files changed, 0 insertions(+), 2 deletions(-)

 diff --git a/tools/kvm/virtio/console.c b/tools/kvm/virtio/console.c
 index f5449ba..1fecf37 100644
 --- a/tools/kvm/virtio/console.c
 +++ b/tools/kvm/virtio/console.c
 @@ -171,8 +171,6 @@ static void virtio_console_handle_callback(struct kvm 
 *self, void *param)
  len = term_putc_iov(CONSOLE_VIRTIO, iov, out);
  virt_queue__set_used_elem(vq, head, len);
  }
 -
 -virt_queue__trigger_irq(vq, virtio_console_pci_device.irq_line, 
 cdev.isr, self);
  }
 
 I think this at least requires a comment at that place, that we intentionally 
 skip notifying the guest, because Linux guests do not use the console IRQ.

Will do.

 Does the guest-side virtio driver *ever* use the irq?

Yes. They use IRQ at least for the RX path.

 Thanks,
 
   Ingo
 


-- 
Best Regards,
Asias He
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/4] Resend this patch series.

2011-05-08 Thread Asias He
Added a comment in console.c suggested by Ingo.

Asias He (4):
  kvm tools: Use virt_queue__trigger_irq() to trigger IRQ for virtio
console
  kvm tools: Use virt_queue__trigger_irq() to trigger IRQ for virtio
blk
  kvm tools: Use virt_queue__trigger_irq() to trigger IRQ for virtio
rng
  kvm tools: Fix virtio console hangs by removing IRQ injection for tx
path

 tools/kvm/virtio/blk.c |8 +---
 tools/kvm/virtio/console.c |   15 +++
 tools/kvm/virtio/rng.c |8 +---
 3 files changed, 21 insertions(+), 10 deletions(-)

-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] kvm tools: Use virt_queue__trigger_irq() to trigger IRQ for virtio console

2011-05-08 Thread Asias He
This patch uses IRQ injection mechanism introduced by
virt_queue__trigger_irq() which respect virtio IRQ status
and VRING_AVAIL_F_NO_INTERRUPT.

Signed-off-by: Asias He asias.he...@gmail.com
---
 tools/kvm/virtio/console.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/kvm/virtio/console.c b/tools/kvm/virtio/console.c
index 188227b..f5449ba 100644
--- a/tools/kvm/virtio/console.c
+++ b/tools/kvm/virtio/console.c
@@ -49,6 +49,7 @@ struct con_dev {
u32 guest_features;
u16 config_vector;
u8  status;
+   u8  isr;
u16 queue_selector;
 
void*jobs[VIRTIO_CONSOLE_NUM_QUEUES];
@@ -85,7 +86,7 @@ static void virtio_console__inject_interrupt_callback(struct 
kvm *self, void *pa
head = virt_queue__get_iov(vq, iov, out, in, self);
len = term_getc_iov(CONSOLE_VIRTIO, iov, in);
virt_queue__set_used_elem(vq, head, len);
-   kvm__irq_line(self, virtio_console_pci_device.irq_line, 1);
+   virt_queue__trigger_irq(vq, virtio_console_pci_device.irq_line, 
cdev.isr, self);
}
 
mutex_unlock(cdev.mutex);
@@ -139,8 +140,9 @@ static bool virtio_console_pci_io_in(struct kvm *self, u16 
port, void *data, int
ioport__write8(data, cdev.status);
break;
case VIRTIO_PCI_ISR:
-   ioport__write8(data, 0x1);
-   kvm__irq_line(self, virtio_console_pci_device.irq_line, 0);
+   ioport__write8(data, cdev.isr);
+   kvm__irq_line(self, virtio_console_pci_device.irq_line, 
VIRTIO_IRQ_LOW);
+   cdev.isr = VIRTIO_IRQ_LOW;
break;
case VIRTIO_MSI_CONFIG_VECTOR:
ioport__write16(data, cdev.config_vector);
@@ -170,7 +172,7 @@ static void virtio_console_handle_callback(struct kvm 
*self, void *param)
virt_queue__set_used_elem(vq, head, len);
}
 
-   kvm__irq_line(self, virtio_console_pci_device.irq_line, 1);
+   virt_queue__trigger_irq(vq, virtio_console_pci_device.irq_line, 
cdev.isr, self);
 }
 
 static bool virtio_console_pci_io_out(struct kvm *self, u16 port, void *data, 
int size, u32 count)
-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/4] kvm tools: Use virt_queue__trigger_irq() to trigger IRQ for virtio blk

2011-05-08 Thread Asias He
This patch uses IRQ injection mechanism introduced by
virt_queue__trigger_irq() which respect virtio IRQ status
and VRING_AVAIL_F_NO_INTERRUPT.

Signed-off-by: Asias He asias.he...@gmail.com
---
 tools/kvm/virtio/blk.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/kvm/virtio/blk.c b/tools/kvm/virtio/blk.c
index cc3dc78..12c7029 100644
--- a/tools/kvm/virtio/blk.c
+++ b/tools/kvm/virtio/blk.c
@@ -38,6 +38,7 @@ struct blk_dev {
u32 guest_features;
u16 config_vector;
u8  status;
+   u8  isr;
u8  idx;
 
/* virtio queue */
@@ -102,8 +103,9 @@ static bool virtio_blk_pci_io_in(struct kvm *self, u16 
port, void *data, int siz
ioport__write8(data, bdev-status);
break;
case VIRTIO_PCI_ISR:
-   ioport__write8(data, 0x1);
-   kvm__irq_line(self, bdev-pci_hdr.irq_line, 0);
+   ioport__write8(data, bdev-isr);
+   kvm__irq_line(self, bdev-pci_hdr.irq_line, VIRTIO_IRQ_LOW);
+   bdev-isr = VIRTIO_IRQ_LOW;
break;
case VIRTIO_MSI_CONFIG_VECTOR:
ioport__write16(data, bdev-config_vector);
@@ -167,7 +169,7 @@ static void virtio_blk_do_io(struct kvm *kvm, void *param)
while (virt_queue__available(vq))
virtio_blk_do_io_request(kvm, bdev, vq);
 
-   kvm__irq_line(kvm, bdev-pci_hdr.irq_line, 1);
+   virt_queue__trigger_irq(vq, bdev-pci_hdr.irq_line, bdev-isr, kvm);
 }
 
 static bool virtio_blk_pci_io_out(struct kvm *self, u16 port, void *data, int 
size, u32 count)
-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/4] kvm tools: Use virt_queue__trigger_irq() to trigger IRQ for virtio rng

2011-05-08 Thread Asias He
This patch uses IRQ injection mechanism introduced by
virt_queue__trigger_irq() which respect virtio IRQ status
and VRING_AVAIL_F_NO_INTERRUPT.

Signed-off-by: Asias He asias.he...@gmail.com
---
 tools/kvm/virtio/rng.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/kvm/virtio/rng.c b/tools/kvm/virtio/rng.c
index d355cf8..f692dfd 100644
--- a/tools/kvm/virtio/rng.c
+++ b/tools/kvm/virtio/rng.c
@@ -37,6 +37,7 @@ static struct pci_device_header virtio_rng_pci_device = {
 
 struct rng_dev {
u8  status;
+   u8  isr;
u16 config_vector;
int fd;
 
@@ -72,8 +73,9 @@ static bool virtio_rng_pci_io_in(struct kvm *kvm, u16 port, 
void *data, int size
ioport__write8(data, rdev.status);
break;
case VIRTIO_PCI_ISR:
-   ioport__write8(data, 0x1);
-   kvm__irq_line(kvm, virtio_rng_pci_device.irq_line, 0);
+   ioport__write8(data, rdev.isr);
+   kvm__irq_line(kvm, virtio_rng_pci_device.irq_line, 
VIRTIO_IRQ_LOW);
+   rdev.isr = VIRTIO_IRQ_LOW;
break;
case VIRTIO_MSI_CONFIG_VECTOR:
ioport__write16(data, rdev.config_vector);
@@ -106,7 +108,7 @@ static void virtio_rng_do_io(struct kvm *kvm, void *param)
 
while (virt_queue__available(vq)) {
virtio_rng_do_io_request(kvm, vq);
-   kvm__irq_line(kvm, virtio_rng_pci_device.irq_line, 1);
+   virt_queue__trigger_irq(vq, virtio_rng_pci_device.irq_line, 
rdev.isr, kvm);
}
 }
 
-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Asias He
As virtio spec says:


 Because this is high importance and low bandwidth, the current Linux
 implementation polls for the buffer to be used, rather than waiting
 for an interrupt, simplifying the implementation signicantly.


drivers/char/virtio_console.c
 send_buf() {
 ...
/* Tell Host to go! */
virtqueue_kick(out_vq);
 ...
while (!virtqueue_get_buf(out_vq, len))
cpu_relax();
 ...
 }

The console hangs can simply be reproduced by yes command which
gives tremendous console IOs and IRQs.

[   16.786440] irq 4: nobody cared (try booting with the irqpoll option)
[   16.786440] Pid: 1437, comm: yes Tainted: GW 2.6.39-rc6+ #56
[   16.786440] Call Trace:
[   16.786440]  [c16578eb] __report_bad_irq+0x30/0x89
[   16.786440]  [c10980e6] note_interrupt+0x118/0x17a
[   16.786440]  [c1096e7d] handle_irq_event_percpu+0x168/0x179
[   16.786440]  [c1096eba] handle_irq_event+0x2c/0x46
[   16.786440]  [c1098516] ? unmask_irq+0x1e/0x1e
[   16.786440]  [c1098566] handle_level_irq+0x50/0x6e
[   16.786440]  IRQ  [c102fa69] ? do_IRQ+0x35/0x7f
[   16.786440]  [c1665ea9] ? common_interrupt+0x29/0x30
[   16.786440]  [c16610d6] ? _raw_spin_unlock_irqrestore+0x7/0x28
[   16.786440]  [c1364f65] ? hvc_write+0x88/0x9e
[   16.786440]  [c1355500] ? do_output_char+0x88/0x18a
[   16.786440]  [c1355631] ? process_output+0x2f/0x42
[   16.786440]  [c1355af6] ? n_tty_write+0x211/0x2dc
[   16.786440]  [c1059d77] ? try_to_wake_up+0x226/0x226
[   16.786440]  [c13534a4] ? tty_write+0x15e/0x1d1
[   16.786440]  [c12c1644] ? security_file_permission+0x22/0x26
[   16.786440]  [c13558e5] ? process_echoes+0x241/0x241
[   16.786440]  [c10dd9d2] ? vfs_write+0x84/0xd7
[   16.786440]  [c1353346] ? tty_write_lock+0x3d/0x3d
[   16.786440]  [c10ddb92] ? sys_write+0x3b/0x5d
[   16.786440]  [c166594c] ? sysenter_do_call+0x12/0x22
[   16.786440] handlers:
[   16.786440] [c1351397] (vp_interrupt+0x0/0x3a)
[   16.786440] Disabling IRQ #4

Signed-off-by: Asias He asias.he...@gmail.com
---
 tools/kvm/virtio/console.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/tools/kvm/virtio/console.c b/tools/kvm/virtio/console.c
index f5449ba..f9031cb 100644
--- a/tools/kvm/virtio/console.c
+++ b/tools/kvm/virtio/console.c
@@ -166,13 +166,18 @@ static void virtio_console_handle_callback(struct kvm 
*self, void *param)
 
vq = param;
 
+   /*
+* The current Linux implementation polls for the buffer
+* to be used, rather than waiting for an interrupt.
+* So there is no need to inject an interrupt for the tx path.
+*/
+
while (virt_queue__available(vq)) {
head = virt_queue__get_iov(vq, iov, out, in, self);
len = term_putc_iov(CONSOLE_VIRTIO, iov, out);
virt_queue__set_used_elem(vq, head, len);
}
 
-   virt_queue__trigger_irq(vq, virtio_console_pci_device.irq_line, 
cdev.isr, self);
 }
 
 static bool virtio_console_pci_io_out(struct kvm *self, u16 port, void *data, 
int size, u32 count)
-- 
1.7.5.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: Enable earlyprintk=serial by default

2011-05-08 Thread Rodrigo Campos
On Sun, May 08, 2011 at 09:39:34AM +0200, Ingo Molnar wrote:
 
 diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c
 index 764a242..eb50b6a 100644
 --- a/tools/kvm/kvm-run.c
 +++ b/tools/kvm/kvm-run.c
 @@ -409,7 +409,7 @@ int kvm_cmd_run(int argc, const char **argv, const char 
 *prefix)
   kvm-nrcpus = nrcpus;
  
   memset(real_cmdline, 0, sizeof(real_cmdline));
 - strcpy(real_cmdline, notsc noapic noacpi pci=conf1 console=ttyS0 );
 + strcpy(real_cmdline, notsc noapic noacpi pci=conf1 console=ttyS0 
 earlyprintk=serial);

I think the space at the end of the string that you delete, squashes the early
printk option with the next one (root= in your case), like in the output you 
show:

  [0.00] Command line: notsc noapic noacpi pci=conf1 console=ttyS0 
 earlyprintk=serialroot=/dev/vda1 rw 
I mean here 
  ^





Thanks,
Rodrigo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Pekka Enberg
On Sun, 2011-05-08 at 21:09 +0800, Asias He wrote:
 As virtio spec says:
 
 
  Because this is high importance and low bandwidth, the current Linux
  implementation polls for the buffer to be used, rather than waiting
  for an interrupt, simplifying the implementation signicantly.
 
 
 drivers/char/virtio_console.c
  send_buf() {
  ...
   /* Tell Host to go! */
   virtqueue_kick(out_vq);
  ...
 while (!virtqueue_get_buf(out_vq, len))
 cpu_relax();
  ...
  }
 
 The console hangs can simply be reproduced by yes command which
 gives tremendous console IOs and IRQs.

Sasha, does this fix the hangs you were seeing? We should re-enable
virtio console unconditionally if this does - that increases test
coverage for virtio console.

Pekka

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Sasha Levin
On Sun, 2011-05-08 at 20:05 +0300, Pekka Enberg wrote:
 On Sun, 2011-05-08 at 21:09 +0800, Asias He wrote:
  As virtio spec says:
  
  
   Because this is high importance and low bandwidth, the current Linux
   implementation polls for the buffer to be used, rather than waiting
   for an interrupt, simplifying the implementation signicantly.
  
  
  drivers/char/virtio_console.c
   send_buf() {
   ...
  /* Tell Host to go! */
  virtqueue_kick(out_vq);
   ...
  while (!virtqueue_get_buf(out_vq, len))
  cpu_relax();
   ...
   }
  
  The console hangs can simply be reproduced by yes command which
  gives tremendous console IOs and IRQs.
 
 Sasha, does this fix the hangs you were seeing? We should re-enable
 virtio console unconditionally if this does - that increases test
 coverage for virtio console.

I'm seeing no more hangs, but why enable it unconditionally?
Maybe enable it by default, but we shouldn't force the activation of
virtio modules if the user doesn't want them.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Pekka Enberg
On Sun, May 8, 2011 at 8:21 PM, Sasha Levin levinsasha...@gmail.com wrote:
 I'm seeing no more hangs, but why enable it unconditionally?
 Maybe enable it by default, but we shouldn't force the activation of
 virtio modules if the user doesn't want them.

I meant enabling the device on PCI bus like we did before.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Sasha Levin
On Sun, 2011-05-08 at 20:28 +0300, Pekka Enberg wrote:
 On Sun, May 8, 2011 at 8:21 PM, Sasha Levin levinsasha...@gmail.com wrote:
  I'm seeing no more hangs, but why enable it unconditionally?
  Maybe enable it by default, but we shouldn't force the activation of
  virtio modules if the user doesn't want them.
 
 I meant enabling the device on PCI bus like we did before.

Thats what I've meant too. virtio-console is the only device which got
initialized even if it wasn't requested (even when '-c serial' was
passed specifically).

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Pekka Enberg
On Sun, May 8, 2011 at 8:29 PM, Sasha Levin levinsasha...@gmail.com wrote:
 On Sun, 2011-05-08 at 20:28 +0300, Pekka Enberg wrote:
 On Sun, May 8, 2011 at 8:21 PM, Sasha Levin levinsasha...@gmail.com wrote:
  I'm seeing no more hangs, but why enable it unconditionally?
  Maybe enable it by default, but we shouldn't force the activation of
  virtio modules if the user doesn't want them.

 I meant enabling the device on PCI bus like we did before.

 Thats what I've meant too. virtio-console is the only device which got
 initialized even if it wasn't requested (even when '-c serial' was
 passed specifically).

The more options we have, the more combinations we need to test.
What's the downside of enabling virtio console by default? The upside
is that it's less likely to break. Btw, we should probably do that for
virtio rng as well.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Sasha Levin
On Sun, 2011-05-08 at 20:34 +0300, Pekka Enberg wrote:
 On Sun, May 8, 2011 at 8:29 PM, Sasha Levin levinsasha...@gmail.com wrote:
  On Sun, 2011-05-08 at 20:28 +0300, Pekka Enberg wrote:
  On Sun, May 8, 2011 at 8:21 PM, Sasha Levin levinsasha...@gmail.com 
  wrote:
   I'm seeing no more hangs, but why enable it unconditionally?
   Maybe enable it by default, but we shouldn't force the activation of
   virtio modules if the user doesn't want them.
 
  I meant enabling the device on PCI bus like we did before.
 
  Thats what I've meant too. virtio-console is the only device which got
  initialized even if it wasn't requested (even when '-c serial' was
  passed specifically).
 
 The more options we have, the more combinations we need to test.
 What's the downside of enabling virtio console by default? The upside
 is that it's less likely to break. Btw, we should probably do that for
 virtio rng as well.

I fully support enabling it by default, I'm against enabling it even if
the user asked to have it disabled.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] kvm tools: Fix virtio console hangs by removing IRQ injection for tx path

2011-05-08 Thread Pekka Enberg
On Sun, 2011-05-08 at 20:35 +0300, Sasha Levin wrote:
 I fully support enabling it by default, I'm against enabling it even if
 the user asked to have it disabled.

Sure.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: PCI -- Make PCI device numbers being unique

2011-05-08 Thread Cyrill Gorcunov
PCI device numbers must be unique on a bus (as a part
of Bus/Device/Function tuple). Make it so. Note the patch
is rather a fast fix since we need a bit more smart pci device
manager (in particular multiple virtio block devices most
probably should lay on a separate pci bus).

Signed-off-by: Cyrill Gorcunov gorcu...@gmail.com
---
 tools/kvm/include/kvm/virtio-pci-dev.h |5 +
 tools/kvm/virtio/blk.c |2 +-
 tools/kvm/virtio/console.c |2 +-
 tools/kvm/virtio/net.c |2 +-
 tools/kvm/virtio/rng.c |2 +-
 5 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6.git/tools/kvm/include/kvm/virtio-pci-dev.h
=
--- linux-2.6.git.orig/tools/kvm/include/kvm/virtio-pci-dev.h
+++ linux-2.6.git/tools/kvm/include/kvm/virtio-pci-dev.h
@@ -16,4 +16,9 @@
 #define PCI_SUBSYSTEM_ID_VIRTIO_CONSOLE0x0003
 #define PCI_SUBSYSTEM_ID_VIRTIO_RNG0x0004

+#define PCI_DEVICE_VIRTIO_NET  0x2
+#define PCI_DEVICE_VIRTIO_BLK  0x1
+#define PCI_DEVICE_VIRTIO_CONSOLE  0x3
+#define PCI_DEVICE_VIRTIO_RNG  0x4
+
 #endif /* VIRTIO_PCI_DEV_H_ */
Index: linux-2.6.git/tools/kvm/virtio/blk.c
=
--- linux-2.6.git.orig/tools/kvm/virtio/blk.c
+++ linux-2.6.git/tools/kvm/virtio/blk.c
@@ -295,7 +295,7 @@ void virtio_blk__init(struct kvm *self,
},
};

-   if (irq__register_device(PCI_DEVICE_ID_VIRTIO_BLK, dev, pin, line)  
0)
+   if (irq__register_device(PCI_DEVICE_VIRTIO_BLK, dev, pin, line)  0)
return;

bdev-pci_hdr.irq_pin   = pin;
Index: linux-2.6.git/tools/kvm/virtio/console.c
=
--- linux-2.6.git.orig/tools/kvm/virtio/console.c
+++ linux-2.6.git/tools/kvm/virtio/console.c
@@ -238,7 +238,7 @@ void virtio_console__init(struct kvm *se
 {
u8 dev, line, pin;

-   if (irq__register_device(PCI_DEVICE_ID_VIRTIO_CONSOLE, dev, pin, 
line)  0)
+   if (irq__register_device(PCI_DEVICE_VIRTIO_CONSOLE, dev, pin, line) 
 0)
return;

virtio_console_pci_device.irq_pin   = pin;
Index: linux-2.6.git/tools/kvm/virtio/net.c
=
--- linux-2.6.git.orig/tools/kvm/virtio/net.c
+++ linux-2.6.git/tools/kvm/virtio/net.c
@@ -385,7 +385,7 @@ void virtio_net__init(const struct virti
if (virtio_net__tap_init(params)) {
u8 dev, line, pin;

-   if (irq__register_device(PCI_DEVICE_ID_VIRTIO_NET, dev, pin, 
line)  0)
+   if (irq__register_device(PCI_DEVICE_VIRTIO_NET, dev, pin, 
line)  0)
return;

virtio_net_pci_device.irq_pin   = pin;
Index: linux-2.6.git/tools/kvm/virtio/rng.c
=
--- linux-2.6.git.orig/tools/kvm/virtio/rng.c
+++ linux-2.6.git/tools/kvm/virtio/rng.c
@@ -178,5 +178,5 @@ void virtio_rng__init(struct kvm *kvm)
virtio_rng_pci_device.irq_line  = line;
pci__register(virtio_rng_pci_device, dev);

-   ioport__register(IOPORT_VIRTIO_RNG, virtio_rng_io_ops, 
IOPORT_VIRTIO_RNG_SIZE);
+   ioport__register(PCI_DEVICE_VIRTIO_RNG, virtio_rng_io_ops, 
IOPORT_VIRTIO_RNG_SIZE);
 }
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-08 Thread Nikola Ciprich
Hello everyboy,
while installing new virt machine today, I noticed that 2.6.32 x86_64 SMP 
guests are hanging if they have paravirt-clock enabled...
Either they don't finish booting at all, or boot but hang soon after..
Such a hanged guest fully loads all host cpus..
The host is 6core x86_64 runnig 2.6.37.6 with 24GB RAM.

kvm_stat:

 kvm_exit(EXTERNAL_INTERRUPT) 156821643
 kvm_exit 155141643
 kvm_entry154161643
 kvm_set_irq  1   0
 kvm_msi_set_irq  1   0
 kvm_apic_accept_irq  1   0
 kvm_exit(VMCLEAR)6   0
 kvm_exit(VMON)   6   0
 kvm_exit(PAUSE_INSTRUCTION)  5   0
 kvm_exit(MCE_DURING_VMENTRY) 5   0
 kvm_exit(MWAIT_INSTRUCTION)  5   0
 kvm_exit(DR_ACCESS)  5   0
 kvm_exit(EPT_VIOLATION)  5   0
 kvm_exit(NMI_WINDOW) 5   0
 kvm_exit(VMPTRLD)5   0
 kvm_exit(TASK_SWITCH)5   0
 kvm_exit(VMREAD) 5   0
 kvm_exit(VMLAUNCH)   5   0
 kvm_exit(RDPMC)  5   0

perf top:
   16.00 10.9% add_preempt_count[kernel.kallsyms]   
 
   16.00 10.9% do_raw_spin_lock [kernel.kallsyms]   
 
   15.00 10.2% sub_preempt_count[kernel.kallsyms]   
 
8.00  5.4% irq_exit [kernel.kallsyms]   
 
7.00  4.8% vmx_vcpu_run 
/lib/modules/2.6.37lb.09/kernel/arch/x86/kvm/kvm-intel.ko
7.00  4.8% page_fault   [kernel.kallsyms]   
 
5.00  3.4% mempool_free [kernel.kallsyms]   
 

info registers:
RAX=00f42400 RBX=81533f00 RCX=0016 
RDX=00077358f500
RSI=1dcd6500 RDI=0001 RBP=880009a03ee8 
RSP=880009a03ee8
R8 =0016 R9 =000a R10= 
R11=
R12=2a4d17d38f3303c1 R13=815fd000 R14=81592140 
R15=00093510
RIP=810767cb RFL=0006 [-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018   00c09300 DPL=0 DS   [-WA]
CS =0010   00a09b00 DPL=0 CS64 [-RA]
SS =0018   00c09300 DPL=0 DS   [-WA]
DS =0018   00c09300 DPL=0 DS   [-WA]
FS =  000f 
GS = 880009a0 000f 
LDT=  000f 
TR =0040 880009a11880 2087 8b00 DPL=0 TSS64-busy
GDT= 880009a04000 007f
IDT= 815fd000 0fff
CR0=8005003b CR2=7f424e540700 CR3=00021690c000 CR4=06f0
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=0d01
FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80
FPR0=  FPR1= 
FPR2=  FPR3= 
FPR4=  FPR5= 
FPR6=  FPR7= 
XMM00= XMM01=3c23d70a
XMM02= XMM03=
XMM04= XMM05=
XMM06= XMM07=
XMM08= XMM09=
XMM10= XMM11=
XMM12= XMM13=
XMM14= XMM15=

info cpus:
* CPU #0: pc=0x8105d4a0 thread_id=19639 
  CPU #1: pc=0x81013140 thread_id=19640 
  CPU #2: pc=0x8102a1b6 (halted) thread_id=19641 
  CPU #3: pc=0x81341521 thread_id=19642 
  CPU #4: pc=0x810415d8 thread_id=19643 
  CPU #5: pc=0x811ca521 thread_id=19644 
  CPU #6: pc=0x81013140 thread_id=19646 
  CPU #7: pc=0x8102a1b6 (halted) thread_id=19647 


and here are trace-cmds for all cpus:
http://nik.lbox.cz/public/trace-cmd.tar.bz2

Could somebody please have a look at this?

I also tried 2.6.38.5, but the result is the same...

cheers
nik




-- 
-
Ing. Nikola CIPRICH

Re: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique

2011-05-08 Thread Sasha Levin
On Sun, 2011-05-08 at 22:29 +0400, Cyrill Gorcunov wrote:
 Index: linux-2.6.git/tools/kvm/virtio/rng.c
 =
 --- linux-2.6.git.orig/tools/kvm/virtio/rng.c
 +++ linux-2.6.git/tools/kvm/virtio/rng.c
 @@ -178,5 +178,5 @@ void virtio_rng__init(struct kvm *kvm)
   virtio_rng_pci_device.irq_line  = line;
   pci__register(virtio_rng_pci_device, dev);
 
 - ioport__register(IOPORT_VIRTIO_RNG, virtio_rng_io_ops, 
 IOPORT_VIRTIO_RNG_SIZE);
 + ioport__register(PCI_DEVICE_VIRTIO_RNG, virtio_rng_io_ops, 
 IOPORT_VIRTIO_RNG_SIZE);
  }

I think you wanted to change irq__register_device, not ioport__register
in virtio-rng.

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-08 Thread Michael Tokarev
08.05.2011 22:33, Nikola Ciprich wrote:
 Hello everyboy,
 while installing new virt machine today, I noticed that 2.6.32 x86_64 SMP 
 guests are hanging if they have paravirt-clock enabled...

There were about 10 bugfixes pushed to 2.6.32.y stable series,
some of them were for kvm-clock, and some were for problems
which manifested itself like you described.  You may actually
take a look which guests you're booting.  FWIW.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique

2011-05-08 Thread Cyrill Gorcunov
On 05/08/2011 10:48 PM, Sasha Levin wrote:
 On Sun, 2011-05-08 at 22:29 +0400, Cyrill Gorcunov wrote:
 Index: linux-2.6.git/tools/kvm/virtio/rng.c
 =
 --- linux-2.6.git.orig/tools/kvm/virtio/rng.c
 +++ linux-2.6.git/tools/kvm/virtio/rng.c
 @@ -178,5 +178,5 @@ void virtio_rng__init(struct kvm *kvm)
  virtio_rng_pci_device.irq_line  = line;
  pci__register(virtio_rng_pci_device, dev);

 -ioport__register(IOPORT_VIRTIO_RNG, virtio_rng_io_ops, 
 IOPORT_VIRTIO_RNG_SIZE);
 +ioport__register(PCI_DEVICE_VIRTIO_RNG, virtio_rng_io_ops, 
 IOPORT_VIRTIO_RNG_SIZE);
  }
 
 I think you wanted to change irq__register_device, not ioport__register
 in virtio-rng.
 

Good catch Sasha, thanks!

-- 
Cyrill
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Add missing space after kernel params

2011-05-08 Thread Sasha Levin
Add missing space so that user-provided kernel params
will be properly concatenated to default params.

Instead of just adding a space at the end, add it with
a separate strcat(), since it's not the first (and wouldn't
have been the last) time a space wasn't added.

Reported-by: Rodrigo Campos rodr...@sdfg.com.ar
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/kvm-run.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/tools/kvm/kvm-run.c b/tools/kvm/kvm-run.c
index 58e2977..d8ef4c9 100644
--- a/tools/kvm/kvm-run.c
+++ b/tools/kvm/kvm-run.c
@@ -424,6 +424,7 @@ int kvm_cmd_run(int argc, const char **argv, const char 
*prefix)
 
memset(real_cmdline, 0, sizeof(real_cmdline));
strcpy(real_cmdline, notsc noapic noacpi pci=conf1 console=ttyS0 
earlyprintk=serial);
+   strcat(real_cmdline,  );
if (kernel_cmdline)
strlcat(real_cmdline, kernel_cmdline, sizeof(real_cmdline));
 
-- 
1.7.5.rc3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique

2011-05-08 Thread Cyrill Gorcunov
On 05/08/2011 10:54 PM, Cyrill Gorcunov wrote:
 On 05/08/2011 10:48 PM, Sasha Levin wrote:
 On Sun, 2011-05-08 at 22:29 +0400, Cyrill Gorcunov wrote:
 Index: linux-2.6.git/tools/kvm/virtio/rng.c
 =
 --- linux-2.6.git.orig/tools/kvm/virtio/rng.c
 +++ linux-2.6.git/tools/kvm/virtio/rng.c
 @@ -178,5 +178,5 @@ void virtio_rng__init(struct kvm *kvm)
 virtio_rng_pci_device.irq_line  = line;
 pci__register(virtio_rng_pci_device, dev);

 -   ioport__register(IOPORT_VIRTIO_RNG, virtio_rng_io_ops, 
 IOPORT_VIRTIO_RNG_SIZE);
 +   ioport__register(PCI_DEVICE_VIRTIO_RNG, virtio_rng_io_ops, 
 IOPORT_VIRTIO_RNG_SIZE);
  }

 I think you wanted to change irq__register_device, not ioport__register
 in virtio-rng.

 
 Good catch Sasha, thanks!
 

This one should go better.
---
From: Cyrill Gorcunov gorcu...@gmail.com
Subject: [PATCH] kvm tools: PCI -- Make PCI device numbers being unique v2

PCI device numbers must be unique on a bus (as a part
of Bus/Device/Function tuple).Make it so. Note the patch
is rather a fast fix since we need a bit more smart pci device
manager (in particular multiple virtio block devices most
probably should lay on a separate pci bus).

v2: Sasha spotted the nit in virtio_rng__init, ioport
function was touched insted of irq__register_device.

Signed-off-by: Cyrill Gorcunov gorcu...@gmail.com
CC: Sasha Levin levinsasha...@gmail.com
---
 tools/kvm/include/kvm/virtio-pci-dev.h |5 +
 tools/kvm/virtio/blk.c |2 +-
 tools/kvm/virtio/console.c |2 +-
 tools/kvm/virtio/net.c |2 +-
 tools/kvm/virtio/rng.c |2 +-
 5 files changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6.git/tools/kvm/include/kvm/virtio-pci-dev.h
=
--- linux-2.6.git.orig/tools/kvm/include/kvm/virtio-pci-dev.h
+++ linux-2.6.git/tools/kvm/include/kvm/virtio-pci-dev.h
@@ -16,4 +16,9 @@
 #define PCI_SUBSYSTEM_ID_VIRTIO_CONSOLE0x0003
 #define PCI_SUBSYSTEM_ID_VIRTIO_RNG0x0004

+#define PCI_DEVICE_VIRTIO_NET  0x2
+#define PCI_DEVICE_VIRTIO_BLK  0x1
+#define PCI_DEVICE_VIRTIO_CONSOLE  0x3
+#define PCI_DEVICE_VIRTIO_RNG  0x4
+
 #endif /* VIRTIO_PCI_DEV_H_ */
Index: linux-2.6.git/tools/kvm/virtio/blk.c
=
--- linux-2.6.git.orig/tools/kvm/virtio/blk.c
+++ linux-2.6.git/tools/kvm/virtio/blk.c
@@ -295,7 +295,7 @@ void virtio_blk__init(struct kvm *self,
},
};

-   if (irq__register_device(PCI_DEVICE_ID_VIRTIO_BLK, dev, pin, line)  
0)
+   if (irq__register_device(PCI_DEVICE_VIRTIO_BLK, dev, pin, line)  0)
return;

bdev-pci_hdr.irq_pin   = pin;
Index: linux-2.6.git/tools/kvm/virtio/console.c
=
--- linux-2.6.git.orig/tools/kvm/virtio/console.c
+++ linux-2.6.git/tools/kvm/virtio/console.c
@@ -238,7 +238,7 @@ void virtio_console__init(struct kvm *se
 {
u8 dev, line, pin;

-   if (irq__register_device(PCI_DEVICE_ID_VIRTIO_CONSOLE, dev, pin, 
line)  0)
+   if (irq__register_device(PCI_DEVICE_VIRTIO_CONSOLE, dev, pin, line) 
 0)
return;

virtio_console_pci_device.irq_pin   = pin;
Index: linux-2.6.git/tools/kvm/virtio/net.c
=
--- linux-2.6.git.orig/tools/kvm/virtio/net.c
+++ linux-2.6.git/tools/kvm/virtio/net.c
@@ -385,7 +385,7 @@ void virtio_net__init(const struct virti
if (virtio_net__tap_init(params)) {
u8 dev, line, pin;

-   if (irq__register_device(PCI_DEVICE_ID_VIRTIO_NET, dev, pin, 
line)  0)
+   if (irq__register_device(PCI_DEVICE_VIRTIO_NET, dev, pin, 
line)  0)
return;

virtio_net_pci_device.irq_pin   = pin;
Index: linux-2.6.git/tools/kvm/virtio/rng.c
=
--- linux-2.6.git.orig/tools/kvm/virtio/rng.c
+++ linux-2.6.git/tools/kvm/virtio/rng.c
@@ -171,7 +171,7 @@ void virtio_rng__init(struct kvm *kvm)
if (rdev.fd  0)
die(Failed initializing RNG);

-   if (irq__register_device(PCI_DEVICE_ID_VIRTIO_RNG, dev, pin, line)  
0)
+   if (irq__register_device(PCI_DEVICE_VIRTIO_RNG, dev, pin, line)  0)
return;

virtio_rng_pci_device.irq_pin   = pin;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-08 Thread Nikola Ciprich
OK,
I see.. the problem is, that I'm trying to hunt down bug causing hangs
when 2.6.32 guests try to run tcpdump - this seems to be reproducible even on 
latest 2.6.32.x, and seems like it depends on kvm-clock..
So I was thinking about bisecting between 2.6.32 and latest git which doesn't 
seem to suffer this problem but hitting another (different) problem in 2.6.32 
complicates thinks a bit :(
If somebody would have some hint on how to proceed, I'd be more then grateful..
cheers
n.

On Sun, May 08, 2011 at 10:53:56PM +0400, Michael Tokarev wrote:
 08.05.2011 22:33, Nikola Ciprich wrote:
  Hello everyboy,
  while installing new virt machine today, I noticed that 2.6.32 x86_64 SMP 
  guests are hanging if they have paravirt-clock enabled...
 
 There were about 10 bugfixes pushed to 2.6.32.y stable series,
 some of them were for kvm-clock, and some were for problems
 which manifested itself like you described.  You may actually
 take a look which guests you're booting.  FWIW.
 
 /mjt
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgpza8Cr1Af9M.pgp
Description: PGP signature


Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-08 Thread Nikola Ciprich
(CC Zachary)
well, I should also note that while testing 2.6.37 host, I had Zach's
patch fixing guest clock regression applied...
n.

On Sun, May 08, 2011 at 08:33:04PM +0200, Nikola Ciprich wrote:
 Hello everyboy,
 while installing new virt machine today, I noticed that 2.6.32 x86_64 SMP 
 guests are hanging if they have paravirt-clock enabled...
 Either they don't finish booting at all, or boot but hang soon after..
 Such a hanged guest fully loads all host cpus..
 The host is 6core x86_64 runnig 2.6.37.6 with 24GB RAM.
 
 kvm_stat:
 
  kvm_exit(EXTERNAL_INTERRUPT) 156821643
  kvm_exit 155141643
  kvm_entry154161643
  kvm_set_irq  1 0
  kvm_msi_set_irq  1 0
  kvm_apic_accept_irq  1 0
  kvm_exit(VMCLEAR)6   0
  kvm_exit(VMON)   6   0
  kvm_exit(PAUSE_INSTRUCTION)  5   0
  kvm_exit(MCE_DURING_VMENTRY) 5   0
  kvm_exit(MWAIT_INSTRUCTION)  5   0
  kvm_exit(DR_ACCESS)  5   0
  kvm_exit(EPT_VIOLATION)  5   0
  kvm_exit(NMI_WINDOW) 5   0
  kvm_exit(VMPTRLD)5   0
  kvm_exit(TASK_SWITCH)5   0
  kvm_exit(VMREAD) 5   0
  kvm_exit(VMLAUNCH)   5   0
  kvm_exit(RDPMC)  5   0
 
 perf top:
16.00 10.9% add_preempt_count[kernel.kallsyms] 

16.00 10.9% do_raw_spin_lock [kernel.kallsyms] 

15.00 10.2% sub_preempt_count[kernel.kallsyms] 

 8.00  5.4% irq_exit [kernel.kallsyms] 

 7.00  4.8% vmx_vcpu_run 
 /lib/modules/2.6.37lb.09/kernel/arch/x86/kvm/kvm-intel.ko
 7.00  4.8% page_fault   [kernel.kallsyms] 

 5.00  3.4% mempool_free [kernel.kallsyms] 

 
 info registers:
 RAX=00f42400 RBX=81533f00 RCX=0016 
 RDX=00077358f500
 RSI=1dcd6500 RDI=0001 RBP=880009a03ee8 
 RSP=880009a03ee8
 R8 =0016 R9 =000a R10= 
 R11=
 R12=2a4d17d38f3303c1 R13=815fd000 R14=81592140 
 R15=00093510
 RIP=810767cb RFL=0006 [-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0018   00c09300 DPL=0 DS   [-WA]
 CS =0010   00a09b00 DPL=0 CS64 [-RA]
 SS =0018   00c09300 DPL=0 DS   [-WA]
 DS =0018   00c09300 DPL=0 DS   [-WA]
 FS =  000f 
 GS = 880009a0 000f 
 LDT=  000f 
 TR =0040 880009a11880 2087 8b00 DPL=0 TSS64-busy
 GDT= 880009a04000 007f
 IDT= 815fd000 0fff
 CR0=8005003b CR2=7f424e540700 CR3=00021690c000 CR4=06f0
 DR0= DR1= DR2= 
 DR3= 
 DR6=0ff0 DR7=0400
 EFER=0d01
 FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80
 FPR0=  FPR1= 
 FPR2=  FPR3= 
 FPR4=  FPR5= 
 FPR6=  FPR7= 
 XMM00= XMM01=3c23d70a
 XMM02= XMM03=
 XMM04= XMM05=
 XMM06= XMM07=
 XMM08= XMM09=
 XMM10= XMM11=
 XMM12= XMM13=
 XMM14= XMM15=
 
 info cpus:
 * CPU #0: pc=0x8105d4a0 thread_id=19639 
   CPU #1: pc=0x81013140 thread_id=19640 
   CPU #2: pc=0x8102a1b6 (halted) thread_id=19641 
   CPU #3: pc=0x81341521 thread_id=19642 
   CPU #4: pc=0x810415d8 thread_id=19643 
   CPU #5: pc=0x811ca521 thread_id=19644 
   CPU #6: pc=0x81013140 thread_id=19646 
   CPU #7: pc=0x8102a1b6 (halted) 

Re: 2.6.32 guest with paravirt clock enabled hangs on 2.6.37.6 host (w qemu-kvm-0.13.0)

2011-05-08 Thread David Ahern


On 05/08/11 13:06, Nikola Ciprich wrote:
 OK,
 I see.. the problem is, that I'm trying to hunt down bug causing hangs
 when 2.6.32 guests try to run tcpdump - this seems to be reproducible even on 
 latest 2.6.32.x, and seems like it depends on kvm-clock..
 So I was thinking about bisecting between 2.6.32 and latest git which doesn't 
 seem to suffer this problem but hitting another (different) problem in 2.6.32 
 complicates thinks a bit :(
 If somebody would have some hint on how to proceed, I'd be more then 
 grateful..
 cheers

Have you tried enabling gdbserver in the qemu monitor and then attaching
gdb to the guest once it hangs?

David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RHEL6 host / CentOS5.6 guest periodically very sluggish

2011-05-08 Thread T Johnson
Hello,

I have a perplexing problem I'm hoping I might be able to get some
help with. This is a RHEL6 kvm host, with 4 idle CentOS5.6 guests. 3
guests have 1 vpu, 1 guest has 4 vpu. Host is an 8 core/16 thread
Nahalem class machine and is idle other than the KVM guests.

Probably every 5-10 minutes both the host and guests will become very
sluggish to respond to input and likely any running workload. This
lasts for maybe 4-5 minutes then everything returns to normal until it
happens again 5-10 minutes later. repeat infinitely. CPU usage on the
host is mostly idle during these sluggish periods. I've noticed a big
drop in interrupts on the host during these periods and missed X
ticks messages in dstat:

normal (responsive) dstat output on host:

usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   2  97   0   0   0|   0 0 |  53k  809k|   0 0 |  37k   37k
  0   1  99   0   0   0|   047k|  46B  346B|   0 0 |  37k   35k
  0   0 100   0   0   0|   0 0 | 394B 1038B|   0 0 |  30k   34k
  0   2  98   0   0   0|   0 0 |  46B  346B|   0 0 |  36k   34k
  1   3  96   0   0   0|   0 0 | 394B 1038B|   0 0 |  39k   34k
  1   2  98   0   0   0|   0 0 |  46B  346B|   0 0 |  37k   35k
  1   3  96   0   0   0|   0  1024B| 514B 1038B|   0 0 |  36k   39k
  1   1  98   0   0   0|   0 0 |  46B  346B|   0 0 |  35k   42k
  0   2  98   0   0   0|   0 0 | 394B 1038B|   0 0 |  38k   42k
  0   2  98   0   0   0|   0 0 |  46B  346B|   0 0 |  37k   42k
  1   2  97   0   0   0|   0 0 | 394B 1038B|   0 0 |  35k   41k
  1   1  98   0   0   0|   0 0 |  46B  346B|   0 0 |  31k   39k


example sluggish dstat output on host:
---
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 0   0 100   0   0   0|   0 0 |5902B   71k|   0 0 | 681  2657
  0   1  99   0   0   0|   0  1024B| 652B  692B|   0 0 |5387
41k missed 2 ticks
  0   1  99   0   0   0|   0 0 | 546B  788B|   0 0 |5741
43k missed 2 ticks
  0   1  99   0   0   0|   0 0 | 546B  756B|   0 0 |5770
43k missed 2 ticks
  1   1  98   0   0   0|   0  1024B| 184B  378B|   0 0 |8890
66k missed 2 ticks
  0   0  99   0   0   0|   0 0 |1062B 1166B|   0 0 |4631
34k missed 2 ticks
  0   2  98   0   0   0|   0 0 | 100B  378B|   0 0 |268024k



On the guests (which are idle) there is also some interesting dstat
output. During the sluggish periods, user,system, and interrupt cpu
increases greatly and the number of interrupts doubles or triples. I
can also usually count on dstat crashing on every guest as soon as I
noticed the problem starting on the host:

Traceback (most recent call last):
  File /usr/bin/dstat, line 1974, in ?
main()
  File /usr/bin/dstat, line 1919, in main
o.extract()
  File /usr/bin/dstat, line 509, in extract
self.val[name][i] = 100.0 * (self.cn2[name][i] -
self.cn1[name][i]) / (sum(self.cn2[name]) - sum(self.cn1[name]))
ZeroDivisionError: float division


example normal (responsive) dstat output on guest:
--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 0   0 100   0   0   0|   0 0 |  60B  314B|   0 0 |100411
  0   0 100   0   0   0|   0 0 |  60B  314B|   0 0 |100512
  0   0 100   0   0   0|   0 0 |  60B  314B|   0 0 |100211
  1   0  99   0   0   0|   0 0 |  60B  314B|   0 0 |1003 9
  0   0 100   0   0   0|   0 0 |  60B  314B|   0 0 |100411
  0   0 100   0   0   0|   0 0 | 106B  368B|   0 0 |100411
  0   0 100   0   0   0|   0 0 |  60B  314B|   0 0 |100415
  0   0 100   0   0   0|   0 0 |  60B  314B|   0 0 |1003 9
  0   0 100   0   0   0|   0 0 |  60B  314B|   0 0 |100411
  0   0 100   0   0   0|   0 0 |  60B  314B|   0 0 |1003 9


example sluggish dstat output on guest:

usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 20  20  60   0   0   0|   0 0 |  60B  404B|   0 0 |1840 8
  0   0 100   0   0   0|   0 0 |  60B  314B|   0 0 |2341 8
 17   0  50   0   0  33|   0 0 |  60B  404B|   0 0 |3374 8
  0   0  33   0   0  67|   0 0 |  60B  314B|   0 0 |200211
  0   0   0   0   0 100|   0 0 |  60B  314B|   0 0 |1943 5
  0  50  50   0   0   0|   032k|  60B  420B|   0 0 | 92218
 33   0  67   0   0   0|   0 0 |  60B  404B|   0 0 |1563 9


I'd guess some sort of timer/clock issue, but I'm unsure of where to
go from here? Any help would be appreciated.

Thanks,
TJ
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] qemu-kvm: Add CPUID support for VIA CPU

2011-05-08 Thread BrillyWu
  I have submit a patch into upstream's KVM for supporting
 these features before, and the patch has been applied in kvm-git.
 
 As far as I can see, nothing has been applied to any tree, neither 
 qemu.git nor qemu-kvm.git.

Sorry, I have submit to kvm.git, not qemu.git or qemu-kvm.git.

  Do you mean that I should not submit this patch until the
 KVM's patch is merged back?
 
 You still need to submit a fixed version of this patch against 
 qemu-kvm.git, uq/master branch (which is qemu.git effectively). Then 
 Marcelo or Avi can pick it up and push it to upstream. Once it's 
 merged there, qemu-kvm.git will update from upstream, and you will 
 have your patch applied to both trees.

 
Thanks very much for your nice guide. I know what to do now.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2] Import Linux headers for KVM and vhost

2011-05-08 Thread Rusty Russell
On Wed, 04 May 2011 10:55:26 +0200, Jan Kiszka jan.kis...@siemens.com wrote:
 But one point of all this is to make life easier for those poor people
 who have check a qemu source package before redistribution for included
 licenses. The clearer the terms are expressed, the easier these
 unproductive processes become. You can't imagine what activities that
 vague BSD license of the virtio headers triggered here...

Yes, so much so that I have been known to recommend that reimplementors
go for the headers out of the virtio spec file, which are definitively
uncontaminated.

However, I have prepared a patch, will send now...

Thanks for prod,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/18] virtio_ring: avail event index interface

2011-05-08 Thread Rusty Russell
On Wed, 4 May 2011 23:51:19 +0300, Michael S. Tsirkin m...@redhat.com wrote:
 Define a new feature bit for the host to
 declare that it uses an avail_event index
 (like Xen) instead of a feature bit
 to enable/disable interrupts.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  include/linux/virtio_ring.h |   11 ---
  1 files changed, 8 insertions(+), 3 deletions(-)
 
 diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
 index f5c1b75..f791772 100644
 --- a/include/linux/virtio_ring.h
 +++ b/include/linux/virtio_ring.h
 @@ -32,6 +32,9 @@
  /* The Guest publishes the used index for which it expects an interrupt
   * at the end of the avail ring. Host should ignore the avail-flags field. 
 */
  #define VIRTIO_RING_F_USED_EVENT_IDX 29
 +/* The Host publishes the avail index for which it expects a kick
 + * at the end of the used ring. Guest should ignore the used-flags field. */
 +#define VIRTIO_RING_F_AVAIL_EVENT_IDX32

Are you really sure we want to separate the two?  Seems a little simpler
to have one bit to mean we're publishing our threshold.  For someone
implementing this from scratch, it's a little simpler.

Or are there cases where the old style makes more sense?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/18] virtio: use avail_event index

2011-05-08 Thread Rusty Russell
On Wed, 4 May 2011 23:51:47 +0300, Michael S. Tsirkin m...@redhat.com wrote:
 Use the new avail_event feature to reduce the number
 of exits from the guest.

Figures here would be nice :)

 @@ -228,6 +237,12 @@ add_head:
* new available array entries. */
   virtio_wmb();
   vq-vring.avail-idx++;
 + /* If the driver never bothers to kick in a very long while,
 +  * avail index might wrap around. If that happens, invalidate
 +  * kicked_avail index we stored. TODO: make sure all drivers
 +  * kick at least once in 2^16 and remove this. */
 + if (unlikely(vq-vring.avail-idx == vq-kicked_avail))
 + vq-kicked_avail_valid = true;

If they don't, they're already buggy.  Simply do:
WARN_ON(vq-vring.avail-idx == vq-kicked_avail);

 +static bool vring_notify(struct vring_virtqueue *vq)
 +{
 + u16 old, new;
 + bool v;
 + if (!vq-event)
 + return !(vq-vring.used-flags  VRING_USED_F_NO_NOTIFY);
 +
 + v = vq-kicked_avail_valid;
 + old = vq-kicked_avail;
 + new = vq-kicked_avail = vq-vring.avail-idx;
 + vq-kicked_avail_valid = true;
 + if (unlikely(!v))
 + return true;

This is the only place you actually used kicked_avail_valid.  Is it
possible to initialize it in such a way that you can remove this?

 @@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device *vdev)
   break;
   case VIRTIO_RING_F_USED_EVENT_IDX:
   break;
 + case VIRTIO_RING_F_AVAIL_EVENT_IDX:
 + break;
   default:
   /* We don't understand this bit. */
   clear_bit(i, vdev-features);

Does this belong in a prior patch?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html