RE: [Intel-gfx] [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM

2014-12-10 Thread Tian, Kevin
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Thursday, December 11, 2014 12:59 AM
> 
> On 09/12/2014 03:49, Tian, Kevin wrote:
> > - Now we have XenGT/KVMGT separately maintained, and KVMGT lags
> > behind XenGT regarding to features and qualities. Likely you'll continue
> > see stale code (like Xen inst decoder) for some time. In the future we
> > plan to maintain a single kernel repo for both, so KVMGT can share
> > same quality as XenGT once KVM in-kernel dm framework is stable.
> >
> > - Regarding to Qemu hacks, KVMGT really doesn't have any different
> > requirements as what have been discussed for GPU pass-through, e.g.
> > about ISA bridge. Our implementation is based on an old Qemu repo,
> > and honestly speaking not cleanly developed, because we know we
> > can leverage from GPU pass-through support once it's in Qemu. At
> > that time we'll leverage the same logic with minimal changes to
> > hook KVMGT mgmt. APIs (e.g. create/destroy a vGPU instance). So
> > we can ignore this area for now. :-)
> 
> Could the virtual device model introduce new registers in order to avoid
> poking at the ISA bridge?  I'm not sure that you "can leverage from GPU
> pass-through support once it's in Qemu", since the Xen IGD passthrough
> support is being added to a separate machine that is specific to Xen IGD
> passthrough; no ISA bridge hacking will probably be allowed on the "-M
> pc" and "-M q35" machine types.
> 

My point is that KVMGT doesn't introduce new requirements as what's
required in IGD passthrough case, because all the hacks you see now
is to satisfy guest graphics driver's expectation. I haven't follow up the
KVM IGD passthrough progress, but if it doesn't require ISA bridge hacking
the same trick can be adopted by KVMGT too. You may know Allen is
working on driver changes to avoid causing those hacks in Qemu side.
That effort will benefit us too. So I don't think this is a KVMGT specific
issue, and we need a common solution to close this gap instead of 
hacking vGPU device model alone.

Thanks
Kevin


RE: [Intel-gfx] [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM

2014-12-09 Thread Tian, Kevin
> From: Song, Jike
> Sent: Wednesday, December 10, 2014 2:34 PM
> 
> CC Kevin.
> 
> 
> On 12/09/2014 05:54 PM, Jan Kiszka wrote:
> > On 2014-12-04 03:24, Jike Song wrote:
> >> Hi all,
> >>
> >>   We are pleased to announce the first release of KVMGT project. KVMGT
> is
> >> the implementation of Intel GVT-g technology, a full GPU virtualization
> >> solution. Under Intel GVT-g, a virtual GPU instance is maintained for
> >> each VM, with part of performance critical resources directly assigned.
> >> The capability of running native graphics driver inside a VM, without
> >> hypervisor intervention in performance critical paths, achieves a good
> >> balance of performance, feature, and sharing capability.
> >>
> >>
> >>   KVMGT is still in the early stage:
> >>
> >>- Basic functions of full GPU virtualization works, guest can see a
> >> full-featured vGPU.
> >>  We ran several 3D workloads such as lightsmark, nexuiz, urbanterror
> >> and warsow.
> >>
> >>- Only Linux guest supported so far, and PPGTT must be disabled in
> >> guest through a
> >>  kernel parameter(see README.kvmgt in QEMU).
> >>
> >>- This drop also includes some Xen specific changes, which will be
> >> cleaned up later.
> >>
> >>- Our end goal is to upstream both XenGT and KVMGT, which shares
> ~90%
> >> logic for vGPU
> >>  device model (will be part of i915 driver), with only difference in
> >> hypervisor
> >>  specific services
> >>
> >>- insufficient test coverage, so please bear with stability issues :)
> >>
> >>
> >>
> >>   There are things need to be improved, esp. the KVM interfacing part:
> >>
> >>  1a domid was added to each KVMGT guest
> >>
> >>  An ID is needed for foreground OS switching, e.g.
> >>
> >>  # echo >
> /sys/kernel/vgt/control/foreground_vm
> >>
> >>  domid 0 is reserved for host OS.
> >>
> >>
> >>   2SRCU workarounds.
> >>
> >>  Some KVM functions, such as:
> >>
> >>  kvm_io_bus_register_dev
> >>  install_new_memslots
> >>
> >>  must be called *without* &kvm->srcu read-locked. Otherwise it
> >> hangs.
> >>
> >>  In KVMGT, we need to register an iodev only *after* BAR
> >> registers are
> >>  written by guest. That means, we already have &kvm->srcu
> hold -
> >>  trapping/emulating PIO(BAR registers) makes us in such a
> condition.
> >>  That will make kvm_io_bus_register_dev hangs.
> >>
> >>  Currently we have to disable rcu_assign_pointer() in such
> >> functions.
> >>
> >>  These were dirty workarounds, your suggestions are high
> welcome!
> >>
> >>
> >>  3syscalls were called to access "/dev/mem" from kernel
> >>
> >>  An in-kernel memslot was added for aperture, but using syscalls
> >> like
> >>  open and mmap to open and access the character device
> "/dev/mem",
> >>  for pass-through.
> >>
> >>
> >>
> >>
> >> The source codes(kernel, qemu as well as seabios) are available at github:
> >>
> >>  git://github.com/01org/KVMGT-kernel
> >>  git://github.com/01org/KVMGT-qemu
> >>  git://github.com/01org/KVMGT-seabios
> >>
> >> In the KVMGT-qemu repository, there is a "README.kvmgt" to be referred.
> >>
> >>
> >>
> >> More information about Intel GVT-g and KVMGT can be found at:
> >>
> >>
> https://www.usenix.org/conference/atc14/technical-sessions/presentation/tia
> n
> >>
> >>
> http://events.linuxfoundation.org/sites/events/files/slides/KVMGT-a%20Full%2
> 0GPU%20Virtualization%20Solution_1.pdf
> >>
> >>
> >>
> >> Appreciate your comments, BUG reports, and contributions!
> >>
> >
> > There is an even increasing interest to keep KVM's in-kernel guest
> > interface as small as possible, specifically for security reasons. I'm
> > sure there are some good performance reasons to create a new in-kernel
> > device model, but I suppose those will need good evidences why things
> > are done in the way they finally should be - and not via a user-space
> > device model. This is likely not a binary decision (all userspace vs. no
> > userspace), it is more about the size and robustness of the in-kernel
> > model vs. its performance.

Thanks for explaining the background. We're not against the userspace
model if applied, but based on our analysis we figured out the in-kernel
model is the best suite, not just for performance reason, but also for
the tight couple to i915 functionalities (scheduling, interrupt, security, etc.)
and hypervisor functionalities (GPU shadow page table, etc.) which are 
best handled in kernel directly. Definitely we don't want to split it just for
performance reason, w/o a functionally clear separation, because that 
just creates unnecessary/messy user/kernel interfaces. And now we've
got i915 community's signal that they're willing to pick the core code
into i915 driver, which we're currently working on.

So, not to eliminate the possibility of user/kernel split, how about we first
loo

RE: [Intel-gfx] [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM

2014-12-08 Thread Tian, Kevin
> From: Daniel Vetter
> Sent: Monday, December 08, 2014 6:21 PM
> 
> On Mon, Dec 08, 2014 at 10:55:01AM +0100, Gerd Hoffmann wrote:
> > On Sa, 2014-12-06 at 12:17 +0800, Jike Song wrote:
> > > I don't know that is exactly needed, we also need to have Windows
> > > driver considered.  However, I'm quite confident that, if things gonna
> > > work for IGD passthrough, it gonna work for GVT-g.
> >
> > I'd suggest to focus on q35 emulation.  q35 is new enough that a version
> > with integrated graphics exists, so the gap we have to close is *much*
> > smaller.
> >
> > In case guests expect a northbridge matching the chipset generation of
> > the graphics device (which I'd expect is the case, after digging a bit
> > in the igd and agpgart linux driver code) I think we should add proper
> > device emulation for them, i.e. comply q35-pcihost with
> > sandybridge-pcihost + ivybridge-pcihost + haswell-pcihost instead of
> > just copying over the pci ids from the host.  Most likely all those
> > variants can share most of the emulation code.
> 
> I don't think i915.ko should care about either northbridge nor pch on
> para-virtualized platforms. We do noodle around in there for the oddball
> memory controller setting and for some display stuff. But neither of that
> really applies to paravirtualized hw. And if there's any case like that we
> should  patch it out (like we do with some of the runtime pm code
> already).

Agree. Now Allen is working on how to avoid those tricky platform
stickiness in Windows gfx driver. We should do same thing in Linux
part too.

Thanks
Kevin


RE: [Intel-gfx] [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM

2014-12-08 Thread Tian, Kevin
Here is some background of this KVMGT release:

- the major purpose is for early experiment of this technique in KVM, and 
throw out issues about adding in-kernel device model (or mediated pass-through 
framework) in KVM.

- KVMGT shares 90% code as XenGT, regarding to vGPU device model. The
only difference is the in-kernel dm interface. The vGPU device model will be
split and integrated in i915 driver. It will register to in-kernel dm framework
provided either by Xen or KVM at boot time. Upstreaming of vGPU device
model is already in progress, with valuable comments received from i915 
community. However the refactoring mostly happen in XenGT repo now 

- Now we have XenGT/KVMGT separately maintained, and KVMGT lags
behind XenGT regarding to features and qualities. Likely you'll continue
see stale code (like Xen inst decoder) for some time. In the future we
plan to maintain a single kernel repo for both, so KVMGT can share
same quality as XenGT once KVM in-kernel dm framework is stable.

- Regarding to Qemu hacks, KVMGT really doesn't have any different 
requirements as what have been discussed for GPU pass-through, e.g. 
about ISA bridge. Our implementation is based on an old Qemu repo, 
and honestly speaking not cleanly developed, because we know we
can leverage from GPU pass-through support once it's in Qemu. At 
that time we'll leverage the same logic with minimal changes to 
hook KVMGT mgmt. APIs (e.g. create/destroy a vGPU instance). So
we can ignore this area for now. :-)

Thanks
Kevin

> From: Paolo Bonzini
> Sent: Friday, December 05, 2014 9:04 PM
> 
> 
> 
> On 05/12/2014 09:50, Gerd Hoffmann wrote:
> > A few comments on the kernel stuff (brief look so far, also
> > compile-tested only, intel gfx on my test machine is too old).
> >
> >  * Noticed the kernel bits don't even compile when configured as
> >module.  Everything (vgt, i915, kvm) must be compiled into the
> >kernel.
> 
> I'll add that the patch is basically impossible to review with all the
> XenGT bits still in.  For example, the x86 emulator seems to be
> unnecessary for KVMGT, but I am not 100% sure.
> 
> I would like a clear understanding of why/how Andrew Barnes was able to
> do i915 passthrough (GVT-d) without hacking the ISA bridge, and why this
> does not apply to GVT-g.
> 
> Paolo
> 
> >  * Design approach still seems to be i915 on vgt not the other way
> >around.
> >
> > Qemu/SeaBIOS bits:
> >
> > I've seen the host bridge changes identity from i440fx to
> > copy-pci-ids-from-host.  Guess the reason for this is that seabios uses
> > this device to figure whenever it is running on i440fx or q35.  Correct?
> >
> > What are the exact requirements for the device?  Must it match the host
> > exactly, to not confuse the guest intel graphics driver?  Or would
> > something more recent -- such as the q35 emulation qemu has -- be good
> > enough to make things work (assuming we add support for the
> > graphic-related pci config space registers there)?
> >
> > The patch also adds a dummy isa bridge at 0x1f.  Simliar question here:
> > What exactly is needed here?  Would things work if we simply use the q35
> > lpc device here?
> >
> > more to come after I've read the paper linked above ...
> >
> > cheers,
> >   Gerd
> >
> >
> > ___
> > Intel-gfx mailing list
> > intel-...@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> >
> ___
> Intel-gfx mailing list
> intel-...@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx


RE: [PATCH] KVM: Split up MSI-X assigned device IRQ handler

2011-09-13 Thread Tian, Kevin
> From: Jan Kiszka [mailto:jan.kis...@siemens.com]
> Sent: Tuesday, September 13, 2011 3:30 PM
> 
> On 2011-09-13 08:40, Tian, Kevin wrote:
> >> From: Jan Kiszka
> >> Sent: Tuesday, September 13, 2011 12:58 AM
> >>
> >> The threaded IRQ handler for MSI-X has almost nothing in common with the
> >> INTx/MSI handler. Move its code into a dedicated handler.
> >
> > if it's desired to further go down this cleanup path, there's also no need 
> > to
> > register ack notifier for MSI-X type device since all the logic there is 
> > simply
> > a nop:
> 
> You mean
> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/79075? :)
> 

yes, and sorry that I didn't catch that thread. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: Split up MSI-X assigned device IRQ handler

2011-09-12 Thread Tian, Kevin
> From: Jan Kiszka
> Sent: Tuesday, September 13, 2011 12:58 AM
> 
> The threaded IRQ handler for MSI-X has almost nothing in common with the
> INTx/MSI handler. Move its code into a dedicated handler.

if it's desired to further go down this cleanup path, there's also no need to
register ack notifier for MSI-X type device since all the logic there is simply
a nop:

static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
{
struct kvm_assigned_dev_kernel *dev;

if (kian->gsi == -1)
return;

dev = container_of(kian, struct kvm_assigned_dev_kernel,
   ack_notifier);

kvm_set_irq(dev->kvm, dev->irq_source_id, dev->guest_irq, 0);

/* The guest irq may be shared so this ack may be
 * from another device.
 */
spin_lock(&dev->intx_lock);
if (dev->host_irq_disabled) {
enable_irq(dev->host_irq);
dev->host_irq_disabled = false;
}
spin_unlock(&dev->intx_lock);
}

Thanks
Kevin

> 
> Signed-off-by: Jan Kiszka 
> ---
>  virt/kvm/assigned-dev.c |   32 +++-
>  1 files changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
> index 84ead54..7debe8c 100644
> --- a/virt/kvm/assigned-dev.c
> +++ b/virt/kvm/assigned-dev.c
> @@ -58,8 +58,6 @@ static int find_index_from_host_irq(struct
> kvm_assigned_dev_kernel
>  static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
>  {
>   struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
> - u32 vector;
> - int index;
> 
>   if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX) {
>   spin_lock(&assigned_dev->intx_lock);
> @@ -68,20 +66,28 @@ static irqreturn_t kvm_assigned_dev_thread(int irq,
> void *dev_id)
>   spin_unlock(&assigned_dev->intx_lock);
>   }
> 
> - if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
> - index = find_index_from_host_irq(assigned_dev, irq);
> - if (index >= 0) {
> - vector = assigned_dev->
> - guest_msix_entries[index].vector;
> - kvm_set_irq(assigned_dev->kvm,
> - assigned_dev->irq_source_id, vector, 1);
> - }
> - } else
> + kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> + assigned_dev->guest_irq, 1);
> +
> + return IRQ_HANDLED;
> +}
> +
> +#ifdef __KVM_HAVE_MSIX
> +static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
> +{
> + struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
> + int index = find_index_from_host_irq(assigned_dev, irq);
> + u32 vector;
> +
> + if (index >= 0) {
> + vector = assigned_dev->guest_msix_entries[index].vector;
>   kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
> - assigned_dev->guest_irq, 1);
> + vector, 1);
> + }
> 
>   return IRQ_HANDLED;
>  }
> +#endif
> 
>  /* Ack the irq line for an assigned device */
>  static void kvm_assigned_dev_ack_irq(struct kvm_irq_ack_notifier *kian)
> @@ -279,7 +285,7 @@ static int assigned_device_enable_host_msix(struct
> kvm *kvm,
> 
>   for (i = 0; i < dev->entries_nr; i++) {
>   r = request_threaded_irq(dev->host_msix_entries[i].vector,
> -  NULL, kvm_assigned_dev_thread,
> +  NULL, kvm_assigned_dev_thread_msix,
>0, dev->irq_name, dev);
>   if (r)
>   goto err;
> --
> 1.7.3.4
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-30 Thread Tian, Kevin
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Tuesday, August 30, 2011 6:57 PM
> 
> On 08/30/2011 04:15 AM, Tian, Kevin wrote:
> > v2 changes:
> > define exit qualification fields for APIC-Access in vmx.h
> > use apic_reg_write instead of apic_set_eoi, to avoid breaking tracing
> > add fasteoi option to allow disabling this acceleration
> >
> > 
> >
> > commit 2a66a12cb6928c962d24907e6d39b6eb9ac94b4b
> > Author: Kevin Tian
> > Date:   Mon Aug 29 13:08:28 2011 +0800
> >
> >  KVM: APIC: avoid instruction emulation for EOI writes
> >
> >  Instruction emulation for EOI writes can be skipped, since sane
> >  guest simply uses MOV instead of string operations. This is a nice
> >  improvement when guest doesn't support x2apic or hyper-V EOI
> >  support.
> >
> >  a single VM bandwidth is observed with ~8% bandwidth improvement
> >  (7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.
> >
> >
> 
> Thanks, applied.  Please use git format-patch (and git send-email)
> instead of git show in the future.
> 

OK, and sorry for the formatting issue.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Tian, Kevin
v2 changes:
define exit qualification fields for APIC-Access in vmx.h
use apic_reg_write instead of apic_set_eoi, to avoid breaking tracing
add fasteoi option to allow disabling this acceleration



commit 2a66a12cb6928c962d24907e6d39b6eb9ac94b4b
Author: Kevin Tian 
Date:   Mon Aug 29 13:08:28 2011 +0800

KVM: APIC: avoid instruction emulation for EOI writes

Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian 
:
Signed-off-by: Eddie Dong 
Signed-off-by: Marcelo Tosatti 

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 2caf290..31f180c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -350,6 +350,18 @@ enum vmcs_field {
 #define DEBUG_REG_ACCESS_REG(eq)(((eq) >> 8) & 0xf) /* 11:8, general 
purpose reg. */
 
 
+/*
+ * Exit Qualifications for APIC-Access
+ */
+#define APIC_ACCESS_OFFSET  0xfff   /* 11:0, offset within the 
APIC page */
+#define APIC_ACCESS_TYPE0xf000  /* 15:12, access type */
+#define TYPE_LINEAR_APIC_INST_READ  (0 << 12)
+#define TYPE_LINEAR_APIC_INST_WRITE (1 << 12)
+#define TYPE_LINEAR_APIC_INST_FETCH (2 << 12)
+#define TYPE_LINEAR_APIC_EVENT  (3 << 12)
+#define TYPE_PHYSICAL_APIC_EVENT(10 << 12)
+#define TYPE_PHYSICAL_APIC_INST (15 << 12)
+
 /* segment AR */
 #define SEGMENT_AR_L_MASK (1 << 13)
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..52645f2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu->arch.apic;
+
+   if (apic)
+   apic_reg_write(vcpu->arch.apic, APIC_EOI, 0);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
if (!vcpu->arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..9d0364b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -71,6 +71,9 @@ module_param(vmm_exclusive, bool, S_IRUGO);
 static int __read_mostly yield_on_hlt = 1;
 module_param(yield_on_hlt, bool, S_IRUGO);
 
+static int __read_mostly fasteoi = 1;
+module_param(fasteoi, bool, S_IRUGO);
+
 /*
  * If nested=1, nested virtualization is supported, i.e., guests may use
  * VMX and be a hypervisor for its own guests. If nested=0, guests may not
@@ -4540,6 +4543,24 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   if (likely(fasteoi)) {
+   unsigned long exit_qualification = 
vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = exit_qualification & APIC_ACCESS_TYPE;
+   offset = exit_qualification & APIC_ACCESS_OFFSET;
+   /*
+* Sane guest uses MOV to write EOI, with written value 
+* not cared. So make a short-circuit here by avoiding 
+* heavy instruction emulation.
+*/
+   if ((access_type == TYPE_LINEAR_APIC_INST_WRITE) &&
+   (offset == APIC_EOI)) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }


Thanks
Kevin


20110829_kvm_eoi_opt.patch
Description: 20110829_kvm_eoi_opt.patch


RE: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-29 Thread Tian, Kevin
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Monday, August 29, 2011 3:24 PM
> 
> On 08/29/2011 09:09 AM, Tian, Kevin wrote:
> >
> >   static int handle_apic_access(struct kvm_vcpu *vcpu)
> >   {
> > +   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> > +   int access_type, offset;
> > +
> > +   access_type = (exit_qualification>>  12)&  0xf;
> > +   offset = exit_qualification&  0xfff;
> > +   /*
> > +* Sane guest uses MOV instead of string operations to
> > +* write EOI, with written value not cared. So make a
> > +* short-circuit here by avoiding heavy instruction
> > +* emulation.
> > +*/
> > +   if ((access_type == 1)&&  (offset == APIC_EOI)) {
> 
> Please replace this 1 with a #define.
> 

updated.

--

KVM: APIC: avoid instruction emulation for EOI writes

Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian 
:
Signed-off-by: Eddie Dong 
Signed-off-by: Marcelo Tosatti 

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..933187e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu->arch.apic;
+
+   if (apic)
+   apic_set_eoi(vcpu->arch.apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
if (!vcpu->arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..34c094f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4538,8 +4538,25 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
return 1;
 }
 
+#define APIC_ACCESS_TYPE_W 1 /* Linear write access during inst execution */
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = (exit_qualification >> 12) & 0xf;
+   offset = exit_qualification & 0xfff;
+   /*
+* Sane guest uses MOV instead of string operations to
+* write EOI, with written value not cared. So make a
+* short-circuit here by avoiding heavy instruction 
+* emulation.
+*/
+   if ((access_type == APIC_ACCESS_TYPE_W) && (offset == APIC_EOI)) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }

Thanks
Kevin


20110829_kvm_eoi_opt.patch
Description: 20110829_kvm_eoi_opt.patch


RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-08-28 Thread Tian, Kevin
> From: Marcelo Tosatti
> Sent: Tuesday, August 23, 2011 6:47 PM
> 
> > >+  if (!apic->lapic_timer.tscdeadline)
> > >+  return;
> > >+
> > >+  tsc_target = kvm_x86_ops->
> > >+  guest_to_host_tsc(apic->lapic_timer.tscdeadline);
> > >+  rdtscll(tsc_now);
> > >+  tsc_delta = tsc_target - tsc_now;
> >
> > This only works if we have a constant tsc, that's not true for large

that type of machine exposes tricky issues to time virtualization in 
general.

> > multiboard machines.  Need to do this with irqs disabled as well
> > (reading both 'now' and 'tsc_now' in the same critical section).
> 
> Should look like this:
> 
> local_irq_disable();
> u64 guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
> if (guest_tsc <= tscdeadline)
> hrtimer_start(now);
> else {
>   ns = convert_to_ns(guest_tsc - tscdeadline);
>   hrtimer_start(now + ns);
> }
> local_irq_enable();

Above is an overkill. only calculating guest tsc delta needs be protected.

On the other hand, I don't think masking irq here adds obvious benefit.
Virtualization doesn't ensure micro-level time accuracy, since there're
many events delaying virtual time interrupt injection even when timer
emulation logic triggers it timely. That's even the case on bare metal
though with smaller scope, and thus most guests are able to handle 
such situation. masking irq in non-critical regions simply slow down the
system response on other events. 

> 
> Note the vcpus tsc can have different frequency than the hosts, so
> vcpu_tsc_khz() should be used to convert to nanoseconds, not tsc_khz.
> 

yes, that's a good suggestion.

btw, Jinsong is on vacation this week. He will update the patch according
to you comment when back.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: APIC: avoid instruction emulation for EOI writes

2011-08-28 Thread Tian, Kevin
Hi, Avi,

Here comes the patch:

KVM: APIC: avoid instruction emulation for EOI writes

Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.

a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.

Signed-off-by: Kevin Tian 
:
Signed-off-by: Eddie Dong 
Signed-off-by: Marcelo Tosatti 

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..933187e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   struct kvm_lapic *apic = vcpu->arch.apic;
+
+   if (apic)
+   apic_set_eoi(vcpu->arch.apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
if (!vcpu->arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..35e4af7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4540,6 +4540,22 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = (exit_qualification >> 12) & 0xf;
+   offset = exit_qualification & 0xfff;
+   /*
+* Sane guest uses MOV instead of string operations to
+* write EOI, with written value not cared. So make a
+* short-circuit here by avoiding heavy instruction 
+* emulation.
+*/
+   if ((access_type == 1) && (offset == APIC_EOI)) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }

Thanks
Kevin

> -Original Message-
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Thursday, August 25, 2011 12:39 PM
> To: Tian, Kevin
> Cc: kvm@vger.kernel.org; Nakajima, Jun
> Subject: Re: about vEOI optimization
> 
> On 08/25/2011 05:24 AM, Tian, Kevin wrote:
> > >
> > >  Another option is the hyper-V EOI support, which can also eliminate the
> > >  EOI exit when no additional interrupt is pending.  This can improve EOI
> > >  performance even more.
> > >
> >
> > yes, and this is an orthogonal option.
> >
> > So if you agree, I'll send out an updated patch atop their work.
> >
> >
> 
> Thanks.
> 
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.



20110829_kvm_eoi_opt.patch
Description: 20110829_kvm_eoi_opt.patch


RE: about vEOI optimization

2011-08-24 Thread Tian, Kevin
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Wednesday, August 24, 2011 6:00 PM
> 
> On 08/23/2011 11:09 AM, Tian, Kevin wrote:
> > Hi, Avi,
> >
> > Both Eddie and Marcello once suggested vEOI optimization by skipping
> > heavy-weight instruction decode, which reduces vEOI overhead greatly:
> >
> > http://www.mail-archive.com/kvm@vger.kernel.org/msg18619.html
> > http://www.spinics.net/lists/kvm/msg36691.html
> >
> > Though virtual x2apic serves similar purpose, it depends on guest OS
> > to support x2apic. Many Windows versions don't support x2apic though,
> > including Win7, Windows server before 2008 R2, etc. Given that 
> > virtualization
> > need support various OS versions, any chance to incorporate above vEOI
> > optimization in KVM as an alternative to boost performance when guest
> > doesn't support x2apic?
> >
> 
> Yes.  There was a problem with the guest using MOVSD or STOSD to write
> the EOI; if we don't emulate, then registers don't get updated.  I guess
> we can ignore it since no sane guest will use those instructions for EOI.

yes, sane guests all use MOV for EOI. btw, Xen has integrated a similar
acceleration for several releases. When we're measuring 10G SR-IOV
network performance, vEOI access overhead may be up to 6%-8% based
on interrupt rate which is one factor for KVM to lag behind.

> 
> Another option is the hyper-V EOI support, which can also eliminate the
> EOI exit when no additional interrupt is pending.  This can improve EOI
> performance even more.
> 

yes, and this is an orthogonal option.

So if you agree, I'll send out an updated patch atop their work.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


about vEOI optimization

2011-08-23 Thread Tian, Kevin
Hi, Avi,

Both Eddie and Marcello once suggested vEOI optimization by skipping
heavy-weight instruction decode, which reduces vEOI overhead greatly:

http://www.mail-archive.com/kvm@vger.kernel.org/msg18619.html 
http://www.spinics.net/lists/kvm/msg36691.html

Though virtual x2apic serves similar purpose, it depends on guest OS
to support x2apic. Many Windows versions don't support x2apic though,
including Win7, Windows server before 2008 R2, etc. Given that virtualization
need support various OS versions, any chance to incorporate above vEOI
optimization in KVM as an alternative to boost performance when guest
doesn't support x2apic?

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: large amount of NMI_INTERRUPT disgrade winxp VM performance much.

2011-08-10 Thread Tian, Kevin
> From: ya su
> Sent: Thursday, August 11, 2011 11:57 AM
> 
>  When I run winxp guest on one server, copy one file about 4G will
> take a time of 40-50 min; if I run a FC14 guest, it will take about
> 2-3 min;
> 
>  I copy and run the winxp image on another server, it works well, take
> about 3min.
> 
>  I run trace-cmd while copying files, the main difference of  the two
> outputs is that: the slow one's output have many NMI_INTERRUPT
> vm_exit, while the fast output has no such vm_exit. both of the two
> servers have NMI enabled default. the slow one's output as the
> following:
>  qemu-system-x86-4454  [004]   549.958147: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958172: kvm_exit:
> reason EXCEPTION_NMI rip 0x8051d5e1
>  qemu-system-x86-4454  [004]   549.958172: kvm_page_fault:
> address c8f8a000 error_code b
>  qemu-system-x86-4454  [004]   549.958177: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958202: kvm_exit:
> reason EXCEPTION_NMI rip 0x8051d5e1
>  qemu-system-x86-4454  [004]   549.958204: kvm_page_fault:
> address c8f8b000 error_code b
>  qemu-system-x86-4454  [004]   549.958209: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958234: kvm_exit:
> reason EXCEPTION_NMI rip 0x8051d5e1
>  qemu-system-x86-4454  [004]   549.958234: kvm_page_fault:
> address c8f8c000 error_code b
>  qemu-system-x86-4454  [004]   549.958239: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958264: kvm_exit:
> reason EXCEPTION_NMI rip 0x8051d5e1
>  qemu-system-x86-4454  [004]   549.958264: kvm_page_fault:
> address c8f8d000 error_code b
>  qemu-system-x86-4454  [004]   549.958267: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958292: kvm_exit:
> reason EXCEPTION_NMI rip 0x8051d5e1
>  qemu-system-x86-4454  [004]   549.958294: kvm_page_fault:
> address c8f8e000 error_code b
>  qemu-system-x86-4454  [004]   549.958299: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958324: kvm_exit:
> reason EXCEPTION_NMI rip 0x8051d5e1
>  qemu-system-x86-4454  [004]   549.958324: kvm_page_fault:
> address c8f8f000 error_code b
>  qemu-system-x86-4454  [004]   549.958329: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958447: kvm_exit:
> reason EXTERNAL_INTERRUPT rip 0x80547ac8
>  qemu-system-x86-4454  [004]   549.958450: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958461: kvm_exit:
> reason CR_ACCESS rip 0x8054428c
>  qemu-system-x86-4454  [004]   549.958461: kvm_cr:
> cr_write 0 = 0x80010031
>  qemu-system-x86-4454  [004]   549.958541: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958573: kvm_exit:
> reason CR_ACCESS rip 0x80546beb
>  qemu-system-x86-4454  [004]   549.958575: kvm_cr:
> cr_write 0 = 0x8001003b
>  qemu-system-x86-4454  [004]   549.958585: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958610: kvm_exit:
> reason CR_ACCESS rip 0x80546b6c
>  qemu-system-x86-4454  [004]   549.958610: kvm_cr:
> cr_write 3 = 0x6e00020
>  qemu-system-x86-4454  [004]   549.958621: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958645: kvm_exit:
> reason EXCEPTION_NMI rip 0x8051d7f4
>  qemu-system-x86-4454  [004]   549.958645: kvm_page_fault:
> address c0648200 error_code 3
>  qemu-system-x86-4454  [004]   549.958653: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958725: kvm_exit:
> reason EXCEPTION_NMI rip 0x8050a26a
>  qemu-system-x86-4454  [004]   549.958726: kvm_page_fault:
> address c0796994 error_code 3
>  qemu-system-x86-4454  [004]   549.958738: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958750: kvm_exit:
> reason IO_INSTRUCTION rip 0x806edad0
>  qemu-system-x86-4454  [004]   549.958750: kvm_pio:
> pio_write at 0xc050 size 2 count 1
>  qemu-system-x86-4454  [004]   549.958838: kvm_entry:
> vcpu 0
>  qemu-system-x86-4454  [004]   549.958844: kvm_exit:
> reason APIC_ACCESS rip 0x806e7b85
>  qemu-system-x86-4454  [004]   549.958852: kvm_apic:
> apic_read APIC_ICR = 0x40041
>  qemu-system-x86-4454  [004]   549.958855: kvm_mmio:
> mmio
> read len 4 gpa 0xfee00300 val 0x40041
>  qemu-system-x86-4454  [004]   549.958857: kvm_mmio:
> mmio
> write len 4 gpa 0xfee00300 val 0x40041
>  qemu-system-x86-4454  [004]   549.958858: kvm_apic:
> apic_write APIC_ICR = 0x40041
>  qemu-system-x86-4454  [004]   549.958860: kvm_apic_ipi: dst 1
> vec 65 (Fixed|physical|de-assert|edge|self)
>  qemu-system-x86-4454  [004]   549.958860: kvm_apic_accept_irq:
> apicid 0 vec 65 (Fixed|edge)
> 
>  Even I disable the NMI when I boot the kernel with nmi_watchdog=0,
> the trace-cmd output still show there are many NMI_INTERRUPT. I find
> that in /proc/interrupts, the amount of NMI is 0. Does this mean that
> NMI is produced in winxp guest OS, or this setting can not hinder kvm
> to catch NMI interrupt?
> 
>   I think the difference between FC14 and winxp is that: fc14 process
> the NMI interrupt correctly, but winxp can not, is this right?
> 
>   I run qemu-kvm with version 0.14.0, kernel version 2.6.32-131.6.4. I
> change kvm

RE: kvm upstream build error..

2011-07-12 Thread Tian, Kevin
it works in my side, due to config difference. It is caused by recent
steal time feature.

int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
case MSR_KVM_STEAL_TIME:

if (unlikely(!sched_info_on()))
return 1;

static inline int sched_info_on(void)
{
#ifdef CONFIG_SCHEDSTATS
return 1;
#elif defined(CONFIG_TASK_DELAY_ACCT)
extern int delayacct_on;
return delayacct_on;
#else
return 0;
#endif
}

I have CONFIG_SCHEDSTATS enabled, while yours has CONFIG_SCHEDSTATS
as 'n' while "CONFIG_TASK_DELAY_ACCT) as 'y'. However delayacct_on is
not an exposed symbol to modules.

Thanks
Kevin

> From: Ren, Yongjie
> Sent: Wednesday, July 13, 2011 10:05 AM
> 
> Hi Avi,
> KVM upstream have a build error in our build system. Do you have some
> comments on this issue? my kvm commit is
> 0af9df4e80a3a620c52c3a9777191c54e615d068
> build error info:
>   LD  arch/x86/boot/setup.elf
>   OBJCOPY arch/x86/boot/setup.bin
>   BUILD   arch/x86/boot/bzImage
> Root device is (8, 1)
> Setup is 15672 bytes (padded to 15872 bytes).
> System is 2556 kB
> CRC c5b53854
> Kernel: arch/x86/boot/bzImage is ready  (#1)
> ERROR: "delayacct_on" [arch/x86/kvm/kvm.ko] undefined!
> make[3]: *** [__modpost] Error 1
> make[2]: *** [modules] Error 2
> error: Bad exit status from /var/tmp/rpm-tmp.81681 (%build)
> 
> Best Regards,
>  Yongjie Ren  (Jay)
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] vmx,svm: Print errors if SVM or VMX were already set

2011-07-04 Thread Tian, Kevin
> From: Alexander Graf
> Sent: Tuesday, July 05, 2011 8:42 AM
> 
> 
> On 05.07.2011, at 01:09, Sasha Levin wrote:
> 
> > Instead of exiting quietly, print an error if the VMX or the SVM bits
> > were already set when loading the module.
> >
> > Having VMX/SVM bits set means that either there is someone else doing
> > hardware virtualization, or that the BIOS is buggy and sets it on
> > by default.
> >
> > Cc: Avi Kivity 
> > Cc: Marcelo Tosatti 
> > Signed-off-by: Sasha Levin 
> > ---
> > arch/x86/kvm/svm.c |5 -
> > arch/x86/kvm/vmx.c |4 +++-
> > 2 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 5ca76e3..2a1df2e 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -590,8 +590,11 @@ static int svm_hardware_enable(void *garbage)
> > int me = raw_smp_processor_id();
> >
> > rdmsrl(MSR_EFER, efer);
> > -   if (check_inuse && (efer & EFER_SVME))
> > +   if (check_inuse && (efer & EFER_SVME)) {
> > +   printk(KERN_ERR "svm_hardware_enable: SVM already set
> on %d\n",
> 
> CPU%d
> 
> Also I'd rephrase it as "already in use on". Otherwise looks good :)
> 

A more elaborative message sounds better, like the explanation for possible
cause in the commit message. Also an advertisement about "check_inuse"
option is a good thing here in the message. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] vmx,svm: Print errors if SVM or VMX were already set

2011-07-04 Thread Tian, Kevin
> From: Sasha Levin
> Sent: Tuesday, July 05, 2011 7:09 AM
> 
> Instead of exiting quietly, print an error if the VMX or the SVM bits
> were already set when loading the module.
> 
> Having VMX/SVM bits set means that either there is someone else doing
> hardware virtualization, or that the BIOS is buggy and sets it on
> by default.
> 
> Cc: Avi Kivity 
> Cc: Marcelo Tosatti 
> Signed-off-by: Sasha Levin 
> ---
>  arch/x86/kvm/svm.c |5 -
>  arch/x86/kvm/vmx.c |4 +++-
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 5ca76e3..2a1df2e 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -590,8 +590,11 @@ static int svm_hardware_enable(void *garbage)
>   int me = raw_smp_processor_id();
> 
>   rdmsrl(MSR_EFER, efer);
> - if (check_inuse && (efer & EFER_SVME))
> + if (check_inuse && (efer & EFER_SVME)) {
> + printk(KERN_ERR "svm_hardware_enable: SVM already set
> on %d\n",
> +me);
>   return -EBUSY;
> + }
> 
>   if (!has_svm()) {
>   printk(KERN_ERR "svm_hardware_enable: err EOPNOTSUPP
> on %d\n",
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 3046b07..df69b1d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2233,8 +2233,10 @@ static int hardware_enable(void *garbage)
>   u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
>   u64 old, test_bits;
> 
> - if (check_inuse && (read_cr4() & X86_CR4_VMXE))
> + if (check_inuse && (read_cr4() & X86_CR4_VMXE)) {
> + printk(KERN_ERR "hardware_enable: VMX already set\n");
>   return -EBUSY;
> + }
> 

make the error message consistent between vmx and svm, i.e. adding
cpu id for vmx too.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Patch v5 1/4] Remove SMEP bit from CR4_RESERVED_BITS

2011-06-01 Thread Tian, Kevin
> From: Ingo Molnar
> Sent: Monday, May 30, 2011 3:41 PM
> 
> 
> * Yang, Wei Y  wrote:
> 
> > This patch removes SMEP bit from CR4_RESERVED_BITS.
> 
> I'm wondering, what is the best-practice way for tools/kvm/ to set
> SMEP for the guest kernel automatically, even if the guest kernel
> itsef has not requested SMEP?
> 

enabling SMEP w/o guest's knowledge can be problematic if the guest
is doing U/S 0->1 bit change w/o TLB invalidation, which is a required
action to ensure SMEP protection working correctly. Linux versions 
known so far don't have this behavior because TLB invalidation due to
P bit change covers U/S 0->1 change. But given that end users may
deploy various OS within the guest, to enable SMEP this way requires
solid understanding on internals of those OSes. Or else it's uncertain
whether SMEP protection fully works on such uncertain guests.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Patch v5 0/4] Enable SMEP feature support for kvm

2011-05-30 Thread Tian, Kevin
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Monday, May 30, 2011 6:00 PM
> 
> On 05/30/2011 12:18 PM, Tian, Kevin wrote:
> > >  From: Avi Kivity [mailto:a...@redhat.com]
> > >  Sent: Monday, May 30, 2011 5:14 PM
> > >
> > >  On 05/30/2011 12:08 PM, Tian, Kevin wrote:
> > >  >  >   From: Avi Kivity
> > >  >  >   Sent: Monday, May 30, 2011 4:52 PM
> > >  >  >
> > >  >  >   On 05/30/2011 06:01 AM, Yang, Wei Y wrote:
> > >  >  >   >   This patchset enables a new CPU feature SMEP (Supervisor
> Mode
> > >  Execution
> > >  >  >   >   Protection) in KVM. SMEP prevents kernel from executing
> code in
> > >  application.
> > >  >  >   >   Updated Intel SDM describes this CPU feature. The
> document will be
> > >  >  >   >   published soon.
> > >  >  >   >
> > >  >  >   >   This patchset is based on Fenghua's SMEP patch series, as
> referred
> > >  by:
> > >  >  >   >   https://lkml.org/lkml/2011/5/17/523
> > >  >  >
> > >  >  >   Looks good.  I'll post the cr0.wp=0 fixup soon.
> > >  >  >
> > >  >
> > >  >  what's your planned fix? through NX bit? :-)
> > >
> > >  Yes.
> > >
> > >  >  btw, why is current scheme used to emulate cr0.wp=0 case instead of
> simply
> > >  >  emulating it?
> > >
> > >  How would you simply emulate it?
> > >
> > >  We have to force cr0.wp=1, otherwise we cannot write-protect guest
> page
> > >  tables.  Once we do that, we have to set U=1 to allow user reads or U=0
> > >  to allow kernel writes.
> > >
> >
> > I mean using instruction emulation instead of changing permission to
> re-execute
> > faulting instruction. Or is current KVM instruction emulator not complete
> enough
> > to handle various memory access instructions (just designed for page table
> access
> > and real mode instructions?)?
> 
> I think by now it's complete enough (it wasn't when the shadow mmu was
> written).  But emulation will be slow if the guest writes a lot of data
> to the page.

OK, got it.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Patch v5 0/4] Enable SMEP feature support for kvm

2011-05-30 Thread Tian, Kevin
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Monday, May 30, 2011 5:14 PM
> 
> On 05/30/2011 12:08 PM, Tian, Kevin wrote:
> > >  From: Avi Kivity
> > >  Sent: Monday, May 30, 2011 4:52 PM
> > >
> > >  On 05/30/2011 06:01 AM, Yang, Wei Y wrote:
> > >  >  This patchset enables a new CPU feature SMEP (Supervisor Mode
> Execution
> > >  >  Protection) in KVM. SMEP prevents kernel from executing code in
> application.
> > >  >  Updated Intel SDM describes this CPU feature. The document will be
> > >  >  published soon.
> > >  >
> > >  >  This patchset is based on Fenghua's SMEP patch series, as referred
> by:
> > >  >  https://lkml.org/lkml/2011/5/17/523
> > >
> > >  Looks good.  I'll post the cr0.wp=0 fixup soon.
> > >
> >
> > what's your planned fix? through NX bit? :-)
> 
> Yes.
> 
> > btw, why is current scheme used to emulate cr0.wp=0 case instead of simply
> > emulating it?
> 
> How would you simply emulate it?
> 
> We have to force cr0.wp=1, otherwise we cannot write-protect guest page
> tables.  Once we do that, we have to set U=1 to allow user reads or U=0
> to allow kernel writes.
> 

I mean using instruction emulation instead of changing permission to re-execute
faulting instruction. Or is current KVM instruction emulator not complete enough
to handle various memory access instructions (just designed for page table 
access
and real mode instructions?)?

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Patch v5 0/4] Enable SMEP feature support for kvm

2011-05-30 Thread Tian, Kevin
> From: Avi Kivity
> Sent: Monday, May 30, 2011 4:52 PM
> 
> On 05/30/2011 06:01 AM, Yang, Wei Y wrote:
> > This patchset enables a new CPU feature SMEP (Supervisor Mode Execution
> > Protection) in KVM. SMEP prevents kernel from executing code in application.
> > Updated Intel SDM describes this CPU feature. The document will be
> > published soon.
> >
> > This patchset is based on Fenghua's SMEP patch series, as referred by:
> > https://lkml.org/lkml/2011/5/17/523
> 
> Looks good.  I'll post the cr0.wp=0 fixup soon.
> 

what's your planned fix? through NX bit? :-)

btw, why is current scheme used to emulate cr0.wp=0 case instead of simply
emulating it?

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Patch v3] Enable CPU SMEP feature for KVM

2011-05-26 Thread Tian, Kevin
> From: Yang, Wei Y
> Sent: Thursday, May 26, 2011 9:29 PM
> 
> This patchset enables a new CPU feature SMEP (Supervisor Mode Execution
> Protection) in KVM. SMEP prevents kernel from executing code in application.
> Updated Intel SDM describes this CPU feature. The document will be published
> soon.
> 
> This patchset is based on Fenghua's SMEP patch series, as referred by:
> https://lkml.org/lkml/2011/5/17/523
> 
> Changes since v2: enable SMEP for spt mode.

another change in this version is to avoid adding SMEP to cr4_guest_owned_bits,
because architecturally it's required to flush TLB when CR4.SMEP is changed
which has to be emulated.

Also based on your comment SMEP is now permitted based on cpuid setting. One 
pending 
issue though is check_cr_write in emulation path. We changed cr4_reserved_bits
to vcpu specific instead of global, but check_cr_write doesn't provide a vcpu 
parameter.
Looks the whole emulation logic avoids using vcpu context to be neutral. Avi, 
do you
have any suggestion for a clean change here? Add cr_reserved_bits array into
emulation context instead of hard-coding?

Thanks
Kevin

> 
>  Signed-off-by: Yang Wei 
>  Signed-off-by: Shan Haitao 
> 
> ---
>  arch/x86/include/asm/kvm_host.h |1 +
>  arch/x86/kvm/paging_tmpl.h  |   15 +--
>  arch/x86/kvm/vmx.c  |9 +
>  arch/x86/kvm/x86.c  |7 +--
>  4 files changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> index d2ac8e2..154287b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -306,6 +306,7 @@ struct kvm_vcpu_arch {
>   unsigned long cr3;
>   unsigned long cr4;
>   unsigned long cr4_guest_owned_bits;
> + unsigned long cr4_reserved_bits;
>   unsigned long cr8;
>   u32 hflags;
>   u64 efer;
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 6c4dc01..7e0b2f8 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -120,7 +120,7 @@ static int FNAME(walk_addr_generic)(struct
> guest_walker *walker,
>   struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
>   gva_t addr, u32 access)
>  {
> - pt_element_t pte;
> + pt_element_t pte, pte_smep;
>   pt_element_t __user *ptep_user;
>   gfn_t table_gfn;
>   unsigned index, pt_access, uninitialized_var(pte_access);
> @@ -150,7 +150,11 @@ walk:
>   }
>   --walker->level;
>   }
> + pte_smep = ~0ULL;
> +#else
> + pte_smep = ~0U;
>  #endif
> +
>   ASSERT((!is_long_mode(vcpu) && is_pae(vcpu)) ||
>  (mmu->get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0);
> 
> @@ -234,6 +238,8 @@ walk:
> 
>   walker->ptes[walker->level - 1] = pte;
> 
> + pte_smep &= pte;
> +
>   if ((walker->level == PT_PAGE_TABLE_LEVEL) ||
>   ((walker->level == PT_DIRECTORY_LEVEL) &&
>   is_large_pte(pte) &&
> @@ -246,6 +252,11 @@ walk:
>   gfn_t gfn;
>   u32 ac;
> 
> + if (unlikely(fetch_fault && !user_fault))
> + if ((vcpu->arch.cr4 & X86_CR4_SMEP)
> + && (pte_smep & PT_USER_MASK))
> + eperm = true;
> +
>   gfn = gpte_to_gfn_lvl(pte, lvl);
>   gfn += (addr & PT_LVL_OFFSET_MASK(lvl)) >> PAGE_SHIFT;
> 
> @@ -305,7 +316,7 @@ error:
> 
>   walker->fault.error_code |= write_fault | user_fault;
> 
> - if (fetch_fault && mmu->nx)
> + if (fetch_fault && (mmu->nx || (vcpu->arch.cr4 & X86_CR4_SMEP)))
>   walker->fault.error_code |= PFERR_FETCH_MASK;
>   if (rsvd_fault)
>   walker->fault.error_code |= PFERR_RSVD_MASK;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4c3fa0f..7ad24fd 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4507,6 +4507,15 @@ static void vmx_cpuid_update(struct kvm_vcpu
> *vcpu)
>   }
>   }
>   }
> +
> + best = kvm_find_cpuid_entry(vcpu, 7, 0);
> + if (best && (best->ebx & bit(X86_FEATURE_SMEP))) {
> + if (boot_cpu_has(X86_FEATURE_SMEP))
> + vcpu->arch.cr4_reserved_bits &=
> + ~((unsigned long)X86_CR4_SMEP);
> + else
> + best->ebx &= ~(bit(X86_FEATURE_SMEP));
> + }
>  }
> 
>  static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2
> *entry)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 77c9d86..6ead39e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -598,9 +598,10 @@ static void update_cpuid(struct kvm_vcpu *vcpu)
>  int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  {
>   unsigned long 

RE: [PATCH 0/31] nVMX: Nested VMX, v11

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Thursday, May 26, 2011 4:01 AM
> 
> Hi,
> 
> This is the eleventh iteration of the nested VMX patch set, and hopefully the
> last in this format.
> 
> Improvements in this version over the previous one include:
> 
>  * Overhauled vmcs, cpu, and launched handling (new loaded_vmcs structure
> and
>per-cpu linked-list replacing old vcpu linked-list).
> 
>  * Added missing checks that VMPTRLD was done, in emulating VMLAUNCH
> and some
>other VMX instructions.
> 
>  * prepare_vmcs02 is now void, and cannot fail. Correctly handle the case
>of apic-access page with an invalid GPA.
> 
>  * A bunch of corrections to errors and omissions discovered by Kevin Tian.
> 
>  * Miscellenous function name changes, and other cosmetic improvements.
> 
> This new set of patches applies to the current KVM trunk (I checked with
> 5b186b275c0288c8576d26baebe9019875d68a2b).
> 
> Avi, please apply.

This version looks good to me, except that please make sure unhandled comments
are recorded in your private bugzilla and finally filled into a public place. 
:-)

Acked-by: Kevin Tian 

Thanks
Kevin

> 
> Nadav.
> 
> 
> About nested VMX:
> -
> 
> The following 31 patches implement nested VMX support. This feature enables
> a guest to use the VMX APIs in order to run its own nested guests.
> In other words, it allows running hypervisors (that use VMX) under KVM.
> Multiple guest hypervisors can be run concurrently, and each of those can
> in turn host multiple guests.
> 
> The theory behind this work, our implementation, and its performance
> characteristics were presented in OSDI 2010 (the USENIX Symposium on
> Operating Systems Design and Implementation). Our paper was titled
> "The Turtles Project: Design and Implementation of Nested Virtualization",
> and was awarded "Jay Lepreau Best Paper". The paper is available online, at:
> 
>   http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
> 
> This patch set does not include all the features described in the paper.
> In particular, this patch set is missing nested EPT (L1 can't use EPT and
> must use shadow page tables). It is also missing some features required to
> run VMWare hypervisors as a guest. These missing features will be sent as
> follow-on patchs.
> 
> Running nested VMX:
> --
> 
> The nested VMX feature is currently disabled by default. It must be
> explicitly enabled with the "nested=1" option to the kvm-intel module.
> 
> No modifications are required to user space (qemu). However, qemu's default
> emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must
> be
> explicitly enabled, by giving qemu one of the following options:
> 
>  -cpu host  (emulated CPU has all features of the real
> CPU)
> 
>  -cpu qemu64,+vmx   (add just the vmx feature to a named CPU
> type)
> 
> 
> This version was only tested with KVM (64-bit) as a guest hypervisor, and
> Linux as a nested guest.
> 
> 
> Patch statistics:
> -
> 
>  Documentation/kvm/nested-vmx.txt |  251 ++
>  arch/x86/include/asm/kvm_host.h  |2
>  arch/x86/include/asm/msr-index.h |   12
>  arch/x86/include/asm/vmx.h   |   43
>  arch/x86/kvm/svm.c   |6
>  arch/x86/kvm/vmx.c   | 2733
> +++--
>  arch/x86/kvm/x86.c   |   11
>  arch/x86/kvm/x86.h   |8
>  8 files changed, 2907 insertions(+), 159 deletions(-)
> 
> --
> Nadav Har'El
> IBM Haifa Research Lab
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 9:26 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 18/31] nVMX:
> Implement VMLAUNCH and VMRESUME":
> > > + if (!saved_vmcs02)
> > > + return -ENOMEM;
> > > +
> >
> > we shouldn't return error after the guest mode is updated. Or else move
> > enter_guest_mode to a later place...
> 
> I moved things around, but I don't think it matters anyway: If we return
> ENOMEM, the KVM ioctl fails, and the whole L1 guest dies - it doesn't matter
> at this point if we were in the middle of updating its state.
> 

yes but the code this way is cleaner that guest mode is set only when next
we'll certainly enter L2.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Wednesday, May 25, 2011 9:21 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX:
> Exiting from L2 to L1":
> > How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO
> comment
> > here make the whole load process complete. :-)
> >
> > Also isn't it more sane to update vmcs01's guest segment info based on
> vmcs12's
> > host segment info? Though you can assume the environment in L1 doesn't
> change
> > from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural
> clear
> > to load those segments fields according to L1's desire.
> 
> Right... One of these days, I (or some other volunteer ;-)) would need to
> print out the relevant sections of the SDM, sit down with a marker, a read
> it line by line marking lines, fields, capabilities, and so on, which we
> forgot to implement...

You've done a great job.

> 
> How about these additions:
> 
>   vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs);
>   vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp);
>   vmcs_writel(GUEST_SYSENTER_EIP, vmcs12->host_ia32_sysenter_eip);
>   vmcs_writel(GUEST_IDTR_BASE, vmcs12->host_idtr_base);
>   vmcs_writel(GUEST_GDTR_BASE, vmcs12->host_gdtr_base);
>   vmcs_writel(GUEST_TR_BASE, vmcs12->host_tr_base);
>   vmcs_writel(GUEST_GS_BASE, vmcs12->host_gs_base);
>   vmcs_writel(GUEST_FS_BASE, vmcs12->host_fs_base);
>   vmcs_write16(GUEST_ES_SELECTOR, vmcs12->host_es_selector);
>   vmcs_write16(GUEST_CS_SELECTOR, vmcs12->host_cs_selector);
>   vmcs_write16(GUEST_SS_SELECTOR, vmcs12->host_ss_selector);
>   vmcs_write16(GUEST_DS_SELECTOR, vmcs12->host_ds_selector);
>   vmcs_write16(GUEST_FS_SELECTOR, vmcs12->host_fs_selector);
>   vmcs_write16(GUEST_GS_SELECTOR, vmcs12->host_gs_selector);
>   vmcs_write16(GUEST_TR_SELECTOR, vmcs12->host_tr_selector);
> 
>   if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
>   vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
>   if (vmcs12->vm_exit_controls &
> VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL)
>   vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL,
>   vmcs12->host_ia32_perf_global_ctrl);
> 

looks good.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 8:34 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 23/31] nVMX:
> Correct handling of interrupt injection":
> > >  static void enable_irq_window(struct kvm_vcpu *vcpu)
> > >  {
> > >   u32 cpu_based_vm_exec_control;
> > > + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> > > + /* We can get here when nested_run_pending caused
> > > +  * vmx_interrupt_allowed() to return false. In this case, do
> > > +  * nothing - the interrupt will be injected later.
> > > +  */
> >
> > I think this is not a rare path? when vcpu is in guest mode with L2 as 
> > current
> > vmx context, this function could be invoked multiple times since kvm thread
> > can be scheduled out/in randomly.
> 
> As I wrote in this comment, this can only happen on nested_run_pending
> (i.e., VMLAUNCH/VMRESUME emulation), because if !nested_run_pending,
> and nested_exit_on_intr(), vmx_interrupt_allowed() would have already
> exited L2, and we wouldn't be in this case.
> 
> I don't know if to classify this as a "rare" path - it's definitely not
> very common. But what does it matter if it's rare or common?

It doesn't matter much. I just tried to understand your comment.

> 
> 
> > > + if (to_vmx(vcpu)->nested.nested_run_pending)
> > > + return 0;
> >
> > Well, now I can see why you require this special 'nested_run_pending' flag
> > because there're places where L0 injects virtual interrupts right after
> > VMLAUNCH/VMRESUME emulation and before entering L2. :-)
> 
> Indeed. I tried to explain that in the patch description, where I wrote
> 
>  We keep a new flag, "nested_run_pending", which can override the decision
> of
>  which should run next, L1 or L2. nested_run_pending=1 means that we
> *must* run
>  L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
>  and therefore expects L2 to be run (and perhaps be injected with an event it
>  specified, etc.). Nested_run_pending is especially intended to avoid 
> switching
>  to L1 in the injection decision-point described above.

atm when nested_run_pending is first introduced, its usage is simple which made
me think this field may not be required. But later several key patches do depend
on this flag for correctness. :-)

> 
> > > + nested_vmx_vmexit(vcpu);
> > > + vmcs12 = get_vmcs12(vcpu);
> > > + vmcs12->vm_exit_reason =
> EXIT_REASON_EXTERNAL_INTERRUPT;
> > > + vmcs12->vm_exit_intr_info = 0;
> > > + /* fall through to normal code, but now in L1, not L2 */
> > > + }
> > > +
> >
> > This is a bad place to add this logic. vmx_interrupt_allowed is simply a
> > query function but you make it an absolute trigger point for switching from
> > L2 to L1. This is fine as now only point calling vmx_interrupt_allowed is
> > when there's vNMI pending. But it's dangerous to have such assumption
> > for pending events inside vmx_interrupt_allowed.
> 
> Now you're beating a dead horse ;-)
> 
> Gleb, and to some degree Avi, already argued that this is the wrong place
> to do this exit, and if anything the exit should be done (or just decided on)
> in enable_irq_window().
> 
> My counter-argument was that the right way is *neither* of these approaches
> -
> any attempt to "commandeer" one of the existing x86 ops, be they
> vmx_interrupt_allowed() or enable_irq_window() to do in the L2 case things
> they were never designed to do is both ugly, and dangerous if the call sites
> change at some time in the future.
> 
> So rather than changing one ugly abuse of one function, to the (arguably
> also ugly) abuse of another function, what I'd like to see is a better overall
> design, where the call sites in x86.c know about the possibility of a nested
> guest (they already do - like we previously discussed, an is_guest_mode()
> function was recently added), and when they need, *they* will call an
> exit-to-L1 function, rather than calling a function called "enable_irq_window"
> or "vmx_interrupt_allowed" which mysteriously will do the exit.
> 

I agree with your point here.

> 
> > On the other hand, I think there's one area which is not handled timely.
> > I think you need to kick a L2->L1 transition when L0 wants to inject
> > virtual interrupt. Consider your current logic:
> >
> > a) L2 is running on cpu1
> >

RE: [PATCH 31/31] nVMX: Documentation

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Wednesday, May 25, 2011 7:55 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 31/31] nVMX:
> Documentation":
> > > +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
> > > +to easily and efficiently run guest operating systems. Normally, these
> guests
> > > +*cannot* themselves be hypervisors running their own guests, because in
> > > VMX,
> > > +guests cannot use VMX instructions.
> >
> > "because in VMX, guests cannot use VMX instructions" looks not correct or
> else
> > you can't add nVMX support. :-) It's just because currently KVM doesn't
> emulate
> > those VMX instructions.
> 
> It depends on whether you look on the half-empty or half-full part of the
> glass ;-)
> 
> The VMX instructions, when used in L1, do trap - as mandated by Popek and
> Goldberg's theorem (that sensitive instructions must trap) - but they
> don't "just work" like, for example, arithmetic instructions just work -
> they need to be emulated by the VMM.
> 
> > > +Terminology
> > > +---
> > > +
> > > +Single-level virtualization has two levels - the host (KVM) and the 
> > > guests.
> > > +In nested virtualization, we have three levels: The host (KVM), which we
> call
> > > +L0, the guest hypervisor, which we call L1, and its nested guest, which 
> > > we
> > > +call L2.
> >
> > Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here,
> given
> > that this doc is a centralized place to gain quick picture of the nested 
> > VMX.
> 
> I'm adding now a short mention. However, I think this file should be viewed
> as a user's guide, not a developer's guide. Developers should probably read
> our full paper, where this terminology is explained, as well as how vmcs02
> is related to the two others.

I agree with the purpose of this doc. 

> 
> > > +Additional patches for running Windows under guest KVM, and Linux under
> > > +guest VMware server, and support for nested EPT, are currently running in
> > > +the lab, and will be sent as follow-on patchsets.
> >
> > any plan on nested VTD?
> 
> Yes, for some definition of Yes ;-)
> 
> We do have an experimental nested IOMMU implementation: In our nested
> VMX
> paper we showed how giving L1 an IOMMU allows for efficient nested device
> assignment (L0 assigns a PCI device to L1, and L1 does the same to L2).
> In that work we used a very simplistic "paravirtual" IOMMU instead of fully
> emulating an IOMMU for L1.
> Later, we did develop a full emulation of an IOMMU for L1, although we didn't
> test it in the context of nested VMX (we used it to allow L1 to use an IOMMU
> for better DMA protection inside the guest).
> 
> The IOMMU emulation work was done by Nadav Amit, Muli Ben-Yehuda, et al.,
> and will be described in the upcoming Usenix ATC conference
> (http://www.usenix.org/event/atc11/tech/techAbstracts.html#Amit).
> After the conference in June, the paper will be available at this URL:
> http://www.usenix.org/event/atc11/tech/final_files/Amit.pdf
> 
> If there is interest, they can perhaps contribute their work to
> KVM (and QEMU) - if you're interested, please get in touch with them directly.

Thanks and good to know those information

> 
> > It'd be good to provide a list of known supported features. In your current
> code,
> > people have to look at code to understand current status. If you can keep a
> > supported and verified feature list here, it'd be great.
> 
> It will be even better to support all features ;-)
> 
> But seriously, the VMX spec is hundreds of pages long, with hundreds of
> features, sub-features, and sub-sub-features and myriads of subcase-of-
> subfeature and combinations thereof, so I don't think such a list would be
> practical - or ever be accurate.

no need for all subfeatures, a list of possibly a dozen features which people
once enabled them one-by-one is applausive, especially for things which
may accelerate L2 perf, such as virtual NMI, tpr shadow, virtual x2APIC, ... 

> 
> In the "Known Limitations" section of this document, I'd like to list major
> features which are missing, and perhaps more importantly - L1 and L2
> guests which are known NOT to work.

yes, that info is also important and thus people can easily reproduce your
success.

> 
> By the way, it appears that you've been going over the patches in increasing
> numerical order, and this is the last patch ;-) Have you finished your
> review iteration?
> 

yes, I've finished my review on all of your v10 patches. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] new version of loaded_vmcs patch

2011-05-25 Thread Tian, Kevin
> From: Avi Kivity
> Sent: Tuesday, May 24, 2011 9:57 PM
> 
> On 05/24/2011 04:14 PM, Avi Kivity wrote:
> > On 05/24/2011 03:26 PM, Nadav Har'El wrote:
> >> Hi Avi, here is a updated version of the loaded_vmcs patch which you
> >> asked me
> >> to send before the rest of the nvmx patches.
> >>
> >> Please let me know when you'd like me to send you an updated version of
> >> the rest of the patches.
> >
> > Do you have a list of unresolved issues?
> >
> > If not, please repost the series.
> >
> 
> btw, since Kevin is now plowing through the patchset, please wait for
> the rest of his review.
> 

I have finished my current round of review of nVMX-v10 with all comments
sent out.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 31/31] nVMX: Documentation

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 4:00 AM
> 
> This patch includes a brief introduction to the nested vmx feature in the
> Documentation/kvm directory. The document also includes a copy of the
> vmcs12 structure, as requested by Avi Kivity.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  Documentation/kvm/nested-vmx.txt |  243
> +
>  1 file changed, 243 insertions(+)
> 
> --- .before/Documentation/kvm/nested-vmx.txt  2011-05-16
> 22:36:51.0 +0300
> +++ .after/Documentation/kvm/nested-vmx.txt   2011-05-16
> 22:36:51.0 +0300
> @@ -0,0 +1,243 @@
> +Nested VMX
> +==
> +
> +Overview
> +-
> +
> +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions)
> +to easily and efficiently run guest operating systems. Normally, these guests
> +*cannot* themselves be hypervisors running their own guests, because in
> VMX,
> +guests cannot use VMX instructions.

"because in VMX, guests cannot use VMX instructions" looks not correct or else
you can't add nVMX support. :-) It's just because currently KVM doesn't emulate
those VMX instructions.

> +
> +The "Nested VMX" feature adds this missing capability - of running guest
> +hypervisors (which use VMX) with their own nested guests. It does so by
> +allowing a guest to use VMX instructions, and correctly and efficiently
> +emulating them using the single level of VMX available in the hardware.
> +
> +We describe in much greater detail the theory behind the nested VMX
> feature,
> +its implementation and its performance characteristics, in the OSDI 2010
> paper
> +"The Turtles Project: Design and Implementation of Nested Virtualization",
> +available at:
> +
> + http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf
> +
> +
> +Terminology
> +---
> +
> +Single-level virtualization has two levels - the host (KVM) and the guests.
> +In nested virtualization, we have three levels: The host (KVM), which we call
> +L0, the guest hypervisor, which we call L1, and its nested guest, which we
> +call L2.

Add a brief introduction about vmcs01/vmcs02/vmcs12 is also helpful here, given
that this doc is a centralized place to gain quick picture of the nested VMX.

> +
> +
> +Known limitations
> +-
> +
> +The current code supports running Linux guests under KVM guests.
> +Only 64-bit guest hypervisors are supported.
> +
> +Additional patches for running Windows under guest KVM, and Linux under
> +guest VMware server, and support for nested EPT, are currently running in
> +the lab, and will be sent as follow-on patchsets.

any plan on nested VTD?

> +
> +
> +Running nested VMX
> +--
> +
> +The nested VMX feature is disabled by default. It can be enabled by giving
> +the "nested=1" option to the kvm-intel module.
> +
> +No modifications are required to user space (qemu). However, qemu's default
> +emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must
> be
> +explicitly enabled, by giving qemu one of the following options:
> +
> + -cpu host  (emulated CPU has all features of the real
> CPU)
> +
> + -cpu qemu64,+vmx   (add just the vmx feature to a named CPU
> type)
> +
> +
> +ABIs
> +
> +
> +Nested VMX aims to present a standard and (eventually) fully-functional VMX
> +implementation for the a guest hypervisor to use. As such, the official
> +specification of the ABI that it provides is Intel's VMX specification,
> +namely volume 3B of their "Intel 64 and IA-32 Architectures Software
> +Developer's Manual". Not all of VMX's features are currently fully supported,
> +but the goal is to eventually support them all, starting with the VMX 
> features
> +which are used in practice by popular hypervisors (KVM and others).

It'd be good to provide a list of known supported features. In your current 
code,
people have to look at code to understand current status. If you can keep a
supported and verified feature list here, it'd be great.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 25/31] nVMX: Correct handling of idt vectoring info

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Wednesday, May 25, 2011 6:14 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 25/31] nVMX:
> Correct handling of idt vectoring info":
> > Here got one question. How about L2 has interrupt exiting disabled? That
> way
> 
> You are right, this is a bug in my code. It's a known bug - listed on my
> private bugzilla - but I didn't get around to fixing it.
> 
> As I said, most hypervisors "in the wild" do not have interrupt exiting
> disabled, so this bug will not appear in practice, but definitely it should
> be fixed, and I will.
> 
> When the nvmx patches are merged, I can can copy my private bug tracker's
> open issues to an open bug tracker (I'm hoping KVM has one?).
> 

That's fine, or at least convert your private bug list into a list of TODO
comment in right code place.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 25/31] nVMX: Correct handling of idt vectoring info

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:57 AM
> 
> This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the
> nested
> case.
> 
> When a guest exits while delivering an interrupt or exception, we get this
> information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
> there's nothing we need to do, because L1 will see this field in vmcs12, and
> handle it itself. However, when L2 exits and L0 handles the exit itself and
> plans to return to L2, L0 must inject this event to L2.
> 
> In the normal non-nested case, the idt_vectoring_info case is discovered after
> the exit, and the decision to inject (though not the injection itself) is made
> at that point. However, in the nested case a decision of whether to return
> to L2 or L1 also happens during the injection phase (see the previous
> patches), so in the nested case we can only decide what to do about the
> idt_vectoring_info right after the injection, i.e., in the beginning of
> vmx_vcpu_run, which is the first time we know for sure if we're staying in
> L2.
> 
> Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular
> vmx_complete_interrupts() code which queues the idt_vectoring_info for
> injection on next entry - because such injection would not be appropriate
> if we will decide to exit to L1. Rather, we just save the idt_vectoring_info
> and related fields in vmcs12 (which is a convenient place to save these
> fields). On the next entry in vmx_vcpu_run (*after* the injection phase,
> potentially exiting to L1 to inject an event requested by user space), if
> we find ourselves in L1 we don't need to do anything with those values
> we saved (as explained above). But if we find that we're in L2, or rather
> *still* at L2 (it's not nested_run_pending, meaning that this is the first
> round of L2 running after L1 having just launched it), we need to inject
> the event saved in those fields - by writing the appropriate VMCS fields.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |   30 ++
>  1 file changed, 30 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:50.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:50.0 +0300
> @@ -5804,6 +5804,8 @@ static void __vmx_complete_interrupts(st
> 
>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>  {
> + if (is_guest_mode(&vmx->vcpu))
> + return;
>   __vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
> VM_EXIT_INSTRUCTION_LEN,
> IDT_VECTORING_ERROR_CODE);
> @@ -5811,6 +5813,8 @@ static void vmx_complete_interrupts(stru
> 
>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>  {
> + if (is_guest_mode(vcpu))
> + return;
>   __vmx_complete_interrupts(to_vmx(vcpu),
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> VM_ENTRY_INSTRUCTION_LEN,
> @@ -5831,6 +5835,21 @@ static void __noclone vmx_vcpu_run(struc
>  {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
> 
> + if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
> + struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> + if (vmcs12->idt_vectoring_info_field &
> + VECTORING_INFO_VALID_MASK) {
> + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> + vmcs12->idt_vectoring_info_field);
> + vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> + vmcs12->vm_exit_instruction_len);
> + if (vmcs12->idt_vectoring_info_field &
> + VECTORING_INFO_DELIVER_CODE_MASK)
> + vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> + vmcs12->idt_vectoring_error_code);
> + }
> + }

Here got one question. How about L2 has interrupt exiting disabled? That way
it's expect to have L0 directly inject virtual interrupt into L2, and thus 
simply
overwrite interrupt info field here looks incorrect. Though as you said typical
hypervisor doesn't turn interrupt exiting off, but it does be an architectural
correct thing. I think here you should compare current INTR_INFO_FIELD with
saved IDT_VECTOR and choose a higher priority, when L2 has interrupt
exiting disabled.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:56 AM
> 
> The code in this patch correctly emulates external-interrupt injection
> while a nested guest L2 is running.
> 
> Because of this code's relative un-obviousness, I include here a longer-than-
> usual justification for what it does - much longer than the code itself ;-)
> 
> To understand how to correctly emulate interrupt injection while L2 is
> running, let's look first at what we need to emulate: How would things look
> like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
> an interrupt, we had hardware delivering an interrupt?
> 
> Now we have L1 running on bare metal with a guest L2, and the hardware
> generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to
> 1, and
> VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below),
> what
> happens now is this: The processor exits from L2 to L1, with an external-
> interrupt exit reason but without an interrupt vector. L1 runs, with
> interrupts disabled, and it doesn't yet know what the interrupt was. Soon
> after, it enables interrupts and only at that moment, it gets the interrupt
> from the processor. when L1 is KVM, Linux handles this interrupt.
> 
> Now we need exactly the same thing to happen when that L1->L2 system runs
> on top of L0, instead of real hardware. This is how we do this:
> 
> When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
> external-interrupt exit reason (with an invalid interrupt vector), and run L1.
> Just like in the bare metal case, it likely can't deliver the interrupt to
> L1 now because L1 is running with interrupts disabled, in which case it turns
> on the interrupt window when running L1 after the exit. L1 will soon enable
> interrupts, and at that point L0 will gain control again and inject the
> interrupt to L1.
> 
> Finally, there is an extra complication in the code: when nested_run_pending,
> we cannot return to L1 now, and must launch L2. We need to remember the
> interrupt we wanted to inject (and not clear it now), and do it on the
> next exit.
> 
> The above explanation shows that the relative strangeness of the nested
> interrupt injection code in this patch, and the extra interrupt-window
> exit incurred, are in fact necessary for accurate emulation, and are not
> just an unoptimized implementation.
> 
> Let's revisit now the two assumptions made above:
> 
> If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
> does, by the way), things are simple: L0 may inject the interrupt directly
> to the L2 guest - using the normal code path that injects to any guest.
> We support this case in the code below.
> 
> If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> does), things look very different from the description above: L1 expects
> to see an exit from L2 with the interrupt vector already filled in the exit
> information, and does not expect to be interrupted again with this interrupt.
> The current code does not (yet) support this case, so we do not allow the
> VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |   36 
>  1 file changed, 36 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
> @@ -1788,6 +1788,7 @@ static __init void nested_vmx_setup_ctls
> 
>   /* exit controls */
>   nested_vmx_exit_ctls_low = 0;
> + /* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported.
> */
>  #ifdef CONFIG_X86_64
>   nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #else
> @@ -3733,9 +3734,25 @@ out:
>   return ret;
>  }
> 
> +/*
> + * In nested virtualization, check if L1 asked to exit on external 
> interrupts.
> + * For most existing hypervisors, this will always return true.
> + */
> +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
> +{
> + return get_vmcs12(vcpu)->pin_based_vm_exec_control &
> + PIN_BASED_EXT_INTR_MASK;
> +}
> +
>  static void enable_irq_window(struct kvm_vcpu *vcpu)
>  {
>   u32 cpu_based_vm_exec_control;
> + if (is_guest_mode(vcpu) && nested_exit_on_intr(vcpu))
> + /* We can get here when nested_run_pending caused
> +  * vmx_interrupt_allowed() to return false. In this case, do
> +  * nothing - the interrupt will be injected later.
> +  */

I think this is not a rare path? when vcpu is in guest mode with L2 as current
vmx context, this function could be invoked multiple times since kvm thread
can be scheduled out/in randomly. 

> + return;
> 
>   cpu_based_vm_exec_control =
> vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>   cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
> @@ -3858,6 +3875,17 @@ static void vmx_set_nmi_mask(struct kvm_
> 
>  static int vmx_interrupt

RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection

2011-05-25 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Wednesday, May 25, 2011 4:40 PM
> > If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> > does), things look very different from the description above: L1 expects
> 
> Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
> really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
> touch generic interrupt logic, and thus better to not ack it until interrupt
> is enabled and then hardware will gear to the kernel interrupt handler
> automatically.
> 
> > to see an exit from L2 with the interrupt vector already filled in the exit
> > information, and does not expect to be interrupted again with this 
> > interrupt.
> > The current code does not (yet) support this case, so we do not allow the
> > VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.
> 
> Then just fill the interrupt vector field with the highest unmasked bit
> from pending vIRR.
> 

And also ack virtual irqchip accordingly...

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 23/31] nVMX: Correct handling of interrupt injection

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:56 AM
> 
> The code in this patch correctly emulates external-interrupt injection
> while a nested guest L2 is running.
> 
> Because of this code's relative un-obviousness, I include here a longer-than-
> usual justification for what it does - much longer than the code itself ;-)
> 
> To understand how to correctly emulate interrupt injection while L2 is
> running, let's look first at what we need to emulate: How would things look
> like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
> an interrupt, we had hardware delivering an interrupt?
> 
> Now we have L1 running on bare metal with a guest L2, and the hardware
> generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to
> 1, and
> VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below),
> what
> happens now is this: The processor exits from L2 to L1, with an external-
> interrupt exit reason but without an interrupt vector. L1 runs, with
> interrupts disabled, and it doesn't yet know what the interrupt was. Soon
> after, it enables interrupts and only at that moment, it gets the interrupt
> from the processor. when L1 is KVM, Linux handles this interrupt.
> 
> Now we need exactly the same thing to happen when that L1->L2 system runs
> on top of L0, instead of real hardware. This is how we do this:
> 
> When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
> external-interrupt exit reason (with an invalid interrupt vector), and run L1.
> Just like in the bare metal case, it likely can't deliver the interrupt to
> L1 now because L1 is running with interrupts disabled, in which case it turns
> on the interrupt window when running L1 after the exit. L1 will soon enable
> interrupts, and at that point L0 will gain control again and inject the
> interrupt to L1.
> 
> Finally, there is an extra complication in the code: when nested_run_pending,
> we cannot return to L1 now, and must launch L2. We need to remember the
> interrupt we wanted to inject (and not clear it now), and do it on the
> next exit.
> 
> The above explanation shows that the relative strangeness of the nested
> interrupt injection code in this patch, and the extra interrupt-window
> exit incurred, are in fact necessary for accurate emulation, and are not
> just an unoptimized implementation.
> 
> Let's revisit now the two assumptions made above:
> 
> If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
> does, by the way), things are simple: L0 may inject the interrupt directly
> to the L2 guest - using the normal code path that injects to any guest.
> We support this case in the code below.
> 
> If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT (again, no hypervisor that I know
> does), things look very different from the description above: L1 expects

Type-1 bare metal hypervisor may enable this bit, such as Xen. This bit is
really prepared for L2 hypervisor since normally L2 hypervisor is tricky to
touch generic interrupt logic, and thus better to not ack it until interrupt
is enabled and then hardware will gear to the kernel interrupt handler
automatically.

> to see an exit from L2 with the interrupt vector already filled in the exit
> information, and does not expect to be interrupted again with this interrupt.
> The current code does not (yet) support this case, so we do not allow the
> VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on by L1.

Then just fill the interrupt vector field with the highest unmasked bit
from pending vIRR.

> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |   36 
>  1 file changed, 36 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
> @@ -1788,6 +1788,7 @@ static __init void nested_vmx_setup_ctls
> 
>   /* exit controls */
>   nested_vmx_exit_ctls_low = 0;
> + /* Note that guest use of VM_EXIT_ACK_INTR_ON_EXIT is not supported.
> */
>  #ifdef CONFIG_X86_64
>   nested_vmx_exit_ctls_high = VM_EXIT_HOST_ADDR_SPACE_SIZE;
>  #else
> @@ -3733,9 +3734,25 @@ out:
>   return ret;
>  }
> 
> +/*
> + * In nested virtualization, check if L1 asked to exit on external 
> interrupts.
> + * For most existing hypervisors, this will always return true.
> + */
> +static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
> +{
> + return get_vmcs12(vcpu)->pin_based_vm_exec_control &
> + PIN_BASED_EXT_INTR_MASK;
> +}
> +

could be a similar common wrapper like nested_cpu_has...

Thanks,
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 4:06 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX:
> Exiting from L2 to L1":
> > IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
> > the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
> > has no permission to set its bit effectively in this case.
> >...
> > this is the problem:
> >
> > (vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) !=
> > (vmcs02->guest_cr0 & vmcs12->cr0_guest_host_mask)
> 
> Sorry for arguing previously, this is a very good, and correct, point, which
> I missed. When both L0 and L1 are KVM, this didn't cause problems because
> the
> only problematic bit has been the TS bit, and when KVM wants to override this
> bit it always does it to 1.
> 
> So I've rewritten this function, based on my new understanding following your
> insights. I believe it now implements your formula *exactly*. Please take a
> look at the comments and the code, and see if you now agree with them:
> 
> /*
>  * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
>  * because L2 may have changed some cr0 bits directly
> (CRO_GUEST_HOST_MASK).
>  * This function returns the new value we should put in vmcs12.guest_cr0.
>  * It's not enough to just return the vmcs02 GUEST_CR0. Rather,
>  *  1. Bits that neither L0 nor L1 trapped, were set directly by L2 and are 
> now
>  * available in vmcs02 GUEST_CR0. (Note: It's enough to check that L0
>  * didn't trap the bit, because if L1 did, so would L0).
>  *  2. Bits that L1 asked to trap (and therefore L0 also did) could not have
>  * been modified by L2, and L1 knows it. So just leave the old value of
>  * the bit from vmcs12.guest_cr0. Note that the bit from vmcs02
> GUEST_CR0
>  * isn't relevant, because if L0 traps this bit it can set it to anything.
>  *  3. Bits that L1 didn't trap, but L0 did. L1 believes the guest could have
>  * changed these bits, and therefore they need to be updated, but L0
>  * didn't necessarily allow them to be changed in GUEST_CR0 - and
> rather
>  * put them in vmcs02 CR0_READ_SHADOW. So take these bits from
> there.
>  */
> static inline unsigned long
> vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> {
>   return
>   /*1*/   (vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits)
> |
>   /*2*/   (vmcs12->guest_cr0 & vmcs12->cr0_guest_host_mask) |
>   /*3*/   (vmcs_readl(CR0_READ_SHADOW) &
> ~(vmcs12->cr0_guest_host_mask |
>   vcpu->arch.cr0_guest_owned_bits));
> }
> 
> 

This looks good to me. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:53 AM
> 
> Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
> hypervisor to run its own guests.
> 
> This patch does not include some of the necessary validity checks on
> vmcs12 fields before the entry. These will appear in a separate patch
> below.
> 
> Signed-off-by: Nadav Har'El 
> ---

[...]

> +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> +{
> + struct vmcs12 *vmcs12;
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + int cpu;
> + struct saved_vmcs *saved_vmcs02;
> +
> + if (!nested_vmx_check_permission(vcpu))
> + return 1;
> + skip_emulated_instruction(vcpu);
> +
> + vmcs12 = get_vmcs12(vcpu);
> +
> + enter_guest_mode(vcpu);
> +
> + vmx->nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
> +
> + /*
> +  * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
> +  * vmcs01, on which CPU it was last loaded, and whether it was launched
> +  * (we need all these values next time we will use L1). Then recall
> +  * these values from the last time vmcs02 was used.
> +  */
> + saved_vmcs02 = nested_get_current_vmcs02(vmx);
> + if (!saved_vmcs02)
> + return -ENOMEM;
> +

we shouldn't return error after the guest mode is updated. Or else move
enter_guest_mode to a later place...

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 22/31] nVMX: Deciding if L0 or L1 should handle an L2 exit

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:55 AM
> 
> This patch contains the logic of whether an L2 exit should be handled by L0
> and then L2 should be resumed, or whether L1 should be run to handle this
> exit (using the nested_vmx_vmexit() function of the previous patch).
> 
> The basic idea is to let L1 handle the exit only if it actually asked to
> trap this sort of event. For example, when L2 exits on a change to CR0,
> we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
> bit which changed; If it did, we exit to L1. But if it didn't it means that
> it is we (L0) that wished to trap this event, so we handle it ourselves.
> 
> The next two patches add additional logic of what to do when an interrupt or
> exception is injected: Does L0 need to do it, should we exit to L1 to do it,
> or should we resume L2 and keep the exception to be injected later.
> 
> We keep a new flag, "nested_run_pending", which can override the decision of
> which should run next, L1 or L2. nested_run_pending=1 means that we *must*
> run
> L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
> and therefore expects L2 to be run (and perhaps be injected with an event it
> specified, etc.). Nested_run_pending is especially intended to avoid switching
> to L1 in the injection decision-point described above.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |  256
> ++-
>  1 file changed, 255 insertions(+), 1 deletion(-)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
> @@ -351,6 +351,8 @@ struct nested_vmx {
>   /* Saving the VMCS that we used for running L1 */
>   struct saved_vmcs saved_vmcs01;
>   u64 vmcs01_tsc_offset;
> + /* L2 must run next, and mustn't decide to exit to L1. */
> + bool nested_run_pending;
>   /*
>* Guest pages referred to in vmcs02 with host-physical pointers, so
>* we must keep them pinned while L2 runs.
> @@ -870,6 +872,20 @@ static inline bool nested_cpu_has2(struc
>   (vmcs12->secondary_vm_exec_control & bit);
>  }
> 
> +static inline bool nested_cpu_has_virtual_nmis(struct kvm_vcpu *vcpu)
> +{
> + return is_guest_mode(vcpu) &&
> + (get_vmcs12(vcpu)->pin_based_vm_exec_control &
> + PIN_BASED_VIRTUAL_NMIS);
> +}

any reason to add guest mode check here? I didn't see such check in your
earlier nested_cpu_has_xxx. It would be clearer to use existing 
nested_cpu_has_xxx
along with is_guest_mode explicitly which makes such usage consistent.

> +
> +static inline bool is_exception(u32 intr_info)
> +{
> + return (intr_info & (INTR_INFO_INTR_TYPE_MASK |
> INTR_INFO_VALID_MASK))
> + == (INTR_TYPE_HARD_EXCEPTION | INTR_INFO_VALID_MASK);
> +}
> +
> +static void nested_vmx_vmexit(struct kvm_vcpu *vcpu);
>  static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
>   struct vmcs12 *vmcs12,
>   u32 reason, unsigned long qualification);
> @@ -5281,6 +5297,232 @@ static int (*kvm_vmx_exit_handlers[])(st
>  static const int kvm_vmx_max_exit_handlers =
>   ARRAY_SIZE(kvm_vmx_exit_handlers);
> 
> +/*
> + * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
> + * rather than handle it ourselves in L0. I.e., check whether L1 expressed
> + * disinterest in the current event (read or write a specific MSR) by using 
> an
> + * MSR bitmap. This may be the case even when L0 doesn't use MSR bitmaps.
> + */
> +static bool nested_vmx_exit_handled_msr(struct kvm_vcpu *vcpu,
> + struct vmcs12 *vmcs12, u32 exit_reason)
> +{
> + u32 msr_index = vcpu->arch.regs[VCPU_REGS_RCX];
> + gpa_t bitmap;
> +
> + if (!nested_cpu_has(get_vmcs12(vcpu),
> CPU_BASED_USE_MSR_BITMAPS))
> + return 1;
> +
> + /*
> +  * The MSR_BITMAP page is divided into four 1024-byte bitmaps,
> +  * for the four combinations of read/write and low/high MSR numbers.
> +  * First we need to figure out which of the four to use:
> +  */
> + bitmap = vmcs12->msr_bitmap;
> + if (exit_reason == EXIT_REASON_MSR_WRITE)
> + bitmap += 2048;
> + if (msr_index >= 0xc000) {
> + msr_index -= 0xc000;
> + bitmap += 1024;
> + }
> +
> + /* Then read the msr_index'th bit from this bitmap: */
> + if (msr_index < 1024*8) {
> + unsigned char b;
> + kvm_read_guest(vcpu->kvm, bitmap + msr_index/8, &b, 1);
> + return 1 & (b >> (msr_index & 7));
> + } else
> + return 1; /* let L1 handle the wrong parameter */
> +}
> +
> +/*
> + * Return 1 if we should exit from L2 to L1 to handle a CR access exit,
> + * rather than handle it ourselves in L0. I.e., check if L1 wanted to
> + * intercept (via guest_host_mask etc.) the current event.
> + */
> +static bool nested_v

RE: [PATCH 21/31] nVMX: vmcs12 checks on nested entry

2011-05-25 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Wednesday, May 25, 2011 1:38 PM
> 
> On Wed, May 25, 2011, Tian, Kevin wrote about "RE: [PATCH 21/31] nVMX:
> vmcs12 checks on nested entry":
> > > + if (vmcs12->launch_state == launch) {
> > > + nested_vmx_failValid(vcpu,
> > > + launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS
> > > +: VMXERR_VMRESUME_NONLAUNCHED_VMCS);
> > > + return 1;
> > > + }
> >
> > from SDM:
> > ELSIF (VMLAUNCH and launch state of current VMCS is not "clear")
> > THEN VMfailValid(VMLAUNCH with non-clear VMCS);
> > ELSIF (VMRESUME and launch state of current VMCS is not "launched")
> > THEN VMfailValid(VMRESUME with non-launched VMCS);
> >
> > So it's legal to use VMLAUNCH on a launched VMCS. However here you
> > changes this behavior. On the other hand, do you want to add a 'clear' state
> > along with L1 VMCLEAR to catch the failure here?
> 
> I don't understand: I always understood the spec to mean that "clear" and
> "launched" the two opposite states of the "launch state" bit? If it isn't,
> what does "clear" mean?
> 
> Is it really "legal to use a VMLAUNCH on a launched VMCS"?
> If it is, why does KVM, for example, go to great lengths to VMLAUNCH the
> first time, and VMRESUME all subsequent times?
> 

You're correct. I've got my head messed on this point. :-)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 21/31] nVMX: vmcs12 checks on nested entry

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:55 AM
> 
> This patch adds a bunch of tests of the validity of the vmcs12 fields,
> according to what the VMX spec and our implementation allows. If fields
> we cannot (or don't want to) honor are discovered, an entry failure is
> emulated.
> 
> According to the spec, there are two types of entry failures: If the problem
> was in vmcs12's host state or control fields, the VMLAUNCH instruction simply
> fails. But a problem is found in the guest state, the behavior is more
> similar to that of an exit.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/include/asm/vmx.h |8 ++
>  arch/x86/kvm/vmx.c |   94
> +++
>  2 files changed, 102 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
> @@ -870,6 +870,10 @@ static inline bool nested_cpu_has2(struc
>   (vmcs12->secondary_vm_exec_control & bit);
>  }
> 
> +static void nested_vmx_entry_failure(struct kvm_vcpu *vcpu,
> + struct vmcs12 *vmcs12,
> + u32 reason, unsigned long qualification);
> +
>  static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
>  {
>   int i;
> @@ -6160,6 +6164,79 @@ static int nested_vmx_run(struct kvm_vcp
> 
>   vmcs12 = get_vmcs12(vcpu);
> 
> + /*
> +  * The nested entry process starts with enforcing various prerequisites
> +  * on vmcs12 as required by the Intel SDM, and act appropriately when
> +  * they fail: As the SDM explains, some conditions should cause the
> +  * instruction to fail, while others will cause the instruction to seem
> +  * to succeed, but return an EXIT_REASON_INVALID_STATE.
> +  * To speed up the normal (success) code path, we should avoid checking
> +  * for misconfigurations which will anyway be caught by the processor
> +  * when using the merged vmcs02.
> +  */
> + if (vmcs12->launch_state == launch) {
> + nested_vmx_failValid(vcpu,
> + launch ? VMXERR_VMLAUNCH_NONCLEAR_VMCS
> +: VMXERR_VMRESUME_NONLAUNCHED_VMCS);
> + return 1;
> + }

from SDM:
ELSIF (VMLAUNCH and launch state of current VMCS is not "clear")
THEN VMfailValid(VMLAUNCH with non-clear VMCS);
ELSIF (VMRESUME and launch state of current VMCS is not "launched")
THEN VMfailValid(VMRESUME with non-launched VMCS);

So it's legal to use VMLAUNCH on a launched VMCS. However here you
changes this behavior. On the other hand, do you want to add a 'clear' state
along with L1 VMCLEAR to catch the failure here?

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:54 AM
> 
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
> 
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in a separate patch below.
> 
> Signed-off-by: Nadav Har'El 

> +/*
> + * A part of what we need to when the nested L2 guest exits and we want to
> + * run its L1 parent, is to reset L1's guest state to the host state 
> specified
> + * in vmcs12.
> + * This function is to be called not only on normal nested exit, but also on
> + * a nested entry failure, as explained in Intel's spec, 3B.23.7 ("VM-Entry
> + * Failures During or After Loading Guest State").
> + * This function should be called when the active VMCS is L1's (vmcs01).
> + */
> +void load_vmcs12_host_state(struct kvm_vcpu *vcpu, struct vmcs12
> *vmcs12)
> +{
> + if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
> + vcpu->arch.efer = vmcs12->host_ia32_efer;
> + if (vmcs12->vm_exit_controls & VM_EXIT_HOST_ADDR_SPACE_SIZE)
> + vcpu->arch.efer |= (EFER_LMA | EFER_LME);
> + else
> + vcpu->arch.efer &= ~(EFER_LMA | EFER_LME);
> + vmx_set_efer(vcpu, vcpu->arch.efer);
> +
> + if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_PAT)
> + vmcs_write64(GUEST_IA32_PAT, vmcs12->host_ia32_pat);
> +
> + kvm_register_write(vcpu, VCPU_REGS_RSP, vmcs12->host_rsp);
> + kvm_register_write(vcpu, VCPU_REGS_RIP, vmcs12->host_rip);
> + /*
> +  * Note that calling vmx_set_cr0 is important, even if cr0 hasn't
> +  * actually changed, because it depends on the current state of
> +  * fpu_active (which may have changed).
> +  * Note that vmx_set_cr0 refers to efer set above.
> +  */
> + kvm_set_cr0(vcpu, vmcs12->host_cr0);
> + /*
> +  * If we did fpu_activate()/fpu_deactivate() during L2's run, we need
> +  * to apply the same changes to L1's vmcs. We just set cr0 correctly,
> +  * but we also need to update cr0_guest_host_mask and
> exception_bitmap.
> +  */
> + update_exception_bitmap(vcpu);
> + vcpu->arch.cr0_guest_owned_bits = (vcpu->fpu_active ? X86_CR0_TS : 0);
> + vmcs_writel(CR0_GUEST_HOST_MASK,
> ~vcpu->arch.cr0_guest_owned_bits);
> +
> + /*
> +  * Note that CR4_GUEST_HOST_MASK is already set in the original
> vmcs01
> +  * (KVM doesn't change it)- no reason to call set_cr4_guest_host_mask();
> +  */
> + vcpu->arch.cr4_guest_owned_bits =
> ~vmcs_readl(CR4_GUEST_HOST_MASK);
> + kvm_set_cr4(vcpu, vmcs12->host_cr4);
> +
> + /* shadow page tables on either EPT or shadow page tables */
> + kvm_set_cr3(vcpu, vmcs12->host_cr3);
> + kvm_mmu_reset_context(vcpu);
> +
> + if (enable_vpid) {
> + /*
> +  * Trivially support vpid by letting L2s share their parent
> +  * L1's vpid. TODO: move to a more elaborate solution, giving
> +  * each L2 its own vpid and exposing the vpid feature to L1.
> +  */
> + vmx_flush_tlb(vcpu);
> + }

How about SYSENTER and PERF_GLOBAL_CTRL MSRs? At least a TODO comment
here make the whole load process complete. :-)

Also isn't it more sane to update vmcs01's guest segment info based on vmcs12's
host segment info? Though you can assume the environment in L1 doesn't change
from VMLAUNCH/VMRESUME to VMEXIT handler, it's more architectural clear
to load those segments fields according to L1's desire.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Tuesday, May 24, 2011 9:43 PM
> 
> On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 20/31] nVMX:
> Exiting from L2 to L1":
> > > +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> > > +{
> > > + /*
> > > +  * As explained above, we take a bit from GUEST_CR0 if we allowed
> the
> > > +  * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits),
> or
> > > +  * if we did trap it - if we did so because L1 asked to trap this bit
> > > +  * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but
> L1
> > > +  * didn't expect us to trap) we read from CR0_READ_SHADOW.
> > > +  */
> > > + unsigned long guest_cr0_bits =
> > > + vcpu->arch.cr0_guest_owned_bits |
> vmcs12->cr0_guest_host_mask;
> > > + return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
> > > +(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
> > > +}
> >
> > Hi, Nadav,
> >
> > Not sure whether I get above operation wrong.
> 
> This is one of the trickiest functions in nested VMX, which is why I added
> 15 lines of comments (!) on just two statements of code.

I read the comment carefully, and the scenario I described is not covered there.

> 
> > But it looks not exactly correct to me
> > in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such 
> > case
> that
> > bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW,
> however above
> > operation will make vmcs02_GUEST_CR0 bit returned instead.
> 
> This behavior is correct: If a bit is set in L1's cr0_guest_host_mask (and
> in particular, if it is set in both L0's and L1's), we always exit to L1 when
> L2 changes this bit, and this bit cannot change while L2 is running, so
> naturally after the run vmcs02.guest_cr0 and vmcs12.guest_cr0 are still
> identical in that be.

Are you sure this is the case? vmcs12.guest_cr0 is identical to an operation
that L1 tries to update GUEST_CR0 when you prepare vmcs02 which is why
you use vmx_set_cr0(vcpu, vmcs12->guest_cr0) in prepare_vmcs02. If L0 
has one bit set in L0's cr0_guest_host_mask, the corresponding bit in 
vmcs12.guest_cr0 will be cached in vmcs02.cr0_read_shadow anyway. This
is not related to whether L2 changes that bit.

IOW, I disagree that if L0/L1 set same bit in cr0_guest_host_mask, then
the bit is identical in vmcs02.guest_cr0 and vmcs12.guest_cr0 because L1
has no permission to set its bit effectively in this case.

> Copying that bit from vmcs02_CR0_READ_SHADOW, like you suggested, would
> be
> completely wrong in this case: When L1 set a bit in cr0_guest_host_mask,
> the vmcs02->cr0_read_shadow is vmcs12->cr0_read_shadow (see
> nested_read_cr0),
> and is just a pretense that L1 set up for L2 - it is NOT the real bit of
> guest_cr0, so copying it into guest_cr0 would be wrong.

So I'm talking about reserving that bit from vmcs12.guest_cr0 when it's set
in vmcs12.cr0_guest_host_mask which is a natural output.

> 
> Note that this function is completely different from nested_read_cr0 (the
> new name), which behaves similar to what you suggested but serves a
> completely
> different (and in some respect, opposite) function.
> 
> I think my comments in the code are clearer than what I just wrote here, so
> please take a look at them again, and let me know if you find any errors.
> 
> > Instead of constructing vmcs12_GUEST_CR0 completely from
> vmcs02_GUEST_CR0,
> > why not just updating bits which can be altered while keeping the rest bits
> from
> > vmcs12_GUEST_CR0? Say something like:
> >
> > vmcs12->guest_cr0 &= vmcs12->cr0_guest_host_mask; /* keep unchanged
> bits */
> > vmcs12->guest_cr0 |= (vmcs_readl(GUEST_CR0) &
> vcpu->arch.cr0_guest_owned_bits) |
> > (vmcs_readl(CR0_READ_SHADOW) &
> ~( vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask))
> 
> I guess I could do something like this, but do you think it's clearer?
> I don't. Behind all the details, my formula emphasises that MOST cr0 bits
> can be just copied from vmcs02 to vmcs12 as is - and we only have to do
> something strange for special bits - where L0 wanted to trap but L1 didn't.
> In your formula, it looks like there are 3 different cases instead of 2.

But my formula is more clear given that it sticks to the implication of the
cr0_guest_host_mask. You only need to update cr0 bits which can be modified
by the L2 w/o trap while just keeping the rest.

> 
> In any case, your formula is definitely not more correct, because 

RE: [PATCH] new version of loaded_vmcs patch

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 24, 2011 8:26 PM
> 
> Hi Avi, here is a updated version of the loaded_vmcs patch which you asked me
> to send before the rest of the nvmx patches.
> 
> Please let me know when you'd like me to send you an updated version of
> the rest of the patches.
> 
> 
> Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.
> 
> In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
> because (at least in theory) the processor might not have written all of its
> content back to memory. Since a patch from June 26, 2008, this is done using
> a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.
> 
> The problem is that with nested VMX, we no longer have the concept of a
> vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
> L2s), and each of those may be have been last loaded on a different cpu.
> 
> So instead of linking the vcpus, we link the VMCSs, using a new structure
> loaded_vmcs. This structure contains the VMCS, and the information
> pertaining
> to its loading on a specific cpu (namely, the cpu number, and whether it
> was already launched on this cpu once). In nested we will also use the same
> structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the
> currently active VMCS.
> 
> Signed-off-by: Nadav Har'El 

Acked-by: Kevin Tian 

Thanks
Kevin

> ---
>  arch/x86/kvm/vmx.c |  150 ---
>  1 file changed, 86 insertions(+), 64 deletions(-)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-24 15:12:22.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-24 15:12:22.0 +0300
> @@ -116,6 +116,18 @@ struct vmcs {
>   char data[0];
>  };
> 
> +/*
> + * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
> + * remember whether it was VMLAUNCHed, and maintain a linked list of all
> VMCSs
> + * loaded on this CPU (so we can clear them if the CPU goes down).
> + */
> +struct loaded_vmcs {
> + struct vmcs *vmcs;
> + int cpu;
> + int launched;
> + struct list_head loaded_vmcss_on_cpu_link;
> +};
> +
>  struct shared_msr_entry {
>   unsigned index;
>   u64 data;
> @@ -124,9 +136,7 @@ struct shared_msr_entry {
> 
>  struct vcpu_vmx {
>   struct kvm_vcpu   vcpu;
> - struct list_head  local_vcpus_link;
>   unsigned long host_rsp;
> - int   launched;
>   u8fail;
>   u8cpl;
>   bool  nmi_known_unmasked;
> @@ -140,7 +150,14 @@ struct vcpu_vmx {
>   u64   msr_host_kernel_gs_base;
>   u64   msr_guest_kernel_gs_base;
>  #endif
> - struct vmcs  *vmcs;
> + /*
> +  * loaded_vmcs points to the VMCS currently used in this vcpu. For a
> +  * non-nested (L1) guest, it always points to vmcs01. For a nested
> +  * guest (L2), it points to a different VMCS.
> +  */
> + struct loaded_vmcsvmcs01;
> + struct loaded_vmcs   *loaded_vmcs;
> + bool  __launched; /* temporary, used in
> vmx_vcpu_run */
>   struct msr_autoload {
>   unsigned nr;
>   struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
> @@ -200,7 +217,11 @@ static int vmx_set_tss_addr(struct kvm *
> 
>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> -static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
> +/*
> + * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is
> needed
> + * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded
> on it.
> + */
> +static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
>  static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
> 
>  static unsigned long *vmx_io_bitmap_a;
> @@ -501,6 +522,13 @@ static void vmcs_clear(struct vmcs *vmcs
>  vmcs, phys_addr);
>  }
> 
> +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
> +{
> + vmcs_clear(loaded_vmcs->vmcs);
> + loaded_vmcs->cpu = -1;
> + loaded_vmcs->launched = 0;
> +}
> +
>  static void vmcs_load(struct vmcs *vmcs)
>  {
>   u64 phys_addr = __pa(vmcs);
> @@ -514,25 +542,24 @@ static void vmcs_load(struct vmcs *vmcs)
>  vmcs, phys_addr);
>  }
> 
> -static void __vcpu_clear(void *arg)
> +static void __loaded_vmcs_clear(void *arg)
>  {
> - struct vcpu_vmx *vmx = arg;
> + struct loaded_vmcs *loaded_vmcs = arg;
>   int cpu = raw_smp_processor_id();
> 
> - if (vmx->vcpu.cpu == cpu)
> - vmcs_clear(vmx->vmcs);
> - if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
> + if (loaded_vmcs->cpu != cpu)
> + return; /* vcpu migration can race with cpu offline */
> + if (per_cpu(current_vmcs, cpu) == loaded_vmcs->vmcs)
>   per_cpu(current_vmcs, cpu) = NULL;
> -

RE: [PATCH 20/31] nVMX: Exiting from L2 to L1

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:54 AM
> 
> This patch implements nested_vmx_vmexit(), called when the nested L2 guest
> exits and we want to run its L1 parent and let it handle this exit.
> 
> Note that this will not necessarily be called on every L2 exit. L0 may decide
> to handle a particular exit on its own, without L1's involvement; In that
> case, L0 will handle the exit, and resume running L2, without running L1 and
> without calling nested_vmx_vmexit(). The logic for deciding whether to handle
> a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
> will appear in a separate patch below.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |  257
> +++
>  1 file changed, 257 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
> @@ -6203,6 +6203,263 @@ static int nested_vmx_run(struct kvm_vcp
>   return 1;
>  }
> 
> +/*
> + * On a nested exit from L2 to L1, vmcs12.guest_cr0 might not be up-to-date
> + * because L2 may have changed some cr0 bits directly (see
> CRO_GUEST_HOST_MASK)
> + * without L0 trapping the change and updating vmcs12.
> + * This function returns the value we should put in vmcs12.guest_cr0. It's 
> not
> + * enough to just return the current (vmcs02) GUEST_CR0 - that may not be
> the
> + * guest cr0 that L1 thought it was giving its L2 guest; It is possible that
> + * L1 wished to allow its guest to set some cr0 bit directly, but we (L0) 
> asked
> + * to trap this change and instead set just the read shadow bit. If this is 
> the
> + * case, we need to copy these read-shadow bits back to vmcs12.guest_cr0,
> where
> + * L1 believes they already are.
> + */
> +static inline unsigned long
> +vmcs12_guest_cr0(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> +{
> + /*
> +  * As explained above, we take a bit from GUEST_CR0 if we allowed the
> +  * guest to modify it untrapped (vcpu->arch.cr0_guest_owned_bits), or
> +  * if we did trap it - if we did so because L1 asked to trap this bit
> +  * (vmcs12->cr0_guest_host_mask). Otherwise (bits we trapped but L1
> +  * didn't expect us to trap) we read from CR0_READ_SHADOW.
> +  */
> + unsigned long guest_cr0_bits =
> + vcpu->arch.cr0_guest_owned_bits | vmcs12->cr0_guest_host_mask;
> + return (vmcs_readl(GUEST_CR0) & guest_cr0_bits) |
> +(vmcs_readl(CR0_READ_SHADOW) & ~guest_cr0_bits);
> +}

Hi, Nadav,

Not sure whether I get above operation wrong. But it looks not exactly correct 
to me
in a glimpse. Say a bit set both in L0/L1's cr0_guest_host_mask. In such case 
that
bit from vmcs12_GUEST_CR0 resides in vmcs02_CR0_READ_SHADOW, however above
operation will make vmcs02_GUEST_CR0 bit returned instead.

Instead of constructing vmcs12_GUEST_CR0 completely from vmcs02_GUEST_CR0,
why not just updating bits which can be altered while keeping the rest bits from
vmcs12_GUEST_CR0? Say something like:

vmcs12->guest_cr0 &= vmcs12->cr0_guest_host_mask; /* keep unchanged bits */
vmcs12->guest_cr0 |= (vmcs_readl(GUEST_CR0) & vcpu->arch.cr0_guest_owned_bits) |
(vmcs_readl(CR0_READ_SHADOW) & ~( vcpu->arch.cr0_guest_owned_bits | 
vmcs12->cr0_guest_host_mask))

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Tuesday, May 24, 2011 7:37 PM
> 
> On 05/24/2011 02:30 PM, Tian, Kevin wrote:
> > >
> > >  We don't do that.  vcpu migration calls vcpu_clear() with interrupts
> > >  enabled, which then calls smp_call_function_single(), which calls
> > >  __vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
> > >  called from interrupts disabled (and calls __vcpu_clear() directly).
> > >
> >
> > OK, that's clear to me now.
> 
> Are there still open issues about the patch?
> 
> (Nadav, please post patches in the future in new threads so they're
> easier to find)
> 

I'm fine with this patch except that Nadav needs to clarify the comment
in __loaded_vmcs_clear (regarding to 'cpu migration' and 'cpu offline' part
which I replied in another mail)

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Tuesday, May 24, 2011 7:27 PM
> 
> On 05/24/2011 02:20 PM, Tian, Kevin wrote:
> > >  I don't think it's possible.  Both calls are done with interrupts 
> > > disabled.
> >
> > If that's the case then there's another potential issue. Deadlock may happen
> > when calling smp_call_function_single with interrupt disabled.
> 
> We don't do that.  vcpu migration calls vcpu_clear() with interrupts
> enabled, which then calls smp_call_function_single(), which calls
> __vcpu_clear() with interrupts disabled.  vmclear_local_vcpus() is
> called from interrupts disabled (and calls __vcpu_clear() directly).
> 

OK, that's clear to me now. 

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Tuesday, May 24, 2011 7:06 PM
> 
> On 05/24/2011 11:20 AM, Tian, Kevin wrote:
> > >
> > >  The (vmx->cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally
> never
> > >  happen: In the cpu offline path, we only call it for the loaded_vmcss 
> > > which
> > >  we know for sure are loaded on the current cpu. In the cpu migration
> path,
> > >  loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which
> ensures
> > >  that
> > >  equality.
> > >
> > >  But, there can be a race condition (this was actually explained to me a
> while
> > >  back by Avi - I never seen this happening in practice): Imagine that cpu
> > >  migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
> > >  VMCLEAR this vmcs. But before that old CPU gets a chance to act on that
> IPI,
> > >  a decision is made to take it offline, and all loaded_vmcs loaded on it
> > >  (including the one in question) are cleared. When that CPU acts on this
> IPI,
> > >  it notices that vmx->cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
> > >  anything (in the new version of the code, I made this more explicit, by
> > >  returning immediately in this case).
> >
> > the reverse also holds true. Right between the point where cpu_offline hits
> > a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's 
> > possible
> > that the vcpu is migrated to another cpu, and it's likely that migration 
> > path
> > (vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
> > from old cpu's linked list. This way later when __loaded_vmcs_clear is
> > invoked on the offlined cpu, there's still chance to observe cpu as -1.
> 
> I don't think it's possible.  Both calls are done with interrupts disabled.

If that's the case then there's another potential issue. Deadlock may happen
when calling smp_call_function_single with interrupt disabled. 

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Tuesday, May 24, 2011 5:45 PM
> 
> On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 18/31] nVMX:
> Implement VMLAUNCH and VMRESUME":
> > > + /*
> > > +  * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02).
> Remember
> > > +  * vmcs01, on which CPU it was last loaded, and whether it was
> launched
> > > +  * (we need all these values next time we will use L1). Then recall
> > > +  * these values from the last time vmcs02 was used.
> > > +  */
> > > + saved_vmcs02 = nested_get_current_vmcs02(vmx);
> > > + if (!saved_vmcs02)
> > > + return -ENOMEM;
> > > +
> > > + cpu = get_cpu();
> > > + vmx->nested.saved_vmcs01.vmcs = vmx->vmcs;
> > > + vmx->nested.saved_vmcs01.cpu = vcpu->cpu;
> > > + vmx->nested.saved_vmcs01.launched = vmx->launched;
> > > + vmx->vmcs = saved_vmcs02->vmcs;
> > > + vcpu->cpu = saved_vmcs02->cpu;
> >
> > this may be another valid reason for your check on cpu_online in your
> > latest [08/31] local_vcpus_link fix, since cpu may be offlined after
> > this assignment. :-)
> 
> I believe that wrapping this part of the code with get_cpu()/put_cpu()
> protected me from these kinds of race conditions.

you're right.

> 
> By the way, please note that this part of the code was changed after my
> latest loaded_vmcs overhaul. It now looks like this:
> 
>   vmcs02 = nested_get_current_vmcs02(vmx);
>   if (!vmcs02)
>   return -ENOMEM;
> 
>   cpu = get_cpu();
>   vmx->loaded_vmcs = vmcs02;
>   vmx_vcpu_put(vcpu);
>   vmx_vcpu_load(vcpu, cpu);
>   vcpu->cpu = cpu;
>   put_cpu();
> 
> (if Avi gives me the green light, I'll send the entire, up-to-date, patch set
> again).

Generally your new patch looks good.

> 
> > > + vmcs12->launch_state = 1;
> > > +
> > > + prepare_vmcs02(vcpu, vmcs12);
> >
> > Since prepare_vmcs may fail, add a check here and move launch_state
> > assignment after its success?
> 
> prepare_vmcs02() cannot fail. All the checks that need to be done on vmcs12
> are done before calling it, in nested_vmx_run().
> 
> Currently, there's a single case where prepare_vmcs02 "fails" when it fails
> to access apic_access_addr memory. This is wrong - the check should have
> been
> done earlier. I'll fix that, and make prepare_vmcs02() void.
> 

then no problem, as long as you keep this choice clear.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 24, 2011 5:19 PM
> 
> On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 17/31] nVMX:
> Prepare vmcs02 from vmcs01 and vmcs12":
> > > +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
> > > +{
> > > + return (fields->guest_cr4 & ~fields->cr4_guest_host_mask) |
> > > + (fields->cr4_read_shadow &
> fields->cr4_guest_host_mask);
> > > +}
> > > +
> >
> > will guest_ prefix look confusing here? The 'guest' has a broad range which
> makes
> > above two functions look like they can be used in non-nested case. Should we
> stick
> > to nested_prefix for nested specific facilities?
> 
> I don't know, I thought it made calls like
> 
>   vmcs_writel(CR0_READ_SHADOW, guest_readable_cr0(vmcs12));
> 
> readable, and the comments (and the parameters) make it obvious it's for
> nested only.
> 
> I now renamed these functions nested_read_cr0(), nested_read_cr4() - I hope
> you like these names better.

yes.

> 
> > > + if (is_guest_mode(&vmx->vcpu))
> > > + vmx->vcpu.arch.cr4_guest_owned_bits &=
> > > +
> ~get_vmcs12(&vmx->vcpu)->cr4_guest_host_mask;
> >
> > why not is_nested_mode()? :-P
> 
> I assume you're wondering why the function is called is_guest_mode(), and
> not is_nested_mode()?

yes

> 
> This name was chosen by Avi Kivity in November last year, for the function
> previously introduced by Joerg Roedel. My original code (before Joerg added
> this function to x86.c) indeed used the term "nested_mode", not
> "guest_mode".
> 
> In January, I pointed to the possibility of confusion between the new
> is_guest_mode() and other things called "guest mode", and Avi Kivity said
> he will rename it to is_nested_guest() - see
> http://lkml.indiana.edu/hypermail/linux/kernel/1101.1/01418.html
> But as you can see, he never did this renaming.
> 
> That being said, after half a year, I got used to the name is_guest_mode(),
> and am no longer convinced it should be changed. It checks whether the vcpu
> (not the underlying CPU) is in Intel-SDM-terminology "guest mode". Just like
> is_long_mode() checks if the vcpu is in long mode. So I'm fine with leaving
> its current name.

well, it's a small issue, and I'm fine with leaving it though I don't like 
'guest' here. :-)

> 
> > > +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12
> *vmcs12)
> > > +{
> >...
> > > + if (!vmx->rdtscp_enabled)
> > > + exec_control &= ~SECONDARY_EXEC_RDTSCP;
> > > + /* Take the following fields only from vmcs12 */
> > > + exec_control &=
> ~SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
> > > + if (nested_cpu_has(vmcs12,
> > > +
> CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
> > > + exec_control |=
> vmcs12->secondary_vm_exec_control;
> >
> > should this 2nd exec_control be merged in clear case-by-case flavor?
> >
> > what about L0 sets "virtualize x2APIC" bit while L1 doesn't?
> >
> > Or what about L0 disables EPT while L1 sets it?
> >
> > I think it's better to scrutinize every 2nd exec_control feature with a
> > clear policy:
> > - whether we want to use the stricter policy which is only set when both L0
> and
> > L1 set it
> > - whether we want to use L1 setting absolutely regardless of L0 setting like
> > what you did for virtualize APIC access
> 
> Please note that most of the examples you give cannot happen in practice,
> because we tell L1 (via MSR) which features it is allowed to use, and we
> fail entry if it tries to use disallowed features (before ever reaching
> the merge code you're commenting on). So we don't allow L1, for example,
> to use the EPT feature (and when nested-EPT support is added, we won't
> allow L1 to use EPT if L0 didn't). The general thinking was that for most
> fields that we do explicitly allow, "OR" is the right choice.

This really bases on the value of the control bit. To achieve the strictest
setting between L0/L1, sometimes you want to use AND and sometimes you
want to use OR.

>From a design p.o.v, it's better not to have such implicit assumption on other
places. Just make it clean and correct. Also in your example it doesn't cover
the case where L0 sets some bits which are not exposed to L1 via MSR. For
example as I said earlier, what about L0 sets virtualize X2APIC mode while
it's not enabled by or not exposed to L1. With OR, you then also enable this 
mode for L2 absolutely, while L1 has no logic to handle it.

I'd like to see a clean policy for the known control bits here, even with a 
strict policy to incur most VM-exits which can be optimized in the future.

> 
> I'll add this to my bugzilla, and think about it again later.
>

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 18/31] nVMX: Implement VMLAUNCH and VMRESUME

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:53 AM
> 
> Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
> hypervisor to run its own guests.
> 
> This patch does not include some of the necessary validity checks on
> vmcs12 fields before the entry. These will appear in a separate patch
> below.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |   84
> +--
>  1 file changed, 82 insertions(+), 2 deletions(-)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:49.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:49.0 +0300
> @@ -347,6 +347,9 @@ struct nested_vmx {
>   /* vmcs02_list cache of VMCSs recently used to run L2 guests */
>   struct list_head vmcs02_pool;
>   int vmcs02_num;
> +
> + /* Saving the VMCS that we used for running L1 */
> + struct saved_vmcs saved_vmcs01;
>   u64 vmcs01_tsc_offset;
>   /*
>* Guest pages referred to in vmcs02 with host-physical pointers, so
> @@ -4668,6 +4671,8 @@ static void nested_free_all_saved_vmcss(
>   kfree(item);
>   }
>   vmx->nested.vmcs02_num = 0;
> + if (is_guest_mode(&vmx->vcpu))
> + nested_free_saved_vmcs(vmx, &vmx->nested.saved_vmcs01);
>  }
> 
>  /* Get a vmcs02 for the current vmcs12. */
> @@ -4959,6 +4964,21 @@ static int handle_vmclear(struct kvm_vcp
>   return 1;
>  }
> 
> +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch);
> +
> +/* Emulate the VMLAUNCH instruction */
> +static int handle_vmlaunch(struct kvm_vcpu *vcpu)
> +{
> + return nested_vmx_run(vcpu, true);
> +}
> +
> +/* Emulate the VMRESUME instruction */
> +static int handle_vmresume(struct kvm_vcpu *vcpu)
> +{
> +
> + return nested_vmx_run(vcpu, false);
> +}
> +
>  enum vmcs_field_type {
>   VMCS_FIELD_TYPE_U16 = 0,
>   VMCS_FIELD_TYPE_U64 = 1,
> @@ -5239,11 +5259,11 @@ static int (*kvm_vmx_exit_handlers[])(st
>   [EXIT_REASON_INVLPG]  = handle_invlpg,
>   [EXIT_REASON_VMCALL]  = handle_vmcall,
>   [EXIT_REASON_VMCLEAR] = handle_vmclear,
> - [EXIT_REASON_VMLAUNCH]= handle_vmx_insn,
> + [EXIT_REASON_VMLAUNCH]= handle_vmlaunch,
>   [EXIT_REASON_VMPTRLD] = handle_vmptrld,
>   [EXIT_REASON_VMPTRST] = handle_vmptrst,
>   [EXIT_REASON_VMREAD]  = handle_vmread,
> - [EXIT_REASON_VMRESUME]= handle_vmx_insn,
> + [EXIT_REASON_VMRESUME]= handle_vmresume,
>   [EXIT_REASON_VMWRITE] = handle_vmwrite,
>   [EXIT_REASON_VMOFF]   = handle_vmoff,
>   [EXIT_REASON_VMON]= handle_vmon,
> @@ -6129,6 +6149,66 @@ static void nested_maintain_per_cpu_list
>   }
>  }
> 
> +/*
> + * nested_vmx_run() handles a nested entry, i.e., a VMLAUNCH or
> VMRESUME on L1
> + * for running an L2 nested guest.
> + */
> +static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
> +{
> + struct vmcs12 *vmcs12;
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + int cpu;
> + struct saved_vmcs *saved_vmcs02;
> +
> + if (!nested_vmx_check_permission(vcpu))
> + return 1;
> + skip_emulated_instruction(vcpu);
> +
> + vmcs12 = get_vmcs12(vcpu);
> +
> + enter_guest_mode(vcpu);
> +
> + vmx->nested.vmcs01_tsc_offset = vmcs_read64(TSC_OFFSET);
> +
> + /*
> +  * Switch from L1's VMCS (vmcs01), to L2's VMCS (vmcs02). Remember
> +  * vmcs01, on which CPU it was last loaded, and whether it was launched
> +  * (we need all these values next time we will use L1). Then recall
> +  * these values from the last time vmcs02 was used.
> +  */
> + saved_vmcs02 = nested_get_current_vmcs02(vmx);
> + if (!saved_vmcs02)
> + return -ENOMEM;
> +
> + cpu = get_cpu();
> + vmx->nested.saved_vmcs01.vmcs = vmx->vmcs;
> + vmx->nested.saved_vmcs01.cpu = vcpu->cpu;
> + vmx->nested.saved_vmcs01.launched = vmx->launched;
> + vmx->vmcs = saved_vmcs02->vmcs;
> + vcpu->cpu = saved_vmcs02->cpu;

this may be another valid reason for your check on cpu_online in your
latest [08/31] local_vcpus_link fix, since cpu may be offlined after
this assignment. :-)

> + vmx->launched = saved_vmcs02->launched;
> +
> + nested_maintain_per_cpu_lists(vmx,
> + saved_vmcs02, &vmx->nested.saved_vmcs01);
> +
> + vmx_vcpu_put(vcpu);
> + vmx_vcpu_load(vcpu, cpu);
> + vcpu->cpu = cpu;
> + put_cpu();
> +
> + vmcs12->launch_state = 1;
> +
> + prepare_vmcs02(vcpu, vmcs12);

Since prepare_vmcs may fail, add a check here and move launch_state
assignment after its success?

Thanks
Kevin

> +
> + /*
> +  * Note no nested_vmx_succeed or nested_vmx_fail here. At this point
> +  * we are no longer running L1, and VMLAUNCH/VMRESUME has not yet
> +  * returned as far as L1

RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Tuesday, May 24, 2011 3:57 PM
> 
> On Tue, May 24, 2011, Tian, Kevin wrote about "RE: [PATCH 08/31] nVMX: Fix
> local_vcpus_link handling":
> > > +static inline void loaded_vmcs_init(struct loaded_vmcs *loaded_vmcs)
> > > +{
> > > + vmcs_clear(loaded_vmcs->vmcs);
> > > + loaded_vmcs->cpu = -1;
> > > + loaded_vmcs->launched = 0;
> > > +}
> > > +
> >
> > call it vmcs_init instead since you now remove original vmcs_init 
> > invocation,
> > which more reflect the necessity of adding VMCLEAR for a new vmcs?
> 
> The best name for this function, I think, would have been loaded_vmcs_clear,
> because this function isn't necessarily used to "init" - it's also called to
> VMCLEAR an old vmcs (and flush its content back to memory) - in that sense
> it is definitely not a "vmcs_init".
> 
> Unfortunately, I already have a whole chain of functions with this name :(
> the existing loaded_vmcs_clear() does an IPI to the CPU which has this VMCS
> loaded, and causes it to run __loaded_vmcs_clear(), which in turn calls
> the above loaded_vmcs_init(). I wish I could call all three functions
> loaded_vmcs_clear(), but of course I can't. If anyone reading this has a good
> suggestion on how to name these three functions, please let me know.

how about loaded_vmcs_reset?

> 
> > > +static void __loaded_vmcs_clear(void *arg)
> > >  {
> > > - struct vcpu_vmx *vmx = arg;
> > > + struct loaded_vmcs *loaded_vmcs = arg;
> > >   int cpu = raw_smp_processor_id();
> > >
> > > - if (vmx->vcpu.cpu == cpu)
> > > - vmcs_clear(vmx->vmcs);
> > > - if (per_cpu(current_vmcs, cpu) == vmx->vmcs)
> > > + if (loaded_vmcs->cpu != cpu)
> > > + return; /* cpu migration can race with cpu offline */
> >
> > what do you mean by "cpu migration" here? why does 'cpu offline'
> > matter here regarding to the cpu change?
> 
> __loaded_vmcs_clear() is typically called in one of two cases: "cpu migration"
> means that a guest that used to run on one CPU, and had its VMCS loaded
> there,
> suddenly needs to run on a different CPU, so we need to clear the VMCS on
> the old CPU. "cpu offline" means that we want to take a certain CPU offline,
> and before we do that we should VMCLEAR all VMCSs which were loaded on it.

So here you need explicitly differentiate a vcpu and a real cpu. For the 1st 
case
it's just 'vcpu migration", and the latter it's the real 'cpu offline'. 'cpu 
migration' 
is generally a RAS feature in mission critical world. :-) 

> 
> The (vmx->cpu.cpu != cpu) case in __loaded_vmcs_clear should ideally never
> happen: In the cpu offline path, we only call it for the loaded_vmcss which
> we know for sure are loaded on the current cpu. In the cpu migration path,
> loaded_vmcs_clear runs __loaded_vmcs_clear on the right CPU, which ensures
> that
> equality.
> 
> But, there can be a race condition (this was actually explained to me a while
> back by Avi - I never seen this happening in practice): Imagine that cpu
> migration calls loaded_vmcs_clear, which tells the old cpu (via IPI) to
> VMCLEAR this vmcs. But before that old CPU gets a chance to act on that IPI,
> a decision is made to take it offline, and all loaded_vmcs loaded on it
> (including the one in question) are cleared. When that CPU acts on this IPI,
> it notices that vmx->cpu.cpu==-1, i.e., != cpu, so it doesn't need to do
> anything (in the new version of the code, I made this more explicit, by
> returning immediately in this case).

the reverse also holds true. Right between the point where cpu_offline hits
a loaded_vmcs and the point where it calls __loaded_vmcs_clear, it's possible
that the vcpu is migrated to another cpu, and it's likely that migration path
(vmx_vcpu_load) has invoked loaded_vmcs_clear but hasn't delete this vmcs
from old cpu's linked list. This way later when __loaded_vmcs_clear is
invoked on the offlined cpu, there's still chance to observe cpu as -1.

> 
> At least this is the theory. As I said, I didn't see this problem in practice
> (unsurprising, since I never offlined any CPU). Maybe Avi or someone else can
> comment more about this (vmx->cpu.cpu == cpu) check, which existed before
> my patch - in __vcpu_clear().

I agree this check is necessary, but just want you to make the comment clear
with right term.

> 
> > > +static void free_loaded_vmcs(struct loaded_vmcs *loaded_vmcs)
> > > +{
> > > + if (!loaded_vmcs->vmcs)

RE: [PATCH 17/31] nVMX: Prepare vmcs02 from vmcs01 and vmcs12

2011-05-24 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:53 AM
>
> This patch contains code to prepare the VMCS which can be used to actually
> run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the
> information
> in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
> own guests).
>
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |  269
> +++
>  1 file changed, 269 insertions(+)
>
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:48.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:48.0 +0300
> @@ -347,6 +347,12 @@ struct nested_vmx {
>   /* vmcs02_list cache of VMCSs recently used to run L2 guests */
>   struct list_head vmcs02_pool;
>   int vmcs02_num;
> + u64 vmcs01_tsc_offset;
> + /*
> +  * Guest pages referred to in vmcs02 with host-physical pointers, so
> +  * we must keep them pinned while L2 runs.
> +  */
> + struct page *apic_access_page;
>  };
>
>  struct vcpu_vmx {
> @@ -849,6 +855,18 @@ static inline bool report_flexpriority(v
>   return flexpriority_enabled;
>  }
>
> +static inline bool nested_cpu_has(struct vmcs12 *vmcs12, u32 bit)
> +{
> + return vmcs12->cpu_based_vm_exec_control & bit;
> +}
> +
> +static inline bool nested_cpu_has2(struct vmcs12 *vmcs12, u32 bit)
> +{
> + return (vmcs12->cpu_based_vm_exec_control &
> + CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) &&
> + (vmcs12->secondary_vm_exec_control & bit);
> +}
> +
>  static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
>  {
>   int i;
> @@ -1435,6 +1453,22 @@ static void vmx_fpu_activate(struct kvm_
>
>  static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu);
>
> +/*
> + * Return the cr0 value that a nested guest would read. This is a combination
> + * of the real cr0 used to run the guest (guest_cr0), and the bits shadowed 
> by
> + * its hypervisor (cr0_read_shadow).
> + */
> +static inline unsigned long guest_readable_cr0(struct vmcs12 *fields)
> +{
> + return (fields->guest_cr0 & ~fields->cr0_guest_host_mask) |
> + (fields->cr0_read_shadow & fields->cr0_guest_host_mask);
> +}
> +static inline unsigned long guest_readable_cr4(struct vmcs12 *fields)
> +{
> + return (fields->guest_cr4 & ~fields->cr4_guest_host_mask) |
> + (fields->cr4_read_shadow & fields->cr4_guest_host_mask);
> +}
> +

will guest_ prefix look confusing here? The 'guest' has a broad range which 
makes
above two functions look like they can be used in non-nested case. Should we 
stick
to nested_prefix for nested specific facilities?

>  static void vmx_fpu_deactivate(struct kvm_vcpu *vcpu)
>  {
>   vmx_decache_cr0_guest_bits(vcpu);
> @@ -3423,6 +3457,9 @@ static void set_cr4_guest_host_mask(stru
>   vmx->vcpu.arch.cr4_guest_owned_bits =
> KVM_CR4_GUEST_OWNED_BITS;
>   if (enable_ept)
>   vmx->vcpu.arch.cr4_guest_owned_bits |= X86_CR4_PGE;
> + if (is_guest_mode(&vmx->vcpu))
> + vmx->vcpu.arch.cr4_guest_owned_bits &=
> + ~get_vmcs12(&vmx->vcpu)->cr4_guest_host_mask;

why not is_nested_mode()? :-P

>   vmcs_writel(CR4_GUEST_HOST_MASK,
> ~vmx->vcpu.arch.cr4_guest_owned_bits);
>  }
>
> @@ -4760,6 +4797,11 @@ static void free_nested(struct vcpu_vmx
>   vmx->nested.current_vmptr = -1ull;
>   vmx->nested.current_vmcs12 = NULL;
>   }
> + /* Unpin physical memory we referred to in current vmcs02 */
> + if (vmx->nested.apic_access_page) {
> + nested_release_page(vmx->nested.apic_access_page);
> + vmx->nested.apic_access_page = 0;
> + }
>
>   nested_free_all_saved_vmcss(vmx);
>  }
> @@ -5829,6 +5871,233 @@ static void vmx_set_supported_cpuid(u32
>  }
>
>  /*
> + * prepare_vmcs02 is called when the L1 guest hypervisor runs its nested
> + * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function "merges" it
> + * with L0's requirements for its guest (a.k.a. vmsc01), so we can run the L2
> + * guest in a way that will both be appropriate to L1's requests, and our
> + * needs. In addition to modifying the active vmcs (which is vmcs02), this
> + * function also has additional necessary side-effects, like setting various
> + * vcpu->arch fields.
> + */
> +static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> +{
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> + u32 exec_control;
> +
> + vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
> + vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
> + vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
> + vmcs_write16(GUEST_DS_SELECTOR, vmcs12->guest_ds_selector);
> + vmcs_write16(GUEST_FS_SELECTOR, vmcs12->guest_fs_selector);
> + vmcs_write16(GUEST_GS_SELECTOR, vmcs12->guest_gs_selector);
> + vmcs_write16(GUEST_LDTR_SELECTOR, vmcs12->guest_ldtr_selector);
> + vmcs_write16(GUEST_TR_SELECTOR, vmcs1

RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-23 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 24, 2011 2:51 AM
> 
> > >+  vmcs_init(vmx->loaded_vmcs->vmcs);
> > >+  vmx->loaded_vmcs->cpu = -1;
> > >+  vmx->loaded_vmcs->launched = 0;
> >
> > Perhaps a loaded_vmcs_init() to encapsulate initialization of these
> > three fields, you'll probably reuse it later.
> 
> It's good you pointed this out, because it made me suddenly realise that I
> forgot to VMCLEAR the new vmcs02's I allocate. In practice it never made a
> difference, but better safe than sorry.

yes, that's what spec requires. You need VMCLEAR on any new VMCS which
does implementation specific initialization in that VMCS region.

> 
> I had to restructure some of the code a bit to be able to properly use this
> new function (in 3 places - __loaded_vmcs_clear, nested_get_current_vmcs02,
> vmx_create_cpu).
> 
> > Please repost separately after the fix, I'd like to apply it before the
> > rest of the series.
> 
> I am adding a new version of this patch at the end of this mail.
> 
> > (regarding interrupts, I think we can do that work post-merge.  But I'd
> > like to see Kevin's comments addressed)
> 
> I replied to his comments. Done some of the things he asked, and asked for
> more info on why/where he believes the current code is incorrect where I
> didn't understand what problems he pointed to, and am now waiting for him
> to reply.

As I replied in another thread, I believe this has been explained clearly by 
Nadav.

> 
> 
> --- 8< -- 8< -- 8< -- 8< --- 8< ---
> 
> Subject: [PATCH 01/31] nVMX: Keep list of loaded VMCSs, instead of vcpus.
> 
> In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
> because (at least in theory) the processor might not have written all of its
> content back to memory. Since a patch from June 26, 2008, this is done using
> a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.
> 
> The problem is that with nested VMX, we no longer have the concept of a
> vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
> L2s), and each of those may be have been last loaded on a different cpu.
> 
> So instead of linking the vcpus, we link the VMCSs, using a new structure
> loaded_vmcs. This structure contains the VMCS, and the information
> pertaining
> to its loading on a specific cpu (namely, the cpu number, and whether it
> was already launched on this cpu once). In nested we will also use the same
> structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the
> currently active VMCS.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |  150 ---
>  1 file changed, 86 insertions(+), 64 deletions(-)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-23 21:46:14.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-23 21:46:14.0 +0300
> @@ -116,6 +116,18 @@ struct vmcs {
>   char data[0];
>  };
> 
> +/*
> + * Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
> + * remember whether it was VMLAUNCHed, and maintain a linked list of all
> VMCSs
> + * loaded on this CPU (so we can clear them if the CPU goes down).
> + */
> +struct loaded_vmcs {
> + struct vmcs *vmcs;
> + int cpu;
> + int launched;
> + struct list_head loaded_vmcss_on_cpu_link;
> +};
> +
>  struct shared_msr_entry {
>   unsigned index;
>   u64 data;
> @@ -124,9 +136,7 @@ struct shared_msr_entry {
> 
>  struct vcpu_vmx {
>   struct kvm_vcpu   vcpu;
> - struct list_head  local_vcpus_link;
>   unsigned long host_rsp;
> - int   launched;
>   u8fail;
>   u8cpl;
>   bool  nmi_known_unmasked;
> @@ -140,7 +150,14 @@ struct vcpu_vmx {
>   u64   msr_host_kernel_gs_base;
>   u64   msr_guest_kernel_gs_base;
>  #endif
> - struct vmcs  *vmcs;
> + /*
> +  * loaded_vmcs points to the VMCS currently used in this vcpu. For a
> +  * non-nested (L1) guest, it always points to vmcs01. For a nested
> +  * guest (L2), it points to a different VMCS.
> +  */
> + struct loaded_vmcsvmcs01;
> + struct loaded_vmcs   *loaded_vmcs;
> + bool  __launched; /* temporary, used in
> vmx_vcpu_run */
>   struct msr_autoload {
>   unsigned nr;
>   struct vmx_msr_entry guest[NR_AUTOLOAD_MSRS];
> @@ -200,7 +217,11 @@ static int vmx_set_tss_addr(struct kvm *
> 
>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> -static DEFINE_PER_CPU(struct list_head, vcpus_on_cpu);
> +/*
> + * We maintain a per-CPU linked-list of VMCS loaded on that CPU. This is
> needed
> + * when a CPU is brought down, and we need to VMCLEAR all VMCSs loaded
> on it.
> + */
> +static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
>  static DEFINE_PER_CPU(struct desc_ptr, host

RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-23 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Sunday, May 22, 2011 4:30 PM
> 
> Hi,
> 
> On Fri, May 20, 2011, Tian, Kevin wrote about "RE: [PATCH 07/31] nVMX:
> Introduce vmcs02: VMCS used to run L2":
> > Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
> > is similar to the hardware behavior regarding to cleared and launched state.
> 
> If you set VMCS02_POOL_SIZE to a large size, and L1, like typical hypervisors,
> only keeps around a few VMCSs (and VMCLEARs the ones it will not use again),
> then we'll only have a few vmcs02: handle_vmclear() removes from the pool the
> vmcs02 that L1 explicitly told us it won't need again.

yes

> 
> > > +struct saved_vmcs {
> > > + struct vmcs *vmcs;
> > > + int cpu;
> > > + int launched;
> > > +};
> >
> > "saved" looks a bit misleading here. It's simply a list of all active vmcs02
> tracked
> > by kvm, isn't it?
> 
> I have rewritten this part of the code, based on Avi's and Marcelo's requests,
> and the new name for this structure is "loaded_vmcs", i.e., a structure
> describing where a VMCS was loaded.

great, I'll take a look at your new code.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 08/31] nVMX: Fix local_vcpus_link handling

2011-05-23 Thread Tian, Kevin
> From: Avi Kivity
> Sent: Monday, May 23, 2011 11:49 PM
> (regarding interrupts, I think we can do that work post-merge.  But I'd
> like to see Kevin's comments addressed)

My earlier comment has been addressed by Nadav with his explanation.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-23 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Sunday, May 22, 2011 3:23 PM
> 
> Hi,
> 
> On Sun, May 22, 2011, Tian, Kevin wrote about "RE: [PATCH 07/31] nVMX:
> Introduce vmcs02: VMCS used to run L2":
> > Here the vmcs02 being overridden may have been run on another processor
> before
> > but is not vmclear-ed yet. When you resume this vmcs02 with new content on
> a
> > separate processor, the risk of corruption exists.
> 
> I still believe that my current code is correct (in this area). I'll try to
> explain it here and would be grateful if you could point to me the error (if
> there is one) in my logic:
> 
> Nested_vmx_run() is our function which is switches from running L1 to L2
> (patch 18).
> 
> This function starts by calling nested_get_current_vmcs02(), which gets us
> *some* vmcs to use for vmcs02. This may be a fresh new VMCS, or a
> "recycled"
> VMCS, some VMCS we've previously used to run some, potentially different L2
> guest on some, potentially different, CPU.
> nested_get_current_vmcs02() returns a "saved_vmcs" structure, which
> not only contains a VMCS, but also remembers on which (if any) cpu it is
> currently loaded (and whether it was VMLAUNCHed once on that cpu).
> 
> The next thing that Nested_vmx_run() now does is to set up in the vcpu object
> the vmcs, cpu and launched fields according to what was returned above.
> 
> Now it calls vmx_vcpu_load(). This standard KVM function checks if we're now
> running on a different CPU from the vcpu->cpu, and if it a different one, is
> uses vcpu_clear() to VMCLEAR the vmcs on the CPU where it was last loaded
> (using an IPI). Only after it vmclears the VMCS on the old CPU, it can finally
> load the VMCS on the new CPU.
> 
> Only now Nested_vmx_run() can call prepare_vmcs02, which starts
> VMWRITEing
> to this VMCS, and finally returns.
> 

yes, you're correct. Previously I just looked around 07/31 and raised above 
concern.
Along with nested_vmx_run you explained above, this part is clear to me now. :-)

> P.S. Seeing that you're from Intel, maybe you can help me with a pointer:
> I found what appears to be a small error in the SDM - who can I report it to?
> 

Let me ask for you.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-21 Thread Tian, Kevin
> From: Nadav Har'El [mailto:n...@math.technion.ac.il]
> Sent: Saturday, May 21, 2011 4:32 AM
> 
> On Fri, May 20, 2011, Tian, Kevin wrote about "RE: [PATCH 07/31] nVMX:
> Introduce vmcs02: VMCS used to run L2":
> > btw, shouldn't you clear recycled VMCS and reset 'cpu' and 'launched' 
> > fields?
> 
> Well, I believe the answer is "no": As far as I understood, a host is allowed
> to take a VMCS that has been used once to launch a certain guest, and then
> modify all the VMCS's fields to define a completely different guest, and then
> VMRESUME it, without doing the regular VMCLEAR/VMLAUNCH, even though
> it's
> "a different guest". Is there something wrong in my assumption? Does VMX
> keep
> anything constant between successive VMRESUMEs?

Yes, you can reuse a VMCS with a completely different state if the VMCS is used
on the same processor, and you must ensure that VMCS not having a dirty state
on other processors. The SDM 3B (21.10.1) explicitly requires:



No VMCS should ever be active on more than one logical processor. If a VMCS is 
to be
"migrated" from one logical processor to another, the first logical processor 
should
execute VMCLEAR for the VMCS (to make it inactive on that logical processor and 
to
ensure that all VMCS data are in memory) before the other logical processor
executes VMPTRLD for the VMCS (to make it active on the second logical 
processor).
A VMCS that is made active on more than one logical processor may become
corrupted



Here the vmcs02 being overridden may have been run on another processor before
but is not vmclear-ed yet. When you resume this vmcs02 with new content on a 
separate processor, the risk of corruption exists.

> 
> > Have you tried SMP L2 guest?
> 
> It "sort of" works, but it *does* appear to still have a bug which I didn't
> yet have the time to hunt... In one case, for example, an 8-vcpu L2 on an
> 8-vcpu L1 seemed to work well (e.g., doing parallel make) for about a minute,
> and then hung with some sort of page fault in the kernel.
> 

See whether cleaning up above can help here.

Thanks
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-20 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Friday, May 20, 2011 4:05 PM
> 
> > From: Nadav Har'El
> > Sent: Tuesday, May 17, 2011 3:48 AM
> >
> > We saw in a previous patch that L1 controls its L2 guest with a vcms12.
> > L0 needs to create a real VMCS for running L2. We call that "vmcs02".
> > A later patch will contain the code, prepare_vmcs02(), for filling the 
> > vmcs02
> > fields. This patch only contains code for allocating vmcs02.
> >
> > In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
> > enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It 
> > can
> > be reused even when L1 runs multiple L2 guests. However, in future versions
> > we'll probably want to add an optimization where vmcs02 fields that rarely
> > change will not be set each time. For that, we may want to keep around
> several
> > vmcs02s of L2 guests that have recently run, so that potentially we could 
> > run
> > these L2s again more quickly because less vmwrites to vmcs02 will be
> needed.
> 
> That would be a neat enhancement and should have an obvious improvement.
> Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
> is similar to the hardware behavior regarding to cleared and launched state.
> 
> >
> > This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
> > which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE
> L2s.
> > As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
> > I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
> > reused to enter any L2 guest. In the future, when prepare_vmcs02() is
> > optimized not to set all fields every time, VMCS02_POOL_SIZE should be
> > increased.
> >
> > Signed-off-by: Nadav Har'El 
> > ---
> >  arch/x86/kvm/vmx.c |  139
> > +++
> >  1 file changed, 139 insertions(+)
> >
> > --- .before/arch/x86/kvm/vmx.c  2011-05-16 22:36:47.0 +0300
> > +++ .after/arch/x86/kvm/vmx.c   2011-05-16 22:36:47.0 +0300
> > @@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
> >  module_param(ple_window, int, S_IRUGO);
> >
> >  #define NR_AUTOLOAD_MSRS 1
> > +#define VMCS02_POOL_SIZE 1
> >
> >  struct vmcs {
> > u32 revision_id;
> > @@ -166,6 +167,30 @@ struct __packed vmcs12 {
> >  #define VMCS12_SIZE 0x1000
> >
> >  /*
> > + * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's
> VMCS
> > + * while we use L2's VMCS), and we wish to save the previous VMCS, we
> must
> > also
> > + * remember on which CPU it was last loaded (vcpu->cpu), so when we
> return
> > to
> > + * using this VMCS we'll know if we're now running on a different CPU and
> > need
> > + * to clear the VMCS on the old CPU, and load it on the new one.
> Additionally,
> > + * we need to remember whether this VMCS was launched (vmx->launched),
> > so when
> > + * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot
> > deduce
> > + * this from other state, because it's possible that this VMCS had once 
> > been
> > + * launched, but has since been cleared after a CPU switch).
> > + */
> > +struct saved_vmcs {
> > +   struct vmcs *vmcs;
> > +   int cpu;
> > +   int launched;
> > +};
> 
> "saved" looks a bit misleading here. It's simply a list of all active vmcs02
> tracked
> by kvm, isn't it?
> 
> > +
> > +/* Used to remember the last vmcs02 used for some recently used vmcs12s
> > */
> > +struct vmcs02_list {
> > +   struct list_head list;
> > +   gpa_t vmcs12_addr;
> 
> uniform the name 'vmptr' as nested_vmx strucure:
>  /* The guest-physical address of the current VMCS L1 keeps for L2 */
>   gpa_t current_vmptr;
>   /* The host-usable pointer to the above */
>   struct page *current_vmcs12_page;
>   struct vmcs12 *current_vmcs12;
> 
> you should keep consistent meaning for vmcs12, which means the arch-neutral
> state interpreted by KVM only.
> 
> > +   struct saved_vmcs vmcs02;
> > +};
> > +
> > +/*
> >   * The nested_vmx structure is part of vcpu_vmx, and holds information we
> > need
> >   * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
> >   */
> > @@ -178,6 +203,10 @@ struct nested_vmx {
> > /* The host-usable pointer to the above */
> > struct page *current_vmcs12_page;
&g

RE: [PATCH 09/31] nVMX: Add VMCS fields to the vmcs12

2011-05-20 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:49 AM
> 
> In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
> standard VMCS fields.
> 
> Later patches will enable L1 to read and write these fields using VMREAD/
> VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing
> vmcs02,
> a hardware VMCS for running L2.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |  275
> +++
>  1 file changed, 275 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:47.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:47.0 +0300
> @@ -144,12 +144,148 @@ struct shared_msr_entry {
>   * machines (necessary for live migration).
>   * If there are changes in this struct, VMCS12_REVISION must be changed.
>   */
> +typedef u64 natural_width;
>  struct __packed vmcs12 {
>   /* According to the Intel spec, a VMCS region must start with the
>* following two fields. Then follow implementation-specific data.
>*/
>   u32 revision_id;
>   u32 abort;
> +
> + u64 io_bitmap_a;
> + u64 io_bitmap_b;
> + u64 msr_bitmap;
> + u64 vm_exit_msr_store_addr;
> + u64 vm_exit_msr_load_addr;
> + u64 vm_entry_msr_load_addr;
> + u64 tsc_offset;
> + u64 virtual_apic_page_addr;
> + u64 apic_access_addr;
> + u64 ept_pointer;
> + u64 guest_physical_address;
> + u64 vmcs_link_pointer;
> + u64 guest_ia32_debugctl;
> + u64 guest_ia32_pat;
> + u64 guest_ia32_efer;
> + u64 guest_pdptr0;
> + u64 guest_pdptr1;
> + u64 guest_pdptr2;
> + u64 guest_pdptr3;
> + u64 host_ia32_pat;
> + u64 host_ia32_efer;
> + u64 padding64[8]; /* room for future expansion */
> + /*
> +  * To allow migration of L1 (complete with its L2 guests) between
> +  * machines of different natural widths (32 or 64 bit), we cannot have
> +  * unsigned long fields with no explict size. We use u64 (aliased
> +  * natural_width) instead. Luckily, x86 is little-endian.
> +  */
> + natural_width cr0_guest_host_mask;
> + natural_width cr4_guest_host_mask;
> + natural_width cr0_read_shadow;
> + natural_width cr4_read_shadow;
> + natural_width cr3_target_value0;
> + natural_width cr3_target_value1;
> + natural_width cr3_target_value2;
> + natural_width cr3_target_value3;
> + natural_width exit_qualification;
> + natural_width guest_linear_address;
> + natural_width guest_cr0;
> + natural_width guest_cr3;
> + natural_width guest_cr4;
> + natural_width guest_es_base;
> + natural_width guest_cs_base;
> + natural_width guest_ss_base;
> + natural_width guest_ds_base;
> + natural_width guest_fs_base;
> + natural_width guest_gs_base;
> + natural_width guest_ldtr_base;
> + natural_width guest_tr_base;
> + natural_width guest_gdtr_base;
> + natural_width guest_idtr_base;
> + natural_width guest_dr7;
> + natural_width guest_rsp;
> + natural_width guest_rip;
> + natural_width guest_rflags;
> + natural_width guest_pending_dbg_exceptions;
> + natural_width guest_sysenter_esp;
> + natural_width guest_sysenter_eip;
> + natural_width host_cr0;
> + natural_width host_cr3;
> + natural_width host_cr4;
> + natural_width host_fs_base;
> + natural_width host_gs_base;
> + natural_width host_tr_base;
> + natural_width host_gdtr_base;
> + natural_width host_idtr_base;
> + natural_width host_ia32_sysenter_esp;
> + natural_width host_ia32_sysenter_eip;
> + natural_width host_rsp;
> + natural_width host_rip;
> + natural_width paddingl[8]; /* room for future expansion */
> + u32 pin_based_vm_exec_control;
> + u32 cpu_based_vm_exec_control;
> + u32 exception_bitmap;
> + u32 page_fault_error_code_mask;
> + u32 page_fault_error_code_match;
> + u32 cr3_target_count;
> + u32 vm_exit_controls;
> + u32 vm_exit_msr_store_count;
> + u32 vm_exit_msr_load_count;
> + u32 vm_entry_controls;
> + u32 vm_entry_msr_load_count;
> + u32 vm_entry_intr_info_field;
> + u32 vm_entry_exception_error_code;
> + u32 vm_entry_instruction_len;
> + u32 tpr_threshold;
> + u32 secondary_vm_exec_control;
> + u32 vm_instruction_error;
> + u32 vm_exit_reason;
> + u32 vm_exit_intr_info;
> + u32 vm_exit_intr_error_code;
> + u32 idt_vectoring_info_field;
> + u32 idt_vectoring_error_code;
> + u32 vm_exit_instruction_len;
> + u32 vmx_instruction_info;
> + u32 guest_es_limit;
> + u32 guest_cs_limit;
> + u32 guest_ss_limit;
> + u32 guest_ds_limit;
> + u32 guest_fs_limit;
> + u32 guest_gs_limit;
> + u32 guest_ldtr_limit;
> + u32 guest_tr_limit;
> + u32 guest_gdtr_limit;
> + u32 guest_idtr_limit;
> + u32 guest_es_ar_bytes;
> + u32 guest_cs_ar_bytes;
> + u32 guest_ss_ar_bytes;
> + u32 guest

RE: [PATCH 07/31] nVMX: Introduce vmcs02: VMCS used to run L2

2011-05-20 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:48 AM
> 
> We saw in a previous patch that L1 controls its L2 guest with a vcms12.
> L0 needs to create a real VMCS for running L2. We call that "vmcs02".
> A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
> fields. This patch only contains code for allocating vmcs02.
> 
> In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
> enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
> be reused even when L1 runs multiple L2 guests. However, in future versions
> we'll probably want to add an optimization where vmcs02 fields that rarely
> change will not be set each time. For that, we may want to keep around several
> vmcs02s of L2 guests that have recently run, so that potentially we could run
> these L2s again more quickly because less vmwrites to vmcs02 will be needed.

That would be a neat enhancement and should have an obvious improvement.
Possibly we can maintain the vmcs02 pool along with L1 VMCLEAR ops, which
is similar to the hardware behavior regarding to cleared and launched state.

> 
> This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
> which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
> As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
> I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
> reused to enter any L2 guest. In the future, when prepare_vmcs02() is
> optimized not to set all fields every time, VMCS02_POOL_SIZE should be
> increased.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |  139
> +++
>  1 file changed, 139 insertions(+)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:47.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:47.0 +0300
> @@ -117,6 +117,7 @@ static int ple_window = KVM_VMX_DEFAULT_
>  module_param(ple_window, int, S_IRUGO);
> 
>  #define NR_AUTOLOAD_MSRS 1
> +#define VMCS02_POOL_SIZE 1
> 
>  struct vmcs {
>   u32 revision_id;
> @@ -166,6 +167,30 @@ struct __packed vmcs12 {
>  #define VMCS12_SIZE 0x1000
> 
>  /*
> + * When we temporarily switch a vcpu's VMCS (e.g., stop using an L1's VMCS
> + * while we use L2's VMCS), and we wish to save the previous VMCS, we must
> also
> + * remember on which CPU it was last loaded (vcpu->cpu), so when we return
> to
> + * using this VMCS we'll know if we're now running on a different CPU and
> need
> + * to clear the VMCS on the old CPU, and load it on the new one. 
> Additionally,
> + * we need to remember whether this VMCS was launched (vmx->launched),
> so when
> + * we return to it we know if to VMLAUNCH or to VMRESUME it (we cannot
> deduce
> + * this from other state, because it's possible that this VMCS had once been
> + * launched, but has since been cleared after a CPU switch).
> + */
> +struct saved_vmcs {
> + struct vmcs *vmcs;
> + int cpu;
> + int launched;
> +};

"saved" looks a bit misleading here. It's simply a list of all active vmcs02 
tracked
by kvm, isn't it?

> +
> +/* Used to remember the last vmcs02 used for some recently used vmcs12s
> */
> +struct vmcs02_list {
> + struct list_head list;
> + gpa_t vmcs12_addr;

uniform the name 'vmptr' as nested_vmx strucure:
 /* The guest-physical address of the current VMCS L1 keeps for L2 */
gpa_t current_vmptr;
/* The host-usable pointer to the above */
struct page *current_vmcs12_page;
struct vmcs12 *current_vmcs12;

you should keep consistent meaning for vmcs12, which means the arch-neutral
state interpreted by KVM only.

> + struct saved_vmcs vmcs02;
> +};
> +
> +/*
>   * The nested_vmx structure is part of vcpu_vmx, and holds information we
> need
>   * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
>   */
> @@ -178,6 +203,10 @@ struct nested_vmx {
>   /* The host-usable pointer to the above */
>   struct page *current_vmcs12_page;
>   struct vmcs12 *current_vmcs12;
> +
> + /* vmcs02_list cache of VMCSs recently used to run L2 guests */
> + struct list_head vmcs02_pool;
> + int vmcs02_num;
>  };
> 
>  struct vcpu_vmx {
> @@ -4200,6 +4229,111 @@ static int handle_invalid_op(struct kvm_
>  }
> 
>  /*
> + * To run an L2 guest, we need a vmcs02 based the L1-specified vmcs12.
> + * We could reuse a single VMCS for all the L2 guests, but we also want the
> + * option to allocate a separate vmcs02 for each separate loaded vmcs12 -
> this
> + * allows keeping them loaded on the processor, and in the future will allow
> + * optimizations where prepare_vmcs02 doesn't need to set all the fields on
> + * every entry if they never change.
> + * So we keep, in vmx->nested.vmcs02_pool, a cache of size
> VMCS02_POOL_SIZE
> + * (>=0) with a vmcs02 for each recently loaded vmcs12s, most recent first.
> + *
> + * The following functions allocate and free a vmcs02 in this pool.
> + */
> +

RE: [PATCH 02/31] nVMX: Implement VMXON and VMXOFF

2011-05-20 Thread Tian, Kevin
> From: Nadav Har'El
> Sent: Tuesday, May 17, 2011 3:45 AM
> 
> This patch allows a guest to use the VMXON and VMXOFF instructions, and
> emulates them accordingly. Basically this amounts to checking some
> prerequisites, and then remembering whether the guest has enabled or
> disabled VMX operation.
> 
> Signed-off-by: Nadav Har'El 
> ---
>  arch/x86/kvm/vmx.c |  110
> ++-
>  1 file changed, 108 insertions(+), 2 deletions(-)
> 
> --- .before/arch/x86/kvm/vmx.c2011-05-16 22:36:46.0 +0300
> +++ .after/arch/x86/kvm/vmx.c 2011-05-16 22:36:46.0 +0300
> @@ -130,6 +130,15 @@ struct shared_msr_entry {
>   u64 mask;
>  };
> 
> +/*
> + * The nested_vmx structure is part of vcpu_vmx, and holds information
> +we need
> + * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
> + */
> +struct nested_vmx {
> + /* Has the level1 guest done vmxon? */
> + bool vmxon;
> +};
> +
>  struct vcpu_vmx {
>   struct kvm_vcpu   vcpu;
>   struct list_head  local_vcpus_link;
> @@ -184,6 +193,9 @@ struct vcpu_vmx {
>   u32 exit_reason;
> 
>   bool rdtscp_enabled;
> +
> + /* Support for a guest hypervisor (nested VMX) */
> + struct nested_vmx nested;
>  };
> 
>  enum segment_cache_field {
> @@ -3890,6 +3902,99 @@ static int handle_invalid_op(struct kvm_  }
> 
>  /*
> + * Emulate the VMXON instruction.
> + * Currently, we just remember that VMX is active, and do not save or
> +even
> + * inspect the argument to VMXON (the so-called "VMXON pointer")
> +because we
> + * do not currently need to store anything in that guest-allocated
> +memory

Though we don't need store anything, VMXON needs to check revision ID of
VMXON region to make sure it matches processor's assumption. Considering
an user uses nVMX to practice VMM development and forgot to fill revision
ID into the region. We should fail the instruction at the 1st place.

> + * region. Consequently, VMCLEAR and VMPTRLD also do not verify that
> +the their
> + * argument is different from the VMXON pointer (which the spec says they
> do).
> + */
> +static int handle_vmon(struct kvm_vcpu *vcpu) {
> + struct kvm_segment cs;
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> + /* The Intel VMX Instruction Reference lists a bunch of bits that
> +  * are prerequisite to running VMXON, most notably cr4.VMXE must be
> +  * set to 1 (see vmx_set_cr4() for when we allow the guest to set this).
> +  * Otherwise, we should fail with #UD. We test these now:
> +  */
> + if (!kvm_read_cr4_bits(vcpu, X86_CR4_VMXE) ||
> + !kvm_read_cr0_bits(vcpu, X86_CR0_PE) ||
> + (vmx_get_rflags(vcpu) & X86_EFLAGS_VM)) {
> + kvm_queue_exception(vcpu, UD_VECTOR);
> + return 1;
> + }
> +
> + vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
> + if (is_long_mode(vcpu) && !cs.l) {
> + kvm_queue_exception(vcpu, UD_VECTOR);
> + return 1;
> + }
> +
> + if (vmx_get_cpl(vcpu)) {
> + kvm_inject_gp(vcpu, 0);
> + return 1;
> + }

You need also check IA32_FEATURE_CONTROL_MSR for bit 0/1/2 as
said in SDM. 

So does the check on 4k alignment and physical-address width for VMXON
region.

> +
> + vmx->nested.vmxon = true;
> +
> + skip_emulated_instruction(vcpu);
> + return 1;
> +}
> +
> +/*
> + * Intel's VMX Instruction Reference specifies a common set of
> +prerequisites
> + * for running VMX instructions (except VMXON, whose prerequisites are
> + * slightly different). It also specifies what exception to inject otherwise.
> + */
> +static int nested_vmx_check_permission(struct kvm_vcpu *vcpu) {
> + struct kvm_segment cs;
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> + if (!vmx->nested.vmxon) {
> + kvm_queue_exception(vcpu, UD_VECTOR);
> + return 0;
> + }
> +
> + vmx_get_segment(vcpu, &cs, VCPU_SREG_CS);
> + if ((vmx_get_rflags(vcpu) & X86_EFLAGS_VM) ||
> + (is_long_mode(vcpu) && !cs.l)) {
> + kvm_queue_exception(vcpu, UD_VECTOR);
> + return 0;
> + }
> +
> + if (vmx_get_cpl(vcpu)) {
> + kvm_inject_gp(vcpu, 0);
> + return 0;
> + }
> +
> + return 1;
> +}
> +
> +/*
> + * Free whatever needs to be freed from vmx->nested when L1 goes down,
> +or
> + * just stops using VMX.
> + */
> +static void free_nested(struct vcpu_vmx *vmx) {
> + if (!vmx->nested.vmxon)
> + return;
> + vmx->nested.vmxon = false;
> +}
> +
> +/* Emulate the VMXOFF instruction */
> +static int handle_vmoff(struct kvm_vcpu *vcpu) {
> + if (!nested_vmx_check_permission(vcpu))
> + return 1;

miss one check on CR0.PE

> + free_nested(to_vmx(vcpu));
> + skip_emulated_instruction(vcpu);
> + return 1;
> +}
> +
> +/*
>   * The exit handlers return 1 if the exit was handled fully and guest 
> execution
>   * may resume.  Otherwise they set the kvm_run p

RE: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Tian, Kevin
>From: Avi Kivity [mailto:[EMAIL PROTECTED]
>Sent: 2008年9月28日 12:23
>
>There is no issue on the host, since all drivers operate on the same
>trust level. A misbehaving driver on the host will take down the entire
>system even without shared interrupts, by corrupting memory, not
>releasing a lock, etc.
>
>But if you move a driver to the guest, you expect it will be isolated
>from the rest of the system, and if there are shared
>interrupts, it isn't.
>

Yes, you're right

>> Or do you have better alternative?
>>
>
>No. Maybe the Neocleus polarity trick (which also reduces performance).
>

To my knowledge, Neocleus polarity trick can't solve this isolation
issue, which just provides one effecient way to track assertion/deassertion
transition on the irq line. For example, reverse polarity when receiving an
instance, and then a new irq instance would occur when all devices de-
assert on shared irq line, and then recover the polarity. In your concerned
case where guest driver misbehaves, this polarity trick can't work neither
as one device always asserts the line.

Thanks,
Kevin
N�Р骒r��yb�X�肚�v�^�)藓{.n�+�筏�hФ�≤�}��财�z�&j:+v�����赙zZ+��+zf"�h���~i���z��wア�?�ㄨ��&�)撷f

RE: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Tian, Kevin
>From: Dong, Eddie
>Sent: 2008年9月28日 10:04
>
>Tian, Kevin wrote:
>>> From:Avi Kivity
>>> Sent: 2008年9月27日 17:50
>>>
>>> Yang, Sheng wrote:
>>>> After check host shared interrupts situation, I got a
>>>> question here:
>>>>
>>>> If I understand correctly, current solution don't block
>>>> host shared irq, just come with the performance pentry.
>>>> The penalty come with host disabled irq line for a
>>>> period. We have to wait guest to write EOI. But I fail
>>>> to see the correctness problem here (except a lot of
>>>> spurious interrupt in the guest).
>>>>
>>>> I've checked mail, but can't find clue about that. Can
>>>> you explain the situation?
>>>>
>>>>
>>>
>>> If the guest fails to disable interrupts on a device
>>> that shares an interrupt line with the host, the host
>>> will experience an interrupt flood.  Eventually the host
>>> will disable the host device as well.
>>>
>>
>> This issue also exists on host side, that one misbehaved
>> driver can hurt all other drivers sharing same irq line.
>> But it seems no good way to avoid it. Since not all
>> devices support MSI, we still need support irq sharing
>> possibly with above caveats given.
>>
>> Existing approach at least works with a sane guest
>> driver, with some performance penality there.
>>
>> Or do you have better alternative?
>>
>> Thanks,
>> Kevin
>>
>
>MSI is always 1st choice. Including taking host MSI for guest
>IOAPIC situation because we don't if guest OS has MSI support
>but we are sure host Linux can.
>
>When MSI is impossible, I recommend we disable device
>assignment for those sharing interrupt , or we assign all
>devices with same interrupt to same guest. Yes the issue is
>same in native, but in native the whole OS (kernel) is in same
>isolation domain, but now different guest has different
>isolation domain :(
>
>In one world, MSI is pretty important for direct IO, and
>SR-IOV is #1 usage in future. Just advocate more and wish more
>people can ack the SR-IOV patch from ZhaoYU so that we can see
>2.6,28 work for direct I/O without sacrificing sharing :)
>

Yes, irq sharing is most tricky stuff, and hard to make it
architectureally clean. Besides irq storm mentioned by Avi,
driver timeout or device buffer overflow is also subtle to be
intervened by the guest sharing irq. Guest inter-dependency
can impact shared irq handling too. If people do care those
issues that known irq sharing approaches can't address,
your recommendation looks making sense.

Thanks
Kevin
N�Р骒r��yb�X�肚�v�^�)藓{.n�+�筏�hФ�≤�}��财�z�&j:+v�����赙zZ+��+zf"�h���~i���z��wア�?�ㄨ��&�)撷f

RE: Remaining passthrough/VT-d tasks list

2008-09-27 Thread Tian, Kevin
>From:Avi Kivity
>Sent: 2008年9月27日 17:50
>
>Yang, Sheng wrote:
>> After check host shared interrupts situation, I got a question here:
>>
>> If I understand correctly, current solution don't block host
>shared irq, just
>> come with the performance pentry. The penalty come with host
>disabled irq
>> line for a period. We have to wait guest to write EOI. But I
>fail to see the
>> correctness problem here (except a lot of spurious interrupt
>in the guest).
>>
>> I've checked mail, but can't find clue about that. Can you
>explain the
>> situation?
>>
>>
>
>If the guest fails to disable interrupts on a device that shares an
>interrupt line with the host, the host will experience an interrupt
>flood.  Eventually the host will disable the host device as well.
>

This issue also exists on host side, that one misbehaved driver
can hurt all other drivers sharing same irq line. But it seems no
good way to avoid it. Since not all devices support MSI, we still
need support irq sharing possibly with above caveats given.

Existing approach at least works with a sane guest driver, with
some performance penality there.

Or do you have better alternative?

Thanks,
Kevin


RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

2008-07-27 Thread Tian, Kevin
>From: Avi Kivity [mailto:[EMAIL PROTECTED] 
>Sent: 2008年7月27日 16:27
>
>Tian, Kevin wrote:
>>> From: Darrick J. Wong
>>> Sent: 2008年7月16日 7:18
>>>
>>> I envision four scenarios:
>>>
>>> 0. Guests that don't know about cpufreq still run at whatever 
>>> nice level
>>> they started with.
>>>
>>> 1. If we have a system with a lot of idle VMs, they will all 
>>> run with +5
>>> nice and this patch has no effect.
>>>
>>> 2. If we have a system with a lot of busy VMs, they all run 
>>> with -5 nice
>>> and this patch also has no effect.
>>>
>>> 3. If, however, we have a lot of idle VMs and a few busy 
>ones, then the
>>> -5 nice of the busy VMs will get those VMs extra CPU time.  
>On a really
>>> crummy FPU microbenchmark I have, the score goes from about 
>500 to 2000
>>> with the patch applied, though of course YMMV.  In some 
>respects this
>>> 
>>
>> How many VMs did you run in this test? All the VMs are idle except
>> the one where your benchmark runs?
>>
>> How about the actual effect when several VMs are doing some stuff?
>>
>> There's another scenario where some VMs don't support cpufreq while
>> others do. Here is it unfair to just renice the latter when 
>the former is
>> not 'nice' at all?
>>   
>
>I guess the solution for such issues is not to have kvm (or qemu) play
>with nice levels, but instead send notifications on virtual frequency
>changes on the qemu monitor. The management application can then choose
>whether to ignore the information, play with nice levels, or even
>propagate the frequency change to the host (useful in client-side
>virtualization).
>

Yes, that'd be more flexible and cleaner.

Thanks,
Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests toinfluencenice priority

2008-07-17 Thread Tian, Kevin
>From: Darrick J. Wong [mailto:[EMAIL PROTECTED] 
>Sent: 2008年7月18日 3:05
>
>If there are multiple VMs that are busy, the busy ones will fight among
>themselves for CPU time.  I still see some priority boost, just not as
>much.

some micro-level analysis is useful here.

>
>I wonder how stable the virtual tsc is...?  Will have to study this.

My point is that to expose virtual freq states doesn't change the fact
whether virtual tsc is stable, since the interception logic about virtual
freq change request only impacts nice. That's expected behavior.

Then whether a virtual tsc is stable is just another issue out of this 
feature.

>
>IDA has the same problem... the T61 BIOS "compensates" for this fakery
>by reporting a frequency of $max_freq + 1 so if you're smart 
>then you'll
>somehow know that you might see a boost that you can't measure. :P

It can be measured, as one necessary requirement pushed on any
hardware coordinated logic, to provide some type of feedback 
mechanism. For example, Intel processors provides APERF/MPERF
pairs with MPERF incremented in proportion to a fixed boot frequency,
while APERF increments tin proportion to actual performance.
Software should use APERF/MPERF to understand actual freq in
elapsed sampling period. 

>
>I suppose the problem here is that p-states were designed on the
>assumption that you're directly manipulating hardware speeds, whereas
>what we really want in both this patch and IDA are qualitative values
>("medium speed", "highest speed", "ludicrous speed?")

It's still a bit different. 

For IDA, when ludicrous speed is requested, it may be granted. 
However when it's not, the actual freq will be still at highest speed
and never be lower.

However for this feature, how much cpu cycles can be granted is
not decided by a single 'nice' value, which instead depends on num
of active vcpus at given time on given cpu. Whatever speed is 
requested, either medium, highest or ludicrous, granted cycles can
always vary from some minimal (many vcpus contends) to 100%
(only current is active).  

>On the other hand, if you get the same performance  at both 
>high and low
>speeds, then it doesn't really matter which one you choose.  At least
>not until the load changes.  I suppose the next question is, how much
>software is dependent on knowing the exact CPU frequency, and are
>workload schedulers smart enough to realize that performance
>characteristics can change over time (throttling, TM1/TM2, etc)?
>Inasmuch as you actually ever know, since with hardware coordination of
>cpufreq the hardware can do whatever it wants.

Throttling or TM1/TM2 are related to thermal when some threshold
is reached. Here let's focus on DBS (Demand Based Switching)
which is actively conducted by OSPM per workload estimation. A
typical freq demotion flow is like below:

If (PercentBusy * Pc/Pn) < threshold
switch Pc to Pn;

Here PercentBusy represents CPU utilization in elapsed sampling
period. Pc stands for freq used in elapsed period, and Pn is the
candidate lower freq to be changed. If freq change can still keep
CPU utilization under predefined threshold, then transition is viable.

Here the keypoint is PercentBusy and Pc, which may make the
final decision pointless if inaccurate. That's why hardware coordi-
nation logic is required to provide some feedback to get Pc. 

I agree that finally guest should be able to catch up if wrong decision
makes its workload restrained or over-granted. E.g. when there's
only one vcpu active on pcpu which requests a medium speed,
100% cycles are granted to make it think that medium speed is
enough for its current workload. Later when other vcpus are active
on same pcpu, its granted cycles reduces and then it may realize
medium speed is not enough and then request to highest speed
which then may adds back some cycles with a lower nice value.

But it's better to do some micro-level analysis to understand 
whether it works as expected, and more important, how fast this
catch-up may be. Note that guest is checking freq change at 
like 20ms level, and we then need make sure no thrash is caused
to mess both guest and host.

Actually another concern just raised is the measurement to
PercentBusy. Take Linux for example, it normally substracts
idle time from elapsed time. If guest doesn't understand steal
time, PercentBusy may not reflect the fact at all. For example,
say a vcpu is continuously busy for 10ms, and then happens
to enter idle loop and then scheduled out for 20ms. Next time 
when re-scheduled in, its dbs timer will get PercentBusy as 
33.3% though actually its fully busy. However when vcpu is 
scheduled out outside of idle loop, that steal time is calculated 
as busy portion. I'm still not clear how this may affect this patch...

>
>I don't think it's easily deduced.  I also don't think 
>APERF/MPERF are emulated in
>kvm yet.  I suppose it wouldn't be difficult to add those two, though
>measuring that might be a bit messy.
>
>Maybe t

RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority

2008-07-15 Thread Tian, Kevin
>From: Darrick J. Wong
>Sent: 2008年7月16日 7:18
>
>I envision four scenarios:
>
>0. Guests that don't know about cpufreq still run at whatever 
>nice level
>they started with.
>
>1. If we have a system with a lot of idle VMs, they will all 
>run with +5
>nice and this patch has no effect.
>
>2. If we have a system with a lot of busy VMs, they all run 
>with -5 nice
>and this patch also has no effect.
>
>3. If, however, we have a lot of idle VMs and a few busy ones, then the
>-5 nice of the busy VMs will get those VMs extra CPU time.  On a really
>crummy FPU microbenchmark I have, the score goes from about 500 to 2000
>with the patch applied, though of course YMMV.  In some respects this

How many VMs did you run in this test? All the VMs are idle except
the one where your benchmark runs?

How about the actual effect when several VMs are doing some stuff?

There's another scenario where some VMs don't support cpufreq while
others do. Here is it unfair to just renice the latter when the former is
not 'nice' at all?

Guess this feature has to be applied with some qualifications, e.g.
in a group of VMs with known same PM abilities...

>
>There are some warts to this patch--most notably, the current
>implementation uses the Intel MSRs and EST feature flag ... even if the
>guest reports the CPU as being AuthenticAMD.  Also, there could be
>timing problems introduced by this change--the OS thinks the CPU
>frequency changes, but I don't know the effect on the guest CPU TSCs.

You can report constant tsc feature in cpuid virtualization. Of course
if physical TSC is unstable, it's another story about how to mark guest
TSC untrustable. (e.g. Marcelo develops one method by simulating C2)

>
>Control values are as as follows:
>0: Nobody's touched cpufreq.  nice is the whatever the default is.
>1: Lowest speed.  nice +5.
>2: Medium speed.  nice is reset.
>3: High speed.  nice -5.

This description seems mismatch with the implementaion, which pushes
+10 and -10 for 1 and 3 case. Maybe I misinterpret the code?

One interesting point is the initial value of PERF_CTL MSR. Current 'zero'
doesn't reflect a meanful state to guest, since there's no perf entry in
ACPI table to carry such value. One likely result is that guest'd think the
cur freq as 0 when initializing ACPI cpufreq driver. So it would more make
sense to set initial value to 2 (P1), as keeping the default nice value, or 
even 3 (P0), if you take that state as IDA style which may over-clock but
not assure.

More critical points to be further thought of, if expecting this feature to be
in real use, is the difinition of exposed virtual freq states, and how these
states can be mapped to scheduler knobs. Inappropriate exposure may
cause guest to excessively bounce between virtual freq points. For example, 
'nice' value is only a relative hint to scheduler and there's no guarantee that
same portion of cpu cycles are added as what 'nice' value changes. There's
even the case where guest requests lowest speed while actual cpu cycles
allocated to it keeps similar as last epoch when it's in high speed. This
will fool the guest that lowest speed can satisfy its requirement. It's similar 
to the requirement on core-based hardware coordination logic, where some 
feedback mechanism (e.g. APERF/MPERF MSR pair) is required to reveal 
actual freq in last sampling period. Here the VM case may need similar 
virtualized feedback mechanism. Not sure whether 'actual' freq is easily 
deduced however.

Maybe it's applausive to compare the freq change count for same benchmark
between VM and native, and more interesting is, how's the effect when 
multiple VMs all take use of such features? For example, whether the
expected effect is counteracted with only overhead added? Any strange
behaviors exposed as in real 'nice' won't be changed so frequently in dozens
of ms level? :-)

Thanks,
Kevin


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html