Re: [PATCH v3 4/5] Sysfs: Export VMCSINFO via sysfs

2012-06-28 Thread Greg KH
On Thu, Jun 28, 2012 at 04:37:38AM -0700, Greg KH wrote:
> On Thu, Jun 28, 2012 at 05:54:30PM +0800, Yanfei Zhang wrote:
> > 于 2012年06月28日 03:22, Greg KH 写道:
> > > On Wed, Jun 27, 2012 at 04:54:54PM +0800, Yanfei Zhang wrote:
> > >> This patch export offsets of fields via /sys/devices/cpu/vmcs/.
> > >> Individual offsets are contained in subfiles named by the filed's
> > >> encoding, e.g.: /sys/devices/cpu/vmcs/0800
> > >>
> > >> Signed-off-by: zhangyanfei 
> > >> ---
> > >>  drivers/base/core.c |   13 +
> > >>  1 files changed, 13 insertions(+), 0 deletions(-)
> > >>
> > >> diff --git a/drivers/base/core.c b/drivers/base/core.c
> > >> index 346be8b..dd05ee7 100644
> > >> --- a/drivers/base/core.c
> > >> +++ b/drivers/base/core.c
> > >> @@ -26,6 +26,7 @@
> > >>  #include 
> > >>  #include 
> > >>  #include 
> > >> +#include 
> > > 
> > > Did you just break the build on all other arches?  Not nice.
> > > 
> > >> @@ -1038,6 +1039,11 @@ int device_add(struct device *dev)
> > >>  error = dpm_sysfs_add(dev);
> > >>  if (error)
> > >>  goto DPMError;
> > >> +#if defined(CONFIG_KVM_INTEL) || defined(CONFIG_KVM_INTEL_MODULE)
> > >> +error = vmcs_sysfs_add(dev);
> > >> +if (error)
> > >> +goto VMCSError;
> > >> +#endif
> > > 
> > > Oh my no, that's no way to ever do this, you know better than that,
> > > please fix.
> > > 
> > > greg k-h
> > > 
> > 
> > Sorry for my thoughtless, Here is the new patch.
> > 
> > ---
> >  drivers/base/core.c |   13 +
> >  1 files changed, 13 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index 346be8b..7b5266a 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -30,6 +30,13 @@
> >  #include "base.h"
> >  #include "power/power.h"
> >  
> > +#if defined(CONFIG_KVM_INTEL) || defined(CONFIG_KVM_INTEL_MODULE)
> > +#include 
> > +#else
> > +static inline int vmcs_sysfs_add(struct device *dev) { return 0; }
> > +static inline void vmcs_sysfs_remove(struct device *dev) { }
> > +#endif
> 
> {sigh}  No, again, you know better, don't do this.

Ok, as others have rightly pointed out, this wasn't the most helpful
review comment, sorry about that.

In Linux, we don't put ifdefs in .c files, we put them in .h files.  See
many examples of this all over the place.  That's my main complaints the
past two times of this patch.

But, for this, I would question why you even want / need to do this in
the drivers/base/core/ file in the first place.  Shouldn't it be in some
arch or cpu specific file instead that already handles the cpu files?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4] KVM: x86: Implement PCID/INVPCID for guests with EPT

2012-06-28 Thread Mao, Junjie
> -Original Message-
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Thursday, June 28, 2012 11:49 PM
> To: Mao, Junjie
> Cc: 'kvm@vger.kernel.org'
> Subject: Re: [PATCH v4] KVM: x86: Implement PCID/INVPCID for guests with
> EPT
> 
> On 06/14/2012 05:04 AM, Mao, Junjie wrote:
> > This patch handles PCID/INVPCID for guests.
> >
> > Process-context identifiers (PCIDs) are a facility by which a logical
> > processor may cache information for multiple linear-address spaces so
> > that the processor may retain cached information when software
> > switches to a different linear address space. Refer to section 4.10.1
> > in IA32 Intel Software Developer's Manual Volume 3A for details.
> >
> > For guests with EPT, the PCID feature is enabled and INVPCID behaves
> > as running natively.
> > For guests without EPT, the PCID feature is disabled and INVPCID triggers
> #UD.
> >
> >
> >  #endif
> > unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0;
> > +   unsigned f_pcid = boot_cpu_has(X86_FEATURE_PCID) ? F(PCID) : 0;
> 
> Unneeded, just put F(PCID) below.

Understood. Thanks for your suggestion.

> 
> > +   unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) :
> > +0;
> >
> > /* cpuid 1.edx */
> > const u32 kvm_supported_word0_x86_features = @@ -228,7 +230,7 @@
> > static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> > 0 /* DS-CPL, VMX, SMX, EST */ |
> > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
> > F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> > -   0 /* Reserved, DCA */ | F(XMM4_1) |
> > +   f_pcid | 0 /* Reserved, DCA */ | F(XMM4_1) |
> 
> 
> > @@ -1739,6 +1745,11 @@ static bool vmx_rdtscp_supported(void)
> > return cpu_has_vmx_rdtscp();
> >  }
> >
> > +static bool vmx_invpcid_supported(void) {
> > +   return cpu_has_vmx_invpcid();
> 
> Should that have && ept_enabled?  You turn off invpcid below if !ept_enabled.
> 
> > +}
> > +
> >  /*
> >   * Swap MSR entry in host/guest MSR entry array.
> >   */
> > @@ -2458,7 +2469,8 @@ static __init int setup_vmcs_config(struct
> vmcs_config *vmcs_conf)
> > SECONDARY_EXEC_ENABLE_EPT |
> > SECONDARY_EXEC_UNRESTRICTED_GUEST |
> > SECONDARY_EXEC_PAUSE_LOOP_EXITING |
> > -   SECONDARY_EXEC_RDTSCP;
> > +   SECONDARY_EXEC_RDTSCP |
> > +   SECONDARY_EXEC_ENABLE_INVPCID;
> > if (adjust_vmx_controls(min2, opt2,
> > MSR_IA32_VMX_PROCBASED_CTLS2,
> > &_cpu_based_2nd_exec_control) < 0) @@ 
> > -3731,6
> +3743,8 @@ static
> > u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
> > if (!enable_ept) {
> > exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
> > enable_unrestricted_guest = 0;
> > +   /* Enable INVPCID for non-ept guests may cause performance
> regression. */
> > +   exec_control &= ~SECONDARY_EXEC_ENABLE_INVPCID;
> > }
> > if (!enable_unrestricted_guest)
> > exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
> 
> This code is unneeded..
> 
> > @@ -6467,6 +6481,23 @@ static void vmx_cpuid_update(struct kvm_vcpu
> *vcpu)
> > }
> > }
> > }
> > +
> > +   if (vmx_invpcid_supported()) {
> > +   exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> > +   /* Exposing INVPCID only when PCID is exposed */
> > +   best = kvm_find_cpuid_entry(vcpu, 0x7, 0);
> > +   if (best && (best->ecx & bit(X86_FEATURE_INVPCID)) &&
> guest_cpuid_has_pcid(vcpu)) {
> > +   exec_control |= SECONDARY_EXEC_ENABLE_INVPCID;
> > +   vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
> > +exec_control);
> > +   } else {
> > +   exec_control &= ~SECONDARY_EXEC_ENABLE_INVPCID;
> > +   vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
> > +exec_control);
> > +   if (best)
> > +   best->ecx &= ~bit(X86_FEATURE_INVPCID);
> > +   }
> > +   }
> >  }
> 
> Since you override it here anyway.  But maybe it's needed if the host never
> calls KVM_SET_CPUID.

The code in vmx_secondary_exec_control may be a must in the situation you have 
mentioned. The missing of enable_ept in vmx_invpcid_supported will make the 
guest think it can enable INVPCID support, though it actually does not, when 
the platform supports it but enable_ept is not set. I'll fix that in the next 
version.

> 
> >
> >  static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2
> > *entry) @@ -6610,6 +6641,9 @@ static void prepare_vmcs02(struct
> kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
> >   page_to_phys(vmx->nested.apic_access_page));
> > }
> >
> > +   /* Explicitly disab

Re: [RFC 07/10] KVM: add KVM TMEM host side interface

2012-06-28 Thread Konrad Rzeszutek Wilk
On Wed, Jun 06, 2012 at 01:00:15PM +0200, Sasha Levin wrote:
> This is the host side interface that the guests which support KVM TMEM
> talk to.
> 
> Signed-off-by: Sasha Levin 
> ---
>  arch/x86/kvm/tmem/Kconfig|6 +++
>  arch/x86/kvm/tmem/Makefile   |2 +
>  arch/x86/kvm/tmem/host.c |   78 
> ++
>  arch/x86/kvm/tmem/host.h |   20 +
>  arch/x86/kvm/x86.c   |8 +---
>  drivers/staging/zcache/zcache-main.c |   35 +++-
>  6 files changed, 141 insertions(+), 8 deletions(-)
>  create mode 100644 arch/x86/kvm/tmem/host.c
>  create mode 100644 arch/x86/kvm/tmem/host.h
> 
> diff --git a/arch/x86/kvm/tmem/Kconfig b/arch/x86/kvm/tmem/Kconfig
> index 15d8301..1a59e4f 100644
> --- a/arch/x86/kvm/tmem/Kconfig
> +++ b/arch/x86/kvm/tmem/Kconfig
> @@ -13,4 +13,10 @@ menuconfig KVM_TMEM
>  
>  if KVM_TMEM
>  
> +config KVM_TMEM_HOST
> + bool "Host-side KVM TMEM"
> + ---help---
> + With this option on, the KVM host will be able to process KVM TMEM 
> requests
> + coming from guests.
> +
>  endif # KVM_TMEM
> diff --git a/arch/x86/kvm/tmem/Makefile b/arch/x86/kvm/tmem/Makefile
> index 6812d46..706cd36 100644
> --- a/arch/x86/kvm/tmem/Makefile
> +++ b/arch/x86/kvm/tmem/Makefile
> @@ -1 +1,3 @@
>  ccflags-y += -Idrivers/staging/zcache/
> +
> +obj-$(CONFIG_KVM_TMEM_HOST)  += host.o
> diff --git a/arch/x86/kvm/tmem/host.c b/arch/x86/kvm/tmem/host.c
> new file mode 100644
> index 000..9e73395
> --- /dev/null
> +++ b/arch/x86/kvm/tmem/host.c
> @@ -0,0 +1,78 @@
> +/*
> + * KVM TMEM host side interface
> + *
> + * Copyright (c) 2012 Sasha Levin
> + *
> + */
> +
> +#include 
> +#include 
> +
> +#include "tmem.h"
> +#include 
> +
> +int use_kvm_tmem_host = 1;

__mostly_read and bool

> +
> +static int no_kvmtmemhost(char *s)
> +{
> + use_kvm_tmem_host = 0;
> + return 1;
> +}
> +
> +__setup("nokvmtmemhost", no_kvmtmemhost);
> +
> +int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t addr, unsigned long *ret)
> +{
> + struct tmem_kvm_op op;
> + struct page *page;
> + int r;
> + unsigned long flags;
> +
> + if (!use_kvm_tmem_host || !zcache_enabled) {
> + *ret = -ENXIO;
> + return 0;
> + }
> +
> + r = kvm_read_guest(vcpu->kvm, addr, &op, sizeof(op));
> + if (r < 0) {
> + *ret = r;
> + return 0;

Shouldn't this return r?

> + }
> +
> + switch (op.cmd) {
> + case TMEM_NEW_POOL:
> + *ret = zcache_new_pool(op.cli_id, op.u.new.flags);
> + break;
> + case TMEM_DESTROY_POOL:
> + *ret = zcache_destroy_pool(op.cli_id, op.pool_id);
> + break;
> + case TMEM_NEW_PAGE:
> + break;
> + case TMEM_PUT_PAGE:
> + page = gfn_to_page(vcpu->kvm, op.u.gen.gfn);
> + local_irq_save(flags);
> + *ret = zcache_put_page(op.cli_id, op.pool_id,
> + &op.u.gen.oid, op.u.gen.index, page);
> + local_irq_restore(flags);
> + break;
> + case TMEM_GET_PAGE:
> + page = gfn_to_page(vcpu->kvm, op.u.gen.gfn);
> + local_irq_save(flags);
> + *ret = zcache_get_page(op.cli_id, op.pool_id,
> + &op.u.gen.oid, op.u.gen.index, page);
> + local_irq_restore(flags);
> + break;
> + case TMEM_FLUSH_PAGE:
> + local_irq_save(flags);
> + *ret = zcache_flush_page(op.cli_id, op.pool_id,
> + &op.u.gen.oid, op.u.gen.index);
> + local_irq_restore(flags);
> + break;
> + case TMEM_FLUSH_OBJECT:
> + local_irq_save(flags);
> + *ret = zcache_flush_object(op.cli_id, op.pool_id, 
> &op.u.gen.oid);
> + local_irq_restore(flags);
> + break;
> + }
> + return 0;
> +}
> diff --git a/arch/x86/kvm/tmem/host.h b/arch/x86/kvm/tmem/host.h
> new file mode 100644
> index 000..17ba0c4
> --- /dev/null
> +++ b/arch/x86/kvm/tmem/host.h
> @@ -0,0 +1,20 @@
> +#ifndef _KVM_TMEM_HOST_H_
> +#define _KVM_TMEM_HOST_H_
> +
> +#ifdef CONFIG_KVM_TMEM_HOST
> +
> +extern int use_kvm_tmem_host;
> +
> +extern int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t addr, unsigned long 
> *ret);
> +
> +#else
> +
> +static inline int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t addr, unsigned 
> long *ret)
> +{
> + *ret = -ENOSUPP;
> + return 0;
> +}
> +
> +#endif
> +
> +#endif
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4c5b6ab..c92d4c8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -27,6 +27,7 @@
>  #include "kvm_cache_regs.h"
>  #include "x86.h"
>  #include "cpuid.h"
> +#include "tmem/host.h"
>  
>  #include 
>  #include 
> @@ -4993,13 +4994,6 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>   return 1;
>  }
>  
> -static int kvm_pv_tmem_op(struct kvm_vcpu *vcpu, gpa_t

Re: [PATCH v3 1/5] x86: Add helper variables and functions to hold VMCSINFO

2012-06-28 Thread HATAYAMA Daisuke
From: Yanfei Zhang 
Subject: [PATCH v3 1/5] x86: Add helper variables and functions to hold VMCSINFO
Date: Wed, 27 Jun 2012 16:51:58 +0800

> This patch provides a set of variables to hold the VMCSINFO and also
> some helper functions to help fill the VMCSINFO.
> 
> Signed-off-by: zhangyanfei 
> ---
>  arch/x86/include/asm/vmcsinfo.h |  219 ++
>  arch/x86/include/asm/vmx.h  |  158 +
>  arch/x86/kernel/Makefile|1 +
>  arch/x86/kernel/vmcsinfo.c  |  381 
> +++
>  4 files changed, 603 insertions(+), 156 deletions(-)
>  create mode 100644 arch/x86/include/asm/vmcsinfo.h
>  create mode 100644 arch/x86/kernel/vmcsinfo.c
> 
> diff --git a/arch/x86/include/asm/vmcsinfo.h b/arch/x86/include/asm/vmcsinfo.h
> new file mode 100644
> index 000..4b9f56b
> --- /dev/null
> +++ b/arch/x86/include/asm/vmcsinfo.h
> @@ -0,0 +1,219 @@
> +#ifndef _ASM_X86_VMCSINFO_H
> +#define _ASM_X86_VMCSINFO_H
> +
> +#ifndef __ASSEMBLY__
> +#include 
> +#include 
> +#include 
> +
> +/* VMCS Encodings */
> +enum vmcs_field {
> + VIRTUAL_PROCESSOR_ID= 0x,



> + HOST_RIP= 0x6c16,
> +};
> +
> +/*
> + * vmcs field offsets.
> + */
> +struct vmcsinfo {
> + u32 vmcs_revision_id;
> + int filled;
> + u16 vmcs_field_to_offset_table[HOST_RIP + 1];

HOST_RIP is so large that this array becomes large. Also there are
unused elements in this array because field encoding is not indexed
constantly.

Instead, how about defining the numbr of vmcs fields, 152?, as a
specific constant, indexing each fields using integers and newly
preparing index_to_field_table[]?

Thanks.
HATAYAMA, Daisuke

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()

2012-06-28 Thread Takuya Yoshikawa
On Thu, 28 Jun 2012 20:53:47 +0300
Avi Kivity  wrote:

> > Note: in the new code we could not use trace_kvm_age_page(), so we just
> > dropped the point from kvm_handle_hva_range().
> > 
> 
> Can't it be pushed to handler()?

Yes, but it will be changed to print rmap, not hva and gfn.
I will do in the next version.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 07/15] ARM: KVM: Hypervisor inititalization

2012-06-28 Thread Marcelo Tosatti
On Thu, Jun 28, 2012 at 06:53:43PM -0400, Christoffer Dall wrote:
> > should assign per_cpu(kvm_arm_hyp_stack_page, cpu) to NULL.
> >
> 
> why? this is run as part of the init code and thus the only way it
> could ever run again would be to have the module unloaded in which
> case the variable would be re-initialized to zero as per the static
> declaration, no?

Right.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop

2012-06-28 Thread Marcelo Tosatti
On Thu, Jun 28, 2012 at 09:10:39AM -0500, Anthony Liguori wrote:
> >
> >1.  read_lock(memmap_lock)
> >2.  MemoryRegionSection mrs = lookup(addr)
> >3.  qom_ref(mrs.mr->dev)
> >4.  read_unlock(memmap_lock)
> >
> >5.  mutex_lock(dev->lock)
> >6.  dispatch(&mrs, addr, data, size)
> >7.  mutex_unlock(dev->lock)
> 
> Just a detail, I don't think we should acquire a device specific
> lock in global code.  Rather, I think we should acquire the global
> lock before dispatch unless a MemoryRegion is marked as being
> unlocked.

"The basic plan is introduce granular locking starting at the KVM
dispatch level until we can get to MemoryRegion dispatch. We'll then
have some way to indicate that a MemoryRegion's callbacks should be
invoked without holding the qemu global mutex."

Before that is possible, the callback must not make use of data
structures currently protected by qemu_global_mutex, such as 
timers, interrupts (that is, you would have to split locks 
for each individual service invoked from inside callbacks,
which is a recipe for disaster).

With lock_device() below you can have

mutex_lock(dev->lock)
device specific work
mutex_lock(qemu_global_mutex)
raise irq, send packet, etc
mutex_unlock(qemu_global_mutex)
mutex_unlock(dev->lock)

and iothread doing the select() pseudocode in the previous email.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop

2012-06-28 Thread Marcelo Tosatti
On Thu, Jun 28, 2012 at 09:10:39AM -0500, Anthony Liguori wrote:
> On 06/26/2012 02:34 PM, Marcelo Tosatti wrote:
> >On Sat, Jun 23, 2012 at 12:55:49AM +0200, Jan Kiszka wrote:
> >>Should have declared this [RFC] in the subject and CC'ed kvm...
> >>
> >>On 2012-06-23 00:45, Jan Kiszka wrote:
> >>>This sketches a possible path to get rid of the iothread lock on vmexits
> >>>in KVM mode. On x86, the the in-kernel irqchips has to be used because
> >>>we otherwise need to synchronize APIC and other per-cpu state accesses
> >>>that could be changed concurrently. Not yet fully analyzed is the NMI
> >>>injection path in the absence of an APIC.
> >>>
> >>>s390x should be fine without specific locking as their pre/post-run
> >>>callbacks are empty. Power requires locking for the pre-run callback.
> >>>
> >>>This patch is untested, but a similar version was successfully used in
> >>>a x86 setup with a network I/O path that needed no central iothread
> >>>locking anymore (required special MMIO exit handling).
> >>>---
> >>>  kvm-all.c |   18 --
> >>>  target-i386/kvm.c |7 +++
> >>>  target-ppc/kvm.c  |4 
> >>>  3 files changed, 27 insertions(+), 2 deletions(-)
> >>>
> >>>diff --git a/kvm-all.c b/kvm-all.c
> >>>index f8e4328..9c3e26f 100644
> >>>--- a/kvm-all.c
> >>>+++ b/kvm-all.c
> >>>@@ -1460,6 +1460,8 @@ int kvm_cpu_exec(CPUArchState *env)
> >>>  return EXCP_HLT;
> >>>  }
> >>>
> >>>+qemu_mutex_unlock_iothread();
> >>>+
> >>>  do {
> >>>  if (env->kvm_vcpu_dirty) {
> >>>  kvm_arch_put_registers(env, KVM_PUT_RUNTIME_STATE);
> >>>@@ -1476,14 +1478,16 @@ int kvm_cpu_exec(CPUArchState *env)
> >>>   */
> >>>  qemu_cpu_kick_self();
> >>>  }
> >>>-qemu_mutex_unlock_iothread();
> >>>
> >>>  run_ret = kvm_vcpu_ioctl(env, KVM_RUN, 0);
> >>>
> >>>-qemu_mutex_lock_iothread();
> >>>  kvm_arch_post_run(env, run);
> >>>
> >>>+/* TODO: push coalesced mmio flushing to the point where we access
> >>>+ * devices that are using it (currently VGA and E1000). */
> >>>+qemu_mutex_lock_iothread();
> >>>  kvm_flush_coalesced_mmio_buffer();
> >>>+qemu_mutex_unlock_iothread();
> >>>
> >>>  if (run_ret<  0) {
> >>>  if (run_ret == -EINTR || run_ret == -EAGAIN) {
> >>>@@ -1499,19 +1503,23 @@ int kvm_cpu_exec(CPUArchState *env)
> >>>  switch (run->exit_reason) {
> >>>  case KVM_EXIT_IO:
> >>>  DPRINTF("handle_io\n");
> >>>+qemu_mutex_lock_iothread();
> >>>  kvm_handle_io(run->io.port,
> >>>(uint8_t *)run + run->io.data_offset,
> >>>run->io.direction,
> >>>run->io.size,
> >>>run->io.count);
> >>>+qemu_mutex_unlock_iothread();
> >>>  ret = 0;
> >>>  break;
> >>>  case KVM_EXIT_MMIO:
> >>>  DPRINTF("handle_mmio\n");
> >>>+qemu_mutex_lock_iothread();
> >>>  cpu_physical_memory_rw(run->mmio.phys_addr,
> >>> run->mmio.data,
> >>> run->mmio.len,
> >>> run->mmio.is_write);
> >>>+qemu_mutex_unlock_iothread();
> >>>  ret = 0;
> >>>  break;
> >>>  case KVM_EXIT_IRQ_WINDOW_OPEN:
> >>>@@ -1520,7 +1528,9 @@ int kvm_cpu_exec(CPUArchState *env)
> >>>  break;
> >>>  case KVM_EXIT_SHUTDOWN:
> >>>  DPRINTF("shutdown\n");
> >>>+qemu_mutex_lock_iothread();
> >>>  qemu_system_reset_request();
> >>>+qemu_mutex_unlock_iothread();
> >>>  ret = EXCP_INTERRUPT;
> >>>  break;
> >>>  case KVM_EXIT_UNKNOWN:
> >>>@@ -1533,11 +1543,15 @@ int kvm_cpu_exec(CPUArchState *env)
> >>>  break;
> >>>  default:
> >>>  DPRINTF("kvm_arch_handle_exit\n");
> >>>+qemu_mutex_lock_iothread();
> >>>  ret = kvm_arch_handle_exit(env, run);
> >>>+qemu_mutex_unlock_iothread();
> >>>  break;
> >>>  }
> >>>  } while (ret == 0);
> >>>
> >>>+qemu_mutex_lock_iothread();
> >>>+
> >>>  if (ret<  0) {
> >>>  cpu_dump_state(env, stderr, fprintf, CPU_DUMP_CODE);
> >>>  vm_stop(RUN_STATE_INTERNAL_ERROR);
> >>>diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> >>>index 0d0d8f6..0ad64d1 100644
> >>>--- a/target-i386/kvm.c
> >>>+++ b/target-i386/kvm.c
> >>>@@ -1631,7 +1631,10 @@ void kvm_arch_pre_run(CPUX86State *env, struct 
> >>>kvm_run *run)
> >>>
> >>>  /* Inject NMI */
> >>>  if (env->interrupt_request&  CPU_INTERRUPT_NMI) {
> >>>+qemu_mutex_lock_iothread();
> >>>  env->interrupt_request&= ~CPU_INTERRUPT_NMI;
> >>>+qemu_mutex_unlock_iothread();
> >>>+
> >>>  DPRI

Re: [PATCHv2 1/5] Provide userspace IO exit completion callback.

2012-06-28 Thread Marcelo Tosatti
On Tue, Jun 12, 2012 at 03:01:23PM +0300, Gleb Natapov wrote:
> Current code assumes that IO exit was due to instruction emulation
> and handles execution back to emulator directly. This patch adds new
> userspace IO exit completion callback that can be set by any other code
> that caused IO exit to userspace.
> 
> Signed-off-by: Gleb Natapov 
> ---
>  arch/x86/include/asm/kvm_host.h |1 +
>  arch/x86/kvm/x86.c  |   92 
> +++
>  2 files changed, 56 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index db7c1f2..1a1bba6 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -406,6 +406,7 @@ struct kvm_vcpu_arch {
>   struct x86_emulate_ctxt emulate_ctxt;
>   bool emulate_regs_need_sync_to_vcpu;
>   bool emulate_regs_need_sync_from_vcpu;
> + int (*complete_userspace_io)(struct kvm_vcpu *vcpu);
>  
>   gpa_t time;
>   struct pvclock_vcpu_time_info hv_clock;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a01a424..6fa0e21 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4547,6 +4547,9 @@ static bool retry_instruction(struct x86_emulate_ctxt 
> *ctxt,
>   return true;
>  }
>  
> +static int complete_emulated_mmio(struct kvm_vcpu *vcpu);
> +static int complete_emulated_pio(struct kvm_vcpu *vcpu);
> +
>  int x86_emulate_instruction(struct kvm_vcpu *vcpu,
>   unsigned long cr2,
>   int emulation_type,
> @@ -4617,13 +4620,16 @@ restart:
>   } else if (vcpu->arch.pio.count) {
>   if (!vcpu->arch.pio.in)
>   vcpu->arch.pio.count = 0;
> - else
> + else {
>   writeback = false;
> + vcpu->arch.complete_userspace_io = 
> complete_emulated_pio;
> + }
>   r = EMULATE_DO_MMIO;
>   } else if (vcpu->mmio_needed) {
>   if (!vcpu->mmio_is_write)
>   writeback = false;
>   r = EMULATE_DO_MMIO;
> + vcpu->arch.complete_userspace_io = complete_emulated_mmio;
>   } else if (r == EMULATION_RESTART)
>   goto restart;
>   else
> @@ -5474,6 +5480,24 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>   return r;
>  }
>  
> +static inline int complete_emulated_io(struct kvm_vcpu *vcpu)
> +{
> + int r;
> + vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> + r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
> + srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> + if (r != EMULATE_DONE)
> + return 0;
> + return 1;
> +}
> +
> +static int complete_emulated_pio(struct kvm_vcpu *vcpu)
> +{
> + BUG_ON(!vcpu->arch.pio.count);
> +
> + return complete_emulated_io(vcpu);
> +}
> +
>  /*
>   * Implements the following, as a state machine:
>   *
> @@ -5490,47 +5514,37 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
>   *  copy data
>   *  exit
>   */
> -static int complete_mmio(struct kvm_vcpu *vcpu)
> +static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
>  {
>   struct kvm_run *run = vcpu->run;
>   struct kvm_mmio_fragment *frag;
> - int r;
>  
> - if (!(vcpu->arch.pio.count || vcpu->mmio_needed))
> - return 1;
> + BUG_ON(!vcpu->mmio_needed);
>  
> - if (vcpu->mmio_needed) {
> - /* Complete previous fragment */
> - frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment++];
> - if (!vcpu->mmio_is_write)
> - memcpy(frag->data, run->mmio.data, frag->len);
> - if (vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments) {
> - vcpu->mmio_needed = 0;
> - if (vcpu->mmio_is_write)
> - return 1;
> - vcpu->mmio_read_completed = 1;
> - goto done;
> - }
> - /* Initiate next fragment */
> - ++frag;
> - run->exit_reason = KVM_EXIT_MMIO;
> - run->mmio.phys_addr = frag->gpa;
> + /* Complete previous fragment */
> + frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment++];
> + if (!vcpu->mmio_is_write)
> + memcpy(frag->data, run->mmio.data, frag->len);
> + if (vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments) {
> + vcpu->mmio_needed = 0;
>   if (vcpu->mmio_is_write)
> - memcpy(run->mmio.data, frag->data, frag->len);
> - run->mmio.len = frag->len;
> - run->mmio.is_write = vcpu->mmio_is_write;
> - return 0;
> -
> - }
> -done:
> - vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> - r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
> - srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> - if (r != EMULATE_DONE)
> - return 0;
> - return 1;

Re: [PATCH 5/6] KVM: Separate rmap_pde from kvm_lpage_info->write_count

2012-06-28 Thread Takuya Yoshikawa
On Thu, 28 Jun 2012 20:39:55 +0300
Avi Kivity  wrote:

> > Note: write_count: 4 bytes, rmap_pde: 8 bytes.  So we are wasting
> > extra paddings by packing them into lpage_info.
> 
> The wastage is quite low since it's just 4 bytes per 2MB.

Yes.

> >> Why not just introduce a function to get the next rmap? Something like 
> >> this:
> > 
> > I want to eliminate this kind of overheads.
> 
> I don't think the overhead is significant.  rmap walk speed is largely a
> function of cache misses IMO, and we may even be adding cache misses by
> splitting lpage_info.

Maybe.  But as far as I can see, write_count does not gain much from
being close to rmap_pde.

> But I still think it's the right thing since it simplifies the code.

After the rmap integration, we can remove
if (likely(level == PT_PAGE_TABLE_LEVEL))
heuristics from __gfn_to_rmap().

As a bonus, the helper will become enough simple to be always inlined
which reduces some function calls.

> Maybe we should add a prefetch() on write_count do mitigate the
> overhead, if it starts showing up in profiles.

OK, I will post rmap integration work as soon as possible, but it
still needs to be synced with unrelated ppc works at some point in
the future: so please take that separately from this series.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] kvm: handle last_boosted_vcpu = 0 case

2012-06-28 Thread Vinod, Chegu
Hello,

I am just catching up on this email thread... 

Perhaps one of you may be able to help answer this query.. preferably along 
with some data.  [BTW, I do understand the basic intent behind PLE in a typical 
[sweet spot] use case where there is over subscription etc. and the need to 
optimize the PLE handler in the host etc. ]

In a use case where the host has fewer but much larger guests (say 40VCPUs and 
higher) and there is no over subscription (i.e. # of vcpus across guests <= 
physical cpus in the host  and perhaps each guest has their vcpu's pinned to 
specific physical cpus for other reasons), I would like to understand if/how  
the PLE really helps ?  For these use cases would it be ok to turn PLE off 
(ple_gap=0) since is no real need to take an exit and find some other VCPU to 
yield to ? 

Thanks
Vinod

-Original Message-
From: Raghavendra K T [mailto:raghavendra...@linux.vnet.ibm.com] 
Sent: Thursday, June 28, 2012 9:22 AM
To: Andrew Jones
Cc: Rik van Riel; Marcelo Tosatti; Srikar; Srivatsa Vaddagiri; Peter Zijlstra; 
Nikunj A. Dadhania; KVM; LKML; Gleb Natapov; Vinod, Chegu; Jeremy Fitzhardinge; 
Avi Kivity; Ingo Molnar
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/28/2012 09:30 PM, Andrew Jones wrote:
>
>
> - Original Message -
>> In summary, current PV has huge benefit on non-PLE machine.
>>
>> On PLE machine, the results become very sensitive to load, type of 
>> workload and SPIN_THRESHOLD. Also PLE interference has significant 
>> effect on them. But still it has slight edge over non PV.
>>
>
> Hi Raghu,
>
> sorry for my slow response. I'm on vacation right now (until the 9th 
> of July) and I have limited access to mail.

Ok. Happy Vacation :)

Also, thanks for
> continuing the benchmarking. Question, when you compare PLE vs.
> non-PLE, are you using different machines (one with and one without), 
> or are you disabling its use by loading the kvm module with the 
> ple_gap=0 modparam as I did?

Yes, I am doing the same when I say with PLE disabled and comparing the 
benchmarks (i.e loading kvm module with ple_gap=0).

But older non-PLE results were on a different machine altogether. (I had 
limited access to PLE machine).




Re: [PATCH v8 07/15] ARM: KVM: Hypervisor inititalization

2012-06-28 Thread Christoffer Dall
On Thu, Jun 28, 2012 at 6:35 PM, Marcelo Tosatti  wrote:
> On Fri, Jun 15, 2012 at 03:07:59PM -0400, Christoffer Dall wrote:
>> Sets up the required registers to run code in HYP-mode from the kernel.
>>
>> By setting the HVBAR the kernel can execute code in Hyp-mode with
>> the MMU disabled. The HVBAR initially points to initialization code,
>> which initializes other Hyp-mode registers and enables the MMU
>> for Hyp-mode. Afterwards, the HVBAR is changed to point to KVM
>> Hyp vectors used to catch guest faults and to switch to Hyp mode
>> to perform a world-switch into a KVM guest.
>>
>> Also provides memory mapping code to map required code pages and data
>> structures accessed in Hyp mode at the same virtual address as the
>> host kernel virtual addresses, but which conforms to the architectural
>> requirements for translations in Hyp mode. This interface is added in
>> arch/arm/kvm/arm_mmu.c and is comprised of:
>>  - create_hyp_mappings(hyp_pgd, start, end);
>>  - free_hyp_pmds(pgd_hyp);
>>
>> Note: The initialization mechanism currently relies on an SMC #0 call
>> to the secure monitor, which was merely a fast way of getting to the
>> hypervisor. Dave Marting and Rusty Russel have patches out to make the
>> boot-wrapper and the kernel boot in Hyp-mode and setup a generic way for
>> hypervisors to get access to Hyp-mode if the boot-loader allows such
>> access.
>>
>> Signed-off-by: Christoffer Dall 
>> ---
>>  arch/arm/include/asm/kvm_arm.h              |  117 +++
>>  arch/arm/include/asm/kvm_asm.h              |   22 +++
>>  arch/arm/include/asm/kvm_mmu.h              |   37 ++
>>  arch/arm/include/asm/pgtable-3level-hwdef.h |    4 +
>>  arch/arm/include/asm/pgtable-3level.h       |    4 +
>>  arch/arm/include/asm/pgtable.h              |    1
>>  arch/arm/kvm/arm.c                          |  167 
>> +++
>>  arch/arm/kvm/exports.c                      |   15 ++
>>  arch/arm/kvm/init.S                         |   99 
>>  arch/arm/kvm/interrupts.S                   |   47 +++
>>  arch/arm/kvm/mmu.c                          |  170 
>> +++
>>  mm/memory.c                                 |    2
>>  12 files changed, 685 insertions(+)
>>  create mode 100644 arch/arm/include/asm/kvm_arm.h
>>  create mode 100644 arch/arm/include/asm/kvm_mmu.h
>>
>> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
>> new file mode 100644
>> index 000..7f30cbd
>> --- /dev/null
>> +++ b/arch/arm/include/asm/kvm_arm.h
>> @@ -0,0 +1,117 @@
>> +/*
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License, version 2, as
>> + * published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
>> + *
>> + */
>> +
>> +#ifndef __KVM_ARM_H__
>> +#define __KVM_ARM_H__
>> +
>> +#include 
>> +
>> +/* Hyp Configuration Register (HCR) bits */
>> +#define HCR_TGE              (1 << 27)
>> +#define HCR_TVM              (1 << 26)
>> +#define HCR_TTLB     (1 << 25)
>> +#define HCR_TPU              (1 << 24)
>> +#define HCR_TPC              (1 << 23)
>> +#define HCR_TSW              (1 << 22)
>> +#define HCR_TAC              (1 << 21)
>> +#define HCR_TIDCP    (1 << 20)
>> +#define HCR_TSC              (1 << 19)
>> +#define HCR_TID3     (1 << 18)
>> +#define HCR_TID2     (1 << 17)
>> +#define HCR_TID1     (1 << 16)
>> +#define HCR_TID0     (1 << 15)
>> +#define HCR_TWE              (1 << 14)
>> +#define HCR_TWI              (1 << 13)
>> +#define HCR_DC               (1 << 12)
>> +#define HCR_BSU              (3 << 10)
>> +#define HCR_BSU_IS   (1 << 10)
>> +#define HCR_FB               (1 << 9)
>> +#define HCR_VA               (1 << 8)
>> +#define HCR_VI               (1 << 7)
>> +#define HCR_VF               (1 << 6)
>> +#define HCR_AMO              (1 << 5)
>> +#define HCR_IMO              (1 << 4)
>> +#define HCR_FMO              (1 << 3)
>> +#define HCR_PTW              (1 << 2)
>> +#define HCR_SWIO     (1 << 1)
>> +#define HCR_VM               1
>> +
>> +/*
>> + * The bits we set in HCR:
>> + * TAC:              Trap ACTLR
>> + * TSC:              Trap SMC
>> + * TWI:              Trap WFI
>> + * BSU_IS:   Upgrade barriers to the inner shareable domain
>> + * FB:               Force broadcast of all maintainance operations
>> + * AMO:              Override CPSR.A and enable signaling with VA
>> + * IMO:              Override CPSR.I and enable signaling with VI
>> + * FMO:           

Re: [PATCH v8 09/15] ARM: KVM: Memory virtualization setup

2012-06-28 Thread Christoffer Dall
On Thu, Jun 28, 2012 at 6:34 PM, Marcelo Tosatti  wrote:
> On Fri, Jun 15, 2012 at 03:08:22PM -0400, Christoffer Dall wrote:
>> From: Christoffer Dall 
>>
>> This commit introduces the framework for guest memory management
>> through the use of 2nd stage translation. Each VM has a pointer
>> to a level-1 table (the pgd field in struct kvm_arch) which is
>> used for the 2nd stage translations. Entries are added when handling
>> guest faults (later patch) and the table itself can be allocated and
>> freed through the following functions implemented in
>> arch/arm/kvm/arm_mmu.c:
>>  - kvm_alloc_stage2_pgd(struct kvm *kvm);
>>  - kvm_free_stage2_pgd(struct kvm *kvm);
>>
>> Further, each entry in TLBs and caches are tagged with a VMID
>> identifier in addition to ASIDs. The VMIDs are assigned consecutively
>> to VMs in the order that VMs are executed, and caches and tlbs are
>> invalidated when the VMID space has been used to allow for more than
>> 255 simultaenously running guests.
>>
>> The 2nd stage pgd is allocated in kvm_arch_init_vm(). The table is
>> freed in kvm_arch_destroy_vm(). Both functions are called from the main
>> KVM code.
>>
>> Signed-off-by: Christoffer Dall 
>
> Can you explain on a high level how the IPA -> PA mappings work?
>

the memory system on ARM with Virtualization Extensions is separated
into two stages: stage 1 and stage 2.

If stage 2 translation is disabled, which it is when we boot the host
kernel, only a three-level page table for stage 1 translations are
performed by the MMU and the result of the stage 1 translation is used
to physically access memory.

If stage 2 translation is enabled, the output of the stage 1
translation (which is a 40-bit intermediate physical address, IPA,
a.k.a. gpa_t/gfn in KVM language) is used for a stage 2 translation
that takes the 40-bit address as input and uses another set of page
tables, in 3 levels, to produce the resulting address.

If a fault happens during stage 1 translation, this fault is taken:
 a) directly by the host when the host is running and stage 2
translation is disabled
 b) directly by the guest VM

If a fault happens during stage 2 translation, the fault is always
taken by the hypervisor, which populates the missing entry in the
stage 2 page table (or changes the entry to be a writable entry in the
case of a permission fault).

During boot of a VM, the MMU is disabled for the guest Stage 1
translations and the address produced by the CPU is fed directly to
the stage 2 translation system.

A nice diagram is shown on page B3-1330 of the ARM arm I referred you
to in the other mail.

Let me know if this is the level you had in mind.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/15] KVM/ARM Implementation

2012-06-28 Thread Christoffer Dall
>
> Is there public documentation for "hyp-mode" available?
>
yes, you have to register on the ARM website
(http://infocenter.arm.com) but there you can download the ARM v7
architecture reference manual.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 07/15] ARM: KVM: Hypervisor inititalization

2012-06-28 Thread Marcelo Tosatti
On Fri, Jun 15, 2012 at 03:07:59PM -0400, Christoffer Dall wrote:
> Sets up the required registers to run code in HYP-mode from the kernel.
> 
> By setting the HVBAR the kernel can execute code in Hyp-mode with
> the MMU disabled. The HVBAR initially points to initialization code,
> which initializes other Hyp-mode registers and enables the MMU
> for Hyp-mode. Afterwards, the HVBAR is changed to point to KVM
> Hyp vectors used to catch guest faults and to switch to Hyp mode
> to perform a world-switch into a KVM guest.
> 
> Also provides memory mapping code to map required code pages and data
> structures accessed in Hyp mode at the same virtual address as the
> host kernel virtual addresses, but which conforms to the architectural
> requirements for translations in Hyp mode. This interface is added in
> arch/arm/kvm/arm_mmu.c and is comprised of:
>  - create_hyp_mappings(hyp_pgd, start, end);
>  - free_hyp_pmds(pgd_hyp);
> 
> Note: The initialization mechanism currently relies on an SMC #0 call
> to the secure monitor, which was merely a fast way of getting to the
> hypervisor. Dave Marting and Rusty Russel have patches out to make the
> boot-wrapper and the kernel boot in Hyp-mode and setup a generic way for
> hypervisors to get access to Hyp-mode if the boot-loader allows such
> access.
> 
> Signed-off-by: Christoffer Dall 
> ---
>  arch/arm/include/asm/kvm_arm.h  |  117 +++
>  arch/arm/include/asm/kvm_asm.h  |   22 +++
>  arch/arm/include/asm/kvm_mmu.h  |   37 ++
>  arch/arm/include/asm/pgtable-3level-hwdef.h |4 +
>  arch/arm/include/asm/pgtable-3level.h   |4 +
>  arch/arm/include/asm/pgtable.h  |1 
>  arch/arm/kvm/arm.c  |  167 
> +++
>  arch/arm/kvm/exports.c  |   15 ++
>  arch/arm/kvm/init.S |   99 
>  arch/arm/kvm/interrupts.S   |   47 +++
>  arch/arm/kvm/mmu.c  |  170 
> +++
>  mm/memory.c |2 
>  12 files changed, 685 insertions(+)
>  create mode 100644 arch/arm/include/asm/kvm_arm.h
>  create mode 100644 arch/arm/include/asm/kvm_mmu.h
> 
> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
> new file mode 100644
> index 000..7f30cbd
> --- /dev/null
> +++ b/arch/arm/include/asm/kvm_arm.h
> @@ -0,0 +1,117 @@
> +/*
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License, version 2, as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
> + *
> + */
> +
> +#ifndef __KVM_ARM_H__
> +#define __KVM_ARM_H__
> +
> +#include 
> +
> +/* Hyp Configuration Register (HCR) bits */
> +#define HCR_TGE  (1 << 27)
> +#define HCR_TVM  (1 << 26)
> +#define HCR_TTLB (1 << 25)
> +#define HCR_TPU  (1 << 24)
> +#define HCR_TPC  (1 << 23)
> +#define HCR_TSW  (1 << 22)
> +#define HCR_TAC  (1 << 21)
> +#define HCR_TIDCP(1 << 20)
> +#define HCR_TSC  (1 << 19)
> +#define HCR_TID3 (1 << 18)
> +#define HCR_TID2 (1 << 17)
> +#define HCR_TID1 (1 << 16)
> +#define HCR_TID0 (1 << 15)
> +#define HCR_TWE  (1 << 14)
> +#define HCR_TWI  (1 << 13)
> +#define HCR_DC   (1 << 12)
> +#define HCR_BSU  (3 << 10)
> +#define HCR_BSU_IS   (1 << 10)
> +#define HCR_FB   (1 << 9)
> +#define HCR_VA   (1 << 8)
> +#define HCR_VI   (1 << 7)
> +#define HCR_VF   (1 << 6)
> +#define HCR_AMO  (1 << 5)
> +#define HCR_IMO  (1 << 4)
> +#define HCR_FMO  (1 << 3)
> +#define HCR_PTW  (1 << 2)
> +#define HCR_SWIO (1 << 1)
> +#define HCR_VM   1
> +
> +/*
> + * The bits we set in HCR:
> + * TAC:  Trap ACTLR
> + * TSC:  Trap SMC
> + * TWI:  Trap WFI
> + * BSU_IS:   Upgrade barriers to the inner shareable domain
> + * FB:   Force broadcast of all maintainance operations
> + * AMO:  Override CPSR.A and enable signaling with VA
> + * IMO:  Override CPSR.I and enable signaling with VI
> + * FMO:  Override CPSR.F and enable signaling with VF
> + * SWIO: Turn set/way invalidates into set/way clean+invalidate
> + */
> +#define HCR_GUEST_MASK (HCR_TSC | 

Re: [PATCH v8 09/15] ARM: KVM: Memory virtualization setup

2012-06-28 Thread Marcelo Tosatti
On Fri, Jun 15, 2012 at 03:08:22PM -0400, Christoffer Dall wrote:
> From: Christoffer Dall 
> 
> This commit introduces the framework for guest memory management
> through the use of 2nd stage translation. Each VM has a pointer
> to a level-1 table (the pgd field in struct kvm_arch) which is
> used for the 2nd stage translations. Entries are added when handling
> guest faults (later patch) and the table itself can be allocated and
> freed through the following functions implemented in
> arch/arm/kvm/arm_mmu.c:
>  - kvm_alloc_stage2_pgd(struct kvm *kvm);
>  - kvm_free_stage2_pgd(struct kvm *kvm);
> 
> Further, each entry in TLBs and caches are tagged with a VMID
> identifier in addition to ASIDs. The VMIDs are assigned consecutively
> to VMs in the order that VMs are executed, and caches and tlbs are
> invalidated when the VMID space has been used to allow for more than
> 255 simultaenously running guests.
> 
> The 2nd stage pgd is allocated in kvm_arch_init_vm(). The table is
> freed in kvm_arch_destroy_vm(). Both functions are called from the main
> KVM code.
> 
> Signed-off-by: Christoffer Dall 

Can you explain on a high level how the IPA -> PA mappings work? 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 00/15] KVM/ARM Implementation

2012-06-28 Thread Marcelo Tosatti
On Fri, Jun 15, 2012 at 03:06:39PM -0400, Christoffer Dall wrote:
> The following series implements KVM support for ARM processors,
> specifically on the Cortex A-15 platform.  Work is done in
> collaboration between Columbia University, Virtual Open Systems and
> ARM/Linaro.
> 
> The patch series applies to kvm/next, specifically commit:
> 25e531a988ea5a64bd97a72dc9d3c65ad5850120
> 
> This is Version 8 of the patch series, but the first two versions
> were reviewed outside of the KVM mailing list. Changes can also be
> pulled from:
>  git://github.com/virtualopensystems/linux-kvm-arm.git kvm-a15-v8
> 
> A non-flattened edition of the patch series can be found at:
>  git://github.com/virtualopensystems/linux-kvm-arm.git kvm-a15-v8-stage
> 
> The implementation is broken up into a logical set of patches, the first
> four are preparatory patches:
>  1. Add mem_type prot_pte accessor (ARM community)
>  2. Use KVM_CAP_IRQ_ROUTING to protect routing code  (KVM community)
>  3. Introduce __KVM_HAVE_IRQ_LINE (KVM community)
>  4. Guard code with CONFIG_MMU_NOTIFIER (KVM community)
> 
> KVM guys, please consider pulling the KVM generic patches as early as
> possible. Thanks.
> 
> The main implementation is broken up into separate patches, the first
> containing a skeleton of files, makefile changes, the basic user space
> interface and KVM architecture specific stubs.  Subsequent patches
> implement parts of the system as listed:
>  1.  Preparatory patch introducing __KVM_HAVE_IRQ_LINE
>  2.  Preparatory patch guarding mmu_notifier code with CONFIG_MMU_NOTIFIER
>  3.  Skeleton
>  4.  Identity Mapping for Hyp mode
>  5.  Hypervisor initialization
>  6.  Hypervisor module unloading
>  7.  Memory virtualization setup (hyp mode mappings and 2nd stage)
>  8.  Inject IRQs and FIQs from userspace
>  9.  World-switch implementation and Hyp exception vectors
>  10. Emulation framework and CP15 emulation
>  11. Handle guest user memory aborts
>  12. Handle guest MMIO aborts
>  13. Support guest wait-for-interrupt instructions
> 
> Testing:
> Limited testing, but have run GCC inside guest, which compiled a small
> hello-world program, which was successfully run. For v8 both ARM/Thumb-2
> kernels were tested as both host/guest and both a compiled-in version
> and a kernel module version of KVM was tested. Hardware still
> unavailable to me, so all testing has been done on ARM Fast Models.
> 
> For a guide on how to set up a testing environment and try out these
> patches, see:
>  http://www.virtualopensystems.com/media/pdf/kvm-arm-guide.pdf
> 
> There is an issue list available using the issue tracker on:
> https://github.com/virtualopensystems/linux-kvm-arm

Is there public documentation for "hyp-mode" available?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 04/15] KVM: Guard mmu_notifier specific code with CONFIG_MMU_NOTIFIER

2012-06-28 Thread Marcelo Tosatti
On Fri, Jun 15, 2012 at 03:07:24PM -0400, Christoffer Dall wrote:
> From: Marc Zyngier 
> 
> In order to avoid compilation failure when KVM is not compiled in,
> guard the mmu_notifier specific sections with both CONFIG_MMU_NOTIFIER
> and KVM_ARCH_WANT_MMU_NOTIFIER, like it is being done in the rest of
> the KVM code.
> 
> Signed-off-by: Marc Zyngier 
> Signed-off-by: Christoffer Dall 
> ---
>  include/linux/kvm_host.h |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/3] vfio: return -EFAULT on failure

2012-06-28 Thread Alex Williamson
On Thu, 2012-06-28 at 09:45 +0300, Dan Carpenter wrote:
> This ioctl function is supposed to return a negative error code or zero
> on success.  copy_to_user() returns zero or the number of bytes
> remaining to be copied.
> 
> Signed-off-by: Dan Carpenter 
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 457acf3..1aa373f 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1159,6 +1159,8 @@ static long vfio_group_fops_unl_ioctl(struct file 
> *filep,
>   status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET;
>  
>   ret = copy_to_user((void __user *)arg, &status, minsz);
> + if (ret)
> + ret = -EFAULT;
>  
>   break;
>   }

Yes, thank you!  I've folded all of these into the commits on my next
branch, so they should be cleaned up in tomorrow's tree.  Thanks for the
reports, please let me know if you find more.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 2/3] vfio: make count unsigned to prevent integer underflow

2012-06-28 Thread Alex Williamson
On Thu, 2012-06-28 at 09:44 +0300, Dan Carpenter wrote:
> In vfio_pci_ioctl() there is a potential integer underflow where we
> might allocate less data than intended.  We check that hdr.count is not
> too large, but we don't check whether it is negative:
> 
> drivers/vfio/pci/vfio_pci.c
>312  if (hdr.argsz - minsz < hdr.count * size ||
>313  hdr.count > vfio_pci_get_irq_count(vdev, hdr.index))
>314  return -EINVAL;
>315
>316  data = kmalloc(hdr.count * size, GFP_KERNEL);
> 
> Signed-off-by: Dan Carpenter 
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 300d49b..86ef2da 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -347,7 +347,7 @@ struct vfio_irq_set {
>  #define VFIO_IRQ_SET_ACTION_TRIGGER  (1 << 5) /* Trigger interrupt */
>   __u32   index;
>   __s32   start;
> - __s32   count;
> + __u32   count;
>   __u8data[];
>  };
>  #define VFIO_DEVICE_SET_IRQS _IO(VFIO_TYPE, VFIO_BASE + 10)

Good find.  I've actually trickled this through to change a number of
the function params to unsigned from int.  Also in this struct, start
should be unsigned.  Thanks for the report!

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 1/3 v2] vfio: signedness bug in vfio_config_do_rw()

2012-06-28 Thread Alex Williamson
On Thu, 2012-06-28 at 11:07 +0300, Dan Carpenter wrote:
> The "count" variable needs to be signed here because we use it to store
> negative error codes.
> 
> Signed-off-by: Dan Carpenter 
> ---
> v2: Just declare count as signed.
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index a4f7321..2e00aa8 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1419,7 +1419,7 @@ void vfio_config_free(struct vfio_pci_device *vdev)
>  }
>  
>  static ssize_t vfio_config_do_rw(struct vfio_pci_device *vdev, char __user 
> *buf,
> -  size_t count, loff_t *ppos, bool iswrite)
> +  ssize_t count, loff_t *ppos, bool iswrite)
>  {
>   struct pci_dev *pdev = vdev->pdev;
>   struct perm_bits *perm;

signed doesn't seem right for count since just below this chunk we do:

if (*ppos < 0 || *ppos + count > pdev->cfg_size)
return -EFAULT;

So then we have to start testing for negative count.  I've added a
ssize_t variable for return that should clear things up.  Thanks for the
report!

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM][Kemari]: Build error fix

2012-06-28 Thread OHMURA Kei

On 2012/06/28 5:34, Sterling Windmill wrote:

Is Kemari still in active development?


No, it's not. Currently we have no intention to add new features into
Kemari.

Thanks,
Kei



Best regards,
Sterling Windmill

On Sun, Dec 4, 2011 at 9:45 PM, OHMURA Kei mailto:ohmura@lab.ntt.co.jp>> wrote:

On 2011/12/02 21:51, Pradeep Kumar wrote:
 > It fixes build failure.
 >
 > I hit this error, after succsfull migration and sync.
 >
 > (qemu) qemu-system-x86_64: fill buffer failed, Interrupted system call
 >
 > qemu-system-x86_64: recv header failed
 >
 > qemu-system-x86_64: recv ack failed
 >
 > qemu_transaction_begin failed

Did you use master branch?
It is not latest version. next branch is latest and fixed error.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari 
 next

Thanks,
Kei

 >
 > Any one working on this now?
 >
 >>From 827c04da6574be80d8352acd7c40b0b4524af5f4 Mon Sep 17 00:00:00 2001
 > Date: Fri, 2 Dec 2011 18:11:40 +0530
 > Subject: [PATCH]  [Qemu][Kemari]: Build Failure
 >Signed-off-by: pradeep mailto:psuri...@linux.vnet.ibm.com>>
 >modified:   ft_trans_file.c
 >
 > ---
 >  ft_trans_file.c |2 +-
 >  1 files changed, 1 insertions(+), 1 deletions(-)
 >
 > diff --git a/ft_trans_file.c b/ft_trans_file.c
 > index 4e33034..dc36757 100644
 > --- a/ft_trans_file.c
 > +++ b/ft_trans_file.c
 > @@ -174,7 +174,7 @@ static int ft_trans_send_header(QEMUFileFtTrans *s,
 >  static int ft_trans_put_buffer(void *opaque, const uint8_t *buf, 
int64_t pos, int size)
 >  {
 >  QEMUFileFtTrans *s = opaque;
 > -ssize_t ret;
 > +ssize_t ret = 0;
 >
 >  trace_ft_trans_put_buffer(size, pos);
 >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org 

More majordomo info at http://vger.kernel.org/majordomo-info.html





--
大村 圭

NTTソフトウェアイノベーションセンタ
分散処理基盤技術プロジェクト
E-mail: ohmura@lab.ntt.co.jp
TEL: 0422-59-2823
FAX: 0422-59-2965



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] kvm: KVM_EOIFD, an eventfd for EOIs

2012-06-28 Thread Michael S. Tsirkin
On Tue, Jun 26, 2012 at 11:10:08PM -0600, Alex Williamson wrote:
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index b216709..87a2558 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1987,6 +1987,30 @@ interrupts with those injected through KVM_IRQ_LINE.  
> IRQFDs created
>  with KVM_IRQFD_FLAG_LEVEL must also set this flag when de-assiging.
>  KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
>  
> +4.77 KVM_EOIFD
> +
> +Capability: KVM_CAP_EOIFD
> +Architectures: x86
> +Type: vm ioctl
> +Parameters: struct kvm_eoifd (in)
> +Returns: 0 on success, -1 on error
> +
> +KVM_EOIFD allows userspace to receive EOI notification through an
> +eventfd for level triggered irqchip interrupts.  Behavior for edge
> +triggered interrupts is undefined.  kvm_eoifd.fd specifies the eventfd
> +used for notification and kvm_eoifd.gsi specifies the irchip pin,
> +similar to KVM_IRQFD.  KVM_EOIFD_FLAG_DEASSIGN is used to deassign
> +a previously enabled eoifd and should also set fd and gsi to match.
> +
> +The KVM_EOIFD_FLAG_LEVEL_IRQFD flag indicates that the EOI is for
> +a level triggered EOI and the kvm_eoifd structure includes
> +kvm_eoifd.irqfd, which must be previously configured using KVM_IRQFD
> +with the KVM_IRQFD_FLAG_LEVEL flag.  This allows both EOI notification
> +through kvm_eoifd.fd as well as automatically de-asserting level
> +irqfds on EOI.  Both KVM_EOIFD_FLAG_DEASSIGN and
> +KVM_EOIFD_FLAG_LEVEL_IRQFD should be used to de-assign an eoifd
> +initially setup with KVM_EOIFD_FLAG_LEVEL_IRQFD.
> +

OK so thinking about this some more, does the below makes sense:
it is enough to have LEVEL either in IRQFD or in EOIFD but not both.
I weakly prefer it in EOIFD: when you bind it to irqfd you
also assign a source id to that irqfd.

We also can make is a bit clearer: rename KVM_EOIFD_FLAG_LEVEL_IRQFD
to KVM_EOIFD_FLAG_LEVEL_CLEAR which makes it explicit that we clear
on EOI and that it is for a level interrupt. The fact that it needs
an irqfd is less important IMHO.
Make this flag mandatory for now, we'll see later how to handle
vcpu filtering and edge.

How does it sound?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V6 1/5] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2012-06-28 Thread Raghavendra K T

On 04/27/2012 09:23 PM, Gleb Natapov wrote:

On Fri, Apr 27, 2012 at 04:15:35PM +0530, Raghavendra K T wrote:

On 04/24/2012 03:29 PM, Gleb Natapov wrote:

On Mon, Apr 23, 2012 at 03:29:47PM +0530, Raghavendra K T wrote:

From: Srivatsa Vaddagiri

KVM_HC_KICK_CPU allows the calling vcpu to kick another vcpu out of halt state.

The presence of these hypercalls is indicated to guest via
KVM_FEATURE_PV_UNHALT/KVM_CAP_PV_UNHALT.

Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Suzuki Poulose
Signed-off-by: Raghavendra K T
---

[...]

+/*
+ * kvm_pv_kick_cpu_op:  Kick a vcpu.
+ *
+ * @apicid - apicid of vcpu to be kicked.
+ */
+static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
+{
+   struct kvm_vcpu *vcpu = NULL;
+   int i;
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (!kvm_apic_present(vcpu))
+   continue;
+
+   if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
+   break;
+   }
+   if (vcpu) {
+   /*
+* Setting unhalt flag here can result in spurious runnable
+* state when unhalt reset does not happen in vcpu_block.
+* But that is harmless since that should soon result in halt.
+*/
+   vcpu->arch.pv.pv_unhalted = 1;
+   /* We need everybody see unhalt before vcpu unblocks */
+   smp_mb();
+   kvm_vcpu_kick(vcpu);
+   }
+}

This is too similar to kvm_irq_delivery_to_apic(). Why not reuse it. We
can use one of reserved delivery modes as PV delivery mode. We will
disallow guest to trigger it through apic interface, so this will not be
part of ABI and can be changed at will.

[...]

kvm/x86.c
=
kvm_pv_kick_cpu_op()
{

  struct kvm_lapic_irq lapic_irq;

  lapic_irq.shorthand = 0;
  lapic_irq.dest_mode = 0;
  lapic_irq.dest_id = apicid;

  lapic_irq.delivery_mode = PV_DELIVERY_MODE;
  kvm_irq_delivery_to_apic(kvm, 0,&lapic_irq );

}

kvm/lapic.c
==
_apic_accept_irq()
{
...
case APIC_DM_REMRD:
 result = 1;
 vcpu->pv_unhalted = 1
 smp_mb();
 kvm_make_request(KVM_REQ_EVENT, vcpu);
 kvm_vcpu_kick(vcpu);
 break;

...
}

here using PV_DELIVERY_MODE = APIC_DM_REMRD, which was unused.


Yes, this is what I mean except that PV_DELIVERY_MODE should be
number defined as reserved by Intel spec.



Hi Gleb, Avi,

This had been TODO in my V8 patches.
I 'll fold this into V9 (while rebasing to
3.5-rc).
Please let me know if it is OK.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()

2012-06-28 Thread Avi Kivity
On 06/28/2012 05:02 AM, Takuya Yoshikawa wrote:
> When we invalidate a THP page, we call the handler with the same
> rmap_pde argument 512 times in the following loop:
> 
>   for each guest page in the range
> for each level
>   unmap using rmap
> 
> This patch avoids these extra handler calls by changing the loop order
> like this:
> 
>   for each level
> for each rmap in the range
>   unmap using rmap
> 
> With the preceding patches in the patch series, this made THP page
> invalidation more than 5 times faster on our x86 host: the host became
> more responsive during swapping the guest's memory as a result.
> 
> Note: in the new code we could not use trace_kvm_age_page(), so we just
> dropped the point from kvm_handle_hva_range().
> 

Can't it be pushed to handler()?



-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] KVM: Separate rmap_pde from kvm_lpage_info->write_count

2012-06-28 Thread Avi Kivity
On 06/28/2012 06:45 AM, Takuya Yoshikawa wrote:
> On Thu, 28 Jun 2012 11:12:51 +0800
> Xiao Guangrong  wrote:
> 
>> >  struct kvm_arch_memory_slot {
>> > +  unsigned long *rmap_pde[KVM_NR_PAGE_SIZES - 1];
>> >struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
>> >  };
>> > 
>> 
>> It looks little complex than before - need manage more alloc-ed/freed 
>> buffers.
> 
> Actually I want to integrate rmap and rmap_pde in the future:
> 
>   rmap[KVM_NR_PAGE_SIZES]

That's a good direction.

> 
> For this we need to modify some unrelated ppc code, so I just
> avoided the integration in this series.
> 
> Note: write_count: 4 bytes, rmap_pde: 8 bytes.  So we are wasting
> extra paddings by packing them into lpage_info.

The wastage is quite low since it's just 4 bytes per 2MB.

> 
>> Why not just introduce a function to get the next rmap? Something like this:
> 
> I want to eliminate this kind of overheads.

I don't think the overhead is significant.  rmap walk speed is largely a
function of cache misses IMO, and we may even be adding cache misses by
splitting lpage_info.

But I still think it's the right thing since it simplifies the code.
Maybe we should add a prefetch() on write_count do mitigate the
overhead, if it starts showing up in profiles.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests

2012-06-28 Thread Avi Kivity
On 06/28/2012 08:26 PM, Jan Kiszka wrote:
>> 
>> This is both impressive and scary.  What is the target scenario here?
>> Partitioning?  I don't see this working for generic consolidation.
>> 
> 
> From my POV, partitioning - including hard realtime partitions - would
> provide some use cases. But, as far as I saw, there are still major
> restrictions in this approach, e.g. that you can't return to userspace
> on the slave core. Or even execute the in-kernel device models on that core.
> 
> I think we need something based on the no-hz work on the long run, ie.
> the ability to run a single VCPU thread of the userland hypervisor on a
> single core with zero rescheduling and unrelated interruptions - as far
> as the guest load scenario allows this (we have some here).

What we can do perhaps is switch from direct mode to indirect mode on
exit.  Instead of running with interrupts disabled, enable interrupts
and make sure they are forwarded to the guest on the next entry.

> Well, and we need proper hardware support for direct IRQ injection on x86...

Hardware support always helps, but it always seems to come after the
software support is in place and needs to be supported forever.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests

2012-06-28 Thread Jan Kiszka
On 2012-06-28 18:58, Avi Kivity wrote:
> On 06/28/2012 09:07 AM, Tomoki Sekiyama wrote:
>> Hello,
>>
>> This RFC patch series provides facility to dedicate CPUs to KVM guests
>> and enable the guests to handle interrupts from passed-through PCI devices
>> directly (without VM exit and relay by the host).
>>
>> With this feature, we can improve throughput and response time of the device
>> and the host's CPU usage by reducing the overhead of interrupt handling.
>> This is good for the application using very high throughput/frequent
>> interrupt device (e.g. 10GbE NIC).
>> CPU-intensive high performance applications and real-time applicatoins
>> also gets benefit from CPU isolation feature, which reduces VM exit and
>> scheduling delay.
>>
>> Current implementation is still just PoC and have many limitations, but
>> submitted for RFC. Any comments are appreciated.
>>
>> * Overview
>> Intel and AMD CPUs have a feature to handle interrupts by guests without
>> VM Exit. However, because it cannot switch VM Exit based on IRQ vectors,
>> interrupts to both the host and the guest will be routed to guests.
>>
>> To avoid mixture of host and guest interrupts, in this patch, some of CPUs
>> are cut off from the host and dedicated to the guests. In addition, IRQ
>> affinity of the passed-through devices are set to the guest CPUs only.
>>
>> For IPI from the host to the guest, we use NMIs, that is an only interrupts
>> having another VM Exit flag.
>>
>> * Benefits
>> This feature provides benefits of virtualization to areas where high
>> performance and low latency are required, such as HPC and trading,
>> and so on. It also useful for consolidation in large scale systems with
>> many CPU cores and PCI devices passed-through or with SR-IOV.
>> For the future, it may be used to keep the guests running even if the host
>> is crashed (but that would need additional features like memory isolation).
>>
>> * Limitations
>> Current implementation is experimental, unstable, and has a lot of 
>> limitations.
>>  - SMP guests don't work correctly
>>  - Only Linux guest is supported
>>  - Only Intel VT-x is supported
>>  - Only MSI and MSI-X pass-through; no ISA interrupts support
>>  - Non passed-through PCI devices (including virtio) are slower
>>  - Kernel space PIT emulation does not work
>>  - Needs a lot of cleanups
>>
> 
> This is both impressive and scary.  What is the target scenario here?
> Partitioning?  I don't see this working for generic consolidation.
> 

>From my POV, partitioning - including hard realtime partitions - would
provide some use cases. But, as far as I saw, there are still major
restrictions in this approach, e.g. that you can't return to userspace
on the slave core. Or even execute the in-kernel device models on that core.

I think we need something based on the no-hz work on the long run, ie.
the ability to run a single VCPU thread of the userland hypervisor on a
single core with zero rescheduling and unrelated interruptions - as far
as the guest load scenario allows this (we have some here).

Well, and we need proper hardware support for direct IRQ injection on x86...

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

2012-06-28 Thread Alex Williamson
On Thu, 2012-06-28 at 19:27 +0300, Avi Kivity wrote:
> On 06/24/2012 06:02 PM, Alex Williamson wrote:
> > On Sun, 2012-06-24 at 15:56 +0300, Avi Kivity wrote:
> >> On 06/23/2012 01:16 AM, Alex Williamson wrote:
> >> > I think we're probably also going to need something like this.
> >> > When running in non-accelerated qemu, we're going to have to
> >> > create some kind of EOI notifier for drivers.  VFIO can make
> >> > additional improvements when running on KVM so it will probably
> >> > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
> >> > want to have a generic EOI notifier in qemu that just stops
> >> > working when kvm-ioapic is enabled.
> >> 
> >> Why?
> > 
> > Hmm, I must be missing something or not describing it correctly, because
> > it seems obvious.  
> 
> I have not exhausted this quarter's quota of stupid questions yet.

;)

> > If we create a dependency in qemu of needing to know
> > when an eoi occurs and notifier a driver and have no way to fulfill that
> > dependency when running on kvm... that'd be bad, right?  I don't want to
> > assume that every consumer of such an interface would prefer to make use
> > of an irqfd.  Not sure if that answers your question though.  Thanks,
> 
> I meant, what scenario do you have in mind where we want the EOI
> notifier while running with kvm-irqchip enabled?  Perhaps I phrased my
> question a bit too tersely.

Aha.  Well, the v2 series really pulls eoifd in as more of a primary
player and is now the only way to get EOI notification regardless of
whether it's bound to a separate level irq source or used by qemu for
the common userspace source.  I don't know of any users that currently
need EOI notification for the common userspace source id, but depending
on where we land with whether KVM_IRQFD supports level interrupts, it
may be vfio that needs it.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 06/18] KVM: Add facility to run guests on slave CPUs

2012-06-28 Thread Avi Kivity
On 06/28/2012 09:07 AM, Tomoki Sekiyama wrote:
> Add path to migrate execution of vcpu_enter_guest to a slave CPU when
> vcpu->arch.slave_cpu is set.
> 
> After moving to the slave CPU, it goes back to the online CPU when the
> guest is exited by reasons that cannot be handled by the slave CPU only
> (e.g. handling async page faults).

What about, say, instruction emulation?  It may need to touch guest
memory, which cannot be done from interrupt disabled context.

> +
> +static int vcpu_post_run(struct kvm_vcpu *vcpu, struct task_struct *task,
> +  int *can_complete_async_pf)
> +{
> + int r = LOOP_ONLINE;
> +
> + clear_bit(KVM_REQ_PENDING_TIMER, &vcpu->requests);
> + if (kvm_cpu_has_pending_timer(vcpu))
> + kvm_inject_pending_timer_irqs(vcpu);
> +
> + if (dm_request_for_irq_injection(vcpu)) {
> + r = -EINTR;
> + vcpu->run->exit_reason = KVM_EXIT_INTR;
> + ++vcpu->stat.request_irq_exits;
> + }
> +
> + if (can_complete_async_pf) {
> + *can_complete_async_pf = kvm_can_complete_async_pf(vcpu);
> + if (r == LOOP_ONLINE)
> + r = *can_complete_async_pf ? LOOP_APF : LOOP_SLAVE;
> + } else
> + kvm_check_async_pf_completion(vcpu);
> +
> + if (signal_pending(task)) {
> + r = -EINTR;
> + vcpu->run->exit_reason = KVM_EXIT_INTR;
> + ++vcpu->stat.signal_exits;
> + }

Isn't this racy?  The signal can come right after this.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests

2012-06-28 Thread Avi Kivity
On 06/28/2012 09:07 AM, Tomoki Sekiyama wrote:
> Hello,
> 
> This RFC patch series provides facility to dedicate CPUs to KVM guests
> and enable the guests to handle interrupts from passed-through PCI devices
> directly (without VM exit and relay by the host).
> 
> With this feature, we can improve throughput and response time of the device
> and the host's CPU usage by reducing the overhead of interrupt handling.
> This is good for the application using very high throughput/frequent
> interrupt device (e.g. 10GbE NIC).
> CPU-intensive high performance applications and real-time applicatoins
> also gets benefit from CPU isolation feature, which reduces VM exit and
> scheduling delay.
> 
> Current implementation is still just PoC and have many limitations, but
> submitted for RFC. Any comments are appreciated.
> 
> * Overview
> Intel and AMD CPUs have a feature to handle interrupts by guests without
> VM Exit. However, because it cannot switch VM Exit based on IRQ vectors,
> interrupts to both the host and the guest will be routed to guests.
> 
> To avoid mixture of host and guest interrupts, in this patch, some of CPUs
> are cut off from the host and dedicated to the guests. In addition, IRQ
> affinity of the passed-through devices are set to the guest CPUs only.
> 
> For IPI from the host to the guest, we use NMIs, that is an only interrupts
> having another VM Exit flag.
> 
> * Benefits
> This feature provides benefits of virtualization to areas where high
> performance and low latency are required, such as HPC and trading,
> and so on. It also useful for consolidation in large scale systems with
> many CPU cores and PCI devices passed-through or with SR-IOV.
> For the future, it may be used to keep the guests running even if the host
> is crashed (but that would need additional features like memory isolation).
> 
> * Limitations
> Current implementation is experimental, unstable, and has a lot of 
> limitations.
>  - SMP guests don't work correctly
>  - Only Linux guest is supported
>  - Only Intel VT-x is supported
>  - Only MSI and MSI-X pass-through; no ISA interrupts support
>  - Non passed-through PCI devices (including virtio) are slower
>  - Kernel space PIT emulation does not work
>  - Needs a lot of cleanups
> 

This is both impressive and scary.  What is the target scenario here?
Partitioning?  I don't see this working for generic consolidation.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/6] kvm: Pass kvm_irqfd to functions

2012-06-28 Thread Michael S. Tsirkin
On Thu, Jun 28, 2012 at 06:51:09PM +0200, Cornelia Huck wrote:
> On Thu, 28 Jun 2012 15:09:49 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Thu, Jun 28, 2012 at 02:00:41PM +0200, Cornelia Huck wrote:
> > > On Thu, 28 Jun 2012 12:34:43 +0300
> > > "Michael S. Tsirkin"  wrote:
> > > 
> > > > On Thu, Jun 28, 2012 at 11:03:16AM +0200, Cornelia Huck wrote:
> > > 
> > > > > How about something like this as parameter for the new ioctl?
> > > > > 
> > > > > struct kvm_irqfd2 {
> > > > >   __u32 fd;
> > > > >   __u32 flags;  /* for things like deassign */
> > > > >   __u64 type;   /* determines the payload */
> > > > >   union {
> > > > >   /* type traditional */
> > > > >   struct {
> > > > >   __u32 gsi;
> > > > >   } trad;
> > > > >   /* type s390 */
> > > > >   struct {
> > > > >   __u32 int_type;
> > > > >   __u32 parm;
> > > > >   __u64 parm64;
> > > > >   } s390;
> > > > >   __u8 pad[20];
> > > > >   };
> > > > > }
> > > > > 
> > > > > This could be combined with an arch or a per-kvm callback to keep the
> > > > > generic code clean of architecture dependencies.
> > > > > 
> > > > > Cornelia
> > > > 
> > > > Looks a bit weird - shouldn't all this be part of gsi routing?
> > > > But no idea really, I don't see the big picture here.
> > > > 
> > > 
> > > Well, on s390 we don't have anything like "gsi routing" (I'm not even
> > > really sure what that is).
> > 
> > I really mean kvm_irq_routing. This has options for
> > irqchip, msi, I guess we can add more.
> 
> I stared at irq_comm.c for a bit and it seems to fulfill a purpose
> similar to arch/s390/kvm/interrupt.c (although it looks more static).
> But I don't really see how they could fit together easily.
> 
> > 
> > 
> > > My understanding is the following:
> > > 
> > > - Basically, we want to notify the guest for a virtqueue.
> > > - For this, we want to inject an interrupt for the associated device.
> > > - On x86, this means raising an interrupt on an interrupt line, as
> > >   specified by some kind of number.
> > 
> > Not just that: for MSI we pass in data encoding priority
> > destination etc.
> > 
> > > - On s390, we need some more information to (a) identify the device and
> > >   (b) additional information that needs to be transmitted for an
> > >   interrupt (device specific). (This is what basically goes into struct
> > >   kvm_s390_interrupt, which I reproduced in the s390 part.)
> > > 
> > > Cornelia
> > 
> > Is this b mostly static or does it change for each interrupt?
> 
> For Linux guests it will be static, although the architecture would
> allow for changing (some of) it.
> 
> Cornelia

So storing this info in routing might be a good fit.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] kvm: KVM_EOIFD, an eventfd for EOIs

2012-06-28 Thread Michael S. Tsirkin
On Thu, Jun 28, 2012 at 05:08:04PM +0300, Gleb Natapov wrote:
> On Thu, Jun 28, 2012 at 04:11:40PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 27, 2012 at 09:55:44PM -0600, Alex Williamson wrote:
> > > On Wed, 2012-06-27 at 17:51 +0300, Gleb Natapov wrote:
> > > > On Wed, Jun 27, 2012 at 08:29:04AM -0600, Alex Williamson wrote:
> > > > > On Wed, 2012-06-27 at 16:58 +0300, Gleb Natapov wrote:
> > > > > > On Tue, Jun 26, 2012 at 11:10:08PM -0600, Alex Williamson wrote:
> > > > > > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > > > > > written for a specified irqchip pin.  By default this is a simple
> > > > > > > notification, but we can also tie the eoifd to a level irqfd, 
> > > > > > > which
> > > > > > > enables the irqchip pin to be automatically de-asserted on EOI.
> > > > > > > This mode is particularly useful for device-assignment 
> > > > > > > applications
> > > > > > > where the unmask and notify triggers a hardware unmask.  The 
> > > > > > > default
> > > > > > > mode is most applicable to simple notify with no side-effects for
> > > > > > > userspace usage, such as Qemu.
> > > > > > > 
> > > > > > > Here we make use of the reference counting of the _irq_source
> > > > > > > object allowing us to share it with an irqfd and cleanup 
> > > > > > > regardless
> > > > > > > of the release order.
> > > > > > > 
> > > > > > > Signed-off-by: Alex Williamson 
> > > > > > > ---
> > > > > > > 
> > > > > > >  Documentation/virtual/kvm/api.txt |   24 +
> > > > > > >  arch/x86/kvm/x86.c|1 
> > > > > > >  include/linux/kvm.h   |   14 +++
> > > > > > >  include/linux/kvm_host.h  |   13 +++
> > > > > > >  virt/kvm/eventfd.c|  189 
> > > > > > > +
> > > > > > >  virt/kvm/kvm_main.c   |   11 ++
> > > > > > >  6 files changed, 250 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > > > > > b/Documentation/virtual/kvm/api.txt
> > > > > > > index b216709..87a2558 100644
> > > > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > > > @@ -1987,6 +1987,30 @@ interrupts with those injected through 
> > > > > > > KVM_IRQ_LINE.  IRQFDs created
> > > > > > >  with KVM_IRQFD_FLAG_LEVEL must also set this flag when 
> > > > > > > de-assiging.
> > > > > > >  KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > > > > > >  
> > > > > > > +4.77 KVM_EOIFD
> > > > > > > +
> > > > > > > +Capability: KVM_CAP_EOIFD
> > > > > > > +Architectures: x86
> > > > > > > +Type: vm ioctl
> > > > > > > +Parameters: struct kvm_eoifd (in)
> > > > > > > +Returns: 0 on success, -1 on error
> > > > > > > +
> > > > > > > +KVM_EOIFD allows userspace to receive EOI notification through an
> > > > > > > +eventfd for level triggered irqchip interrupts.  Behavior for 
> > > > > > > edge
> > > > > > > +triggered interrupts is undefined.  kvm_eoifd.fd specifies the 
> > > > > > > eventfd
> > > > > > Lets make it defined. EOI notification can be used by userspace to 
> > > > > > fix
> > > > > > time drift due to lost interrupts. But than userspace needs to know
> > > > > > which vcpu did EOI.
> > > > > 
> > > > > Hmm, do we need an additional flag and field in kvm_eoifd to filter by
> > > > > vCPU then?
> > > > > 
> > > > This will be enough for a use case I am aware of. Don't know if this
> > > > interface is generic enough for all possible use cases.
> > > 
> > > That's generally a hard prediction to make ;)  We currently don't pass a
> > > kvm_vcpu anywhere close to the irq ack notifier.  The ioapic path could
> > > be relatively trivial, but the pic path is a bit further disconnected.
> > > If we had that plumbing, a KVM_CAP plus vcpu filter flag and specifying
> > > the vcpu using some of the padding space seems like it's sufficient.
> > > I'll drop mention of level-only from the description, but the plumbing
> > > and vcpu filtering can be a follow-on.  Thanks,
> > > 
> > > Alex
> > 
> > If we don't implement what's needed for timedrift to be fixed,
> > then IMO it's better to simply require an IRQFD for EOIFD for now,
> > and limit this to level. Otherwise when we actually try to implement
> > we might find issues.
> > 
> > Another reason to explicitly say EOI is not supported for edge is that
> > EOI might not get invoked at all with PV EOI.
> > 
> Good point, but easily addressable by disabling PV EOI for a GSI that
> has EOI notifier registered.

This patch doesn't do this though, so it will need a separate
patch and a separate capability.

> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/6] kvm: Pass kvm_irqfd to functions

2012-06-28 Thread Cornelia Huck
On Thu, 28 Jun 2012 15:09:49 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Jun 28, 2012 at 02:00:41PM +0200, Cornelia Huck wrote:
> > On Thu, 28 Jun 2012 12:34:43 +0300
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Thu, Jun 28, 2012 at 11:03:16AM +0200, Cornelia Huck wrote:
> > 
> > > > How about something like this as parameter for the new ioctl?
> > > > 
> > > > struct kvm_irqfd2 {
> > > > __u32 fd;
> > > > __u32 flags;  /* for things like deassign */
> > > > __u64 type;   /* determines the payload */
> > > > union {
> > > > /* type traditional */
> > > > struct {
> > > > __u32 gsi;
> > > > } trad;
> > > > /* type s390 */
> > > > struct {
> > > > __u32 int_type;
> > > > __u32 parm;
> > > > __u64 parm64;
> > > > } s390;
> > > > __u8 pad[20];
> > > > };
> > > > }
> > > > 
> > > > This could be combined with an arch or a per-kvm callback to keep the
> > > > generic code clean of architecture dependencies.
> > > > 
> > > > Cornelia
> > > 
> > > Looks a bit weird - shouldn't all this be part of gsi routing?
> > > But no idea really, I don't see the big picture here.
> > > 
> > 
> > Well, on s390 we don't have anything like "gsi routing" (I'm not even
> > really sure what that is).
> 
> I really mean kvm_irq_routing. This has options for
> irqchip, msi, I guess we can add more.

I stared at irq_comm.c for a bit and it seems to fulfill a purpose
similar to arch/s390/kvm/interrupt.c (although it looks more static).
But I don't really see how they could fit together easily.

> 
> 
> > My understanding is the following:
> > 
> > - Basically, we want to notify the guest for a virtqueue.
> > - For this, we want to inject an interrupt for the associated device.
> > - On x86, this means raising an interrupt on an interrupt line, as
> >   specified by some kind of number.
> 
> Not just that: for MSI we pass in data encoding priority
> destination etc.
> 
> > - On s390, we need some more information to (a) identify the device and
> >   (b) additional information that needs to be transmitted for an
> >   interrupt (device specific). (This is what basically goes into struct
> >   kvm_s390_interrupt, which I reproduced in the s390 part.)
> > 
> > Cornelia
> 
> Is this b mostly static or does it change for each interrupt?

For Linux guests it will be static, although the architecture would
allow for changing (some of) it.

Cornelia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 16/18] KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received

2012-06-28 Thread Avi Kivity
On 06/28/2012 09:08 AM, Tomoki Sekiyama wrote:
> Since NMI can not be disabled around VM enter, there is a race between
> receiving NMI to kick a guest and entering the guests on slave CPUs.If the
> NMI is received just before entering VM, after the NMI handler is invoked,
> it continues entering the guest and the effect of the NMI will be lost.
> 
> This patch adds kvm_arch_vcpu_prevent_run(), which causes VM exit right
> after VM enter. The NMI handler uses this to ensure the execution of the
> guest is cancelled after NMI.
> 
>  
> +/*
> + * Make VMRESUME fail using preemption timer with timer value = 0.
> + * On processors that doesn't support preemption timer, VMRESUME will fail
> + * by internal error.
> + */
> +static void vmx_prevent_run(struct kvm_vcpu *vcpu, int prevent)
> +{
> + if (prevent)
> + vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
> +   PIN_BASED_PREEMPTION_TIMER);
> + else
> + vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
> + PIN_BASED_PREEMPTION_TIMER);
> +}

This may interrupt another RMW sequence, which will then overwrite the
control.  So it needs to be called only if inside the entry sequence
(otherwise can just set a KVM_REQ_IMMEDIATE_EXIT in vcpu->requests).

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 18/18] x86: request TLB flush to slave CPU using NMI

2012-06-28 Thread Avi Kivity
On 06/28/2012 09:08 AM, Tomoki Sekiyama wrote:
> For slave CPUs, it is inapropriate to request TLB flush using IPI.
> because the IPI may be sent to a KVM guest when the slave CPU is running
> the guest with direct interrupt routing.
> 
> Instead, it registers a TLB flush request in per-cpu bitmask and send a NMI
> to interrupt execution of the guest. Then, NMI handler will check the
> requests and handles the requests.


Currently x86's get_user_pages_fast() depends on TLB flushes being held
up by local_irq_disable().  With this patch, this is no longer true and
get_user_pages_fast() can race with page table freeing.  There are
patches from Peter Zijlstra to remove this dependency though.  NMIs are
still slow and fragile when compared to normal interrupts, so this patch
is somewhat problematic.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.0.1 - unable to exit if vcpu is in infinite loop

2012-06-28 Thread Avi Kivity
On 06/28/2012 07:29 PM, Peter Lieven wrote:
>> Yes. A signal is sent, and KVM returns from the guest to userspace on
>> pending signals.

> is there a description available how this process exactly works?

The kernel part is in vcpu_enter_guest(), see the check for
signal_pending().  But this hasn't seen changes for quite a long while.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.0.1 - unable to exit if vcpu is in infinite loop

2012-06-28 Thread Peter Lieven

On 28.06.2012 17:22, Jan Kiszka wrote:

On 2012-06-28 17:02, Peter Lieven wrote:

On 28.06.2012 15:25, Jan Kiszka wrote:

On 2012-06-28 15:05, Peter Lieven wrote:

Hi,

i debugged my initial problem further and found out that the problem
happens to be that
the main thread is stuck in pause_all_vcpus() on reset or quit commands
in the monitor
if one cpu is stuck in the do-while loop kvm_cpu_exec. If I modify the
condition from while (ret == 0)
to while ((ret == 0)&&   !env->stop); it works, but is this the right fix?
"Quit" command seems to work, but on "Reset" the VM enterns pause state.

Before entering the wait loop in pause_all_vcpus, there are kicks sent
to all vcpus. Now we need to find out why some of those kicks apparently
don't reach the destination.

can you explain shot what exactly these kicks do? does these kicks lead
to leaving the kernel mode and returning to userspace?

Yes. A signal is sent, and KVM returns from the guest to userspace on
pending signals.

is there a description available how this process exactly works?

thanks
peter

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4][RFC] kvm: eoi_eventfd

2012-06-28 Thread Avi Kivity
On 06/24/2012 06:02 PM, Alex Williamson wrote:
> On Sun, 2012-06-24 at 15:56 +0300, Avi Kivity wrote:
>> On 06/23/2012 01:16 AM, Alex Williamson wrote:
>> > I think we're probably also going to need something like this.
>> > When running in non-accelerated qemu, we're going to have to
>> > create some kind of EOI notifier for drivers.  VFIO can make
>> > additional improvements when running on KVM so it will probably
>> > make use of the KVM_IRQFD_LEVEL_EOI interface, but we don't
>> > want to have a generic EOI notifier in qemu that just stops
>> > working when kvm-ioapic is enabled.
>> 
>> Why?
> 
> Hmm, I must be missing something or not describing it correctly, because
> it seems obvious.  

I have not exhausted this quarter's quota of stupid questions yet.

> If we create a dependency in qemu of needing to know
> when an eoi occurs and notifier a driver and have no way to fulfill that
> dependency when running on kvm... that'd be bad, right?  I don't want to
> assume that every consumer of such an interface would prefer to make use
> of an irqfd.  Not sure if that answers your question though.  Thanks,

I meant, what scenario do you have in mind where we want the EOI
notifier while running with kvm-irqchip enabled?  Perhaps I phrased my
question a bit too tersely.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

2012-06-28 Thread Raghavendra K T

On 06/28/2012 09:30 PM, Andrew Jones wrote:



- Original Message -

In summary, current PV has huge benefit on non-PLE machine.

On PLE machine, the results become very sensitive to load, type of
workload and SPIN_THRESHOLD. Also PLE interference has significant
effect on them. But still it has slight edge over non PV.



Hi Raghu,

sorry for my slow response. I'm on vacation right now (until the
9th of July) and I have limited access to mail.


Ok. Happy Vacation :)

Also, thanks for

continuing the benchmarking. Question, when you compare PLE vs.
non-PLE, are you using different machines (one with and one
without), or are you disabling its use by loading the kvm module
with the ple_gap=0 modparam as I did?


Yes, I am doing the same when I say with PLE disabled and comparing the
benchmarks (i.e loading kvm module with ple_gap=0).

But older non-PLE results were on a different machine altogether. (I
had limited access to PLE machine).


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/7] kvm: Get coalesced MMIO flushing out of the hot-path

2012-06-28 Thread Avi Kivity
On 06/27/2012 07:27 PM, Jan Kiszka wrote:
> Changes in v2:
>  - added memory_region_clear_flush_coalesced
>  - call memory_region_clear_flush_coalesced from
>memory_region_clear_coalescing
>  - wrap all region manipulations via memory_region_transaction_begin/
>commit internally
>  - flush coalesced MMIO only on memory_region_transaction_begin
> 
> Original description:
> 
> We currently flush the coalesced MMIO buffer on every vmexit to
> userspace. KVM only provides a single buffer per VM, so a central lock
> is required to read from it. This is a contention point given a large
> enough VCPU set. Moreover, we need to hold the BQL while replaying the
> queued requests, probably for a long time until there is more fine
> grained locking available. Good reasons to overcome the unconditional
> flush.
> 
> The series achieves this by flushing only on selected memory region
> accesses, either generically via the memory access dispatcher or
> directly on certain VGA PIO accesses that are not yet fully converted.
> Another reason to flush are remappings or other relevant region state
> changes.

Looks good; please see minor comment on patch 2.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/7] memory: Flush coalesced MMIO on selected region access

2012-06-28 Thread Avi Kivity
On 06/27/2012 07:27 PM, Jan Kiszka wrote:
> Instead of flushing pending coalesced MMIO requests on every vmexit,
> this provides a mechanism to selectively flush when memory regions
> related to the coalesced one are accessed. This first of all includes
> the coalesced region itself but can also applied to other regions, e.g.
> of the same device, by calling memory_region_set_flush_coalesced.
> 
> +
> +void memory_region_clear_flush_coalesced(MemoryRegion *mr)
> +{
> +qemu_flush_coalesced_mmio_buffer();
> +mr->flush_coalesced_mmio = false;
> +}


This will clear the implicit flag set by set_coalesced, but also any
explicit flag set by set_flush_coalesced.  In some convoluted setup that
will never happen, this can cause breakage.  This can be addresses
either using two flags (or perhaps using the coalesced list for the
implicit flag) or by a comment in clear_coalesced's documentation.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

2012-06-28 Thread Andrew Jones


- Original Message -
> In summary, current PV has huge benefit on non-PLE machine.
> 
> On PLE machine, the results become very sensitive to load, type of
> workload and SPIN_THRESHOLD. Also PLE interference has significant
> effect on them. But still it has slight edge over non PV.
> 

Hi Raghu,

sorry for my slow response. I'm on vacation right now (until the
9th of July) and I have limited access to mail. Also, thanks for
continuing the benchmarking. Question, when you compare PLE vs.
non-PLE, are you using different machines (one with and one
without), or are you disabling its use by loading the kvm module
with the ple_gap=0 modparam as I did?

Drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: x86: Implement PCID/INVPCID for guests with EPT

2012-06-28 Thread Avi Kivity
On 06/28/2012 06:49 PM, Avi Kivity wrote:
> On 06/14/2012 05:04 AM, Mao, Junjie wrote:
>> This patch handles PCID/INVPCID for guests.
>> 
>> Process-context identifiers (PCIDs) are a facility by which a logical 
>> processor
>> may cache information for multiple linear-address spaces so that the 
>> processor
>> may retain cached information when software switches to a different linear
>> address space. Refer to section 4.10.1 in IA32 Intel Software Developer's 
>> Manual
>> Volume 3A for details.
>> 
>> For guests with EPT, the PCID feature is enabled and INVPCID behaves as 
>> running
>> natively.
>> For guests without EPT, the PCID feature is disabled and INVPCID triggers 
>> #UD.
>> 


Oh, and sorry about the late review.


-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: x86: Implement PCID/INVPCID for guests with EPT

2012-06-28 Thread Avi Kivity
On 06/14/2012 05:04 AM, Mao, Junjie wrote:
> This patch handles PCID/INVPCID for guests.
> 
> Process-context identifiers (PCIDs) are a facility by which a logical 
> processor
> may cache information for multiple linear-address spaces so that the processor
> may retain cached information when software switches to a different linear
> address space. Refer to section 4.10.1 in IA32 Intel Software Developer's 
> Manual
> Volume 3A for details.
> 
> For guests with EPT, the PCID feature is enabled and INVPCID behaves as 
> running
> natively.
> For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.
> 
> 
>  #endif
>   unsigned f_rdtscp = kvm_x86_ops->rdtscp_supported() ? F(RDTSCP) : 0;
> + unsigned f_pcid = boot_cpu_has(X86_FEATURE_PCID) ? F(PCID) : 0;

Unneeded, just put F(PCID) below.

> + unsigned f_invpcid = kvm_x86_ops->invpcid_supported() ? F(INVPCID) : 0;
>  
>   /* cpuid 1.edx */
>   const u32 kvm_supported_word0_x86_features =
> @@ -228,7 +230,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
> u32 function,
>   0 /* DS-CPL, VMX, SMX, EST */ |
>   0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
>   F(FMA) | F(CX16) | 0 /* xTPR Update, PDCM */ |
> - 0 /* Reserved, DCA */ | F(XMM4_1) |
> + f_pcid | 0 /* Reserved, DCA */ | F(XMM4_1) |


> @@ -1739,6 +1745,11 @@ static bool vmx_rdtscp_supported(void)
>   return cpu_has_vmx_rdtscp();
>  }
>  
> +static bool vmx_invpcid_supported(void)
> +{
> + return cpu_has_vmx_invpcid();

Should that have && ept_enabled?  You turn off invpcid below if
!ept_enabled.

> +}
> +
>  /*
>   * Swap MSR entry in host/guest MSR entry array.
>   */
> @@ -2458,7 +2469,8 @@ static __init int setup_vmcs_config(struct vmcs_config 
> *vmcs_conf)
>   SECONDARY_EXEC_ENABLE_EPT |
>   SECONDARY_EXEC_UNRESTRICTED_GUEST |
>   SECONDARY_EXEC_PAUSE_LOOP_EXITING |
> - SECONDARY_EXEC_RDTSCP;
> + SECONDARY_EXEC_RDTSCP |
> + SECONDARY_EXEC_ENABLE_INVPCID;
>   if (adjust_vmx_controls(min2, opt2,
>   MSR_IA32_VMX_PROCBASED_CTLS2,
>   &_cpu_based_2nd_exec_control) < 0)
> @@ -3731,6 +3743,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx 
> *vmx)
>   if (!enable_ept) {
>   exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
>   enable_unrestricted_guest = 0;
> + /* Enable INVPCID for non-ept guests may cause performance 
> regression. */
> + exec_control &= ~SECONDARY_EXEC_ENABLE_INVPCID;
>   }
>   if (!enable_unrestricted_guest)
>   exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;

This code is unneeded..

> @@ -6467,6 +6481,23 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
>   }
>   }
>   }
> +
> + if (vmx_invpcid_supported()) {
> + exec_control = vmcs_read32(SECONDARY_VM_EXEC_CONTROL);
> + /* Exposing INVPCID only when PCID is exposed */
> + best = kvm_find_cpuid_entry(vcpu, 0x7, 0);
> + if (best && (best->ecx & bit(X86_FEATURE_INVPCID)) && 
> guest_cpuid_has_pcid(vcpu)) {
> + exec_control |= SECONDARY_EXEC_ENABLE_INVPCID;
> + vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
> +  exec_control);
> + } else {
> + exec_control &= ~SECONDARY_EXEC_ENABLE_INVPCID;
> + vmcs_write32(SECONDARY_VM_EXEC_CONTROL,
> +  exec_control);
> + if (best)
> + best->ecx &= ~bit(X86_FEATURE_INVPCID);
> + }
> + }
>  }

Since you override it here anyway.  But maybe it's needed if the host
never calls KVM_SET_CPUID.

>  
>  static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
> @@ -6610,6 +6641,9 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12)
> page_to_phys(vmx->nested.apic_access_page));
>   }
>  
> + /* Explicitly disable INVPCID until PCID for L2 guest is 
> supported */
> + exec_control &= ~SECONDARY_EXEC_ENABLE_INVPCID;
> +

We can't just disable it without the guest knowing.  If we don't expose
INCPCID through the MSR, then we should fail VMKLAUNCH or VMRESUME is
this bit is set.

>   vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
>   }
>  
> @@ -528,6 +528,10 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>   return 1;
>   }
>  
> + if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG) &&
> + kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
> + return 1;

Why check old_cr0?

>  
> + if ((cr4 & X86_CR4_PCIDE

Re: Is there a mail-list for answering questions against kvm??

2012-06-28 Thread Ademar de Souza Reis Jr.
On Tue, Jun 26, 2012 at 10:50:36AM +0800, Zhengwang Ruan wrote:
> Hello,
> 
> I am freshman to this community, I want to know if there is a
> mail-list in which I can ask questions against kvm and people are
> willing to answer too?  Thanks! :-)
> 

Hi Zhengwang, welcome to KVM.

You can try IRC: #kvm on freenode and maybe #virtualization on OFTC.
(check http://www.linux-kvm.org/page/Lists%2C_IRC)

But anyway, your questions are well suited to this list, it's
just that people might be too busy right now...

You can refresh your questions after a few days with what you've
found by yourself (a self reply, with answers and maybe new
questions).

Good luck.

-- 
Ademar de Souza Reis Jr.
Red Hat

^[:wq!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.0.1 - unable to exit if vcpu is in infinite loop

2012-06-28 Thread Jan Kiszka
On 2012-06-28 17:02, Peter Lieven wrote:
> On 28.06.2012 15:25, Jan Kiszka wrote:
>> On 2012-06-28 15:05, Peter Lieven wrote:
>>> Hi,
>>>
>>> i debugged my initial problem further and found out that the problem
>>> happens to be that
>>> the main thread is stuck in pause_all_vcpus() on reset or quit commands
>>> in the monitor
>>> if one cpu is stuck in the do-while loop kvm_cpu_exec. If I modify the
>>> condition from while (ret == 0)
>>> to while ((ret == 0)&&  !env->stop); it works, but is this the right fix?
>>> "Quit" command seems to work, but on "Reset" the VM enterns pause state.
>> Before entering the wait loop in pause_all_vcpus, there are kicks sent
>> to all vcpus. Now we need to find out why some of those kicks apparently
>> don't reach the destination.
> can you explain shot what exactly these kicks do? does these kicks lead
> to leaving the kernel mode and returning to userspace?

Yes. A signal is sent, and KVM returns from the guest to userspace on
pending signals.

>> Again:
>>   - on which host kernels does this occur, and which change may have
>> changed it?
> I do not see it in 3.0.0 and have also not seen it in 2.6.38. both
> the mainline 64-bit ubuntu-server kernels (for natty / oneiric 
> respectively).
> If I compile a more recent kvm-kmod 3.3 or 3.4 on these machines,
> it is no longer working.

I was asking for kernel 3.3 or 3.4 without kvm-kmod.

>>   - with which qemu-kvm version is it reproducible, and which commit
>> introduced or fixed it?
> qemu-kvm-1.0.1 from sourceforge. to get into the scenario it
> is not sufficient to boot from an empty harddisk. to reproduce

Please also try qemu-kvm git to see if something fixed it there.

> i have use a live cd like ubuntu-server 12.04 and choose to
> boot from the first harddisk. i think the isolinux loader does
> not check for a valid bootsector and just executes what is found
> in sector 0. this leads to the mmio reads i posted and 100%
> cpu load (most spent in kernel). at that time the monitor/qmp
> is still responsible. if i sent a command that pauses all vcpus,
> the first cpu is looping in kvm_cpu_exec and the main thread
> is waiting. at that time the monitor stops responding.
> i have also seen this issue on very old windows 2000 servers
> where the system fails to power off and is just halted. maybe
> this is also a busy loop.
> 
> i will try to bisect this asap and let you know, maybe the above
> info helps you already to reproduce.

OK, thanks,

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop

2012-06-28 Thread Avi Kivity
On 06/28/2012 05:10 PM, Anthony Liguori wrote:
>>
>> 1.  read_lock(memmap_lock)
>> 2.  MemoryRegionSection mrs = lookup(addr)
>> 3.  qom_ref(mrs.mr->dev)
>> 4.  read_unlock(memmap_lock)
>>
>> 5.  mutex_lock(dev->lock)
>> 6.  dispatch(&mrs, addr, data, size)
>> 7.  mutex_unlock(dev->lock)
> 
> Just a detail, I don't think we should acquire a device specific lock in
> global code.  Rather, I think we should acquire the global lock before
> dispatch unless a MemoryRegion is marked as being unlocked.

I think I agree.  The mutex acquisition is moved inside dispatch (in
device-specific code).  A side effect is that device code must not call
memory_region_destroy() outside its constructor or destructor.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.0.1 - unable to exit if vcpu is in infinite loop

2012-06-28 Thread Peter Lieven

On 28.06.2012 15:25, Jan Kiszka wrote:

On 2012-06-28 15:05, Peter Lieven wrote:

Hi,

i debugged my initial problem further and found out that the problem
happens to be that
the main thread is stuck in pause_all_vcpus() on reset or quit commands
in the monitor
if one cpu is stuck in the do-while loop kvm_cpu_exec. If I modify the
condition from while (ret == 0)
to while ((ret == 0)&&  !env->stop); it works, but is this the right fix?
"Quit" command seems to work, but on "Reset" the VM enterns pause state.

Before entering the wait loop in pause_all_vcpus, there are kicks sent
to all vcpus. Now we need to find out why some of those kicks apparently
don't reach the destination.

can you explain shot what exactly these kicks do? does these kicks lead
to leaving the kernel mode and returning to userspace?

Again:
  - on which host kernels does this occur, and which change may have
changed it?

I do not see it in 3.0.0 and have also not seen it in 2.6.38. both
the mainline 64-bit ubuntu-server kernels (for natty / oneiric 
respectively).

If I compile a more recent kvm-kmod 3.3 or 3.4 on these machines,
it is no longer working.

  - with which qemu-kvm version is it reproducible, and which commit
introduced or fixed it?

qemu-kvm-1.0.1 from sourceforge. to get into the scenario it
is not sufficient to boot from an empty harddisk. to reproduce
i have use a live cd like ubuntu-server 12.04 and choose to
boot from the first harddisk. i think the isolinux loader does
not check for a valid bootsector and just executes what is found
in sector 0. this leads to the mmio reads i posted and 100%
cpu load (most spent in kernel). at that time the monitor/qmp
is still responsible. if i sent a command that pauses all vcpus,
the first cpu is looping in kvm_cpu_exec and the main thread
is waiting. at that time the monitor stops responding.
i have also seen this issue on very old windows 2000 servers
where the system fails to power off and is just halted. maybe
this is also a busy loop.

i will try to bisect this asap and let you know, maybe the above
info helps you already to reproduce.

thanks,
peter


I failed reproducing so far.

Jan



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop

2012-06-28 Thread Anthony Liguori

On 06/26/2012 02:34 PM, Marcelo Tosatti wrote:

On Sat, Jun 23, 2012 at 12:55:49AM +0200, Jan Kiszka wrote:

Should have declared this [RFC] in the subject and CC'ed kvm...

On 2012-06-23 00:45, Jan Kiszka wrote:

This sketches a possible path to get rid of the iothread lock on vmexits
in KVM mode. On x86, the the in-kernel irqchips has to be used because
we otherwise need to synchronize APIC and other per-cpu state accesses
that could be changed concurrently. Not yet fully analyzed is the NMI
injection path in the absence of an APIC.

s390x should be fine without specific locking as their pre/post-run
callbacks are empty. Power requires locking for the pre-run callback.

This patch is untested, but a similar version was successfully used in
a x86 setup with a network I/O path that needed no central iothread
locking anymore (required special MMIO exit handling).
---
  kvm-all.c |   18 --
  target-i386/kvm.c |7 +++
  target-ppc/kvm.c  |4 
  3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index f8e4328..9c3e26f 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1460,6 +1460,8 @@ int kvm_cpu_exec(CPUArchState *env)
  return EXCP_HLT;
  }

+qemu_mutex_unlock_iothread();
+
  do {
  if (env->kvm_vcpu_dirty) {
  kvm_arch_put_registers(env, KVM_PUT_RUNTIME_STATE);
@@ -1476,14 +1478,16 @@ int kvm_cpu_exec(CPUArchState *env)
   */
  qemu_cpu_kick_self();
  }
-qemu_mutex_unlock_iothread();

  run_ret = kvm_vcpu_ioctl(env, KVM_RUN, 0);

-qemu_mutex_lock_iothread();
  kvm_arch_post_run(env, run);

+/* TODO: push coalesced mmio flushing to the point where we access
+ * devices that are using it (currently VGA and E1000). */
+qemu_mutex_lock_iothread();
  kvm_flush_coalesced_mmio_buffer();
+qemu_mutex_unlock_iothread();

  if (run_ret<  0) {
  if (run_ret == -EINTR || run_ret == -EAGAIN) {
@@ -1499,19 +1503,23 @@ int kvm_cpu_exec(CPUArchState *env)
  switch (run->exit_reason) {
  case KVM_EXIT_IO:
  DPRINTF("handle_io\n");
+qemu_mutex_lock_iothread();
  kvm_handle_io(run->io.port,
(uint8_t *)run + run->io.data_offset,
run->io.direction,
run->io.size,
run->io.count);
+qemu_mutex_unlock_iothread();
  ret = 0;
  break;
  case KVM_EXIT_MMIO:
  DPRINTF("handle_mmio\n");
+qemu_mutex_lock_iothread();
  cpu_physical_memory_rw(run->mmio.phys_addr,
 run->mmio.data,
 run->mmio.len,
 run->mmio.is_write);
+qemu_mutex_unlock_iothread();
  ret = 0;
  break;
  case KVM_EXIT_IRQ_WINDOW_OPEN:
@@ -1520,7 +1528,9 @@ int kvm_cpu_exec(CPUArchState *env)
  break;
  case KVM_EXIT_SHUTDOWN:
  DPRINTF("shutdown\n");
+qemu_mutex_lock_iothread();
  qemu_system_reset_request();
+qemu_mutex_unlock_iothread();
  ret = EXCP_INTERRUPT;
  break;
  case KVM_EXIT_UNKNOWN:
@@ -1533,11 +1543,15 @@ int kvm_cpu_exec(CPUArchState *env)
  break;
  default:
  DPRINTF("kvm_arch_handle_exit\n");
+qemu_mutex_lock_iothread();
  ret = kvm_arch_handle_exit(env, run);
+qemu_mutex_unlock_iothread();
  break;
  }
  } while (ret == 0);

+qemu_mutex_lock_iothread();
+
  if (ret<  0) {
  cpu_dump_state(env, stderr, fprintf, CPU_DUMP_CODE);
  vm_stop(RUN_STATE_INTERNAL_ERROR);
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 0d0d8f6..0ad64d1 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1631,7 +1631,10 @@ void kvm_arch_pre_run(CPUX86State *env, struct kvm_run 
*run)

  /* Inject NMI */
  if (env->interrupt_request&  CPU_INTERRUPT_NMI) {
+qemu_mutex_lock_iothread();
  env->interrupt_request&= ~CPU_INTERRUPT_NMI;
+qemu_mutex_unlock_iothread();
+
  DPRINTF("injected NMI\n");
  ret = kvm_vcpu_ioctl(env, KVM_NMI);
  if (ret<  0) {
@@ -1641,6 +1644,8 @@ void kvm_arch_pre_run(CPUX86State *env, struct kvm_run 
*run)
  }

  if (!kvm_irqchip_in_kernel()) {
+qemu_mutex_lock_iothread();
+
  /* Force the VCPU out of its inner loop to process any INIT requests
   * or pending TPR access reports. */
  if (env->interrupt_request&
@@ -1682,6 +1687,8 @@ void kvm_arch_pre_run(CPUX86State *env, struct kvm_run 
*run)

  DPRINTF("setting tpr\n");
  run->cr8 = cpu_get_apic_tpr(env->apic_state);
+
+qemu_mutex_unlock_iothread();
  

Re: [PATCH v2 5/6] kvm: KVM_EOIFD, an eventfd for EOIs

2012-06-28 Thread Gleb Natapov
On Thu, Jun 28, 2012 at 04:11:40PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 27, 2012 at 09:55:44PM -0600, Alex Williamson wrote:
> > On Wed, 2012-06-27 at 17:51 +0300, Gleb Natapov wrote:
> > > On Wed, Jun 27, 2012 at 08:29:04AM -0600, Alex Williamson wrote:
> > > > On Wed, 2012-06-27 at 16:58 +0300, Gleb Natapov wrote:
> > > > > On Tue, Jun 26, 2012 at 11:10:08PM -0600, Alex Williamson wrote:
> > > > > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > > > > written for a specified irqchip pin.  By default this is a simple
> > > > > > notification, but we can also tie the eoifd to a level irqfd, which
> > > > > > enables the irqchip pin to be automatically de-asserted on EOI.
> > > > > > This mode is particularly useful for device-assignment applications
> > > > > > where the unmask and notify triggers a hardware unmask.  The default
> > > > > > mode is most applicable to simple notify with no side-effects for
> > > > > > userspace usage, such as Qemu.
> > > > > > 
> > > > > > Here we make use of the reference counting of the _irq_source
> > > > > > object allowing us to share it with an irqfd and cleanup regardless
> > > > > > of the release order.
> > > > > > 
> > > > > > Signed-off-by: Alex Williamson 
> > > > > > ---
> > > > > > 
> > > > > >  Documentation/virtual/kvm/api.txt |   24 +
> > > > > >  arch/x86/kvm/x86.c|1 
> > > > > >  include/linux/kvm.h   |   14 +++
> > > > > >  include/linux/kvm_host.h  |   13 +++
> > > > > >  virt/kvm/eventfd.c|  189 
> > > > > > +
> > > > > >  virt/kvm/kvm_main.c   |   11 ++
> > > > > >  6 files changed, 250 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > > > > b/Documentation/virtual/kvm/api.txt
> > > > > > index b216709..87a2558 100644
> > > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > > @@ -1987,6 +1987,30 @@ interrupts with those injected through 
> > > > > > KVM_IRQ_LINE.  IRQFDs created
> > > > > >  with KVM_IRQFD_FLAG_LEVEL must also set this flag when de-assiging.
> > > > > >  KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > > > > >  
> > > > > > +4.77 KVM_EOIFD
> > > > > > +
> > > > > > +Capability: KVM_CAP_EOIFD
> > > > > > +Architectures: x86
> > > > > > +Type: vm ioctl
> > > > > > +Parameters: struct kvm_eoifd (in)
> > > > > > +Returns: 0 on success, -1 on error
> > > > > > +
> > > > > > +KVM_EOIFD allows userspace to receive EOI notification through an
> > > > > > +eventfd for level triggered irqchip interrupts.  Behavior for edge
> > > > > > +triggered interrupts is undefined.  kvm_eoifd.fd specifies the 
> > > > > > eventfd
> > > > > Lets make it defined. EOI notification can be used by userspace to fix
> > > > > time drift due to lost interrupts. But than userspace needs to know
> > > > > which vcpu did EOI.
> > > > 
> > > > Hmm, do we need an additional flag and field in kvm_eoifd to filter by
> > > > vCPU then?
> > > > 
> > > This will be enough for a use case I am aware of. Don't know if this
> > > interface is generic enough for all possible use cases.
> > 
> > That's generally a hard prediction to make ;)  We currently don't pass a
> > kvm_vcpu anywhere close to the irq ack notifier.  The ioapic path could
> > be relatively trivial, but the pic path is a bit further disconnected.
> > If we had that plumbing, a KVM_CAP plus vcpu filter flag and specifying
> > the vcpu using some of the padding space seems like it's sufficient.
> > I'll drop mention of level-only from the description, but the plumbing
> > and vcpu filtering can be a follow-on.  Thanks,
> > 
> > Alex
> 
> If we don't implement what's needed for timedrift to be fixed,
> then IMO it's better to simply require an IRQFD for EOIFD for now,
> and limit this to level. Otherwise when we actually try to implement
> we might find issues.
> 
> Another reason to explicitly say EOI is not supported for edge is that
> EOI might not get invoked at all with PV EOI.
> 
Good point, but easily addressable by disabling PV EOI for a GSI that
has EOI notifier registered.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.0.1 - unable to exit if vcpu is in infinite loop

2012-06-28 Thread Jan Kiszka
On 2012-06-28 15:05, Peter Lieven wrote:
> Hi,
> 
> i debugged my initial problem further and found out that the problem
> happens to be that
> the main thread is stuck in pause_all_vcpus() on reset or quit commands
> in the monitor
> if one cpu is stuck in the do-while loop kvm_cpu_exec. If I modify the
> condition from while (ret == 0)
> to while ((ret == 0) && !env->stop); it works, but is this the right fix?
> "Quit" command seems to work, but on "Reset" the VM enterns pause state.

Before entering the wait loop in pause_all_vcpus, there are kicks sent
to all vcpus. Now we need to find out why some of those kicks apparently
don't reach the destination.

Again:
 - on which host kernels does this occur, and which change may have
   changed it?
 - with which qemu-kvm version is it reproducible, and which commit
   introduced or fixed it?

I failed reproducing so far.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] kvm: KVM_EOIFD, an eventfd for EOIs

2012-06-28 Thread Michael S. Tsirkin
On Wed, Jun 27, 2012 at 09:55:44PM -0600, Alex Williamson wrote:
> On Wed, 2012-06-27 at 17:51 +0300, Gleb Natapov wrote:
> > On Wed, Jun 27, 2012 at 08:29:04AM -0600, Alex Williamson wrote:
> > > On Wed, 2012-06-27 at 16:58 +0300, Gleb Natapov wrote:
> > > > On Tue, Jun 26, 2012 at 11:10:08PM -0600, Alex Williamson wrote:
> > > > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > > > written for a specified irqchip pin.  By default this is a simple
> > > > > notification, but we can also tie the eoifd to a level irqfd, which
> > > > > enables the irqchip pin to be automatically de-asserted on EOI.
> > > > > This mode is particularly useful for device-assignment applications
> > > > > where the unmask and notify triggers a hardware unmask.  The default
> > > > > mode is most applicable to simple notify with no side-effects for
> > > > > userspace usage, such as Qemu.
> > > > > 
> > > > > Here we make use of the reference counting of the _irq_source
> > > > > object allowing us to share it with an irqfd and cleanup regardless
> > > > > of the release order.
> > > > > 
> > > > > Signed-off-by: Alex Williamson 
> > > > > ---
> > > > > 
> > > > >  Documentation/virtual/kvm/api.txt |   24 +
> > > > >  arch/x86/kvm/x86.c|1 
> > > > >  include/linux/kvm.h   |   14 +++
> > > > >  include/linux/kvm_host.h  |   13 +++
> > > > >  virt/kvm/eventfd.c|  189 
> > > > > +
> > > > >  virt/kvm/kvm_main.c   |   11 ++
> > > > >  6 files changed, 250 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > > > b/Documentation/virtual/kvm/api.txt
> > > > > index b216709..87a2558 100644
> > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > @@ -1987,6 +1987,30 @@ interrupts with those injected through 
> > > > > KVM_IRQ_LINE.  IRQFDs created
> > > > >  with KVM_IRQFD_FLAG_LEVEL must also set this flag when de-assiging.
> > > > >  KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > > > >  
> > > > > +4.77 KVM_EOIFD
> > > > > +
> > > > > +Capability: KVM_CAP_EOIFD
> > > > > +Architectures: x86
> > > > > +Type: vm ioctl
> > > > > +Parameters: struct kvm_eoifd (in)
> > > > > +Returns: 0 on success, -1 on error
> > > > > +
> > > > > +KVM_EOIFD allows userspace to receive EOI notification through an
> > > > > +eventfd for level triggered irqchip interrupts.  Behavior for edge
> > > > > +triggered interrupts is undefined.  kvm_eoifd.fd specifies the 
> > > > > eventfd
> > > > Lets make it defined. EOI notification can be used by userspace to fix
> > > > time drift due to lost interrupts. But than userspace needs to know
> > > > which vcpu did EOI.
> > > 
> > > Hmm, do we need an additional flag and field in kvm_eoifd to filter by
> > > vCPU then?
> > > 
> > This will be enough for a use case I am aware of. Don't know if this
> > interface is generic enough for all possible use cases.
> 
> That's generally a hard prediction to make ;)  We currently don't pass a
> kvm_vcpu anywhere close to the irq ack notifier.  The ioapic path could
> be relatively trivial, but the pic path is a bit further disconnected.
> If we had that plumbing, a KVM_CAP plus vcpu filter flag and specifying
> the vcpu using some of the padding space seems like it's sufficient.
> I'll drop mention of level-only from the description, but the plumbing
> and vcpu filtering can be a follow-on.  Thanks,
> 
> Alex

If we don't implement what's needed for timedrift to be fixed,
then IMO it's better to simply require an IRQFD for EOIFD for now,
and limit this to level. Otherwise when we actually try to implement
we might find issues.

Another reason to explicitly say EOI is not supported for edge is that
EOI might not get invoked at all with PV EOI.


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-kvm-1.0.1 - unable to exit if vcpu is in infinite loop

2012-06-28 Thread Peter Lieven

Hi,

i debugged my initial problem further and found out that the problem 
happens to be that
the main thread is stuck in pause_all_vcpus() on reset or quit commands 
in the monitor
if one cpu is stuck in the do-while loop kvm_cpu_exec. If I modify the 
condition from while (ret == 0)

to while ((ret == 0) && !env->stop); it works, but is this the right fix?
"Quit" command seems to work, but on "Reset" the VM enterns pause state.

Thanks,
Peter

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] kvm: Level IRQ de-assert for KVM_IRQFD

2012-06-28 Thread Avi Kivity
On 06/27/2012 08:10 AM, Alex Williamson wrote:
> This is an alternate level irqfd de-assert mode that's potentially
> useful for emulated drivers.  It's included here to show how easy it
> is to implement with the new level irqfd and eoifd support.  It's
> possible this mode might also prove interesting for device-assignment
> where we inject via level irqfd, receive an EOI (w/o de-assert), and
> use the level de-assert irqfd here.

This use case is racy.  The guest driver will have shut down the
interrupt before EOI, but with what you describe, it will fire again
until the eoifd/deassertfd sequence completes.

An emulated device will see the guest driver shutting down the interrupt
so it's not a problem there.

-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] kvm: Extend irqfd to support level interrupts

2012-06-28 Thread Avi Kivity
On 06/28/2012 12:19 AM, Alex Williamson wrote:
>> > @@ -302,6 +385,7 @@ kvm_irqfd_deassign(struct kvm *kvm, struct kvm_irqfd 
>> > *args)
>> >  {
>> >struct _irqfd *irqfd, *tmp;
>> >struct eventfd_ctx *eventfd;
>> > +  bool is_level = (args->flags & KVM_IRQFD_FLAG_LEVEL) != 0;
>> 
>> != 0 is not needed here I think.
> 
> Strictly, no it's not.  I just prefer to set bools from bools and I
> think Avi has commented on using !! before.

I prefer the naked version too, but != 0 is acceptable.
-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] kvm: Sanitize KVM_IRQFD flags

2012-06-28 Thread Avi Kivity
On 06/27/2012 12:21 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 26, 2012 at 11:09:32PM -0600, Alex Williamson wrote:
>> We only know of one so far.
>> 
>> Signed-off-by: Alex Williamson 
> 
> Ugh. So we have a bug: we should have sanitized the fields.
> If there's buggy userspace that only set the low bit
> it will break with this change.
> Is it too late now? Do we need KVM_IRQFD2 which
> sanitized fields properly? Avi?

We try and see.  Commit this, if somebody complain, revert after
apologizing profusely.  If no one notices, we can claim those bits.



-- 
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/6] kvm: Pass kvm_irqfd to functions

2012-06-28 Thread Michael S. Tsirkin
On Thu, Jun 28, 2012 at 02:00:41PM +0200, Cornelia Huck wrote:
> On Thu, 28 Jun 2012 12:34:43 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Thu, Jun 28, 2012 at 11:03:16AM +0200, Cornelia Huck wrote:
> 
> > > How about something like this as parameter for the new ioctl?
> > > 
> > > struct kvm_irqfd2 {
> > >   __u32 fd;
> > >   __u32 flags;  /* for things like deassign */
> > >   __u64 type;   /* determines the payload */
> > >   union {
> > >   /* type traditional */
> > >   struct {
> > >   __u32 gsi;
> > >   } trad;
> > >   /* type s390 */
> > >   struct {
> > >   __u32 int_type;
> > >   __u32 parm;
> > >   __u64 parm64;
> > >   } s390;
> > >   __u8 pad[20];
> > >   };
> > > }
> > > 
> > > This could be combined with an arch or a per-kvm callback to keep the
> > > generic code clean of architecture dependencies.
> > > 
> > > Cornelia
> > 
> > Looks a bit weird - shouldn't all this be part of gsi routing?
> > But no idea really, I don't see the big picture here.
> > 
> 
> Well, on s390 we don't have anything like "gsi routing" (I'm not even
> really sure what that is).

I really mean kvm_irq_routing. This has options for
irqchip, msi, I guess we can add more.


> My understanding is the following:
> 
> - Basically, we want to notify the guest for a virtqueue.
> - For this, we want to inject an interrupt for the associated device.
> - On x86, this means raising an interrupt on an interrupt line, as
>   specified by some kind of number.

Not just that: for MSI we pass in data encoding priority
destination etc.

> - On s390, we need some more information to (a) identify the device and
>   (b) additional information that needs to be transmitted for an
>   interrupt (device specific). (This is what basically goes into struct
>   kvm_s390_interrupt, which I reproduced in the s390 part.)
> 
> Cornelia

Is this b mostly static or does it change for each interrupt?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/6] kvm: Pass kvm_irqfd to functions

2012-06-28 Thread Cornelia Huck
On Thu, 28 Jun 2012 12:34:43 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Jun 28, 2012 at 11:03:16AM +0200, Cornelia Huck wrote:

> > How about something like this as parameter for the new ioctl?
> > 
> > struct kvm_irqfd2 {
> > __u32 fd;
> > __u32 flags;  /* for things like deassign */
> > __u64 type;   /* determines the payload */
> > union {
> > /* type traditional */
> > struct {
> > __u32 gsi;
> > } trad;
> > /* type s390 */
> > struct {
> > __u32 int_type;
> > __u32 parm;
> > __u64 parm64;
> > } s390;
> > __u8 pad[20];
> > };
> > }
> > 
> > This could be combined with an arch or a per-kvm callback to keep the
> > generic code clean of architecture dependencies.
> > 
> > Cornelia
> 
> Looks a bit weird - shouldn't all this be part of gsi routing?
> But no idea really, I don't see the big picture here.
> 

Well, on s390 we don't have anything like "gsi routing" (I'm not even
really sure what that is).

My understanding is the following:

- Basically, we want to notify the guest for a virtqueue.
- For this, we want to inject an interrupt for the associated device.
- On x86, this means raising an interrupt on an interrupt line, as
  specified by some kind of number.
- On s390, we need some more information to (a) identify the device and
  (b) additional information that needs to be transmitted for an
  interrupt (device specific). (This is what basically goes into struct
  kvm_s390_interrupt, which I reproduced in the s390 part.)

Cornelia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/5] Sysfs: Export VMCSINFO via sysfs

2012-06-28 Thread Greg KH
On Thu, Jun 28, 2012 at 05:54:30PM +0800, Yanfei Zhang wrote:
> 于 2012年06月28日 03:22, Greg KH 写道:
> > On Wed, Jun 27, 2012 at 04:54:54PM +0800, Yanfei Zhang wrote:
> >> This patch export offsets of fields via /sys/devices/cpu/vmcs/.
> >> Individual offsets are contained in subfiles named by the filed's
> >> encoding, e.g.: /sys/devices/cpu/vmcs/0800
> >>
> >> Signed-off-by: zhangyanfei 
> >> ---
> >>  drivers/base/core.c |   13 +
> >>  1 files changed, 13 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/drivers/base/core.c b/drivers/base/core.c
> >> index 346be8b..dd05ee7 100644
> >> --- a/drivers/base/core.c
> >> +++ b/drivers/base/core.c
> >> @@ -26,6 +26,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> > 
> > Did you just break the build on all other arches?  Not nice.
> > 
> >> @@ -1038,6 +1039,11 @@ int device_add(struct device *dev)
> >>error = dpm_sysfs_add(dev);
> >>if (error)
> >>goto DPMError;
> >> +#if defined(CONFIG_KVM_INTEL) || defined(CONFIG_KVM_INTEL_MODULE)
> >> +  error = vmcs_sysfs_add(dev);
> >> +  if (error)
> >> +  goto VMCSError;
> >> +#endif
> > 
> > Oh my no, that's no way to ever do this, you know better than that,
> > please fix.
> > 
> > greg k-h
> > 
> 
> Sorry for my thoughtless, Here is the new patch.
> 
> ---
>  drivers/base/core.c |   13 +
>  1 files changed, 13 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 346be8b..7b5266a 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -30,6 +30,13 @@
>  #include "base.h"
>  #include "power/power.h"
>  
> +#if defined(CONFIG_KVM_INTEL) || defined(CONFIG_KVM_INTEL_MODULE)
> +#include 
> +#else
> +static inline int vmcs_sysfs_add(struct device *dev) { return 0; }
> +static inline void vmcs_sysfs_remove(struct device *dev) { }
> +#endif

{sigh}  No, again, you know better, don't do this.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Openstack-operators] [Openstack] Nova Controller HA issues

2012-06-28 Thread Igor Laskovy
Hi guys, have any updates here?

On Sun, Jun 17, 2012 at 12:59 PM, Igor Laskovy  wrote:
> John, Jason, can you please concretely clarify what this bad things? For
> example the worst-case.
>
> Yoshi, Kei, can you please clarify current status of Kemari. How far it is
> from production usage?
>
> On Fri, Jun 15, 2012 at 5:48 PM, Jason Hedden  wrote:
>> I'm running 2 full nova controllers behind a NGINX load balancer.  While
>> there still is that chance of half completed tasks, it's been working very
>> well.
>>
>> Each nova controller is running (full time) nova-scheduler, nova-cert,
>> keystone, and 6 nova-api processes. All API requests go through NGINX which
>> reverse proxies the traffic to these 2 systems.
>>
>> example Nginx nova-api config:
>> upstream nova-api  {
>>  server hostA:8774 fail_timeout=30s;
>>  server hostB:8774 fail_timeout=30s;
>>  server hostA:18774 fail_timeout=30s;
>>  server hostB:18774 fail_timeout=30s;
>>  server hostA:28774 fail_timeout=30s;
>>  server hostB:28774 fail_timeout=30s;
>>  server hostA:38774 fail_timeout=30s;
>>  server hostB:38774 fail_timeout=30s;
>>  server hostA:48774 fail_timeout=30s;
>>  server hostB:48774 fail_timeout=30s;
>>  server hostA:58774 fail_timeout=30s;
>>  server hostB:58774 fail_timeout=30s;
>> }
>>
>> server {
>>  listen x.x.x.x:8774;
>>  server_name public.name;
>>
>>  location / {
>>    proxy_pass  http://nova-api;
>>    proxy_set_header        Host            "public.address:8774";
>>    proxy_set_header        X-Real-IP       $remote_addr;
>>    proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
>>  }
>> }
>>
>>
>> Attached is a diagram that gives a brief overview of the HA environment
>> I've setup.
>>
>> --Jason Hedden
>>
>>
>> On Jun 15, 2012, at 5:36 AM, John Garbutt wrote:
>>
>>> I know there is some work in the XenAPI driver to make it resilient to
>>> these kinds of failures (to allow frequent updates of the nova code), and I
>>> think there were plans for the work to be reused in the Libvirt driver.
>>>
>>> AFAIK, in Essex and lower, bad things can happen if you don’t wait for
>>> all the tasks to finish. You may well be OK some of the time.
>>>
>>> It boils down to an issue of consuming the message from Rabbit but not
>>> completing the task, and not being able to recover from half completed
>>> tasks.
>>>
>>> Hope that helps,
>>> John
>>>
>>> From: Igor Laskovy [mailto:igor.lask...@gmail.com]
>>> Sent: 15 June 2012 11:31
>>> To: Christian Parpart
>>> Cc: John Garbutt; openstack-operat...@lists.openstack.org;
>>> <,openst...@lists.launchpad.net>,
>>> Subject: Re: [Openstack-operators] Nova Controller HA issues
>>>
>>> I am using OpenStack for my little lab for a short time too))
>>>
>>> Ok, you are right of course, but I meant a some another design when told
>>> about virtualization controller nodes.
>>>
>>> It is can be only two dedicated hypetvisor with dedicated share/drbd
>>> between them. This hypervisors will be standalone, and not be part of nova.
>>> Than, maybe pacemaker or another tool can take availability function to
>>> restart VM to alive node when active will die.
>>>
>>> Main question here - how worth can be if occurs controller nodes
>>> unexpected power off. In another word, when VM restart it will be in crash
>>> consisted state.
>>> Will some nova services will loose here?
>>> Will RabbiMQ loose some data here? (I am new to RabbitMQ too)
>>>
>>> Igor Laskovy
>>> facebook.com/igor.laskovy
>>> Kiev, Ukraine
>>>
>>> On Jun 15, 2012 10:54 AM, "Christian Parpart"  wrote:
>>> Hey,
>>>
>>> well, I said "I might be wrong" because I have no "clear" vision on how
>>> OpenStack works in
>>> its deepest detail, however, I would not like to depend on a controller
>>> node that
>>> is inside a virtual machine, controlled by compute nodes, that are
>>> controlled by the controller
>>> node. This sounds quite like a chicken-and-egg problem.
>>>
>>> However, at the time of this writing, I think you'll have to have a
>>> working nova-scheduler process,
>>> which is responsible on deciding on which compute node to spawn your VM
>>> (what else?),
>>> and think about what you do when this (or all your controller-)VMs
>>> terribly die,
>>> and you want to rebuild it, how do you plan to do this when your
>>> controller node is out-of-service?
>>>
>>> I in my case have put the controller services onto two compute nodes, and
>>> use Pacemaker
>>> to switch between them, in case one node goes down, the other can take
>>> over (via shared service-IP).
>>>
>>> Again, these are my thoughts, and I am using OpenStack for just about a
>>> month now :-)
>>> But I hope this helps a bit...
>>>
>>> Best regards,
>>> Christian Parpart.
>>>
>>> On Fri, Jun 15, 2012 at 8:16 AM, Igor Laskovy 
>>> wrote:
>>> Why? Can you please clarify.
>>>
>>> Igor Laskovy
>>> facebook.com/igor.laskovy
>>> Kiev, Ukraine
>>>
>>> On Jun 15, 2012 1:55 AM, "Christian Parpart"  wrote:
>>> I don't think putting the controller node completely into a VM is a g

Re: race between kvm-kmod-3.0 and kvm-kmod-3.3 // was: race condition in qemu-kvm-1.0.1

2012-06-28 Thread Peter Lieven

On 28.06.2012 11:39, Jan Kiszka wrote:

On 2012-06-28 11:31, Peter Lieven wrote:

On 28.06.2012 11:21, Jan Kiszka wrote:

On 2012-06-28 11:11, Peter Lieven wrote:

On 27.06.2012 18:54, Jan Kiszka wrote:

On 2012-06-27 17:39, Peter Lieven wrote:

Hi all,

i debugged this further and found out that kvm-kmod-3.0 is working with
qemu-kvm-1.0.1 while kvm-kmod-3.3 and kvm-kmod-3.4 are not. What is
working as well is kvm-kmod-3.4 with an old userspace (qemu-kvm-0.13.0).
Has anyone a clue which new KVM feature could cause this if a vcpu is in
an infinite loop?

Before accusing kvm-kmod ;), can you check if the effect is visible with
an original Linux 3.3.x or 3.4.x kernel as well?

sorry, i should have been more specific. maybe I also misunderstood sth.
I was believing that kvm-kmod-3.0 is basically what is in vanialla kernel
3.0. If I use the ubuntu kernel from ubuntu oneiric (3.0.0) it works, if
I use
a self-compiled kvm-kmod-3.3/3.4 with that kernel it doesn't.
however, maybe we don't have to dig to deep - see below.

kvm-kmod wraps and patches things to make the kvm code from 3.3/3.4
working on an older kernel. This step may introduce bugs of its own.
Therefore my suggestion to use a "real" 3.x kernel to exclude that risk
first of all.


Then, bisection the change in qemu-kvm that apparently resolved the
issue would be interesting.

If we have to dig deeper, tracing [1] the lockup would likely be helpful
(all events of the qemu process, not just KVM related ones: trace-cmd
record -e all qemu-system-x86_64 ...).

that here is bascially whats going on:

qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio read
len 3 gpa 0xa val 0x10ff
  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)

its doing that forever. this is tracing the kvm module. doing the
qemu-system-x86_64 trace is a bit compilcated, but
maybe this is already sufficient. otherwise i will of course gather this
info as well.

That's only tracing KVM event, and it's tracing when things went wrong
already. We may need a full trace (-e all) specifically for the period
when this pattern above started.

i will do that. maybe i should explain that the vcpu is executing
garbage when this above starts. its basically booting from an empty
harddisk.

if i understand correctly qemu-kvm loops in kvm_cpu_exec(CPUState *env);

maybe the time to handle the monitor/qmp connection is just to short.
if i understand furhter correctly, it can only handle monitor connections
while qemu-kvm is executing kvm_vcpu_ioctl(env, KVM_RUN, 0); or am i
wrong here? the time spend in this state might be rather short.

Unless you played with priorities and affinities, the Linux scheduler
should provide the required time to the iothread.


my concern is not that the machine hangs, just the the hypervisor is
unresponsive
and its impossible to reset or quit gracefully. the only way to get the
hypervisor
ended is via SIGKILL.

Right. Even if the guest runs wild, you must be able to control the vm
via the monitor etc. If not, that's a bug.

what i observed just know is that the monitor is working up to the
point i try to quit the hypervisor or try to reset the cpu.

so we where looking at a completely wrong place...

it seems that in this short excerpt, that the deadlock appears not
on excution but when the vcpus shall be paused.

Program received signal SIGINT, Interrupt.
0x7fc8ec36785c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/libpthread.so.0

(gdb) thread apply all bt

Thread 4 (Thread 0x7

Re: race between kvm-kmod-3.0 and kvm-kmod-3.3 // was: race condition in qemu-kvm-1.0.1

2012-06-28 Thread Peter Lieven

On 28.06.2012 11:39, Jan Kiszka wrote:

On 2012-06-28 11:31, Peter Lieven wrote:

On 28.06.2012 11:21, Jan Kiszka wrote:

On 2012-06-28 11:11, Peter Lieven wrote:

On 27.06.2012 18:54, Jan Kiszka wrote:

On 2012-06-27 17:39, Peter Lieven wrote:

Hi all,

i debugged this further and found out that kvm-kmod-3.0 is working with
qemu-kvm-1.0.1 while kvm-kmod-3.3 and kvm-kmod-3.4 are not. What is
working as well is kvm-kmod-3.4 with an old userspace (qemu-kvm-0.13.0).
Has anyone a clue which new KVM feature could cause this if a vcpu is in
an infinite loop?

Before accusing kvm-kmod ;), can you check if the effect is visible with
an original Linux 3.3.x or 3.4.x kernel as well?

sorry, i should have been more specific. maybe I also misunderstood sth.
I was believing that kvm-kmod-3.0 is basically what is in vanialla kernel
3.0. If I use the ubuntu kernel from ubuntu oneiric (3.0.0) it works, if
I use
a self-compiled kvm-kmod-3.3/3.4 with that kernel it doesn't.
however, maybe we don't have to dig to deep - see below.

kvm-kmod wraps and patches things to make the kvm code from 3.3/3.4
working on an older kernel. This step may introduce bugs of its own.
Therefore my suggestion to use a "real" 3.x kernel to exclude that risk
first of all.


Then, bisection the change in qemu-kvm that apparently resolved the
issue would be interesting.

If we have to dig deeper, tracing [1] the lockup would likely be helpful
(all events of the qemu process, not just KVM related ones: trace-cmd
record -e all qemu-system-x86_64 ...).

that here is bascially whats going on:

qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio read
len 3 gpa 0xa val 0x10ff
  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)

its doing that forever. this is tracing the kvm module. doing the
qemu-system-x86_64 trace is a bit compilcated, but
maybe this is already sufficient. otherwise i will of course gather this
info as well.

That's only tracing KVM event, and it's tracing when things went wrong
already. We may need a full trace (-e all) specifically for the period
when this pattern above started.

i will do that. maybe i should explain that the vcpu is executing
garbage when this above starts. its basically booting from an empty
harddisk.

if i understand correctly qemu-kvm loops in kvm_cpu_exec(CPUState *env);

maybe the time to handle the monitor/qmp connection is just to short.
if i understand furhter correctly, it can only handle monitor connections
while qemu-kvm is executing kvm_vcpu_ioctl(env, KVM_RUN, 0); or am i
wrong here? the time spend in this state might be rather short.

Unless you played with priorities and affinities, the Linux scheduler
should provide the required time to the iothread.

I have a 1.1GB (85MB compressed) trace-file. If you have time to
look at it I could drop it somewhere.

We currently run all VMs with nice 1 because we observed that
this improves that controlability of the Node in case all VMs
have execessive CPU load. Running the VM unniced does
not change the behaviour unfortunately.

Peter

my concern is not that the machine hangs, just the the hypervisor is
unresponsive
and its impossible to reset or quit gracefully. the only way to get the
hypervisor
ended is via SIGKILL.

Right. Even if the guest runs wild, you must be able to control the vm
via the monitor etc. If not, that's a bug.

Jan



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo

Re: [PATCH v3 4/5] Sysfs: Export VMCSINFO via sysfs

2012-06-28 Thread Yanfei Zhang
于 2012年06月28日 03:22, Greg KH 写道:
> On Wed, Jun 27, 2012 at 04:54:54PM +0800, Yanfei Zhang wrote:
>> This patch export offsets of fields via /sys/devices/cpu/vmcs/.
>> Individual offsets are contained in subfiles named by the filed's
>> encoding, e.g.: /sys/devices/cpu/vmcs/0800
>>
>> Signed-off-by: zhangyanfei 
>> ---
>>  drivers/base/core.c |   13 +
>>  1 files changed, 13 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/base/core.c b/drivers/base/core.c
>> index 346be8b..dd05ee7 100644
>> --- a/drivers/base/core.c
>> +++ b/drivers/base/core.c
>> @@ -26,6 +26,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
> 
> Did you just break the build on all other arches?  Not nice.
> 
>> @@ -1038,6 +1039,11 @@ int device_add(struct device *dev)
>>  error = dpm_sysfs_add(dev);
>>  if (error)
>>  goto DPMError;
>> +#if defined(CONFIG_KVM_INTEL) || defined(CONFIG_KVM_INTEL_MODULE)
>> +error = vmcs_sysfs_add(dev);
>> +if (error)
>> +goto VMCSError;
>> +#endif
> 
> Oh my no, that's no way to ever do this, you know better than that,
> please fix.
> 
> greg k-h
> 

Sorry for my thoughtless, Here is the new patch.

---
 drivers/base/core.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 346be8b..7b5266a 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -30,6 +30,13 @@
 #include "base.h"
 #include "power/power.h"
 
+#if defined(CONFIG_KVM_INTEL) || defined(CONFIG_KVM_INTEL_MODULE)
+#include 
+#else
+static inline int vmcs_sysfs_add(struct device *dev) { return 0; }
+static inline void vmcs_sysfs_remove(struct device *dev) { }
+#endif
+
 #ifdef CONFIG_SYSFS_DEPRECATED
 #ifdef CONFIG_SYSFS_DEPRECATED_V2
 long sysfs_deprecated = 1;
@@ -1038,6 +1045,9 @@ int device_add(struct device *dev)
error = dpm_sysfs_add(dev);
if (error)
goto DPMError;
+   error = vmcs_sysfs_add(dev);
+   if (error)
+   goto VMCSError;
device_pm_add(dev);
 
/* Notify clients of device addition.  This call must come
@@ -1069,6 +1079,8 @@ int device_add(struct device *dev)
 done:
put_device(dev);
return error;
+ VMCSError:
+   dpm_sysfs_remove(dev);
  DPMError:
bus_remove_device(dev);
  BusError:
@@ -1171,6 +1183,7 @@ void device_del(struct device *dev)
blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
 BUS_NOTIFY_DEL_DEVICE, dev);
device_pm_remove(dev);
+   vmcs_sysfs_remove(dev);
dpm_sysfs_remove(dev);
if (parent)
klist_del(&dev->p->knode_parent);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: race between kvm-kmod-3.0 and kvm-kmod-3.3 // was: race condition in qemu-kvm-1.0.1

2012-06-28 Thread Jan Kiszka
On 2012-06-28 11:31, Peter Lieven wrote:
> On 28.06.2012 11:21, Jan Kiszka wrote:
>> On 2012-06-28 11:11, Peter Lieven wrote:
>>> On 27.06.2012 18:54, Jan Kiszka wrote:
 On 2012-06-27 17:39, Peter Lieven wrote:
> Hi all,
>
> i debugged this further and found out that kvm-kmod-3.0 is working with
> qemu-kvm-1.0.1 while kvm-kmod-3.3 and kvm-kmod-3.4 are not. What is
> working as well is kvm-kmod-3.4 with an old userspace (qemu-kvm-0.13.0).
> Has anyone a clue which new KVM feature could cause this if a vcpu is in
> an infinite loop?
 Before accusing kvm-kmod ;), can you check if the effect is visible with
 an original Linux 3.3.x or 3.4.x kernel as well?
>>> sorry, i should have been more specific. maybe I also misunderstood sth.
>>> I was believing that kvm-kmod-3.0 is basically what is in vanialla kernel
>>> 3.0. If I use the ubuntu kernel from ubuntu oneiric (3.0.0) it works, if
>>> I use
>>> a self-compiled kvm-kmod-3.3/3.4 with that kernel it doesn't.
>>> however, maybe we don't have to dig to deep - see below.
>> kvm-kmod wraps and patches things to make the kvm code from 3.3/3.4
>> working on an older kernel. This step may introduce bugs of its own.
>> Therefore my suggestion to use a "real" 3.x kernel to exclude that risk
>> first of all.
>>
 Then, bisection the change in qemu-kvm that apparently resolved the
 issue would be interesting.

 If we have to dig deeper, tracing [1] the lockup would likely be helpful
 (all events of the qemu process, not just KVM related ones: trace-cmd
 record -e all qemu-system-x86_64 ...).
>>> that here is bascially whats going on:
>>>
>>>qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio read
>>> len 3 gpa 0xa val 0x10ff
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
>>> 0xa gpa 0xa Read GPA
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
>>> unsatisfied-read len 1 gpa 0xa val 0x0
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
>>> KVM_EXIT_MMIO (6)
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
>>> read len 3 gpa 0xa val 0x10ff
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
>>> 0xa gpa 0xa Read GPA
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
>>> unsatisfied-read len 1 gpa 0xa val 0x0
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
>>> KVM_EXIT_MMIO (6)
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
>>> read len 3 gpa 0xa val 0x10ff
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
>>> 0xa gpa 0xa Read GPA
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
>>> unsatisfied-read len 1 gpa 0xa val 0x0
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
>>> KVM_EXIT_MMIO (6)
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
>>> read len 3 gpa 0xa val 0x10ff
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
>>> 0xa gpa 0xa Read GPA
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
>>> unsatisfied-read len 1 gpa 0xa val 0x0
>>>  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
>>> KVM_EXIT_MMIO (6)
>>>
>>> its doing that forever. this is tracing the kvm module. doing the
>>> qemu-system-x86_64 trace is a bit compilcated, but
>>> maybe this is already sufficient. otherwise i will of course gather this
>>> info as well.
>> That's only tracing KVM event, and it's tracing when things went wrong
>> already. We may need a full trace (-e all) specifically for the period
>> when this pattern above started.
> i will do that. maybe i should explain that the vcpu is executing
> garbage when this above starts. its basically booting from an empty 
> harddisk.
> 
> if i understand correctly qemu-kvm loops in kvm_cpu_exec(CPUState *env);
> 
> maybe the time to handle the monitor/qmp connection is just to short.
> if i understand furhter correctly, it can only handle monitor connections
> while qemu-kvm is executing kvm_vcpu_ioctl(env, KVM_RUN, 0); or am i
> wrong here? the time spend in this state might be rather short.

Unless you played with priorities and affinities, the Linux scheduler
should provide the required time to the iothread.

> 
> my concern is not that the machine hangs, just the the hypervisor is 
> unresponsive
> and its impossible to reset or quit gracefully. the only way to get the 
> hypervisor
> ended is via SIGKILL.

Right. Even if the guest runs wild, you must be able to control the vm
via the monitor etc. If not, that's a bug.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger

Re: race between kvm-kmod-3.0 and kvm-kmod-3.3 // was: race condition in qemu-kvm-1.0.1

2012-06-28 Thread Peter Lieven

does anyone know whats that here in handle_mmio?

/* hack: Red Hat 7.1 generates these weird accesses. */
if ((addr > 0xa-4 && addr <= 0xa) && kvm_run->mmio.len == 3)
return 0;

thanks,
peter

On 28.06.2012 11:31, Peter Lieven wrote:

On 28.06.2012 11:21, Jan Kiszka wrote:

On 2012-06-28 11:11, Peter Lieven wrote:

On 27.06.2012 18:54, Jan Kiszka wrote:

On 2012-06-27 17:39, Peter Lieven wrote:

Hi all,

i debugged this further and found out that kvm-kmod-3.0 is working 
with

qemu-kvm-1.0.1 while kvm-kmod-3.3 and kvm-kmod-3.4 are not. What is
working as well is kvm-kmod-3.4 with an old userspace 
(qemu-kvm-0.13.0).
Has anyone a clue which new KVM feature could cause this if a vcpu 
is in

an infinite loop?
Before accusing kvm-kmod ;), can you check if the effect is visible 
with

an original Linux 3.3.x or 3.4.x kernel as well?
sorry, i should have been more specific. maybe I also misunderstood 
sth.
I was believing that kvm-kmod-3.0 is basically what is in vanialla 
kernel
3.0. If I use the ubuntu kernel from ubuntu oneiric (3.0.0) it 
works, if

I use
a self-compiled kvm-kmod-3.3/3.4 with that kernel it doesn't.
however, maybe we don't have to dig to deep - see below.

kvm-kmod wraps and patches things to make the kvm code from 3.3/3.4
working on an older kernel. This step may introduce bugs of its own.
Therefore my suggestion to use a "real" 3.x kernel to exclude that risk
first of all.


Then, bisection the change in qemu-kvm that apparently resolved the
issue would be interesting.

If we have to dig deeper, tracing [1] the lockup would likely be 
helpful

(all events of the qemu process, not just KVM related ones: trace-cmd
record -e all qemu-system-x86_64 ...).

that here is bascially whats going on:

   qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio 
read

len 3 gpa 0xa val 0x10ff
 qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   
reason

KVM_EXIT_MMIO (6)
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
 qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   
reason

KVM_EXIT_MMIO (6)
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
 qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   
reason

KVM_EXIT_MMIO (6)
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
 qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   
reason

KVM_EXIT_MMIO (6)

its doing that forever. this is tracing the kvm module. doing the
qemu-system-x86_64 trace is a bit compilcated, but
maybe this is already sufficient. otherwise i will of course gather 
this

info as well.

That's only tracing KVM event, and it's tracing when things went wrong
already. We may need a full trace (-e all) specifically for the period
when this pattern above started.

i will do that. maybe i should explain that the vcpu is executing
garbage when this above starts. its basically booting from an empty 
harddisk.


if i understand correctly qemu-kvm loops in kvm_cpu_exec(CPUState *env);

maybe the time to handle the monitor/qmp connection is just to short.
if i understand furhter correctly, it can only handle monitor connections
while qemu-kvm is executing kvm_vcpu_ioctl(env, KVM_RUN, 0); or am i
wrong here? the time spend in this state might be rather short.

my concern is not that the machine hangs, just the the hypervisor is 
unresponsive
and its impossible to reset or quit gracefully. the only way to get 
the hypervisor

ended is via SIGKILL.

thanks
peter


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/6] kvm: Pass kvm_irqfd to functions

2012-06-28 Thread Michael S. Tsirkin
On Thu, Jun 28, 2012 at 11:03:16AM +0200, Cornelia Huck wrote:
> On Thu, 28 Jun 2012 11:38:57 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Jun 27, 2012 at 04:24:30PM +0200, Cornelia Huck wrote:
> > > On Tue, 26 Jun 2012 23:09:04 -0600
> > > Alex Williamson  wrote:
> > > 
> > > > Prune this down to just the struct kvm_irqfd so we can avoid
> > > > changing function definition for every flag or field we use.
> > > > 
> > > > Signed-off-by: Alex Williamson 
> > > 
> > > I'm currently trying to find a way to make irqfd workable for s390
> > > which will likely include using a new field in kvm_irqfd, so I'd like
> > > to have this change (and I also think it makes the code nicer to read).
> > > So:
> > > 
> > > Acked-by: Cornelia Huck 
> > 
> > Unfortunately it looks like we are not sanitizing kvm_irqfd
> > at all so we won't be able to use the padding :(
> > We'll need a new ioctl instead.
> > 
> 
> How about something like this as parameter for the new ioctl?
> 
> struct kvm_irqfd2 {
>   __u32 fd;
>   __u32 flags;  /* for things like deassign */
>   __u64 type;   /* determines the payload */
>   union {
>   /* type traditional */
>   struct {
>   __u32 gsi;
>   } trad;
>   /* type s390 */
>   struct {
>   __u32 int_type;
>   __u32 parm;
>   __u64 parm64;
>   } s390;
>   __u8 pad[20];
>   };
> }
> 
> This could be combined with an arch or a per-kvm callback to keep the
> generic code clean of architecture dependencies.
> 
> Cornelia

Looks a bit weird - shouldn't all this be part of gsi routing?
But no idea really, I don't see the big picture here.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: race between kvm-kmod-3.0 and kvm-kmod-3.3 // was: race condition in qemu-kvm-1.0.1

2012-06-28 Thread Peter Lieven

On 28.06.2012 11:21, Jan Kiszka wrote:

On 2012-06-28 11:11, Peter Lieven wrote:

On 27.06.2012 18:54, Jan Kiszka wrote:

On 2012-06-27 17:39, Peter Lieven wrote:

Hi all,

i debugged this further and found out that kvm-kmod-3.0 is working with
qemu-kvm-1.0.1 while kvm-kmod-3.3 and kvm-kmod-3.4 are not. What is
working as well is kvm-kmod-3.4 with an old userspace (qemu-kvm-0.13.0).
Has anyone a clue which new KVM feature could cause this if a vcpu is in
an infinite loop?

Before accusing kvm-kmod ;), can you check if the effect is visible with
an original Linux 3.3.x or 3.4.x kernel as well?

sorry, i should have been more specific. maybe I also misunderstood sth.
I was believing that kvm-kmod-3.0 is basically what is in vanialla kernel
3.0. If I use the ubuntu kernel from ubuntu oneiric (3.0.0) it works, if
I use
a self-compiled kvm-kmod-3.3/3.4 with that kernel it doesn't.
however, maybe we don't have to dig to deep - see below.

kvm-kmod wraps and patches things to make the kvm code from 3.3/3.4
working on an older kernel. This step may introduce bugs of its own.
Therefore my suggestion to use a "real" 3.x kernel to exclude that risk
first of all.


Then, bisection the change in qemu-kvm that apparently resolved the
issue would be interesting.

If we have to dig deeper, tracing [1] the lockup would likely be helpful
(all events of the qemu process, not just KVM related ones: trace-cmd
record -e all qemu-system-x86_64 ...).

that here is bascially whats going on:

   qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio read
len 3 gpa 0xa val 0x10ff
 qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
 qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
 qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
read len 3 gpa 0xa val 0x10ff
 qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
0xa gpa 0xa Read GPA
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
unsatisfied-read len 1 gpa 0xa val 0x0
 qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
KVM_EXIT_MMIO (6)

its doing that forever. this is tracing the kvm module. doing the
qemu-system-x86_64 trace is a bit compilcated, but
maybe this is already sufficient. otherwise i will of course gather this
info as well.

That's only tracing KVM event, and it's tracing when things went wrong
already. We may need a full trace (-e all) specifically for the period
when this pattern above started.

i will do that. maybe i should explain that the vcpu is executing
garbage when this above starts. its basically booting from an empty 
harddisk.


if i understand correctly qemu-kvm loops in kvm_cpu_exec(CPUState *env);

maybe the time to handle the monitor/qmp connection is just to short.
if i understand furhter correctly, it can only handle monitor connections
while qemu-kvm is executing kvm_vcpu_ioctl(env, KVM_RUN, 0); or am i
wrong here? the time spend in this state might be rather short.

my concern is not that the machine hangs, just the the hypervisor is 
unresponsive
and its impossible to reset or quit gracefully. the only way to get the 
hypervisor

ended is via SIGKILL.

thanks
peter
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: race between kvm-kmod-3.0 and kvm-kmod-3.3 // was: race condition in qemu-kvm-1.0.1

2012-06-28 Thread Jan Kiszka
On 2012-06-28 11:11, Peter Lieven wrote:
> On 27.06.2012 18:54, Jan Kiszka wrote:
>> On 2012-06-27 17:39, Peter Lieven wrote:
>>> Hi all,
>>>
>>> i debugged this further and found out that kvm-kmod-3.0 is working with
>>> qemu-kvm-1.0.1 while kvm-kmod-3.3 and kvm-kmod-3.4 are not. What is
>>> working as well is kvm-kmod-3.4 with an old userspace (qemu-kvm-0.13.0).
>>> Has anyone a clue which new KVM feature could cause this if a vcpu is in
>>> an infinite loop?
>> Before accusing kvm-kmod ;), can you check if the effect is visible with
>> an original Linux 3.3.x or 3.4.x kernel as well?
> sorry, i should have been more specific. maybe I also misunderstood sth.
> I was believing that kvm-kmod-3.0 is basically what is in vanialla kernel
> 3.0. If I use the ubuntu kernel from ubuntu oneiric (3.0.0) it works, if
> I use
> a self-compiled kvm-kmod-3.3/3.4 with that kernel it doesn't.
> however, maybe we don't have to dig to deep - see below.

kvm-kmod wraps and patches things to make the kvm code from 3.3/3.4
working on an older kernel. This step may introduce bugs of its own.
Therefore my suggestion to use a "real" 3.x kernel to exclude that risk
first of all.

>> Then, bisection the change in qemu-kvm that apparently resolved the
>> issue would be interesting.
>>
>> If we have to dig deeper, tracing [1] the lockup would likely be helpful
>> (all events of the qemu process, not just KVM related ones: trace-cmd
>> record -e all qemu-system-x86_64 ...).
> that here is bascially whats going on:
> 
>   qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio read
> len 3 gpa 0xa val 0x10ff
> qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
> 0xa gpa 0xa Read GPA
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
> unsatisfied-read len 1 gpa 0xa val 0x0
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
> KVM_EXIT_MMIO (6)
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
> read len 3 gpa 0xa val 0x10ff
> qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
> 0xa gpa 0xa Read GPA
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
> unsatisfied-read len 1 gpa 0xa val 0x0
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
> KVM_EXIT_MMIO (6)
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
> read len 3 gpa 0xa val 0x10ff
> qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
> 0xa gpa 0xa Read GPA
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
> unsatisfied-read len 1 gpa 0xa val 0x0
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
> KVM_EXIT_MMIO (6)
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
> read len 3 gpa 0xa val 0x10ff
> qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva
> 0xa gpa 0xa Read GPA
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio
> unsatisfied-read len 1 gpa 0xa val 0x0
> qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason
> KVM_EXIT_MMIO (6)
> 
> its doing that forever. this is tracing the kvm module. doing the
> qemu-system-x86_64 trace is a bit compilcated, but
> maybe this is already sufficient. otherwise i will of course gather this
> info as well.

That's only tracing KVM event, and it's tracing when things went wrong
already. We may need a full trace (-e all) specifically for the period
when this pattern above started.

However, let's focus on bisecting at kernel and qemu-kvm level first.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: race between kvm-kmod-3.0 and kvm-kmod-3.3 // was: race condition in qemu-kvm-1.0.1

2012-06-28 Thread Peter Lieven

On 27.06.2012 18:54, Jan Kiszka wrote:

On 2012-06-27 17:39, Peter Lieven wrote:

Hi all,

i debugged this further and found out that kvm-kmod-3.0 is working with
qemu-kvm-1.0.1 while kvm-kmod-3.3 and kvm-kmod-3.4 are not. What is
working as well is kvm-kmod-3.4 with an old userspace (qemu-kvm-0.13.0).
Has anyone a clue which new KVM feature could cause this if a vcpu is in
an infinite loop?

Before accusing kvm-kmod ;), can you check if the effect is visible with
an original Linux 3.3.x or 3.4.x kernel as well?

sorry, i should have been more specific. maybe I also misunderstood sth.
I was believing that kvm-kmod-3.0 is basically what is in vanialla kernel
3.0. If I use the ubuntu kernel from ubuntu oneiric (3.0.0) it works, if 
I use

a self-compiled kvm-kmod-3.3/3.4 with that kernel it doesn't.
however, maybe we don't have to dig to deep - see below.

Then, bisection the change in qemu-kvm that apparently resolved the
issue would be interesting.

If we have to dig deeper, tracing [1] the lockup would likely be helpful
(all events of the qemu process, not just KVM related ones: trace-cmd
record -e all qemu-system-x86_64 ...).

that here is bascially whats going on:

  qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio 
read len 3 gpa 0xa val 0x10ff
qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva 
0xa gpa 0xa Read GPA
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio 
unsatisfied-read len 1 gpa 0xa val 0x0
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason 
KVM_EXIT_MMIO (6)
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio 
read len 3 gpa 0xa val 0x10ff
qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva 
0xa gpa 0xa Read GPA
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio 
unsatisfied-read len 1 gpa 0xa val 0x0
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason 
KVM_EXIT_MMIO (6)
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio 
read len 3 gpa 0xa val 0x10ff
qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva 
0xa gpa 0xa Read GPA
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio 
unsatisfied-read len 1 gpa 0xa val 0x0
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason 
KVM_EXIT_MMIO (6)
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio 
read len 3 gpa 0xa val 0x10ff
qemu-kvm-1.0-2506  [010] 60996.908000: vcpu_match_mmio:  gva 
0xa gpa 0xa Read GPA
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_mmio: mmio 
unsatisfied-read len 1 gpa 0xa val 0x0
qemu-kvm-1.0-2506  [010] 60996.908000: kvm_userspace_exit:   reason 
KVM_EXIT_MMIO (6)


its doing that forever. this is tracing the kvm module. doing the 
qemu-system-x86_64 trace is a bit compilcated, but
maybe this is already sufficient. otherwise i will of course gather this 
info as well.


thanks
peter



Jan

[1] http://www.linux-kvm.org/page/Tracing


---

Hi,

we recently came across multiple VMs racing and stopping working. It
seems to happen when the system is at 100% cpu.
One way to reproduce this is:
qemu-kvm-1.0.1 with vnc-thread enabled

cmdline (or similar):
/usr/bin/qemu-kvm-1.0.1 -net
tap,vlan=141,script=no,downscript=no,ifname=tap15,vnet_hdr -net
nic,vlan=141,model=virtio,macaddr=52:54:00:ff:00:f7 -drive
format=host_device,file=/dev/mapper/iqn.2001-05.com.equallogic:0-8a0906-efdf4e007-16700198c7f4fead-02-debug-race-hd01,if=virtio,cache=none,aio=native
-m 2048 -smp 2,sockets=1,cores=2,threads=1 -monitor
tcp:0:4026,server,nowait -vnc :26 -qmp tcp:0:3026,server,nowait -name
02-debug-race -boot order=dc,menu=off -cdrom
/home/kvm/cdrom//root/ubuntu-12.04-server-amd64.iso -k de -pidfile
/var/run/qemu/vm-221.pid -mem-prealloc -cpu
host,+x2apic,model_id=Intel(R) Xeon(R) CPU   L5640  @
2.27GHz,-tsc -rtc base=utc -usb -usbdevice tablet -no-hpet -vga cirrus

it is important that the attached virtio image contains only zeroes. if
the system boots from cd, select boot from first harddisk.
the hypervisor then hangs at 100% cpu and neither monitor nor qmp are
responsive anymore.

i have also seen customers reporting this when a VM is shut down.

if this is connected to the threaded vnc server it might be important to
connected at this time.

debug backtrace attached.

Thanks,
Peter

--

(gdb) file /usr/bin/qemu-kvm-1.0.1
Reading symbols from /usr/bin/qemu-kvm-1.0.1...done.
(gdb) attach 5145
Attaching to program: /usr/bin/qemu-kvm-1.0.1, process 5145
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
[Thread debugging using libthread_db enabled]
[New Thread 0x7f54d08b9700 (LWP 5253)]
[New Thread 0x7f5552757700 (LWP 5152)]
[New Thread 0x7f5552f58700 (LWP 5151)]
0x7f5553c6b5a3 in select () from /lib/libc.so.6
(gdb) info threads
 

Re: [PATCH v2 1/6] kvm: Pass kvm_irqfd to functions

2012-06-28 Thread Cornelia Huck
On Thu, 28 Jun 2012 11:38:57 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Jun 27, 2012 at 04:24:30PM +0200, Cornelia Huck wrote:
> > On Tue, 26 Jun 2012 23:09:04 -0600
> > Alex Williamson  wrote:
> > 
> > > Prune this down to just the struct kvm_irqfd so we can avoid
> > > changing function definition for every flag or field we use.
> > > 
> > > Signed-off-by: Alex Williamson 
> > 
> > I'm currently trying to find a way to make irqfd workable for s390
> > which will likely include using a new field in kvm_irqfd, so I'd like
> > to have this change (and I also think it makes the code nicer to read).
> > So:
> > 
> > Acked-by: Cornelia Huck 
> 
> Unfortunately it looks like we are not sanitizing kvm_irqfd
> at all so we won't be able to use the padding :(
> We'll need a new ioctl instead.
> 

How about something like this as parameter for the new ioctl?

struct kvm_irqfd2 {
__u32 fd;
__u32 flags;  /* for things like deassign */
__u64 type;   /* determines the payload */
union {
/* type traditional */
struct {
__u32 gsi;
} trad;
/* type s390 */
struct {
__u32 int_type;
__u32 parm;
__u64 parm64;
} s390;
__u8 pad[20];
};
}

This could be combined with an arch or a per-kvm callback to keep the
generic code clean of architecture dependencies.

Cornelia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] kvm: Extend irqfd to support level interrupts

2012-06-28 Thread Gleb Natapov
On Thu, Jun 28, 2012 at 11:48:40AM +0300, Michael S. Tsirkin wrote:
> On Thu, Jun 28, 2012 at 11:46:11AM +0300, Gleb Natapov wrote:
> > On Thu, Jun 28, 2012 at 11:41:05AM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Jun 28, 2012 at 11:35:41AM +0300, Gleb Natapov wrote:
> > > > On Thu, Jun 28, 2012 at 11:34:35AM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Jun 28, 2012 at 09:34:31AM +0300, Gleb Natapov wrote:
> > > > > > On Thu, Jun 28, 2012 at 01:31:29AM +0300, Michael S. Tsirkin wrote:
> > > > > > > On Wed, Jun 27, 2012 at 04:04:18PM -0600, Alex Williamson wrote:
> > > > > > > > On Wed, 2012-06-27 at 18:26 +0300, Michael S. Tsirkin wrote:
> > > > > > > > > On Tue, Jun 26, 2012 at 11:09:46PM -0600, Alex Williamson 
> > > > > > > > > wrote:
> > > > > > > > > > @@ -71,6 +130,14 @@ irqfd_inject(struct work_struct *work)
> > > > > > > > > > kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 
> > > > > > > > > > irqfd->gsi, 0);
> > > > > > > > > >  }
> > > > > > > > > >  
> > > > > > > > > > +static void
> > > > > > > > > > +irqfd_inject_level(struct work_struct *work)
> > > > > > > > > > +{
> > > > > > > > > > +   struct _irqfd *irqfd = container_of(work, struct 
> > > > > > > > > > _irqfd, inject);
> > > > > > > > > > +
> > > > > > > > > > +   kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 
> > > > > > > > > > 1);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > >  /*
> > > > > > > > > >   * Race-free decouple logic (ordering is critical)
> > > > > > > > > >   */
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Why is it safe to ignore return value here?
> > > > > > > > > needs a comment.
> > > > > > > > 
> > > > > > > > Well, it seems like you and Gleb came to the conclusion that 
> > > > > > > > it's safe,
> > > > > > > > but I can really follow from the list thread.  Can you explain 
> > > > > > > > and I'll
> > > > > > > > add a comment?  Thanks,
> > > > > > > > 
> > > > > > > > Alex
> > > > > > > 
> > > > > > > We merely talked about edge interrupts.
> > > > > > > 
> > > > > > In fact it would have been nice to return -EBUSY when write() to 
> > > > > > level
> > > > > > irqfd is coalesced.
> > > > > 
> > > > > Possibly nice but not really practical.
> > > > > 
> > > > What do you mean by that? Impossible to implement or not useful?
> > > 
> > > Impossible to implement and also does not match normal eventfd
> > > semantics.
> > > 
> > Hmm, I remember we discussed using irqfd for level triggered interrupt ~2
> > years ago and came to a conclusion that eventfd is a bad fit for it,
> > was true than is true now. Not be able to detect coalescing will make
> > irqfd level interrupts inferior to IRQ_LINE ioctl.
> 
> This aspect isn't different from MSI though, is it?
> 
For MSI delivered through irqfd it isn't, so yes, level triggered are
not more broken in this regards. You can detect coalescing if you
deliver MSI through ioctl.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] kvm: Extend irqfd to support level interrupts

2012-06-28 Thread Michael S. Tsirkin
On Thu, Jun 28, 2012 at 11:46:11AM +0300, Gleb Natapov wrote:
> On Thu, Jun 28, 2012 at 11:41:05AM +0300, Michael S. Tsirkin wrote:
> > On Thu, Jun 28, 2012 at 11:35:41AM +0300, Gleb Natapov wrote:
> > > On Thu, Jun 28, 2012 at 11:34:35AM +0300, Michael S. Tsirkin wrote:
> > > > On Thu, Jun 28, 2012 at 09:34:31AM +0300, Gleb Natapov wrote:
> > > > > On Thu, Jun 28, 2012 at 01:31:29AM +0300, Michael S. Tsirkin wrote:
> > > > > > On Wed, Jun 27, 2012 at 04:04:18PM -0600, Alex Williamson wrote:
> > > > > > > On Wed, 2012-06-27 at 18:26 +0300, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jun 26, 2012 at 11:09:46PM -0600, Alex Williamson wrote:
> > > > > > > > > @@ -71,6 +130,14 @@ irqfd_inject(struct work_struct *work)
> > > > > > > > >   kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 
> > > > > > > > > irqfd->gsi, 0);
> > > > > > > > >  }
> > > > > > > > >  
> > > > > > > > > +static void
> > > > > > > > > +irqfd_inject_level(struct work_struct *work)
> > > > > > > > > +{
> > > > > > > > > + struct _irqfd *irqfd = container_of(work, struct 
> > > > > > > > > _irqfd, inject);
> > > > > > > > > +
> > > > > > > > > + kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 
> > > > > > > > > 1);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > >  /*
> > > > > > > > >   * Race-free decouple logic (ordering is critical)
> > > > > > > > >   */
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Why is it safe to ignore return value here?
> > > > > > > > needs a comment.
> > > > > > > 
> > > > > > > Well, it seems like you and Gleb came to the conclusion that it's 
> > > > > > > safe,
> > > > > > > but I can really follow from the list thread.  Can you explain 
> > > > > > > and I'll
> > > > > > > add a comment?  Thanks,
> > > > > > > 
> > > > > > > Alex
> > > > > > 
> > > > > > We merely talked about edge interrupts.
> > > > > > 
> > > > > In fact it would have been nice to return -EBUSY when write() to level
> > > > > irqfd is coalesced.
> > > > 
> > > > Possibly nice but not really practical.
> > > > 
> > > What do you mean by that? Impossible to implement or not useful?
> > 
> > Impossible to implement and also does not match normal eventfd
> > semantics.
> > 
> Hmm, I remember we discussed using irqfd for level triggered interrupt ~2
> years ago and came to a conclusion that eventfd is a bad fit for it,
> was true than is true now. Not be able to detect coalescing will make
> irqfd level interrupts inferior to IRQ_LINE ioctl.

This aspect isn't different from MSI though, is it?

> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] kvm: Extend irqfd to support level interrupts

2012-06-28 Thread Gleb Natapov
On Thu, Jun 28, 2012 at 11:41:05AM +0300, Michael S. Tsirkin wrote:
> On Thu, Jun 28, 2012 at 11:35:41AM +0300, Gleb Natapov wrote:
> > On Thu, Jun 28, 2012 at 11:34:35AM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Jun 28, 2012 at 09:34:31AM +0300, Gleb Natapov wrote:
> > > > On Thu, Jun 28, 2012 at 01:31:29AM +0300, Michael S. Tsirkin wrote:
> > > > > On Wed, Jun 27, 2012 at 04:04:18PM -0600, Alex Williamson wrote:
> > > > > > On Wed, 2012-06-27 at 18:26 +0300, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Jun 26, 2012 at 11:09:46PM -0600, Alex Williamson wrote:
> > > > > > > > @@ -71,6 +130,14 @@ irqfd_inject(struct work_struct *work)
> > > > > > > > kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, 
> > > > > > > > irqfd->gsi, 0);
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +static void
> > > > > > > > +irqfd_inject_level(struct work_struct *work)
> > > > > > > > +{
> > > > > > > > +   struct _irqfd *irqfd = container_of(work, struct 
> > > > > > > > _irqfd, inject);
> > > > > > > > +
> > > > > > > > +   kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 
> > > > > > > > 1);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >  /*
> > > > > > > >   * Race-free decouple logic (ordering is critical)
> > > > > > > >   */
> > > > > > > 
> > > > > > > 
> > > > > > > Why is it safe to ignore return value here?
> > > > > > > needs a comment.
> > > > > > 
> > > > > > Well, it seems like you and Gleb came to the conclusion that it's 
> > > > > > safe,
> > > > > > but I can really follow from the list thread.  Can you explain and 
> > > > > > I'll
> > > > > > add a comment?  Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > We merely talked about edge interrupts.
> > > > > 
> > > > In fact it would have been nice to return -EBUSY when write() to level
> > > > irqfd is coalesced.
> > > 
> > > Possibly nice but not really practical.
> > > 
> > What do you mean by that? Impossible to implement or not useful?
> 
> Impossible to implement and also does not match normal eventfd
> semantics.
> 
Hmm, I remember we discussed using irqfd for level triggered interrupt ~2
years ago and came to a conclusion that eventfd is a bad fit for it,
was true than is true now. Not be able to detect coalescing will make
irqfd level interrupts inferior to IRQ_LINE ioctl.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop

2012-06-28 Thread Stefan Hajnoczi
On Wed, Jun 27, 2012 at 12:19 PM, Marcelo Tosatti  wrote:
> On Wed, Jun 27, 2012 at 08:41:49AM +0100, Stefan Hajnoczi wrote:
>> On Wed, Jun 27, 2012 at 8:39 AM, Stefan Hajnoczi  wrote:
>> > On Tue, Jun 26, 2012 at 8:34 PM, Marcelo Tosatti  
>> > wrote:
>> >> On Sat, Jun 23, 2012 at 12:55:49AM +0200, Jan Kiszka wrote:
>> >> net.txt
>> >> 
>> >>
>> >>
>> >> iothread flow
>> >> =
>> >>
>> >> 1) Skip-work-if-device-locked
>> >>
>> >> select(tap fd ready)
>> >> tap_send
>> >>    if (trylock(TAPState->NetClientState->dev))
>> >>        proceed;
>> >>    else
>> >>        continue; (data remains in queue)
>> >>    tap_read_packet
>> >>    qemu_send_packet_async
>> >>
>> >> DRAWBACK: lock intensive.
>> >> DRAWBACK: VCPU threads can starve IOTHREAD (can be fixed with
>> >> a scheme such as trylock() marks a flag saying "iothread wants lock".
>> >
>> > One alternative is to say the lock cannot be held across system calls
>> > or other long-running operations (similar to interrupt handler
>> > spinlocks in the kernel).  But QEMU threads are still at the mercy of
>> > the scheduler or page faults, so perhaps this is not a good idea
>> > because waiting on the lock could tie up the iothread.
>> >
>> >> 2) Queue-work-to-vcpu-context
>> >>
>> >> select(tap fd ready)
>> >> tap_send
>> >>    if (trylock(TAPState->NetClientState->dev)) {
>> >>        proceed;
>> >>    } else {
>> >>        AddToDeviceWorkQueue(work);
>> >>    }
>> >>    tap_read_packet
>> >>    qemu_send_packet_async
>> >>
>> >> DRAWBACK: lock intensive
>> >> DRAWBACK: cache effects
>> >
>> > This almost suggests we should invert packet reception for NICs.  tap
>> > fd ready should add a work item and the guest device calls into
>> > net/tap.c to pull out any packets from the work function:
>> >
>> > tap_send
>> >    dev_add_work(work);
>> >
>> > virtio_net_rx_work_fn
>> >    while ((req = virtqueue_pop(rx_queue))) {
>> >        if (!net_receive_packet(netclient, req->buf)) {
>> >            break; /* no more packets available */
>> >        }
>> >        /* ...mark req complete and push onto virtqueue... */
>> >    }
>> >    virtqueue_notify(rx_queue);
>> >
>> > The idea is to avoid copies on the rx path and make the locking simple
>> > by always deferring to a work function (which can be invoked
>> > immediately if the device isn't locked).
>>
>> I just realized this approach bypasses net/queue.c.  I think it works
>> really well for the peer model (no "VLANs").  Basically flow control
>> is dictated by the peer (receiver) and we don't need to use NetQueue
>> anymore.  Whether this works elegantly for slirp and other backends,
>> I'm not sure yet.
>>
>> Stefan
>
> Right. The advantage is only backends where performance matters need to
> be converted (and only net hw drivers that matter performance wise need
> to be converted).
>
> Apparently you guys have very different ideas on the higher level
> model here. It would be good to agree on one structure (including
> modifications on the one above) to follow and share the work on that
> direction.

"You guys" == Ping Fan + Jan + Anthony + Avi?  I'm not working
directly on QBL removal myself.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/6] kvm: level triggered irqfd support

2012-06-28 Thread Michael S. Tsirkin
On Wed, Jun 27, 2012 at 08:33:40AM -0600, Alex Williamson wrote:
> On Wed, 2012-06-27 at 12:58 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jun 26, 2012 at 11:08:52PM -0600, Alex Williamson wrote:
> > > Ok, let's see how this flies.  I actually quite like this, so be
> > > gentle tearing it apart ;)
> > > 
> > > I just couldn't bring myself to contort KVM_IRQFD into something
> > > that either sets up an irqfd or specifies a nearly unrelated EOI
> > > eventfd.  The solution I've come up with, that also avoids exposing
> > > irq_source_ids to userspace, is to work through the irqfd.  If we
> > > setup a level irqfd, we can optionally associate an eoifd with
> > > the same irq_source_id, by passing the irqfd.  To do this, we just
> > > need to create a new reference counted object for the source ID
> > > so we don't run into problems ordering release.  This means we
> > > end up with a KVM_EOIFD ioctl that has both general usefulness and
> > > can still tie into an irqfd.
> > 
> > I like this API.
> 
> Yay!  I'll work on the bugs you've spotted.
> 
> > > In patch 6/6 I also include an alternate de-assert mechanism via an
> > > irqfd with the opposite polarity.  I don't currently use this, but
> > > it's pretty trivial and at least available in archives now.
> > 
> > The nasty bit about that is that if you assert twice then
> > deassert once it's not clear what should happen.
> > But yea, it does not hurt to put them in the archives.
> 
> It's no different that KVM_IRQ_LINE in that sense.  It's not an
> accumulator, it's a set state.  Thanks,
> 
> Alex

Yes but eventfd semantics are that of an accumulator so this
does not match what a normal eventfd does very well.
Anyway, as it's not required now we can argue about it
when we get to it.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] kvm: Extend irqfd to support level interrupts

2012-06-28 Thread Michael S. Tsirkin
On Thu, Jun 28, 2012 at 11:35:41AM +0300, Gleb Natapov wrote:
> On Thu, Jun 28, 2012 at 11:34:35AM +0300, Michael S. Tsirkin wrote:
> > On Thu, Jun 28, 2012 at 09:34:31AM +0300, Gleb Natapov wrote:
> > > On Thu, Jun 28, 2012 at 01:31:29AM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 27, 2012 at 04:04:18PM -0600, Alex Williamson wrote:
> > > > > On Wed, 2012-06-27 at 18:26 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jun 26, 2012 at 11:09:46PM -0600, Alex Williamson wrote:
> > > > > > > @@ -71,6 +130,14 @@ irqfd_inject(struct work_struct *work)
> > > > > > >   kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > > > > > >  }
> > > > > > >  
> > > > > > > +static void
> > > > > > > +irqfd_inject_level(struct work_struct *work)
> > > > > > > +{
> > > > > > > + struct _irqfd *irqfd = container_of(work, struct _irqfd, 
> > > > > > > inject);
> > > > > > > +
> > > > > > > + kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > > > > +}
> > > > > > > +
> > > > > > >  /*
> > > > > > >   * Race-free decouple logic (ordering is critical)
> > > > > > >   */
> > > > > > 
> > > > > > 
> > > > > > Why is it safe to ignore return value here?
> > > > > > needs a comment.
> > > > > 
> > > > > Well, it seems like you and Gleb came to the conclusion that it's 
> > > > > safe,
> > > > > but I can really follow from the list thread.  Can you explain and 
> > > > > I'll
> > > > > add a comment?  Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > We merely talked about edge interrupts.
> > > > 
> > > In fact it would have been nice to return -EBUSY when write() to level
> > > irqfd is coalesced.
> > 
> > Possibly nice but not really practical.
> > 
> What do you mean by that? Impossible to implement or not useful?

Impossible to implement and also does not match normal eventfd
semantics.

> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/6] kvm: Pass kvm_irqfd to functions

2012-06-28 Thread Michael S. Tsirkin
On Wed, Jun 27, 2012 at 04:24:30PM +0200, Cornelia Huck wrote:
> On Tue, 26 Jun 2012 23:09:04 -0600
> Alex Williamson  wrote:
> 
> > Prune this down to just the struct kvm_irqfd so we can avoid
> > changing function definition for every flag or field we use.
> > 
> > Signed-off-by: Alex Williamson 
> 
> I'm currently trying to find a way to make irqfd workable for s390
> which will likely include using a new field in kvm_irqfd, so I'd like
> to have this change (and I also think it makes the code nicer to read).
> So:
> 
> Acked-by: Cornelia Huck 

Unfortunately it looks like we are not sanitizing kvm_irqfd
at all so we won't be able to use the padding :(
We'll need a new ioctl instead.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] kvm: Extend irqfd to support level interrupts

2012-06-28 Thread Gleb Natapov
On Thu, Jun 28, 2012 at 11:34:35AM +0300, Michael S. Tsirkin wrote:
> On Thu, Jun 28, 2012 at 09:34:31AM +0300, Gleb Natapov wrote:
> > On Thu, Jun 28, 2012 at 01:31:29AM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jun 27, 2012 at 04:04:18PM -0600, Alex Williamson wrote:
> > > > On Wed, 2012-06-27 at 18:26 +0300, Michael S. Tsirkin wrote:
> > > > > On Tue, Jun 26, 2012 at 11:09:46PM -0600, Alex Williamson wrote:
> > > > > > @@ -71,6 +130,14 @@ irqfd_inject(struct work_struct *work)
> > > > > > kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > > > > >  }
> > > > > >  
> > > > > > +static void
> > > > > > +irqfd_inject_level(struct work_struct *work)
> > > > > > +{
> > > > > > +   struct _irqfd *irqfd = container_of(work, struct _irqfd, 
> > > > > > inject);
> > > > > > +
> > > > > > +   kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > > > +}
> > > > > > +
> > > > > >  /*
> > > > > >   * Race-free decouple logic (ordering is critical)
> > > > > >   */
> > > > > 
> > > > > 
> > > > > Why is it safe to ignore return value here?
> > > > > needs a comment.
> > > > 
> > > > Well, it seems like you and Gleb came to the conclusion that it's safe,
> > > > but I can really follow from the list thread.  Can you explain and I'll
> > > > add a comment?  Thanks,
> > > > 
> > > > Alex
> > > 
> > > We merely talked about edge interrupts.
> > > 
> > In fact it would have been nice to return -EBUSY when write() to level
> > irqfd is coalesced.
> 
> Possibly nice but not really practical.
> 
What do you mean by that? Impossible to implement or not useful?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] kvm: Extend irqfd to support level interrupts

2012-06-28 Thread Michael S. Tsirkin
On Thu, Jun 28, 2012 at 09:34:31AM +0300, Gleb Natapov wrote:
> On Thu, Jun 28, 2012 at 01:31:29AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 27, 2012 at 04:04:18PM -0600, Alex Williamson wrote:
> > > On Wed, 2012-06-27 at 18:26 +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Jun 26, 2012 at 11:09:46PM -0600, Alex Williamson wrote:
> > > > > @@ -71,6 +130,14 @@ irqfd_inject(struct work_struct *work)
> > > > >   kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID, irqfd->gsi, 0);
> > > > >  }
> > > > >  
> > > > > +static void
> > > > > +irqfd_inject_level(struct work_struct *work)
> > > > > +{
> > > > > + struct _irqfd *irqfd = container_of(work, struct _irqfd, 
> > > > > inject);
> > > > > +
> > > > > + kvm_set_irq(irqfd->kvm, irqfd->source->id, irqfd->gsi, 1);
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Race-free decouple logic (ordering is critical)
> > > > >   */
> > > > 
> > > > 
> > > > Why is it safe to ignore return value here?
> > > > needs a comment.
> > > 
> > > Well, it seems like you and Gleb came to the conclusion that it's safe,
> > > but I can really follow from the list thread.  Can you explain and I'll
> > > add a comment?  Thanks,
> > > 
> > > Alex
> > 
> > We merely talked about edge interrupts.
> > 
> In fact it would have been nice to return -EBUSY when write() to level
> irqfd is coalesced.

Possibly nice but not really practical.

> --
>   Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] kvm: Extend irqfd to support level interrupts

2012-06-28 Thread Michael S. Tsirkin
On Wed, Jun 27, 2012 at 09:52:52PM -0600, Alex Williamson wrote:
> On Thu, 2012-06-28 at 01:28 +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 27, 2012 at 03:28:19PM -0600, Alex Williamson wrote:
> > > On Thu, 2012-06-28 at 00:14 +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 27, 2012 at 02:59:09PM -0600, Alex Williamson wrote:
> > > > > On Wed, 2012-06-27 at 12:51 +0300, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jun 26, 2012 at 11:09:46PM -0600, Alex Williamson wrote:
> > > > > > > In order to inject an interrupt from an external source using an
> > > > > > > irqfd, we need to allocate a new irq_source_id.  This allows us to
> > > > > > > assert and (later) de-assert an interrupt line independently from
> > > > > > > users of KVM_IRQ_LINE and avoid lost interrupts.
> > > > > > > 
> > > > > > > We also add what may appear like a bit of excessive infrastructure
> > > > > > > around an object for storing this irq_source_id.  However, notice
> > > > > > > that we only provide a way to assert the interrupt here.  A 
> > > > > > > follow-on
> > > > > > > interface will make use of the same irq_source_id to allow 
> > > > > > > de-assert.
> > > > > > > 
> > > > > > > Signed-off-by: Alex Williamson 
> > > > > > > ---
> > > > > > > 
> > > > > > >  Documentation/virtual/kvm/api.txt |5 ++
> > > > > > >  arch/x86/kvm/x86.c|1 
> > > > > > >  include/linux/kvm.h   |3 +
> > > > > > >  virt/kvm/eventfd.c|   95 
> > > > > > > +++--
> > > > > > >  4 files changed, 99 insertions(+), 5 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/Documentation/virtual/kvm/api.txt 
> > > > > > > b/Documentation/virtual/kvm/api.txt
> > > > > > > index ea9edce..b216709 100644
> > > > > > > --- a/Documentation/virtual/kvm/api.txt
> > > > > > > +++ b/Documentation/virtual/kvm/api.txt
> > > > > > > @@ -1981,6 +1981,11 @@ the guest using the specified gsi pin.  
> > > > > > > The irqfd is removed using
> > > > > > >  the KVM_IRQFD_FLAG_DEASSIGN flag, specifying both kvm_irqfd.fd
> > > > > > >  and kvm_irqfd.gsi.
> > > > > > >  
> > > > > > > +With KVM_IRQFD_FLAG_LEVEL KVM_IRQFD allocates a new IRQ source 
> > > > > > > ID for
> > > > > > > +the requested irqfd.  This is necessary to share level triggered
> > > > > > > +interrupts with those injected through KVM_IRQ_LINE.  IRQFDs 
> > > > > > > created
> > > > > > > +with KVM_IRQFD_FLAG_LEVEL must also set this flag when 
> > > > > > > de-assiging.
> > > > > > > +KVM_IRQFD_FLAG_LEVEL support is indicated by KVM_CAP_IRQFD_LEVEL.
> > > > > > 
> > > > > > Note that if my patch removing auto-deassert gets accepted,
> > > > > > this is not needed at all: we can just look at the GSI
> > > > > > to see if it's level or edge.
> > > > > 
> > > > > I'm not sure this is a good idea.  I know from vfio that I'm 
> > > > > injecting a
> > > > > level interrupt regardless of how the guest has the pic/ioapic
> > > > > programmed at the time I'm calling this ioctl.  Peeking across address
> > > > > spaces to get to the right pin on the right pic/ioapic and see how 
> > > > > it's
> > > > > currently programmed seems fragile.  Thanks,
> > > > > 
> > > > > Alex
> > > > 
> > > > Fragile? If you set eventfd as LEVEL but GSI is really edge then
> > > > it all explodes, right? So why give users the option to shoot
> > > > themselves in the foot?
> > > 
> > > If the guest has the ioapic rte set to edge at the time I call KVM_IRQFD
> > > to register my level interrupt then it all explodes, right?  I'd rather
> > > let the user shoot themselves than play Russian roulette with the guest.
> > > Am I misunderstanding what you mean by looking that the GSI to see if
> > > it's level or edge?
> > 
> > Not sure.
> > I simply mean this: if eventfd is bound to irqfd, set level from irqfd
> > and clear from eventfd ack notifier.
> 
> Are you simply saying assert (kvm_set_irq(,,,1)) from irqfd trigger and
> de-assert (kvm_set_irq(,,,0)) from eventfd ack notifier (aka KVM_EOIFD)?

Yes.

> > There's no need for a special IRQ_LEVEL for this.
> 
> That ignores the whole problem of when do we need to allocate a new
> irq_source_id and when do we inject using KVM_USERSPACE_IRQ_SOURCE_ID.
> We've already discussed that a level triggered, externally fired
> interrupt must use a separate source ID from Qemu userspace.  Therefore
> when you say "look at the GSI to see if it's level or edge", I assume
> you mean trace the gsi back to the pic/ioapic pin and look at the
> trigger mode.  That trigger mode is configured by the guest, so that
> means that at the point in time when we call KVM_IRQFD we make a
> determination based on how the _guest_ has programmed the ioapic.  That
> may not match the interrupt we expect to inject.  On the other hand, the
> user calling KVM_IRQFD absolutely knows the type of interrupt provided
> by their device.  I think we need a flag regardless of whether your
> patch is accepted.  We may be able to share the 

Re: [PATCH 5/6 v5] deal with guest panicked event accoring to -onpanic parameter

2012-06-28 Thread Jan Kiszka
On 2012-06-28 03:15, Wen Congyang wrote:
> At 06/27/2012 10:39 PM, Jan Kiszka Wrote:
>> On 2012-06-27 09:02, Wen Congyang wrote:
>>> When the guest is panicked, it will write 0x1 to the port KVM_PV_PORT.
>>> So if qemu reads 0x1 from this port, we can do the folloing three
>>> things according to the parameter -onpanic:
>>> 1. emit QEVENT_GUEST_PANICKED only
>>> 2. emit QEVENT_GUEST_PANICKED and pause the guest
>>> 3. emit QEVENT_GUEST_PANICKED and poweroff the guest
>>> 4. emit QEVENT_GUEST_PANICKED and reset the guest
>>>
>>> Note: if we emit QEVENT_GUEST_PANICKED only, and the management
>>> application does not receive this event(the management may not
>>> run when the event is emitted), the management won't know the
>>> guest is panicked.
>>>
>>> Signed-off-by: Wen Congyang 
>>> ---
>>>  kvm-all.c   |  101 
>>> +++
>>>  kvm-stub.c  |9 +
>>>  kvm.h   |3 ++
>>>  qemu-options.hx |   15 
>>>  vl.c|   10 +
>>>  5 files changed, 138 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/kvm-all.c b/kvm-all.c
>>> index f8e4328..9494dd2 100644
>>> --- a/kvm-all.c
>>> +++ b/kvm-all.c
>>> @@ -19,6 +19,8 @@
>>>  #include 
>>>  
>>>  #include 
>>> +#include 
>>> +#include 
>>>  
>>>  #include "qemu-common.h"
>>>  #include "qemu-barrier.h"
>>> @@ -32,6 +34,9 @@
>>>  #include "bswap.h"
>>>  #include "memory.h"
>>>  #include "exec-memory.h"
>>> +#include "iorange.h"
>>> +#include "qemu-objects.h"
>>> +#include "monitor.h"
>>>  
>>>  /* This check must be after config-host.h is included */
>>>  #ifdef CONFIG_EVENTFD
>>> @@ -1931,3 +1936,99 @@ int kvm_on_sigbus(int code, void *addr)
>>>  {
>>>  return kvm_arch_on_sigbus(code, addr);
>>>  }
>>> +
>>> +/* Possible values for action parameter. */
>>> +#define PANICKED_REPORT 1   /* emit QEVENT_GUEST_PANICKED only */
>>> +#define PANICKED_PAUSE  2   /* emit QEVENT_GUEST_PANICKED and pause VM 
>>> */
>>> +#define PANICKED_POWEROFF   3   /* emit QEVENT_GUEST_PANICKED and quit VM 
>>> */
>>> +#define PANICKED_RESET  4   /* emit QEVENT_GUEST_PANICKED and reset VM 
>>> */
>>> +
>>> +static int panicked_action = PANICKED_REPORT;
>>> +
>>> +static void kvm_pv_port_read(IORange *iorange, uint64_t offset, unsigned 
>>> width,
>>> + uint64_t *data)
>>> +{
>>> +*data = (1 << KVM_PV_FEATURE_PANICKED);
>>> +}
>>> +
>>> +static void panicked_mon_event(const char *action)
>>> +{
>>> +QObject *data;
>>> +
>>> +data = qobject_from_jsonf("{ 'action': %s }", action);
>>> +monitor_protocol_event(QEVENT_GUEST_PANICKED, data);
>>> +qobject_decref(data);
>>> +}
>>> +
>>> +static void panicked_perform_action(void)
>>> +{
>>> +switch (panicked_action) {
>>> +case PANICKED_REPORT:
>>> +panicked_mon_event("report");
>>> +break;
>>> +
>>> +case PANICKED_PAUSE:
>>> +panicked_mon_event("pause");
>>> +vm_stop(RUN_STATE_GUEST_PANICKED);
>>> +break;
>>> +
>>> +case PANICKED_POWEROFF:
>>> +panicked_mon_event("poweroff");
>>> +exit(0);
>>> +break;
>>> +case PANICKED_RESET:
>>> +panicked_mon_event("reset");
>>> +qemu_system_reset_request();
>>> +break;
>>> +}
>>> +}
>>> +
>>> +static void kvm_pv_port_write(IORange *iorange, uint64_t offset, unsigned 
>>> width,
>>> +  uint64_t data)
>>> +{
>>> +if (data == KVM_PV_PANICKED) {
>>> +panicked_perform_action();
>>> +}
>>> +}
>>> +
>>> +static void kvm_pv_port_destructor(IORange *iorange)
>>> +{
>>> +g_free(iorange);
>>> +}
>>> +
>>> +static IORangeOps pv_io_range_ops = {
>>> +.read = kvm_pv_port_read,
>>> +.write = kvm_pv_port_write,
>>> +.destructor = kvm_pv_port_destructor,
>>> +};
>>> +
>>> +#if defined(KVM_PV_PORT)
>>> +void kvm_pv_port_init(void)
>>> +{
>>> +IORange *pv_io_range = g_malloc(sizeof(IORange));
>>> +
>>> +iorange_init(pv_io_range, &pv_io_range_ops, KVM_PV_PORT, 1);
>>> +ioport_register(pv_io_range);
>>
>> This modeling is still not ok. We don't open-code ports anymore, we
>> introduce devices. And this doesn't belong inro generic code as it is
> 
> Do you mean introducing a new device instead of I/O port?

I mean encapsulating the I/O registration (PIO or MMIO) in a QOM device
and building that device only for target archs that supports it. Already
pointed you to examples in hw/kvm/.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/3 v2] vfio: signedness bug in vfio_config_do_rw()

2012-06-28 Thread Dan Carpenter
The "count" variable needs to be signed here because we use it to store
negative error codes.

Signed-off-by: Dan Carpenter 
---
v2: Just declare count as signed.

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index a4f7321..2e00aa8 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1419,7 +1419,7 @@ void vfio_config_free(struct vfio_pci_device *vdev)
 }
 
 static ssize_t vfio_config_do_rw(struct vfio_pci_device *vdev, char __user 
*buf,
-size_t count, loff_t *ppos, bool iswrite)
+ssize_t count, loff_t *ppos, bool iswrite)
 {
struct pci_dev *pdev = vdev->pdev;
struct perm_bits *perm;
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8][TRIVIAL] Fix typos in arch/x86/kvm/svm.c

2012-06-28 Thread Guo Chao

Signed-off-by: Guo Chao 
---
 arch/x86/kvm/svm.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f75af40..0d20bdd 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2063,7 +2063,7 @@ static inline bool nested_svm_intr(struct vcpu_svm *svm)
if (svm->nested.intercept & 1ULL) {
/*
 * The #vmexit can't be emulated here directly because this
-* code path runs with irqs and preemtion disabled. A
+* code path runs with irqs and preemption disabled. A
 * #vmexit emulation might sleep. Only signal request for
 * the #vmexit here.
 */
@@ -2409,7 +2409,7 @@ static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm)
 {
/*
 * This function merges the msr permission bitmaps of kvm and the
-* nested vmcb. It is omptimized in that it only merges the parts where
+* nested vmcb. It is optimized in that it only merges the parts where
 * the kvm msr permission bitmap may contain zero bits
 */
int i;
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8][TRIVIAL] Fix typos in arch/x86/kvm/vmx.c

2012-06-28 Thread Guo Chao

Signed-off-by: Guo Chao 
---
 arch/x86/kvm/vmx.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 32eb588..7c40477 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1325,7 +1325,7 @@ static bool update_transition_efer(struct vcpu_vmx *vmx, 
int efer_offset)
guest_efer = vmx->vcpu.arch.efer;
 
/*
-* NX is emulated; LMA and LME handled by hardware; SCE meaninless
+* NX is emulated; LMA and LME handled by hardware; SCE meaningless
 * outside long mode
 */
ignore_bits = EFER_NX | EFER_SCE;
@@ -3218,7 +3218,7 @@ static void vmx_set_segment(struct kvm_vcpu *vcpu,
 * qemu binaries.
 *   IA32 arch specifies that at the time of processor reset the
 * "Accessed" bit in the AR field of segment registers is 1. And qemu
-* is setting it to 0 in the usedland code. This causes invalid guest
+* is setting it to 0 in the userland code. This causes invalid guest
 * state vmexit when "unrestricted guest" mode is turned on.
 *Fix for this setup issue in cpu_reset is being pushed in the qemu
 * tree. Newer qemu binaries with that qemu fix would not need this
@@ -4363,7 +4363,7 @@ vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char 
*hypercall)
hypercall[2] = 0xc1;
 }
 
-/* called to set cr0 as approriate for a mov-to-cr0 exit. */
+/* called to set cr0 as appropriate for a mov-to-cr0 exit. */
 static int handle_set_cr0(struct kvm_vcpu *vcpu, unsigned long val)
 {
if (to_vmx(vcpu)->nested.vmxon &&
-- 
1.7.9.5



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html