date:20150916

Re: [PATCH v2 00/22] KVM: ARM64: Add guest PMU support

2015-09-16 Thread Shannon Zhao

Hi Wei,

On 2015/9/17 13:56, Wei Huang wrote:
> 
> 
> On 09/16/2015 08:32 PM, Shannon Zhao wrote:
>> Hi Wei,
>>
>> On 2015/9/17 5:07, Wei Huang wrote:
>>> I am testing this series. 
>> Thanks for your time and help.
>>
>>> The first question is: do you plan to add ACPI
>>> support in QEMU?
> 
> I saw "KVM_{SET/GET}_DEVICE_ATTR failed: Invalid argument" while using
> your QEMU tree (PMU_v2 branch). A quick debugging:
> 
>From this log, it might fail at below check:
+   if (reg < VGIC_NR_SGIS || reg > dev->kvm->arch.vgic.nr_irqs)
+   return -EINVAL;

> (1) dmesg on host kernel didn't show any vPMU initialization errors. So
> I suspect the problem is related to QEMU.
> (2) Commit 58771bc2a78 worked fine. So probably the problem was
> introduced by new PMU code.
> 
> Have you seen it before?
> 

Oh, I didn't see this. And I checkout the code on git.linaro.org, it's
same with my local code.

Could you add some print in kvm_arm_pmu_set_irq of hw/misc/arm_pmu_kvm.c
and kvm_arm_pmu_set_attr, kvm_arm_pmu_set_irq of virt/kvm/arm/pmu.c.

Thanks,

-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ARM/ARM64: KVM: remove 'config KVM_ARM_MAX_VCPUS'

2015-09-16 Thread Ming Lei

On Wed, Sep 2, 2015 at 7:42 PM, Ming Lei  wrote:
> On Wed, Sep 2, 2015 at 6:25 PM, Christoffer Dall
>  wrote:
>> On Wed, Sep 02, 2015 at 02:31:21PM +0800, Ming Lei wrote:
>>> This patch removes config option of KVM_ARM_MAX_VCPUS,
>>> and like other ARCHs, just choose the maximum allowed
>>> value from hardware, and follows the reasons:
>>>
>>> 1) from distribution view, the option has to be
>>> defined as the max allowed value because it need to
>>> meet all kinds of virtulization applications and
>>> need to support most of SoCs;
>>>
>>> 2) using a bigger value doesn't introduce extra memory
>>> consumption, and the help text in Kconfig isn't accurate
>>> because kvm_vpu structure isn't allocated until request
>>> of creating VCPU is sent from QEMU;
>>
>> This used to be true because of the vgic bitmaps, but that is now
>> dynamically allocated, so I believe you're correct in saying that the
>> text is no longer accurate.
>>
>>>
>>> 3) the main effect is that the field of vcpus[] in 'struct kvm'
>>> becomes a bit bigger(sizeof(void *) per vcpu) and need more cache
>>> lines to hold the structure, but 'struct kvm' is one generic struct,
>>> and it has worked well on other ARCHs already in this way. Also,
>>> the world switch frequecy is often low, for example, it is ~2000
>>> when running kernel building load in VM from APM xgene KVM host,
>>> so the effect is very small, and the difference can't be observed
>>> in my test at all.
>>
>> While I'm not prinicipally opposed to removing this option, I have to
>> point out that this analysis is far far over-simplified.  You have
>> chosen a workload which excercised only CPU and memory virtualization,
>> mostly solved by the hardware virtualization support, and therefore you
>> don't see many exits.
>>
>> Try running an I/O bound workload, or something which involves a lot of
>> virtual IPIs, and you'll see a higher number of exits.
>
> Yeah, the frequency of exits becomes higher(6600/sec) when I run a
> totally I/O benchmark(fio: 4 jobs, bs 4k, libaio over virtio-blk) in a
> quad-core VM, but it is still not high enough to cause any difference
> on the test result.
>
>>
>> However, I still doubt that the effects will be noticable in the grand
>> scheme of things.
>>>
>>> Cc: Dann Frazier 
>>> Cc: Christoffer Dall 
>>> Cc: Marc Zyngier 
>>> Cc: kvm...@lists.cs.columbia.edu
>>> Cc: kvm@vger.kernel.org
>>> Signed-off-by: Ming Lei 
>>> ---
>>>  arch/arm/include/asm/kvm_host.h   |  8 ++--
>>>  arch/arm/kvm/Kconfig  | 11 ---
>>>  arch/arm64/include/asm/kvm_host.h |  8 ++--
>>>  arch/arm64/kvm/Kconfig| 11 ---
>>>  include/kvm/arm_vgic.h|  6 +-
>>>  virt/kvm/arm/vgic-v3.c|  2 +-
>>>  6 files changed, 6 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/arch/arm/include/asm/kvm_host.h 
>>> b/arch/arm/include/asm/kvm_host.h
>>> index dcba0fa..c8c226a 100644
>>> --- a/arch/arm/include/asm/kvm_host.h
>>> +++ b/arch/arm/include/asm/kvm_host.h
>>> @@ -29,12 +29,6 @@
>>>
>>>  #define __KVM_HAVE_ARCH_INTC_INITIALIZED
>>>
>>> -#if defined(CONFIG_KVM_ARM_MAX_VCPUS)
>>> -#define KVM_MAX_VCPUS CONFIG_KVM_ARM_MAX_VCPUS
>>> -#else
>>> -#define KVM_MAX_VCPUS 0
>>> -#endif
>>> -
>>>  #define KVM_USER_MEM_SLOTS 32
>>>  #define KVM_PRIVATE_MEM_SLOTS 4
>>>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>>> @@ -44,6 +38,8 @@
>>>
>>>  #include 
>>>
>>> +#define KVM_MAX_VCPUS VGIC_V2_MAX_CPUS
>>> +
>>>  u32 *kvm_vcpu_reg(struct kvm_vcpu *vcpu, u8 reg_num, u32 mode);
>>>  int __attribute_const__ kvm_target_cpu(void);
>>>  int kvm_reset_vcpu(struct kvm_vcpu *vcpu);
>>> diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
>>> index bfb915d..210ecca 100644
>>> --- a/arch/arm/kvm/Kconfig
>>> +++ b/arch/arm/kvm/Kconfig
>>> @@ -45,15 +45,4 @@ config KVM_ARM_HOST
>>>   ---help---
>>> Provides host support for ARM processors.
>>>
>>> -config KVM_ARM_MAX_VCPUS
>>> - int "Number maximum supported virtual CPUs per VM"
>>> - depends on KVM_ARM_HOST
>>> - default 4
>>> - help
>>> -   Static number of max supported virtual CPUs per VM.
>>> -
>>> -   If you choose a high number, the vcpu structures will be quite
>>> -   large, so only choose a reasonable number that you expect to
>>> -   actually use.
>>> -
>>>  endif # VIRTUALIZATION
>>> diff --git a/arch/arm64/include/asm/kvm_host.h 
>>> b/arch/arm64/include/asm/kvm_host.h
>>> index 415938d..3fb58ea 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -30,12 +30,6 @@
>>>
>>>  #define __KVM_HAVE_ARCH_INTC_INITIALIZED
>>>
>>> -#if defined(CONFIG_KVM_ARM_MAX_VCPUS)
>>> -#define KVM_MAX_VCPUS CONFIG_KVM_ARM_MAX_VCPUS
>>> -#else
>>> -#define KVM_MAX_VCPUS 0
>>> -#endif
>>> -
>>>  #define KVM_USER_MEM_SLOTS 32
>>>  #define KVM_PRIVATE_MEM_SLOTS 4
>>>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>>> @@ -43,6 +37,8 @@
>>>  #include 
>>>  #include 
>>>
>>> +#define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS
>>>

Re: [PATCH v2 00/22] KVM: ARM64: Add guest PMU support

2015-09-16 Thread Wei Huang



On 09/16/2015 08:32 PM, Shannon Zhao wrote:
> Hi Wei,
> 
> On 2015/9/17 5:07, Wei Huang wrote:
>> I am testing this series. 
> Thanks for your time and help.
> 
>> The first question is: do you plan to add ACPI
>> support in QEMU?

I saw "KVM_{SET/GET}_DEVICE_ATTR failed: Invalid argument" while using
your QEMU tree (PMU_v2 branch). A quick debugging:

(1) dmesg on host kernel didn't show any vPMU initialization errors. So
I suspect the problem is related to QEMU.
(2) Commit 58771bc2a78 worked fine. So probably the problem was
introduced by new PMU code.

Have you seen it before?

Thanks,
-Wei

> To the completeness, this should be added. Maybe this could be added at
> v3. But I have a look at the kernel PMU driver, it doesn't support
> probing through ACPI, although there are some patches[1] out-of-tree.
> 
>> My in-house kernel uses ACPI for device probing. I had
>> to force "acpi=off" when I test this patch series.
> Guest kernel only boots with ACPI when you add "acpi=force". No need to
> add "acpi=off".
> 
> Thanks,
> 
> [1] http://marc.info/?l=linaro-acpi&m=137949337925645&w=2
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v8 03/13] KVM: Define a new interface kvm_intr_is_single_vcpu()

2015-09-16 Thread Wu, Feng



> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Wednesday, September 16, 2015 5:23 PM
> To: Wu, Feng; alex.william...@redhat.com; j...@8bytes.org;
> mtosa...@redhat.com
> Cc: eric.au...@linaro.org; kvm@vger.kernel.org;
> io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH v8 03/13] KVM: Define a new interface
> kvm_intr_is_single_vcpu()
> 
> 
> 
> On 16/09/2015 10:49, Feng Wu wrote:
> > This patch defines a new interface kvm_intr_is_single_vcpu(),
> > which can returns whether the interrupt is for single-CPU or not.
> >
> > It is used by VT-d PI, since now we only support single-CPU
> > interrupts, For lowest-priority interrupts, if user configures
> > it via /proc/irq or uses irqbalance to make it single-CPU, we
> > can use PI to deliver the interrupts to it. Full functionality
> > of lowest-priority support will be added later.
> >
> > Signed-off-by: Feng Wu 
> > ---
> > v8:
> > - Some optimizations in kvm_intr_is_single_vcpu().
> > - Expose kvm_intr_is_single_vcpu() so we can use it in vmx code.
> > - Add kvm_intr_is_single_vcpu_fast() as the fast path to find
> >   the target vCPU for the single-destination interrupt
> >
> >  arch/x86/include/asm/kvm_host.h |  3 ++
> >  arch/x86/kvm/irq_comm.c | 94
> +
> >  arch/x86/kvm/lapic.c|  5 +--
> >  arch/x86/kvm/lapic.h|  2 +
> >  4 files changed, 101 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> > index 49ec903..af11bca 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1204,4 +1204,7 @@ int __x86_set_memory_region(struct kvm *kvm,
> >  int x86_set_memory_region(struct kvm *kvm,
> >   const struct kvm_userspace_memory_region *mem);
> >
> > +bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
> > +struct kvm_vcpu **dest_vcpu);
> > +
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> > index 9efff9e..97ba1d6 100644
> > --- a/arch/x86/kvm/irq_comm.c
> > +++ b/arch/x86/kvm/irq_comm.c
> > @@ -297,6 +297,100 @@ out:
> > return r;
> >  }
> >
> > +static bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm,
> > +struct kvm_lapic_irq *irq,
> > +struct kvm_vcpu **dest_vcpu)
> 
> Please put this in lapic.c, similar to kvm_irq_delivery_to_apic_fast, so
> that you do not have to export other functions.
> 
> > +{
> > +   struct kvm_apic_map *map;
> > +   bool ret = false;
> > +   struct kvm_lapic *dst = NULL;
> > +
> > +   if (irq->shorthand)
> > +   return false;
> > +
> > +   rcu_read_lock();
> > +   map = rcu_dereference(kvm->arch.apic_map);
> > +
> > +   if (!map)
> > +   goto out;
> > +
> > +   if (irq->dest_mode == APIC_DEST_PHYSICAL) {
> > +   if (irq->dest_id == 0xFF)
> > +   goto out;
> > +
> > +   if (irq->dest_id >= ARRAY_SIZE(map->phys_map)) {
> 
> Warning here is wrong, the guest can trigger it.

Could you please share more information about how the guest
triggers these conditions (including the following two), Thanks
a lot!

Thanks,
Feng

> 
> > +   WARN_ON_ONCE(1);
> > +   goto out;
> > +   }
> > +
> > +   dst = map->phys_map[irq->dest_id];
> > +   if (dst && kvm_apic_present(dst->vcpu))
> > +   *dest_vcpu = dst->vcpu;
> > +   else
> > +   goto out;
> > +   } else {
> > +   u16 cid;
> > +   unsigned long bitmap = 1;
> > +   int i, r = 0;
> > +
> > +   if (!kvm_apic_logical_map_valid(map)) {
> > +   WARN_ON_ONCE(1);
> 
> Same here.
> 
> > +   goto out;
> > +   }
> > +
> > +   apic_logical_id(map, irq->dest_id, &cid, (u16 *)&bitmap);
> > +
> > +   if (cid >= ARRAY_SIZE(map->logical_map)) {
> > +   WARN_ON_ONCE(1);
> 
> Same here.
> 
> Otherwise looks good.
> 
> Paolo
> 
> > +   goto out;
> > +   }
> > +
> > +   for_each_set_bit(i, &bitmap, 16) {
> > +   dst = map->logical_map[cid][i];
> > +   if (++r == 2)
> > +   goto out;
> > +   }
> > +
> > +   if (dst && kvm_apic_present(dst->vcpu))
> > +   *dest_vcpu = dst->vcpu;
> > +   else
> > +   goto out;
> > +   }
> > +
> > +   ret = true;
> > +out:
> > +   rcu_read_unlock();
> > +   return ret;
> > +}
> > +
> > +
> > +bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
> > +struct kvm_vcpu **dest_vcpu)
> > +{
> > +   int i, r = 0;
> > +   struct kvm_vcpu *vcpu;
> > +
> > +   if (kvm_intr_is_single_vcpu_fast(kvm, irq, dest_vcpu))
> >

RE: [PATCH v8 09/13] KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'

2015-09-16 Thread Wu, Feng



> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Wednesday, September 16, 2015 5:27 PM
> To: Wu, Feng; alex.william...@redhat.com; j...@8bytes.org;
> mtosa...@redhat.com
> Cc: eric.au...@linaro.org; kvm@vger.kernel.org;
> io...@lists.linux-foundation.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH v8 09/13] KVM: Add an arch specific hooks in 'struct
> kvm_kernel_irqfd'
> 
> 
> 
> On 16/09/2015 10:50, Feng Wu wrote:
> > +int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
> > +  uint32_t guest_irq, bool set)
> > +{
> > +   return !kvm_x86_ops->update_pi_irte ? -EINVAL :
> > +   kvm_x86_ops->update_pi_irte(kvm, host_irq, guest_irq, set);
> > +}
> > +
> 
> Just use "if" here.  No need to resend if this is the only comment.

I am sorry, I don't quite understand. Do you mean I don't need to include
this patch in v9? If so, what about other patches with your Reviewed-by?

Thanks,
Feng

> 
> >
> >  }
> > +int  __attribute__((weak)) kvm_arch_update_irqfd_routing(
> > +   struct kvm *kvm, unsigned
> 
> Empty line after "}".
> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 00/22] KVM: ARM64: Add guest PMU support

2015-09-16 Thread Shannon Zhao

Hi Wei,

On 2015/9/17 5:07, Wei Huang wrote:
> I am testing this series. 
Thanks for your time and help.

> The first question is: do you plan to add ACPI
> support in QEMU?
To the completeness, this should be added. Maybe this could be added at
v3. But I have a look at the kernel PMU driver, it doesn't support
probing through ACPI, although there are some patches[1] out-of-tree.

> My in-house kernel uses ACPI for device probing. I had
> to force "acpi=off" when I test this patch series.
Guest kernel only boots with ACPI when you add "acpi=force". No need to
add "acpi=off".

Thanks,

[1] http://marc.info/?l=linaro-acpi&m=137949337925645&w=2
-- 
Shannon
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops

2015-09-16 Thread Andy Lutomirski

Setting CONFIG_PARAVIRT=y has an unintended side effect: it silently
turns all rdmsr and wrmsr operations into the safe variants without
any checks that the operations actually succeed.

This is IMO awful: it papers over bugs.  In particular, KVM gueests
might be unwittingly depending on this behavior because
CONFIG_KVM_GUEST currently depends on CONFIG_PARAVIRT.  I'm not
aware of any such problems, but applying this series would be a good
way to shake them out.

Fix it so that the MSR operations work the same on CONFIG_PARAVIRT=n
and CONFIG_PARAVIRT=y as long as Xen isn't being used.  The Xen
maintainers are welcome to make a similar change on top of this.

Since there's plenty of time before the next merge window, I think
we should apply and fix anything that breaks.

Doing this is probably a prerequisite to sanely decoupling
CONFIG_KVM_GUEST and CONFIG_PARAVIRT, which would probably make
Arjan and the rest of the Clear Containers people happy :)

Andy Lutomirski (3):
  x86/paravirt: Add _safe to the read_msr and write_msr PV hooks
  x86/paravirt: Add paravirt_{read,write}_msr
  x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y

 arch/x86/include/asm/paravirt.h   | 45 +--
 arch/x86/include/asm/paravirt_types.h | 12 +++---
 arch/x86/kernel/paravirt.c|  6 +++--
 arch/x86/xen/enlighten.c  | 27 +++--
 4 files changed, 65 insertions(+), 25 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] x86/paravirt: Add paravirt_{read,write}_msr

2015-09-16 Thread Andy Lutomirski

This adds paravirt hooks for unsafe MSR access.  On native, they
call native_{read,write}_msr.  On Xen, they use
xen_{read,write}_msr_safe.

Nothing uses them yet for ease of bisection.  The next patch will
use them in rdmsrl, wrmsrl, etc.

I intentionally didn't make them OOPS on #GP on Xen.  I think that
should be done separately by the Xen maintainers.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/paravirt.h   | 11 +++
 arch/x86/include/asm/paravirt_types.h | 10 --
 arch/x86/kernel/paravirt.c|  2 ++
 arch/x86/xen/enlighten.c  | 23 +++
 4 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9cf6e5232b0d..e6569a3b0a37 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -123,6 +123,17 @@ static inline void wbinvd(void)
 
 #define get_kernel_rpl()  (pv_info.kernel_rpl)
 
+static inline u64 paravirt_read_msr(unsigned msr)
+{
+   return PVOP_CALL1(u64, pv_cpu_ops.read_msr, msr);
+}
+
+static inline void paravirt_write_msr(unsigned msr,
+ unsigned low, unsigned high)
+{
+   return PVOP_VCALL3(pv_cpu_ops.write_msr, msr, low, high);
+}
+
 static inline u64 paravirt_read_msr_safe(unsigned msr, int *err)
 {
return PVOP_CALL2(u64, pv_cpu_ops.read_msr_safe, msr, err);
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 810e8ccaa42a..cb1e7af9fe42 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -151,8 +151,14 @@ struct pv_cpu_ops {
void (*cpuid)(unsigned int *eax, unsigned int *ebx,
  unsigned int *ecx, unsigned int *edx);
 
-   /* MSR operations.
-  err = 0/-EIO.  wrmsr returns 0/-EIO. */
+   /* Unsafe MSR operations.  These either succeed or crash. */
+   u64 (*read_msr)(unsigned int msr);
+   int (*write_msr)(unsigned int msr, unsigned low, unsigned high);
+
+   /*
+* Safe MSR operations.
+* err = 0/-EIO.  wrmsr returns 0/-EIO.
+*/
u64 (*read_msr_safe)(unsigned int msr, int *err);
int (*write_msr_safe)(unsigned int msr, unsigned low, unsigned high);
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index fe8cd519a796..21fc4686f760 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -349,6 +349,8 @@ __visible struct pv_cpu_ops pv_cpu_ops = {
.write_cr8 = native_write_cr8,
 #endif
.wbinvd = native_wbinvd,
+   .read_msr = native_read_msr,
+   .write_msr = native_write_msr,
.read_msr_safe = native_read_msr_safe,
.write_msr_safe = native_write_msr_safe,
.read_pmc = native_read_pmc,
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 8523d42d163e..d2bc6afeaf33 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1086,6 +1086,26 @@ static int xen_write_msr_safe(unsigned int msr, unsigned 
low, unsigned high)
return ret;
 }
 
+static u64 xen_read_msr(unsigned int msr)
+{
+   /*
+* This will silently swallow a #GP from RDMSR.  It may be worth
+* changing that.
+*/
+   int err;
+
+   return xen_read_msr_safe(msr, &err);
+}
+
+static void xen_write_msr(unsigned int msr, unsigned low, unsigned high)
+{
+   /*
+* This will silently swallow a #GP from WRMSR.  It may be worth
+* changing that.
+*/
+   xen_write_msr_safe(msr, low, high);
+}
+
 void xen_setup_shared_info(void)
 {
if (!xen_feature(XENFEAT_auto_translated_physmap)) {
@@ -1216,6 +1236,9 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = {
 
.wbinvd = native_wbinvd,
 
+   .read_msr = xen_read_msr,
+   .write_msr = xen_write_msr,
+
.read_msr_safe = xen_read_msr_safe,
.write_msr_safe = xen_write_msr_safe,
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] x86/paravirt: Add _safe to the read_msr and write_msr PV hooks

2015-09-16 Thread Andy Lutomirski

These hooks match the _safe variants, so name them accordingly.
This will make room for unsafe PV hooks.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/paravirt.h   | 33 +
 arch/x86/include/asm/paravirt_types.h |  8 
 arch/x86/kernel/paravirt.c|  4 ++--
 arch/x86/xen/enlighten.c  |  4 ++--
 4 files changed, 25 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 10d0596433f8..9cf6e5232b0d 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -123,34 +123,35 @@ static inline void wbinvd(void)
 
 #define get_kernel_rpl()  (pv_info.kernel_rpl)
 
-static inline u64 paravirt_read_msr(unsigned msr, int *err)
+static inline u64 paravirt_read_msr_safe(unsigned msr, int *err)
 {
-   return PVOP_CALL2(u64, pv_cpu_ops.read_msr, msr, err);
+   return PVOP_CALL2(u64, pv_cpu_ops.read_msr_safe, msr, err);
 }
 
-static inline int paravirt_write_msr(unsigned msr, unsigned low, unsigned high)
+static inline int paravirt_write_msr_safe(unsigned msr,
+ unsigned low, unsigned high)
 {
-   return PVOP_CALL3(int, pv_cpu_ops.write_msr, msr, low, high);
+   return PVOP_CALL3(int, pv_cpu_ops.write_msr_safe, msr, low, high);
 }
 
 /* These should all do BUG_ON(_err), but our headers are too tangled. */
 #define rdmsr(msr, val1, val2) \
 do {   \
int _err;   \
-   u64 _l = paravirt_read_msr(msr, &_err); \
+   u64 _l = paravirt_read_msr_safe(msr, &_err);\
val1 = (u32)_l; \
val2 = _l >> 32;\
 } while (0)
 
 #define wrmsr(msr, val1, val2) \
 do {   \
-   paravirt_write_msr(msr, val1, val2);\
+   paravirt_write_msr_safe(msr, val1, val2);   \
 } while (0)
 
 #define rdmsrl(msr, val)   \
 do {   \
int _err;   \
-   val = paravirt_read_msr(msr, &_err);\
+   val = paravirt_read_msr_safe(msr, &_err);   \
 } while (0)
 
 static inline void wrmsrl(unsigned msr, u64 val)
@@ -158,23 +159,23 @@ static inline void wrmsrl(unsigned msr, u64 val)
wrmsr(msr, (u32)val, (u32)(val>>32));
 }
 
-#define wrmsr_safe(msr, a, b)  paravirt_write_msr(msr, a, b)
+#define wrmsr_safe(msr, a, b)  paravirt_write_msr_safe(msr, a, b)
 
 /* rdmsr with exception handling */
-#define rdmsr_safe(msr, a, b)  \
-({ \
-   int _err;   \
-   u64 _l = paravirt_read_msr(msr, &_err); \
-   (*a) = (u32)_l; \
-   (*b) = _l >> 32;\
-   _err;   \
+#define rdmsr_safe(msr, a, b)  \
+({ \
+   int _err;   \
+   u64 _l = paravirt_read_msr_safe(msr, &_err);\
+   (*a) = (u32)_l; \
+   (*b) = _l >> 32;\
+   _err;   \
 })
 
 static inline int rdmsrl_safe(unsigned msr, unsigned long long *p)
 {
int err;
 
-   *p = paravirt_read_msr(msr, &err);
+   *p = paravirt_read_msr_safe(msr, &err);
return err;
 }
 
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index ce029e4fa7c6..810e8ccaa42a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -151,10 +151,10 @@ struct pv_cpu_ops {
void (*cpuid)(unsigned int *eax, unsigned int *ebx,
  unsigned int *ecx, unsigned int *edx);
 
-   /* MSR, PMC and TSR operations.
-  err = 0/-EFAULT.  wrmsr returns 0/-EFAULT. */
-   u64 (*read_msr)(unsigned int msr, int *err);
-   int (*write_msr)(unsigned int msr, unsigned low, unsigned high);
+   /* MSR operations.
+  err = 0/-EIO.  wrmsr returns 0/-EIO. */
+   u64 (*read_msr_safe)(unsigned int msr, int *err);
+   int (*write_msr_safe)(unsigned int msr, unsigned low, unsigned high);
 
u64 (*read_pmc)(int counter);
 
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index f68e48f5f6c2..fe8cd519a796 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -349,8 +349,8 @@ __visible struct pv_cpu_ops pv_cpu_ops = {
.write_cr8 = native_write_cr8,
 #endif
.wbinvd = native_wbinvd,
-   .read_msr = native_read_msr_safe,
-   .write_msr = native_write_msr_safe,
+   .read_msr_safe = native_read_msr_safe,
+   .write_msr_safe = native_write_msr_safe,
.read_pmc = native_read

[PATCH 3/3] x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y

2015-09-16 Thread Andy Lutomirski

Enabling CONFIG_PARAVIRT had an unintended side effect: rdmsr turned
into rdmsr_safe and wrmsr turned into wrmsr_safe, even on bare
metal.  Undo that by using the new unsafe paravirt MSR hooks.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/paravirt.h | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index e6569a3b0a37..f61975a3ccfd 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -145,24 +145,21 @@ static inline int paravirt_write_msr_safe(unsigned msr,
return PVOP_CALL3(int, pv_cpu_ops.write_msr_safe, msr, low, high);
 }
 
-/* These should all do BUG_ON(_err), but our headers are too tangled. */
 #define rdmsr(msr, val1, val2) \
 do {   \
-   int _err;   \
-   u64 _l = paravirt_read_msr_safe(msr, &_err);\
+   u64 _l = paravirt_read_msr(msr);\
val1 = (u32)_l; \
val2 = _l >> 32;\
 } while (0)
 
 #define wrmsr(msr, val1, val2) \
 do {   \
-   paravirt_write_msr_safe(msr, val1, val2);   \
+   paravirt_write_msr(msr, val1, val2);\
 } while (0)
 
 #define rdmsrl(msr, val)   \
 do {   \
-   int _err;   \
-   val = paravirt_read_msr_safe(msr, &_err);   \
+   val = paravirt_read_msr(msr);   \
 } while (0)
 
 static inline void wrmsrl(unsigned msr, u64 val)
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 00/22] KVM: ARM64: Add guest PMU support

2015-09-16 Thread Wei Huang



On 09/11/2015 03:54 AM, Shannon Zhao wrote:
> From: Shannon Zhao 
> 
> This patchset adds guest PMU support for KVM on ARM64. It takes
> trap-and-emulate approach. When guest wants to monitor one event, it
> will be trapped by KVM and KVM will call perf_event API to create a perf
> event and call relevant perf_event APIs to get the count value of event.
> 
> Use perf to test this patchset in guest. When using "perf list", it
> shows the list of the hardware events and hardware cache events perf
> supports. Then use "perf stat -e EVENT" to monitor some event. For
> example, use "perf stat -e cycles" to count cpu cycles and
> "perf stat -e cache-misses" to count cache misses.
> 
> Below are the outputs of "perf stat -r 5 sleep 5" when running in host
> and guest.
> 
> Host:
>  Performance counter stats for 'sleep 5' (5 runs):
> 
>   0.551428  task-clock (msec) #0.000 CPUs utilized
> ( +-  0.91% )
>  1  context-switches  #0.002 M/sec
>  0  cpu-migrations#0.000 K/sec
> 48  page-faults   #0.088 M/sec
> ( +-  1.05% )
>1150265  cycles#2.086 GHz  
> ( +-  0.92% )
>  stalled-cycles-frontend
>  stalled-cycles-backend
> 526398  instructions  #0.46  insns per cycle  
> ( +-  0.89% )
>  branches
>   9485  branch-misses #   17.201 M/sec
> ( +-  2.35% )
> 
>5.000831616 seconds time elapsed   
>( +-  0.00% )
> 
> Guest:
>  Performance counter stats for 'sleep 5' (5 runs):
> 
>   0.730868  task-clock (msec) #0.000 CPUs utilized
> ( +-  1.13% )
>  1  context-switches  #0.001 M/sec
>  0  cpu-migrations#0.000 K/sec
> 48  page-faults   #0.065 M/sec
> ( +-  0.42% )
>1642982  cycles#2.248 GHz  
> ( +-  1.04% )
>  stalled-cycles-frontend
>  stalled-cycles-backend
> 637964  instructions  #0.39  insns per cycle  
> ( +-  0.65% )
>  branches
>  10377  branch-misses #   14.198 M/sec
> ( +-  1.09% )
> 
>5.001289068 seconds time elapsed   
>( +-  0.00% )
> 
> This patchset can be fetched from [1] and the relevant QEMU version for
> test can be fetched from [2].
> 
> Thanks,
> Shannon
> 
> [1] https://git.linaro.org/people/shannon.zhao/linux-mainline.git  
> KVM_ARM64_PMU_v2
> [2] https://git.linaro.org/people/shannon.zhao/qemu.git  PMU_v2

I am testing this series. The first question is: do you plan to add ACPI
support in QEMU? My in-house kernel uses ACPI for device probing. I had
to force "acpi=off" when I test this patch series.

> 
> Shannon Zhao (22):
>   ARM64: Move PMU register related defines to asm/pmu.h
>   KVM: ARM64: Define PMU data structure for each vcpu
>   KVM: ARM64: Add offset defines for PMU registers
>   KVM: ARM64: Add reset and access handlers for PMCR_EL0 register
>   KVM: ARM64: Add a helper for CP15 registers reset to UNKNOWN
>   KVM: ARM64: Add reset and access handlers for PMSELR register
>   KVM: ARM64: Add reset and access handlers for PMCEID0 and PMCEID1
> register
>   KVM: ARM64: PMU: Add perf event map and introduce perf event creating
> function
>   KVM: ARM64: Add reset and access handlers for PMXEVTYPER register
>   KVM: ARM64: Add reset and access handlers for PMXEVCNTR register
>   KVM: ARM64: Add reset and access handlers for PMCCNTR register
>   KVM: ARM64: Add reset and access handlers for PMCNTENSET and
> PMCNTENCLR register
>   KVM: ARM64: Add reset and access handlers for PMINTENSET and
> PMINTENCLR register
>   KVM: ARM64: Add reset and access handlers for PMOVSSET and PMOVSCLR
> register
>   KVM: ARM64: Add a helper for CP15 registers reset to specified value
>   KVM: ARM64: Add reset and access handlers for PMUSERENR register
>   KVM: ARM64: Add reset and access handlers for PMSWINC register
>   KVM: ARM64: Add access handlers for PMEVCNTRn and PMEVTYPERn register
>   KVM: ARM64: Add PMU overflow interrupt routing
>   KVM: ARM64: Reset PMU state when resetting vcpu
>   KVM: ARM64: Free perf event of PMU when destroying vcpu
>   KVM: ARM64: Add a new kvm ARM PMU device
> 
>  Documentation/virtual/kvm/devices/arm-pmu.txt |  15 +
>  arch/arm/kvm/arm.c|   4 +
>  arch/arm64/include/asm/kvm_asm.h  |  59 ++-
>  arch/arm64/include/asm/kvm_host.h |   2 +
>  arch/arm64/include/asm/pmu.h  |  49 +++
>  arch/arm64/include/uapi/asm/kvm.h |   3 +
>  arch/arm64

Re: [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking"

2015-09-16 Thread Christian Borntraeger

Both patches in combination
Tested-by: Christian Borntraeger  # on top of 4.3-rc1


As a side note, patch 2 does not apply cleanly on 4.2, so we probably need to 
provide a separate backport.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: arm64: remove all traces of the ThumbEE registers

2015-09-16 Thread Marc Zyngier

On 15/09/15 17:23, Peter Maydell wrote:
> On 15 September 2015 at 17:15, Will Deacon  wrote:
>> Although the ThumbEE registers and traps were present in earlier
>> versions of the v8 architecture, it was retrospectively removed and so
>> we can do the same.
>>
>> Cc: Marc Zyngier 
>> Signed-off-by: Will Deacon 
> 
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index b41607d270ac..6c35e49757d8 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -539,13 +539,6 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>> { Op0(0b10), Op1(0b000), CRn(0b0111), CRm(0b1110), Op2(0b110),
>>   trap_dbgauthstatus_el1 },
>>
>> -   /* TEECR32_EL1 */
>> -   { Op0(0b10), Op1(0b010), CRn(0b), CRm(0b), Op2(0b000),
>> - NULL, reset_val, TEECR32_EL1, 0 },
>> -   /* TEEHBR32_EL1 */
>> -   { Op0(0b10), Op1(0b010), CRn(0b0001), CRm(0b), Op2(0b000),
>> - NULL, reset_val, TEEHBR32_EL1, 0 },
>> -
> 
> I guess this is a VM migration compatibility break between kernels
> without this patch and kernels with it? I think that's OK at this
> point, but it would be nice to mention it in the commit message.

I'll add comment to that effect.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking"

2015-09-16 Thread Oleg Nesterov

On 09/16, Tejun Heo wrote:
>
> From f9f9e7b776142fb1c0782cade004cc8e0147a199 Mon Sep 17 00:00:00 2001
> From: Tejun Heo 
> Date: Wed, 16 Sep 2015 11:51:12 -0400
>
> This reverts commit b5ba75b5fc0e8404e2c50cb68f39bb6a53fc916f.
>
> d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with
> a global percpu_rwsem") and b5ba75b5fc0e ("cgroup: simplify
> threadgroup locking") changed how cgroup synchronizes against task
> fork and exits so that it uses global percpu_rwsem instead of
> per-process rwsem; unfortunately, the write [un]lock paths of
> percpu_rwsem always involve synchronize_rcu_expedited() which turned
> out to be too expensive.
>
> Improvements for percpu_rwsem are scheduled to be merged in the coming
> v4.4-rc1 merge window which alleviates this issue.  For now, revert
> the two commits to restore per-process rwsem.  They will be re-applied
> for the v4.4-rc1 merge window.
>
> Signed-off-by: Tejun Heo 
> Link: http://lkml.kernel.org/g/55f8097a.7000...@de.ibm.com
> Reported-by: Christian Borntraeger 
> Cc: Oleg Nesterov 
> Cc: "Paul E. McKenney" 
> Cc: Peter Zijlstra 
> Cc: Paolo Bonzini 
> Cc: sta...@vger.kernel.org # v4.2+

So just in case, I agree. Perhaps we could merge the percpu_rwsem changes
in v4.3, but these patches look much safer for -stable.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] arm/arm64: KVM: vgic: Check for !irqchip_in_kernel() when mapping resources

2015-09-16 Thread Marc Zyngier

From: Pavel Fedin 

Until b26e5fdac43c ("arm/arm64: KVM: introduce per-VM ops"),
kvm_vgic_map_resources() used to include a check on irqchip_in_kernel(),
and vgic_v2_map_resources() still has it.

But now vm_ops are not initialized until we call kvm_vgic_create().
Therefore kvm_vgic_map_resources() can being called without a VGIC,
and we die because vm_ops.map_resources is NULL.

Fixing this restores QEMU's kernel-irqchip=off option to a working state,
allowing to use GIC emulation in userspace.

Fixes: b26e5fdac43c ("arm/arm64: KVM: introduce per-VM ops")
Cc: sta...@vger.kernel.org
Signed-off-by: Pavel Fedin 
[maz: reworked commit message]
Signed-off-by: Marc Zyngier 
---
 arch/arm/kvm/arm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index ce404a5..dc017ad 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -446,7 +446,7 @@ static int kvm_vcpu_first_run_init(struct kvm_vcpu *vcpu)
 * Map the VGIC hardware resources before running a vcpu the first
 * time on this VM.
 */
-   if (unlikely(!vgic_ready(kvm))) {
+   if (unlikely(irqchip_in_kernel(kvm) && !vgic_ready(kvm))) {
ret = kvm_vgic_map_resources(kvm);
if (ret)
return ret;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] arm64: KVM: Disable virtual timer even if the guest is not using it

2015-09-16 Thread Marc Zyngier

When running a guest with the architected timer disabled (with QEMU and
the kernel_irqchip=off option, for example), it is important to make
sure the timer gets turned off. Otherwise, the guest may try to
enable it anyway, leading to a screaming HW interrupt.

The fix is to unconditionally turn off the virtual timer on guest
exit.

Cc: sta...@vger.kernel.org
Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/hyp.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index 6addf97..38f5434 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -570,8 +570,6 @@ alternative_endif
mrs x3, cntv_ctl_el0
and x3, x3, #3
str w3, [x0, #VCPU_TIMER_CNTV_CTL]
-   bic x3, x3, #1  // Clear Enable
-   msr cntv_ctl_el0, x3
 
isb
 
@@ -579,6 +577,8 @@ alternative_endif
str x3, [x0, #VCPU_TIMER_CNTV_CVAL]
 
 1:
+   msr cntv_ctl_el0, xzr
+
// Allow physical timer/counter access for the host
mrs x2, cnthctl_el2
orr x2, x2, #3
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] arm: KVM: Disable virtual timer even if the guest is not using it

2015-09-16 Thread Marc Zyngier

When running a guest with the architected timer disabled (with QEMU and
the kernel_irqchip=off option, for example), it is important to make
sure the timer gets turned off. Otherwise, the guest may try to
enable it anyway, leading to a screaming HW interrupt.

The fix is to unconditionally turn off the virtual timer on guest
exit.

Cc: sta...@vger.kernel.org
Signed-off-by: Marc Zyngier 
---
 arch/arm/kvm/interrupts_head.S | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/interrupts_head.S b/arch/arm/kvm/interrupts_head.S
index 702740d..51a5950 100644
--- a/arch/arm/kvm/interrupts_head.S
+++ b/arch/arm/kvm/interrupts_head.S
@@ -515,8 +515,7 @@ ARM_BE8(rev r6, r6  )
 
mrc p15, 0, r2, c14, c3, 1  @ CNTV_CTL
str r2, [vcpu, #VCPU_TIMER_CNTV_CTL]
-   bic r2, #1  @ Clear ENABLE
-   mcr p15, 0, r2, c14, c3, 1  @ CNTV_CTL
+
isb
 
mrrcp15, 3, rr_lo_hi(r2, r3), c14   @ CNTV_CVAL
@@ -529,6 +528,9 @@ ARM_BE8(rev r6, r6  )
mcrrp15, 4, r2, r2, c14 @ CNTVOFF
 
 1:
+   mov r2, #0  @ Clear ENABLE
+   mcr p15, 0, r2, c14, c3, 1  @ CNTV_CTL
+
@ Allow physical timer/counter access for the host
mrc p15, 4, r2, c14, c1, 0  @ CNTHCTL
orr r2, r2, #(CNTHCTL_PL1PCEN | CNTHCTL_PL1PCTEN)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] arm/arm64: KVM: Fix !irqchip_in_kernel() handling

2015-09-16 Thread Marc Zyngier

It is quite obvious that the non kernel-irqchip has been bitrotting
for a while now. Pavel has been doing some work to address this, but
it turns out that there is more fun to be had...

This series picks up one of Pavel's series (which is an obvious fix),
and adds another fix for both arm and arm64.

With these two patches applied, I get a working system as long as I
don't need timers - which is expected.

Tested on X-Gene.

Marc Zyngier (2):
  arm64: KVM: Disable virtual timer even if the guest is not using it
  arm: KVM: Disable virtual timer even if the guest is not using it

Pavel Fedin (1):
  arm/arm64: KVM: vgic: Check for !irqchip_in_kernel() when mapping
resources

 arch/arm/kvm/arm.c | 2 +-
 arch/arm/kvm/interrupts_head.S | 6 --
 arch/arm64/kvm/hyp.S   | 4 ++--
 3 files changed, 7 insertions(+), 5 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH cgroup/for-4.3-fixes 2/2] Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem"

2015-09-16 Thread Tejun Heo

>From 0c986253b939cc14c69d4adbe2b4121bdf4aa220 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Wed, 16 Sep 2015 11:51:12 -0400

This reverts commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14.

d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem") and b5ba75b5fc0e ("cgroup: simplify
threadgroup locking") changed how cgroup synchronizes against task
fork and exits so that it uses global percpu_rwsem instead of
per-process rwsem; unfortunately, the write [un]lock paths of
percpu_rwsem always involve synchronize_rcu_expedited() which turned
out to be too expensive.

Improvements for percpu_rwsem are scheduled to be merged in the coming
v4.4-rc1 merge window which alleviates this issue.  For now, revert
the two commits to restore per-process rwsem.  They will be re-applied
for the v4.4-rc1 merge window.

Signed-off-by: Tejun Heo 
Link: http://lkml.kernel.org/g/55f8097a.7000...@de.ibm.com
Reported-by: Christian Borntraeger 
Cc: Oleg Nesterov 
Cc: "Paul E. McKenney" 
Cc: Peter Zijlstra 
Cc: Paolo Bonzini 
Cc: sta...@vger.kernel.org # v4.2+
---
 include/linux/cgroup-defs.h | 27 ++--
 include/linux/init_task.h   |  8 +
 include/linux/sched.h   | 12 +++
 kernel/cgroup.c | 77 +
 kernel/fork.c   |  4 +++
 5 files changed, 83 insertions(+), 45 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 4d8fcf2..8492721 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -473,31 +473,8 @@ struct cgroup_subsys {
unsigned int depends_on;
 };
 
-extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
-
-/**
- * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
- * @tsk: target task
- *
- * Called from threadgroup_change_begin() and allows cgroup operations to
- * synchronize against threadgroup changes using a percpu_rw_semaphore.
- */
-static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk)
-{
-   percpu_down_read(&cgroup_threadgroup_rwsem);
-}
-
-/**
- * cgroup_threadgroup_change_end - threadgroup exclusion for cgroups
- * @tsk: target task
- *
- * Called from threadgroup_change_end().  Counterpart of
- * cgroup_threadcgroup_change_begin().
- */
-static inline void cgroup_threadgroup_change_end(struct task_struct *tsk)
-{
-   percpu_up_read(&cgroup_threadgroup_rwsem);
-}
+void cgroup_threadgroup_change_begin(struct task_struct *tsk);
+void cgroup_threadgroup_change_end(struct task_struct *tsk);
 
 #else  /* CONFIG_CGROUPS */
 
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index d0b380e..e38681f 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -25,6 +25,13 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 
+#ifdef CONFIG_CGROUPS
+#define INIT_GROUP_RWSEM(sig)  \
+   .group_rwsem = __RWSEM_INITIALIZER(sig.group_rwsem),
+#else
+#define INIT_GROUP_RWSEM(sig)
+#endif
+
 #ifdef CONFIG_CPUSETS
 #define INIT_CPUSET_SEQ(tsk)   
\
.mems_allowed_seq = SEQCNT_ZERO(tsk.mems_allowed_seq),
@@ -57,6 +64,7 @@ extern struct fs_struct init_fs;
INIT_PREV_CPUTIME(sig)  \
.cred_guard_mutex = \
 __MUTEX_INITIALIZER(sig.cred_guard_mutex), \
+   INIT_GROUP_RWSEM(sig)   \
 }
 
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4ab9da..b7b9501 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -762,6 +762,18 @@ struct signal_struct {
unsigned audit_tty_log_passwd;
struct tty_audit_buf *tty_audit_buf;
 #endif
+#ifdef CONFIG_CGROUPS
+   /*
+* group_rwsem prevents new tasks from entering the threadgroup and
+* member tasks from exiting,a more specifically, setting of
+* PF_EXITING.  fork and exit paths are protected with this rwsem
+* using threadgroup_change_begin/end().  Users which require
+* threadgroup to remain stable should use threadgroup_[un]lock()
+* which also takes care of exec path.  Currently, cgroup is the
+* only user.
+*/
+   struct rw_semaphore group_rwsem;
+#endif
 
oom_flags_t oom_flags;
short oom_score_adj;/* OOM kill score adjustment */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 115091e..2c9eae6 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -46,7 +46,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -104,8 +103,6 @@ static DEFINE_SPINLOCK(cgroup_idr_lock);
  */
 static DEFINE_SPINLOCK(release_agent_path_lock);
 
-struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
-
 #define cgroup_assert_mutex_or_rcu_locked()

[PATCH cgroup/for-4.3-fixes 1/2] Revert "cgroup: simplify threadgroup locking"

2015-09-16 Thread Tejun Heo

>From f9f9e7b776142fb1c0782cade004cc8e0147a199 Mon Sep 17 00:00:00 2001
From: Tejun Heo 
Date: Wed, 16 Sep 2015 11:51:12 -0400

This reverts commit b5ba75b5fc0e8404e2c50cb68f39bb6a53fc916f.

d59cfc09c32a ("sched, cgroup: replace signal_struct->group_rwsem with
a global percpu_rwsem") and b5ba75b5fc0e ("cgroup: simplify
threadgroup locking") changed how cgroup synchronizes against task
fork and exits so that it uses global percpu_rwsem instead of
per-process rwsem; unfortunately, the write [un]lock paths of
percpu_rwsem always involve synchronize_rcu_expedited() which turned
out to be too expensive.

Improvements for percpu_rwsem are scheduled to be merged in the coming
v4.4-rc1 merge window which alleviates this issue.  For now, revert
the two commits to restore per-process rwsem.  They will be re-applied
for the v4.4-rc1 merge window.

Signed-off-by: Tejun Heo 
Link: http://lkml.kernel.org/g/55f8097a.7000...@de.ibm.com
Reported-by: Christian Borntraeger 
Cc: Oleg Nesterov 
Cc: "Paul E. McKenney" 
Cc: Peter Zijlstra 
Cc: Paolo Bonzini 
Cc: sta...@vger.kernel.org # v4.2+
---
Hello,

These are the two reverts that I'm pushing through
cgroup/for-4.3-fixes.  I'll re-apply the reverted patches on the
for-4.4 branch so that they can land together with percpu_rwsem
updates during the next merge window.

Thanks.

 kernel/cgroup.c | 45 +
 1 file changed, 33 insertions(+), 12 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2cf0f79..115091e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2460,13 +2460,14 @@ static ssize_t __cgroup_procs_write(struct 
kernfs_open_file *of, char *buf,
if (!cgrp)
return -ENODEV;
 
-   percpu_down_write(&cgroup_threadgroup_rwsem);
+retry_find_task:
rcu_read_lock();
if (pid) {
tsk = find_task_by_vpid(pid);
if (!tsk) {
+   rcu_read_unlock();
ret = -ESRCH;
-   goto out_unlock_rcu;
+   goto out_unlock_cgroup;
}
} else {
tsk = current;
@@ -2482,23 +2483,37 @@ static ssize_t __cgroup_procs_write(struct 
kernfs_open_file *of, char *buf,
 */
if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
ret = -EINVAL;
-   goto out_unlock_rcu;
+   rcu_read_unlock();
+   goto out_unlock_cgroup;
}
 
get_task_struct(tsk);
rcu_read_unlock();
 
+   percpu_down_write(&cgroup_threadgroup_rwsem);
+   if (threadgroup) {
+   if (!thread_group_leader(tsk)) {
+   /*
+* a race with de_thread from another thread's exec()
+* may strip us of our leadership, if this happens,
+* there is no choice but to throw this task away and
+* try again; this is
+* "double-double-toil-and-trouble-check locking".
+*/
+   percpu_up_write(&cgroup_threadgroup_rwsem);
+   put_task_struct(tsk);
+   goto retry_find_task;
+   }
+   }
+
ret = cgroup_procs_write_permission(tsk, cgrp, of);
if (!ret)
ret = cgroup_attach_task(cgrp, tsk, threadgroup);
 
-   put_task_struct(tsk);
-   goto out_unlock_threadgroup;
-
-out_unlock_rcu:
-   rcu_read_unlock();
-out_unlock_threadgroup:
percpu_up_write(&cgroup_threadgroup_rwsem);
+
+   put_task_struct(tsk);
+out_unlock_cgroup:
cgroup_kn_unlock(of->kn);
return ret ?: nbytes;
 }
@@ -2643,8 +2658,6 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
 
lockdep_assert_held(&cgroup_mutex);
 
-   percpu_down_write(&cgroup_threadgroup_rwsem);
-
/* look up all csses currently attached to @cgrp's subtree */
down_read(&css_set_rwsem);
css_for_each_descendant_pre(css, cgroup_css(cgrp, NULL)) {
@@ -2700,8 +2713,17 @@ static int cgroup_update_dfl_csses(struct cgroup *cgrp)
goto out_finish;
last_task = task;
 
+   percpu_down_write(&cgroup_threadgroup_rwsem);
+   /* raced against de_thread() from another thread? */
+   if (!thread_group_leader(task)) {
+   percpu_up_write(&cgroup_threadgroup_rwsem);
+   put_task_struct(task);
+   continue;
+   }
+
ret = cgroup_migrate(src_cset->dfl_cgrp, task, true);
 
+   percpu_up_write(&cgroup_threadgroup_rwsem);
put_task_struct(task);
 
if (WARN(ret, "cgroup: failed to update controllers for 
the default hierarchy (%d), further operations may cras

Re: [PATCH] Fixes: 805de8f43c20 (atomic: Replace atomic_{set,clear}_mask() usage)

2015-09-16 Thread Cornelia Huck

On Wed, 16 Sep 2015 09:13:50 -0400
"Jason J. Herne"  wrote:
> The offending commit accidentally replaces an atomic_clear with an
> atomic_or instead of an atomic_andnot in kvm_s390_vcpu_request_handled.
> The symptom is that kvm guests on s390 hang on startup.
> This patch simply replaces the incorrect atomic_or with atomic_andnot
> 
> Signed-off-by: Jason J. Herne 
> ---
>  arch/s390/kvm/kvm-s390.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index c91eb94..49e76be 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -1574,7 +1574,7 @@ static void kvm_s390_vcpu_request(struct kvm_vcpu *vcpu)
> 
>  static void kvm_s390_vcpu_request_handled(struct kvm_vcpu *vcpu)
>  {
> - atomic_or(PROG_REQUEST, &vcpu->arch.sie_block->prog20);
> + atomic_andnot(PROG_REQUEST, &vcpu->arch.sie_block->prog20);
>  }
> 
>  /*

Acked-by: Cornelia Huck 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] arm64: KVM: Fix user access for debug registers

2015-09-16 Thread Marc Zyngier

On 16/09/15 15:35, Alex Bennée wrote:
> 
> Christoffer Dall  writes:
> 
>> On Wed, Sep 16, 2015 at 11:41:10AM +0100, Marc Zyngier wrote:
>>> When setting the debug register from userspace, make sure that
>>> copy_from_user() is called with its parameters in the expected
>>> order. It otherwise doesn't do what you think.
>>>
>>> Reported-by: Peter Maydell 
>>> Cc: Alex Bennée 
>>> Fixes: 84e690bfbed1 ("KVM: arm64: introduce vcpu->arch.debug_ptr")
>>> Signed-off-by: Marc Zyngier 
>>
>> yikes!
> 
> OK I'm now muchly confused as to how it could have worked...

Well, we only write the registers at boot time, and corrupting userspace
did go unnoticed. I was only able to reproduce this on a model with PAN
enabled.

Copy-paste bug.

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fixes: 805de8f43c20 (atomic: Replace atomic_{set,clear}_mask() usage)

2015-09-16 Thread Peter Zijlstra

On Wed, Sep 16, 2015 at 09:13:50AM -0400, Jason J. Herne wrote:
> The offending commit accidentally replaces an atomic_clear with an
> atomic_or instead of an atomic_andnot in kvm_s390_vcpu_request_handled.
> The symptom is that kvm guests on s390 hang on startup.
> This patch simply replaces the incorrect atomic_or with atomic_andnot
> 
> Signed-off-by: Jason J. Herne 

Urgh, sorry about that.

Acked-by: Peter Zijlstra (Intel) 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] arm64: KVM: Fix user access for debug registers

2015-09-16 Thread Alex Bennée


Christoffer Dall  writes:

> On Wed, Sep 16, 2015 at 11:41:10AM +0100, Marc Zyngier wrote:
>> When setting the debug register from userspace, make sure that
>> copy_from_user() is called with its parameters in the expected
>> order. It otherwise doesn't do what you think.
>> 
>> Reported-by: Peter Maydell 
>> Cc: Alex Bennée 
>> Fixes: 84e690bfbed1 ("KVM: arm64: introduce vcpu->arch.debug_ptr")
>> Signed-off-by: Marc Zyngier 
>
> yikes!

OK I'm now muchly confused as to how it could have worked...

>
> Reviewed-by: Christoffer Dall 

-- 
Alex Bennée
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fixes: 805de8f43c20 (atomic: Replace atomic_{set,clear}_mask() usage)

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 16:34, Christian Borntraeger wrote:
> Am 16.09.2015 um 15:13 schrieb Jason J. Herne:
>> The offending commit accidentally replaces an atomic_clear with an
>> atomic_or instead of an atomic_andnot in kvm_s390_vcpu_request_handled.
>> The symptom is that kvm guests on s390 hang on startup.
>> This patch simply replaces the incorrect atomic_or with atomic_andnot
>>
>> Signed-off-by: Jason J. Herne 
> 
> Acked-by: Christian Borntraeger 
> 
> Paolo,
> can you take this via kvm tree for 4.3?
> 
> Maybe move the subject line into the mail body an rephrase the subject line.

Sure.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Fixes: 805de8f43c20 (atomic: Replace atomic_{set,clear}_mask() usage)

2015-09-16 Thread Christian Borntraeger

Am 16.09.2015 um 15:13 schrieb Jason J. Herne:
> The offending commit accidentally replaces an atomic_clear with an
> atomic_or instead of an atomic_andnot in kvm_s390_vcpu_request_handled.
> The symptom is that kvm guests on s390 hang on startup.
> This patch simply replaces the incorrect atomic_or with atomic_andnot
> 
> Signed-off-by: Jason J. Herne 

Acked-by: Christian Borntraeger 

Paolo,
can you take this via kvm tree for 4.3?

Maybe move the subject line into the mail body an rephrase the subject line.

Christian


> ---
>  arch/s390/kvm/kvm-s390.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index c91eb94..49e76be 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -1574,7 +1574,7 @@ static void kvm_s390_vcpu_request(struct kvm_vcpu *vcpu)
> 
>  static void kvm_s390_vcpu_request_handled(struct kvm_vcpu *vcpu)
>  {
> - atomic_or(PROG_REQUEST, &vcpu->arch.sie_block->prog20);
> + atomic_andnot(PROG_REQUEST, &vcpu->arch.sie_block->prog20);
>  }
> 
>  /*
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] arm64: KVM: Fix user access for debug registers

2015-09-16 Thread Alex Bennée


Marc Zyngier  writes:

> When setting the debug register from userspace, make sure that
> copy_from_user() is called with its parameters in the expected
> order. It otherwise doesn't do what you think.

Oops. Well that exposes a big hole in my testing. While I tested
debugging inside the guest worked before and after being guest debugged
I think GDBs tendency to reload all the debug registers between each
step may have masked this.

Debugging GDB in action or some sort of migration event would of course
screw this up but I'm afraid my testing wasn't evil enough.

Anyway have a:

Reviewed-by: Alex Bennée 

>
> Reported-by: Peter Maydell 
> Cc: Alex Bennée 
> Fixes: 84e690bfbed1 ("KVM: arm64: introduce vcpu->arch.debug_ptr")
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kvm/sys_regs.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index b41607d..1d0463e 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -272,7 +272,7 @@ static int set_bvr(struct kvm_vcpu *vcpu, const struct 
> sys_reg_desc *rd,
>  {
>   __u64 *r = &vcpu->arch.vcpu_debug_state.dbg_bvr[rd->reg];
>  
> - if (copy_from_user(uaddr, r, KVM_REG_SIZE(reg->id)) != 0)
> + if (copy_from_user(r, uaddr, KVM_REG_SIZE(reg->id)) != 0)
>   return -EFAULT;
>   return 0;
>  }
> @@ -314,7 +314,7 @@ static int set_bcr(struct kvm_vcpu *vcpu, const struct 
> sys_reg_desc *rd,
>  {
>   __u64 *r = &vcpu->arch.vcpu_debug_state.dbg_bcr[rd->reg];
>  
> - if (copy_from_user(uaddr, r, KVM_REG_SIZE(reg->id)) != 0)
> + if (copy_from_user(r, uaddr, KVM_REG_SIZE(reg->id)) != 0)
>   return -EFAULT;
>  
>   return 0;
> @@ -358,7 +358,7 @@ static int set_wvr(struct kvm_vcpu *vcpu, const struct 
> sys_reg_desc *rd,
>  {
>   __u64 *r = &vcpu->arch.vcpu_debug_state.dbg_wvr[rd->reg];
>  
> - if (copy_from_user(uaddr, r, KVM_REG_SIZE(reg->id)) != 0)
> + if (copy_from_user(r, uaddr, KVM_REG_SIZE(reg->id)) != 0)
>   return -EFAULT;
>   return 0;
>  }
> @@ -400,7 +400,7 @@ static int set_wcr(struct kvm_vcpu *vcpu, const struct 
> sys_reg_desc *rd,
>  {
>   __u64 *r = &vcpu->arch.vcpu_debug_state.dbg_wcr[rd->reg];
>  
> - if (copy_from_user(uaddr, r, KVM_REG_SIZE(reg->id)) != 0)
> + if (copy_from_user(r, uaddr, KVM_REG_SIZE(reg->id)) != 0)
>   return -EFAULT;
>   return 0;
>  }

-- 
Alex Bennée
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 16:16, Tejun Heo wrote:
> On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
>>> > > If the revert isn't easy, I think backporting rcu_sync is the best bet.
>> > 
>> > I leave this to Paul and Tejun... at least I think this is not v4.2 
>> > material.
> Will route reverts through cgroup branch.  Should be pretty painless.
> Nice job on percpu_rwsem.  It's so much better than having to come up
> with a separate middleground solution.

Thanks all for the quick resolution!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Tejun Heo

Hello,

On Wed, Sep 16, 2015 at 02:22:49PM +0200, Oleg Nesterov wrote:
> > If the revert isn't easy, I think backporting rcu_sync is the best bet.
> 
> I leave this to Paul and Tejun... at least I think this is not v4.2 material.

Will route reverts through cgroup branch.  Should be pretty painless.
Nice job on percpu_rwsem.  It's so much better than having to come up
with a separate middleground solution.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] arm64: KVM: Fix user access for debug registers

2015-09-16 Thread Marc Zyngier

On 16/09/15 14:41, Christoffer Dall wrote:
> On Wed, Sep 16, 2015 at 11:41:10AM +0100, Marc Zyngier wrote:
>> When setting the debug register from userspace, make sure that
>> copy_from_user() is called with its parameters in the expected
>> order. It otherwise doesn't do what you think.
>>
>> Reported-by: Peter Maydell 
>> Cc: Alex Bennée 
>> Fixes: 84e690bfbed1 ("KVM: arm64: introduce vcpu->arch.debug_ptr")
>> Signed-off-by: Marc Zyngier 
>
> yikes!
>
> Reviewed-by: Christoffer Dall 
>

Thanks. Merged and pushed out to -next, together with the physaddr_t patch.

M.
--
Jazz is not dead. It just smells funny...



-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: vmx: fix VPID is 0000H in non-root operation

2015-09-16 Thread Wanpeng Li


On 9/16/15 8:42 PM, Paolo Bonzini wrote:


On 16/09/2015 14:32, Jan Kiszka wrote:

BTW, what will happen if allocate_vpid runs out of free slots and
returns 0? Will we always fail then...?

The return value of vmx_secondary_exec_control will not have
SECONDARY_EXEC_ENABLE_VPID, so it's okay.  However, I think
we need this in the nested VPID patches:

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c23482cda1a7..e6859b45b00b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9456,7 +9460,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
else
vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);

-   if (enable_vpid) {
+   if (vmx->nested.vpid02) {
/*
 * There is no direct mapping between vpid02 and vpid12, the
 * vpid02 is per-vCPU for L0 and reused while the value of



Looks good to me.

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] arm64: KVM: Fix user access for debug registers

2015-09-16 Thread Christoffer Dall

On Wed, Sep 16, 2015 at 11:41:10AM +0100, Marc Zyngier wrote:
> When setting the debug register from userspace, make sure that
> copy_from_user() is called with its parameters in the expected
> order. It otherwise doesn't do what you think.
> 
> Reported-by: Peter Maydell 
> Cc: Alex Bennée 
> Fixes: 84e690bfbed1 ("KVM: arm64: introduce vcpu->arch.debug_ptr")
> Signed-off-by: Marc Zyngier 

yikes!

Reviewed-by: Christoffer Dall 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Fixes: 805de8f43c20 (atomic: Replace atomic_{set,clear}_mask() usage)

2015-09-16 Thread Jason J. Herne

The offending commit accidentally replaces an atomic_clear with an
atomic_or instead of an atomic_andnot in kvm_s390_vcpu_request_handled.
The symptom is that kvm guests on s390 hang on startup.
This patch simply replaces the incorrect atomic_or with atomic_andnot

Signed-off-by: Jason J. Herne 
---
 arch/s390/kvm/kvm-s390.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index c91eb94..49e76be 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1574,7 +1574,7 @@ static void kvm_s390_vcpu_request(struct kvm_vcpu *vcpu)
 
 static void kvm_s390_vcpu_request_handled(struct kvm_vcpu *vcpu)
 {
-   atomic_or(PROG_REQUEST, &vcpu->arch.sie_block->prog20);
+   atomic_andnot(PROG_REQUEST, &vcpu->arch.sie_block->prog20);
 }
 
 /*
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 1/2] target-i386: disable LINT0 after reset

2015-09-16 Thread Nadav Amit

I don’t happen to have a similar platform. On regular qemu/kvm runs with
q35, I see APIC_LVT0 is set once to 0x8700 on the BSP - as expected:

 qemu-system-x86-19345 [011] d... 2583274.503018: kvm_entry: vcpu 0
 qemu-system-x86-19345 [011] d... 2583274.503019: kvm_exit: reason APIC_ACCESS 
rip 0x7ffb8288 info 1350 0
 qemu-system-x86-19345 [011]  2583274.503020: kvm_emulate_insn: 
0:7ffb8288:c7 05 50 03 e0 fe 00 87 00 00 (prot32)
 qemu-system-x86-19345 [011]  2583274.503021: kvm_mmio: mmio write len 4 
gpa 0xfee00350 val 0x8700
 qemu-system-x86-19345 [011]  2583274.503021: kvm_apic: apic_write 
APIC_LVT0 = 0x8700

If someone sends a trace ( http://www.linux-kvm.org/page/Tracing ) of the
failure, I would be happy to assist.

Nadav

Gerd Hoffmann  wrote:

> On Mi, 2015-09-16 at 07:23 +0200, Jan Kiszka wrote:
>> On 2015-09-15 23:19, Alex Williamson wrote:
>>> On Mon, 2015-04-13 at 02:32 +0300, Nadav Amit wrote:
 Due to old Seabios bug, QEMU reenable LINT0 after reset. This bug is long 
 gone
 and therefore this hack is no longer needed.  Since it violates the
 specifications, it is removed.
 
 Signed-off-by: Nadav Amit 
 ---
 hw/intc/apic_common.c | 9 -
 1 file changed, 9 deletions(-)
>>> 
>>> Please see bug: https://bugs.launchpad.net/qemu/+bug/1488363
>>> 
>>> Is this bug perhaps not as long gone as we thought, or is there
>>> something else going on here?  Thanks,
>> 
>> I would say, someone needs to check if the SeaBIOS line that is supposed
>> to enable LINT0 is actually executed on one of the broken systems and,
>> if not, why not.
> 
> There is only one reason (beside miscompiling seabios with
> CONFIG_QEMU=n) why seabios would skip acpi initialization, and that is
> apic not being present according to cpuid:
> 
>cpuid(1, &eax, &ebx, &ecx, &cpuid_features);
>if (eax < 1 || !(cpuid_features & CPUID_APIC)) {
>// No apic - only the main cpu is present.
> 
> https://www.kraxel.org/cgit/seabios/tree/src/fw/smp.c#n79
> 
> cheers,
>  Gerd
> 
> PS: coreboot tripped over this too, fixed just a few days ago.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger

Am 16.09.2015 um 14:22 schrieb Oleg Nesterov:
>>  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.
> And percpu_up_write() doesn't do synchronize_sched() at all.
> 
>> it only
>> makes it more rare.
> 
> Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
> a single synchronize_sched().

And in fact it does. Paolo suggested to trace how often we call 
synchronize_sched so I applied some advanced printk debugging technology ;-)
Until login I have 41 and after starting the 70 guests this went up to 48.
Nice work.

> 
>> So it's possible that it isn't eliminating the root
>> cause of the problem.
> 
> We will see... Just in case, currently the usage of percpu_down_write()
> is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
> this needs some WIP changes in rcu_sync. Plus we can do more improvements,
> but this is off-topic right now.
> 
> Oleg.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Oleg Nesterov

On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 14:22, Oleg Nesterov wrote:
> > >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> >
> > Yes, but it eliminates _expedited(). This is good, but otoh this means
> > that (say) individual __cgroup_procs_write() can take much more time.
> > However, it won't block the readers and/or disturb the whole system.
>
> According to Christian, removing the _expedited() "makes things worse"

Yes sure, we can not just remove _expedited() from down/up_read().

> in that the system takes ages to boot up and systemd timeouts.

Yes, this is clear

> So I'm
> still a bit wary about anything that uses RCU for the cgroups write side.
>
> However, rcu_sync is okay with him, so perhaps it is really really
> effective.  Christian, can you instrument how many synchronize_sched
> (out of the 6479 cgroup_procs_write calls) are actually executed at boot
> with the rcu rework?

Heh, another change I have in mind. It would be nice to add some trace
points. But firstly we should merge the current code.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: vmx: fix VPID is 0000H in non-root operation

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 14:32, Jan Kiszka wrote:
> 
> BTW, what will happen if allocate_vpid runs out of free slots and
> returns 0? Will we always fail then...?

The return value of vmx_secondary_exec_control will not have
SECONDARY_EXEC_ENABLE_VPID, so it's okay.  However, I think
we need this in the nested VPID patches:

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c23482cda1a7..e6859b45b00b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9456,7 +9460,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
else
vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
-   if (enable_vpid) {
+   if (vmx->nested.vpid02) {
/*
 * There is no direct mapping between vpid02 and vpid12, the
 * vpid02 is per-vCPU for L0 and reused while the value of

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini

On 16/09/2015 14:22, Oleg Nesterov wrote:
> >  The issue is that rcu_sync doesn't eliminate synchronize_sched,
> 
> Yes, but it eliminates _expedited(). This is good, but otoh this means
> that (say) individual __cgroup_procs_write() can take much more time.
> However, it won't block the readers and/or disturb the whole system.

According to Christian, removing the _expedited() "makes things worse"
in that the system takes ages to boot up and systemd timeouts.  So I'm
still a bit wary about anything that uses RCU for the cgroups write side.

However, rcu_sync is okay with him, so perhaps it is really really
effective.  Christian, can you instrument how many synchronize_sched
(out of the 6479 cgroup_procs_write calls) are actually executed at boot
with the rcu rework?

Paolo

> And percpu_up_write() doesn't do synchronize_sched() at all.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: vmx: fix VPID is 0000H in non-root operation

2015-09-16 Thread Jan Kiszka

On 2015-09-16 13:31, Wanpeng Li wrote:
> Reference SDM 28.1:
> 
> The current VPID is H in the following situations:
> — Outside VMX operation. (This includes operation in system-management 
>   mode under the default treatment of SMIs and SMM with VMX operation; 
>   see Section 34.14.)
> — In VMX root operation.
> — In VMX non-root operation when the “enable VPID” VM-execution control 
>   is 0.
> 
> The VPID should never be H in non-root operation when "enable VPID" 
> VM-execution control is 1. However, commit (34a1cd60: 'kvm: x86: vmx: 
> move some vmx setting from vmx_init() to hardware_setup()') remove the 
> codes which reserve H for VMX root operation. 
> 
> This patch fix it by reintroducing reserve H for VMX root operation.
> 
> Reported-by: Wincy Van 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9ff6a3f..a63b9ca 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6056,6 +6056,8 @@ static __init int hardware_setup(void)
>   memcpy(vmx_msr_bitmap_longmode_x2apic,
>   vmx_msr_bitmap_longmode, PAGE_SIZE);
>  
> + set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
> +
>   if (enable_apicv) {
>   for (msr = 0x800; msr <= 0x8ff; msr++)
>   vmx_disable_intercept_msr_read_x2apic(msr);
> 

Good point.

BTW, what will happen if allocate_vpid runs out of free slots and
returns 0? Will we always fail then...?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Oleg Nesterov

On 09/16, Paolo Bonzini wrote:
>
>
> On 16/09/2015 10:57, Christian Borntraeger wrote:
> > Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> >>
> >>
> >> On 15/09/2015 19:38, Paul E. McKenney wrote:
> >>> Excellent points!
> >>>
> >>> Other options in such situations include the following:
> >>>
> >>> o Rework so that the code uses call_rcu*() instead of *_expedited().
> >>>
> >>> o Maintain a per-task or per-CPU counter so that every so many
> >>>   *_expedited() invocations instead uses the non-expedited
> >>>   counterpart.  (For example, synchronize_rcu instead of
> >>>   synchronize_rcu_expedited().)
> >>
> >> Or just use ratelimit (untested):
> >
> > One of my tests was to always replace synchronize_sched_expedited with
> > synchronize_sched and things turned out to be even worse. Not sure if
> > it makes sense to test yopur in-the-middle approach?
>
> I don't think it applies here, since down_write/up_write is a
> synchronous API.
>
> If the revert isn't easy, I think backporting rcu_sync is the best bet.

I leave this to Paul and Tejun... at least I think this is not v4.2 material.

>  The issue is that rcu_sync doesn't eliminate synchronize_sched,

Yes, but it eliminates _expedited(). This is good, but otoh this means
that (say) individual __cgroup_procs_write() can take much more time.
However, it won't block the readers and/or disturb the whole system.
And percpu_up_write() doesn't do synchronize_sched() at all.

> it only
> makes it more rare.

Yes, so we can hope that multiple __cgroup_procs_write()'s can "share"
a single synchronize_sched().

> So it's possible that it isn't eliminating the root
> cause of the problem.

We will see... Just in case, currently the usage of percpu_down_write()
is suboptimal. We do not need to do ->sync() under cgroup_mutex. But
this needs some WIP changes in rcu_sync. Plus we can do more improvements,
but this is off-topic right now.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: add halt_attempted_poll to VCPU stats

2015-09-16 Thread Wanpeng Li


On 9/16/15 6:12 PM, Christian Borntraeger wrote:

Am 15.09.2015 um 18:27 schrieb Paolo Bonzini:

This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.

For example, say halt_poll_ns = 48, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then


For the first 481 us, block_ns should be 481us, block_ns > 
halt_poll_ns(480us) and long halt is detected, the vcpu->halt_poll_ns 
will be shrinked.



is consuming about 30% more CPU than it would use without
polling.  This would show as an abnormally high number of
attempted polling compared to the successful polls.

Cc: Christian Borntraeger 
Signed-off-by: Paolo Bonzini 

Acked-by: Christian Borntraeger 

yes, this will help to detect some bad cases, but not all.

PS:
upstream maintenance keeps me really busy at the moment :-)
I am looking into a case right now, where auto polling goes
completely nuts on my system:

guest1: 8vcpus  guest2: 1 vcpu
iperf with 25 process (-P25) from guest1 to guest2.

I/O interrupts on s390 are floating (pending on all CPUs) so on
ALL VCPUs that go to sleep, polling will consider any pending
network interrupt as successful poll. So with auto polling the
guest consumes up to 5 host CPUs without auto polling only 1.
Reducing  halt_poll_ns to 10 seems to work (goes back to
1 cpu).

The proper way might be to feedback the result of the
interrupt dequeue into the heuristics. Don't know yet how
to handle that properly.


If this can be reproduced on x86 platform?

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger

Am 16.09.2015 um 13:03 schrieb Tejun Heo:
> Hello,
> 
> On Wed, Sep 16, 2015 at 12:58:00PM +0200, Christian Borntraeger wrote:
>> FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
>> just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
>> systemd seems to do that for the processes. 
>>
>> So a revert is really the right thing to do. In fact, I dont know if the
>> rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
>> be NOT a cold/seldom case.
> 
> Booting would usually be the hottest operation for that and it's still
> *relatively* cold path compared to the reader side which is task
> fork/exit paths.  The whole point is shift overhead from hotter reader
> side.  Can you see problems with percpu_rwsem rework?

As I said, it seems the rcu tree with that change seems to work fine on my
system. This needs a more testing on other machines, though. I guess a 
revert plus a re-add in the 4.4 merge window should give us enough test
coverage.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: vmx: fix VPID is 0000H in non-root operation

2015-09-16 Thread Wanpeng Li

Reference SDM 28.1:

The current VPID is H in the following situations:
— Outside VMX operation. (This includes operation in system-management 
  mode under the default treatment of SMIs and SMM with VMX operation; 
  see Section 34.14.)
— In VMX root operation.
— In VMX non-root operation when the “enable VPID” VM-execution control 
  is 0.

The VPID should never be H in non-root operation when "enable VPID" 
VM-execution control is 1. However, commit (34a1cd60: 'kvm: x86: vmx: 
move some vmx setting from vmx_init() to hardware_setup()') remove the 
codes which reserve H for VMX root operation. 

This patch fix it by reintroducing reserve H for VMX root operation.

Reported-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9ff6a3f..a63b9ca 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6056,6 +6056,8 @@ static __init int hardware_setup(void)
memcpy(vmx_msr_bitmap_longmode_x2apic,
vmx_msr_bitmap_longmode, PAGE_SIZE);
 
+   set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
+
if (enable_apicv) {
for (msr = 0x800; msr <= 0x8ff; msr++)
vmx_disable_intercept_msr_read_x2apic(msr);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Tejun Heo

Hello,

On Tue, Sep 15, 2015 at 09:35:47PM -0700, Paul E. McKenney wrote:
> > > I am suggesting trying the options and seeing what works best, then
> > > working to convince people as needed.
> > 
> > Yeah, sure thing.  Let's wait for Christian.
> 
> Indeed.  Is there enough benefit to risk jamming this thing into 4.3?
> I believe that 4.4 should be a no-brainer.

In that case, I'm gonna revert the threadcgroup percpu_rwsem
conversion patches through -stable and reapply them for 4.4 merge
window.

Thanks!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Tejun Heo

Hello,

On Wed, Sep 16, 2015 at 12:58:00PM +0200, Christian Borntraeger wrote:
> FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
> just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
> systemd seems to do that for the processes. 
>
> So a revert is really the right thing to do. In fact, I dont know if the
> rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
> be NOT a cold/seldom case.

Booting would usually be the hottest operation for that and it's still
*relatively* cold path compared to the reader side which is task
fork/exit paths.  The whole point is shift overhead from hotter reader
side.  Can you see problems with percpu_rwsem rework?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: suspicious RCU usage with kvm_pr

2015-09-16 Thread Denis Kirjanov

On 9/16/15, Thomas Huth  wrote:
> On 16/09/15 10:51, Denis Kirjanov wrote:
>> Hi,
>>
>> I see the following trace on qemu startup (ps700 blade):
>>
>> v4.2-11169-g64d1def
>>
>>
>> [  143.369638] ===
>> [  143.369640] [ INFO: suspicious RCU usage. ]
>> [  143.369643] 4.2.0-11169-g64d1def #10 Tainted: G S
>> [  143.369645] ---
>> [  143.369647] arch/powerpc/kvm/../../../virt/kvm/kvm_main.c:3310
>> suspicious rcu_dereference_check() usage!
>> [  143.369649]
>> other info that might help us debug this:
>>
>> [  143.369652]
>> rcu_scheduler_active = 1, debug_locks = 1
>> [  143.369655] 1 lock held by qemu-system-ppc/2292:
>> [  143.369656]  #0:  (&vcpu->mutex){+.+.+.}, at: []
>> .vcpu_load+0x2c/0xb0 [kvm]
>> [  143.369672]
>> stack backtrace:
>> [  143.369675] CPU: 12 PID: 2292 Comm: qemu-system-ppc Tainted: G S
>>   4.2.0-11169-g64d1def #10
>> [  143.369677] Call Trace:
>> [  143.369682] [c001d08bf200] [c0816dd0]
>> .dump_stack+0x98/0xd4 (unreliable)
>> [  143.369687] [c001d08bf280] [c00f7058]
>> .lockdep_rcu_suspicious+0x108/0x170
>> [  143.369696] [c001d08bf310] [d42296d8]
>> .kvm_io_bus_read+0x1d8/0x220 [kvm]
>> [  143.369705] [c001d08bf3c0] [d422f980]
>> .kvmppc_h_logical_ci_load+0x60/0xe0 [kvm]
>
> Could it be that we need to srcu_read_lock(&vcpu->kvm->srcu) before
> calling the kvm_io_bus_read/write() function in the
> kvmppc_h_logical_ci_load/store() function?

I haven't had time to dig into this. I'll try it.

Thanks

>
>  Thomas
>
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger

Am 16.09.2015 um 09:44 schrieb Christian Borntraeger:
> Am 16.09.2015 um 03:24 schrieb Tejun Heo:
>> Hello, Paul.
>>
>> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
>>> Well, the decision as to what is too big for -stable is owned by the
>>> -stable maintainers, not by me.
>>
>> Is it tho?  Usually the subsystem maintainer knows the best and has
>> most say in it.  I was mostly curious whether you'd think that the
>> changes would be too risky.  If not, great.
>>
>>> I am suggesting trying the options and seeing what works best, then
>>> working to convince people as needed.
>>
>> Yeah, sure thing.  Let's wait for Christian.
> 
> Well, I have optimized my testcase now that is puts enough pressure to
> the system to  confuses system (the older 209 version, which still has
> some event loop issues) that systemd restarts the journal deamon and does
> several other recoveries.
> To avoid regressions - even for somewhat shaky userspaces - we should
> consider a revert for 4.2 stable.
> There are several followup patches, which makes the revert non-trivial,
> though.
> 
> The rework of the percpu rwsem seems to work fine, but we are beyond the
> merge window so 4.4 seems better to me. (and consider a revert for 4.3)

FWIW, I added a printk to percpu_down_write. With KVM and uprobes disabled,
just booting up a fedora20 gives me __6749__ percpu_down_write calls on 4.2.
systemd seems to do that for the processes. 

So a revert is really the right thing to do. In fact, I dont know if the
rcu_sync_enter rework is enough. With systemd setting the cgroup seem to
be NOT a cold/seldom case.

Christian







--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] arm64: KVM: Fix user access for debug registers

2015-09-16 Thread Marc Zyngier

When setting the debug register from userspace, make sure that
copy_from_user() is called with its parameters in the expected
order. It otherwise doesn't do what you think.

Reported-by: Peter Maydell 
Cc: Alex Bennée 
Fixes: 84e690bfbed1 ("KVM: arm64: introduce vcpu->arch.debug_ptr")
Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/sys_regs.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index b41607d..1d0463e 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -272,7 +272,7 @@ static int set_bvr(struct kvm_vcpu *vcpu, const struct 
sys_reg_desc *rd,
 {
__u64 *r = &vcpu->arch.vcpu_debug_state.dbg_bvr[rd->reg];
 
-   if (copy_from_user(uaddr, r, KVM_REG_SIZE(reg->id)) != 0)
+   if (copy_from_user(r, uaddr, KVM_REG_SIZE(reg->id)) != 0)
return -EFAULT;
return 0;
 }
@@ -314,7 +314,7 @@ static int set_bcr(struct kvm_vcpu *vcpu, const struct 
sys_reg_desc *rd,
 {
__u64 *r = &vcpu->arch.vcpu_debug_state.dbg_bcr[rd->reg];
 
-   if (copy_from_user(uaddr, r, KVM_REG_SIZE(reg->id)) != 0)
+   if (copy_from_user(r, uaddr, KVM_REG_SIZE(reg->id)) != 0)
return -EFAULT;
 
return 0;
@@ -358,7 +358,7 @@ static int set_wvr(struct kvm_vcpu *vcpu, const struct 
sys_reg_desc *rd,
 {
__u64 *r = &vcpu->arch.vcpu_debug_state.dbg_wvr[rd->reg];
 
-   if (copy_from_user(uaddr, r, KVM_REG_SIZE(reg->id)) != 0)
+   if (copy_from_user(r, uaddr, KVM_REG_SIZE(reg->id)) != 0)
return -EFAULT;
return 0;
 }
@@ -400,7 +400,7 @@ static int set_wcr(struct kvm_vcpu *vcpu, const struct 
sys_reg_desc *rd,
 {
__u64 *r = &vcpu->arch.vcpu_debug_state.dbg_wcr[rd->reg];
 
-   if (copy_from_user(uaddr, r, KVM_REG_SIZE(reg->id)) != 0)
+   if (copy_from_user(r, uaddr, KVM_REG_SIZE(reg->id)) != 0)
return -EFAULT;
return 0;
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] os-android: Add support to android platform, built by ndk-r10

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 11:54, Houcheng Lin wrote:
> 2015-09-16 17:38 GMT+08:00 Paolo Bonzini :
>>
>> Actually it's even simpler.  shm_open is basically just
>>
>> char *s;
>> int fd;
>>
>> asprintf(&s, "/dev/shm/%s", name);
>> fd = open(s, name | O_CLOEXEC, mode);
>> free(s);
>> return fd;
>>
>> plus some error checking.  Do Android systems have /dev/shm?
>>
>> Paolo
> 
> It's simple, thanks.
> The android have no /dev/shm. Though we can mknod it but need root prividlege 
> to
> do it.

Oh well.  Then I think it's okay to disable ivshmem on Android.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] kvm/x86: Hyper-V HV_X64_MSR_VP_INDEX export for QEMU.

2015-09-16 Thread Denis V. Lunev

From: Andrey Smetanin 

Insert Hyper-V HV_X64_MSR_VP_INDEX into msr's emulated list,
so QEMU can set Hyper-V features cpuid HV_X64_MSR_VP_INDEX_AVAILABLE
bit correctly. KVM emulation part is in place already.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Paolo Bonzini 
CC: Gleb Natapov 
---
 arch/x86/kvm/x86.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5a14a66..c2028ac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -952,6 +952,7 @@ static u32 emulated_msrs[] = {
HV_X64_MSR_CRASH_P0, HV_X64_MSR_CRASH_P1, HV_X64_MSR_CRASH_P2,
HV_X64_MSR_CRASH_P3, HV_X64_MSR_CRASH_P4, HV_X64_MSR_CRASH_CTL,
HV_X64_MSR_RESET,
+   HV_X64_MSR_VP_INDEX,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_KVM_PV_EOI_EN,
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] kvm/x86: Hyper-V HV_X64_MSR_RESET msr

2015-09-16 Thread Denis V. Lunev

From: Andrey Smetanin 

HV_X64_MSR_RESET msr is used by Hyper-V based Windows guest
to reset guest VM by hypervisor.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Paolo Bonzini 
CC: Gleb Natapov 
---
 arch/x86/include/uapi/asm/hyperv.h |  3 +++
 arch/x86/kvm/hyperv.c  | 10 ++
 arch/x86/kvm/x86.c |  7 +++
 include/linux/kvm_host.h   |  1 +
 4 files changed, 21 insertions(+)

diff --git a/arch/x86/include/uapi/asm/hyperv.h 
b/arch/x86/include/uapi/asm/hyperv.h
index f0412c5..dab584b 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -153,6 +153,9 @@
 /* MSR used to provide vcpu index */
 #define HV_X64_MSR_VP_INDEX0x4002
 
+/* MSR used to reset the guest OS. */
+#define HV_X64_MSR_RESET   0x4003
+
 /* MSR used to read the per-partition time reference counter */
 #define HV_X64_MSR_TIME_REF_COUNT  0x4020
 
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index a8160d2..0ad11a2 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -41,6 +41,7 @@ static bool kvm_hv_msr_partition_wide(u32 msr)
case HV_X64_MSR_TIME_REF_COUNT:
case HV_X64_MSR_CRASH_CTL:
case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4:
+   case HV_X64_MSR_RESET:
r = true;
break;
}
@@ -163,6 +164,12 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 data,
 data);
case HV_X64_MSR_CRASH_CTL:
return kvm_hv_msr_set_crash_ctl(vcpu, data, host);
+   case HV_X64_MSR_RESET:
+   if (data == 1) {
+   vcpu_debug(vcpu, "hyper-v reset requested\n");
+   kvm_make_request(KVM_REQ_HV_RESET, vcpu);
+   }
+   break;
default:
vcpu_unimpl(vcpu, "Hyper-V uhandled wrmsr: 0x%x data 0x%llx\n",
msr, data);
@@ -241,6 +248,9 @@ static int kvm_hv_get_msr_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 *pdata)
 pdata);
case HV_X64_MSR_CRASH_CTL:
return kvm_hv_msr_get_crash_ctl(vcpu, pdata);
+   case HV_X64_MSR_RESET:
+   data = 0;
+   break;
default:
vcpu_unimpl(vcpu, "Hyper-V unhandled rdmsr: 0x%x\n", msr);
return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a60bdbc..5a14a66 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -951,6 +951,7 @@ static u32 emulated_msrs[] = {
HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_REFERENCE_TSC,
HV_X64_MSR_CRASH_P0, HV_X64_MSR_CRASH_P1, HV_X64_MSR_CRASH_P2,
HV_X64_MSR_CRASH_P3, HV_X64_MSR_CRASH_P4, HV_X64_MSR_CRASH_CTL,
+   HV_X64_MSR_RESET,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_KVM_PV_EOI_EN,
 
@@ -6268,6 +6269,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
r = 0;
goto out;
}
+   if (kvm_check_request(KVM_REQ_HV_RESET, vcpu)) {
+   vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+   vcpu->run->system_event.type = KVM_SYSTEM_EVENT_RESET;
+   r = 0;
+   goto out;
+   }
}
 
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1bef9e2..f6beba2 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -140,6 +140,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_APIC_PAGE_RELOAD  25
 #define KVM_REQ_SMI   26
 #define KVM_REQ_HV_CRASH  27
+#define KVM_REQ_HV_RESET  28
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID   1
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 2/2] KVM: nVMX: nested VPID emulation

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 11:30, Wanpeng Li wrote:
> VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
> the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
> between L1 and L2. 
> 
> This patch advertises VPID to the L1 hypervisor, then address space of L1 
> and L2 can be separately treated and avoid TLB flush when swithing between 
> L1 and L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid 
> w/ an invvpid.
> 
> Performance: 
> 
> run lmbench on L2 w/ 3.5 kernel.
> 
> Context switching - times in microseconds - smaller is better
> -
> Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
>  ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
> - - -- -- -- -- -- --- ---
> kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
> nested VPID 
> kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
> vanilla
> 
> Reviewed-by: Jan Kiszka 
> Suggested-by: Wincy Van 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 37 +++--
>  1 file changed, 31 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index f8d704d..c23482c 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -424,6 +424,9 @@ struct nested_vmx {
>   /* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */
>   u64 vmcs01_debugctl;
>  
> + u16 vpid02;
> + u16 last_vpid;
> +
>   u32 nested_vmx_procbased_ctls_low;
>   u32 nested_vmx_procbased_ctls_high;
>   u32 nested_vmx_true_procbased_ctls_low;
> @@ -1155,6 +1158,11 @@ static inline bool 
> nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
>  }
>  
> +static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)
> +{
> + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
> +}
> +
>  static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
>  {
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
> @@ -2469,6 +2477,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
> *vmx)
>   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>   SECONDARY_EXEC_RDTSCP |
>   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
> + SECONDARY_EXEC_ENABLE_VPID |
>   SECONDARY_EXEC_APIC_REGISTER_VIRT |
>   SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
>   SECONDARY_EXEC_WBINVD_EXITING |
> @@ -6663,6 +6672,7 @@ static void free_nested(struct vcpu_vmx *vmx)
>   return;
>  
>   vmx->nested.vmxon = false;
> + free_vpid(vmx->nested.vpid02);
>   nested_release_vmcs12(vmx);
>   if (enable_shadow_vmcs)
>   free_vmcs(vmx->nested.current_shadow_vmcs);
> @@ -8548,8 +8558,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>   goto free_vmcs;
>   }
>  
> - if (nested)
> + if (nested) {
>   nested_vmx_setup_ctls_msrs(vmx);
> + vmx->nested.vpid02 = allocate_vpid();
> + }
>  
>   vmx->nested.posted_intr_nv = -1;
>   vmx->nested.current_vmptr = -1ull;
> @@ -8570,6 +8582,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>   return &vmx->vcpu;
>  
>  free_vmcs:
> + free_vpid(vmx->nested.vpid02);
>   free_loaded_vmcs(vmx->loaded_vmcs);
>  free_msrs:
>   kfree(vmx->guest_msrs);
> @@ -9445,12 +9458,24 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12)
>  
>   if (enable_vpid) {
>   /*
> -  * Trivially support vpid by letting L2s share their parent
> -  * L1's vpid. TODO: move to a more elaborate solution, giving
> -  * each L2 its own vpid and exposing the vpid feature to L1.
> +  * There is no direct mapping between vpid02 and vpid12, the
> +  * vpid02 is per-vCPU for L0 and reused while the value of
> +  * vpid12 is changed w/ one invvpid during nested vmentry.
> +  * The vpid12 is allocated by L1 for L2, so it will not
> +  * influence global bitmap(for vpid01 and vpid02 allocation)
> +  * even if spawn a lot of nested vCPUs.
>*/
> - vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
> - vmx_flush_tlb(vcpu);
> + if (nested_cpu_has_vpid(vmcs12)) {
> + vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02);
> + if (vmcs12->virtual_processor_id != 
> vmx->nested.last_vpid) {
> + vmx->nested.last_vpid = 
> vmcs12->virtual_processor_id;
> + vmx_flush_tlb(vcpu);
> + }
>

Re: [PATCH v5 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-16 Thread Wanpeng Li


On 9/16/15 6:04 PM, Paolo Bonzini wrote:


On 16/09/2015 11:30, Wanpeng Li wrote:

Enhance allocate/free_vid to handle shadow vpid.

Adjusting the commit message:

 KVM: nVMX: adjust interface to allocate/free_vpid
 
 Adjust allocate/free_vid so that they can be reused for the nested vpid.


Cool, always thanks for your help, Paolo! :-)

Regards,
Wanpeng Li

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: add halt_attempted_poll to VCPU stats

2015-09-16 Thread Christian Borntraeger

Am 15.09.2015 um 18:27 schrieb Paolo Bonzini:
> This new statistic can help diagnosing VCPUs that, for any reason,
> trigger bad behavior of halt_poll_ns autotuning.
> 
> For example, say halt_poll_ns = 48, and wakeups are spaced exactly
> like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
> 10+20+40+80+160+320+480 = 1110 microseconds out of every
> 479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
> is consuming about 30% more CPU than it would use without
> polling.  This would show as an abnormally high number of
> attempted polling compared to the successful polls.
> 
> Cc: Christian Borntraeger  Cc: David Matlack 
> Signed-off-by: Paolo Bonzini 

Acked-by: Christian Borntraeger 

yes, this will help to detect some bad cases, but not all.

PS: 
upstream maintenance keeps me really busy at the moment :-)
I am looking into a case right now, where auto polling goes 
completely nuts on my system:

guest1: 8vcpus  guest2: 1 vcpu
iperf with 25 process (-P25) from guest1 to guest2.

I/O interrupts on s390 are floating (pending on all CPUs) so on 
ALL VCPUs that go to sleep, polling will consider any pending
network interrupt as successful poll. So with auto polling the
guest consumes up to 5 host CPUs without auto polling only 1.
Reducing  halt_poll_ns to 10 seems to work (goes back to 
1 cpu).

The proper way might be to feedback the result of the
interrupt dequeue into the heuristics. Don't know yet how
to handle that properly.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: suspicious RCU usage with kvm_pr

2015-09-16 Thread Thomas Huth

On 16/09/15 10:51, Denis Kirjanov wrote:
> Hi,
> 
> I see the following trace on qemu startup (ps700 blade):
> 
> v4.2-11169-g64d1def
> 
> 
> [  143.369638] ===
> [  143.369640] [ INFO: suspicious RCU usage. ]
> [  143.369643] 4.2.0-11169-g64d1def #10 Tainted: G S
> [  143.369645] ---
> [  143.369647] arch/powerpc/kvm/../../../virt/kvm/kvm_main.c:3310
> suspicious rcu_dereference_check() usage!
> [  143.369649]
> other info that might help us debug this:
> 
> [  143.369652]
> rcu_scheduler_active = 1, debug_locks = 1
> [  143.369655] 1 lock held by qemu-system-ppc/2292:
> [  143.369656]  #0:  (&vcpu->mutex){+.+.+.}, at: []
> .vcpu_load+0x2c/0xb0 [kvm]
> [  143.369672]
> stack backtrace:
> [  143.369675] CPU: 12 PID: 2292 Comm: qemu-system-ppc Tainted: G S
>   4.2.0-11169-g64d1def #10
> [  143.369677] Call Trace:
> [  143.369682] [c001d08bf200] [c0816dd0]
> .dump_stack+0x98/0xd4 (unreliable)
> [  143.369687] [c001d08bf280] [c00f7058]
> .lockdep_rcu_suspicious+0x108/0x170
> [  143.369696] [c001d08bf310] [d42296d8]
> .kvm_io_bus_read+0x1d8/0x220 [kvm]
> [  143.369705] [c001d08bf3c0] [d422f980]
> .kvmppc_h_logical_ci_load+0x60/0xe0 [kvm]

Could it be that we need to srcu_read_lock(&vcpu->kvm->srcu) before
calling the kvm_io_bus_read/write() function in the
kvmppc_h_logical_ci_load/store() function?

 Thomas

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: add halt_attempted_poll to VCPU stats

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 12:12, Christian Borntraeger wrote:
> I am looking into a case right now, where auto polling goes 
> completely nuts on my system:
> 
> guest1: 8vcpusguest2: 1 vcpu
> iperf with 25 process (-P25) from guest1 to guest2.
> 
> I/O interrupts on s390 are floating (pending on all CPUs) so on 
> ALL VCPUs that go to sleep, polling will consider any pending
> network interrupt as successful poll. So with auto polling the
> guest consumes up to 5 host CPUs without auto polling only 1.
> Reducing  halt_poll_ns to 10 seems to work (goes back to 
> 1 cpu).
> 
> The proper way might be to feedback the result of the
> interrupt dequeue into the heuristics. Don't know yet how
> to handle that properly.

I think it's simplest to disable halt_poll_ns by default on s390.  On
x86, for example, you can mark interrupts so that they _can_ be
delivered to all CPUs but only one will get it.

You can add a Kconfig symbol for that to other architectures, and not s390.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 11:30, Wanpeng Li wrote:
> Enhance allocate/free_vid to handle shadow vpid.

Adjusting the commit message:

KVM: nVMX: adjust interface to allocate/free_vpid

Adjust allocate/free_vid so that they can be reused for the nested vpid.

and committing to kvm/queue.

Paolo

> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 25 -
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9ff6a3f..f8d704d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4155,29 +4155,28 @@ static int alloc_identity_pagetable(struct kvm *kvm)
>   return r;
>  }
>  
> -static void allocate_vpid(struct vcpu_vmx *vmx)
> +static int allocate_vpid(void)
>  {
>   int vpid;
>  
> - vmx->vpid = 0;
>   if (!enable_vpid)
> - return;
> + return 0;
>   spin_lock(&vmx_vpid_lock);
>   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
> - if (vpid < VMX_NR_VPIDS) {
> - vmx->vpid = vpid;
> + if (vpid < VMX_NR_VPIDS)
>   __set_bit(vpid, vmx_vpid_bitmap);
> - }
> + else
> + vpid = 0;
>   spin_unlock(&vmx_vpid_lock);
> + return vpid;
>  }
>  
> -static void free_vpid(struct vcpu_vmx *vmx)
> +static void free_vpid(int vpid)
>  {
> - if (!enable_vpid)
> + if (!enable_vpid || vpid == 0)
>   return;
>   spin_lock(&vmx_vpid_lock);
> - if (vmx->vpid != 0)
> - __clear_bit(vmx->vpid, vmx_vpid_bitmap);
> + __clear_bit(vpid, vmx_vpid_bitmap);
>   spin_unlock(&vmx_vpid_lock);
>  }
>  
> @@ -8482,7 +8481,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>  
>   if (enable_pml)
>   vmx_disable_pml(vmx);
> - free_vpid(vmx);
> + free_vpid(vmx->vpid);
>   leave_guest_mode(vcpu);
>   vmx_load_vmcs01(vcpu);
>   free_nested(vmx);
> @@ -8501,7 +8500,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>   if (!vmx)
>   return ERR_PTR(-ENOMEM);
>  
> - allocate_vpid(vmx);
> + vmx->vpid = allocate_vpid();
>  
>   err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
>   if (err)
> @@ -8577,7 +8576,7 @@ free_msrs:
>  uninit_vcpu:
>   kvm_vcpu_uninit(&vmx->vcpu);
>  free_vcpu:
> - free_vpid(vmx);
> + free_vpid(vmx->vpid);
>   kmem_cache_free(kvm_vcpu_cache, vmx);
>   return ERR_PTR(err);
>  }
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support

2015-09-16 Thread Denis V. Lunev

From: Andrey Smetanin 

HV_X64_MSR_VP_RUNTIME msr used by guest to get
"the time the virtual processor consumes running guest code,
and the time the associated logical processor spends running
hypervisor code on behalf of that guest."

Calculation of this time is performed by task_cputime_adjusted()
for vcpu task.

Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Paolo Bonzini 
CC: Gleb Natapov 
---
 arch/x86/include/asm/kvm_host.h|  1 +
 arch/x86/include/uapi/asm/hyperv.h |  3 +++
 arch/x86/kvm/hyperv.c  | 21 +++--
 arch/x86/kvm/x86.c |  1 +
 kernel/sched/cputime.c |  2 ++
 5 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c12e845..39ebb4d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -373,6 +373,7 @@ struct kvm_mtrr {
 /* Hyper-V per vcpu emulation context */
 struct kvm_vcpu_hv {
u64 hv_vapic;
+   s64 runtime_offset;
 };
 
 struct kvm_vcpu_arch {
diff --git a/arch/x86/include/uapi/asm/hyperv.h 
b/arch/x86/include/uapi/asm/hyperv.h
index dab584b..2677a0a 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -156,6 +156,9 @@
 /* MSR used to reset the guest OS. */
 #define HV_X64_MSR_RESET   0x4003
 
+/* MSR used to provide vcpu runtime in 100ns units */
+#define HV_X64_MSR_VP_RUNTIME  0x4010
+
 /* MSR used to read the per-partition time reference counter */
 #define HV_X64_MSR_TIME_REF_COUNT  0x4020
 
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 0ad11a2..62cf8c9 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -178,7 +178,16 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 data,
return 0;
 }
 
-static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
+/* Calculate cpu time spent by current task in 100ns units */
+static u64 current_task_runtime_100ns(void)
+{
+   cputime_t utime, stime;
+
+   task_cputime_adjusted(current, &utime, &stime);
+   return div_u64(cputime_to_nsecs(utime + stime), 100);
+}
+
+static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host)
 {
struct kvm_vcpu_hv *hv = &vcpu->arch.hyperv;
 
@@ -212,6 +221,11 @@ static int kvm_hv_set_msr(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
return kvm_hv_vapic_msr_write(vcpu, APIC_ICR, data);
case HV_X64_MSR_TPR:
return kvm_hv_vapic_msr_write(vcpu, APIC_TASKPRI, data);
+   case HV_X64_MSR_VP_RUNTIME:
+   if (!host)
+   return 1;
+   hv->runtime_offset = data - current_task_runtime_100ns();
+   break;
default:
vcpu_unimpl(vcpu, "Hyper-V uhandled wrmsr: 0x%x data 0x%llx\n",
msr, data);
@@ -287,6 +301,9 @@ static int kvm_hv_get_msr(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
case HV_X64_MSR_APIC_ASSIST_PAGE:
data = hv->hv_vapic;
break;
+   case HV_X64_MSR_VP_RUNTIME:
+   data = current_task_runtime_100ns() + hv->runtime_offset;
+   break;
default:
vcpu_unimpl(vcpu, "Hyper-V unhandled rdmsr: 0x%x\n", msr);
return 1;
@@ -305,7 +322,7 @@ int kvm_hv_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data, bool host)
mutex_unlock(&vcpu->kvm->lock);
return r;
} else
-   return kvm_hv_set_msr(vcpu, msr, data);
+   return kvm_hv_set_msr(vcpu, msr, data, host);
 }
 
 int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c2028ac..d6263b7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -953,6 +953,7 @@ static u32 emulated_msrs[] = {
HV_X64_MSR_CRASH_P3, HV_X64_MSR_CRASH_P4, HV_X64_MSR_CRASH_CTL,
HV_X64_MSR_RESET,
HV_X64_MSR_VP_INDEX,
+   HV_X64_MSR_VP_RUNTIME,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_KVM_PV_EOI_EN,
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8cbc3db..26a5446 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -444,6 +444,7 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t 
*ut, cputime_t *st)
*ut = p->utime;
*st = p->stime;
 }
+EXPORT_SYMBOL_GPL(task_cputime_adjusted);
 
 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, 
cputime_t *st)
 {
@@ -652,6 +653,7 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t 
*ut, cputime_t *st)
task_cputime(p, &cputime.utime, &cputime.stime);
cputime_adjust(&cputi

[PATCH 0/3] KVM: ecessary simple pre-requisites for VMBus emulation

2015-09-16 Thread Denis V. Lunev

Hyper-V reset, vp index, vp runtime support is required to
support loading Windows guest driver Winhv.sys. Winhv.sys in guest
is required to support Windows VMBus.

These changes are simple and straightforward. Let's them go first.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Paolo Bonzini 
CC: Gleb Natapov 

Andrey Smetanin (3):
  kvm/x86: Hyper-V HV_X64_MSR_RESET msr
  kvm/x86: Hyper-V HV_X64_MSR_VP_INDEX export for QEMU.
  kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support

 arch/x86/include/asm/kvm_host.h|  1 +
 arch/x86/include/uapi/asm/hyperv.h |  6 ++
 arch/x86/kvm/hyperv.c  | 31 +--
 arch/x86/kvm/x86.c |  9 +
 include/linux/kvm_host.h   |  1 +
 kernel/sched/cputime.c |  2 ++
 6 files changed, 48 insertions(+), 2 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] os-android: Add support to android platform, built by ndk-r10

2015-09-16 Thread Houcheng Lin

2015-09-16 17:38 GMT+08:00 Paolo Bonzini :
>
> Actually it's even simpler.  shm_open is basically just
>
> char *s;
> int fd;
>
> asprintf(&s, "/dev/shm/%s", name);
> fd = open(s, name | O_CLOEXEC, mode);
> free(s);
> return fd;
>
> plus some error checking.  Do Android systems have /dev/shm?
>
> Paolo

It's simple, thanks.
The android have no /dev/shm. Though we can mknod it but need root prividlege to
do it.


-- 
Best regards,
Houcheng Lin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] KVM: ecessary simple pre-requisites for VMBus emulation

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 11:29, Denis V. Lunev wrote:
> Hyper-V reset, vp index, vp runtime support is required to
> support loading Windows guest driver Winhv.sys. Winhv.sys in guest
> is required to support Windows VMBus.
> 
> These changes are simple and straightforward. Let's them go first.
> 
> Signed-off-by: Andrey Smetanin 
> Reviewed-by: Roman Kagan 
> Signed-off-by: Denis V. Lunev 
> CC: Paolo Bonzini 
> CC: Gleb Natapov 
> 
> Andrey Smetanin (3):
>   kvm/x86: Hyper-V HV_X64_MSR_RESET msr
>   kvm/x86: Hyper-V HV_X64_MSR_VP_INDEX export for QEMU.
>   kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support
> 
>  arch/x86/include/asm/kvm_host.h|  1 +
>  arch/x86/include/uapi/asm/hyperv.h |  6 ++
>  arch/x86/kvm/hyperv.c  | 31 +--
>  arch/x86/kvm/x86.c |  9 +
>  include/linux/kvm_host.h   |  1 +
>  kernel/sched/cputime.c |  2 ++
>  6 files changed, 48 insertions(+), 2 deletions(-)
> 

Thanks, applying this to kvm/queue.

I had to change the request number to 29 in the first patch.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost: build failure

2015-09-16 Thread Sudip Mukherjee

On Wed, Sep 16, 2015 at 11:36:45AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 16, 2015 at 01:50:08PM +0530, Sudip Mukherjee wrote:
> > Hi,
> > While crosscompiling the kernel for openrisc with allmodconfig the build
> > failed with the error:
> > drivers/vhost/vhost.c: In function 'vhost_vring_ioctl':
> > drivers/vhost/vhost.c:818:3: error: call to '__compiletime_assert_818' 
> > declared with attribute error: BUILD_BUG_ON failed: __alignof__
> > *vq->avail > VRING_AVAIL_ALIGN_SIZE
> > 
> > Can you please give me any idea about what the problem might be and how
> > it can be solved.
> > 
> > You can see the build log at:
> > https://travis-ci.org/sudipm-mukherjee/parport/jobs/80365425
> > 
> > regards
> > sudip
> 
> Yes - I think I saw this already.
> I think the openrisc cross-compiler is broken.
I thought so. Thanks for the quick reply. I will open a bug in gcc and
lets see what they say.

regards
sudip
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] os-android: Add support to android platform, built by ndk-r10

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 11:28, Houcheng Lin wrote:
> 2015-09-16 16:09 GMT+08:00 Paolo Bonzini :
>>
>>
>>>
>>> I'll modify the bionic C library to support these functions and feedback
>>> to google's AOSP project. But the android kernel does not support shmem,
>>
>> It doesn't support tmpfs?  /dev/shm is just a tmpfs.
>>
>> Paolo
> 
> Oh, you are right. The android have shmget, shmat, shmdt functions in
> their libc. The POSIX shm_open can built on top of these. I'll fix my
> libc to support posix share memory functions.

Actually it's even simpler.  shm_open is basically just

char *s;
int fd;

asprintf(&s, "/dev/shm/%s", name);
fd = open(s, name | O_CLOEXEC, mode);
free(s);
return fd;

plus some error checking.  Do Android systems have /dev/shm?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 00/13] Add VT-d Posted-Interrupts support

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 10:49, Feng Wu wrote:
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> You can find the VT-d Posted-Interrtups Spec. in the following URL:
> http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
> 
> v8:
> refer to the changelog in each patch

Thanks, it mostly looks good.

Since we've more or less converged, could you post the whole series for
v9, including the other prerequisite series?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 12/13] KVM: Warn if 'SN' is set during posting interrupts by software

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 10:50, Feng Wu wrote:
> Currently, we don't support urgent interrupt, all interrupts
> are recognized as non-urgent interrupt, so we cannot post
> interrupts when 'SN' is set.
> 
> If the vcpu is in guest mode, it cannot have been scheduled out,
> and that's the only case when SN is set currently, warning if
> SN is set.
> 
> Signed-off-by: Feng Wu 
> Reviewed-by: Paolo Bonzini 

Please fold this into patch 10.

Paolo

> ---
>  arch/x86/kvm/vmx.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9888c43..58fbbc6 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4498,6 +4498,22 @@ static inline bool 
> kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu)
>  {
>  #ifdef CONFIG_SMP
>   if (vcpu->mode == IN_GUEST_MODE) {
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> + /*
> +  * Currently, we don't support urgent interrupt,
> +  * all interrupts are recognized as non-urgent
> +  * interrupt, so we cannot post interrupts when
> +  * 'SN' is set.
> +  *
> +  * If the vcpu is in guest mode, it means it is
> +  * running instead of being scheduled out and
> +  * waiting in the run queue, and that's the only
> +  * case when 'SN' is set currently, warning if
> +  * 'SN' is set.
> +  */
> + WARN_ON_ONCE(pi_test_sn(&vmx->pi_desc));
> +
>   apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
>   POSTED_INTR_VECTOR);
>   return true;
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 11/13] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 10:50, Feng Wu wrote:
>* are two possible cases:
> -  * 1. After running 'pi_pre_block', context switch
> +  * 1. After running 'pre_block', context switch

Please fold this in the previous patch.

>*happened. For this case, 'sn' was set in
>*vmx_vcpu_put(), so we need to clear it here.
> -  * 2. After running 'pi_pre_block', we were blocked,
> +  * 2. After running 'pre_block', we were blocked,
>*and woken up by some other guy. For this case,

(Same).

> + spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> + list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
> + blocked_vcpu_list) {
> + struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
> +
> + if (pi_test_on(pi_desc) == 1)
> + kvm_vcpu_kick(vcpu);
> + }
> + spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
> +}

Please document the lock in Documentation/virtual/kvm/locking.txt.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 0/2] KVM: nested VPID emulation

2015-09-16 Thread Wanpeng Li

v4 -> v5:
 * add vpid == 0 check

v3 -> v4:
 * return 0 if vpid == VMX_NR_VPIDs
 * skip vpid != 0 check

v2 -> v3:
 * enhance allocate/free_vpid as Jan's suggestion
 * add more comments to 2/2

v1 -> v2:
 * enhance allocate/free_vpid to handle shadow vpid
 * drop empty space
 * allocate shadow vpid during initialization
 * For each nested vmentry, if vpid12 is changed, reuse shadow vpid w/ an 
   invvpid.

VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
between L1 and L2. 

This patch advertises VPID to the L1 hypervisor, then address space of L1 and 
L2 can be separately treated and avoid TLB flush when swithing between L1 and 
L2.

Performance: 

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID 
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla

Wanpeng Li (2):
  KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid
  KVM: nVMX: nested VPID emulation

 arch/x86/kvm/vmx.c | 62 +-
 1 file changed, 43 insertions(+), 19 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 2/2] KVM: nVMX: nested VPID emulation

2015-09-16 Thread Wanpeng Li

VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
between L1 and L2. 

This patch advertises VPID to the L1 hypervisor, then address space of L1 
and L2 can be separately treated and avoid TLB flush when swithing between 
L1 and L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid 
w/ an invvpid.

Performance: 

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID 
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla

Reviewed-by: Jan Kiszka 
Suggested-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f8d704d..c23482c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -424,6 +424,9 @@ struct nested_vmx {
/* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */
u64 vmcs01_debugctl;
 
+   u16 vpid02;
+   u16 last_vpid;
+
u32 nested_vmx_procbased_ctls_low;
u32 nested_vmx_procbased_ctls_high;
u32 nested_vmx_true_procbased_ctls_low;
@@ -1155,6 +1158,11 @@ static inline bool 
nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 }
 
+static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
+}
+
 static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
 {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
@@ -2469,6 +2477,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
+   SECONDARY_EXEC_ENABLE_VPID |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_WBINVD_EXITING |
@@ -6663,6 +6672,7 @@ static void free_nested(struct vcpu_vmx *vmx)
return;
 
vmx->nested.vmxon = false;
+   free_vpid(vmx->nested.vpid02);
nested_release_vmcs12(vmx);
if (enable_shadow_vmcs)
free_vmcs(vmx->nested.current_shadow_vmcs);
@@ -8548,8 +8558,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
goto free_vmcs;
}
 
-   if (nested)
+   if (nested) {
nested_vmx_setup_ctls_msrs(vmx);
+   vmx->nested.vpid02 = allocate_vpid();
+   }
 
vmx->nested.posted_intr_nv = -1;
vmx->nested.current_vmptr = -1ull;
@@ -8570,6 +8582,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
return &vmx->vcpu;
 
 free_vmcs:
+   free_vpid(vmx->nested.vpid02);
free_loaded_vmcs(vmx->loaded_vmcs);
 free_msrs:
kfree(vmx->guest_msrs);
@@ -9445,12 +9458,24 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
 
if (enable_vpid) {
/*
-* Trivially support vpid by letting L2s share their parent
-* L1's vpid. TODO: move to a more elaborate solution, giving
-* each L2 its own vpid and exposing the vpid feature to L1.
+* There is no direct mapping between vpid02 and vpid12, the
+* vpid02 is per-vCPU for L0 and reused while the value of
+* vpid12 is changed w/ one invvpid during nested vmentry.
+* The vpid12 is allocated by L1 for L2, so it will not
+* influence global bitmap(for vpid01 and vpid02 allocation)
+* even if spawn a lot of nested vCPUs.
 */
-   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
-   vmx_flush_tlb(vcpu);
+   if (nested_cpu_has_vpid(vmcs12)) {
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02);
+   if (vmcs12->virtual_processor_id != 
vmx->nested.last_vpid) {
+   vmx->nested.last_vpid = 
vmcs12->virtual_processor_id;
+   vmx_flush_tlb(vcpu);
+   }
+   } else {
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+   vmx_flush_tlb(vcpu);
+   }
+
}

[PATCH v5 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-16 Thread Wanpeng Li

Enhance allocate/free_vid to handle shadow vpid.

Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9ff6a3f..f8d704d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4155,29 +4155,28 @@ static int alloc_identity_pagetable(struct kvm *kvm)
return r;
 }
 
-static void allocate_vpid(struct vcpu_vmx *vmx)
+static int allocate_vpid(void)
 {
int vpid;
 
-   vmx->vpid = 0;
if (!enable_vpid)
-   return;
+   return 0;
spin_lock(&vmx_vpid_lock);
vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
-   if (vpid < VMX_NR_VPIDS) {
-   vmx->vpid = vpid;
+   if (vpid < VMX_NR_VPIDS)
__set_bit(vpid, vmx_vpid_bitmap);
-   }
+   else
+   vpid = 0;
spin_unlock(&vmx_vpid_lock);
+   return vpid;
 }
 
-static void free_vpid(struct vcpu_vmx *vmx)
+static void free_vpid(int vpid)
 {
-   if (!enable_vpid)
+   if (!enable_vpid || vpid == 0)
return;
spin_lock(&vmx_vpid_lock);
-   if (vmx->vpid != 0)
-   __clear_bit(vmx->vpid, vmx_vpid_bitmap);
+   __clear_bit(vpid, vmx_vpid_bitmap);
spin_unlock(&vmx_vpid_lock);
 }
 
@@ -8482,7 +8481,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 
if (enable_pml)
vmx_disable_pml(vmx);
-   free_vpid(vmx);
+   free_vpid(vmx->vpid);
leave_guest_mode(vcpu);
vmx_load_vmcs01(vcpu);
free_nested(vmx);
@@ -8501,7 +8500,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
if (!vmx)
return ERR_PTR(-ENOMEM);
 
-   allocate_vpid(vmx);
+   vmx->vpid = allocate_vpid();
 
err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
if (err)
@@ -8577,7 +8576,7 @@ free_msrs:
 uninit_vcpu:
kvm_vcpu_uninit(&vmx->vcpu);
 free_vcpu:
-   free_vpid(vmx);
+   free_vpid(vmx->vpid);
kmem_cache_free(kvm_vcpu_cache, vmx);
return ERR_PTR(err);
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 10/13] KVM: Update Posted-Interrupts Descriptor when vCPU is preempted

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 10:50, Feng Wu wrote:
> +
> + if (!irq_remapping_cap(IRQ_POSTING_CAP) ||
> + (!kvm_arch_has_assigned_device(vcpu->kvm)))
> + return;
> +

Better:

if (!arch_has_assigned_device(vcpu->kvm)) ||
!irq_remapping_cap(IRQ_POSTING_CAP))
return;

(In the future we might add a static_key here).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] os-android: Add support to android platform, built by ndk-r10

2015-09-16 Thread Houcheng Lin

2015-09-16 16:09 GMT+08:00 Paolo Bonzini :
>
>
>>
>> I'll modify the bionic C library to support these functions and feedback
>> to google's AOSP project. But the android kernel does not support shmem,
>
> It doesn't support tmpfs?  /dev/shm is just a tmpfs.
>
> Paolo

Oh, you are right. The android have shmget, shmat, shmdt functions in
their libc. The POSIX shm_open can built on top of these. I'll fix my
libc to support posix share memory functions.

-- 
Best regards,
Houcheng Lin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 09/13] KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 10:50, Feng Wu wrote:
> +int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
> +uint32_t guest_irq, bool set)
> +{
> + return !kvm_x86_ops->update_pi_irte ? -EINVAL :
> + kvm_x86_ops->update_pi_irte(kvm, host_irq, guest_irq, set);
> +}
> +

Just use "if" here.  No need to resend if this is the only comment.

> 
>  }
> +int  __attribute__((weak)) kvm_arch_update_irqfd_routing(
> + struct kvm *kvm, unsigned

Empty line after "}".

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 03/13] KVM: Define a new interface kvm_intr_is_single_vcpu()

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 10:49, Feng Wu wrote:
> This patch defines a new interface kvm_intr_is_single_vcpu(),
> which can returns whether the interrupt is for single-CPU or not.
> 
> It is used by VT-d PI, since now we only support single-CPU
> interrupts, For lowest-priority interrupts, if user configures
> it via /proc/irq or uses irqbalance to make it single-CPU, we
> can use PI to deliver the interrupts to it. Full functionality
> of lowest-priority support will be added later.
> 
> Signed-off-by: Feng Wu 
> ---
> v8:
> - Some optimizations in kvm_intr_is_single_vcpu().
> - Expose kvm_intr_is_single_vcpu() so we can use it in vmx code.
> - Add kvm_intr_is_single_vcpu_fast() as the fast path to find
>   the target vCPU for the single-destination interrupt
> 
>  arch/x86/include/asm/kvm_host.h |  3 ++
>  arch/x86/kvm/irq_comm.c | 94 
> +
>  arch/x86/kvm/lapic.c|  5 +--
>  arch/x86/kvm/lapic.h|  2 +
>  4 files changed, 101 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 49ec903..af11bca 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1204,4 +1204,7 @@ int __x86_set_memory_region(struct kvm *kvm,
>  int x86_set_memory_region(struct kvm *kvm,
> const struct kvm_userspace_memory_region *mem);
>  
> +bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
> +  struct kvm_vcpu **dest_vcpu);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> index 9efff9e..97ba1d6 100644
> --- a/arch/x86/kvm/irq_comm.c
> +++ b/arch/x86/kvm/irq_comm.c
> @@ -297,6 +297,100 @@ out:
>   return r;
>  }
>  
> +static bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm,
> +  struct kvm_lapic_irq *irq,
> +  struct kvm_vcpu **dest_vcpu)

Please put this in lapic.c, similar to kvm_irq_delivery_to_apic_fast, so
that you do not have to export other functions.

> +{
> + struct kvm_apic_map *map;
> + bool ret = false;
> + struct kvm_lapic *dst = NULL;
> +
> + if (irq->shorthand)
> + return false;
> +
> + rcu_read_lock();
> + map = rcu_dereference(kvm->arch.apic_map);
> +
> + if (!map)
> + goto out;
> +
> + if (irq->dest_mode == APIC_DEST_PHYSICAL) {
> + if (irq->dest_id == 0xFF)
> + goto out;
> +
> + if (irq->dest_id >= ARRAY_SIZE(map->phys_map)) {

Warning here is wrong, the guest can trigger it.

> + WARN_ON_ONCE(1);
> + goto out;
> + }
> +
> + dst = map->phys_map[irq->dest_id];
> + if (dst && kvm_apic_present(dst->vcpu))
> + *dest_vcpu = dst->vcpu;
> + else
> + goto out;
> + } else {
> + u16 cid;
> + unsigned long bitmap = 1;
> + int i, r = 0;
> +
> + if (!kvm_apic_logical_map_valid(map)) {
> + WARN_ON_ONCE(1);

Same here.

> + goto out;
> + }
> +
> + apic_logical_id(map, irq->dest_id, &cid, (u16 *)&bitmap);
> +
> + if (cid >= ARRAY_SIZE(map->logical_map)) {
> + WARN_ON_ONCE(1);

Same here.

Otherwise looks good.

Paolo

> + goto out;
> + }
> +
> + for_each_set_bit(i, &bitmap, 16) {
> + dst = map->logical_map[cid][i];
> + if (++r == 2)
> + goto out;
> + }
> +
> + if (dst && kvm_apic_present(dst->vcpu))
> + *dest_vcpu = dst->vcpu;
> + else
> + goto out;
> + }
> +
> + ret = true;
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +
> +
> +bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
> +  struct kvm_vcpu **dest_vcpu)
> +{
> + int i, r = 0;
> + struct kvm_vcpu *vcpu;
> +
> + if (kvm_intr_is_single_vcpu_fast(kvm, irq, dest_vcpu))
> + return true;
> +
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + if (!kvm_apic_present(vcpu))
> + continue;
> +
> + if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
> + irq->dest_id, irq->dest_mode))
> + continue;
> +
> + if (++r == 2)
> + return false;
> +
> + *dest_vcpu = vcpu;
> + }
> +
> + return r == 1;
> +}
> +EXPORT_SYMBOL_GPL(kvm_intr_is_single_vcpu);
> +
>  #define IOAPIC_ROUTING_ENTRY(irq) \
>   { .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,  \
> .u.irqchip = { .irqchip = KVM_IRQCHIP_IOAPIC, .pin = (irq) } }
> diff --git a/a

[PATCH v8 01/13] KVM: Extend struct pi_desc for VT-d Posted-Interrupts

2015-09-16 Thread Feng Wu

Extend struct pi_desc for VT-d Posted-Interrupts.

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/vmx.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 83b7b5c..271dd70 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -446,8 +446,24 @@ struct nested_vmx {
 /* Posted-Interrupt Descriptor */
 struct pi_desc {
u32 pir[8]; /* Posted interrupt requested */
-   u32 control;/* bit 0 of control is outstanding notification bit */
-   u32 rsvd[7];
+   union {
+   struct {
+   /* bit 256 - Outstanding Notification */
+   u16 on  : 1,
+   /* bit 257 - Suppress Notification */
+   sn  : 1,
+   /* bit 271:258 - Reserved */
+   rsvd_1  : 14;
+   /* bit 279:272 - Notification Vector */
+   u8  nv;
+   /* bit 287:280 - Reserved */
+   u8  rsvd_2;
+   /* bit 319:288 - Notification Destination */
+   u32 ndst;
+   };
+   u64 control;
+   };
+   u32 rsvd[6];
 } __aligned(64);
 
 static bool pi_test_and_set_on(struct pi_desc *pi_desc)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 03/13] KVM: Define a new interface kvm_intr_is_single_vcpu()

2015-09-16 Thread Feng Wu

This patch defines a new interface kvm_intr_is_single_vcpu(),
which can returns whether the interrupt is for single-CPU or not.

It is used by VT-d PI, since now we only support single-CPU
interrupts, For lowest-priority interrupts, if user configures
it via /proc/irq or uses irqbalance to make it single-CPU, we
can use PI to deliver the interrupts to it. Full functionality
of lowest-priority support will be added later.

Signed-off-by: Feng Wu 
---
v8:
- Some optimizations in kvm_intr_is_single_vcpu().
- Expose kvm_intr_is_single_vcpu() so we can use it in vmx code.
- Add kvm_intr_is_single_vcpu_fast() as the fast path to find
  the target vCPU for the single-destination interrupt

 arch/x86/include/asm/kvm_host.h |  3 ++
 arch/x86/kvm/irq_comm.c | 94 +
 arch/x86/kvm/lapic.c|  5 +--
 arch/x86/kvm/lapic.h|  2 +
 4 files changed, 101 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 49ec903..af11bca 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1204,4 +1204,7 @@ int __x86_set_memory_region(struct kvm *kvm,
 int x86_set_memory_region(struct kvm *kvm,
  const struct kvm_userspace_memory_region *mem);
 
+bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
+struct kvm_vcpu **dest_vcpu);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 9efff9e..97ba1d6 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -297,6 +297,100 @@ out:
return r;
 }
 
+static bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm,
+struct kvm_lapic_irq *irq,
+struct kvm_vcpu **dest_vcpu)
+{
+   struct kvm_apic_map *map;
+   bool ret = false;
+   struct kvm_lapic *dst = NULL;
+
+   if (irq->shorthand)
+   return false;
+
+   rcu_read_lock();
+   map = rcu_dereference(kvm->arch.apic_map);
+
+   if (!map)
+   goto out;
+
+   if (irq->dest_mode == APIC_DEST_PHYSICAL) {
+   if (irq->dest_id == 0xFF)
+   goto out;
+
+   if (irq->dest_id >= ARRAY_SIZE(map->phys_map)) {
+   WARN_ON_ONCE(1);
+   goto out;
+   }
+
+   dst = map->phys_map[irq->dest_id];
+   if (dst && kvm_apic_present(dst->vcpu))
+   *dest_vcpu = dst->vcpu;
+   else
+   goto out;
+   } else {
+   u16 cid;
+   unsigned long bitmap = 1;
+   int i, r = 0;
+
+   if (!kvm_apic_logical_map_valid(map)) {
+   WARN_ON_ONCE(1);
+   goto out;
+   }
+
+   apic_logical_id(map, irq->dest_id, &cid, (u16 *)&bitmap);
+
+   if (cid >= ARRAY_SIZE(map->logical_map)) {
+   WARN_ON_ONCE(1);
+   goto out;
+   }
+
+   for_each_set_bit(i, &bitmap, 16) {
+   dst = map->logical_map[cid][i];
+   if (++r == 2)
+   goto out;
+   }
+
+   if (dst && kvm_apic_present(dst->vcpu))
+   *dest_vcpu = dst->vcpu;
+   else
+   goto out;
+   }
+
+   ret = true;
+out:
+   rcu_read_unlock();
+   return ret;
+}
+
+
+bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
+struct kvm_vcpu **dest_vcpu)
+{
+   int i, r = 0;
+   struct kvm_vcpu *vcpu;
+
+   if (kvm_intr_is_single_vcpu_fast(kvm, irq, dest_vcpu))
+   return true;
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (!kvm_apic_present(vcpu))
+   continue;
+
+   if (!kvm_apic_match_dest(vcpu, NULL, irq->shorthand,
+   irq->dest_id, irq->dest_mode))
+   continue;
+
+   if (++r == 2)
+   return false;
+
+   *dest_vcpu = vcpu;
+   }
+
+   return r == 1;
+}
+EXPORT_SYMBOL_GPL(kvm_intr_is_single_vcpu);
+
 #define IOAPIC_ROUTING_ENTRY(irq) \
{ .gsi = irq, .type = KVM_IRQ_ROUTING_IRQCHIP,  \
  .u.irqchip = { .irqchip = KVM_IRQCHIP_IOAPIC, .pin = (irq) } }
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 2a5ca97..9848cd50 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -136,13 +136,12 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
 /* The logical map is definitely wrong if we have multiple
  * modes at the same time.  (Physical map is always right.)
  */
-static inline bool kvm_apic_logical_map_valid(struct kvm_apic_map *map)
+bool kvm_

[PATCH v8 05/13] KVM: make kvm_set_msi_irq() public

2015-09-16 Thread Feng Wu

Make kvm_set_msi_irq() public, we can use this function outside.

Signed-off-by: Feng Wu 
Reviewed-by: Paolo Bonzini 
---
v8:
- Export kvm_set_msi_irq() so we can use it in vmx code

 arch/x86/include/asm/kvm_host.h | 4 
 arch/x86/kvm/irq_comm.c | 5 +++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index af11bca..daa6126 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -175,6 +175,8 @@ enum {
  */
 #define KVM_APIC_PV_EOI_PENDING1
 
+struct kvm_kernel_irq_routing_entry;
+
 /*
  * We don't want allocation failures within the mmu code, so we preallocate
  * enough memory for a single page fault in a cache.
@@ -1207,4 +1209,6 @@ int x86_set_memory_region(struct kvm *kvm,
 bool kvm_intr_is_single_vcpu(struct kvm *kvm, struct kvm_lapic_irq *irq,
 struct kvm_vcpu **dest_vcpu);
 
+void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
+struct kvm_lapic_irq *irq);
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 97ba1d6..add52d8 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -91,8 +91,8 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
kvm_lapic *src,
return r;
 }
 
-static inline void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
-  struct kvm_lapic_irq *irq)
+void kvm_set_msi_irq(struct kvm_kernel_irq_routing_entry *e,
+struct kvm_lapic_irq *irq)
 {
trace_kvm_msi_set_irq(e->msi.address_lo, e->msi.data);
 
@@ -108,6 +108,7 @@ static inline void kvm_set_msi_irq(struct 
kvm_kernel_irq_routing_entry *e,
irq->level = 1;
irq->shorthand = 0;
 }
+EXPORT_SYMBOL_GPL(kvm_set_msi_irq);
 
 int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
struct kvm *kvm, int irq_source_id, int level, bool line_status)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 02/13] KVM: Add some helper functions for Posted-Interrupts

2015-09-16 Thread Feng Wu

This patch adds some helper functions to manipulate the
Posted-Interrupts Descriptor.

Signed-off-by: Feng Wu 
Reviewed-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 271dd70..316f9bf 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -443,6 +443,8 @@ struct nested_vmx {
 };
 
 #define POSTED_INTR_ON  0
+#define POSTED_INTR_SN  1
+
 /* Posted-Interrupt Descriptor */
 struct pi_desc {
u32 pir[8]; /* Posted interrupt requested */
@@ -483,6 +485,30 @@ static int pi_test_and_set_pir(int vector, struct pi_desc 
*pi_desc)
return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
 }
 
+static void pi_clear_sn(struct pi_desc *pi_desc)
+{
+   return clear_bit(POSTED_INTR_SN,
+   (unsigned long *)&pi_desc->control);
+}
+
+static void pi_set_sn(struct pi_desc *pi_desc)
+{
+   return set_bit(POSTED_INTR_SN,
+   (unsigned long *)&pi_desc->control);
+}
+
+static int pi_test_on(struct pi_desc *pi_desc)
+{
+   return test_bit(POSTED_INTR_ON,
+   (unsigned long *)&pi_desc->control);
+}
+
+static int pi_test_sn(struct pi_desc *pi_desc)
+{
+   return test_bit(POSTED_INTR_SN,
+   (unsigned long *)&pi_desc->control);
+}
+
 struct vcpu_vmx {
struct kvm_vcpu   vcpu;
unsigned long host_rsp;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 04/13] KVM: Make struct kvm_irq_routing_table accessible

2015-09-16 Thread Feng Wu

Move struct kvm_irq_routing_table from irqchip.c to kvm_host.h,
so we can use it outside of irqchip.c.

Signed-off-by: Feng Wu 
Reviewed-by: Paolo Bonzini 
---
 include/linux/kvm_host.h | 14 ++
 virt/kvm/irqchip.c   | 10 --
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5ac8d21..5f183fb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -328,6 +328,20 @@ struct kvm_kernel_irq_routing_entry {
struct hlist_node link;
 };
 
+#ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
+
+struct kvm_irq_routing_table {
+   int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
+   u32 nr_rt_entries;
+   /*
+* Array indexed by gsi. Each entry contains list of irq chips
+* the gsi is connected to.
+*/
+   struct hlist_head map[0];
+};
+
+#endif
+
 #ifndef KVM_PRIVATE_MEM_SLOTS
 #define KVM_PRIVATE_MEM_SLOTS 0
 #endif
diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
index 21c1424..2cf45d3 100644
--- a/virt/kvm/irqchip.c
+++ b/virt/kvm/irqchip.c
@@ -31,16 +31,6 @@
 #include 
 #include "irq.h"
 
-struct kvm_irq_routing_table {
-   int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
-   u32 nr_rt_entries;
-   /*
-* Array indexed by gsi. Each entry contains list of irq chips
-* the gsi is connected to.
-*/
-   struct hlist_head map[0];
-};
-
 int kvm_irq_map_gsi(struct kvm *kvm,
struct kvm_kernel_irq_routing_entry *entries, int gsi)
 {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-16 Thread Wanpeng Li


On 9/16/15 5:11 PM, Jan Kiszka wrote:

On 2015-09-16 09:19, Wanpeng Li wrote:

Enhance allocate/free_vid to handle shadow vpid.

Signed-off-by: Wanpeng Li 
---
  arch/x86/kvm/vmx.c | 23 +++
  1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9ff6a3f..c5222b8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4155,29 +4155,28 @@ static int alloc_identity_pagetable(struct kvm *kvm)
return r;
  }
  
-static void allocate_vpid(struct vcpu_vmx *vmx)

+static int allocate_vpid(void)
  {
int vpid;
  
-	vmx->vpid = 0;

if (!enable_vpid)
-   return;
+   return 0;
spin_lock(&vmx_vpid_lock);
vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
-   if (vpid < VMX_NR_VPIDS) {
-   vmx->vpid = vpid;
+   if (vpid < VMX_NR_VPIDS)
__set_bit(vpid, vmx_vpid_bitmap);
-   }
+   else
+   vpid = 0;
spin_unlock(&vmx_vpid_lock);
+   return vpid;
  }
  
-static void free_vpid(struct vcpu_vmx *vmx)

+static void free_vpid(int vpid)
  {
if (!enable_vpid)

|| vpid == 0

Otherwise you clear bit zero and cause the next allocate_vpid return 0 -
from the bitmap.


Will do, thanks. :-)

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-16 Thread Jan Kiszka

On 2015-09-16 09:19, Wanpeng Li wrote:
> Enhance allocate/free_vid to handle shadow vpid.
> 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 23 +++
>  1 file changed, 11 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9ff6a3f..c5222b8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4155,29 +4155,28 @@ static int alloc_identity_pagetable(struct kvm *kvm)
>   return r;
>  }
>  
> -static void allocate_vpid(struct vcpu_vmx *vmx)
> +static int allocate_vpid(void)
>  {
>   int vpid;
>  
> - vmx->vpid = 0;
>   if (!enable_vpid)
> - return;
> + return 0;
>   spin_lock(&vmx_vpid_lock);
>   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
> - if (vpid < VMX_NR_VPIDS) {
> - vmx->vpid = vpid;
> + if (vpid < VMX_NR_VPIDS)
>   __set_bit(vpid, vmx_vpid_bitmap);
> - }
> + else
> + vpid = 0;
>   spin_unlock(&vmx_vpid_lock);
> + return vpid;
>  }
>  
> -static void free_vpid(struct vcpu_vmx *vmx)
> +static void free_vpid(int vpid)
>  {
>   if (!enable_vpid)

|| vpid == 0

Otherwise you clear bit zero and cause the next allocate_vpid return 0 -
from the bitmap.

Jan

>   return;
>   spin_lock(&vmx_vpid_lock);
> - if (vmx->vpid != 0)
> - __clear_bit(vmx->vpid, vmx_vpid_bitmap);
> + __clear_bit(vpid, vmx_vpid_bitmap);
>   spin_unlock(&vmx_vpid_lock);
>  }
>  
> @@ -8482,7 +8481,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>  
>   if (enable_pml)
>   vmx_disable_pml(vmx);
> - free_vpid(vmx);
> + free_vpid(vmx->vpid);
>   leave_guest_mode(vcpu);
>   vmx_load_vmcs01(vcpu);
>   free_nested(vmx);
> @@ -8501,7 +8500,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>   if (!vmx)
>   return ERR_PTR(-ENOMEM);
>  
> - allocate_vpid(vmx);
> + vmx->vpid = allocate_vpid();
>  
>   err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
>   if (err)
> @@ -8577,7 +8576,7 @@ free_msrs:
>  uninit_vcpu:
>   kvm_vcpu_uninit(&vmx->vcpu);
>  free_vcpu:
> - free_vpid(vmx);
> + free_vpid(vmx->vpid);
>   kmem_cache_free(kvm_vcpu_cache, vmx);
>   return ERR_PTR(err);
>  }
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini



On 16/09/2015 10:57, Christian Borntraeger wrote:
> Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
>>
>>
>> On 15/09/2015 19:38, Paul E. McKenney wrote:
>>> Excellent points!
>>>
>>> Other options in such situations include the following:
>>>
>>> o   Rework so that the code uses call_rcu*() instead of *_expedited().
>>>
>>> o   Maintain a per-task or per-CPU counter so that every so many
>>> *_expedited() invocations instead uses the non-expedited
>>> counterpart.  (For example, synchronize_rcu instead of
>>> synchronize_rcu_expedited().)
>>
>> Or just use ratelimit (untested):
> 
> One of my tests was to always replace synchronize_sched_expedited with 
> synchronize_sched and things turned out to be even worse. Not sure if
> it makes sense to test yopur in-the-middle approach?

I don't think it applies here, since down_write/up_write is a
synchronous API.

If the revert isn't easy, I think backporting rcu_sync is the best bet.
 The issue is that rcu_sync doesn't eliminate synchronize_sched, it only
makes it more rare.  So it's possible that it isn't eliminating the root
cause of the problem.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 08/13] KVM: Implement IRQ bypass consumer callbacks for x86

2015-09-16 Thread Feng Wu

Implement the following callbacks for x86:

- kvm_arch_irq_bypass_add_producer
- kvm_arch_irq_bypass_del_producer
- kvm_arch_irq_bypass_stop: dummy callback
- kvm_arch_irq_bypass_resume: dummy callback

and set CONFIG_HAVE_KVM_IRQ_BYPASS for x86.

Signed-off-by: Feng Wu 
---
v8:
- Move the weak irq bypas stop and irq bypass start to this patch.
- Call kvm_x86_ops->update_pi_irte() instead of kvm_arch_update_pi_irte().

 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/Kconfig|  1 +
 arch/x86/kvm/x86.c  | 44 +
 virt/kvm/eventfd.c  | 12 +++
 4 files changed, 58 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8c44286..0ddd353 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index c951d44..b90776f 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -30,6 +30,7 @@ config KVM
select HAVE_KVM_IRQCHIP
select HAVE_KVM_IRQFD
select IRQ_BYPASS_MANAGER
+   select HAVE_KVM_IRQ_BYPASS
select HAVE_KVM_IRQ_ROUTING
select HAVE_KVM_EVENTFD
select KVM_APIC_ARCHITECTURE
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9dcd501..79dac02 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -50,6 +50,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
 #define CREATE_TRACE_POINTS
@@ -8249,6 +8251,48 @@ bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
 
+int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
+ struct irq_bypass_producer *prod)
+{
+   struct kvm_kernel_irqfd *irqfd =
+   container_of(cons, struct kvm_kernel_irqfd, consumer);
+
+   if (kvm_x86_ops->update_pi_irte) {
+   irqfd->producer = prod;
+   return kvm_x86_ops->update_pi_irte(irqfd->kvm,
+   prod->irq, irqfd->gsi, 1);
+   }
+
+   return -EINVAL;
+}
+
+void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
+ struct irq_bypass_producer *prod)
+{
+   int ret;
+   struct kvm_kernel_irqfd *irqfd =
+   container_of(cons, struct kvm_kernel_irqfd, consumer);
+
+   if (!kvm_x86_ops->update_pi_irte) {
+   WARN_ON(irqfd->producer != NULL);
+   return;
+   }
+
+   WARN_ON(irqfd->producer != prod);
+   irqfd->producer = NULL;
+
+   /*
+* When producer of consumer is unregistered, we change back to
+* remapped mode, so we can re-use the current implementation
+* when the irq is masked/disabed or the consumer side (KVM
+* int this case doesn't want to receive the interrupts.
+   */
+   ret = kvm_x86_ops->update_pi_irte(irqfd->kvm, prod->irq, irqfd->gsi, 0);
+   if (ret)
+   printk(KERN_INFO "irq bypass consumer (token %p) unregistration"
+  " fails: %d\n", irqfd->consumer.token, ret);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index d7a230f..c0a56a1 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -256,6 +256,18 @@ static void irqfd_update(struct kvm *kvm, struct 
kvm_kernel_irqfd *irqfd)
write_seqcount_end(&irqfd->irq_entry_sc);
 }
 
+#ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
+void __attribute__((weak)) kvm_arch_irq_bypass_stop(
+   struct irq_bypass_consumer *cons)
+{
+}
+
+void __attribute__((weak)) kvm_arch_irq_bypass_start(
+   struct irq_bypass_consumer *cons)
+{
+}
+#endif
+
 static int
 kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 07/13] KVM: x86: Update IRTE for posted-interrupts

2015-09-16 Thread Feng Wu

This patch adds the routine to update IRTE for posted-interrupts
when guest changes the interrupt configuration.

Signed-off-by: Feng Wu 
---
v8:
- Move 'kvm_arch_update_pi_irte' to vmx.c as a callback
- Only update the PI irte when VM has assigned devices
- Add a trace point for VT-d posted-interrupts when we update
  or disable it for a specific irq.

 arch/x86/include/asm/kvm_host.h |  3 ++
 arch/x86/kvm/trace.h| 33 
 arch/x86/kvm/vmx.c  | 83 +
 arch/x86/kvm/x86.c  |  2 +
 4 files changed, 121 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index daa6126..8c44286 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -862,6 +862,9 @@ struct kvm_x86_ops {
   gfn_t offset, unsigned long mask);
/* pmu operations of sub-arch */
const struct kvm_pmu_ops *pmu_ops;
+
+   int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
+ uint32_t guest_irq, bool set);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 4eae7c3..539a9e4 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -974,6 +974,39 @@ TRACE_EVENT(kvm_enter_smm,
  __entry->smbase)
 );
 
+/*
+ * Tracepoint for VT-d posted-interrupts.
+ */
+TRACE_EVENT(kvm_pi_irte_update,
+   TP_PROTO(unsigned int vcpu_id, unsigned int gsi,
+unsigned int gvec, u64 pi_desc_addr, bool set),
+   TP_ARGS(vcpu_id, gsi, gvec, pi_desc_addr, set),
+
+   TP_STRUCT__entry(
+   __field(unsigned int,   vcpu_id )
+   __field(unsigned int,   gsi )
+   __field(unsigned int,   gvec)
+   __field(u64,pi_desc_addr)
+   __field(bool,   set )
+   ),
+
+   TP_fast_assign(
+   __entry->vcpu_id= vcpu_id;
+   __entry->gsi= gsi;
+   __entry->gvec   = gvec;
+   __entry->pi_desc_addr   = pi_desc_addr;
+   __entry->set= set;
+   ),
+
+   TP_printk("VT-d PI is %s for this irq, vcpu %u, gsi: 0x%x, "
+ "gvec: 0x%x, pi_desc_addr: 0x%llx",
+ __entry->set ? "enabled and being updated" : "disabled",
+ __entry->vcpu_id,
+ __entry->gsi,
+ __entry->gvec,
+ __entry->pi_desc_addr)
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 316f9bf..5a25651 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "trace.h"
 #include "pmu.h"
@@ -605,6 +606,11 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu 
*vcpu)
return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
+{
+   return &(to_vmx(vcpu)->pi_desc);
+}
+
 #define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
 #define FIELD(number, name)[number] = VMCS12_OFFSET(name)
 #define FIELD64(number, name)  [number] = VMCS12_OFFSET(name), \
@@ -10344,6 +10350,81 @@ static void vmx_enable_log_dirty_pt_masked(struct kvm 
*kvm,
kvm_mmu_clear_dirty_pt_masked(kvm, memslot, offset, mask);
 }
 
+/*
+ * vmx_update_pi_irte - set IRTE for Posted-Interrupts
+ *
+ * @kvm: kvm
+ * @host_irq: host irq of the interrupt
+ * @guest_irq: gsi of the interrupt
+ * @set: set or unset PI
+ * returns 0 on success, < 0 on failure
+ */
+int vmx_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
+  uint32_t guest_irq, bool set)
+{
+   struct kvm_kernel_irq_routing_entry *e;
+   struct kvm_irq_routing_table *irq_rt;
+   struct kvm_lapic_irq irq;
+   struct kvm_vcpu *vcpu;
+   struct vcpu_data vcpu_info;
+   int idx, ret = -EINVAL;
+
+   if (!irq_remapping_cap(IRQ_POSTING_CAP) ||
+   (!kvm_arch_has_assigned_device(kvm)))
+   return 0;
+
+   idx = srcu_read_lock(&kvm->irq_srcu);
+   irq_rt = srcu_dereference(kvm->irq_routing, &kvm->irq_srcu);
+   BUG_ON(guest_irq >= irq_rt->nr_rt_entries);
+
+   hlist_for_each_entry(e, &irq_rt->map[guest_irq], link) {
+   if (e->type != KVM_IRQ_ROUTING_MSI)
+   continue;
+   /*
+* VT-d PI cannot support posting multicast/broadcast
+* interrupts to a vCPU, we still use interrupt remapping
+* for these kind of interrupts.
+*
+* For lowest-priority interrupts, we only support
+* those with single CPU as the destination, e.g. user
+* configures the interrupts via /proc/irq

[PATCH v8 06/13] vfio: Register/unregister irq_bypass_producer

2015-09-16 Thread Feng Wu

This patch adds the registration/unregistration of an
irq_bypass_producer for MSI/MSIx on vfio pci devices.

Signed-off-by: Feng Wu 
---
v8:
- Merge "[PATCH v7 08/17] vfio: Select IRQ_BYPASS_MANAGER for vfio PCI devices"
  into this patch.

v6:
- Make the add_consumer and del_consumer callbacks static
- Remove pointless INIT_LIST_HEAD to 'vdev->ctx[vector].producer.node)'
- Use dev_info instead of WARN_ON() when irq_bypass_register_producer fails
- Remove optional dummy callbacks for irq producer

 drivers/vfio/pci/Kconfig| 1 +
 drivers/vfio/pci/vfio_pci_intrs.c   | 9 +
 drivers/vfio/pci/vfio_pci_private.h | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 579d83b..02912f1 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -2,6 +2,7 @@ config VFIO_PCI
tristate "VFIO support for PCI devices"
depends on VFIO && PCI && EVENTFD
select VFIO_VIRQFD
+   select IRQ_BYPASS_MANAGER
help
  Support for the PCI VFIO bus driver.  This is required to make
  use of PCI drivers using the VFIO framework.
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 1f577b4..c65299d 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -319,6 +319,7 @@ static int vfio_msi_set_vector_signal(struct 
vfio_pci_device *vdev,
 
if (vdev->ctx[vector].trigger) {
free_irq(irq, vdev->ctx[vector].trigger);
+   irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
kfree(vdev->ctx[vector].name);
eventfd_ctx_put(vdev->ctx[vector].trigger);
vdev->ctx[vector].trigger = NULL;
@@ -360,6 +361,14 @@ static int vfio_msi_set_vector_signal(struct 
vfio_pci_device *vdev,
return ret;
}
 
+   vdev->ctx[vector].producer.token = trigger;
+   vdev->ctx[vector].producer.irq = irq;
+   ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
+   if (unlikely(ret))
+   dev_info(&pdev->dev,
+   "irq bypass producer (token %p) registeration fails: %d\n",
+   vdev->ctx[vector].producer.token, ret);
+
vdev->ctx[vector].trigger = trigger;
 
return 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index ae0e1b4..0e7394f 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -13,6 +13,7 @@
 
 #include 
 #include 
+#include 
 
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
@@ -29,6 +30,7 @@ struct vfio_pci_irq_ctx {
struct virqfd   *mask;
char*name;
boolmasked;
+   struct irq_bypass_producer  producer;
 };
 
 struct vfio_pci_device {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 09/13] KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'

2015-09-16 Thread Feng Wu

This patch adds an arch specific hooks 'arch_update' in
'struct kvm_kernel_irqfd'. On Intel side, it is used to
update the IRTE when VT-d posted-interrupts is used.

Signed-off-by: Feng Wu 
---
v8:
- Remove callback .arch_update()
- Remove kvm_arch_irqfd_init()
- Call kvm_arch_update_irqfd_routing() instead.

 arch/x86/kvm/x86.c   |  7 +++
 include/linux/kvm_host.h |  2 ++
 virt/kvm/eventfd.c   | 19 ++-
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 79dac02..e189a94 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8293,6 +8293,13 @@ void kvm_arch_irq_bypass_del_producer(struct 
irq_bypass_consumer *cons,
   " fails: %d\n", irqfd->consumer.token, ret);
 }
 
+int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
+  uint32_t guest_irq, bool set)
+{
+   return !kvm_x86_ops->update_pi_irte ? -EINVAL :
+   kvm_x86_ops->update_pi_irte(kvm, host_irq, guest_irq, set);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 5f183fb..feba1fb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1174,6 +1174,8 @@ void kvm_arch_irq_bypass_del_producer(struct 
irq_bypass_consumer *,
   struct irq_bypass_producer *);
 void kvm_arch_irq_bypass_stop(struct irq_bypass_consumer *);
 void kvm_arch_irq_bypass_start(struct irq_bypass_consumer *);
+int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
+ uint32_t guest_irq, bool set);
 #endif /* CONFIG_HAVE_KVM_IRQ_BYPASS */
 #endif
 
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index c0a56a1..89c9635 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -266,6 +266,12 @@ void __attribute__((weak)) kvm_arch_irq_bypass_start(
struct irq_bypass_consumer *cons)
 {
 }
+int  __attribute__((weak)) kvm_arch_update_irqfd_routing(
+   struct kvm *kvm, unsigned int host_irq,
+   uint32_t guest_irq, bool set)
+{
+   return 0;
+}
 #endif
 
 static int
@@ -582,13 +588,24 @@ kvm_irqfd_release(struct kvm *kvm)
  */
 void kvm_irq_routing_update(struct kvm *kvm)
 {
+   int ret;
struct kvm_kernel_irqfd *irqfd;
 
spin_lock_irq(&kvm->irqfds.lock);
 
-   list_for_each_entry(irqfd, &kvm->irqfds.items, list)
+   list_for_each_entry(irqfd, &kvm->irqfds.items, list) {
irqfd_update(kvm, irqfd);
 
+#ifdef CONFIG_HAVE_KVM_IRQ_BYPASS
+   if (irqfd->producer) {
+   ret = kvm_arch_update_irqfd_routing(
+   irqfd->kvm, irqfd->producer->irq,
+   irqfd->gsi, 1);
+   WARN_ON(ret);
+   }
+#endif
+   }
+
spin_unlock_irq(&kvm->irqfds.lock);
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 11/13] KVM: Update Posted-Interrupts Descriptor when vCPU is blocked

2015-09-16 Thread Feng Wu

This patch updates the Posted-Interrupts Descriptor when vCPU
is blocked.

pre-block:
- Add the vCPU to the blocked per-CPU list
- Set 'NV' to POSTED_INTR_WAKEUP_VECTOR

post-block:
- Remove the vCPU from the per-CPU list

Signed-off-by: Feng Wu 
---
v8:
- Rename 'pi_pre_block' to 'pre_block'
- Rename 'pi_post_block' to 'post_block'
- Change some comments
- Only add the vCPU to the blocking list when the VM has assigned devices.

 arch/x86/include/asm/kvm_host.h |  13 
 arch/x86/kvm/vmx.c  | 157 +++-
 arch/x86/kvm/x86.c  |  53 +++---
 include/linux/kvm_host.h|   3 +
 virt/kvm/kvm_main.c |   3 +
 5 files changed, 217 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0ddd353..304fbb5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -552,6 +552,8 @@ struct kvm_vcpu_arch {
 */
bool write_fault_to_shadow_pgtable;
 
+   bool halted;
+
/* set at EPT violation at this point */
unsigned long exit_qualification;
 
@@ -864,6 +866,17 @@ struct kvm_x86_ops {
/* pmu operations of sub-arch */
const struct kvm_pmu_ops *pmu_ops;
 
+   /*
+* Architecture specific hooks for vCPU blocking due to
+* HLT instruction.
+* Returns for .pre_block():
+*- 0 means continue to block the vCPU.
+*- 1 means we cannot block the vCPU since some event
+*happens during this period, such as, 'ON' bit in
+*posted-interrupts descriptor is set.
+*/
+   int (*pre_block)(struct kvm_vcpu *vcpu);
+   void (*post_block)(struct kvm_vcpu *vcpu);
int (*update_pi_irte)(struct kvm *kvm, unsigned int host_irq,
  uint32_t guest_irq, bool set);
 };
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5ceb280..9888c43 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -879,6 +879,13 @@ static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
 static DEFINE_PER_CPU(struct list_head, loaded_vmcss_on_cpu);
 static DEFINE_PER_CPU(struct desc_ptr, host_gdt);
 
+/*
+ * We maintian a per-CPU linked-list of vCPU, so in wakeup_handler() we
+ * can find which vCPU should be waken up.
+ */
+static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
+static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
+
 static unsigned long *vmx_io_bitmap_a;
 static unsigned long *vmx_io_bitmap_b;
 static unsigned long *vmx_msr_bitmap_legacy;
@@ -1959,10 +1966,10 @@ static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int 
cpu)
/*
 * If 'nv' field is POSTED_INTR_WAKEUP_VECTOR, there
 * are two possible cases:
-* 1. After running 'pi_pre_block', context switch
+* 1. After running 'pre_block', context switch
 *happened. For this case, 'sn' was set in
 *vmx_vcpu_put(), so we need to clear it here.
-* 2. After running 'pi_pre_block', we were blocked,
+* 2. After running 'pre_block', we were blocked,
 *and woken up by some other guy. For this case,
 *we don't need to do anything, 'pi_post_block'
 *will do everything for us. However, we cannot
@@ -2985,6 +2992,8 @@ static int hardware_enable(void)
return -EBUSY;
 
INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
+   INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
+   spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
 
/*
 * Now we can enable the vmclear operation in kdump
@@ -6105,6 +6114,25 @@ static void update_ple_window_actual_max(void)
ple_window_grow, INT_MIN);
 }
 
+/*
+ * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
+ */
+static void wakeup_handler(void)
+{
+   struct kvm_vcpu *vcpu;
+   int cpu = smp_processor_id();
+
+   spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+   list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
+   blocked_vcpu_list) {
+   struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+
+   if (pi_test_on(pi_desc) == 1)
+   kvm_vcpu_kick(vcpu);
+   }
+   spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+}
+
 static __init int hardware_setup(void)
 {
int r = -ENOMEM, i, msr;
@@ -6289,6 +6317,8 @@ static __init int hardware_setup(void)
kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
}
 
+   kvm_set_posted_intr_wakeup_handler(wakeup_handler);
+
return alloc_kvm_area();
 
 out8:
@@ -10414,6 +10444,126 @@ static void vmx_enable_log_dirty_pt_masked(struct kvm 
*kvm,
 }
 
 /*
+ * This routine does the following things for vCPU which is going
+ * to be blocked if VT-d PI

[PATCH v8 10/13] KVM: Update Posted-Interrupts Descriptor when vCPU is preempted

2015-09-16 Thread Feng Wu

This patch updates the Posted-Interrupts Descriptor when vCPU
is preempted.

sched out:
- Set 'SN' to suppress furture non-urgent interrupts posted for
the vCPU.

sched in:
- Clear 'SN'
- Change NDST if vCPU is scheduled to a different CPU
- Set 'NV' to POSTED_INTR_VECTOR

Signed-off-by: Feng Wu 
---
v8:
- Add two wrapper fucntion vmx_vcpu_pi_load() and vmx_vcpu_pi_put().
- Only handle VT-d PI related logic when the VM has assigned devices.

 arch/x86/kvm/vmx.c | 63 ++
 1 file changed, 63 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5a25651..5ceb280 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1943,6 +1943,52 @@ static void vmx_load_host_state(struct vcpu_vmx *vmx)
preempt_enable();
 }
 
+static void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
+{
+   struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+   struct pi_desc old, new;
+   unsigned int dest;
+
+   if (!irq_remapping_cap(IRQ_POSTING_CAP) ||
+   (!kvm_arch_has_assigned_device(vcpu->kvm)))
+   return;
+
+   do {
+   old.control = new.control = pi_desc->control;
+
+   /*
+* If 'nv' field is POSTED_INTR_WAKEUP_VECTOR, there
+* are two possible cases:
+* 1. After running 'pi_pre_block', context switch
+*happened. For this case, 'sn' was set in
+*vmx_vcpu_put(), so we need to clear it here.
+* 2. After running 'pi_pre_block', we were blocked,
+*and woken up by some other guy. For this case,
+*we don't need to do anything, 'pi_post_block'
+*will do everything for us. However, we cannot
+*check whether it is case #1 or case #2 here
+*(maybe, not needed), so we also clear sn here,
+*I think it is not a big deal.
+*/
+   if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR) {
+   if (vcpu->cpu != cpu) {
+   dest = cpu_physical_id(cpu);
+
+   if (x2apic_enabled())
+   new.ndst = dest;
+   else
+   new.ndst = (dest << 8) & 0xFF00;
+   }
+
+   /* set 'NV' to 'notification vector' */
+   new.nv = POSTED_INTR_VECTOR;
+   }
+
+   /* Allow posting non-urgent interrupts */
+   new.sn = 0;
+   } while (cmpxchg(&pi_desc->control, old.control,
+   new.control) != old.control);
+}
 /*
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
@@ -1993,10 +2039,27 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
cpu)
vmcs_writel(HOST_IA32_SYSENTER_ESP, sysenter_esp); /* 22.2.3 */
vmx->loaded_vmcs->cpu = cpu;
}
+
+   vmx_vcpu_pi_load(vcpu, cpu);
+}
+
+static void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
+{
+   struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+
+   if (!irq_remapping_cap(IRQ_POSTING_CAP) ||
+   (!kvm_arch_has_assigned_device(vcpu->kvm)))
+   return;
+
+   /* Set SN when the vCPU is preempted */
+   if (vcpu->preempted)
+   pi_set_sn(pi_desc);
 }
 
 static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   vmx_vcpu_pi_put(vcpu);
+
__vmx_load_host_state(to_vmx(vcpu));
if (!vmm_exclusive) {
__loaded_vmcs_clear(to_vmx(vcpu)->loaded_vmcs);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 12/13] KVM: Warn if 'SN' is set during posting interrupts by software

2015-09-16 Thread Feng Wu

Currently, we don't support urgent interrupt, all interrupts
are recognized as non-urgent interrupt, so we cannot post
interrupts when 'SN' is set.

If the vcpu is in guest mode, it cannot have been scheduled out,
and that's the only case when SN is set currently, warning if
SN is set.

Signed-off-by: Feng Wu 
Reviewed-by: Paolo Bonzini 
---
 arch/x86/kvm/vmx.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9888c43..58fbbc6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4498,6 +4498,22 @@ static inline bool 
kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_SMP
if (vcpu->mode == IN_GUEST_MODE) {
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   /*
+* Currently, we don't support urgent interrupt,
+* all interrupts are recognized as non-urgent
+* interrupt, so we cannot post interrupts when
+* 'SN' is set.
+*
+* If the vcpu is in guest mode, it means it is
+* running instead of being scheduled out and
+* waiting in the run queue, and that's the only
+* case when 'SN' is set currently, warning if
+* 'SN' is set.
+*/
+   WARN_ON_ONCE(pi_test_sn(&vmx->pi_desc));
+
apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
POSTED_INTR_VECTOR);
return true;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 13/13] iommu/vt-d: Add a command line parameter for VT-d posted-interrupts

2015-09-16 Thread Feng Wu

Enable VT-d Posted-Interrtups and add a command line
parameter for it.

Signed-off-by: Feng Wu 
Reviewed-by: Paolo Bonzini 
---
 Documentation/kernel-parameters.txt |  1 +
 drivers/iommu/irq_remapping.c   | 12 
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 1d6f045..52aca36 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1547,6 +1547,7 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
nosid   disable Source ID checking
no_x2apic_optout
BIOS x2APIC opt-out request will be ignored
+   nopost  disable Interrupt Posting
 
iomem=  Disable strict checking of access to MMIO memory
strict  regions from userspace.
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 2d99930..d8c3997 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -22,7 +22,7 @@ int irq_remap_broken;
 int disable_sourceid_checking;
 int no_x2apic_optout;
 
-int disable_irq_post = 1;
+int disable_irq_post = 0;
 
 static int disable_irq_remap;
 static struct irq_remap_ops *remap_ops;
@@ -58,14 +58,18 @@ static __init int setup_irqremap(char *str)
return -EINVAL;
 
while (*str) {
-   if (!strncmp(str, "on", 2))
+   if (!strncmp(str, "on", 2)) {
disable_irq_remap = 0;
-   else if (!strncmp(str, "off", 3))
+   disable_irq_post = 0;
+   } else if (!strncmp(str, "off", 3)) {
disable_irq_remap = 1;
-   else if (!strncmp(str, "nosid", 5))
+   disable_irq_post = 1;
+   } else if (!strncmp(str, "nosid", 5))
disable_sourceid_checking = 1;
else if (!strncmp(str, "no_x2apic_optout", 16))
no_x2apic_optout = 1;
+   else if (!strncmp(str, "nopost", 6))
+   disable_irq_post = 1;
 
str += strcspn(str, ",");
while (*str == ',')
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 00/13] Add VT-d Posted-Interrupts support

2015-09-16 Thread Feng Wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

v8:
refer to the changelog in each patch

v7:
* Define two weak irq bypass callbacks:
  - kvm_arch_irq_bypass_start()
  - kvm_arch_irq_bypass_stop()
* Remove the x86 dummy implementation of the above two functions.
* Print some useful information instead of WARN_ON() when the
  irq bypass consumer unregistration fails.
* Fix an issue when calling pi_pre_block and pi_post_block.

v6:
* Rebase on 4.2.0-rc6
* Rebase on https://lkml.org/lkml/2015/8/6/526 and 
http://www.gossamer-threads.com/lists/linux/kernel/2235623
* Make the add_consumer and del_consumer callbacks static
* Remove pointless INIT_LIST_HEAD to 'vdev->ctx[vector].producer.node)'
* Use dev_info instead of WARN_ON() when irq_bypass_register_producer fails
* Remove optional dummy callbacks for irq producer

v4:
* For lowest-priority interrupt, only support single-CPU destination
interrupts at the current stage, more common lowest priority support
will be added later.
* Accoring to Marcelo's suggestion, when vCPU is blocked, we handle
the posted-interrupts in the HLT emulation path.
* Some small changes (coding style, typo, add some code comments)

v3:
* Adjust the Posted-interrupts Descriptor updating logic when vCPU is
  preempted or blocked.
* KVM_DEV_VFIO_DEVICE_POSTING_IRQ --> KVM_DEV_VFIO_DEVICE_POST_IRQ
* __KVM_HAVE_ARCH_KVM_VFIO_POSTING --> __KVM_HAVE_ARCH_KVM_VFIO_POST
* Add KVM_DEV_VFIO_DEVICE_UNPOST_IRQ attribute for VFIO irq, which
  can be used to change back to remapping mode.
* Fix typo

v2:
* Use VFIO framework to enable this feature, the VFIO part of this series is
  base on Eric's patch "[PATCH v3 0/8] KVM-VFIO IRQ forward control"
* Rebase this patchset on 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git,
  then revise some irq logic based on the new hierarchy irqdomain patches 
provided
  by Jiang Liu 

Feng Wu (13):
  KVM: Extend struct pi_desc for VT-d Posted-Interrupts
  KVM: Add some helper functions for Posted-Interrupts
  KVM: Define a new interface kvm_intr_is_single_vcpu()
  KVM: Make struct kvm_irq_routing_table accessible
  KVM: make kvm_set_msi_irq() public
  vfio: Register/unregister irq_bypass_producer
  KVM: x86: Update IRTE for posted-interrupts
  KVM: Implement IRQ bypass consumer callbacks for x86
  KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'
  KVM: Update Posted-Interrupts Descriptor when vCPU is preempted
  KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
  KVM: Warn if 'SN' is set during posting interrupts by software
  iommu/vt-d: Add a command line parameter for VT-d posted-interrupts

 Documentation/kernel-parameters.txt |   1 +
 arch/x86/include/asm/kvm_host.h |  24 +++
 arch/x86/kvm/Kconfig|   1 +
 arch/x86/kvm/irq_comm.c |  99 +-
 arch/x86/kvm/lapic.c|   5 +-
 arch/x86/kvm/lapic.h|   2 +
 arch/x86/kvm/trace.h|  33 
 arch/x86/kvm/vmx.c  | 361 +++-
 arch/x86/kvm/x86.c  | 106 ++-
 drivers/iommu/irq_remapping.c   |  12 +-
 drivers/vfio/pci/Kconfig|   1 +
 drivers/vfio/pci/vfio_pci_intrs.c   |   9 +
 drivers/vfio/pci/vfio_pci_private.h |   2 +
 include/linux/kvm_host.h|  19 ++
 virt/kvm/eventfd.c  |  31 +++-
 virt/kvm/irqchip.c  |  10 -
 virt/kvm/kvm_main.c |   3 +
 17 files changed, 687 insertions(+), 32 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V6 0/6] Fast mmio eventfd fixes

2015-09-16 Thread Paolo Bonzini



On 15/09/2015 21:26, Michael S. Tsirkin wrote:
> > Applied to kvm/queue and will send patches 1-4 for 4.3-rc.  Thanks!
> 
> I'd prefer at least 6 to be there as well:
> without 6 userspace can't safely use the code, and without 5,
> it can't trace it.

The idea is to just make old userspace work without crashing.  New
features do not belong in stable releases.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger

Am 16.09.2015 um 10:32 schrieb Paolo Bonzini:
> 
> 
> On 15/09/2015 19:38, Paul E. McKenney wrote:
>> Excellent points!
>>
>> Other options in such situations include the following:
>>
>> oRework so that the code uses call_rcu*() instead of *_expedited().
>>
>> oMaintain a per-task or per-CPU counter so that every so many
>>  *_expedited() invocations instead uses the non-expedited
>>  counterpart.  (For example, synchronize_rcu instead of
>>  synchronize_rcu_expedited().)
> 
> Or just use ratelimit (untested):

One of my tests was to always replace synchronize_sched_expedited with 
synchronize_sched and things turned out to be even worse. Not sure if
it makes sense to test yopur in-the-middle approach?

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vhost: build failure

2015-09-16 Thread Michael S. Tsirkin

On Wed, Sep 16, 2015 at 01:50:08PM +0530, Sudip Mukherjee wrote:
> Hi,
> While crosscompiling the kernel for openrisc with allmodconfig the build
> failed with the error:
> drivers/vhost/vhost.c: In function 'vhost_vring_ioctl':
> drivers/vhost/vhost.c:818:3: error: call to '__compiletime_assert_818' 
> declared with attribute error: BUILD_BUG_ON failed: __alignof__
> *vq->avail > VRING_AVAIL_ALIGN_SIZE
> 
> Can you please give me any idea about what the problem might be and how
> it can be solved.
> 
> You can see the build log at:
> https://travis-ci.org/sudipm-mukherjee/parport/jobs/80365425
> 
> regards
> sudip

Yes - I think I saw this already.
I think the openrisc cross-compiler is broken.

VRING_AVAIL_ALIGN_SIZE is 2

*vq->avail is:

struct vring_avail {
__virtio16 flags;
__virtio16 idx;
__virtio16 ring[];
};

And __virtio16 is just a u16 with some sparse annotations.

Looking at openrisc architecture document:
Operand:Length  addr[3:0] if aligned
Halfword (or half)  2 bytes Xxx0

TypeC-TYPESizeof Alignment Openrisc Equivalent
Short   Signed short2  2   Signed halfword

and

16.1.2
Aggregates and Unions
Aggregates (structures and arrays) and unions assume the alignment of their most
strictly aligned element.

So to me, it looks like your gcc violates the ABI
by adding alignment requirements > 2.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Paolo Bonzini



On 15/09/2015 19:38, Paul E. McKenney wrote:
> Excellent points!
> 
> Other options in such situations include the following:
> 
> o Rework so that the code uses call_rcu*() instead of *_expedited().
> 
> o Maintain a per-task or per-CPU counter so that every so many
>   *_expedited() invocations instead uses the non-expedited
>   counterpart.  (For example, synchronize_rcu instead of
>   synchronize_rcu_expedited().)

Or just use ratelimit (untested):

diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
index 834c4e52cb2d..8fb66b2aeed9 100644
--- a/include/linux/percpu-rwsem.h
+++ b/include/linux/percpu-rwsem.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct percpu_rw_semaphore {
unsigned int __percpu   *fast_read_ctr;
@@ -13,6 +14,7 @@ struct percpu_rw_semaphore {
struct rw_semaphore rw_sem;
atomic_tslow_read_ctr;
wait_queue_head_t   write_waitq;
+   struct ratelimit_state  expedited_ratelimit;
 };
 
 extern void percpu_down_read(struct percpu_rw_semaphore *);
diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
index f32567254867..c33f8bc89384 100644
--- a/kernel/locking/percpu-rwsem.c
+++ b/kernel/locking/percpu-rwsem.c
@@ -20,6 +20,8 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
atomic_set(&brw->write_ctr, 0);
atomic_set(&brw->slow_read_ctr, 0);
init_waitqueue_head(&brw->write_waitq);
+   /* Expedite one down_write and one up_write per second.  */
+   ratelimit_state_init(&brw->expedited_ratelimit, HZ, 2);
return 0;
 }
 
@@ -152,7 +156,10 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
 *fast-path, it executes a full memory barrier before we return.
 *See R_W case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
+   if (__ratelimit(&brw->expedited_ratelimit))
+   synchronize_sched_expedited();
+   else
+   synchronize_sched();
 
/* exclude other writers, and block the new readers completely */
down_write(&brw->rw_sem);
@@ -172,7 +179,10 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
 * Insert the barrier before the next fast-path in down_read,
 * see W_R case in the comment above update_fast_ctr().
 */
-   synchronize_sched_expedited();
+   if (__ratelimit(&brw->expedited_ratelimit))
+   synchronize_sched_expedited();
+   else
+   synchronize_sched();
/* the last writer unblocks update_fast_ctr() */
atomic_dec(&brw->write_ctr);
 }


> Note that synchronize_srcu_expedited() is less troublesome than are the
> other *_expedited() functions, because synchronize_srcu_expedited() does
> not inflict OS jitter on other CPUs.

Yup, synchronize_srcu_expedited() is just a busy wait and it can
complete extremely fast if you use SRCU as a "local RCU" rather
than a "sleepable RCU".  However it doesn't apply here since you
want to avoid SRCU's 2 memory barriers per lock/unlock.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

vhost: build failure

2015-09-16 Thread Sudip Mukherjee

Hi,
While crosscompiling the kernel for openrisc with allmodconfig the build
failed with the error:
drivers/vhost/vhost.c: In function 'vhost_vring_ioctl':
drivers/vhost/vhost.c:818:3: error: call to '__compiletime_assert_818' declared 
with attribute error: BUILD_BUG_ON failed: __alignof__
*vq->avail > VRING_AVAIL_ALIGN_SIZE

Can you please give me any idea about what the problem might be and how
it can be solved.

You can see the build log at:
https://travis-ci.org/sudipm-mukherjee/parport/jobs/80365425

regards
sudip
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] os-android: Add support to android platform, built by ndk-r10

2015-09-16 Thread Paolo Bonzini



On 15/09/2015 19:34, Houcheng Lin wrote:
> Hi Paolo,
> 
> (Please ignore the previous mail that did not include "qemu-devel")
> 
> Thanks for your review and suggestions. I'll fix this patch
> accordingly and please see my replies below.
> 
> best regards,
> Houcheng Lin
> 
> 2015-09-15 17:41 GMT+08:00 Paolo Bonzini :
> 
>> This is okay and can be done unconditionally (introduce a new
>> qemu_getdtablesize function that is defined in util/oslib-posix.c).
> 
> Will fix it.
>>
>>
>>> - sigtimewait(): call __rt_sigtimewait() instead.
>>> - lockf(): not see this feature in android, directly return -1.
>>> - shm_open(): not see this feature in android, directly return -1.
>>
>> This is not okay.  Please fix your libc instead.
> 
> I'll modify the bionic C library to support these functions and feedback
> to google's AOSP project. But the android kernel does not support shmem,

It doesn't support tmpfs?  /dev/shm is just a tmpfs.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 kvmtool] Make static libc and guest-init functionality optional.

2015-09-16 Thread Dimitri John Ledkov

Hello Will,

Looks good to me =)

On 15 September 2015 at 18:20, Will Deacon  wrote:
> Hi Dmitri,
>
> On Fri, Sep 11, 2015 at 03:40:00PM +0100, Dimitri John Ledkov wrote:
>> If one typically only boots full disk-images, one wouldn't necessaraly
>> want to statically link glibc, for the guest-init feature of the
>> kvmtool. As statically linked glibc triggers haevy security
>> maintainance.
>>
>> Signed-off-by: Dimitri John Ledkov 
>> ---
>>  Changes since v1:
>>  - rename CONFIG_HAS_LIBC to CONFIG_GUEST_INIT for clarity
>>  - use more ifdefs, instead of runtime check of _binary_guest_init_size==0
>
> The idea looks good, but I think we can tidy some of this up at the same
> time by moving all the guest_init code in builtin_setup.c.
>
> How about the patch below?
>
> Will
>
> --->8
>
> From cdce942c1a3a04635065a7972ca4e21386664756 Mon Sep 17 00:00:00 2001
> From: Dimitri John Ledkov 
> Date: Fri, 11 Sep 2015 15:40:00 +0100
> Subject: [PATCH] Make static libc and guest-init functionality optional.
>
> If one typically only boots full disk-images, one wouldn't necessaraly
> want to statically link glibc, for the guest-init feature of the
> kvmtool. As statically linked glibc triggers haevy security
> maintainance.
>
> Signed-off-by: Dimitri John Ledkov 
> [will: moved all the guest_init handling into builtin_setup.c]
> Signed-off-by: Will Deacon 
> ---
>  Makefile| 12 +++-
>  builtin-run.c   | 29 +
>  builtin-setup.c | 19 ++-
>  include/kvm/builtin-setup.h |  1 +
>  4 files changed, 23 insertions(+), 38 deletions(-)
>
> diff --git a/Makefile b/Makefile
> index 7b17d529d13b..f1701aa7b8ec 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -34,8 +34,6 @@ bindir_SQ = $(subst ','\'',$(bindir))
>  PROGRAM:= lkvm
>  PROGRAM_ALIAS := vm
>
> -GUEST_INIT := guest/init
> -
>  OBJS   += builtin-balloon.o
>  OBJS   += builtin-debug.o
>  OBJS   += builtin-help.o
> @@ -279,8 +277,13 @@ ifeq ($(LTO),1)
> endif
>  endif
>
> -ifneq ($(call try-build,$(SOURCE_STATIC),,-static),y)
> -$(error No static libc found. Please install glibc-static package.)
> +ifeq ($(call try-build,$(SOURCE_STATIC),,-static),y)
> +   CFLAGS  += -DCONFIG_GUEST_INIT
> +   GUEST_INIT  := guest/init
> +   GUEST_OBJS  = guest/guest_init.o
> +else
> +   $(warning No static libc found. Skipping guest init)
> +   NOTFOUND+= static-libc
>  endif
>
>  ifeq (y,$(ARCH_WANT_LIBFDT))
> @@ -356,7 +359,6 @@ c_flags = -Wp,-MD,$(depfile) $(CFLAGS)
>  # $(OTHEROBJS) are things that do not get substituted like this.
>  #
>  STATIC_OBJS = $(patsubst %.o,%.static.o,$(OBJS) $(OBJS_STATOPT))
> -GUEST_OBJS = guest/guest_init.o
>
>  $(PROGRAM)-static:  $(STATIC_OBJS) $(OTHEROBJS) $(GUEST_INIT)
> $(E) "  LINK" $@
> diff --git a/builtin-run.c b/builtin-run.c
> index 1ee75ad3f010..e0c87329e52b 100644
> --- a/builtin-run.c
> +++ b/builtin-run.c
> @@ -59,9 +59,6 @@ static int  kvm_run_wrapper;
>
>  bool do_debug_print = false;
>
> -extern char _binary_guest_init_start;
> -extern char _binary_guest_init_size;
> -
>  static const char * const run_usage[] = {
> "lkvm run [] []",
> NULL
> @@ -345,30 +342,6 @@ void kvm_run_help(void)
> usage_with_options(run_usage, options);
>  }
>
> -static int kvm_setup_guest_init(struct kvm *kvm)
> -{
> -   const char *rootfs = kvm->cfg.custom_rootfs_name;
> -   char tmp[PATH_MAX];
> -   size_t size;
> -   int fd, ret;
> -   char *data;
> -
> -   /* Setup /virt/init */
> -   size = (size_t)&_binary_guest_init_size;
> -   data = (char *)&_binary_guest_init_start;
> -   snprintf(tmp, PATH_MAX, "%s%s/virt/init", kvm__get_dir(), rootfs);
> -   remove(tmp);
> -   fd = open(tmp, O_CREAT | O_WRONLY, 0755);
> -   if (fd < 0)
> -   die("Fail to setup %s", tmp);
> -   ret = xwrite(fd, data, size);
> -   if (ret < 0)
> -   die("Fail to setup %s", tmp);
> -   close(fd);
> -
> -   return 0;
> -}
> -
>  static int kvm_run_set_sandbox(struct kvm *kvm)
>  {
> const char *guestfs_name = kvm->cfg.custom_rootfs_name;
> @@ -631,7 +604,7 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
> **argv)
>
> if (!kvm->cfg.no_dhcp)
> strcat(real_cmdline, "  ip=dhcp");
> -   if (kvm_setup_guest_init(kvm))
> +   if (kvm_setup_guest_init(kvm->cfg.custom_rootfs_name))
> die("Failed to setup init for guest.");
> }
> } else if (!strstr(real_cmdline, "root=")) {
> diff --git a/builtin-setup.c b/builtin-setup.c
> index 8b45c5645ad4..40fef15dbbe4 100644
> --- a/builtin-setup.c
> +++ b/builtin-setup.c
> @@ -16,9 +16,6 @@
>  #include 
>  #include 
>
> -extern char _binary_guest_init_start;
> -extern char _binary_guest_init_size;
> -
>  stati

Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-16 Thread Christian Borntraeger

Am 16.09.2015 um 03:24 schrieb Tejun Heo:
> Hello, Paul.
> 
> On Tue, Sep 15, 2015 at 04:38:18PM -0700, Paul E. McKenney wrote:
>> Well, the decision as to what is too big for -stable is owned by the
>> -stable maintainers, not by me.
> 
> Is it tho?  Usually the subsystem maintainer knows the best and has
> most say in it.  I was mostly curious whether you'd think that the
> changes would be too risky.  If not, great.
> 
>> I am suggesting trying the options and seeing what works best, then
>> working to convince people as needed.
> 
> Yeah, sure thing.  Let's wait for Christian.

Well, I have optimized my testcase now that is puts enough pressure to
the system to  confuses system (the older 209 version, which still has
some event loop issues) that systemd restarts the journal deamon and does
several other recoveries.
To avoid regressions - even for somewhat shaky userspaces - we should
consider a revert for 4.2 stable.
There are several followup patches, which makes the revert non-trivial,
though.

The rework of the percpu rwsem seems to work fine, but we are beyond the
merge window so 4.4 seems better to me. (and consider a revert for 4.3)

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 0/2] KVM: nested VPID emulation

2015-09-16 Thread Wanpeng Li

v3 -> v4:
 * return 0 if vpid == VMX_NR_VPIDs
 * skip vpid != 0 check

v2 -> v3:
 * enhance allocate/free_vpid as Jan's suggestion
 * add more comments to 2/2

v1 -> v2:
 * enhance allocate/free_vpid to handle shadow vpid
 * drop empty space
 * allocate shadow vpid during initialization
 * For each nested vmentry, if vpid12 is changed, reuse shadow vpid w/ an 
   invvpid.

VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
between L1 and L2. 

This patch advertises VPID to the L1 hypervisor, then address space of L1 and 
L2 can be separately treated and avoid TLB flush when swithing between L1 and 
L2.

Performance: 

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID 
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla

Wanpeng Li (2):
  KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid
  KVM: nVMX: nested VPID emulation

 arch/x86/kvm/vmx.c | 60 ++
 1 file changed, 42 insertions(+), 18 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 2/2] KVM: nVMX: nested VPID emulation

2015-09-16 Thread Wanpeng Li

VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
between L1 and L2. 

This patch advertises VPID to the L1 hypervisor, then address space of L1 and 
L2 can be separately treated and avoid TLB flush when swithing between L1 and 
L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid w/ an 
invvpid.

Performance: 

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID 
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla

Reviewed-by: Jan Kiszka 
Suggested-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c5222b8..780a2ed 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -424,6 +424,9 @@ struct nested_vmx {
/* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */
u64 vmcs01_debugctl;
 
+   u16 vpid02;
+   u16 last_vpid;
+
u32 nested_vmx_procbased_ctls_low;
u32 nested_vmx_procbased_ctls_high;
u32 nested_vmx_true_procbased_ctls_low;
@@ -1155,6 +1158,11 @@ static inline bool 
nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 }
 
+static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
+}
+
 static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
 {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
@@ -2469,6 +2477,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
+   SECONDARY_EXEC_ENABLE_VPID |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_WBINVD_EXITING |
@@ -6663,6 +6672,7 @@ static void free_nested(struct vcpu_vmx *vmx)
return;
 
vmx->nested.vmxon = false;
+   free_vpid(vmx->nested.vpid02);
nested_release_vmcs12(vmx);
if (enable_shadow_vmcs)
free_vmcs(vmx->nested.current_shadow_vmcs);
@@ -8548,8 +8558,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
goto free_vmcs;
}
 
-   if (nested)
+   if (nested) {
nested_vmx_setup_ctls_msrs(vmx);
+   vmx->nested.vpid02 = allocate_vpid();
+   }
 
vmx->nested.posted_intr_nv = -1;
vmx->nested.current_vmptr = -1ull;
@@ -8570,6 +8582,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
return &vmx->vcpu;
 
 free_vmcs:
+   free_vpid(vmx->nested.vpid02);
free_loaded_vmcs(vmx->loaded_vmcs);
 free_msrs:
kfree(vmx->guest_msrs);
@@ -9445,12 +9458,24 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
 
if (enable_vpid) {
/*
-* Trivially support vpid by letting L2s share their parent
-* L1's vpid. TODO: move to a more elaborate solution, giving
-* each L2 its own vpid and exposing the vpid feature to L1.
+* There is no direct mapping between vpid02 and vpid12, the
+* vpid02 is per-vCPU for L0 and reused while the value of
+* vpid12 is changed w/ one invvpid during nested vmentry.
+* The vpid12 is allocated by L1 for L2, so it will not
+* influence global bitmap(for vpid01 and vpid02 allocation)
+* even if spawn a lot of nested vCPUs.
 */
-   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
-   vmx_flush_tlb(vcpu);
+   if (nested_cpu_has_vpid(vmcs12)) {
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02);
+   if (vmcs12->virtual_processor_id != 
vmx->nested.last_vpid) {
+   vmx->nested.last_vpid = 
vmcs12->virtual_processor_id;
+   vmx_flush_tlb(vcpu);
+   }
+   } else {
+   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
+   vmx_flush_tlb(vcpu);
+   }
+
}

1 2 >

1 - 100 of 101 matches

Mail list logo