RE: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-12-20 Thread Wu, Feng


> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Yang Zhang
> Sent: Monday, December 21, 2015 9:50 AM
> To: Wu, Feng ; pbonz...@redhat.com;
> rkrc...@redhat.com
> Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d
> posted-interrupts
> 
> On 2015/12/16 9:37, Feng Wu wrote:
> > Use vector-hashing to deliver lowest-priority interrupts for
> > VT-d posted-interrupts.
> >
> > Signed-off-by: Feng Wu 
> > ---
> >   arch/x86/kvm/lapic.c | 67
> 
> >   arch/x86/kvm/lapic.h |  2 ++
> >   arch/x86/kvm/vmx.c   | 12 --
> >   3 files changed, 79 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index e29001f..d4f2c8f 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -854,6 +854,73 @@ out:
> >   }
> >
> >   /*
> > + * This routine handles lowest-priority interrupts using vector-hashing
> > + * mechanism. As an example, modern Intel CPUs use this method to handle
> > + * lowest-priority interrupts.
> > + *
> > + * Here is the details about the vector-hashing mechanism:
> > + * 1. For lowest-priority interrupts, store all the possible destination
> > + *vCPUs in an array.
> > + * 2. Use "guest vector % max number of destination vCPUs" to find the 
> > right
> > + *destination vCPU in the array for the lowest-priority interrupt.
> > + */
> > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
> > + struct kvm_lapic_irq *irq)
> > +{
> > +   struct kvm_apic_map *map;
> > +   struct kvm_vcpu *vcpu = NULL;
> > +
> > +   if (irq->shorthand)
> > +   return NULL;
> > +
> > +   rcu_read_lock();
> > +   map = rcu_dereference(kvm->arch.apic_map);
> > +
> > +   if (!map)
> > +   goto out;
> > +
> > +   if ((irq->dest_mode != APIC_DEST_PHYSICAL) &&
> > +   kvm_lowest_prio_delivery(irq)) {
> > +   u16 cid;
> > +   int i, idx = 0;
> > +   unsigned long bitmap = 1;
> > +   unsigned int dest_vcpus = 0;
> > +   struct kvm_lapic **dst = NULL;
> > +
> > +
> > +   if (!kvm_apic_logical_map_valid(map))
> > +   goto out;
> > +
> > +   apic_logical_id(map, irq->dest_id, , (u16 *));
> > +
> > +   if (cid >= ARRAY_SIZE(map->logical_map))
> > +   goto out;
> > +
> > +   dst = map->logical_map[cid];
> > +
> > +   for_each_set_bit(i, , 16) {
> > +   if (!dst[i] && !kvm_lapic_enabled(dst[i]->vcpu)) {
> > +   clear_bit(i, );
> > +   continue;
> > +   }
> > +   }
> > +
> > +   dest_vcpus = hweight16(bitmap);
> > +
> > +   if (dest_vcpus != 0) {
> > +   idx = kvm_vector_2_index(irq->vector, dest_vcpus,
> > +, 16);
> > +   vcpu = dst[idx-1]->vcpu;
> > +   }
> > +   }
> > +
> > +out:
> > +   rcu_read_unlock();
> > +   return vcpu;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest);
> > +
> > +/*
> >* Add a pending IRQ into lapic.
> >* Return 1 if successfully added and 0 if discarded.
> >*/
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 6890ef0..52bffce 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -172,4 +172,6 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm,
> struct kvm_lapic_irq *irq,
> > struct kvm_vcpu **dest_vcpu);
> >   int kvm_vector_2_index(u32 vector, u32 dest_vcpus,
> >const unsigned long *bitmap, u32 bitmap_size);
> > +struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
> > + struct kvm_lapic_irq *irq);
> >   #endif
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 5eb56ed..3f89189 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -10702,8 +10702,16 @@ static int vmx_update_pi_irte(struct kvm *kvm,
> unsigned int host_irq,
> >  */
> >
> > kvm_set_msi_irq(e, );
> > -   if (!kvm_intr_is_single_vcpu(kvm, , ))
> > -   continue;
> > +
> > +   if (!kvm_intr_is_single_vcpu(kvm, , )) {
> > +   if (!kvm_vector_hashing_enabled() ||
> > +   irq.delivery_mode !=
> APIC_DM_LOWEST)
> > +   continue;
> > +
> > +   vcpu = kvm_intr_vector_hashing_dest(kvm, );
> > +   if (!vcpu)
> > +   continue;
> > +   }
> 
> I am a little confused with the 'continue'. If the destination is not
> single vcpu, shouldn't we rollback to use non-PI mode?

Here is the 

Re: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-12-20 Thread Yang Zhang

On 2015/12/21 9:55, Wu, Feng wrote:




-Original Message-
From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
ow...@vger.kernel.org] On Behalf Of Yang Zhang
Sent: Monday, December 21, 2015 9:50 AM
To: Wu, Feng ; pbonz...@redhat.com;
rkrc...@redhat.com
Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d
posted-interrupts

On 2015/12/16 9:37, Feng Wu wrote:

Use vector-hashing to deliver lowest-priority interrupts for
VT-d posted-interrupts.

Signed-off-by: Feng Wu 
---
   arch/x86/kvm/lapic.c | 67



   arch/x86/kvm/lapic.h |  2 ++
   arch/x86/kvm/vmx.c   | 12 --
   3 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e29001f..d4f2c8f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -854,6 +854,73 @@ out:
   }

   /*
+ * This routine handles lowest-priority interrupts using vector-hashing
+ * mechanism. As an example, modern Intel CPUs use this method to handle
+ * lowest-priority interrupts.
+ *
+ * Here is the details about the vector-hashing mechanism:
+ * 1. For lowest-priority interrupts, store all the possible destination
+ *vCPUs in an array.
+ * 2. Use "guest vector % max number of destination vCPUs" to find the right
+ *destination vCPU in the array for the lowest-priority interrupt.
+ */
+struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
+ struct kvm_lapic_irq *irq)
+{
+   struct kvm_apic_map *map;
+   struct kvm_vcpu *vcpu = NULL;
+
+   if (irq->shorthand)
+   return NULL;
+
+   rcu_read_lock();
+   map = rcu_dereference(kvm->arch.apic_map);
+
+   if (!map)
+   goto out;
+
+   if ((irq->dest_mode != APIC_DEST_PHYSICAL) &&
+   kvm_lowest_prio_delivery(irq)) {
+   u16 cid;
+   int i, idx = 0;
+   unsigned long bitmap = 1;
+   unsigned int dest_vcpus = 0;
+   struct kvm_lapic **dst = NULL;
+
+
+   if (!kvm_apic_logical_map_valid(map))
+   goto out;
+
+   apic_logical_id(map, irq->dest_id, , (u16 *));
+
+   if (cid >= ARRAY_SIZE(map->logical_map))
+   goto out;
+
+   dst = map->logical_map[cid];
+
+   for_each_set_bit(i, , 16) {
+   if (!dst[i] && !kvm_lapic_enabled(dst[i]->vcpu)) {
+   clear_bit(i, );
+   continue;
+   }
+   }
+
+   dest_vcpus = hweight16(bitmap);
+
+   if (dest_vcpus != 0) {
+   idx = kvm_vector_2_index(irq->vector, dest_vcpus,
+, 16);
+   vcpu = dst[idx-1]->vcpu;
+   }
+   }
+
+out:
+   rcu_read_unlock();
+   return vcpu;
+}
+EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest);
+
+/*
* Add a pending IRQ into lapic.
* Return 1 if successfully added and 0 if discarded.
*/
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 6890ef0..52bffce 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -172,4 +172,6 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm,

struct kvm_lapic_irq *irq,

struct kvm_vcpu **dest_vcpu);
   int kvm_vector_2_index(u32 vector, u32 dest_vcpus,
   const unsigned long *bitmap, u32 bitmap_size);
+struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
+ struct kvm_lapic_irq *irq);
   #endif
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5eb56ed..3f89189 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10702,8 +10702,16 @@ static int vmx_update_pi_irte(struct kvm *kvm,

unsigned int host_irq,

 */

kvm_set_msi_irq(e, );
-   if (!kvm_intr_is_single_vcpu(kvm, , ))
-   continue;
+
+   if (!kvm_intr_is_single_vcpu(kvm, , )) {
+   if (!kvm_vector_hashing_enabled() ||
+   irq.delivery_mode !=

APIC_DM_LOWEST)

+   continue;
+
+   vcpu = kvm_intr_vector_hashing_dest(kvm, );
+   if (!vcpu)
+   continue;
+   }


I am a little confused with the 'continue'. If the destination is not
single vcpu, shouldn't we rollback to use non-PI mode?


Here is the logic:
- If it is single destination, we will use PI no matter it is fixed or 
lowest-priority.
- If it is not single destination:
a) It is fixed, we will use non-PI
b) It is lowest-priority and vector-hashing is enabled, we will 

Re: [PATCH kernel] vfio: Add explicit alignments in vfio_iommu_spapr_tce_create

2015-12-20 Thread David Gibson
On Fri, Dec 18, 2015 at 12:35:47PM +1100, Alexey Kardashevskiy wrote:
> The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields
> which should have resulted in sizeof(fio_iommu_spapr_tce_create) equal
> to 32 bytes. However due to the gcc's default alignment, the actual
> size of this struct is 40 bytes.
> 
> This fills gaps with __resv1/2 fields.
> 
> This should not cause any change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy 

Oops, that was a bit sloppy.  Oh well.

Acked-by: David Gibson 

> ---
>  include/uapi/linux/vfio.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9fd7b5d..d117233 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -568,8 +568,10 @@ struct vfio_iommu_spapr_tce_create {
>   __u32 flags;
>   /* in */
>   __u32 page_shift;
> + __u32 __resv1;
>   __u64 window_size;
>   __u32 levels;
> + __u32 __resv2;
>   /* out */
>   __u64 start_addr;
>  };

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest-priority interrupts

2015-12-20 Thread Yang Zhang

On 2015/12/16 9:37, Feng Wu wrote:

Use vector-hashing to deliver lowest-priority interrupts, As an
example, modern Intel CPUs in server platform use this method to
handle lowest-priority interrupts.

Signed-off-by: Feng Wu 
---
  arch/x86/kvm/irq_comm.c | 27 ++-
  arch/x86/kvm/lapic.c| 57 -
  arch/x86/kvm/lapic.h|  2 ++
  arch/x86/kvm/x86.c  |  9 
  arch/x86/kvm/x86.h  |  1 +
  5 files changed, 81 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 84b96d3..c8c5f61 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -32,6 +32,7 @@
  #include "ioapic.h"

  #include "lapic.h"
+#include "x86.h"

  static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
   struct kvm *kvm, int irq_source_id, int level,
@@ -53,8 +54,10 @@ static int kvm_set_ioapic_irq(struct 
kvm_kernel_irq_routing_entry *e,
  int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
struct kvm_lapic_irq *irq, unsigned long *dest_map)
  {
-   int i, r = -1;
+   int i, r = -1, idx = 0;
struct kvm_vcpu *vcpu, *lowest = NULL;
+   unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
+   unsigned int dest_vcpus = 0;

if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
kvm_lowest_prio_delivery(irq)) {
@@ -65,6 +68,8 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
kvm_lapic *src,
if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, , dest_map))
return r;

+   memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap));
+
kvm_for_each_vcpu(i, vcpu, kvm) {
if (!kvm_apic_present(vcpu))
continue;
@@ -78,13 +83,25 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
kvm_lapic *src,
r = 0;
r += kvm_apic_set_irq(vcpu, irq, dest_map);
} else if (kvm_lapic_enabled(vcpu)) {
-   if (!lowest)
-   lowest = vcpu;
-   else if (kvm_apic_compare_prio(vcpu, lowest) < 0)
-   lowest = vcpu;
+   if (!kvm_vector_hashing_enabled()) {
+   if (!lowest)
+   lowest = vcpu;
+   else if (kvm_apic_compare_prio(vcpu, lowest) < 
0)
+   lowest = vcpu;
+   } else {
+   __set_bit(vcpu->vcpu_id, dest_vcpu_bitmap);
+   dest_vcpus++;
+   }
}
}

+   if (dest_vcpus != 0) {
+   idx = kvm_vector_2_index(irq->vector, dest_vcpus,
+dest_vcpu_bitmap, KVM_MAX_VCPUS);
+
+   lowest = kvm_get_vcpu(kvm, idx - 1);
+   }
+
if (lowest)
r = kvm_apic_set_irq(lowest, irq, dest_map);

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ecd4ea1..e29001f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -678,6 +678,22 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct 
kvm_lapic *source,
}
  }

+int kvm_vector_2_index(u32 vector, u32 dest_vcpus,
+  const unsigned long *bitmap, u32 bitmap_size)
+{
+   u32 mod;
+   int i, idx = 0;
+
+   mod = vector % dest_vcpus;
+
+   for (i = 0; i <= mod; i++) {
+   idx = find_next_bit(bitmap, bitmap_size, idx) + 1;
+   BUG_ON(idx > bitmap_size);
+   }
+
+   return idx;
+}
+
  bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map)
  {
@@ -731,17 +747,38 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, 
struct kvm_lapic *src,
dst = map->logical_map[cid];

if (kvm_lowest_prio_delivery(irq)) {
-   int l = -1;
-   for_each_set_bit(i, , 16) {
-   if (!dst[i])
-   continue;
-   if (l < 0)
-   l = i;
-   else if (kvm_apic_compare_prio(dst[i]->vcpu, 
dst[l]->vcpu) < 0)
-   l = i;
+   if (!kvm_vector_hashing_enabled()) {
+   int l = -1;
+   for_each_set_bit(i, , 16) {
+   if (!dst[i])
+   continue;
+   if (l < 0)
+   l = i;
+   else if 

Re: [PATCH v2 2/2] KVM: x86: Add lowest-priority support for vt-d posted-interrupts

2015-12-20 Thread Yang Zhang

On 2015/12/16 9:37, Feng Wu wrote:

Use vector-hashing to deliver lowest-priority interrupts for
VT-d posted-interrupts.

Signed-off-by: Feng Wu 
---
  arch/x86/kvm/lapic.c | 67 
  arch/x86/kvm/lapic.h |  2 ++
  arch/x86/kvm/vmx.c   | 12 --
  3 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index e29001f..d4f2c8f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -854,6 +854,73 @@ out:
  }

  /*
+ * This routine handles lowest-priority interrupts using vector-hashing
+ * mechanism. As an example, modern Intel CPUs use this method to handle
+ * lowest-priority interrupts.
+ *
+ * Here is the details about the vector-hashing mechanism:
+ * 1. For lowest-priority interrupts, store all the possible destination
+ *vCPUs in an array.
+ * 2. Use "guest vector % max number of destination vCPUs" to find the right
+ *destination vCPU in the array for the lowest-priority interrupt.
+ */
+struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
+ struct kvm_lapic_irq *irq)
+{
+   struct kvm_apic_map *map;
+   struct kvm_vcpu *vcpu = NULL;
+
+   if (irq->shorthand)
+   return NULL;
+
+   rcu_read_lock();
+   map = rcu_dereference(kvm->arch.apic_map);
+
+   if (!map)
+   goto out;
+
+   if ((irq->dest_mode != APIC_DEST_PHYSICAL) &&
+   kvm_lowest_prio_delivery(irq)) {
+   u16 cid;
+   int i, idx = 0;
+   unsigned long bitmap = 1;
+   unsigned int dest_vcpus = 0;
+   struct kvm_lapic **dst = NULL;
+
+
+   if (!kvm_apic_logical_map_valid(map))
+   goto out;
+
+   apic_logical_id(map, irq->dest_id, , (u16 *));
+
+   if (cid >= ARRAY_SIZE(map->logical_map))
+   goto out;
+
+   dst = map->logical_map[cid];
+
+   for_each_set_bit(i, , 16) {
+   if (!dst[i] && !kvm_lapic_enabled(dst[i]->vcpu)) {
+   clear_bit(i, );
+   continue;
+   }
+   }
+
+   dest_vcpus = hweight16(bitmap);
+
+   if (dest_vcpus != 0) {
+   idx = kvm_vector_2_index(irq->vector, dest_vcpus,
+, 16);
+   vcpu = dst[idx-1]->vcpu;
+   }
+   }
+
+out:
+   rcu_read_unlock();
+   return vcpu;
+}
+EXPORT_SYMBOL_GPL(kvm_intr_vector_hashing_dest);
+
+/*
   * Add a pending IRQ into lapic.
   * Return 1 if successfully added and 0 if discarded.
   */
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 6890ef0..52bffce 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -172,4 +172,6 @@ bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct 
kvm_lapic_irq *irq,
struct kvm_vcpu **dest_vcpu);
  int kvm_vector_2_index(u32 vector, u32 dest_vcpus,
   const unsigned long *bitmap, u32 bitmap_size);
+struct kvm_vcpu *kvm_intr_vector_hashing_dest(struct kvm *kvm,
+ struct kvm_lapic_irq *irq);
  #endif
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5eb56ed..3f89189 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -10702,8 +10702,16 @@ static int vmx_update_pi_irte(struct kvm *kvm, 
unsigned int host_irq,
 */

kvm_set_msi_irq(e, );
-   if (!kvm_intr_is_single_vcpu(kvm, , ))
-   continue;
+
+   if (!kvm_intr_is_single_vcpu(kvm, , )) {
+   if (!kvm_vector_hashing_enabled() ||
+   irq.delivery_mode != APIC_DM_LOWEST)
+   continue;
+
+   vcpu = kvm_intr_vector_hashing_dest(kvm, );
+   if (!vcpu)
+   continue;
+   }


I am a little confused with the 'continue'. If the destination is not 
single vcpu, shouldn't we rollback to use non-PI mode?




vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
vcpu_info.vector = irq.vector;




--
best regards
yang
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest-priority interrupts

2015-12-20 Thread Yang Zhang

On 2015/12/21 9:50, Wu, Feng wrote:




-Original Message-
From: Yang Zhang [mailto:yang.zhang...@gmail.com]
Sent: Monday, December 21, 2015 9:46 AM
To: Wu, Feng ; pbonz...@redhat.com;
rkrc...@redhat.com
Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org
Subject: Re: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest-
priority interrupts

On 2015/12/16 9:37, Feng Wu wrote:

Use vector-hashing to deliver lowest-priority interrupts, As an
example, modern Intel CPUs in server platform use this method to
handle lowest-priority interrupts.

Signed-off-by: Feng Wu 
---
   arch/x86/kvm/irq_comm.c | 27 ++-
   arch/x86/kvm/lapic.c| 57

-

   arch/x86/kvm/lapic.h|  2 ++
   arch/x86/kvm/x86.c  |  9 
   arch/x86/kvm/x86.h  |  1 +
   5 files changed, 81 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index 84b96d3..c8c5f61 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -32,6 +32,7 @@
   #include "ioapic.h"

   #include "lapic.h"
+#include "x86.h"

   static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
   struct kvm *kvm, int irq_source_id, int level,
@@ -53,8 +54,10 @@ static int kvm_set_ioapic_irq(struct

kvm_kernel_irq_routing_entry *e,

   int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
struct kvm_lapic_irq *irq, unsigned long *dest_map)
   {
-   int i, r = -1;
+   int i, r = -1, idx = 0;
struct kvm_vcpu *vcpu, *lowest = NULL;
+   unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
+   unsigned int dest_vcpus = 0;

if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
kvm_lowest_prio_delivery(irq)) {
@@ -65,6 +68,8 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct

kvm_lapic *src,

if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, , dest_map))
return r;

+   memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap));
+
kvm_for_each_vcpu(i, vcpu, kvm) {
if (!kvm_apic_present(vcpu))
continue;
@@ -78,13 +83,25 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct

kvm_lapic *src,

r = 0;
r += kvm_apic_set_irq(vcpu, irq, dest_map);
} else if (kvm_lapic_enabled(vcpu)) {
-   if (!lowest)
-   lowest = vcpu;
-   else if (kvm_apic_compare_prio(vcpu, lowest) < 0)
-   lowest = vcpu;
+   if (!kvm_vector_hashing_enabled()) {
+   if (!lowest)
+   lowest = vcpu;
+   else if (kvm_apic_compare_prio(vcpu, lowest) <

0)

+   lowest = vcpu;
+   } else {
+   __set_bit(vcpu->vcpu_id, dest_vcpu_bitmap);
+   dest_vcpus++;
+   }
}
}

+   if (dest_vcpus != 0) {
+   idx = kvm_vector_2_index(irq->vector, dest_vcpus,
+dest_vcpu_bitmap, KVM_MAX_VCPUS);
+
+   lowest = kvm_get_vcpu(kvm, idx - 1);
+   }
+
if (lowest)
r = kvm_apic_set_irq(lowest, irq, dest_map);

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index ecd4ea1..e29001f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -678,6 +678,22 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu,

struct kvm_lapic *source,

}
   }

+int kvm_vector_2_index(u32 vector, u32 dest_vcpus,
+  const unsigned long *bitmap, u32 bitmap_size)
+{
+   u32 mod;
+   int i, idx = 0;
+
+   mod = vector % dest_vcpus;
+
+   for (i = 0; i <= mod; i++) {
+   idx = find_next_bit(bitmap, bitmap_size, idx) + 1;
+   BUG_ON(idx > bitmap_size);
+   }
+
+   return idx;
+}
+
   bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map)
   {
@@ -731,17 +747,38 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm

*kvm, struct kvm_lapic *src,

dst = map->logical_map[cid];

if (kvm_lowest_prio_delivery(irq)) {
-   int l = -1;
-   for_each_set_bit(i, , 16) {
-   if (!dst[i])
-   continue;
-   if (l < 0)
-   l = i;
-   else if (kvm_apic_compare_prio(dst[i]->vcpu,

dst[l]->vcpu) < 0)

-   l = i;
+   if 

RE: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest-priority interrupts

2015-12-20 Thread Wu, Feng


> -Original Message-
> From: Yang Zhang [mailto:yang.zhang...@gmail.com]
> Sent: Monday, December 21, 2015 9:46 AM
> To: Wu, Feng ; pbonz...@redhat.com;
> rkrc...@redhat.com
> Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH v2 1/2] KVM: x86: Use vector-hashing to deliver lowest-
> priority interrupts
> 
> On 2015/12/16 9:37, Feng Wu wrote:
> > Use vector-hashing to deliver lowest-priority interrupts, As an
> > example, modern Intel CPUs in server platform use this method to
> > handle lowest-priority interrupts.
> >
> > Signed-off-by: Feng Wu 
> > ---
> >   arch/x86/kvm/irq_comm.c | 27 ++-
> >   arch/x86/kvm/lapic.c| 57
> -
> >   arch/x86/kvm/lapic.h|  2 ++
> >   arch/x86/kvm/x86.c  |  9 
> >   arch/x86/kvm/x86.h  |  1 +
> >   5 files changed, 81 insertions(+), 15 deletions(-)
> >
> > diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> > index 84b96d3..c8c5f61 100644
> > --- a/arch/x86/kvm/irq_comm.c
> > +++ b/arch/x86/kvm/irq_comm.c
> > @@ -32,6 +32,7 @@
> >   #include "ioapic.h"
> >
> >   #include "lapic.h"
> > +#include "x86.h"
> >
> >   static int kvm_set_pic_irq(struct kvm_kernel_irq_routing_entry *e,
> >struct kvm *kvm, int irq_source_id, int level,
> > @@ -53,8 +54,10 @@ static int kvm_set_ioapic_irq(struct
> kvm_kernel_irq_routing_entry *e,
> >   int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
> > struct kvm_lapic_irq *irq, unsigned long *dest_map)
> >   {
> > -   int i, r = -1;
> > +   int i, r = -1, idx = 0;
> > struct kvm_vcpu *vcpu, *lowest = NULL;
> > +   unsigned long dest_vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
> > +   unsigned int dest_vcpus = 0;
> >
> > if (irq->dest_mode == 0 && irq->dest_id == 0xff &&
> > kvm_lowest_prio_delivery(irq)) {
> > @@ -65,6 +68,8 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct
> kvm_lapic *src,
> > if (kvm_irq_delivery_to_apic_fast(kvm, src, irq, , dest_map))
> > return r;
> >
> > +   memset(dest_vcpu_bitmap, 0, sizeof(dest_vcpu_bitmap));
> > +
> > kvm_for_each_vcpu(i, vcpu, kvm) {
> > if (!kvm_apic_present(vcpu))
> > continue;
> > @@ -78,13 +83,25 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct
> kvm_lapic *src,
> > r = 0;
> > r += kvm_apic_set_irq(vcpu, irq, dest_map);
> > } else if (kvm_lapic_enabled(vcpu)) {
> > -   if (!lowest)
> > -   lowest = vcpu;
> > -   else if (kvm_apic_compare_prio(vcpu, lowest) < 0)
> > -   lowest = vcpu;
> > +   if (!kvm_vector_hashing_enabled()) {
> > +   if (!lowest)
> > +   lowest = vcpu;
> > +   else if (kvm_apic_compare_prio(vcpu, lowest) <
> 0)
> > +   lowest = vcpu;
> > +   } else {
> > +   __set_bit(vcpu->vcpu_id, dest_vcpu_bitmap);
> > +   dest_vcpus++;
> > +   }
> > }
> > }
> >
> > +   if (dest_vcpus != 0) {
> > +   idx = kvm_vector_2_index(irq->vector, dest_vcpus,
> > +dest_vcpu_bitmap, KVM_MAX_VCPUS);
> > +
> > +   lowest = kvm_get_vcpu(kvm, idx - 1);
> > +   }
> > +
> > if (lowest)
> > r = kvm_apic_set_irq(lowest, irq, dest_map);
> >
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index ecd4ea1..e29001f 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -678,6 +678,22 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu,
> struct kvm_lapic *source,
> > }
> >   }
> >
> > +int kvm_vector_2_index(u32 vector, u32 dest_vcpus,
> > +  const unsigned long *bitmap, u32 bitmap_size)
> > +{
> > +   u32 mod;
> > +   int i, idx = 0;
> > +
> > +   mod = vector % dest_vcpus;
> > +
> > +   for (i = 0; i <= mod; i++) {
> > +   idx = find_next_bit(bitmap, bitmap_size, idx) + 1;
> > +   BUG_ON(idx > bitmap_size);
> > +   }
> > +
> > +   return idx;
> > +}
> > +
> >   bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map)
> >   {
> > @@ -731,17 +747,38 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm
> *kvm, struct kvm_lapic *src,
> > dst = map->logical_map[cid];
> >
> > if (kvm_lowest_prio_delivery(irq)) {
> > -   int l = -1;
> > -   for_each_set_bit(i, , 16) {
> > -   if (!dst[i])
> > -   continue;
> > -   if (l < 0)
> > -   l = i;
> > -

RE: [PATCH v4 5/5] kvm/x86: Hyper-V kvm exit

2015-12-20 Thread Pavel Fedin
 Hello!

 Replying to everything in one message.

> > As far as i understand this code, KVM_EXIT_HYPERV is called when one
> > of three MSRs are accessed. But, shouldn't we have implemented
> > instead something more generic, like KVM_EXIT_REG_IO, which would
> > work similar to KVM_EXIT_PIO or KVM_EXIT_MMIO, but carry register
> > code and value?
> 
> Yes, we considered that.  There were actually patches for this as well.

 Ah, i missed them, what a pity. There are lots of patches, i don't review them 
all. Actually i have noticed the change only after
it appeared in linux-next.

>  However, in this case the register is still emulated in the kernel, and
> userspace just gets informed of the new value.

 I see, but this doesn't change the semantic. All we need to do is to tell the 
userland that "register has been written".
Additionally to this we could do whatever we want, including caching the data 
in kernel, using it in kernel, and processing reads in
kernel.

> If we do get that, we will just rename KVM_EXIT_HYPERV to
> KVM_EXIT_MSR_ACCESS, and KVM_EXIT_HYPERV_SYNIC to
> KVM_EXIT_MSR_HYPERV_SYNIC, and struct kvm_hyperv_exit to kvm_msr_exit.

 Actually, i see this in more generic way, something like:

/* KVM_EXIT_REG_ACCESS */
struct {
__u64 reg;
__u64 data;
__u8  is_write;
} mmio;
 
 'data' and 'is_write' are self-explanatory, 'reg' would be generalized 
register code, the same as used for KVM_(GET|SET)_ONE_REG:
 - for ARM64: ARM64_SYS_REG(op0, op1, crn, crm, op2) - see
http://lxr.free-electrons.com/source/arch/arm64/include/uapi/asm/kvm.h#L189
 - for x86  : to be defined (i know, we don't use ..._ONE_REG operations here 
yet), like X86_MSR_REG(id), where the macro itself is:

#define X86_MSR_REG(id) (KVM_REG_X86 | KVM_REG_X86_MSR | 
KVM_REG_SIZE_U64 | (id))

 - for other architectures: to be defined in a similar way, once needed.

> On brief inspection of Andrey's patch (I have not been following
> closely) it looks like the kvm_hyperv_exit struct that's returned to
> userspace contains more data (control, evt_page, and msg_page fields)
> than simply the value of the MSR, so would the desired SynIC exit fit
> into a general-purpose exit for MSR emulation?

 I have looked at the code too, and these three fields are nothing more than 
values of respective MSR's:

case HV_X64_MSR_SCONTROL:
synic->control = data;
if (!host)
synic_exit(synic, msr);
break;



case HV_X64_MSR_SIEFP:
if (data & HV_SYNIC_SIEFP_ENABLE)
if (kvm_clear_guest(vcpu->kvm,
data & PAGE_MASK, PAGE_SIZE)) {
ret = 1;
break;
}
synic->evt_page = data;
if (!host)
synic_exit(synic, msr);
break;
case HV_X64_MSR_SIMP:
if (data & HV_SYNIC_SIMP_ENABLE)
if (kvm_clear_guest(vcpu->kvm,
data & PAGE_MASK, PAGE_SIZE)) {
ret = 1;
break;
}
synic->msg_page = data;
if (!host)
synic_exit(synic, msr);
break;

 So, every time one of these thee MSRs is written, we get a vmexit with values 
of all three registers, and that's all. We could
easily have 'synic_exit(synic, msr, data)' in all three cases, and i think the 
userspace could easily deal with proposed
KVM_EXIT_REG_ACCESS, just cache these values internally if needed.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/4] x86/vdso: Remove pvclock fixmap machinery

2015-12-20 Thread Andy Lutomirski
Acked-by: Paolo Bonzini 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/vdso/vclock_gettime.c |  1 -
 arch/x86/entry/vdso/vma.c|  1 +
 arch/x86/include/asm/fixmap.h|  5 -
 arch/x86/include/asm/pvclock.h   |  5 -
 arch/x86/kernel/kvmclock.c   |  6 --
 arch/x86/kernel/pvclock.c| 24 
 6 files changed, 1 insertion(+), 41 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 5dd363d54348..59a98c25bde7 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -45,7 +45,6 @@ extern u8 pvclock_page
 
 #include 
 #include 
-#include 
 #include 
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index aa828191c654..b8f69e264ac4 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index f80d70009ff8..6d7d0e52ed5a 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -19,7 +19,6 @@
 #include 
 #include 
 #include 
-#include 
 #ifdef CONFIG_X86_32
 #include 
 #include 
@@ -72,10 +71,6 @@ enum fixed_addresses {
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT,
 #endif
-#ifdef CONFIG_PARAVIRT_CLOCK
-   PVCLOCK_FIXMAP_BEGIN,
-   PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1,
-#endif
 #endif
FIX_DBGP_BASE,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 571dad355bbc..fdcc04020636 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info {
 } __attribute__((__aligned__(SMP_CACHE_BYTES)));
 
 #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info)
-#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1)
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
-int size);
-struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu);
 
 #endif /* _ASM_X86_PVCLOCK_H */
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index ec1b06dc82d2..72cef58693c7 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
 {
 #ifdef CONFIG_X86_64
int cpu;
-   int ret;
u8 flags;
struct pvclock_vcpu_time_info *vcpu_time;
unsigned int size;
@@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void)
return 1;
}
 
-   if ((ret = pvclock_init_vsyscall(hv_clock, size))) {
-   put_cpu();
-   return ret;
-   }
-
put_cpu();
 
kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK;
diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 2f355d229a58..99bfc025111d 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
 
set_normalized_timespec(ts, now.tv_sec, now.tv_nsec);
 }
-
-#ifdef CONFIG_X86_64
-/*
- * Initialize the generic pvclock vsyscall state.  This will allocate
- * a/some page(s) for the per-vcpu pvclock information, set up a
- * fixmap mapping for the page(s)
- */
-
-int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i,
-int size)
-{
-   int idx;
-
-   WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE);
-
-   for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) {
-   __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx,
-__pa(i) + (idx*PAGE_SIZE),
-PAGE_KERNEL_VVAR);
-   }
-
-   return 0;
-}
-#endif
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/4] x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap

2015-12-20 Thread Andy Lutomirski
Acked-by: Paolo Bonzini 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/vdso/vclock_gettime.c  | 20 
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 ++-
 arch/x86/entry/vdso/vdso2c.c  |  3 +++
 arch/x86/entry/vdso/vma.c | 13 +
 arch/x86/include/asm/pvclock.h|  9 +
 arch/x86/include/asm/vdso.h   |  1 +
 arch/x86/kernel/kvmclock.c|  5 +
 7 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index c325ba1bdddf..5dd363d54348 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void)
 }
 #endif
 
+#ifdef CONFIG_PARAVIRT_CLOCK
+extern u8 pvclock_page
+   __attribute__((visibility("hidden")));
+#endif
+
 #ifndef BUILD_VDSO32
 
 #include 
@@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, 
struct timezone *tz)
 
 #ifdef CONFIG_PARAVIRT_CLOCK
 
-static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu)
+static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
 {
-   const struct pvclock_vsyscall_time_info *pvti_base;
-   int idx = cpu / (PAGE_SIZE/PVTI_SIZE);
-   int offset = cpu % (PAGE_SIZE/PVTI_SIZE);
-
-   BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END);
-
-   pvti_base = (struct pvclock_vsyscall_time_info *)
-   __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx);
-
-   return _base[offset];
+   return (const struct pvclock_vsyscall_time_info *)_page;
 }
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-   const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
+   const struct pvclock_vcpu_time_info *pvti = _pvti0()->pvti;
cycle_t ret;
u64 tsc, pvti_tsc;
u64 last, delta, pvti_system_time;
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S 
b/arch/x86/entry/vdso/vdso-layout.lds.S
index de2c921025f5..4158acc17df0 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -25,7 +25,7 @@ SECTIONS
 * segment.
 */
 
-   vvar_start = . - 2 * PAGE_SIZE;
+   vvar_start = . - 3 * PAGE_SIZE;
vvar_page = vvar_start;
 
/* Place all vvars at the offsets in asm/vvar.h. */
@@ -36,6 +36,7 @@ SECTIONS
 #undef EMIT_VVAR
 
hpet_page = vvar_start + PAGE_SIZE;
+   pvclock_page = vvar_start + 2 * PAGE_SIZE;
 
. = SIZEOF_HEADERS;
 
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 785d9922b106..491020b2826d 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -73,6 +73,7 @@ enum {
sym_vvar_start,
sym_vvar_page,
sym_hpet_page,
+   sym_pvclock_page,
sym_VDSO_FAKE_SECTION_TABLE_START,
sym_VDSO_FAKE_SECTION_TABLE_END,
 };
@@ -80,6 +81,7 @@ enum {
 const int special_pages[] = {
sym_vvar_page,
sym_hpet_page,
+   sym_pvclock_page,
 };
 
 struct vdso_sym {
@@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = {
[sym_vvar_start] = {"vvar_start", true},
[sym_vvar_page] = {"vvar_page", true},
[sym_hpet_page] = {"hpet_page", true},
+   [sym_pvclock_page] = {"pvclock_page", true},
[sym_VDSO_FAKE_SECTION_TABLE_START] = {
"VDSO_FAKE_SECTION_TABLE_START", false
},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 64df47148160..aa828191c654 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
.name = "[vvar]",
.pages = no_pages,
};
+   struct pvclock_vsyscall_time_info *pvti;
 
if (calculate_addr) {
addr = vdso_addr(current->mm->start_stack,
@@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
}
 #endif
 
+   pvti = pvclock_pvti_cpu0_va();
+   if (pvti && image->sym_pvclock_page) {
+   ret = remap_pfn_range(vma,
+ text_start + image->sym_pvclock_page,
+ __pa(pvti) >> PAGE_SHIFT,
+ PAGE_SIZE,
+ PAGE_READONLY);
+
+   if (ret)
+   goto up_fail;
+   }
+
 up_fail:
if (ret)
current->mm->context.vdso = NULL;
diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 7a6bed5c08bc..571dad355bbc 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -4,6 +4,15 @@
 #include 
 #include 
 
+#ifdef CONFIG_KVM_GUEST
+extern struct pvclock_vsyscall_time_info *pvclock_pvti_cpu0_va(void);
+#else
+static inline struct 

[PATCH v2 0/4] x86: KVM vdso and clock improvements

2015-12-20 Thread Andy Lutomirski
x86: KVM vdso and clock improvements

NB: patch 1 doesn't really belong here, but it makes this a lot
easier for me to test.  Patch 1, if it's okay at all, should go
though the kvm tree.  The rest should probably go through
tip:x86/vdso once they're reviewed.

I'll do a followup to enable vdso pvclock on 32-bit guests.
I'm not currently set up to test it.  (The KVM people could also
do it very easily on top of these patches.)

Changes from v1:
 - Dropped patch 1
 - Added Paolo's review and acks
 - Fixed a build issue on some configs

Andy Lutomirski (4):
  x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
  x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
  x86/vdso: Remove pvclock fixmap machinery
  x86/vdso: Enable vdso pvclock access on all vdso variants

 arch/x86/entry/vdso/vclock_gettime.c  | 151 --
 arch/x86/entry/vdso/vdso-layout.lds.S |   3 +-
 arch/x86/entry/vdso/vdso2c.c  |   3 +
 arch/x86/entry/vdso/vma.c |  14 
 arch/x86/include/asm/fixmap.h |   5 --
 arch/x86/include/asm/pvclock.h|  14 ++--
 arch/x86/include/asm/vdso.h   |   1 +
 arch/x86/kernel/kvmclock.c|  11 ++-
 arch/x86/kernel/pvclock.c |  24 --
 9 files changed, 107 insertions(+), 119 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-12-20 Thread Andy Lutomirski
From: Andy Lutomirski 

The pvclock vdso code was too abstracted to understand easily and
excessively paranoid.  Simplify it for a huge speedup.

This opens the door for additional simplifications, as the vdso no
longer accesses the pvti for any vcpu other than vcpu 0.

Before, vclock_gettime using kvm-clock took about 45ns on my machine.
With this change, it takes 29ns, which is almost as fast as the pure TSC
implementation.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/vdso/vclock_gettime.c | 81 
 1 file changed, 46 insertions(+), 35 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index ca94fa649251..c325ba1bdddf 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info 
*get_pvti(int cpu)
 
 static notrace cycle_t vread_pvclock(int *mode)
 {
-   const struct pvclock_vsyscall_time_info *pvti;
+   const struct pvclock_vcpu_time_info *pvti = _pvti(0)->pvti;
cycle_t ret;
-   u64 last;
-   u32 version;
-   u8 flags;
-   unsigned cpu, cpu1;
-
+   u64 tsc, pvti_tsc;
+   u64 last, delta, pvti_system_time;
+   u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
/*
-* Note: hypervisor must guarantee that:
-* 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
-* 2. that per-CPU pvclock time info is updated if the
-*underlying CPU changes.
-* 3. that version is increased whenever underlying CPU
-*changes.
+* Note: The kernel and hypervisor must guarantee that cpu ID
+* number maps 1:1 to per-CPU pvclock time info.
+*
+* Because the hypervisor is entirely unaware of guest userspace
+* preemption, it cannot guarantee that per-CPU pvclock time
+* info is updated if the underlying CPU changes or that that
+* version is increased whenever underlying CPU changes.
 *
+* On KVM, we are guaranteed that pvti updates for any vCPU are
+* atomic as seen by *all* vCPUs.  This is an even stronger
+* guarantee than we get with a normal seqlock.
+*
+* On Xen, we don't appear to have that guarantee, but Xen still
+* supplies a valid seqlock using the version field.
+
+* We only do pvclock vdso timing at all if
+* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
+* mean that all vCPUs have matching pvti and that the TSC is
+* synced, so we can just look at vCPU 0's pvti.
 */
-   do {
-   cpu = __getcpu() & VGETCPU_CPU_MASK;
-   /* TODO: We can put vcpu id into higher bits of pvti.version.
-* This will save a couple of cycles by getting rid of
-* __getcpu() calls (Gleb).
-*/
-
-   pvti = get_pvti(cpu);
-
-   version = __pvclock_read_cycles(>pvti, , );
-
-   /*
-* Test we're still on the cpu as well as the version.
-* We could have been migrated just after the first
-* vgetcpu but before fetching the version, so we
-* wouldn't notice a version change.
-*/
-   cpu1 = __getcpu() & VGETCPU_CPU_MASK;
-   } while (unlikely(cpu != cpu1 ||
- (pvti->pvti.version & 1) ||
- pvti->pvti.version != version));
-
-   if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT)))
+
+   if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
+   return 0;
+   }
+
+   do {
+   version = pvti->version;
+
+   /* This is also a read barrier, so we'll read version first. */
+   tsc = rdtsc_ordered();
+
+   pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
+   pvti_tsc_shift = pvti->tsc_shift;
+   pvti_system_time = pvti->system_time;
+   pvti_tsc = pvti->tsc_timestamp;
+
+   /* Make sure that the version double-check is last. */
+   smp_rmb();
+   } while (unlikely((version & 1) || version != pvti->version));
+
+   delta = tsc - pvti_tsc;
+   ret = pvti_system_time +
+   pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
+   pvti_tsc_shift);
 
/* refer to tsc.c read_tsc() comment for rationale */
last = gtod->cycle_last;
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/4] x86/vdso: Enable vdso pvclock access on all vdso variants

2015-12-20 Thread Andy Lutomirski
Now that pvclock doesn't require access to the fixmap, all vdso
variants can use it.

The kernel side isn't wired up for 32-bit kernels yet, but this
covers 32-bit and x32 userspace on 64-bit kernels.

Acked-by: Paolo Bonzini 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/entry/vdso/vclock_gettime.c | 91 
 1 file changed, 40 insertions(+), 51 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 59a98c25bde7..8602f06c759f 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -17,8 +17,10 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -43,10 +45,6 @@ extern u8 pvclock_page
 
 #ifndef BUILD_VDSO32
 
-#include 
-#include 
-#include 
-
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
 {
long ret;
@@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, 
struct timezone *tz)
return ret;
 }
 
-#ifdef CONFIG_PARAVIRT_CLOCK
 
+#else
+
+notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
+{
+   long ret;
+
+   asm(
+   "mov %%ebx, %%edx \n"
+   "mov %2, %%ebx \n"
+   "call __kernel_vsyscall \n"
+   "mov %%edx, %%ebx \n"
+   : "=a" (ret)
+   : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
+   : "memory", "edx");
+   return ret;
+}
+
+notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
+{
+   long ret;
+
+   asm(
+   "mov %%ebx, %%edx \n"
+   "mov %2, %%ebx \n"
+   "call __kernel_vsyscall \n"
+   "mov %%edx, %%ebx \n"
+   : "=a" (ret)
+   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
+   : "memory", "edx");
+   return ret;
+}
+
+#endif
+
+#ifdef CONFIG_PARAVIRT_CLOCK
 static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void)
 {
return (const struct pvclock_vsyscall_time_info *)_page;
@@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode)
do {
version = pvti->version;
 
-   /* This is also a read barrier, so we'll read version first. */
-   tsc = rdtsc_ordered();
+   smp_rmb();
 
+   tsc = rdtsc_ordered();
pvti_tsc_to_system_mul = pvti->tsc_to_system_mul;
pvti_tsc_shift = pvti->tsc_shift;
pvti_system_time = pvti->system_time;
@@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode)
pvclock_scale_delta(delta, pvti_tsc_to_system_mul,
pvti_tsc_shift);
 
-   /* refer to tsc.c read_tsc() comment for rationale */
+   /* refer to vread_tsc() comment for rationale */
last = gtod->cycle_last;
 
if (likely(ret >= last))
@@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode)
 }
 #endif
 
-#else
-
-notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_clock_gettime), "g" (clock), "c" (ts)
-   : "memory", "edx");
-   return ret;
-}
-
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-   long ret;
-
-   asm(
-   "mov %%ebx, %%edx \n"
-   "mov %2, %%ebx \n"
-   "call __kernel_vsyscall \n"
-   "mov %%edx, %%ebx \n"
-   : "=a" (ret)
-   : "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
-   : "memory", "edx");
-   return ret;
-}
-
-#ifdef CONFIG_PARAVIRT_CLOCK
-
-static notrace cycle_t vread_pvclock(int *mode)
-{
-   *mode = VCLOCK_NONE;
-   return 0;
-}
-#endif
-
-#endif
-
 notrace static cycle_t vread_tsc(void)
 {
cycle_t ret = (cycle_t)rdtsc_ordered();
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [Qemu-devel] [PATCH] SeaBios: Fix reset procedure reentrancy problem on qemu-kvm platform

2015-12-20 Thread Gonglei (Arei)

> -Original Message-
> From: Kevin O'Connor [mailto:ke...@koconnor.net]
> Sent: Saturday, December 19, 2015 11:12 PM
> On Sat, Dec 19, 2015 at 12:03:15PM +, Gonglei (Arei) wrote:
> > Maybe the root cause is not NMI but INTR, so yield() can open hardware
> interrupt,
> > And then execute interrupt handler, but the interrupt handler make the
> SeaBIOS
> > stack broken, so that the BSP can't execute the instruction and occur
> exception,
> > VM_EXIT to Kmod, which is an infinite loop. But I don't have any proofs 
> > except
> > the surface phenomenon.
> 
> I can't see any reason why allowing interrupts at this location would
> be a problem.
> 
Does it have any relationship with *extra stack* of SeaBIOS?

> > Kevin, can we drop yield() in smp_setup() ?
> 
> It's possible to eliminate this instance of yield, but I think it
> would just push the crash to the next time interrupts are enabled.
> 
Perhaps. I'm not sure.

> > Is it really useful and allowable for SeaBIOS? Maybe for other components?
> > I'm not sure. Because we found that when SeaBIOS is booting, if we inject a
> > NMI by QMP, the guest will *stuck*. And the kvm tracing log is the same with
> > the current problem.
> 
> If you apply the patches you had to prevent that NMI crash problem,
> does it also prevent the above crash?
> 
Yes, but we cannot prevent the NMI injection (though I'll submit some patches to
forbid users' NMI injection after NMI_EN disabled by RTC bit7 of port 0x70).


Regards,
-Gonglei
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] vhost: cleanups and fixes

2015-12-20 Thread Michael S. Tsirkin
The following changes since commit 9f9499ae8e6415cefc4fe0a96ad0e27864353c89:

  Linux 4.4-rc5 (2015-12-13 17:42:58 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 74a599f09bec7419b2490039f0fb33bc8581ef7c:

  virtio/s390: handle error values in irb (2015-12-17 10:37:33 +0200)


virtio: fixes on top of 4.4-rc5

This includes a single fix for virtio ccw error handling.

Signed-off-by: Michael S. Tsirkin 


Cornelia Huck (1):
  virtio/s390: handle error values in irb

 drivers/s390/virtio/virtio_ccw.c | 62 
 1 file changed, 37 insertions(+), 25 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] SeaBios: Fix reset procedure reentrancy problem on qemu-kvm platform

2015-12-20 Thread Kevin O'Connor
On Sun, Dec 20, 2015 at 09:49:54AM +, Gonglei (Arei) wrote:
> > From: Kevin O'Connor [mailto:ke...@koconnor.net]
> > Sent: Saturday, December 19, 2015 11:12 PM
> > On Sat, Dec 19, 2015 at 12:03:15PM +, Gonglei (Arei) wrote:
> > > Maybe the root cause is not NMI but INTR, so yield() can open hardware
> > interrupt,
> > > And then execute interrupt handler, but the interrupt handler make the
> > SeaBIOS
> > > stack broken, so that the BSP can't execute the instruction and occur
> > exception,
> > > VM_EXIT to Kmod, which is an infinite loop. But I don't have any proofs 
> > > except
> > > the surface phenomenon.
> > 
> > I can't see any reason why allowing interrupts at this location would
> > be a problem.
> > 
> Does it have any relationship with *extra stack* of SeaBIOS?

None that I can see.  Also, the kvm trace seems to show the code
trying to execute at rip=0x03 - that will crash long before the extra
stack is used.

> > > Kevin, can we drop yield() in smp_setup() ?
> > 
> > It's possible to eliminate this instance of yield, but I think it
> > would just push the crash to the next time interrupts are enabled.
> > 
> Perhaps. I'm not sure.
> 
> > > Is it really useful and allowable for SeaBIOS? Maybe for other components?
> > > I'm not sure. Because we found that when SeaBIOS is booting, if we inject 
> > > a
> > > NMI by QMP, the guest will *stuck*. And the kvm tracing log is the same 
> > > with
> > > the current problem.
> > 
> > If you apply the patches you had to prevent that NMI crash problem,
> > does it also prevent the above crash?
> > 
> Yes, but we cannot prevent the NMI injection (though I'll submit some patches 
> to
> forbid users' NMI injection after NMI_EN disabled by RTC bit7 of port 0x70).
> 

-Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html