Re: [PATCH v5 6/8] virtio: add explicit big-endian support to memory accessors

2015-04-24 Thread Cornelia Huck
On Thu, 23 Apr 2015 21:27:19 +0200
Thomas Huth th...@redhat.com wrote:

 On Thu, 23 Apr 2015 17:29:06 +0200
 Greg Kurz gk...@linux.vnet.ibm.com wrote:
 
  The current memory accessors logic is:
  - little endian if little_endian
  - native endian (i.e. no byteswap) if !little_endian
  
  If we want to fully support cross-endian vhost, we also need to be
  able to convert to big endian.
  
  Instead of changing the little_endian argument to some 3-value enum, this
  patch changes the logic to:
  - little endian if little_endian
  - big endian if !little_endian
  
  The native endian case is handled by all users with a trivial helper. This
  patch doesn't change any functionality, nor it does add overhead.
  
  Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
  ---
  
  Changes since v4:
  - style fixes (I have chosen if ... else in most places to stay below
80 columns, with the notable exception of the vhost helper which gets
shorten in a later patch)
  
   drivers/net/macvtap.c|5 -
   drivers/net/tun.c|5 -
   drivers/vhost/vhost.h|2 +-
   include/linux/virtio_byteorder.h |   24 ++--
   include/linux/virtio_config.h|5 -
   include/linux/vringh.h   |2 +-
   6 files changed, 28 insertions(+), 15 deletions(-)
  
  diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
  index a2f2958..6cf6b3e 100644
  --- a/drivers/net/macvtap.c
  +++ b/drivers/net/macvtap.c
  @@ -51,7 +51,10 @@ struct macvtap_queue {
   
   static inline bool macvtap_is_little_endian(struct macvtap_queue *q)
   {
  -   return q-flags  MACVTAP_VNET_LE;
  +   if (q-flags  MACVTAP_VNET_LE)
  +   return true;
  +   else
  +   return virtio_legacy_is_little_endian();
 
 simply:
 
   return (q-flags  MACVTAP_VNET_LE) ||
  virtio_legacy_is_little_endian();
 
 ?

ISTR that MST preferred the current notation, but I like either your
way or ?: (with linebreak) even better.

 
   }
   
(...)
  diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
  index 6a49960..954c657 100644
  --- a/drivers/vhost/vhost.h
  +++ b/drivers/vhost/vhost.h
  @@ -175,7 +175,7 @@ static inline bool vhost_has_feature(struct 
  vhost_virtqueue *vq, int bit)
   
   static inline bool vhost_is_little_endian(struct vhost_virtqueue *vq)
   {
  -   return vhost_has_feature(vq, VIRTIO_F_VERSION_1);
  +   return vhost_has_feature(vq, VIRTIO_F_VERSION_1) ? true : 
  virtio_legacy_is_little_endian();
   }
 
 That line is way longer than 80 characters ... may I suggest to switch
 at least here to:
 
   return vhost_has_feature(vq, VIRTIO_F_VERSION_1) ||
  virtio_legacy_is_little_endian();

I think the line will collapse in a further patch anyway.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Paolo Bonzini


On 24/04/2015 09:46, Zhang, Yang Z wrote:
  On the other hand vmexit is lighter and lighter on newer processors; a
  Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000
  vs. 2500 clock cycles approximately).
 
 1000 cycles? I remember it takes about 4000 cycle even in HSW server.

I was going from memory, but I now measured it with the vmexit test of
kvm-unit-tests.  With both SNB Xeon E5 and IVB Core i7, returns about
1400 clock cycles for a vmcall exit.  This includes the overhead of
doing the cpuid itself.

Thus the vmexit cost is around 1300 cycles.  Of this the vmresume
instruction is probably around 800 cycles, and the rest is introduced by
KVM.  There are at least 4-5 memory barriers and locked instructions.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 6/8] virtio: add explicit big-endian support to memory accessors

2015-04-24 Thread Greg Kurz
On Fri, 24 Apr 2015 09:04:21 +0200
Cornelia Huck cornelia.h...@de.ibm.com wrote:
 On Thu, 23 Apr 2015 21:27:19 +0200
 Thomas Huth th...@redhat.com wrote:
 

Thomas's e-mail did not make it to my mailbox... weird. :-\

  On Thu, 23 Apr 2015 17:29:06 +0200
  Greg Kurz gk...@linux.vnet.ibm.com wrote:
  
   The current memory accessors logic is:
   - little endian if little_endian
   - native endian (i.e. no byteswap) if !little_endian
   
   If we want to fully support cross-endian vhost, we also need to be
   able to convert to big endian.
   
   Instead of changing the little_endian argument to some 3-value enum, this
   patch changes the logic to:
   - little endian if little_endian
   - big endian if !little_endian
   
   The native endian case is handled by all users with a trivial helper. This
   patch doesn't change any functionality, nor it does add overhead.
   
   Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
   ---
   
   Changes since v4:
   - style fixes (I have chosen if ... else in most places to stay below
 80 columns, with the notable exception of the vhost helper which gets
 shorten in a later patch)
   
drivers/net/macvtap.c|5 -
drivers/net/tun.c|5 -
drivers/vhost/vhost.h|2 +-
include/linux/virtio_byteorder.h |   24 ++--
include/linux/virtio_config.h|5 -
include/linux/vringh.h   |2 +-
6 files changed, 28 insertions(+), 15 deletions(-)
   
   diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
   index a2f2958..6cf6b3e 100644
   --- a/drivers/net/macvtap.c
   +++ b/drivers/net/macvtap.c
   @@ -51,7 +51,10 @@ struct macvtap_queue {

static inline bool macvtap_is_little_endian(struct macvtap_queue *q)
{
   - return q-flags  MACVTAP_VNET_LE;
   + if (q-flags  MACVTAP_VNET_LE)
   + return true;
   + else
   + return virtio_legacy_is_little_endian();
  
  simply:
  
  return (q-flags  MACVTAP_VNET_LE) ||
 virtio_legacy_is_little_endian();
  
  ?
 
 ISTR that MST preferred the current notation, but I like either your
 way or ?: (with linebreak) even better.
 

MST did not like the initial notation I had used actually. FWIW I
like the simplicity of Thomas's suggestion... even better than ?:
which is:
 some_LE_check() ? true : some_default_endianness()

  
}

 (...)
   diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
   index 6a49960..954c657 100644
   --- a/drivers/vhost/vhost.h
   +++ b/drivers/vhost/vhost.h
   @@ -175,7 +175,7 @@ static inline bool vhost_has_feature(struct 
   vhost_virtqueue *vq, int bit)

static inline bool vhost_is_little_endian(struct vhost_virtqueue *vq)
{
   - return vhost_has_feature(vq, VIRTIO_F_VERSION_1);
   + return vhost_has_feature(vq, VIRTIO_F_VERSION_1) ? true : 
   virtio_legacy_is_little_endian();
}
  
  That line is way longer than 80 characters ... may I suggest to switch
  at least here to:
  
  return vhost_has_feature(vq, VIRTIO_F_VERSION_1) ||
 virtio_legacy_is_little_endian();
 
 I think the line will collapse in a further patch anyway.
 

Yes, as mentionned in the changelog :)

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 7/8] vhost: cross-endian support for legacy devices

2015-04-24 Thread Cornelia Huck
On Fri, 24 Apr 2015 10:06:19 +0200
Greg Kurz gk...@linux.vnet.ibm.com wrote:

 On Fri, 24 Apr 2015 09:19:26 +0200
 Cornelia Huck cornelia.h...@de.ibm.com wrote:
 
  On Thu, 23 Apr 2015 17:29:42 +0200
  Greg Kurz gk...@linux.vnet.ibm.com wrote:

   diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
   index bb6a5b4..b980b53 100644
   --- a/include/uapi/linux/vhost.h
   +++ b/include/uapi/linux/vhost.h
   @@ -103,6 +103,18 @@ struct vhost_memory {
/* Get accessor: reads index, writes value in num */
#define VHOST_GET_VRING_BASE _IOWR(VHOST_VIRTIO, 0x12, struct 
   vhost_vring_state)
   
   +/* Set the vring byte order in num. Valid values are 
   VHOST_VRING_LITTLE_ENDIAN
   + * or VHOST_VRING_BIG_ENDIAN (other values return EINVAL).
  
  -EINVAL?
  
 
 Oops, yes. :)
 
  Should you also mention when you return -EBUSY?
  
 
 Indeed... what about:
 
 The byte order cannot be changed when the device is active: trying to do so
 returns -EBUSY.

s/when/while/ ?

But sounds good.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-04-24:
 
 
 On 24/04/2015 09:46, Zhang, Yang Z wrote:
 On the other hand vmexit is lighter and lighter on newer
 processors; a Sandy Bridge has less than half the vmexit cost of a
 Core 2 (IIRC
 1000 vs. 2500 clock cycles approximately).
 
 1000 cycles? I remember it takes about 4000 cycle even in HSW server.
 
 I was going from memory, but I now measured it with the vmexit test of
 kvm-unit-tests.  With both SNB Xeon E5 and IVB Core i7, returns about
 1400 clock cycles for a vmcall exit.  This includes the overhead of
 doing the cpuid itself.
 
 Thus the vmexit cost is around 1300 cycles.  Of this the vmresume
 instruction is probably around 800 cycles, and the rest is introduced
 by KVM.  There are at least 4-5 memory barriers and locked instructions.

Yes, that's make sense. The average vmexit/vmentry handle cost is around 4000 
cycles. But I guess xsaveopt doesn't take so many cycles. Does anyone have the 
xsaveopt cost data?

 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 7/8] vhost: cross-endian support for legacy devices

2015-04-24 Thread Greg Kurz
On Fri, 24 Apr 2015 09:19:26 +0200
Cornelia Huck cornelia.h...@de.ibm.com wrote:

 On Thu, 23 Apr 2015 17:29:42 +0200
 Greg Kurz gk...@linux.vnet.ibm.com wrote:
 
  This patch brings cross-endian support to vhost when used to implement
  legacy virtio devices. Since it is a relatively rare situation, the
  feature availability is controlled by a kernel config option (not set
  by default).
  
  The vq-is_le boolean field is added to cache the endianness to be
  used for ring accesses. It defaults to native endian, as expected
  by legacy virtio devices. When the ring gets active, we force little
  endian if the device is modern. When the ring is deactivated, we
  revert to the native endian default.
  
  If cross-endian was compiled in, a vq-user_be boolean field is added
  so that userspace may request a specific endianness. This field is
  used to override the default when activating the ring of a legacy
  device. It has no effect on modern devices.
  
  Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
  ---
  
  Changes since v4:
  - rewrote patch title to mention cross-endian
  - renamed config to VHOST_CROSS_ENDIAN_LEGACY
  - rewrote config description and help
  - moved ifdefery to top of vhost.c
  - added a detailed comment about the lifecycle of vq-user_be in
vhost_init_is_le()
  - renamed ioctls to VHOST_[GS]ET_VRING_ENDIAN
  - added LE/BE defines to the ioctl API
  - rewrote ioctl sanity check with the LE/BE defines
  - updated comment in uapi/linux/vhost.h to mention that the availibility
of both SET and GET ioctls depends on the kernel config
  
   drivers/vhost/Kconfig  |   15 
   drivers/vhost/vhost.c  |   86 
  +++-
   drivers/vhost/vhost.h  |   10 +
   include/uapi/linux/vhost.h |   12 ++
   4 files changed, 121 insertions(+), 2 deletions(-)
  
  diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
  index 017a1e8..74d7380 100644
  --- a/drivers/vhost/Kconfig
  +++ b/drivers/vhost/Kconfig
  @@ -32,3 +32,18 @@ config VHOST
  ---help---
This option is selected by any driver which needs to access
the core of vhost.
  +
  +config VHOST_CROSS_ENDIAN_LEGACY
  +   bool Cross-endian support for vhost
  +   default n
  +   ---help---
  + This option allows vhost to support guests with a different byte
  + ordering from host.
 
 ...while using legacy virtio.
 
 Might help to explain the LEGACY in the config option ;)
 

Makes sense indeed !

  +
  + Userspace programs can control the feature using the
  + VHOST_SET_VRING_ENDIAN and VHOST_GET_VRING_ENDIAN ioctls.
  +
  + This is only useful on a few platforms (ppc64 and arm64). Since it
  + adds some overhead, it is disabled default.
 
 s/default/by default/
 

Ok.

  +
  + If unsure, say N.
  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
  index 2ee2826..8c4390d 100644
  --- a/drivers/vhost/vhost.c
  +++ b/drivers/vhost/vhost.c
  @@ -36,6 +36,78 @@ enum {
   #define vhost_used_event(vq) ((__virtio16 __user 
  *)vq-avail-ring[vq-num])
   #define vhost_avail_event(vq) ((__virtio16 __user 
  *)vq-used-ring[vq-num])
  
  +#ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
  +static void vhost_vq_reset_user_be(struct vhost_virtqueue *vq)
  +{
  +   vq-user_be = !virtio_legacy_is_little_endian();
  +}
  +
  +static long vhost_set_vring_endian(struct vhost_virtqueue *vq, int __user 
  *argp)
  +{
  +   struct vhost_vring_state s;
  +
  +   if (vq-private_data)
  +   return -EBUSY;
  +
  +   if (copy_from_user(s, argp, sizeof(s)))
  +   return -EFAULT;
  +
  +   if (s.num != VHOST_VRING_LITTLE_ENDIAN 
  +   s.num != VHOST_VRING_BIG_ENDIAN)
  +   return -EINVAL;
  +
  +   vq-user_be = s.num;
  +
  +   return 0;
  +}
  +
  +static long vhost_get_vring_endian(struct vhost_virtqueue *vq, u32 idx,
  +  int __user *argp)
  +{
  +   struct vhost_vring_state s = {
  +   .index = idx,
  +   .num = vq-user_be
  +   };
  +
  +   if (copy_to_user(argp, s, sizeof(s)))
  +   return -EFAULT;
  +
  +   return 0;
  +}
  +
  +static void vhost_init_is_le(struct vhost_virtqueue *vq)
  +{
  +   /* Note for legacy virtio: user_be is initialized at reset time
  +* according to the host endianness. If userspace does not set an
  +* explicit endianness, the default behavior is native endian, as
  +* expected by legacy virtio.
  +*/
  +   vq-is_le = vhost_has_feature(vq, VIRTIO_F_VERSION_1) || !vq-user_be;
  +}
  +#else
  +static void vhost_vq_reset_user_be(struct vhost_virtqueue *vq)
  +{
  +   ;
 
 Just leave the function body empty?
 

Ok.

  +}
  +
  +static long vhost_set_vring_endian(struct vhost_virtqueue *vq, int __user 
  *argp)
  +{
  +   return -ENOIOCTLCMD;
  +}
  +
  +static long vhost_get_vring_endian(struct vhost_virtqueue *vq, u32 idx,
  +  int __user *argp)
  +{
  +   return -ENOIOCTLCMD;
  +}
  +
  +static void 

[PATCH 1/2] mips/kvm: Fix Big endian 32-bit register access

2015-04-24 Thread James Hogan
Fix access to 32-bit registers on big endian targets. The pointer passed
to the kernel must be for the actual 32-bit value, not a temporary
64-bit value, otherwise on big endian systems the kernel will only
interpret the upper half.

Signed-off-by: James Hogan james.ho...@imgtec.com
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Leon Alrae leon.al...@imgtec.com
Cc: Aurelien Jarno aurel...@aurel32.net
Cc: kvm@vger.kernel.org
Cc: qemu-sta...@nongnu.org
---
 target-mips/kvm.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/target-mips/kvm.c b/target-mips/kvm.c
index 4d1f7ead8142..1597bbeac17a 100644
--- a/target-mips/kvm.c
+++ b/target-mips/kvm.c
@@ -240,10 +240,9 @@ int kvm_mips_set_ipi_interrupt(MIPSCPU *cpu, int irq, int 
level)
 static inline int kvm_mips_put_one_reg(CPUState *cs, uint64_t reg_id,
int32_t *addr)
 {
-uint64_t val64 = *addr;
 struct kvm_one_reg cp0reg = {
 .id = reg_id,
-.addr = (uintptr_t)val64
+.addr = (uintptr_t)addr
 };
 
 return kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, cp0reg);
@@ -275,18 +274,12 @@ static inline int kvm_mips_put_one_reg64(CPUState *cs, 
uint64_t reg_id,
 static inline int kvm_mips_get_one_reg(CPUState *cs, uint64_t reg_id,
int32_t *addr)
 {
-int ret;
-uint64_t val64 = 0;
 struct kvm_one_reg cp0reg = {
 .id = reg_id,
-.addr = (uintptr_t)val64
+.addr = (uintptr_t)addr
 };
 
-ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, cp0reg);
-if (ret = 0) {
-*addr = val64;
-}
-return ret;
+return kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, cp0reg);
 }
 
 static inline int kvm_mips_get_one_ulreg(CPUState *cs, uint64 reg_id,
-- 
2.0.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] mips/kvm: Sign extend registers written to KVM

2015-04-24 Thread James Hogan
In case we're running on a 64-bit host, be sure to sign extend the
general purpose registers and hi/lo/pc before writing them to KVM, so as
to take advantage of MIPS32/MIPS64 compatibility.

Signed-off-by: James Hogan james.ho...@imgtec.com
Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Leon Alrae leon.al...@imgtec.com
Cc: Aurelien Jarno aurel...@aurel32.net
Cc: kvm@vger.kernel.org
Cc: qemu-sta...@nongnu.org
---
 target-mips/kvm.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/target-mips/kvm.c b/target-mips/kvm.c
index 1597bbeac17a..d5388ca27568 100644
--- a/target-mips/kvm.c
+++ b/target-mips/kvm.c
@@ -633,12 +633,12 @@ int kvm_arch_put_registers(CPUState *cs, int level)
 
 /* Set the registers based on QEMU's view of things */
 for (i = 0; i  32; i++) {
-regs.gpr[i] = env-active_tc.gpr[i];
+regs.gpr[i] = (int64_t)(target_long)env-active_tc.gpr[i];
 }
 
-regs.hi = env-active_tc.HI[0];
-regs.lo = env-active_tc.LO[0];
-regs.pc = env-active_tc.PC;
+regs.hi = (int64_t)(target_long)env-active_tc.HI[0];
+regs.lo = (int64_t)(target_long)env-active_tc.LO[0];
+regs.pc = (int64_t)(target_long)env-active_tc.PC;
 
 ret = kvm_vcpu_ioctl(cs, KVM_SET_REGS, regs);
 
-- 
2.0.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] mips/kvm: Fixes for big endian MIPS64 hosts

2015-04-24 Thread James Hogan
A couple of small fixes for accessing 32-bit KVM registers on big
endian, and to sign extend struct kvm_regs registers so as to work on
MIPS64 hosts.

James Hogan (2):
  mips/kvm: Fix Big endian 32-bit register access
  mips/kvm: Sign extend registers written to KVM

 target-mips/kvm.c | 21 +++--
 1 file changed, 7 insertions(+), 14 deletions(-)

Cc: Paolo Bonzini pbonz...@redhat.com
Cc: Leon Alrae leon.al...@imgtec.com
Cc: Aurelien Jarno aurel...@aurel32.net
Cc: kvm@vger.kernel.org
Cc: qemu-sta...@nongnu.org
-- 
2.0.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 0/8] vhost: support for cross endian guests

2015-04-24 Thread Michael S. Tsirkin
On Fri, Apr 24, 2015 at 02:24:15PM +0200, Greg Kurz wrote:
 Only cosmetic and documentation changes since v5.
 
 ---

Looks sane to me. I plan to review and apply next week.

 Greg Kurz (8):
   virtio: introduce virtio_is_little_endian() helper
   tun: add tun_is_little_endian() helper
   macvtap: introduce macvtap_is_little_endian() helper
   vringh: introduce vringh_is_little_endian() helper
   vhost: introduce vhost_is_little_endian() helper
   virtio: add explicit big-endian support to memory accessors
   vhost: cross-endian support for legacy devices
   macvtap/tun: cross-endian support for little-endian hosts
 
 
  drivers/net/Kconfig  |   14 ++
  drivers/net/macvtap.c|   66 +-
  drivers/net/tun.c|   68 ++
  drivers/vhost/Kconfig|   15 +++
  drivers/vhost/vhost.c|   85 
 ++
  drivers/vhost/vhost.h|   25 ---
  include/linux/virtio_byteorder.h |   24 ++-
  include/linux/virtio_config.h|   18 +---
  include/linux/vringh.h   |   18 +---
  include/uapi/linux/if_tun.h  |6 +++
  include/uapi/linux/vhost.h   |   14 ++
  11 files changed, 320 insertions(+), 33 deletions(-)
 
 --
 Greg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 5/8] vhost: introduce vhost_is_little_endian() helper

2015-04-24 Thread Greg Kurz
Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
---
 drivers/vhost/vhost.h |   17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8c1c792..6a49960 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -173,34 +173,39 @@ static inline bool vhost_has_feature(struct 
vhost_virtqueue *vq, int bit)
return vq-acked_features  (1ULL  bit);
 }
 
+static inline bool vhost_is_little_endian(struct vhost_virtqueue *vq)
+{
+   return vhost_has_feature(vq, VIRTIO_F_VERSION_1);
+}
+
 /* Memory accessors */
 static inline u16 vhost16_to_cpu(struct vhost_virtqueue *vq, __virtio16 val)
 {
-   return __virtio16_to_cpu(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+   return __virtio16_to_cpu(vhost_is_little_endian(vq), val);
 }
 
 static inline __virtio16 cpu_to_vhost16(struct vhost_virtqueue *vq, u16 val)
 {
-   return __cpu_to_virtio16(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+   return __cpu_to_virtio16(vhost_is_little_endian(vq), val);
 }
 
 static inline u32 vhost32_to_cpu(struct vhost_virtqueue *vq, __virtio32 val)
 {
-   return __virtio32_to_cpu(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+   return __virtio32_to_cpu(vhost_is_little_endian(vq), val);
 }
 
 static inline __virtio32 cpu_to_vhost32(struct vhost_virtqueue *vq, u32 val)
 {
-   return __cpu_to_virtio32(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+   return __cpu_to_virtio32(vhost_is_little_endian(vq), val);
 }
 
 static inline u64 vhost64_to_cpu(struct vhost_virtqueue *vq, __virtio64 val)
 {
-   return __virtio64_to_cpu(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+   return __virtio64_to_cpu(vhost_is_little_endian(vq), val);
 }
 
 static inline __virtio64 cpu_to_vhost64(struct vhost_virtqueue *vq, u64 val)
 {
-   return __cpu_to_virtio64(vhost_has_feature(vq, VIRTIO_F_VERSION_1), 
val);
+   return __cpu_to_virtio64(vhost_is_little_endian(vq), val);
 }
 #endif

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 6/8] virtio: add explicit big-endian support to memory accessors

2015-04-24 Thread Greg Kurz
The current memory accessors logic is:
- little endian if little_endian
- native endian (i.e. no byteswap) if !little_endian

If we want to fully support cross-endian vhost, we also need to be
able to convert to big endian.

Instead of changing the little_endian argument to some 3-value enum, this
patch changes the logic to:
- little endian if little_endian
- big endian if !little_endian

The native endian case is handled by all users with a trivial helper. This
patch doesn't change any functionality, nor it does add overhead.

Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
---

Changes since v5:
- changed endian checking helpers as suggested by Thomas (use || and line
  breaker)

 drivers/net/macvtap.c|3 ++-
 drivers/net/tun.c|3 ++-
 drivers/vhost/vhost.h|3 ++-
 include/linux/virtio_byteorder.h |   24 ++--
 include/linux/virtio_config.h|3 ++-
 include/linux/vringh.h   |3 ++-
 6 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index a2f2958..0327d9d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -51,7 +51,8 @@ struct macvtap_queue {
 
 static inline bool macvtap_is_little_endian(struct macvtap_queue *q)
 {
-   return q-flags  MACVTAP_VNET_LE;
+   return q-flags  MACVTAP_VNET_LE ||
+   virtio_legacy_is_little_endian();
 }
 
 static inline u16 macvtap16_to_cpu(struct macvtap_queue *q, __virtio16 val)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3c3d6c0..7c4f6b6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -208,7 +208,8 @@ struct tun_struct {
 
 static inline bool tun_is_little_endian(struct tun_struct *tun)
 {
-   return tun-flags  TUN_VNET_LE;
+   return tun-flags  TUN_VNET_LE ||
+   virtio_legacy_is_little_endian();
 }
 
 static inline u16 tun16_to_cpu(struct tun_struct *tun, __virtio16 val)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6a49960..a4fa33a 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -175,7 +175,8 @@ static inline bool vhost_has_feature(struct vhost_virtqueue 
*vq, int bit)
 
 static inline bool vhost_is_little_endian(struct vhost_virtqueue *vq)
 {
-   return vhost_has_feature(vq, VIRTIO_F_VERSION_1);
+   return vhost_has_feature(vq, VIRTIO_F_VERSION_1) ||
+   virtio_legacy_is_little_endian();
 }
 
 /* Memory accessors */
diff --git a/include/linux/virtio_byteorder.h b/include/linux/virtio_byteorder.h
index 51865d0..ce63a2c 100644
--- a/include/linux/virtio_byteorder.h
+++ b/include/linux/virtio_byteorder.h
@@ -3,17 +3,21 @@
 #include linux/types.h
 #include uapi/linux/virtio_types.h
 
-/*
- * Low-level memory accessors for handling virtio in modern little endian and 
in
- * compatibility native endian format.
- */
+static inline bool virtio_legacy_is_little_endian(void)
+{
+#ifdef __LITTLE_ENDIAN
+   return true;
+#else
+   return false;
+#endif
+}
 
 static inline u16 __virtio16_to_cpu(bool little_endian, __virtio16 val)
 {
if (little_endian)
return le16_to_cpu((__force __le16)val);
else
-   return (__force u16)val;
+   return be16_to_cpu((__force __be16)val);
 }
 
 static inline __virtio16 __cpu_to_virtio16(bool little_endian, u16 val)
@@ -21,7 +25,7 @@ static inline __virtio16 __cpu_to_virtio16(bool 
little_endian, u16 val)
if (little_endian)
return (__force __virtio16)cpu_to_le16(val);
else
-   return (__force __virtio16)val;
+   return (__force __virtio16)cpu_to_be16(val);
 }
 
 static inline u32 __virtio32_to_cpu(bool little_endian, __virtio32 val)
@@ -29,7 +33,7 @@ static inline u32 __virtio32_to_cpu(bool little_endian, 
__virtio32 val)
if (little_endian)
return le32_to_cpu((__force __le32)val);
else
-   return (__force u32)val;
+   return be32_to_cpu((__force __be32)val);
 }
 
 static inline __virtio32 __cpu_to_virtio32(bool little_endian, u32 val)
@@ -37,7 +41,7 @@ static inline __virtio32 __cpu_to_virtio32(bool 
little_endian, u32 val)
if (little_endian)
return (__force __virtio32)cpu_to_le32(val);
else
-   return (__force __virtio32)val;
+   return (__force __virtio32)cpu_to_be32(val);
 }
 
 static inline u64 __virtio64_to_cpu(bool little_endian, __virtio64 val)
@@ -45,7 +49,7 @@ static inline u64 __virtio64_to_cpu(bool little_endian, 
__virtio64 val)
if (little_endian)
return le64_to_cpu((__force __le64)val);
else
-   return (__force u64)val;
+   return be64_to_cpu((__force __be64)val);
 }
 
 static inline __virtio64 __cpu_to_virtio64(bool little_endian, u64 val)
@@ -53,7 +57,7 @@ static inline __virtio64 __cpu_to_virtio64(bool 
little_endian, u64 val)
if (little_endian)
return 

[PATCH v6 3/8] macvtap: introduce macvtap_is_little_endian() helper

2015-04-24 Thread Greg Kurz
Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
---
 drivers/net/macvtap.c |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 27ecc5c..a2f2958 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -49,14 +49,19 @@ struct macvtap_queue {
 
 #define MACVTAP_VNET_LE 0x8000
 
+static inline bool macvtap_is_little_endian(struct macvtap_queue *q)
+{
+   return q-flags  MACVTAP_VNET_LE;
+}
+
 static inline u16 macvtap16_to_cpu(struct macvtap_queue *q, __virtio16 val)
 {
-   return __virtio16_to_cpu(q-flags  MACVTAP_VNET_LE, val);
+   return __virtio16_to_cpu(macvtap_is_little_endian(q), val);
 }
 
 static inline __virtio16 cpu_to_macvtap16(struct macvtap_queue *q, u16 val)
 {
-   return __cpu_to_virtio16(q-flags  MACVTAP_VNET_LE, val);
+   return __cpu_to_virtio16(macvtap_is_little_endian(q), val);
 }
 
 static struct proto macvtap_proto = {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 4/8] vringh: introduce vringh_is_little_endian() helper

2015-04-24 Thread Greg Kurz
Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
---
 include/linux/vringh.h |   17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/vringh.h b/include/linux/vringh.h
index a3fa537..3ed62ef 100644
--- a/include/linux/vringh.h
+++ b/include/linux/vringh.h
@@ -226,33 +226,38 @@ static inline void vringh_notify(struct vringh *vrh)
vrh-notify(vrh);
 }
 
+static inline bool vringh_is_little_endian(const struct vringh *vrh)
+{
+   return vrh-little_endian;
+}
+
 static inline u16 vringh16_to_cpu(const struct vringh *vrh, __virtio16 val)
 {
-   return __virtio16_to_cpu(vrh-little_endian, val);
+   return __virtio16_to_cpu(vringh_is_little_endian(vrh), val);
 }
 
 static inline __virtio16 cpu_to_vringh16(const struct vringh *vrh, u16 val)
 {
-   return __cpu_to_virtio16(vrh-little_endian, val);
+   return __cpu_to_virtio16(vringh_is_little_endian(vrh), val);
 }
 
 static inline u32 vringh32_to_cpu(const struct vringh *vrh, __virtio32 val)
 {
-   return __virtio32_to_cpu(vrh-little_endian, val);
+   return __virtio32_to_cpu(vringh_is_little_endian(vrh), val);
 }
 
 static inline __virtio32 cpu_to_vringh32(const struct vringh *vrh, u32 val)
 {
-   return __cpu_to_virtio32(vrh-little_endian, val);
+   return __cpu_to_virtio32(vringh_is_little_endian(vrh), val);
 }
 
 static inline u64 vringh64_to_cpu(const struct vringh *vrh, __virtio64 val)
 {
-   return __virtio64_to_cpu(vrh-little_endian, val);
+   return __virtio64_to_cpu(vringh_is_little_endian(vrh), val);
 }
 
 static inline __virtio64 cpu_to_vringh64(const struct vringh *vrh, u64 val)
 {
-   return __cpu_to_virtio64(vrh-little_endian, val);
+   return __cpu_to_virtio64(vringh_is_little_endian(vrh), val);
 }
 #endif /* _LINUX_VRINGH_H */

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 1/8] virtio: introduce virtio_is_little_endian() helper

2015-04-24 Thread Greg Kurz
Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
---
 include/linux/virtio_config.h |   17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index ca3ed78..bd1a582 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -205,35 +205,40 @@ int virtqueue_set_affinity(struct virtqueue *vq, int cpu)
return 0;
 }
 
+static inline bool virtio_is_little_endian(struct virtio_device *vdev)
+{
+   return virtio_has_feature(vdev, VIRTIO_F_VERSION_1);
+}
+
 /* Memory accessors */
 static inline u16 virtio16_to_cpu(struct virtio_device *vdev, __virtio16 val)
 {
-   return __virtio16_to_cpu(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), 
val);
+   return __virtio16_to_cpu(virtio_is_little_endian(vdev), val);
 }
 
 static inline __virtio16 cpu_to_virtio16(struct virtio_device *vdev, u16 val)
 {
-   return __cpu_to_virtio16(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), 
val);
+   return __cpu_to_virtio16(virtio_is_little_endian(vdev), val);
 }
 
 static inline u32 virtio32_to_cpu(struct virtio_device *vdev, __virtio32 val)
 {
-   return __virtio32_to_cpu(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), 
val);
+   return __virtio32_to_cpu(virtio_is_little_endian(vdev), val);
 }
 
 static inline __virtio32 cpu_to_virtio32(struct virtio_device *vdev, u32 val)
 {
-   return __cpu_to_virtio32(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), 
val);
+   return __cpu_to_virtio32(virtio_is_little_endian(vdev), val);
 }
 
 static inline u64 virtio64_to_cpu(struct virtio_device *vdev, __virtio64 val)
 {
-   return __virtio64_to_cpu(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), 
val);
+   return __virtio64_to_cpu(virtio_is_little_endian(vdev), val);
 }
 
 static inline __virtio64 cpu_to_virtio64(struct virtio_device *vdev, u64 val)
 {
-   return __cpu_to_virtio64(virtio_has_feature(vdev, VIRTIO_F_VERSION_1), 
val);
+   return __cpu_to_virtio64(virtio_is_little_endian(vdev), val);
 }
 
 /* Config space accessors. */

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 8/8] macvtap/tun: cross-endian support for little-endian hosts

2015-04-24 Thread Greg Kurz
The VNET_LE flag was introduced to fix accesses to virtio 1.0 headers
that are always little-endian. It can also be used to handle the special
case of a legacy little-endian device implemented by a big-endian host.

Let's add a flag and ioctls for big-endian devices as well. If both flags
are set, little-endian wins.

Since this is isn't a common usecase, the feature is controlled by a kernel
config option (not set by default).

Both macvtap and tun are covered by this patch since they share the same
API with userland.

Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
---

Changes since v5:
- changed {macvtapi,tun}_legacy_is_little_endian() to use ?:

 drivers/net/Kconfig |   14 ++
 drivers/net/macvtap.c   |   57 +-
 drivers/net/tun.c   |   59 ++-
 include/uapi/linux/if_tun.h |6 
 4 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index df51d60..71ac0ec 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -244,6 +244,20 @@ config TUN
 
  If you don't know what to use this for, you don't need it.
 
+config TUN_VNET_CROSS_LE
+   bool Support for cross-endian vnet headers on little-endian kernels
+   default n
+   ---help---
+ This option allows TUN/TAP and MACVTAP device drivers in a
+ little-endian kernel to parse vnet headers that come from a
+ big-endian legacy virtio device.
+
+ Userspace programs can control the feature using the TUNSETVNETBE
+ and TUNGETVNETBE ioctls.
+
+ Unless you have a little-endian system hosting a big-endian virtual
+ machine with a legacy virtio NIC, you should say N.
+
 config VETH
tristate Virtual ethernet pair device
---help---
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 0327d9d..dc0a47c 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -48,11 +48,60 @@ struct macvtap_queue {
 #define MACVTAP_FEATURES (IFF_VNET_HDR | IFF_MULTI_QUEUE)
 
 #define MACVTAP_VNET_LE 0x8000
+#define MACVTAP_VNET_BE 0x4000
+
+#ifdef CONFIG_TUN_VNET_CROSS_LE
+static inline bool macvtap_legacy_is_little_endian(struct macvtap_queue *q)
+{
+   return q-flags  MACVTAP_VNET_BE ? false :
+   virtio_legacy_is_little_endian();
+}
+
+static long macvtap_get_vnet_be(struct macvtap_queue *q, int __user *sp)
+{
+   int s = !!(q-flags  MACVTAP_VNET_BE);
+
+   if (put_user(s, sp))
+   return -EFAULT;
+
+   return 0;
+}
+
+static long macvtap_set_vnet_be(struct macvtap_queue *q, int __user *sp)
+{
+   int s;
+
+   if (get_user(s, sp))
+   return -EFAULT;
+
+   if (s)
+   q-flags |= MACVTAP_VNET_BE;
+   else
+   q-flags = ~MACVTAP_VNET_BE;
+
+   return 0;
+}
+#else
+static inline bool macvtap_legacy_is_little_endian(struct macvtap_queue *q)
+{
+   return virtio_legacy_is_little_endian();
+}
+
+static long macvtap_get_vnet_be(struct macvtap_queue *q, int __user *argp)
+{
+   return -EINVAL;
+}
+
+static long macvtap_set_vnet_be(struct macvtap_queue *q, int __user *argp)
+{
+   return -EINVAL;
+}
+#endif /* CONFIG_TUN_VNET_CROSS_LE */
 
 static inline bool macvtap_is_little_endian(struct macvtap_queue *q)
 {
return q-flags  MACVTAP_VNET_LE ||
-   virtio_legacy_is_little_endian();
+   macvtap_legacy_is_little_endian(q);
 }
 
 static inline u16 macvtap16_to_cpu(struct macvtap_queue *q, __virtio16 val)
@@ -1096,6 +1145,12 @@ static long macvtap_ioctl(struct file *file, unsigned 
int cmd,
q-flags = ~MACVTAP_VNET_LE;
return 0;
 
+   case TUNGETVNETBE:
+   return macvtap_get_vnet_be(q, sp);
+
+   case TUNSETVNETBE:
+   return macvtap_set_vnet_be(q, sp);
+
case TUNSETOFFLOAD:
/* let the user check for future flags */
if (arg  ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 7c4f6b6..9fa05d6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -111,6 +111,7 @@ do {
\
 #define TUN_FASYNC IFF_ATTACH_QUEUE
 /* High bits in flags field are unused. */
 #define TUN_VNET_LE 0x8000
+#define TUN_VNET_BE 0x4000
 
 #define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
  IFF_MULTI_QUEUE)
@@ -206,10 +207,58 @@ struct tun_struct {
u32 flow_count;
 };
 
+#ifdef CONFIG_TUN_VNET_CROSS_LE
+static inline bool tun_legacy_is_little_endian(struct tun_struct *tun)
+{
+   return tun-flags  TUN_VNET_BE ? false :
+   virtio_legacy_is_little_endian();
+}
+
+static long tun_get_vnet_be(struct tun_struct *tun, int __user *argp)
+{
+   int be = !!(tun-flags  TUN_VNET_BE);
+
+   if (put_user(be, 

[PATCH v6 2/8] tun: add tun_is_little_endian() helper

2015-04-24 Thread Greg Kurz
Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
---
 drivers/net/tun.c |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 857dca4..3c3d6c0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -206,14 +206,19 @@ struct tun_struct {
u32 flow_count;
 };
 
+static inline bool tun_is_little_endian(struct tun_struct *tun)
+{
+   return tun-flags  TUN_VNET_LE;
+}
+
 static inline u16 tun16_to_cpu(struct tun_struct *tun, __virtio16 val)
 {
-   return __virtio16_to_cpu(tun-flags  TUN_VNET_LE, val);
+   return __virtio16_to_cpu(tun_is_little_endian(tun), val);
 }
 
 static inline __virtio16 cpu_to_tun16(struct tun_struct *tun, u16 val)
 {
-   return __cpu_to_virtio16(tun-flags  TUN_VNET_LE, val);
+   return __cpu_to_virtio16(tun_is_little_endian(tun), val);
 }
 
 static inline u32 tun_hashfn(u32 rxhash)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for 2015-04-24

2015-04-24 Thread Juan Quintela

Hi

Please, send any topic that you are interested in covering.


 Call details:

By popular demand, a google calendar public entry with it

  
https://www.google.com/calendar/embed?src=dG9iMXRqcXAzN3Y4ZXZwNzRoMHE4a3BqcXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

(Let me know if you have any problems with the calendar entry.  I just
gave up about getting right at the same time CEST, CET, EDT and DST).

If you need phone number details,  contact me privately

Thanks, Juan.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 7/8] vhost: cross-endian support for legacy devices

2015-04-24 Thread Greg Kurz
This patch brings cross-endian support to vhost when used to implement
legacy virtio devices. Since it is a relatively rare situation, the
feature availability is controlled by a kernel config option (not set
by default).

The vq-is_le boolean field is added to cache the endianness to be
used for ring accesses. It defaults to native endian, as expected
by legacy virtio devices. When the ring gets active, we force little
endian if the device is modern. When the ring is deactivated, we
revert to the native endian default.

If cross-endian was compiled in, a vq-user_be boolean field is added
so that userspace may request a specific endianness. This field is
used to override the default when activating the ring of a legacy
device. It has no effect on modern devices.

Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
---

Changes since v5:
- fixed description in Kconfig
- fixed error description in uapi header
- dropped useless semi-colon in the vhost_vq_reset_user_be() stub

 drivers/vhost/Kconfig  |   15 
 drivers/vhost/vhost.c  |   85 +++-
 drivers/vhost/vhost.h  |   11 +-
 include/uapi/linux/vhost.h |   14 +++
 4 files changed, 122 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 017a1e8..533eaf0 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -32,3 +32,18 @@ config VHOST
---help---
  This option is selected by any driver which needs to access
  the core of vhost.
+
+config VHOST_CROSS_ENDIAN_LEGACY
+   bool Cross-endian support for vhost
+   default n
+   ---help---
+ This option allows vhost to support guests with a different byte
+ ordering from host while using legacy virtio.
+
+ Userspace programs can control the feature using the
+ VHOST_SET_VRING_ENDIAN and VHOST_GET_VRING_ENDIAN ioctls.
+
+ This is only useful on a few platforms (ppc64 and arm64). Since it
+ adds some overhead, it is disabled by default.
+
+ If unsure, say N.
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ee2826..9e8e004 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -36,6 +36,77 @@ enum {
 #define vhost_used_event(vq) ((__virtio16 __user *)vq-avail-ring[vq-num])
 #define vhost_avail_event(vq) ((__virtio16 __user *)vq-used-ring[vq-num])
 
+#ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
+static void vhost_vq_reset_user_be(struct vhost_virtqueue *vq)
+{
+   vq-user_be = !virtio_legacy_is_little_endian();
+}
+
+static long vhost_set_vring_endian(struct vhost_virtqueue *vq, int __user 
*argp)
+{
+   struct vhost_vring_state s;
+
+   if (vq-private_data)
+   return -EBUSY;
+
+   if (copy_from_user(s, argp, sizeof(s)))
+   return -EFAULT;
+
+   if (s.num != VHOST_VRING_LITTLE_ENDIAN 
+   s.num != VHOST_VRING_BIG_ENDIAN)
+   return -EINVAL;
+
+   vq-user_be = s.num;
+
+   return 0;
+}
+
+static long vhost_get_vring_endian(struct vhost_virtqueue *vq, u32 idx,
+  int __user *argp)
+{
+   struct vhost_vring_state s = {
+   .index = idx,
+   .num = vq-user_be
+   };
+
+   if (copy_to_user(argp, s, sizeof(s)))
+   return -EFAULT;
+
+   return 0;
+}
+
+static void vhost_init_is_le(struct vhost_virtqueue *vq)
+{
+   /* Note for legacy virtio: user_be is initialized at reset time
+* according to the host endianness. If userspace does not set an
+* explicit endianness, the default behavior is native endian, as
+* expected by legacy virtio.
+*/
+   vq-is_le = vhost_has_feature(vq, VIRTIO_F_VERSION_1) || !vq-user_be;
+}
+#else
+static void vhost_vq_reset_user_be(struct vhost_virtqueue *vq)
+{
+}
+
+static long vhost_set_vring_endian(struct vhost_virtqueue *vq, int __user 
*argp)
+{
+   return -ENOIOCTLCMD;
+}
+
+static long vhost_get_vring_endian(struct vhost_virtqueue *vq, u32 idx,
+  int __user *argp)
+{
+   return -ENOIOCTLCMD;
+}
+
+static void vhost_init_is_le(struct vhost_virtqueue *vq)
+{
+   if (vhost_has_feature(vq, VIRTIO_F_VERSION_1))
+   vq-is_le = true;
+}
+#endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
+
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
 {
@@ -199,6 +270,8 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq-call = NULL;
vq-log_ctx = NULL;
vq-memory = NULL;
+   vq-is_le = virtio_legacy_is_little_endian();
+   vhost_vq_reset_user_be(vq);
 }
 
 static int vhost_worker(void *data)
@@ -806,6 +879,12 @@ long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, 
void __user *argp)
} else
filep = eventfp;
break;
+   case VHOST_SET_VRING_ENDIAN:
+   r = 

[PATCH v6 0/8] vhost: support for cross endian guests

2015-04-24 Thread Greg Kurz
Only cosmetic and documentation changes since v5.

---

Greg Kurz (8):
  virtio: introduce virtio_is_little_endian() helper
  tun: add tun_is_little_endian() helper
  macvtap: introduce macvtap_is_little_endian() helper
  vringh: introduce vringh_is_little_endian() helper
  vhost: introduce vhost_is_little_endian() helper
  virtio: add explicit big-endian support to memory accessors
  vhost: cross-endian support for legacy devices
  macvtap/tun: cross-endian support for little-endian hosts


 drivers/net/Kconfig  |   14 ++
 drivers/net/macvtap.c|   66 +-
 drivers/net/tun.c|   68 ++
 drivers/vhost/Kconfig|   15 +++
 drivers/vhost/vhost.c|   85 ++
 drivers/vhost/vhost.h|   25 ---
 include/linux/virtio_byteorder.h |   24 ++-
 include/linux/virtio_config.h|   18 +---
 include/linux/vringh.h   |   18 +---
 include/uapi/linux/if_tun.h  |6 +++
 include/uapi/linux/vhost.h   |   14 ++
 11 files changed, 320 insertions(+), 33 deletions(-)

--
Greg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


x86: kvmclock: drop rdtsc_barrier()

2015-04-24 Thread Marcelo Tosatti

Drop unnecessary rdtsc_barrier(), as has been determined empirically,
see 057e6a8c660e95c3f4e7162e00e2fee1fc90c50d for details.

Noticed by Andy Lutomirski.

Improves clock_gettime() by approximately 15% on 
Intel i7-3520M @ 2.90GHz.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h
index 25b1cc0..a5beb23 100644
--- a/arch/x86/include/asm/pvclock.h
+++ b/arch/x86/include/asm/pvclock.h
@@ -86,7 +86,6 @@ unsigned __pvclock_read_cycles(const struct 
pvclock_vcpu_time_info *src,
offset = pvclock_get_nsec_offset(src);
ret = src-system_time + offset;
ret_flags = src-flags;
-   rdtsc_barrier();
 
*cycles = ret;
*flags = ret_flags;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvmclock: drop rdtsc_barrier()

2015-04-24 Thread Rik van Riel
On 04/24/2015 09:36 PM, Marcelo Tosatti wrote:
 
 Drop unnecessary rdtsc_barrier(), as has been determined empirically,
 see 057e6a8c660e95c3f4e7162e00e2fee1fc90c50d for details.
 
 Noticed by Andy Lutomirski.
 
 Improves clock_gettime() by approximately 15% on 
 Intel i7-3520M @ 2.90GHz.
 
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Acked-by: Rik van Riel r...@redhat.com

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

2015-04-24 Thread Waiman Long
In the pv_scan_next() function, the slow cmpxchg atomic operation is
performed even if the other CPU is not even close to being halted. This
extra cmpxchg can harm slowpath performance.

This patch introduces the new mayhalt flag to indicate if the other
spinning CPU is close to being halted or not. The current threshold
for x86 is 2k cpu_relax() calls. If this flag is not set, the other
spinning CPU will have at least 2k more cpu_relax() calls before
it can enter the halt state. This should give enough time for the
setting of the locked flag in struct mcs_spinlock to propagate to
that CPU without using atomic op.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock_paravirt.h |   28 +---
 1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 9b4ac3d..41ee033 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -19,7 +19,8 @@
  * native_queue_spin_unlock().
  */
 
-#define _Q_SLOW_VAL(3U  _Q_LOCKED_OFFSET)
+#define _Q_SLOW_VAL(3U  _Q_LOCKED_OFFSET)
+#define MAYHALT_THRESHOLD  (SPIN_THRESHOLD  4)
 
 /*
  * The vcpu_hashed is a special state that is set by the new lock holder on
@@ -39,6 +40,7 @@ struct pv_node {
 
int cpu;
u8  state;
+   u8  mayhalt;
 };
 
 /*
@@ -198,6 +200,7 @@ static void pv_init_node(struct mcs_spinlock *node)
 
pn-cpu = smp_processor_id();
pn-state = vcpu_running;
+   pn-mayhalt = false;
 }
 
 /*
@@ -214,23 +217,34 @@ static void pv_wait_node(struct mcs_spinlock *node)
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node-locked))
return;
+   if (loop == MAYHALT_THRESHOLD)
+   xchg(pn-mayhalt, true);
cpu_relax();
}
 
/*
-* Order pn-state vs pn-locked thusly:
+* Order pn-state/pn-mayhalt vs pn-locked thusly:
 *
-* [S] pn-state = vcpu_halted[S] next-locked = 1
+* [S] pn-mayhalt = 1[S] next-locked = 1
+* MB, delay  barrier()
+* [S] pn-state = vcpu_halted[L] pn-mayhalt
 * MB MB
 * [L] pn-locked   [RmW] pn-state = vcpu_hashed
 *
 * Matches the cmpxchg() from pv_scan_next().
+*
+* As the new lock holder may quit (when pn-mayhalt is not
+* set) without memory barrier, a sufficiently long delay is
+* inserted between the setting of pn-mayhalt and pn-state
+* to ensure that there is enough time for the new pn-locked
+* value to be propagated here to be checked below.
 */
(void)xchg(pn-state, vcpu_halted);
 
if (!READ_ONCE(node-locked))
pv_wait(pn-state, vcpu_halted);
 
+   pn-mayhalt = false;
/*
 * Reset the state except when vcpu_hashed is set.
 */
@@ -263,6 +277,14 @@ static void pv_scan_next(struct qspinlock *lock, struct 
mcs_spinlock *node)
struct __qspinlock *l = (void *)lock;
 
/*
+* If mayhalt is not set, there is enough time for the just set value
+* in pn-locked to be propagated to the other CPU before it is time
+* to halt.
+*/
+   if (!READ_ONCE(pn-mayhalt))
+   return;
+
+   /*
 * Transition CPU state: halted = hashed
 * Quit if the transition failed.
 */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v16 12/14] pvqspinlock: Only kick CPU at unlock time

2015-04-24 Thread Waiman Long
Before this patch, a CPU may have been kicked twice before getting
the lock - one before it becomes queue head and once before it gets
the lock. All these CPU kicking and halting (VMEXIT) can be expensive
and slow down system performance, especially in an overcommitted guest.

This patch adds a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU.

It also adds a second synchronization point in __pv_queue_spin_unlock()
so as to do the pv_kick() only if it is really necessary.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c  |   10 ++--
 kernel/locking/qspinlock_paravirt.h |   76 +-
 2 files changed, 61 insertions(+), 25 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c009120..0a3a109 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -239,8 +239,8 @@ static __always_inline void set_locked(struct qspinlock 
*lock)
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
 static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_scan_next(struct qspinlock *lock,
+  struct mcs_spinlock *node) { }
 static __always_inline void __pv_wait_head(struct qspinlock *lock,
   struct mcs_spinlock *node) { }
 
@@ -248,7 +248,7 @@ static __always_inline void __pv_wait_head(struct qspinlock 
*lock,
 
 #define pv_init_node   __pv_init_node
 #define pv_wait_node   __pv_wait_node
-#define pv_kick_node   __pv_kick_node
+#define pv_scan_next   __pv_scan_next
 #define pv_wait_head   __pv_wait_head
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
@@ -440,7 +440,7 @@ queue:
cpu_relax();
 
arch_mcs_spin_unlock_contended(next-locked);
-   pv_kick_node(next);
+   pv_scan_next(lock, next);
 
 release:
/*
@@ -461,7 +461,7 @@ EXPORT_SYMBOL(queue_spin_lock_slowpath);
 
 #undef pv_init_node
 #undef pv_wait_node
-#undef pv_kick_node
+#undef pv_scan_next
 #undef pv_wait_head
 
 #undef  queue_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 084e5c1..9b4ac3d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -21,9 +21,16 @@
 
 #define _Q_SLOW_VAL(3U  _Q_LOCKED_OFFSET)
 
+/*
+ * The vcpu_hashed is a special state that is set by the new lock holder on
+ * the new queue head to indicate that _Q_SLOW_VAL is set and hash entry
+ * filled. With this state, the queue head CPU will always be kicked even
+ * if it is not halted to avoid potential racing condition.
+ */
 enum vcpu_state {
vcpu_running = 0,
vcpu_halted,
+   vcpu_hashed
 };
 
 struct pv_node {
@@ -162,13 +169,13 @@ __visible void __pv_queue_spin_unlock(struct qspinlock 
*lock)
 * The queue head has been halted. Need to locate it and wake it up.
 */
node = pv_hash_find(lock);
-   smp_store_release(l-locked, 0);
+   (void)xchg(l-locked, 0);
 
/*
 * At this point the memory pointed at by lock can be freed/reused,
 * however we can still use the PV node to kick the CPU.
 */
-   if (READ_ONCE(node-state) == vcpu_halted)
+   if (READ_ONCE(node-state) != vcpu_running)
pv_kick(node-cpu);
 }
 /*
@@ -195,7 +202,8 @@ static void pv_init_node(struct mcs_spinlock *node)
 
 /*
  * Wait for node-locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to wake the vcpu again.
+ * pv_scan_next() is used to set _Q_SLOW_VAL and fill in hash table on its
+ * behalf.
  */
 static void pv_wait_node(struct mcs_spinlock *node)
 {
@@ -214,9 +222,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
 *
 * [S] pn-state = vcpu_halted[S] next-locked = 1
 * MB MB
-* [L] pn-locked   [RmW] pn-state = vcpu_running
+* [L] pn-locked   [RmW] pn-state = vcpu_hashed
 *
-* Matches the xchg() from pv_kick_node().
+* Matches the cmpxchg() from pv_scan_next().
 */
(void)xchg(pn-state, vcpu_halted);
 
@@ -224,9 +232,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
pv_wait(pn-state, vcpu_halted);
 
/*
-* Reset the vCPU state to avoid unncessary CPU kicking
+* Reset the state except when vcpu_hashed is set.
 */
-   WRITE_ONCE(pn-state, vcpu_running);
+   

[PATCH v16 00/14] qspinlock: a 4-byte queue spinlock with PV support

2015-04-24 Thread Waiman Long
v15-v16:
 - Remove the lfsr patch and use linear probing as lfsr is not really
   necessary in most cases.
 - Move the paravirt PV_CALLEE_SAVE_REGS_THUNK code to an asm header.
 - Add a patch to collect PV qspinlock statistics which also
   supersedes the PV lock hash debug patch.
 - Add PV qspinlock performance numbers.

v14-v15:
 - Incorporate PeterZ's v15 qspinlock patch and improve upon the PV
   qspinlock code by dynamically allocating the hash table as well
   as some other performance optimization.
 - Simplified the Xen PV qspinlock code as suggested by David Vrabel
   david.vra...@citrix.com.
 - Add benchmarking data for 3.19 kernel to compare the performance
   of a spinlock heavy test with and without the qspinlock patch
   under different cpufreq drivers and scaling governors.

v13-v14:
 - Patches 1  2: Add queue_spin_unlock_wait() to accommodate commit
   78bff1c86 from Oleg Nesterov.
 - Fix the system hang problem when using PV qspinlock in an
   over-committed guest due to a racing condition in the
   pv_set_head_in_tail() function.
 - Increase the MAYHALT_THRESHOLD from 10 to 1024.
 - Change kick_cpu into a regular function pointer instead of a
   callee-saved function.
 - Change lock statistics code to use separate bits for different
   statistics.

v12-v13:
 - Change patch 9 to generate separate versions of the
   queue_spin_lock_slowpath functions for bare metal and PV guest. This
   reduces the performance impact of the PV code on bare metal systems.

v11-v12:
 - Based on PeterZ's version of the qspinlock patch
   (https://lkml.org/lkml/2014/6/15/63).
 - Incorporated many of the review comments from Konrad Wilk and
   Paolo Bonzini.
 - The pvqspinlock code is largely from my previous version with
   PeterZ's way of going from queue tail to head and his idea of
   using callee saved calls to KVM and XEN codes.

v10-v11:
  - Use a simple test-and-set unfair lock to simplify the code,
but performance may suffer a bit for large guest with many CPUs.
  - Take out Raghavendra KT's test results as the unfair lock changes
may render some of his results invalid.
  - Add PV support without increasing the size of the core queue node
structure.
  - Other minor changes to address some of the feedback comments.

v9-v10:
  - Make some minor changes to qspinlock.c to accommodate review feedback.
  - Change author to PeterZ for 2 of the patches.
  - Include Raghavendra KT's test results in patch 18.

v8-v9:
  - Integrate PeterZ's version of the queue spinlock patch with some
modification:
http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
  - Break the more complex patches into smaller ones to ease review effort.
  - Fix a racing condition in the PV qspinlock code.

v7-v8:
  - Remove one unneeded atomic operation from the slowpath, thus
improving performance.
  - Simplify some of the codes and add more comments.
  - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
unfair lock.
  - Reduce unfair lock slowpath lock stealing frequency depending
on its distance from the queue head.
  - Add performance data for IvyBridge-EX CPU.

v6-v7:
  - Remove an atomic operation from the 2-task contending code
  - Shorten the names of some macros
  - Make the queue waiter to attempt to steal lock when unfair lock is
enabled.
  - Remove lock holder kick from the PV code and fix a race condition
  - Run the unfair lock  PV code on overcommitted KVM guests to collect
performance data.

v5-v6:
 - Change the optimized 2-task contending code to make it fairer at the
   expense of a bit of performance.
 - Add a patch to support unfair queue spinlock for Xen.
 - Modify the PV qspinlock code to follow what was done in the PV
   ticketlock.
 - Add performance data for the unfair lock as well as the PV
   support code.

v4-v5:
 - Move the optimized 2-task contending code to the generic file to
   enable more architectures to use it without code duplication.
 - Address some of the style-related comments by PeterZ.
 - Allow the use of unfair queue spinlock in a real para-virtualized
   execution environment.
 - Add para-virtualization support to the qspinlock code by ensuring
   that the lock holder and queue head stay alive as much as possible.

v3-v4:
 - Remove debugging code and fix a configuration error
 - Simplify the qspinlock structure and streamline the code to make it
   perform a bit better
 - Add an x86 version of asm/qspinlock.h for holding x86 specific
   optimization.
 - Add an optimized x86 code path for 2 contending tasks to improve
   low contention performance.

v2-v3:
 - Simplify the code by using numerous mode only without an unfair option.
 - Use the latest smp_load_acquire()/smp_store_release() barriers.
 - Move the queue spinlock code to kernel/locking.
 - Make the use of queue spinlock the default for x86-64 without user
   configuration.
 - Additional performance tuning.

v1-v2:
 - Add some more comments to document what the code does.
 - 

[PATCH v16 02/14] qspinlock, x86: Enable x86-64 to use queue spinlock

2015-04-24 Thread Waiman Long
This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/qspinlock.h  |   20 
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 
 4 files changed, 30 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..49fecb1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -125,6 +125,7 @@ config X86
select MODULES_USE_ELF_RELA if X86_64
select CLONE_BACKWARDS if X86_32
select ARCH_USE_BUILTIN_BSWAP
+   select ARCH_USE_QUEUE_SPINLOCK
select ARCH_USE_QUEUE_RWLOCK
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 000..222995b
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,20 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+#definequeue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * A smp_store_release() on the least-significant byte.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   smp_store_release((u8 *)lock, 0);
+}
+
+#include asm-generic/qspinlock.h
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index cf87de3..a9c01fd 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm/qspinlock.h
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -196,6 +200,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t 
*lock)
cpu_relax();
}
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 /*
  * Read-write spinlocks, allowing multiple readers
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index 5f9d757..5d654a1 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm-generic/qspinlock_types.h
+#else
 typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include asm-generic/qrwlock_types.h
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: x86: fix kvmclock update protocol

2015-04-24 Thread Marcelo Tosatti
On Thu, Apr 23, 2015 at 01:46:55PM +0200, Paolo Bonzini wrote:
 From: Radim Krčmář rkrc...@redhat.com
 
 The kvmclock spec says that the host will increment a version field to
 an odd number, then update stuff, then increment it to an even number.
 The host is buggy and doesn't do this, and the result is observable
 when one vcpu reads another vcpu's kvmclock data.
 
 There's no good way for a guest kernel to keep its vdso from reading
 a different vcpu's kvmclock data, but we don't need to care about
 changing VCPUs as long as we read a consistent data from kvmclock.
 (VCPU can change outside of this loop too, so it doesn't matter if we
 return a value not fit for this VCPU.)
 
 Based on a patch by Radim Krčmář.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/x86.c | 33 -
  1 file changed, 28 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index ed31c31b2485..c73efcd03e29 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -1669,12 +1669,28 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
   guest_hv_clock, sizeof(guest_hv_clock
   return 0;
  
 - /*
 -  * The interface expects us to write an even number signaling that the
 -  * update is finished. Since the guest won't see the intermediate
 -  * state, we just increase by 2 at the end.
 + /* This VCPU is paused, but it's legal for a guest to read another
 +  * VCPU's kvmclock, so we really have to follow the specification where
 +  * it says that version is odd if data is being modified, and even after
 +  * it is consistent.
 +  *
 +  * Version field updates must be kept separate.  This is because
 +  * kvm_write_guest_cached might use a rep movs instruction, and
 +  * writes within a string instruction are weakly ordered.  So there
 +  * are three writes overall.
 +  *
 +  * As a small optimization, only write the version field in the first
 +  * and third write.  The vcpu-pv_time cache is still valid, because the
 +  * version field is the first in the struct.
*/
 - vcpu-hv_clock.version = guest_hv_clock.version + 2;
 + BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0);
 +
 + vcpu-hv_clock.version = guest_hv_clock.version + 1;
 + kvm_write_guest_cached(v-kvm, vcpu-pv_time,
 + vcpu-hv_clock,
 + sizeof(vcpu-hv_clock.version));
 +
 + smp_wmb();
  
   /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
   pvclock_flags = (guest_hv_clock.flags  PVCLOCK_GUEST_STOPPED);
 @@ -1695,6 +1711,13 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
   kvm_write_guest_cached(v-kvm, vcpu-pv_time,
   vcpu-hv_clock,
   sizeof(vcpu-hv_clock));
 +
 + smp_wmb();
 +
 + vcpu-hv_clock.version++;
 + kvm_write_guest_cached(v-kvm, vcpu-pv_time,
 + vcpu-hv_clock,
 + sizeof(vcpu-hv_clock.version));
   return 0;
  }
  
 -- 
 1.8.3.1

Acked-by: Marcelo Tosatti mtosa...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v16 11/14] pvqspinlock, x86: Enable PV qspinlock for Xen

2015-04-24 Thread Waiman Long
From: David Vrabel david.vra...@citrix.com

This patch adds the necessary Xen specific code to allow Xen to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Signed-off-by: David Vrabel david.vra...@citrix.com
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/xen/spinlock.c |   64 ---
 kernel/Kconfig.locks|2 +-
 2 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 956374c..3215ffc 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,56 @@
 #include xen-ops.h
 #include debugfs.h
 
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+#include asm/qspinlock.h
+
+static void xen_qlock_kick(int cpu)
+{
+   xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+}
+
+/*
+ * Halt the current CPU  release it back to the host
+ */
+static void xen_qlock_wait(u8 *byte, u8 val)
+{
+   int irq = __this_cpu_read(lock_kicker_irq);
+
+   /* If kicker interrupts not initialized yet, just spin */
+   if (irq == -1)
+   return;
+
+   /* clear pending */
+   xen_clear_irq_pending(irq);
+   barrier();
+
+   /*
+* We check the byte value after clearing pending IRQ to make sure
+* that we won't miss a wakeup event because of the clearing.
+*
+* The sync_clear_bit() call in xen_clear_irq_pending() is atomic.
+* So it is effectively a memory barrier for x86.
+*/
+   if (READ_ONCE(*byte) != val)
+   return;
+
+   /*
+* If an interrupt happens here, it will leave the wakeup irq
+* pending, which will cause xen_poll_irq() to return
+* immediately.
+*/
+
+   /* Block until irq becomes pending (or perhaps a spurious wakeup) */
+   xen_poll_irq(irq);
+}
+
+#else /* CONFIG_QUEUE_SPINLOCK */
+
 enum xen_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -100,12 +150,9 @@ struct xen_lock_waiting {
__ticket_t want;
 };
 
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
 static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
 static cpumask_t waiting_cpus;
 
-static bool xen_pvspin = true;
 __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 {
int irq = __this_cpu_read(lock_kicker_irq);
@@ -217,6 +264,7 @@ static void xen_unlock_kick(struct arch_spinlock *lock, 
__ticket_t next)
}
}
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
@@ -280,8 +328,16 @@ void __init xen_init_spinlocks(void)
return;
}
printk(KERN_DEBUG xen: PV spinlocks enabled\n);
+#ifdef CONFIG_QUEUE_SPINLOCK
+   __pv_init_lock_hash();
+   pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath;
+   pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock);
+   pv_lock_ops.wait = xen_qlock_wait;
+   pv_lock_ops.kick = xen_qlock_kick;
+#else
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
pv_lock_ops.unlock_kick = xen_unlock_kick;
+#endif
 }
 
 /*
@@ -310,7 +366,7 @@ static __init int xen_parse_nopvspin(char *arg)
 }
 early_param(xen_nopvspin, xen_parse_nopvspin);
 
-#ifdef CONFIG_XEN_DEBUG_FS
+#if defined(CONFIG_XEN_DEBUG_FS)  !defined(CONFIG_QUEUE_SPINLOCK)
 
 static struct dentry *d_spin_debug;
 
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 537b13e..0b42933 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
-   depends on SMP  (!PARAVIRT_SPINLOCKS || !XEN)
+   depends on SMP
 
 config ARCH_USE_QUEUE_RWLOCK
bool
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v16 09/14] pvqspinlock, x86: Implement the paravirt qspinlock call patching

2015-04-24 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

We use the regular paravirt call patching to switch between:

  native_queue_spin_lock_slowpath() __pv_queue_spin_lock_slowpath()
  native_queue_spin_unlock()__pv_queue_spin_unlock()

We use a callee saved call for the unlock function which reduces the
i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions
again.

We further optimize the unlock path by patching the direct call with a
movb $0,%arg1 if we are indeed using the native unlock code. This
makes the unlock code almost as fast as the !PARAVIRT case.

This significantly lowers the overhead of having
CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/Kconfig  |2 +-
 arch/x86/include/asm/paravirt.h   |   29 -
 arch/x86/include/asm/paravirt_types.h |   10 ++
 arch/x86/include/asm/qspinlock.h  |   25 -
 arch/x86/include/asm/qspinlock_paravirt.h |6 ++
 arch/x86/kernel/paravirt-spinlocks.c  |   24 +++-
 arch/x86/kernel/paravirt_patch_32.c   |   22 ++
 arch/x86/kernel/paravirt_patch_64.c   |   22 ++
 8 files changed, 128 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock_paravirt.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 49fecb1..a0946e7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -661,7 +661,7 @@ config PARAVIRT_DEBUG
 config PARAVIRT_SPINLOCKS
bool Paravirtualization layer for spinlocks
depends on PARAVIRT  SMP
-   select UNINLINE_SPIN_UNLOCK
+   select UNINLINE_SPIN_UNLOCK if !QUEUE_SPINLOCK
---help---
  Paravirtualized spinlocks allow a pvops backend to replace the
  spinlock implementation with something virtualization-friendly
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 965c47d..f76199a 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -712,6 +712,31 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+static __always_inline void pv_queue_spin_lock_slowpath(struct qspinlock *lock,
+   u32 val)
+{
+   PVOP_VCALL2(pv_lock_ops.queue_spin_lock_slowpath, lock, val);
+}
+
+static __always_inline void pv_queue_spin_unlock(struct qspinlock *lock)
+{
+   PVOP_VCALLEE1(pv_lock_ops.queue_spin_unlock, lock);
+}
+
+static __always_inline void pv_wait(u8 *ptr, u8 val)
+{
+   PVOP_VCALL2(pv_lock_ops.wait, ptr, val);
+}
+
+static __always_inline void pv_kick(int cpu)
+{
+   PVOP_VCALL1(pv_lock_ops.kick, cpu);
+}
+
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -724,7 +749,9 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
 
-#endif
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
+#endif /* SMP  PARAVIRT_SPINLOCKS */
 
 #ifdef CONFIG_X86_32
 #define PV_SAVE_REGS pushl %ecx; pushl %edx;
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..f6acaea 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,19 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+struct qspinlock;
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+   void (*queue_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
+   struct paravirt_callee_save queue_spin_unlock;
+
+   void (*wait)(u8 *ptr, u8 val);
+   void (*kick)(int cpu);
+#else /* !CONFIG_QUEUE_SPINLOCK */
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 64c925e..c8290db 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -3,6 +3,7 @@
 
 #include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
+#include asm/paravirt.h
 
 #definequeue_spin_unlock queue_spin_unlock
 /**
@@ -11,11 +12,33 @@
  *
  * A smp_store_release() on the least-significant byte.
  */
-static inline void queue_spin_unlock(struct qspinlock *lock)
+static inline void native_queue_spin_unlock(struct qspinlock *lock)
 {
smp_store_release((u8 *)lock, 0);
 }
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void 

[PATCH v16 07/14] qspinlock: Revert to test-and-set on hypervisors

2015-04-24 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

When we detect a hypervisor (!paravirt, see qspinlock paravirt support
patches), revert to a simple test-and-set lock to avoid the horrors
of queue preemption.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/qspinlock.h |   14 ++
 include/asm-generic/qspinlock.h  |7 +++
 kernel/locking/qspinlock.c   |3 +++
 3 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 222995b..64c925e 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_QSPINLOCK_H
 #define _ASM_X86_QSPINLOCK_H
 
+#include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
 
 #definequeue_spin_unlock queue_spin_unlock
@@ -15,6 +16,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
smp_store_release((u8 *)lock, 0);
 }
 
+#define virt_queue_spin_lock virt_queue_spin_lock
+
+static inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+   return false;
+
+   while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
+   cpu_relax();
+
+   return true;
+}
+
 #include asm-generic/qspinlock.h
 
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 315d6dc..bcbbc5e 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -111,6 +111,13 @@ static inline void queue_spin_unlock_wait(struct qspinlock 
*lock)
cpu_relax();
 }
 
+#ifndef virt_queue_spin_lock
+static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   return false;
+}
+#endif
+
 /*
  * Initializier
  */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 99503ef..fc2e5ab 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -249,6 +249,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (virt_queue_spin_lock(lock))
+   return;
+
/*
 * wait for in-progress pending-locked hand-overs
 *
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v16 01/14] qspinlock: A simple generic 4-byte queue spinlock

2015-04-24 Thread Waiman Long
This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 include/asm-generic/qspinlock.h   |  132 +
 include/asm-generic/qspinlock_types.h |   58 +
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 +
 kernel/locking/mcs_spinlock.h |1 +
 kernel/locking/qspinlock.c|  209 +
 6 files changed, 408 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 000..315d6dc
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,132 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+   if (!atomic_read(lock-val) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+   u32 

[PATCH v16 04/14] qspinlock: Extract out code snippets for the next patch

2015-04-24 Thread Waiman Long
This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.

 1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
 2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.

This patch also simplifies the trylock operation before queuing by
calling queue_spin_trylock() directly.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 include/asm-generic/qspinlock_types.h |2 +
 kernel/locking/qspinlock.c|   79 -
 2 files changed, 50 insertions(+), 31 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 9c3f5c2..ef36613 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -58,6 +58,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0351f78..11f6ad9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -97,6 +97,42 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+   atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, lock-val);
+}
+
+/**
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   u32 old, new, val = atomic_read(lock-val);
+
+   for (;;) {
+   new = (val  _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return old;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -178,15 +214,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 *
 * *,1,0 - *,0,1
 */
-   for (;;) {
-   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   clear_pending_set_locked(lock);
return;
 
/*
@@ -203,37 +231,26 @@ queue:
node-next = NULL;
 
/*
-* We have already touched the queueing cacheline; don't bother with
-* pending stuff.
-*
-* trylock || xchg(lock, node)
-*
-* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked.
-* p,y,x - n,y,x ; tail was p - tail is n; preserving locked.
+* We touched a (possibly) cold cacheline in the per-cpu queue node;
+* attempt the trylock once more in the hope someone let go while we
+* weren't watching.
 */
-   for (;;) {
-   new = _Q_LOCKED_VAL;
-   if (val)
-   new = tail | (val  _Q_LOCKED_PENDING_MASK);
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   if (queue_spin_trylock(lock))
+   goto release;
 
/*
-* we won the trylock; forget about queueing.
+* We have already touched the queueing cacheline; don't bother with
+* pending stuff.
+*
+* p,*,* - n,*,*
 */
-   if (new == _Q_LOCKED_VAL)
-   goto release;
+   old = xchg_tail(lock, tail);
 
/*
 * if there was a previous node; link it and wait until reaching the
 * head of the waitqueue.
 */
-   if (old  ~_Q_LOCKED_PENDING_MASK) {
+   if (old  _Q_TAIL_MASK) {
prev = decode_tail(old);
WRITE_ONCE(prev-next, node);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v16 14/14] pvqspinlock: Collect slowpath lock statistics

2015-04-24 Thread Waiman Long
This patch enables the accumulation of PV qspinlock statistics
when either one of the following three sets of CONFIG parameters
are enabled:

 1) CONFIG_LOCK_STAT  CONFIG_DEBUG_FS
 2) CONFIG_KVM_DEBUG_FS
 3) CONFIG_XEN_DEBUG_FS

The accumulated lock statistics will be reported in debugfs under the
pv-qspinlock directory.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock_paravirt.h |  100 ++-
 1 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 41ee033..d512d9b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -43,6 +43,86 @@ struct pv_node {
u8  mayhalt;
 };
 
+#if defined(CONFIG_KVM_DEBUG_FS) || defined(CONFIG_XEN_DEBUG_FS) ||\
+   (defined(CONFIG_LOCK_STAT)  defined(CONFIG_DEBUG_FS))
+#define PV_QSPINLOCK_STAT
+#endif
+
+/*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+   pv_stat_wait_head,
+   pv_stat_wait_node,
+   pv_stat_wait_hash,
+   pv_stat_kick_cpu,
+   pv_stat_no_kick,
+   pv_stat_spurious,
+   pv_stat_hash,
+   pv_stat_hops,
+   pv_stat_num /* Total number of statistics counts */
+};
+
+#ifdef PV_QSPINLOCK_STAT
+
+#include linux/debugfs.h
+
+static const char * const stat_fsnames[pv_stat_num] = {
+   [pv_stat_wait_head] = wait_head_count,
+   [pv_stat_wait_node] = wait_node_count,
+   [pv_stat_wait_hash] = wait_hash_count,
+   [pv_stat_kick_cpu]  = kick_cpu_count,
+   [pv_stat_no_kick]   = no_kick_count,
+   [pv_stat_spurious]  = spurious_wakeup,
+   [pv_stat_hash]  = hash_count,
+   [pv_stat_hops]  = hash_hops_count,
+};
+
+static atomic_t pv_stats[pv_stat_num];
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+   struct dentry *d_pvqlock = debugfs_create_dir(pv-qspinlock, NULL);
+   int i;
+
+   if (!d_pvqlock)
+   printk(KERN_WARNING
+  Could not create 'pv-qspinlock' debugfs directory\n);
+
+   for (i = 0; i  pv_stat_num; i++)
+   debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+ (u32 *)pv_stats[i]);
+   return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat)
+{
+   atomic_inc(pv_stats[stat]);
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+   atomic_inc(pv_stats[pv_stat_hash]);
+   atomic_add(hopcnt, pv_stats[pv_stat_hops]);
+}
+
+#else /* PV_QSPINLOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat) { }
+static inline void pvstat_hop(int hopcnt)  { }
+
+#endif /* PV_QSPINLOCK_STAT */
+
 /*
  * Lock and MCS node addresses hash table for fast lookup
  *
@@ -102,11 +182,13 @@ pv_hash(struct qspinlock *lock, struct pv_node *node)
 {
unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_entry *he, *end;
+   int hopcnt = 0;
 
init_hash = hash;
for (;;) {
he = pv_lock_hash[hash].ent;
for (end = he + PV_HE_PER_LINE; he  end; he++) {
+   hopcnt++;
if (!cmpxchg(he-lock, NULL, lock)) {
/*
 * We haven't set the _Q_SLOW_VAL yet. So
@@ -122,6 +204,7 @@ pv_hash(struct qspinlock *lock, struct pv_node *node)
}
 
 done:
+   pvstat_hop(hopcnt);
return he-lock;
 }
 
@@ -177,8 +260,12 @@ __visible void __pv_queue_spin_unlock(struct qspinlock 
*lock)
 * At this point the memory pointed at by lock can be freed/reused,
 * however we can still use the PV node to kick the CPU.
 */
-   if (READ_ONCE(node-state) != vcpu_running)
+   if (READ_ONCE(node-state) != vcpu_running) {
+   pvstat_inc(pv_stat_kick_cpu);
pv_kick(node-cpu);
+   } else {
+   pvstat_inc(pv_stat_no_kick);
+   }
 }
 /*
  * Include the architecture specific callee-save thunk of the
@@ -241,8 +328,10 @@ static void pv_wait_node(struct mcs_spinlock *node)
 */
(void)xchg(pn-state, vcpu_halted);
 
-   if (!READ_ONCE(node-locked))
+   if (!READ_ONCE(node-locked)) {
+   pvstat_inc(pv_stat_wait_node);
pv_wait(pn-state, vcpu_halted);
+   }
 
pn-mayhalt = false;
/*
@@ -250,6 +339,8 @@ static void pv_wait_node(struct mcs_spinlock *node)
 */
(void)cmpxchg(pn-state, vcpu_halted, vcpu_running);
 
+   if (READ_ONCE(node-locked))
+   break;
/*
 * If the locked flag is still 

[PATCH v16 08/14] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-24 Thread Waiman Long
Provide a separate (second) version of the spin_lock_slowpath for
paravirt along with a special unlock path.

The second slowpath is generated by adding a few pv hooks to the
normal slowpath, but where those will compile away for the native
case, they expand into special wait/wake code for the pv version.

The actual MCS queue can use extra storage in the mcs_nodes[] array to
keep track of state and therefore uses directed wakeups.

The head contender has no such storage directly visible to the
unlocker.  So the unlocker searches a hash table with open addressing
using a simple binary Galois linear feedback shift register.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c  |   68 +++-
 kernel/locking/qspinlock_paravirt.h |  324 +++
 2 files changed, 391 insertions(+), 1 deletions(-)
 create mode 100644 kernel/locking/qspinlock_paravirt.h

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fc2e5ab..c009120 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -18,6 +18,9 @@
  * Authors: Waiman Long waiman.l...@hp.com
  *  Peter Zijlstra pet...@infradead.org
  */
+
+#ifndef _GEN_PV_LOCK_SLOWPATH
+
 #include linux/smp.h
 #include linux/bug.h
 #include linux/cpumask.h
@@ -65,13 +68,21 @@
 
 #include mcs_spinlock.h
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define MAX_NODES  8
+#else
+#define MAX_NODES  4
+#endif
+
 /*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ *
+ * PV doubles the storage and uses the second cacheline for PV state.
  */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
 
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -220,6 +231,32 @@ static __always_inline void set_locked(struct qspinlock 
*lock)
WRITE_ONCE(l-locked, _Q_LOCKED_VAL);
 }
 
+
+/*
+ * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for
+ * all the PV callbacks.
+ */
+
+static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
+
+static __always_inline void __pv_wait_head(struct qspinlock *lock,
+  struct mcs_spinlock *node) { }
+
+#define pv_enabled()   false
+
+#define pv_init_node   __pv_init_node
+#define pv_wait_node   __pv_wait_node
+#define pv_kick_node   __pv_kick_node
+#define pv_wait_head   __pv_wait_head
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define queue_spin_lock_slowpath   native_queue_spin_lock_slowpath
+#endif
+
+#endif /* _GEN_PV_LOCK_SLOWPATH */
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
@@ -249,6 +286,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (pv_enabled())
+   goto queue;
+
if (virt_queue_spin_lock(lock))
return;
 
@@ -325,6 +365,7 @@ queue:
node += idx;
node-locked = 0;
node-next = NULL;
+   pv_init_node(node);
 
/*
 * We touched a (possibly) cold cacheline in the per-cpu queue node;
@@ -350,6 +391,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev-next, node);
 
+   pv_wait_node(node);
arch_mcs_spin_lock_contended(node-locked);
}
 
@@ -365,6 +407,7 @@ queue:
 * does not imply a full barrier.
 *
 */
+   pv_wait_head(lock, node);
while ((val = smp_load_acquire(lock-val.counter))  
_Q_LOCKED_PENDING_MASK)
cpu_relax();
 
@@ -397,6 +440,7 @@ queue:
cpu_relax();
 
arch_mcs_spin_unlock_contended(next-locked);
+   pv_kick_node(next);
 
 release:
/*
@@ -405,3 +449,25 @@ release:
this_cpu_dec(mcs_nodes[0].count);
 }
 EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+/*
+ * Generate the paravirt code for queue_spin_unlock_slowpath().
+ */
+#if !defined(_GEN_PV_LOCK_SLOWPATH)  defined(CONFIG_PARAVIRT_SPINLOCKS)
+#define _GEN_PV_LOCK_SLOWPATH
+
+#undef  pv_enabled
+#define pv_enabled()   true
+
+#undef pv_init_node
+#undef pv_wait_node
+#undef pv_kick_node
+#undef pv_wait_head
+
+#undef  queue_spin_lock_slowpath
+#define queue_spin_lock_slowpath   __pv_queue_spin_lock_slowpath
+
+#include qspinlock_paravirt.h
+#include qspinlock.c
+
+#endif
diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
new file mode 100644
index 000..084e5c1
--- /dev/null
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -0,0 +1,324 @@

[PATCH v16 05/14] qspinlock: Optimize for smaller NR_CPUS

2015-04-24 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

When we allow for a max NR_CPUS  2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   13 ++
 kernel/locking/qspinlock.c|   69 -
 2 files changed, 81 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index ef36613..f01b55d 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -35,6 +35,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS  16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS = 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -47,7 +55,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS  (1U  14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -58,6 +70,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 11f6ad9..bcc99e6 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -24,6 +24,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/mutex.h
+#include asm/byteorder.h
 #include asm/qspinlock.h
 
 /*
@@ -56,6 +57,10 @@
  * node; whereby avoiding the need to carry a node from lock to unlock, and
  * preserving existing lock API. This also makes the unlock code simpler and
  * faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
+ *
  */
 
 #include mcs_spinlock.h
@@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,1,0 - *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   WRITE_ONCE(l-locked_pending, _Q_LOCKED_VAL);
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -131,6 +192,7 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
}
return old;
 }
+#endif /* _Q_PENDING_BITS == 8 */
 
 

[PATCH v16 06/14] qspinlock: Use a simple write to grab the lock

2015-04-24 Thread Waiman Long
Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   3 2324/2321  2248/2265-3%/-2%
   4 2890/2896  2819/2831-2%/-2%
   5 3611/3595  3522/3512-2%/-2%
   6 4281/4276  4173/4160-3%/-3%
   7 5018/5001  4875/4861-3%/-3%
   8 5759/5750  5563/5568-3%/-3%

[Standalone/Embedded - different nodes]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   312242/12237 12087/12093  -1%/-1%
   410688/10696 10507/10521  -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The ebizzy -m test was also run with the following results:

  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 kernel/locking/qspinlock.c |   66 +--
 1 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index bcc99e6..99503ef 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -105,24 +105,37 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
  * By using the whole 2nd least significant byte for the pending bit, we
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
union {
atomic_t val;
-   struct {
 #ifdef __LITTLE_ENDIAN
+   struct {
+   u8  locked;
+   u8  pending;
+   };
+   struct {
u16 locked_pending;
u16 tail;
+   };
 #else
+   struct {
u16 tail;
u16 locked_pending;
-#endif
};
+   struct {
+   u8  reserved[2];
+   u8  pending;
+   u8  locked;
+   };
+#endif
};
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -195,6 +208,19 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,*,0 - *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   WRITE_ONCE(l-locked, _Q_LOCKED_VAL);
+}
+
+/**
  * 

[PATCH v16 10/14] pvqspinlock, x86: Enable PV qspinlock for KVM

2015-04-24 Thread Waiman Long
This patch adds the necessary KVM specific code to allow KVM to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/kernel/kvm.c |   43 +++
 kernel/Kconfig.locks  |2 +-
 2 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e354cc6..4bb42c0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -584,6 +584,39 @@ static void kvm_kick_cpu(int cpu)
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+#include asm/qspinlock.h
+
+static void kvm_wait(u8 *ptr, u8 val)
+{
+   unsigned long flags;
+
+   if (in_nmi())
+   return;
+
+   local_irq_save(flags);
+
+   if (READ_ONCE(*ptr) != val)
+   goto out;
+
+   /*
+* halt until it's our turn and kicked. Note that we do safe halt
+* for irq enabled case to avoid hang when lock info is overwritten
+* in irq spinlock slowpath and no spurious interrupt occur to save us.
+*/
+   if (arch_irqs_disabled_flags(flags))
+   halt();
+   else
+   safe_halt();
+
+out:
+   local_irq_restore(flags);
+}
+
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
 enum kvm_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -817,6 +850,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, 
__ticket_t ticket)
}
 }
 
+#endif /* !CONFIG_QUEUE_SPINLOCK */
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
@@ -828,8 +863,16 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+   __pv_init_lock_hash();
+   pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath;
+   pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock);
+   pv_lock_ops.wait = kvm_wait;
+   pv_lock_ops.kick = kvm_kick_cpu;
+#else /* !CONFIG_QUEUE_SPINLOCK */
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
 }
 
 static __init int kvm_spinlock_init_jump(void)
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index c6a8f7c..537b13e 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
-   depends on SMP  !PARAVIRT_SPINLOCKS
+   depends on SMP  (!PARAVIRT_SPINLOCKS || !XEN)
 
 config ARCH_USE_QUEUE_RWLOCK
bool
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v16 03/14] qspinlock: Add pending bit

2015-04-24 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.

It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.

In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   12 +++-
 kernel/locking/qspinlock.c|  119 +++--
 2 files changed, 107 insertions(+), 24 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index c9348d8..9c3f5c2 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -36,8 +36,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
   _Q_ ## type ## _OFFSET)
@@ -45,7 +46,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -54,5 +59,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 3456819..0351f78 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
return per_cpu_ptr(mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
  *
- * (queue tail, lock value)
- *
- *  fast  :slow  :
unlock
- *:  :
- * uncontended  (0,0)   --:-- (0,1) :-- (*,0)
- *:   | ^./  :
- *:   v   \   |  :
- * uncontended:(n,x) --+-- (n,0) |  :
- *   queue:   | ^--'  |  :
- *:   v   |  :
- * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
- *   queue: ^--' :
+ * (queue tail, pending bit, lock value)
  *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +-- (0,1,0)   \   |  :
+ *   :   | ^--'  |   |  :
+ *   :   v   |   |  :
+ * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
+ *   queue   :   | ^--'  |  :
+ *   :   v   |  :
+ * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
+ *   queue   : ^--' :
  */
 void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
@@ -121,6 +125,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   /*
+* wait for in-progress pending-locked hand-overs
+*
+* 0,1,0 - 0,0,1
+*/
+   if (val == _Q_PENDING_VAL) {
+   while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL)
+ 

Re: [PATCH v5 7/8] vhost: cross-endian support for legacy devices

2015-04-24 Thread Cornelia Huck
On Thu, 23 Apr 2015 17:29:42 +0200
Greg Kurz gk...@linux.vnet.ibm.com wrote:

 This patch brings cross-endian support to vhost when used to implement
 legacy virtio devices. Since it is a relatively rare situation, the
 feature availability is controlled by a kernel config option (not set
 by default).
 
 The vq-is_le boolean field is added to cache the endianness to be
 used for ring accesses. It defaults to native endian, as expected
 by legacy virtio devices. When the ring gets active, we force little
 endian if the device is modern. When the ring is deactivated, we
 revert to the native endian default.
 
 If cross-endian was compiled in, a vq-user_be boolean field is added
 so that userspace may request a specific endianness. This field is
 used to override the default when activating the ring of a legacy
 device. It has no effect on modern devices.
 
 Signed-off-by: Greg Kurz gk...@linux.vnet.ibm.com
 ---
 
 Changes since v4:
 - rewrote patch title to mention cross-endian
 - renamed config to VHOST_CROSS_ENDIAN_LEGACY
 - rewrote config description and help
 - moved ifdefery to top of vhost.c
 - added a detailed comment about the lifecycle of vq-user_be in
   vhost_init_is_le()
 - renamed ioctls to VHOST_[GS]ET_VRING_ENDIAN
 - added LE/BE defines to the ioctl API
 - rewrote ioctl sanity check with the LE/BE defines
 - updated comment in uapi/linux/vhost.h to mention that the availibility
   of both SET and GET ioctls depends on the kernel config
 
  drivers/vhost/Kconfig  |   15 
  drivers/vhost/vhost.c  |   86 
 +++-
  drivers/vhost/vhost.h  |   10 +
  include/uapi/linux/vhost.h |   12 ++
  4 files changed, 121 insertions(+), 2 deletions(-)
 
 diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
 index 017a1e8..74d7380 100644
 --- a/drivers/vhost/Kconfig
 +++ b/drivers/vhost/Kconfig
 @@ -32,3 +32,18 @@ config VHOST
   ---help---
 This option is selected by any driver which needs to access
 the core of vhost.
 +
 +config VHOST_CROSS_ENDIAN_LEGACY
 + bool Cross-endian support for vhost
 + default n
 + ---help---
 +   This option allows vhost to support guests with a different byte
 +   ordering from host.

...while using legacy virtio.

Might help to explain the LEGACY in the config option ;)

 +
 +   Userspace programs can control the feature using the
 +   VHOST_SET_VRING_ENDIAN and VHOST_GET_VRING_ENDIAN ioctls.
 +
 +   This is only useful on a few platforms (ppc64 and arm64). Since it
 +   adds some overhead, it is disabled default.

s/default/by default/

 +
 +   If unsure, say N.
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 2ee2826..8c4390d 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -36,6 +36,78 @@ enum {
  #define vhost_used_event(vq) ((__virtio16 __user *)vq-avail-ring[vq-num])
  #define vhost_avail_event(vq) ((__virtio16 __user *)vq-used-ring[vq-num])
 
 +#ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
 +static void vhost_vq_reset_user_be(struct vhost_virtqueue *vq)
 +{
 + vq-user_be = !virtio_legacy_is_little_endian();
 +}
 +
 +static long vhost_set_vring_endian(struct vhost_virtqueue *vq, int __user 
 *argp)
 +{
 + struct vhost_vring_state s;
 +
 + if (vq-private_data)
 + return -EBUSY;
 +
 + if (copy_from_user(s, argp, sizeof(s)))
 + return -EFAULT;
 +
 + if (s.num != VHOST_VRING_LITTLE_ENDIAN 
 + s.num != VHOST_VRING_BIG_ENDIAN)
 + return -EINVAL;
 +
 + vq-user_be = s.num;
 +
 + return 0;
 +}
 +
 +static long vhost_get_vring_endian(struct vhost_virtqueue *vq, u32 idx,
 +int __user *argp)
 +{
 + struct vhost_vring_state s = {
 + .index = idx,
 + .num = vq-user_be
 + };
 +
 + if (copy_to_user(argp, s, sizeof(s)))
 + return -EFAULT;
 +
 + return 0;
 +}
 +
 +static void vhost_init_is_le(struct vhost_virtqueue *vq)
 +{
 + /* Note for legacy virtio: user_be is initialized at reset time
 +  * according to the host endianness. If userspace does not set an
 +  * explicit endianness, the default behavior is native endian, as
 +  * expected by legacy virtio.
 +  */
 + vq-is_le = vhost_has_feature(vq, VIRTIO_F_VERSION_1) || !vq-user_be;
 +}
 +#else
 +static void vhost_vq_reset_user_be(struct vhost_virtqueue *vq)
 +{
 + ;

Just leave the function body empty?

 +}
 +
 +static long vhost_set_vring_endian(struct vhost_virtqueue *vq, int __user 
 *argp)
 +{
 + return -ENOIOCTLCMD;
 +}
 +
 +static long vhost_get_vring_endian(struct vhost_virtqueue *vq, u32 idx,
 +int __user *argp)
 +{
 + return -ENOIOCTLCMD;
 +}
 +
 +static void vhost_init_is_le(struct vhost_virtqueue *vq)
 +{
 + if (vhost_has_feature(vq, VIRTIO_F_VERSION_1))
 + vq-is_le = true;
 +}
 +#endif /* CONFIG_VHOST_CROSS_ENDIAN_LEGACY */
 +
  

Re: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Paolo Bonzini


On 24/04/2015 03:16, Zhang, Yang Z wrote:
 This is interesting since previous measurements on KVM have had
 the exact opposite results.  I think we need to understand this a
 lot more.
 
 What I can tell is that vmexit is heavy. So it is reasonable to see
 the improvement under some cases, especially kernel is using eager
 FPU now which means each schedule may trigger a vmexit.

On the other hand vmexit is lighter and lighter on newer processors; a
Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000
vs. 2500 clock cycles approximately).

Also, measurement were done on Westmere but Sandy Bridge is the first
processor to have XSAVEOPT and thus use eager FPU.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-24 Thread Zhang, Yang Z
Paolo Bonzini wrote on 2015-04-24:
 
 
 On 24/04/2015 03:16, Zhang, Yang Z wrote:
 This is interesting since previous measurements on KVM have had the
 exact opposite results.  I think we need to understand this a lot
 more.
 
 What I can tell is that vmexit is heavy. So it is reasonable to see
 the improvement under some cases, especially kernel is using eager
 FPU now which means each schedule may trigger a vmexit.
 
 On the other hand vmexit is lighter and lighter on newer processors; a
 Sandy Bridge has less than half the vmexit cost of a Core 2 (IIRC 1000
 vs. 2500 clock cycles approximately).
 

1000 cycles? I remember it takes about 4000 cycle even in HSW server.

 Also, measurement were done on Westmere but Sandy Bridge is the first
 processor to have XSAVEOPT and thus use eager FPU.
 
 Paolo


Best regards,
Yang


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kvm-unit-tests] x86/run: Rearrange the valid binary and testdev support checks

2015-04-24 Thread Bandan Das

Ping, Paolo, did this slip through the cracks ?

Bandan Das b...@redhat.com writes:

 This extends the sanity checks done on known common Qemu binary
 paths when the user supplies a QEMU= on the command line

 Fixes: b895b967db94937d5b593c51b95eb32d2889a764

 Signed-off-by: Bandan Das b...@redhat.com
 ---
  x86/run | 43 +++
  1 file changed, 19 insertions(+), 24 deletions(-)

 diff --git a/x86/run b/x86/run
 index 219a93b..5281fca 100755
 --- a/x86/run
 +++ b/x86/run
 @@ -2,33 +2,28 @@
  NOTFOUND=1
  TESTDEVNOTSUPP=2
  
 -qemukvm=${QEMU:-qemu-kvm}
 -qemusystem=${QEMU:-qemu-system-x86_64}
 +qemubinarysearch=${QEMU:-qemu-kvm qemu-system-x86_64}
  
 -if  ! [ -z ${QEMU} ]
 -then
 - qemu=${QEMU}
 -else
 - for qemucmds in ${qemukvm} ${qemusystem}
 - do
 - unset QEMUFOUND
 - unset qemu
 - if ! [ -z ${QEMUFOUND=$(${qemucmds} --help 2/dev/null | grep 
 QEMU)} ] 
 -${qemucmds} -device '?' 21 | grep -F -e \testdev\ -e 
 \pc-testdev\  /dev/null;
 - then
 - qemu=${qemucmds}
 - break
 - fi
 - done
 - if  [ -z ${QEMUFOUND} ]
 - then
 - echo A QEMU binary was not found, You can set a custom 
 location by using the QEMU=path environment variable 
 - exit ${NOTFOUND}
 - elif[ -z ${qemu} ]
 +for qemucmd in ${qemubinarysearch}
 +do
 + unset QEMUFOUND
 + unset qemu
 + if ! [ -z ${QEMUFOUND=$(${qemucmd} --help 2/dev/null | grep QEMU)} 
 ] 
 + ${qemucmd} -device '?' 21 | grep -F -e \testdev\ -e 
 \pc-testdev\  /dev/null;
   then
 - echo No Qemu test device support found
 - exit ${TESTDEVNOTSUPP}
 + qemu=${qemucmd}
 + break
   fi
 +done
 +
 +if  [ -z ${QEMUFOUND} ]
 +then
 + echo A QEMU binary was not found, You can set a custom location by 
 using the QEMU=path environment variable 
 + exit ${NOTFOUND}
 +elif[ -z ${qemu} ]
 +then
 + echo No Qemu test device support found
 + exit ${TESTDEVNOTSUPP}
  fi
  
  if
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio-pci: Reset workaround for AMD Bonaire and Hawaii GPUs

2015-04-24 Thread Alex Williamson
Somehow these GPUs manage not to respond to a PCI bus reset, removing
our primary mechanism for resetting graphics cards.  The result is
that these devices typically work well for a single VM boot.  If the
VM is rebooted or restarted, the guest driver is not able to init the
card from the dirty state, resulting in a blue screen for Windows
guests.

The workaround is to use a device specific reset.  This is not 100%
reliable though since it depends on the incoming state of the device,
but it substantially improves the usability of these devices in a VM.

Credit to Alex Deucher alexander.deuc...@amd.com for his guidance.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio/pci.c |  162 +
 1 file changed, 162 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index ebc1e0a..baae98c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -154,6 +154,7 @@ typedef struct VFIOPCIDevice {
 PCIHostDeviceAddress host;
 EventNotifier err_notifier;
 EventNotifier req_notifier;
+int (*resetfn)(struct VFIOPCIDevice *);
 uint32_t features;
 #define VFIO_FEATURE_ENABLE_VGA_BIT 0
 #define VFIO_FEATURE_ENABLE_VGA (1  VFIO_FEATURE_ENABLE_VGA_BIT)
@@ -3319,6 +3320,162 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice 
*vdev)
 vdev-req_enabled = false;
 }
 
+/*
+ * AMD Radeon PCI config reset, based on Linux:
+ *   drivers/gpu/drm/radeon/ci_smc.c:ci_is_smc_running()
+ *   drivers/gpu/drm/radeon/radeon_device.c:radeon_pci_config_reset
+ *   drivers/gpu/drm/radeon/ci_smc.c:ci_reset_smc()
+ *   drivers/gpu/drm/radeon/ci_smc.c:ci_stop_smc_clock()
+ * IDs: include/drm/drm_pciids.h
+ * Registers: http://cgit.freedesktop.org/~agd5f/linux/commit/?id=4e2aa447f6f0
+ *
+ * Bonaire and Hawaii GPUs do not respond to a bus reset.  This is a bug in the
+ * hardware that should be fixed on future ASICs.  The symptom of this is that
+ * once the accerlated driver loads, Windows guests will bsod on subsequent
+ * attmpts to load the driver, such as after VM reset or shutdown/restart.  To
+ * work around this, we do an AMD specific PCI config reset, followed by an SMC
+ * reset.  The PCI config reset only works if SMC firmware is running, so we
+ * have a dependency on the state of the device as to whether this reset will
+ * be effective.  There are still cases where we won't be able to kick the
+ * device into working, but this greatly improves the usability overall.  The
+ * config reset magic is relatively common on AMD GPUs, but the setup and SMC
+ * poking is largely ASIC specific.
+ */
+static bool vfio_radeon_smc_is_running(VFIOPCIDevice *vdev)
+{
+uint32_t clk, pc_c;
+
+/*
+ * Registers 200h and 204h are index and data registers for acessing
+ * indirect configuration registers within the device.
+ */
+vfio_region_write(vdev-bars[5].region, 0x200, 0x8004, 4);
+clk = vfio_region_read(vdev-bars[5].region, 0x204, 4);
+vfio_region_write(vdev-bars[5].region, 0x200, 0x8370, 4);
+pc_c = vfio_region_read(vdev-bars[5].region, 0x204, 4);
+
+return (!(clk  1)  (0x20100 = pc_c));
+}
+
+/*
+ * The scope of a config reset is controlled by a mode bit in the misc register
+ * and a fuse, exposed as a bit in another register.  The fuse is the default
+ * (0 = GFX, 1 = whole GPU), the misc bit is a toggle, with the forumula
+ * scope = !(misc ^ fuse), where the resulting scope is defined the same as
+ * the fuse.  A truth table therefore tells us that if misc == fuse, we need
+ * to flip the value of the bit in the misc register.
+ */
+static void vfio_radeon_set_gfx_only_reset(VFIOPCIDevice *vdev)
+{
+uint32_t misc, fuse;
+bool a, b;
+
+vfio_region_write(vdev-bars[5].region, 0x200, 0xc00c, 4);
+fuse = vfio_region_read(vdev-bars[5].region, 0x204, 4);
+b = fuse  64;
+
+vfio_region_write(vdev-bars[5].region, 0x200, 0xc010, 4);
+misc = vfio_region_read(vdev-bars[5].region, 0x204, 4);
+a = misc  2;
+
+if (a == b) {
+vfio_region_write(vdev-bars[5].region, 0x204, misc ^ 2, 4);
+vfio_region_read(vdev-bars[5].region, 0x204, 4); /* flush */
+}
+}
+
+static int vfio_radeon_reset(VFIOPCIDevice *vdev)
+{
+PCIDevice *pdev = vdev-pdev;
+int i, ret = 0;
+uint32_t data;
+
+/* Defer to a kernel implemented reset */
+if (vdev-vbasedev.reset_works) {
+return -ENODEV;
+}
+
+/* Enable only memory BAR access */
+vfio_pci_write_config(pdev, PCI_COMMAND, PCI_COMMAND_MEMORY, 2);
+
+/* Reset only works if SMC firmware is loaded and running */
+if (!vfio_radeon_smc_is_running(vdev)) {
+ret = -EINVAL;
+goto out;
+}
+
+/* Make sure only the GFX function is reset */
+vfio_radeon_set_gfx_only_reset(vdev);
+
+/* AMD PCI config reset */
+vfio_pci_write_config(pdev, 0x7c, 0x39d5e86b, 4);
+usleep(100);
+
+/* Read back the memory size to make sure we're out of reset */
+for (i =