RE: [Qemu-devel] The status about vhost-net on kvm-arm?

2014-10-16 Thread GAUGUEY Rémy 228890
Hello, 

Using this Qemu patchset as well as recent irqfd work, I’ve tried to make 
vhost-net working on Cortex-A15.
Unfortunately, even if I can correctly generate irqs to the guest through 
irqfd, it seems to me that some pieces are still missing….
Indeed, virtio mmio interrupt status register (@ offset 0x60) is not updated by 
vhost thread, and reading it or writing to the peer interrupt ack register 
(offset 0x64) from the guest causes an VM exit …

After reading older posts, I understand that vhost-net with irqfd support could 
only work with MSI-X support : 
On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
“When MSI is off, each interrupt needs to be bounced through the io thread when 
it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in the 
same thread.
“
Indeed, in case of MSI-X support, Virtio spec indicates that the ISR Status 
field is unused…

I understand that Vhost does not emulate a complete virtio PCI adapter but only 
manage virtqueue operations.
However I don’t have a clear view of what is performed by Qemu and what is 
performed by vhost-thread…
Could someone highlight me on this point, and maybe give some clues for an 
implementation of Vhost with irqfd and without MSI support ???

Thanks a lot in advance.
Best regards.
Rémy



De : kvmarm-boun...@lists.cs.columbia.edu 
[mailto:kvmarm-boun...@lists.cs.columbia.edu] De la part de Yingshiuan Pan
Envoyé : vendredi 15 août 2014 09:25
À : Li Liu
Cc : kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org; qemu-devel
Objet : Re: [Qemu-devel] The status about vhost-net on kvm-arm?

Hi, Li,

It's ok, I did get those mails from mailing list. I guess it was because I did 
not subscribe some of mailing lists.

Currently, I think I will not have any plan to renew my patcheset since I have 
resigned from my previous company, I do not have Cortex-A15 platform to 
test/verify.

I'm fine with that, it would be great if you or someone can take it and improve 
it.
Thanks.



Best Regards,
Yingshiuan Pan

2014-08-15 11:04 GMT+08:00 Li Liu john.li...@huawei.com:
Hi Ying-Shiuan Pan,

I don't know why for missing your mail in mailbox. Sorry about that.
The results of vhost-net performance have been attached in another mail.

Do you have a plan to renew your patchset to support irqfd. If not,
we will try to finish it based on yours.

On 2014/8/14 11:50, Li Liu wrote:


 On 2014/8/13 19:25, Nikolay Nikolaev wrote:
 On Wed, Aug 13, 2014 at 12:10 PM, Nikolay Nikolaev
 n.nikol...@virtualopensystems.com wrote:
 On Tue, Aug 12, 2014 at 6:47 PM, Nikolay Nikolaev
 n.nikol...@virtualopensystems.com wrote:

 Hello,


 On Tue, Aug 12, 2014 at 5:41 AM, Li Liu john.li...@huawei.com wrote:

 Hi all,

 Is anyone there can tell the current status of vhost-net on kvm-arm?

 Half a year has passed from Isa Ansharullah asked this question:
 http://www.spinics.net/lists/kvm-arm/msg08152.html

 I have found two patches which have provided the kvm-arm support of
 eventfd and irqfd:

 1) [RFC PATCH 0/4] ARM: KVM: Enable the ioeventfd capability of KVM on ARM
 http://lists.gnu.org/archive/html/qemu-devel/2014-01/msg01770.html

 2) [RFC,v3] ARM: KVM: add irqfd and irq routing support
 https://patches.linaro.org/32261/

 And there's a rough patch for qemu to support eventfd from Ying-Shiuan 
 Pan:

 [Qemu-devel] [PATCH 0/4] ioeventfd support for virtio-mmio
 https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg00715.html

 But there no any comments of this patch. And I can found nothing about 
 qemu
 to support irqfd. Do I lost the track?

 If nobody try to fix it. We have a plan to complete it about virtio-mmio
 supporing irqfd and multiqueue.



 we at Virtual Open Systems did some work and tested vhost-net on ARM
 back in March.
 The setup was based on:
  - host kernel with our ioeventfd patches:
 http://www.spinics.net/lists/kvm-arm/msg08413.html

 - qemu with the aforementioned patches from Ying-Shiuan Pan
 https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg00715.html

 The testbed was ARM Chromebook with Exynos 5250, using a 1Gbps USB3
 Ethernet adapter connected to a 1Gbps switch. I can't find the actual
 numbers but I remember that with multiple streams the gain was clearly
 seen. Note that it used the minimum required ioventfd implementation
 and not irqfd.

 I guess it is feasible to think that it all can be put together and
 rebased + the recent irqfd work. One can achiev even better
 performance (because of the irqfd).


 Managed to replicate the setup with the old versions e used in March:

 Single stream from another machine to chromebook with 1Gbps USB3
 Ethernet adapter.
 iperf -c address -P 1 -i 1 -p 5001 -f k -t 10
 to HOST: 858316 Kbits/sec
 to GUEST: 761563 Kbits/sec
 to GUEST vhost=off: 508150 Kbits/sec

 10 parallel streams
 iperf -c address -P 10 -i 1 -p 5001 -f k -t 10
 to HOST: 842420 Kbits/sec
 to GUEST: 625144 Kbits/sec
 to GUEST vhost=off: 425276 Kbits/sec

 I 

[Bug 86161] On KVM, Windows 7 32bit guests sometimes run into blue screen(0x0000005c) during reboot

2014-10-16 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=86161

GC Ngu ng...@qq.com changed:

   What|Removed |Added

Summary|PROBLEM: On KVM, Windows 7  |On KVM, Windows 7 32bit
   |32bit guests sometimes run  |guests sometimes run into
   |into blue   |blue screen(0x005c)
   |screen(0x005c) during   |during reboot
   |reboot  |

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


A question about HTL VM-Exit handling time

2014-10-16 Thread Wu, Feng
Hi folks,

I run kernel build in the guest and use perf kvm to get some VM-Exit result as 
the following:

Analyze events for all VCPUs:

 VM-EXITSamples  Samples% Time%   Min Time   Max Time   
  A

   MSR_WRITE361390857.53%18.97%5us 1362us  
9.73
 HLT139974722.28%74.90%5us   432448us 
99.24
   CR_ACCESS 96120315.30% 3.28%4us  188us  
6.33
  EXTERNAL_INTERRUPT 213821 3.40% 2.25%4us 4089us 
19.54
   EXCEPTION_NMI  25152 0.40% 0.12%4us   71us  
9.05
   EPT_MISCONFIG  20104 0.32% 0.15%8us 5628us 
13.74
   CPUID  19904 0.32% 0.07%4us  220us  
6.90
  IO_INSTRUCTION  17097 0.27% 0.20%   13us 1008us 
22.08
   PAUSE_INSTRUCTION  10737 0.17% 0.05%4us   53us  
8.33
MSR_READ 48 0.00% 0.00%4us8us  
5.62

Total Samples:6281721, Total events handled time:185457820.41us.

I also do some other experiments with different workload in the guest, I got 
the same results in terms of
HLT VM-Exit handling time. Does anyone know why the handling time for HLT 
VM-Exit is so high? Appreciate
You help!

Thanks,
Feng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


A question about HLT VM-Exit handling time

2014-10-16 Thread Wu, Feng
Correct the typo in the subject.

 -Original Message-
 From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
 Behalf Of Wu, Feng
 Sent: Thursday, October 16, 2014 4:16 PM
 To: kvm@vger.kernel.org
 Cc: Xiao Guangrong
 Subject: A question about HTL VM-Exit handling time
 
 Hi folks,
 
 I run kernel build in the guest and use perf kvm to get some VM-Exit result as
 the following:
 
 Analyze events for all VCPUs:
 
  VM-EXITSamples  Samples% Time%   Min Time
 Max Time A
 
MSR_WRITE361390857.53%18.97%5us
 1362us  9.73
  HLT139974722.28%74.90%5us
 432448us 99.24
CR_ACCESS 96120315.30% 3.28%4us
 188us  6.33
   EXTERNAL_INTERRUPT 213821 3.40% 2.25%4us
 4089us 19.54
EXCEPTION_NMI  25152 0.40% 0.12%4us
 71us  9.05
EPT_MISCONFIG  20104 0.32% 0.15%8us
 5628us 13.74
CPUID  19904 0.32% 0.07%4us
 220us  6.90
   IO_INSTRUCTION  17097 0.27% 0.20%   13us
 1008us 22.08
PAUSE_INSTRUCTION  10737 0.17% 0.05%4us
 53us  8.33
 MSR_READ 48 0.00% 0.00%4us
 8us  5.62
 
 Total Samples:6281721, Total events handled time:185457820.41us.
 
 I also do some other experiments with different workload in the guest, I got 
 the
 same results in terms of
 HLT VM-Exit handling time. Does anyone know why the handling time for HLT
 VM-Exit is so high? Appreciate
 You help!
 
 Thanks,
 Feng
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] arm/arm64: KVM: Fix BE accesses to GICv2 EISR and ELRSR regs

2014-10-16 Thread Christoffer Dall
Hi Victor,

On Thu, Oct 16, 2014 at 1:54 AM, Victor Kamensky
victor.kamen...@linaro.org wrote:
 On 14 October 2014 08:21, Victor Kamensky victor.kamen...@linaro.org wrote:
 On 14 October 2014 02:47, Marc Zyngier marc.zyng...@arm.com wrote:
 On Sun, Sep 28 2014 at 03:04:26 PM, Christoffer Dall 
 christoffer.d...@linaro.org wrote:
 The EIRSR and ELRSR registers are 32-bit registers on GICv2, and we
 store these as an array of two such registers on the vgic vcpu struct.
 However, we access them as a single 64-bit value or as a bitmap pointer
 in the generic vgic code, which breaks BE support.

 Instead, store them as u64 values on the vgic structure and do the
 word-swapping in the assembly code, which already handles the byte order
 for BE systems.

 Signed-off-by: Christoffer Dall christoffer.d...@linaro.org

 (still going through my email backlog, hence the delay).

 This looks like a valuable fix. Haven't had a chance to try it (no BE
 setup at hand) but maybe Victor can help reproducing this?.

 I'll give it a spin.

 Tested-by: Victor Kamensky victor.kamen...@linaro.org

 Tested on v3.17 + this fix on TC2 (V7) and Mustang (V8) with BE
 kvm host, tried different combination of guests BE/LE V7/V8. All looks
 good.

 Only with latest qemu in BE V8 mode in v3.17 without this
 fix I was able to reproduce the issue that Will spotted. With kvmtool,
 and older qemu V8 BE code never hit vgic_v2_set_lr function so
 that is why we did not run into it before. I guess fix in qemu in
 pl011 mentioned by 1f2bb4acc125, uncovered vgic_v2_set_lr
 code path and this BE issue. With this patch it works fine now.

Thanks for the detailed testing and explanation.  I'll apply this one to next.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 1/4] vfio: platform: add device tree info API and skeleton

2014-10-16 Thread Antonios Motakis
This patch introduced the API to return device tree info about
a PLATFORM device (if described by a device tree) and the skeleton
of the implementation for VFIO_PLATFORM. Information about any device
node bound by VFIO_PLATFORM should be queried via the introduced ioctl
VFIO_DEVICE_GET_DEVTREE_INFO.

The proposed API allows to get a list of strings with available property
names, and then allows to query each property. Note that the properties
are not indexed numerically, so they are always accessed by property name.
The user needs to know the data type of the property he is accessing.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/Makefile|  3 +-
 drivers/vfio/platform/devtree.c   | 70 +++
 drivers/vfio/platform/vfio_platform_common.c  | 39 +++
 drivers/vfio/platform/vfio_platform_private.h |  6 +++
 include/uapi/linux/vfio.h | 26 ++
 5 files changed, 143 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/platform/devtree.c

diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
index 81de144..99f3ba1 100644
--- a/drivers/vfio/platform/Makefile
+++ b/drivers/vfio/platform/Makefile
@@ -1,5 +1,6 @@
 
-vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o
+vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o \
+  devtree.o
 
 obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
 
diff --git a/drivers/vfio/platform/devtree.c b/drivers/vfio/platform/devtree.c
new file mode 100644
index 000..c057be3
--- /dev/null
+++ b/drivers/vfio/platform/devtree.c
@@ -0,0 +1,70 @@
+#include linux/slab.h
+#include linux/vfio.h
+#include linux/of.h
+#include linux/platform_device.h
+#include vfio_platform_private.h
+
+static int devtree_get_prop_list(struct device_node *np, unsigned *lenp,
+void __user *datap, unsigned long datasz)
+{
+   return -EINVAL;
+}
+
+static int devtree_get_strings(struct device_node *np,
+  char *name, unsigned *lenp,
+  void __user *datap, unsigned long datasz)
+{
+   return -EINVAL;
+}
+
+static int devtree_get_uint(struct device_node *np, char *name,
+   uint32_t type, unsigned *lenp,
+   void __user *datap, unsigned long datasz)
+{
+   return -EINVAL;
+}
+
+int vfio_platform_devtree_info(struct device_node *np,
+  uint32_t type, unsigned *lenp,
+  void __user *datap, unsigned long datasz)
+{
+   char *name;
+   long namesz;
+   int ret;
+
+   if (type == VFIO_DEVTREE_PROP_LIST) {
+   return devtree_get_prop_list(np, lenp, datap, datasz);
+   }
+
+   namesz = strnlen_user(datap, datasz);
+   if (!namesz)
+   return -EFAULT;
+   if (namesz  datasz)
+   return -EINVAL;
+
+   name = kzalloc(namesz, GFP_KERNEL);
+   if (!name)
+   return -ENOMEM;
+   if (strncpy_from_user(name, datap, namesz) = 0) {
+   kfree(name);
+   return -EFAULT;
+   }
+
+   switch (type) {
+   case VFIO_DEVTREE_TYPE_STRINGS:
+   ret = devtree_get_strings(np, name, lenp, datap, datasz);
+   break;
+
+   case VFIO_DEVTREE_TYPE_U32:
+   case VFIO_DEVTREE_TYPE_U16:
+   case VFIO_DEVTREE_TYPE_U8:
+   ret = devtree_get_uint(np, name, type, lenp, datap, datasz);
+   break;
+
+   default:
+   ret = -EINVAL;
+   }
+
+   kfree(name);
+   return ret;
+}
diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 2a6c665..bfbee2f 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -24,6 +24,7 @@
 #include linux/uaccess.h
 #include linux/vfio.h
 #include linux/io.h
+#include linux/of.h
 
 #include vfio_platform_private.h
 
@@ -244,6 +245,34 @@ static long vfio_platform_ioctl(void *device_data,
 
return ret;
 
+   } else if (cmd == VFIO_DEVICE_GET_DEVTREE_INFO) {
+   struct vfio_devtree_info info;
+   void __user *datap;
+   unsigned long datasz;
+   int ret;
+
+   if (!vdev-of_node)
+   return -EINVAL;
+
+   minsz = offsetofend(struct vfio_devtree_info, length);
+
+   if (copy_from_user(info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz  minsz)
+   return -EINVAL;
+
+   datap = (void __user *) arg + minsz;
+   datasz = info.argsz - minsz;
+
+   ret = vfio_platform_devtree_info(vdev-of_node, info.type,
+

[RFC PATCH v2 4/4] vfio: platform: devtree: return arrays of u32, u16, or u8 data

2014-10-16 Thread Antonios Motakis
Certain properties of a device tree node are accessible as an array
of unsigned integers, either u32, u16, or u8. Let the VFIO user query
this type of device node properties. Accessing u64 arrays is not yet
implemented in this RFC.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/devtree.c | 55 -
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/platform/devtree.c b/drivers/vfio/platform/devtree.c
index 6d25f97..17f55d4 100644
--- a/drivers/vfio/platform/devtree.c
+++ b/drivers/vfio/platform/devtree.c
@@ -97,7 +97,60 @@ static int devtree_get_uint(struct device_node *np, char 
*name,
uint32_t type, unsigned *lenp,
void __user *datap, unsigned long datasz)
 {
-   return -EINVAL;
+   int ret, n;
+   size_t sz;
+   u8 *out;
+   int (*func)(const struct device_node *, const char *, void *, size_t)
+   = NULL;
+
+   switch (type) {
+   case VFIO_DEVTREE_TYPE_U32:
+   sz = sizeof(u32);
+   func = (int (*)(const struct device_node *,
+   const char *, void *, size_t))
+   of_property_read_u32_array;
+   break;
+   case VFIO_DEVTREE_TYPE_U16:
+   sz = sizeof(u16);
+   func = (int (*)(const struct device_node *,
+   const char *, void *, size_t))
+   of_property_read_u16_array;
+   break;
+   case VFIO_DEVTREE_TYPE_U8:
+   sz = sizeof(u8);
+   func = (int (*)(const struct device_node *,
+   const char *, void *, size_t))
+   of_property_read_u8_array;
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   n = of_property_count_elems_of_size(np, name, sz);
+   if (n  0)
+   return n;
+
+   if (lenp)
+   *lenp = n * sz;
+
+   if (n * sz  datasz)
+   return -EAGAIN;
+
+   out = kcalloc(n, sz, GFP_KERNEL);
+   if (!out)
+   return -EFAULT;
+
+   ret = func(np, name, out, n);
+   if (ret)
+   goto out;
+
+   if (copy_to_user(datap, out, n * sz))
+   ret = -EFAULT;
+
+out:
+   kfree(out);
+   return ret;
 }
 
 int vfio_platform_devtree_info(struct device_node *np,
-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 2/4] vfio: platform: devtree: return available property names

2014-10-16 Thread Antonios Motakis
The available properties of a device are not indexed numerically,
instead they are accessible by property name.
Passing type = VFIO_DEVTREE_PROP_LIST to VFIO_DEVICE_GET_DEVTREE_INFO,
returns a list of strings with the available properties that the VFIO
user can access.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/devtree.c | 37 -
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/platform/devtree.c b/drivers/vfio/platform/devtree.c
index c057be3..032ee16 100644
--- a/drivers/vfio/platform/devtree.c
+++ b/drivers/vfio/platform/devtree.c
@@ -7,7 +7,42 @@
 static int devtree_get_prop_list(struct device_node *np, unsigned *lenp,
 void __user *datap, unsigned long datasz)
 {
-   return -EINVAL;
+   struct property *prop;
+   int len = 0, sz;
+   int ret = 0;
+
+   for_each_property_of_node(np, prop) {
+   sz = strlen(prop-name) + 1;
+
+   if (datasz  sz) {
+   ret = -EAGAIN;
+   break;
+   }
+
+   if (copy_to_user(datap, prop-name, sz))
+   return -EFAULT;
+
+   datap += sz;
+   datasz -= sz;
+   len += sz;
+   }
+
+   /* if overflow occurs, calculate remaining length */
+   while (prop) {
+   len += strlen(prop-name) + 1;
+   prop = prop-next;
+   }
+
+   /* we expose the full_name in addition to the usual properties */
+   len += sz = strlen(full_name) + 1;
+   if (datasz  sz) {
+   ret = -EAGAIN;
+   } else if (copy_to_user(datap, full_name, sz))
+   return -EFAULT;
+
+   *lenp = len;
+
+   return ret;
 }
 
 static int devtree_get_strings(struct device_node *np,
-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 3/4] vfio: platform: devtree: access property as a list of strings

2014-10-16 Thread Antonios Motakis
Certain device tree properties (e.g. the device node name, the compatible
string), are available as a list of strings (separated by the null
terminating character). Let the VFIO user query this type of properties.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/devtree.c | 43 -
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/platform/devtree.c b/drivers/vfio/platform/devtree.c
index 032ee16..6d25f97 100644
--- a/drivers/vfio/platform/devtree.c
+++ b/drivers/vfio/platform/devtree.c
@@ -45,11 +45,52 @@ static int devtree_get_prop_list(struct device_node *np, 
unsigned *lenp,
return ret;
 }
 
+static int devtree_get_full_name(struct device_node *np, unsigned *lenp,
+void __user *datap, unsigned long datasz)
+{
+   int len = strlen(np-full_name) + 1;
+
+   if (lenp)
+   *lenp = len;
+
+   if (len  datasz)
+   return -EAGAIN;
+
+   if (copy_to_user(datap, np-full_name, len))
+   return -EFAULT;
+
+   return 0;
+}
+
 static int devtree_get_strings(struct device_node *np,
   char *name, unsigned *lenp,
   void __user *datap, unsigned long datasz)
 {
-   return -EINVAL;
+   struct property *prop;
+   int len;
+
+   prop = of_find_property(np, name, len);
+
+   if (!prop) {
+   /* special case full_name as a property that is not on the fdt,
+* but we wish to return to the user as it includes the full
+* path of the device */
+   if (!strcmp(name, full_name))
+   return devtree_get_full_name(np, lenp, datap, datasz);
+   else
+   return -EINVAL;
+   }
+
+   if (lenp)
+   *lenp = len;
+
+   if (len  datasz)
+   return -EAGAIN;
+
+   if (copy_to_user(datap, prop-value, len))
+   return -EFAULT;
+
+   return 0;
 }
 
 static int devtree_get_uint(struct device_node *np, char *name,
-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v12 11/11] pvqspinlock, x86: Enable PV qspinlock for XEN

2014-10-16 Thread Waiman Long
This patch adds the necessary XEN specific code to allow XEN to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/xen/spinlock.c |  149 +--
 kernel/Kconfig.locks|2 +-
 2 files changed, 145 insertions(+), 6 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d1b6a32..8edc197 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,12 @@
 #include xen-ops.h
 #include debugfs.h
 
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifndef CONFIG_QUEUE_SPINLOCK
+
 enum xen_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -100,12 +106,9 @@ struct xen_lock_waiting {
__ticket_t want;
 };
 
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
 static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
 static cpumask_t waiting_cpus;
 
-static bool xen_pvspin = true;
 __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 {
int irq = __this_cpu_read(lock_kicker_irq);
@@ -213,6 +216,118 @@ static void xen_unlock_kick(struct arch_spinlock *lock, 
__ticket_t next)
}
 }
 
+#else /* CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_XEN_DEBUG_FS
+static u32 kick_nohlt_stats;   /* Kick but not halt count  */
+static u32 halt_qhead_stats;   /* Queue head halting count */
+static u32 halt_qnode_stats;   /* Queue node halting count */
+static u32 halt_abort_stats;   /* Halting abort count  */
+static u32 wake_kick_stats;/* Wakeup by kicking count  */
+static u32 wake_spur_stats;/* Spurious wakeup count*/
+static u64 time_blocked;   /* Total blocking time  */
+
+static inline void xen_halt_stats(enum pv_lock_stats type)
+{
+   if (type == PV_HALT_QHEAD)
+   add_smp(halt_qhead_stats, 1);
+   else if (type == PV_HALT_QNODE)
+   add_smp(halt_qnode_stats, 1);
+   else /* type == PV_HALT_ABORT */
+   add_smp(halt_abort_stats, 1);
+}
+
+void xen_lock_stats(enum pv_lock_stats type)
+{
+   if (type == PV_WAKE_KICKED)
+   add_smp(wake_kick_stats, 1);
+   else if (type == PV_WAKE_SPURIOUS)
+   add_smp(wake_spur_stats, 1);
+   else /* type == PV_KICK_NOHALT */
+   add_smp(kick_nohlt_stats, 1);
+}
+PV_CALLEE_SAVE_REGS_THUNK(xen_lock_stats);
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u64 delta;
+
+   delta = sched_clock() - start;
+   add_smp(time_blocked, delta);
+}
+#else /* CONFIG_XEN_DEBUG_FS */
+static inline void xen_halt_stats(enum pv_lock_stats type)
+{
+}
+
+static inline u64 spin_time_start(void)
+{
+   return 0;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+}
+#endif /* CONFIG_XEN_DEBUG_FS */
+
+void xen_kick_cpu(int cpu)
+{
+   xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+}
+PV_CALLEE_SAVE_REGS_THUNK(xen_kick_cpu);
+
+/*
+ * Halt the current CPU  release it back to the host
+ */
+void xen_halt_cpu(u8 *lockbyte)
+{
+   int irq = __this_cpu_read(lock_kicker_irq);
+   unsigned long flags;
+   u64 start;
+
+   /* If kicker interrupts not initialized yet, just spin */
+   if (irq == -1)
+   return;
+
+   /*
+* Make sure an interrupt handler can't upset things in a
+* partially setup state.
+*/
+   local_irq_save(flags);
+   start = spin_time_start();
+
+   xen_halt_stats(lockbyte ? PV_HALT_QHEAD : PV_HALT_QNODE);
+   /* clear pending */
+   xen_clear_irq_pending(irq);
+
+   /* Allow interrupts while blocked */
+   local_irq_restore(flags);
+   /*
+* Don't halt if the lock is now available
+*/
+   if (lockbyte  !ACCESS_ONCE(*lockbyte)) {
+   xen_halt_stats(PV_HALT_ABORT);
+   return;
+   }
+   /*
+* If an interrupt happens here, it will leave the wakeup irq
+* pending, which will cause xen_poll_irq() to return
+* immediately.
+*/
+
+   /* Block until irq becomes pending (or perhaps a spurious wakeup) */
+   xen_poll_irq(irq);
+   spin_time_accum_blocked(start);
+}
+PV_CALLEE_SAVE_REGS_THUNK(xen_halt_cpu);
+
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
BUG();
@@ -258,7 +373,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -275,8 +389,17 @@ void __init xen_init_spinlocks(void)

[PATCH v12 10/11] pvqspinlock, x86: Enable PV qspinlock for KVM

2014-10-16 Thread Waiman Long
This patch adds the necessary KVM specific code to allow KVM to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Two KVM guests of 20 CPU cores (2 nodes) were created for performance
testing in one of the following three configurations:
 1) Only 1 VM is active
 2) Both VMs are active and they share the same 20 physical CPUs
(200% overcommit)

The tests run included the disk workload of the AIM7 benchmark on
both ext4 and xfs RAM disks at 3000 users on a 3.17 based kernel. The
ebizzy -m test and futextest was was also run and its performance
data were recorded.  With two VMs running, the idle=poll kernel
option was added to simulate a busy guest. If PV qspinlock is not
enabled, unfairlock will be used automically in a guest.

AIM7 XFS Disk Test (no overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 25423737.08   98.95   5.44
  PV qspinlock  25495757.06   98.63   5.40
  unfairlock26162796.91   97.05   5.42

AIM7 XFS Disk Test (200% overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 64446827.93  415.22   6.33
  PV qspinlock  64562427.88  419.84   0.39
  unfairlock69551825.88  377.40   4.09

AIM7 EXT4 Disk Test (no overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 19955659.02  103.67   5.76
  PV qspinlock  20111738.95  102.15   5.40
  unfairlock20665908.71   98.13   5.46

AIM7 EXT4 Disk Test (200% overcommit)
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  PV ticketlock 47834137.63  495.81  30.78
  PV qspinlock  47405837.97  475.74  30.95
  unfairlock56022432.13  398.43  26.27

For the AIM7 disk workload, both PV ticketlock and qspinlock have
about the same performance. The unfairlock performs slightly better
than the PV lock.

EBIZZY-m Test (no overcommit)
  kernelRec/s   Real Time   Sys TimeUsr Time
  - -   -   
  PV ticketlock 3255  10.00   60.65   3.62
  PV qspinlock  3318  10.00   54.27   3.60
  unfairlock2833  10.00   26.66   3.09

EBIZZY-m Test (200% overcommit)
  kernelRec/s   Real Time   Sys TimeUsr Time
  - -   -   
  PV ticketlock  841  10.00   71.03   2.37
  PV qspinlock   834  10.00   68.27   2.39
  unfairlock 865  10.00   27.08   1.51

  futextest (no overcommit)
  kernel   kops/s
  ---
  PV ticketlock11523
  PV qspinlock 12328
  unfairlock9478

  futextest (200% overcommit)
  kernel   kops/s
  ---
  PV ticketlock 7276
  PV qspinlock  7095
  unfairlock5614

The ebizzy and futextest have much higher spinlock contention than
the AIM7 disk workload. In this case, the unfairlock performs worse
than both the PV ticketlock and qspinlock. The performance of the 2
PV locks are comparable.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/kernel/kvm.c |  138 -
 kernel/Kconfig.locks  |2 +-
 2 files changed, 138 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index bc11fb5..9fb9015 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -560,7 +560,7 @@ arch_initcall(activate_jump_labels);
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 /* Kick a cpu by its apicid. Used to wake up a halted vcpu */
-static void kvm_kick_cpu(int cpu)
+void kvm_kick_cpu(int cpu)
 {
int apicid;
unsigned long flags = 0;
@@ -568,7 +568,9 @@ static void kvm_kick_cpu(int cpu)
apicid = per_cpu(x86_cpu_to_apicid, cpu);
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
+PV_CALLEE_SAVE_REGS_THUNK(kvm_kick_cpu);
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 enum kvm_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -796,6 +798,132 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, 
__ticket_t ticket)
}
}
 }
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
+#ifdef CONFIG_KVM_DEBUG_FS
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+static u32 kick_nohlt_stats;   /* Kick but 

[PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-16 Thread Waiman Long
This patch adds para-virtualization support to the queue spinlock
code base with minimal impact to the native case. There are some
minor code changes in the generic qspinlock.c file which should be
usable in other architectures. The other code changes are specific
to x86 processors and so are all put under the arch/x86 directory.

On the lock side, there are a couple of jump labels and 2 paravirt
callee saved calls that defaults to NOPs and some registered move
instructions. So the performance impact should be minimal.

Since enabling paravirt spinlock will disable unlock function inlining,
a jump label can be added to the unlock function without adding patch
sites all over the kernel.

The actual paravirt code comes in 5 parts;

 - init_node; this initializes the extra data members required for PV
   state. PV state data is kept 1 cacheline ahead of the regular data.

 - link_and_wait_node; this replaces the regular MCS queuing code. CPU
   halting can happen if the wait is too long.

 - wait_head; this waits until the lock is avialable and the CPU will
   be halted if the wait is too long.

 - wait_check; this is called after acquiring the lock to see if the
   next queue head CPU is halted. If this is the case, the lock bit is
   changed to indicate the queue head will have to be kicked on unlock.

 - queue_unlock;  this routine has a jump label to check if paravirt
   is enabled. If yes, it has to do an atomic cmpxchg to clear the lock
   bit or call the slowpath function to kick the queue head cpu.

Tracking the head is done in two parts, firstly the pv_wait_head will
store its cpu number in whichever node is pointed to by the tail part
of the lock word. Secondly, pv_link_and_wait_node() will propagate the
existing head from the old to the new tail node.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/paravirt.h   |   20 ++
 arch/x86/include/asm/paravirt_types.h |   20 ++
 arch/x86/include/asm/pvqspinlock.h|  403 +
 arch/x86/include/asm/qspinlock.h  |   44 -
 arch/x86/kernel/paravirt-spinlocks.c  |6 +
 kernel/locking/qspinlock.c|   72 ++-
 6 files changed, 558 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/include/asm/pvqspinlock.h

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index cd6e161..3b041db 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -712,6 +712,25 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+static __always_inline void pv_kick_cpu(int cpu)
+{
+   PVOP_VCALLEE1(pv_lock_ops.kick_cpu, cpu);
+}
+
+static __always_inline void
+pv_lockwait(u8 *lockbyte)
+{
+   PVOP_VCALLEE1(pv_lock_ops.lockwait, lockbyte);
+}
+
+static __always_inline void pv_lockstat(enum pv_lock_stats type)
+{
+   PVOP_VCALLEE1(pv_lock_ops.lockstat, type);
+}
+
+#else
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -723,6 +742,7 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
 {
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
+#endif
 
 #endif
 
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..49e4b76 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -326,6 +326,9 @@ struct pv_mmu_ops {
   phys_addr_t phys, pgprot_t flags);
 };
 
+struct mcs_spinlock;
+struct qspinlock;
+
 struct arch_spinlock;
 #ifdef CONFIG_SMP
 #include asm/spinlock_types.h
@@ -333,9 +336,26 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+enum pv_lock_stats {
+   PV_HALT_QHEAD,  /* Queue head halting   */
+   PV_HALT_QNODE,  /* Other queue node halting */
+   PV_HALT_ABORT,  /* Halting aborted  */
+   PV_WAKE_KICKED, /* Wakeup by kicking*/
+   PV_WAKE_SPURIOUS,   /* Spurious wakeup  */
+   PV_KICK_NOHALT  /* Kick but CPU not halted  */
+};
+#endif
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+   struct paravirt_callee_save kick_cpu;
+   struct paravirt_callee_save lockstat;
+   struct paravirt_callee_save lockwait;
+#else
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/pvqspinlock.h 
b/arch/x86/include/asm/pvqspinlock.h
new file mode 100644
index 000..d424252
--- /dev/null
+++ b/arch/x86/include/asm/pvqspinlock.h
@@ -0,0 +1,403 @@
+#ifndef _ASM_X86_PVQSPINLOCK_H
+#define _ASM_X86_PVQSPINLOCK_H
+
+/*
+ *   

[PATCH v12 08/11] qspinlock, x86: Rename paravirt_ticketlocks_enabled

2014-10-16 Thread Waiman Long
This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/include/asm/spinlock.h  |4 ++--
 arch/x86/kernel/kvm.c|2 +-
 arch/x86/kernel/paravirt-spinlocks.c |4 ++--
 arch/x86/xen/spinlock.c  |2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 5899483..928751e 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -39,7 +39,7 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD (1  15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
 #ifdef CONFIG_QUEUE_SPINLOCK
@@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t 
*lock,
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
if (TICKET_SLOWPATH_FLAG 
-   static_key_false(paravirt_ticketlocks_enabled)) {
+   static_key_false(paravirt_spinlocks_enabled)) {
arch_spinlock_t prev;
 
prev = *lock;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3dd8e2c..bc11fb5 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -819,7 +819,7 @@ static __init int kvm_spinlock_init_jump(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
printk(KERN_INFO KVM setup paravirtual spinlock\n);
 
return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index bbb6c73..e434f24 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -16,5 +16,5 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 0ba5f3b..d1b6a32 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jump(void)
if (!xen_domain())
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
return 0;
 }
 early_initcall(xen_init_spinlocks_jump);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v12 07/11] qspinlock: Revert to test-and-set on hypervisors

2014-10-16 Thread Waiman Long
From: Peter Zijlstra pet...@infradead.org

When we detect a hypervisor (!paravirt, see qspinlock paravirt support
patches), revert to a simple test-and-set lock to avoid the horrors
of queue preemption.

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/qspinlock.h |   14 ++
 include/asm-generic/qspinlock.h  |7 +++
 kernel/locking/qspinlock.c   |3 +++
 3 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index a6a8762..05a77fe 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_QSPINLOCK_H
 #define _ASM_X86_QSPINLOCK_H
 
+#include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
 
 #ifndef CONFIG_X86_PPRO_FENCE
@@ -20,6 +21,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 
 #endif /* !CONFIG_X86_PPRO_FENCE */
 
+#define virt_queue_spin_lock virt_queue_spin_lock
+
+static inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+   return false;
+
+   while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
+   cpu_relax();
+
+   return true;
+}
+
 #include asm-generic/qspinlock.h
 
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index e8a7ae8..a53a7bb 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -98,6 +98,13 @@ static __always_inline void queue_spin_unlock(struct 
qspinlock *lock)
 }
 #endif
 
+#ifndef virt_queue_spin_lock
+static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   return false;
+}
+#endif
+
 /*
  * Initializier
  */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fb0e988..1c1926a 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -257,6 +257,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (virt_queue_spin_lock(lock))
+   return;
+
/*
 * wait for in-progress pending-locked hand-overs
 *
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v12 06/11] qspinlock: Use a simple write to grab the lock

2014-10-16 Thread Waiman Long
Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   3 2324/2321  2248/2265-3%/-2%
   4 2890/2896  2819/2831-2%/-2%
   5 3611/3595  3522/3512-2%/-2%
   6 4281/4276  4173/4160-3%/-3%
   7 5018/5001  4875/4861-3%/-3%
   8 5759/5750  5563/5568-3%/-3%

[Standalone/Embedded - different nodes]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   312242/12237 12087/12093  -1%/-1%
   410688/10696 10507/10521  -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The ebizzy -m test was also run with the following results:

  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 kernel/locking/qspinlock.c |   59 
 1 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7c127b4..fb0e988 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -103,24 +103,33 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
  * By using the whole 2nd least significant byte for the pending bit, we
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
union {
atomic_t val;
-   struct {
 #ifdef __LITTLE_ENDIAN
+   u8   locked;
+   struct {
u16 locked_pending;
u16 tail;
+   };
 #else
+   struct {
u16 tail;
u16 locked_pending;
-#endif
};
+   struct {
+   u8  reserved[3];
+   u8  locked;
+   };
+#endif
};
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -207,6 +216,19 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,*,0 - *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue 

[PATCH v12 05/11] qspinlock: Optimize for smaller NR_CPUS

2014-10-16 Thread Waiman Long
From: Peter Zijlstra pet...@infradead.org

When we allow for a max NR_CPUS  2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   13 ++
 kernel/locking/qspinlock.c|   71 -
 2 files changed, 83 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 88d647c..01b46df 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -35,6 +35,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS  16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS = 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -47,7 +55,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS  (1U  14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -58,6 +70,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 48bd2ad..7c127b4 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -22,6 +22,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/mutex.h
+#include asm/byteorder.h
 #include asm/qspinlock.h
 
 /*
@@ -54,6 +55,10 @@
  * node; whereby avoiding the need to carry a node from lock to unlock, and
  * preserving existing lock API. This also makes the unlock code simpler and
  * faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
+ *
  */
 
 #include mcs_spinlock.h
@@ -94,6 +99,64 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL;
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -141,6 +204,7 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
}
   

[PATCH v12 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock

2014-10-16 Thread Waiman Long
This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/qspinlock.h  |   25 +
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 
 4 files changed, 35 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fad4aa6..da42708 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -123,6 +123,7 @@ config X86
select MODULES_USE_ELF_RELA if X86_64
select CLONE_BACKWARDS if X86_32
select ARCH_USE_BUILTIN_BSWAP
+   select ARCH_USE_QUEUE_SPINLOCK
select ARCH_USE_QUEUE_RWLOCK
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 000..a6a8762
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,25 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+#ifndef CONFIG_X86_PPRO_FENCE
+
+#definequeue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * An effective smp_store_release() on the least-significant byte.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   barrier();
+   ACCESS_ONCE(*(u8 *)lock) = 0;
+}
+
+#endif /* !CONFIG_X86_PPRO_FENCE */
+
+#include asm-generic/qspinlock.h
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 9295016..5899483 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm/qspinlock.h
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -180,6 +184,7 @@ static __always_inline void 
arch_spin_lock_flags(arch_spinlock_t *lock,
 {
arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index 5f9d757..5d654a1 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm-generic/qspinlock_types.h
+#else
 typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include asm-generic/qrwlock_types.h
 
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v12 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-10-16 Thread Waiman Long
This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock.h   |  118 +++
 include/asm-generic/qspinlock_types.h |   58 +
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 +
 kernel/locking/mcs_spinlock.h |1 +
 kernel/locking/qspinlock.c|  207 +
 6 files changed, 392 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 000..e8a7ae8
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,118 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+   if (!atomic_read(lock-val) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+   u32 val;
+
+   

[PATCH v12 04/11] qspinlock: Extract out code snippets for the next patch

2014-10-16 Thread Waiman Long
This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.

 1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
 2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.

This patch also simplifies the trylock operation before queuing by
calling queue_spin_trylock() directly.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock_types.h |2 +
 kernel/locking/qspinlock.c|   91 +---
 2 files changed, 62 insertions(+), 31 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 4196694..88d647c 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -58,6 +58,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 226b11d..48bd2ad 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -95,6 +95,54 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   u32 new, old;
+
+   for (;;) {
+   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+}
+
+/**
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   u32 old, new, val = atomic_read(lock-val);
+
+   for (;;) {
+   new = (val  _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return old;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -176,15 +224,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 *
 * *,1,0 - *,0,1
 */
-   for (;;) {
-   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   clear_pending_set_locked(lock, val);
return;
 
/*
@@ -201,37 +241,26 @@ queue:
node-next = NULL;
 
/*
-* We have already touched the queueing cacheline; don't bother with
-* pending stuff.
-*
-* trylock || xchg(lock, node)
-*
-* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked.
-* p,y,x - n,y,x ; tail was p - tail is n; preserving locked.
+* We touched a (possibly) cold cacheline in the per-cpu queue node;
+* attempt the trylock once more in the hope someone let go while we
+* weren't watching.
 */
-   for (;;) {
-   new = _Q_LOCKED_VAL;
-   if (val)
-   new = tail | (val  _Q_LOCKED_PENDING_MASK);
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   if (queue_spin_trylock(lock))
+   goto release;
 
/*
-* we won the trylock; forget about queueing.
+* We have already touched the queueing cacheline; don't bother with
+* pending stuff.
+*
+* p,*,* - n,*,*
 */
-   if (new == _Q_LOCKED_VAL)
-   goto release;
+   old = xchg_tail(lock, tail);
 
/*
 * if there was a previous node; link it and wait until reaching the
 * head of the waitqueue.
 */
-   if (old  ~_Q_LOCKED_PENDING_MASK) {
+   if (old  _Q_TAIL_MASK) {
prev = 

[PATCH v12 03/11] qspinlock: Add pending bit

2014-10-16 Thread Waiman Long
From: Peter Zijlstra pet...@infradead.org

Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.

It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.

In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.

Signed-off-by: Peter Zijlstra pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   12 +++-
 kernel/locking/qspinlock.c|  119 +++--
 2 files changed, 107 insertions(+), 24 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 67a2110..4196694 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -36,8 +36,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
   _Q_ ## type ## _OFFSET)
@@ -45,7 +46,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -54,5 +59,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c114076..226b11d 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -92,24 +92,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
return per_cpu_ptr(mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
  *
- * (queue tail, lock value)
- *
- *  fast  :slow  :
unlock
- *:  :
- * uncontended  (0,0)   --:-- (0,1) :-- (*,0)
- *:   | ^./  :
- *:   v   \   |  :
- * uncontended:(n,x) --+-- (n,0) |  :
- *   queue:   | ^--'  |  :
- *:   v   |  :
- * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
- *   queue: ^--' :
+ * (queue tail, pending bit, lock value)
  *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +-- (0,1,0)   \   |  :
+ *   :   | ^--'  |   |  :
+ *   :   v   |   |  :
+ * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
+ *   queue   :   | ^--'  |  :
+ *   :   v   |  :
+ * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
+ *   queue   : ^--' :
  */
 void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
@@ -119,6 +123,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   /*
+* wait for in-progress pending-locked hand-overs
+*
+* 0,1,0 - 0,0,1
+*/
+   if (val == _Q_PENDING_VAL) {
+   while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL)
+   

[PATCH v12 00/11] qspinlock: a 4-byte queue spinlock with PV support

2014-10-16 Thread Waiman Long
v11-v12:
 - Based on PeterZ's version of the qspinlock patch
   (https://lkml.org/lkml/2014/6/15/63).
 - Incorporated many of the review comments from Konrad Wilk and
   Paolo Bonzini.
 - The pvqspinlock code is largely from my previous version with
   PeterZ's way of going from queue tail to head and his idea of
   using callee saved calls to KVM and XEN codes.

v10-v11:
  - Use a simple test-and-set unfair lock to simplify the code,
but performance may suffer a bit for large guest with many CPUs.
  - Take out Raghavendra KT's test results as the unfair lock changes
may render some of his results invalid.
  - Add PV support without increasing the size of the core queue node
structure.
  - Other minor changes to address some of the feedback comments.

v9-v10:
  - Make some minor changes to qspinlock.c to accommodate review feedback.
  - Change author to PeterZ for 2 of the patches.
  - Include Raghavendra KT's test results in patch 18.

v8-v9:
  - Integrate PeterZ's version of the queue spinlock patch with some
modification:
http://lkml.kernel.org/r/20140310154236.038181...@infradead.org
  - Break the more complex patches into smaller ones to ease review effort.
  - Fix a racing condition in the PV qspinlock code.

v7-v8:
  - Remove one unneeded atomic operation from the slowpath, thus
improving performance.
  - Simplify some of the codes and add more comments.
  - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable
unfair lock.
  - Reduce unfair lock slowpath lock stealing frequency depending
on its distance from the queue head.
  - Add performance data for IvyBridge-EX CPU.

v6-v7:
  - Remove an atomic operation from the 2-task contending code
  - Shorten the names of some macros
  - Make the queue waiter to attempt to steal lock when unfair lock is
enabled.
  - Remove lock holder kick from the PV code and fix a race condition
  - Run the unfair lock  PV code on overcommitted KVM guests to collect
performance data.

v5-v6:
 - Change the optimized 2-task contending code to make it fairer at the
   expense of a bit of performance.
 - Add a patch to support unfair queue spinlock for Xen.
 - Modify the PV qspinlock code to follow what was done in the PV
   ticketlock.
 - Add performance data for the unfair lock as well as the PV
   support code.

v4-v5:
 - Move the optimized 2-task contending code to the generic file to
   enable more architectures to use it without code duplication.
 - Address some of the style-related comments by PeterZ.
 - Allow the use of unfair queue spinlock in a real para-virtualized
   execution environment.
 - Add para-virtualization support to the qspinlock code by ensuring
   that the lock holder and queue head stay alive as much as possible.

v3-v4:
 - Remove debugging code and fix a configuration error
 - Simplify the qspinlock structure and streamline the code to make it
   perform a bit better
 - Add an x86 version of asm/qspinlock.h for holding x86 specific
   optimization.
 - Add an optimized x86 code path for 2 contending tasks to improve
   low contention performance.

v2-v3:
 - Simplify the code by using numerous mode only without an unfair option.
 - Use the latest smp_load_acquire()/smp_store_release() barriers.
 - Move the queue spinlock code to kernel/locking.
 - Make the use of queue spinlock the default for x86-64 without user
   configuration.
 - Additional performance tuning.

v1-v2:
 - Add some more comments to document what the code does.
 - Add a numerous CPU mode to support = 16K CPUs
 - Add a configuration option to allow lock stealing which can further
   improve performance in many cases.
 - Enable wakeup of queue head CPU at unlock time for non-numerous
   CPU mode.

This patch set has 3 different sections:
 1) Patches 1-6: Introduces a queue-based spinlock implementation that
can replace the default ticket spinlock without increasing the
size of the spinlock data structure. As a result, critical kernel
data structures that embed spinlock won't increase in size and
break data alignments.
 2) Patch 7: Enables the use of unfair lock in a virtual guest. This
can resolve some of the locking related performance issues due to
the fact that the next CPU to get the lock may have been scheduled
out for a period of time.
 3) Patches 8-11: Enable qspinlock para-virtualization support by
halting the waiting CPUs after spinning for a certain amount of
time. The unlock code will detect the a sleeping waiter and wake it
up. This is essentially the same logic as the PV ticketlock code.

The queue spinlock has slightly better performance than the ticket
spinlock in uncontended case. Its performance can be much better
with moderate to heavy contention.  This patch has the potential of
improving the performance of all the workloads that have moderate to
heavy spinlock contention.

The queue spinlock is especially suitable for NUMA machines with
at least 2 sockets. Though even at 

Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors

2014-10-16 Thread Jason Wang
On 10/15/2014 01:40 PM, Rusty Russell wrote:
 Jason Wang jasow...@redhat.com writes:
 Below should be useful for some experiments Jason is doing.
 I thought I'd send it out for early review/feedback.

 event idx feature allows us to defer interrupts until
 a specific # of descriptors were used.
 Sometimes it might be useful to get an interrupt after
 a specific descriptor, regardless.
 This adds a descriptor flag for this, and an API
 to create an urgent output descriptor.
 This is still an RFC:
 we'll need a feature bit for drivers to detect this,
 but we've run out of feature bits for virtio 0.X.
 For experimentation purposes, drivers can assume
 this is set, or add a driver-specific feature bit.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com
 The new VRING_DESC_F_URGENT bit is theoretically nicer, but for
 networking (which tends to take packets in order) couldn't we just set
 the event counter to give us a tx interrupt at the packet we want?

 Cheers,
 Rusty.

Yes, we could. Recent RFC of enabling tx interrupt use this.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 07/11] powerpc: kvm: the stopper func to cease secondary hwthread

2014-10-16 Thread kernelfans
To enter guest, primary hwtherad schedules the stopper func on
secondary threads and force them into NAP mode.
When exit to host,secondary threads hardcode to restore the stack,
then switch back to the stopper func, i.e host.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv.c| 15 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 34 +
 2 files changed, 49 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ba258c8..4348abd 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1486,6 +1486,21 @@ static void kvmppc_remove_runnable(struct kvmppc_vcore 
*vc,
list_del(vcpu-arch.run_list);
 }
 
+#ifdef KVMPPC_ENABLE_SECONDARY
+
+extern void kvmppc_secondary_stopper_enter();
+
+static int kvmppc_secondary_stopper(void *data)
+{
+   int cpu =smp_processor_id();
+   struct paca_struct *lpaca = get_paca();
+   BUG_ON(!(cpu%thread_per_core));
+
+   kvmppc_secondary_stopper_enter();
+}
+
+#endif
+
 static int kvmppc_grab_hwthread(int cpu)
 {
struct paca_struct *tpaca;
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index d5594b0..254038b 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -349,7 +349,41 @@ kvm_do_nap:
 
 #ifdef PPCKVM_ENABLE_SECONDARY
 kvm_secondary_exit_trampoline:
+
+   /* all register is free to use, later kvmppc_secondary_stopper_exit set 
up them*/
+   //loop-wait for the primary to signal that host env is ready
+
+   LOAD_REG_ADDR(r5, kvmppc_secondary_stopper_exit)
+   /* fixme, load msr from lpaca stack */
+   li  r6, MSR_IR | MSR_DR
+   mtsrr0  r5
+   mtsrr1  r6
+   RFI
+
+_GLOBAL_TOC(kvmppc_secondary_stopper_enter)
+   mflrr0
+   std r0, PPC_LR_STKOFF(r1)
+   stdur1, -112(r1)
+
+   /* fixme: store other register such as msr */
+
+   /* prevent us to enter kernel */
+   li  r0, 1
+   stb r0, HSTATE_HWTHREAD_REQ(r13)
+   /* tell the primary that we are ready */
+li  r0,KVM_HWTHREAD_IN_KERNEL
+stb r0,HSTATE_HWTHREAD_STATE(r13)
+   nap
b   .
+
+/* enter with vmode */
+kvmppc_secondary_stopper_exit:
+   /* fixme, restore the stack which we store on lpaca */
+
+   ld  r0, 112+PPC_LR_STKOFF(r1)
+   addir1, r1, 112
+   mtlrr0
+   blr
 #endif
 
 /**
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 09/11] powerpc: kvm: handle time base on secondary hwthread

2014-10-16 Thread kernelfans
(This is a place holder patch.)
We need to store the time base for host on secondary hwthread.
Later when switching back, we need to reprogram it with elapse
time.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 89ea16c..a817ba6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -371,6 +371,8 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter)
 
/* fixme: store other register such as msr */
 
+   /* fixme: store the tb, and set it as MAX, so we cease the tick on 
secondary */
+
/* prevent us to enter kernel */
li  r0, 1
stb r0, HSTATE_HWTHREAD_REQ(r13)
@@ -382,6 +384,10 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter)
 
 /* enter with vmode */
 kvmppc_secondary_stopper_exit:
+   /* fixme: restore the tb, with the orig val plus time elapse
+ * so we can fire the hrtimer as soon as possible
+ */
+
/* fixme, restore the stack which we store on lpaca */
 
ld  r0, 112+PPC_LR_STKOFF(r1)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 05/11] sched: introduce stop_cpus_async() to schedule special tsk on cpu

2014-10-16 Thread kernelfans
The proto will be:
 cpu1  cpuX
  stop_cpus_async()
  bring cpuX to a special state
  signal flag and trapped
  check for flag

The func help powerpc to reuse the scheme of cpu_stopper_task
to force the secondary hwthread goto NAP state, in which state,
cpu will not run any longer until the master cpu tells them to
go.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 include/linux/stop_machine.h |  2 ++
 kernel/stop_machine.c| 25 -
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index d2abbdb..871c1bf 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -32,6 +32,8 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, 
cpu_stop_fn_t fn, void *
 void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
 struct cpu_stop_work *work_buf);
 int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
+int stop_cpus_async(const struct cpumask *cpumask, cpu_stop_fn_t fn,
+   void *arg);
 int try_stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
 
 #else  /* CONFIG_SMP */
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 695f0c6..d26fd6a 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -354,13 +354,15 @@ static void queue_stop_cpus_work(const struct cpumask 
*cpumask,
 }
 
 static int __stop_cpus(const struct cpumask *cpumask,
-  cpu_stop_fn_t fn, void *arg)
+  cpu_stop_fn_t fn, void *arg, bool sync)
 {
struct cpu_stop_done done;
 
-   cpu_stop_init_done(done, cpumask_weight(cpumask));
+   if (sync)
+   cpu_stop_init_done(done, cpumask_weight(cpumask));
queue_stop_cpus_work(cpumask, fn, arg, done);
-   wait_for_completion(done.completion);
+   if (sync)
+   wait_for_completion(done.completion);
return done.executed ? done.ret : -ENOENT;
 }
 
@@ -398,7 +400,20 @@ int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t 
fn, void *arg)
 
/* static works are used, process one request at a time */
mutex_lock(stop_cpus_mutex);
-   ret = __stop_cpus(cpumask, fn, arg);
+   ret = __stop_cpus(cpumask, fn, arg, true);
+   mutex_unlock(stop_cpus_mutex);
+   return ret;
+}
+
+/* similar to stop_cpus(), but not wait for the ack. */
+int stop_cpus_async(const struct cpumask *cpumask, cpu_stop_fn_t fn,
+   void *arg)
+{
+   int ret;
+
+   /* static works are used, process one request at a time */
+   mutex_lock(stop_cpus_mutex);
+   ret = __stop_cpus(cpumask, fn, arg, false);
mutex_unlock(stop_cpus_mutex);
return ret;
 }
@@ -428,7 +443,7 @@ int try_stop_cpus(const struct cpumask *cpumask, 
cpu_stop_fn_t fn, void *arg)
/* static works are used, process one request at a time */
if (!mutex_trylock(stop_cpus_mutex))
return -EAGAIN;
-   ret = __stop_cpus(cpumask, fn, arg);
+   ret = __stop_cpus(cpumask, fn, arg, true);
mutex_unlock(stop_cpus_mutex);
return ret;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 03/11] powerpc: kvm: add interface to control kvm function on a core

2014-10-16 Thread kernelfans
When kvm is enabled on a core, we migrate all external irq to primary
thread. Since currently, the kvmirq logic is handled by the primary
hwthread.

Todo: this patch lacks re-enable of irqbalance when kvm is disable on
the core

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/sysfs.c| 39 ++
 arch/powerpc/sysdev/xics/xics-common.c | 12 +++
 2 files changed, 51 insertions(+)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 67fd2fd..a2595dd 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -552,6 +552,45 @@ static void sysfs_create_dscr_default(void)
if (cpu_has_feature(CPU_FTR_DSCR))
err = device_create_file(cpu_subsys.dev_root, 
dev_attr_dscr_default);
 }
+
+#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
+#define NR_CORES   (CONFIG_NR_CPUS/threads_per_core)
+static DECLARE_BITMAP(kvm_on_core, NR_CORES) __read_mostly
+
+static ssize_t show_kvm_enable(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+}
+
+static ssize_t __used store_kvm_enable(struct device *dev,
+   struct device_attribute *attr, const char *buf,
+   size_t count)
+{
+   struct cpumask stop_cpus;
+   unsigned long core, thr;
+
+   sscanf(buf, %lx, core);
+   if (core  NR_CORES)
+   return -1;
+   if (!test_bit(core, kvm_on_core))
+   for (thr = 1; thr threads_per_core; thr++)
+   if (cpu_online(thr * threads_per_core + thr))
+   cpumask_set_cpu(thr * threads_per_core + thr, 
stop_cpus);
+
+   stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus);
+   set_bit(core, kvm_on_core);
+   return count;
+}
+
+static DEVICE_ATTR(kvm_enable, 0600,
+   show_kvm_enable, store_kvm_enable);
+
+static void sysfs_create_kvm_enable(void)
+{
+   device_create_file(cpu_subsys.dev_root, dev_attr_kvm_enable);
+}
+#endif
+
 #endif /* CONFIG_PPC64 */
 
 #ifdef HAS_PPC_PMC_PA6T
diff --git a/arch/powerpc/sysdev/xics/xics-common.c 
b/arch/powerpc/sysdev/xics/xics-common.c
index fe0cca4..68b33d8 100644
--- a/arch/powerpc/sysdev/xics/xics-common.c
+++ b/arch/powerpc/sysdev/xics/xics-common.c
@@ -258,6 +258,18 @@ unlock:
raw_spin_unlock_irqrestore(desc-lock, flags);
}
 }
+
+int xics_migrate_irqs_away_secondary(void *data)
+{
+   int cpu = smp_processor_id();
+   if(cpu%thread_per_core != 0) {
+   WARN(condition, format...);
+   return 0;
+   }
+   /* In fact, if we can migrate the primary, it will be more fine */
+   xics_migrate_irqs_away();
+   return 0;
+}
 #endif /* CONFIG_HOTPLUG_CPU */
 
 #ifdef CONFIG_SMP
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 04/11] powerpc: kvm: introduce a kthread on primary thread to anti tickless

2014-10-16 Thread kernelfans
(This patch is a place holder.)

If there is only one vcpu thread is ready(the other vcpu thread can
wait for it to execute), the primary thread can enter tickless mode,
which causes the primary keeps running, so the secondary has no
opportunity to exit to host, even they have other tsk on them.

Introduce a kthread (anti_tickless) on primary, so when there is only
one vcpu thread on primary, the secondary can resort to anti_tickless
to keep the primary out of tickless mode.
(I thought that anti_tickless thread can goto NAP, so we can let the
secondary run).

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/sysfs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index a2595dd..f0b110e 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -575,9 +575,11 @@ static ssize_t __used store_kvm_enable(struct device *dev,
if (!test_bit(core, kvm_on_core))
for (thr = 1; thr threads_per_core; thr++)
if (cpu_online(thr * threads_per_core + thr))
-   cpumask_set_cpu(thr * threads_per_core + thr, 
stop_cpus);
+   cpumask_set_cpu(core * threads_per_core + thr, 
stop_cpus);
 
stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus);
+   /* fixme, create a kthread on primary hwthread to handle tickless mode 
*/
+   //kthread_create_on_cpu(prevent_tickless, NULL, core * 
threads_per_core, ppckvm_prevent_tickless);
set_bit(core, kvm_on_core);
return count;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 11/11] powerpc: kvm: Kconfig add an option for enabling secondary hwthread

2014-10-16 Thread kernelfans
Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/Kconfig | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 602eb51..de38566 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -93,6 +93,10 @@ config KVM_BOOK3S_64_HV
 
  If unsure, say N.
 
+config KVMPPC_ENABLE_SECONDARY
+   tristate KVM support for running on secondary hwthread in host
+   depends on KVM_BOOK3S_64_HV
+
 config KVM_BOOK3S_64_PR
tristate KVM support without using hypervisor mode in host
depends on KVM_BOOK3S_64
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 06/11] powerpc: kvm: introduce online in paca to indicate whether cpu is needed by host

2014-10-16 Thread kernelfans
Nowadays, powerKVM runs with secondary hwthread offline. Although
we can make all secondary hwthread online later, we still preserve
this behavior for dedicated KVM env. Achieve this by setting
paca-online as false.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/paca.h |  3 +++
 arch/powerpc/kernel/asm-offsets.c   |  3 +++
 arch/powerpc/kernel/smp.c   |  3 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 
 4 files changed, 21 insertions(+)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index a5139ea..67c2500 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -84,6 +84,9 @@ struct paca_struct {
u8 cpu_start;   /* At startup, processor spins until */
/* this becomes non-zero. */
u8 kexec_state; /* set when kexec down has irqs off */
+#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
+   u8 online;
+#endif
 #ifdef CONFIG_PPC_STD_MMU_64
struct slb_shadow *slb_shadow_ptr;
struct dtl_entry *dispatch_log;
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 9d7dede..0faa8fe 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -182,6 +182,9 @@ int main(void)
DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc));
DEFINE(PACAKBASE, offsetof(struct paca_struct, kernelbase));
DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr));
+#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
+   DEFINE(PACAONLINE, offsetof(struct paca_struct, online));
+#endif
DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled));
DEFINE(PACAIRQHAPPENED, offsetof(struct paca_struct, irq_happened));
DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id));
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index a0738af..4c3843e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -736,6 +736,9 @@ void start_secondary(void *unused)
 
cpu_startup_entry(CPUHP_ONLINE);
 
+#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
+   get_paca()-online = true;
+#endif 
BUG();
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index f0c4db7..d5594b0 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -322,6 +322,13 @@ kvm_no_guest:
li  r0, KVM_HWTHREAD_IN_NAP
stb r0, HSTATE_HWTHREAD_STATE(r13)
 kvm_do_nap:
+#ifdef PPCKVM_ENABLE_SECONDARY
+   /* check the cpu is needed by host or not */
+   ld  r2, PACAONLINE(r13)
+   ld  r3, 0
+   cmp r2, r3
+   bne kvm_secondary_exit_trampoline
+#endif
/* Clear the runlatch bit before napping */
mfspr   r2, SPRN_CTRLF
clrrdi  r2, r2, 1
@@ -340,6 +347,11 @@ kvm_do_nap:
nap
b   .
 
+#ifdef PPCKVM_ENABLE_SECONDARY
+kvm_secondary_exit_trampoline:
+   b   .
+#endif
+
 /**
  **
  *   Entry code   *
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 08/11] powerpc: kvm: add a flag in vcore to sync primary with secondry hwthread

2014-10-16 Thread kernelfans
The secondary thread can only jump back to host until primary has set
up the env. Add host_ready field in kvm_vcore to sync this action.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/kvm_host.h |  3 +++
 arch/powerpc/kernel/asm-offsets.c   |  3 +++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 9a3355e..1310e03 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -305,6 +305,9 @@ struct kvmppc_vcore {
u32 arch_compat;
ulong pcr;
ulong dpdes;/* doorbell state (POWER8) */
+#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
+   u8 host_ready;
+#endif
void *mpp_buffer; /* Micro Partition Prefetch buffer */
bool mpp_buffer_is_valid;
 };
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 0faa8fe..9c04ac2 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -562,6 +562,9 @@ int main(void)
DEFINE(VCORE_LPCR, offsetof(struct kvmppc_vcore, lpcr));
DEFINE(VCORE_PCR, offsetof(struct kvmppc_vcore, pcr));
DEFINE(VCORE_DPDES, offsetof(struct kvmppc_vcore, dpdes));
+#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
+   DEFINE(VCORE_HOST_READY, offsetof(struct kvmppc_vcore, host_ready));
+#endif
DEFINE(VCPU_SLB_E, offsetof(struct kvmppc_slb, orige));
DEFINE(VCPU_SLB_V, offsetof(struct kvmppc_slb, origv));
DEFINE(VCPU_SLB_SIZE, sizeof(struct kvmppc_slb));
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 254038b..89ea16c 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -351,7 +351,11 @@ kvm_do_nap:
 kvm_secondary_exit_trampoline:
 
/* all register is free to use, later kvmppc_secondary_stopper_exit set 
up them*/
-   //loop-wait for the primary to signal that host env is ready
+   /* wait until the primary to set up host env */
+   ld  r5, HSTATE_KVM_VCORE(r13)
+   ld  r0, VCORE_HOST_READY(r5)
+   cmp r0,  //primary is ready?
+   bne kvm_secondary_exit_trampoline
 
LOAD_REG_ADDR(r5, kvmppc_secondary_stopper_exit)
/* fixme, load msr from lpaca stack */
@@ -1821,6 +1825,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
li  r0, KVM_GUEST_MODE_NONE
stb r0, HSTATE_IN_GUEST(r13)
 
+#ifdef PPCKVM_ENABLE_SECONDARY
+   /* signal the secondary that host env is ready */
+   li  r0, 1
+   stb r0, VCORE_HOST_READY(r5)
+#endif
ld  r0, 112+PPC_LR_STKOFF(r1)
addir1, r1, 112
mtlrr0
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 02/11] powerpc: kvm: ensure vcpu-thread run only on primary hwthread

2014-10-16 Thread kernelfans
When vcpu thread runs at the first time, it will ensure to stick
to the primary thread.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/include/asm/kvm_host.h |  3 +++
 arch/powerpc/kvm/book3s_hv.c| 17 +
 2 files changed, 20 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 98d9dd5..9a3355e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -666,6 +666,9 @@ struct kvm_vcpu_arch {
spinlock_t tbacct_lock;
u64 busy_stolen;
u64 busy_preempt;
+#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
+   bool cpu_selected;
+#endif
 #endif
 };
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 27cced9..ba258c8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1909,6 +1909,23 @@ static int kvmppc_vcpu_run_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu)
 {
int r;
int srcu_idx;
+#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY
+   int cpu = smp_processor_id();
+   int target_cpu;
+   unsigned int cpu;
+   struct task_struct *p = current;
+
+   if (unlikely(!vcpu-arch.cpu_selected)) {
+   vcpu-arch.cpu_selected = true;
+   for (cpu = 0; cpu  NR_CPUS; cpu+=threads_per_core) {
+   cpumask_set_cpu(cpu, p-sys_allowed);
+   }
+   if (cpu%threads_per_core != 0) {
+   target_cpu = cpu/threads_per_core*threads_per_core;
+   migrate_task_to(current, target_cpu);
+   }
+   }
+#endif
 
if (!vcpu-arch.sane) {
run-exit_reason = KVM_EXIT_INTERNAL_ERROR;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 10/11] powerpc: kvm: on_primary_thread() force the secondary threads into NAP mode

2014-10-16 Thread kernelfans
The primary hwthread ceases the scheduler of secondary hwthread by
bringing them into NAP. Then, the secondary is ready for guest.

Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com
---
 arch/powerpc/kvm/book3s_hv.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 4348abd..7896c31 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1593,15 +1593,22 @@ static int on_primary_thread(void)
 {
int cpu = smp_processor_id();
int thr;
+   struct cpumask msk;
 
/* Are we on a primary subcore? */
if (cpu_thread_in_subcore(cpu))
return 0;
 
thr = 0;
+#ifdef KVMPPC_ENABLE_SECONDARY
+   while (++thr  threads_per_subcore)
+   cpumask_set_cpu(thr, msk);
+   stop_cpus_async(msk, kvmppc_secondary_stopper, NULL);
+#else
while (++thr  threads_per_subcore)
if (cpu_online(cpu + thr))
return 0;
+#endif
 
/* Grab all hw threads so they can't go into the kernel */
for (thr = 1; thr  threads_per_subcore; ++thr) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html