RE: [Qemu-devel] The status about vhost-net on kvm-arm?
Hello, Using this Qemu patchset as well as recent irqfd work, I’ve tried to make vhost-net working on Cortex-A15. Unfortunately, even if I can correctly generate irqs to the guest through irqfd, it seems to me that some pieces are still missing…. Indeed, virtio mmio interrupt status register (@ offset 0x60) is not updated by vhost thread, and reading it or writing to the peer interrupt ack register (offset 0x64) from the guest causes an VM exit … After reading older posts, I understand that vhost-net with irqfd support could only work with MSI-X support : On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote: “When MSI is off, each interrupt needs to be bounced through the io thread when it's set/cleared, so vhost-net causes more context switches and higher CPU utilization than userspace virtio which handles networking in the same thread. “ Indeed, in case of MSI-X support, Virtio spec indicates that the ISR Status field is unused… I understand that Vhost does not emulate a complete virtio PCI adapter but only manage virtqueue operations. However I don’t have a clear view of what is performed by Qemu and what is performed by vhost-thread… Could someone highlight me on this point, and maybe give some clues for an implementation of Vhost with irqfd and without MSI support ??? Thanks a lot in advance. Best regards. Rémy De : kvmarm-boun...@lists.cs.columbia.edu [mailto:kvmarm-boun...@lists.cs.columbia.edu] De la part de Yingshiuan Pan Envoyé : vendredi 15 août 2014 09:25 À : Li Liu Cc : kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org; qemu-devel Objet : Re: [Qemu-devel] The status about vhost-net on kvm-arm? Hi, Li, It's ok, I did get those mails from mailing list. I guess it was because I did not subscribe some of mailing lists. Currently, I think I will not have any plan to renew my patcheset since I have resigned from my previous company, I do not have Cortex-A15 platform to test/verify. I'm fine with that, it would be great if you or someone can take it and improve it. Thanks. Best Regards, Yingshiuan Pan 2014-08-15 11:04 GMT+08:00 Li Liu john.li...@huawei.com: Hi Ying-Shiuan Pan, I don't know why for missing your mail in mailbox. Sorry about that. The results of vhost-net performance have been attached in another mail. Do you have a plan to renew your patchset to support irqfd. If not, we will try to finish it based on yours. On 2014/8/14 11:50, Li Liu wrote: On 2014/8/13 19:25, Nikolay Nikolaev wrote: On Wed, Aug 13, 2014 at 12:10 PM, Nikolay Nikolaev n.nikol...@virtualopensystems.com wrote: On Tue, Aug 12, 2014 at 6:47 PM, Nikolay Nikolaev n.nikol...@virtualopensystems.com wrote: Hello, On Tue, Aug 12, 2014 at 5:41 AM, Li Liu john.li...@huawei.com wrote: Hi all, Is anyone there can tell the current status of vhost-net on kvm-arm? Half a year has passed from Isa Ansharullah asked this question: http://www.spinics.net/lists/kvm-arm/msg08152.html I have found two patches which have provided the kvm-arm support of eventfd and irqfd: 1) [RFC PATCH 0/4] ARM: KVM: Enable the ioeventfd capability of KVM on ARM http://lists.gnu.org/archive/html/qemu-devel/2014-01/msg01770.html 2) [RFC,v3] ARM: KVM: add irqfd and irq routing support https://patches.linaro.org/32261/ And there's a rough patch for qemu to support eventfd from Ying-Shiuan Pan: [Qemu-devel] [PATCH 0/4] ioeventfd support for virtio-mmio https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg00715.html But there no any comments of this patch. And I can found nothing about qemu to support irqfd. Do I lost the track? If nobody try to fix it. We have a plan to complete it about virtio-mmio supporing irqfd and multiqueue. we at Virtual Open Systems did some work and tested vhost-net on ARM back in March. The setup was based on: - host kernel with our ioeventfd patches: http://www.spinics.net/lists/kvm-arm/msg08413.html - qemu with the aforementioned patches from Ying-Shiuan Pan https://lists.gnu.org/archive/html/qemu-devel/2014-02/msg00715.html The testbed was ARM Chromebook with Exynos 5250, using a 1Gbps USB3 Ethernet adapter connected to a 1Gbps switch. I can't find the actual numbers but I remember that with multiple streams the gain was clearly seen. Note that it used the minimum required ioventfd implementation and not irqfd. I guess it is feasible to think that it all can be put together and rebased + the recent irqfd work. One can achiev even better performance (because of the irqfd). Managed to replicate the setup with the old versions e used in March: Single stream from another machine to chromebook with 1Gbps USB3 Ethernet adapter. iperf -c address -P 1 -i 1 -p 5001 -f k -t 10 to HOST: 858316 Kbits/sec to GUEST: 761563 Kbits/sec to GUEST vhost=off: 508150 Kbits/sec 10 parallel streams iperf -c address -P 10 -i 1 -p 5001 -f k -t 10 to HOST: 842420 Kbits/sec to GUEST: 625144 Kbits/sec to GUEST vhost=off: 425276 Kbits/sec I
[Bug 86161] On KVM, Windows 7 32bit guests sometimes run into blue screen(0x0000005c) during reboot
https://bugzilla.kernel.org/show_bug.cgi?id=86161 GC Ngu ng...@qq.com changed: What|Removed |Added Summary|PROBLEM: On KVM, Windows 7 |On KVM, Windows 7 32bit |32bit guests sometimes run |guests sometimes run into |into blue |blue screen(0x005c) |screen(0x005c) during |during reboot |reboot | -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
A question about HTL VM-Exit handling time
Hi folks, I run kernel build in the guest and use perf kvm to get some VM-Exit result as the following: Analyze events for all VCPUs: VM-EXITSamples Samples% Time% Min Time Max Time A MSR_WRITE361390857.53%18.97%5us 1362us 9.73 HLT139974722.28%74.90%5us 432448us 99.24 CR_ACCESS 96120315.30% 3.28%4us 188us 6.33 EXTERNAL_INTERRUPT 213821 3.40% 2.25%4us 4089us 19.54 EXCEPTION_NMI 25152 0.40% 0.12%4us 71us 9.05 EPT_MISCONFIG 20104 0.32% 0.15%8us 5628us 13.74 CPUID 19904 0.32% 0.07%4us 220us 6.90 IO_INSTRUCTION 17097 0.27% 0.20% 13us 1008us 22.08 PAUSE_INSTRUCTION 10737 0.17% 0.05%4us 53us 8.33 MSR_READ 48 0.00% 0.00%4us8us 5.62 Total Samples:6281721, Total events handled time:185457820.41us. I also do some other experiments with different workload in the guest, I got the same results in terms of HLT VM-Exit handling time. Does anyone know why the handling time for HLT VM-Exit is so high? Appreciate You help! Thanks, Feng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
A question about HLT VM-Exit handling time
Correct the typo in the subject. -Original Message- From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Wu, Feng Sent: Thursday, October 16, 2014 4:16 PM To: kvm@vger.kernel.org Cc: Xiao Guangrong Subject: A question about HTL VM-Exit handling time Hi folks, I run kernel build in the guest and use perf kvm to get some VM-Exit result as the following: Analyze events for all VCPUs: VM-EXITSamples Samples% Time% Min Time Max Time A MSR_WRITE361390857.53%18.97%5us 1362us 9.73 HLT139974722.28%74.90%5us 432448us 99.24 CR_ACCESS 96120315.30% 3.28%4us 188us 6.33 EXTERNAL_INTERRUPT 213821 3.40% 2.25%4us 4089us 19.54 EXCEPTION_NMI 25152 0.40% 0.12%4us 71us 9.05 EPT_MISCONFIG 20104 0.32% 0.15%8us 5628us 13.74 CPUID 19904 0.32% 0.07%4us 220us 6.90 IO_INSTRUCTION 17097 0.27% 0.20% 13us 1008us 22.08 PAUSE_INSTRUCTION 10737 0.17% 0.05%4us 53us 8.33 MSR_READ 48 0.00% 0.00%4us 8us 5.62 Total Samples:6281721, Total events handled time:185457820.41us. I also do some other experiments with different workload in the guest, I got the same results in terms of HLT VM-Exit handling time. Does anyone know why the handling time for HLT VM-Exit is so high? Appreciate You help! Thanks, Feng -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] arm/arm64: KVM: Fix BE accesses to GICv2 EISR and ELRSR regs
Hi Victor, On Thu, Oct 16, 2014 at 1:54 AM, Victor Kamensky victor.kamen...@linaro.org wrote: On 14 October 2014 08:21, Victor Kamensky victor.kamen...@linaro.org wrote: On 14 October 2014 02:47, Marc Zyngier marc.zyng...@arm.com wrote: On Sun, Sep 28 2014 at 03:04:26 PM, Christoffer Dall christoffer.d...@linaro.org wrote: The EIRSR and ELRSR registers are 32-bit registers on GICv2, and we store these as an array of two such registers on the vgic vcpu struct. However, we access them as a single 64-bit value or as a bitmap pointer in the generic vgic code, which breaks BE support. Instead, store them as u64 values on the vgic structure and do the word-swapping in the assembly code, which already handles the byte order for BE systems. Signed-off-by: Christoffer Dall christoffer.d...@linaro.org (still going through my email backlog, hence the delay). This looks like a valuable fix. Haven't had a chance to try it (no BE setup at hand) but maybe Victor can help reproducing this?. I'll give it a spin. Tested-by: Victor Kamensky victor.kamen...@linaro.org Tested on v3.17 + this fix on TC2 (V7) and Mustang (V8) with BE kvm host, tried different combination of guests BE/LE V7/V8. All looks good. Only with latest qemu in BE V8 mode in v3.17 without this fix I was able to reproduce the issue that Will spotted. With kvmtool, and older qemu V8 BE code never hit vgic_v2_set_lr function so that is why we did not run into it before. I guess fix in qemu in pl011 mentioned by 1f2bb4acc125, uncovered vgic_v2_set_lr code path and this BE issue. With this patch it works fine now. Thanks for the detailed testing and explanation. I'll apply this one to next. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 1/4] vfio: platform: add device tree info API and skeleton
This patch introduced the API to return device tree info about a PLATFORM device (if described by a device tree) and the skeleton of the implementation for VFIO_PLATFORM. Information about any device node bound by VFIO_PLATFORM should be queried via the introduced ioctl VFIO_DEVICE_GET_DEVTREE_INFO. The proposed API allows to get a list of strings with available property names, and then allows to query each property. Note that the properties are not indexed numerically, so they are always accessed by property name. The user needs to know the data type of the property he is accessing. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/Makefile| 3 +- drivers/vfio/platform/devtree.c | 70 +++ drivers/vfio/platform/vfio_platform_common.c | 39 +++ drivers/vfio/platform/vfio_platform_private.h | 6 +++ include/uapi/linux/vfio.h | 26 ++ 5 files changed, 143 insertions(+), 1 deletion(-) create mode 100644 drivers/vfio/platform/devtree.c diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile index 81de144..99f3ba1 100644 --- a/drivers/vfio/platform/Makefile +++ b/drivers/vfio/platform/Makefile @@ -1,5 +1,6 @@ -vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o +vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o \ + devtree.o obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o diff --git a/drivers/vfio/platform/devtree.c b/drivers/vfio/platform/devtree.c new file mode 100644 index 000..c057be3 --- /dev/null +++ b/drivers/vfio/platform/devtree.c @@ -0,0 +1,70 @@ +#include linux/slab.h +#include linux/vfio.h +#include linux/of.h +#include linux/platform_device.h +#include vfio_platform_private.h + +static int devtree_get_prop_list(struct device_node *np, unsigned *lenp, +void __user *datap, unsigned long datasz) +{ + return -EINVAL; +} + +static int devtree_get_strings(struct device_node *np, + char *name, unsigned *lenp, + void __user *datap, unsigned long datasz) +{ + return -EINVAL; +} + +static int devtree_get_uint(struct device_node *np, char *name, + uint32_t type, unsigned *lenp, + void __user *datap, unsigned long datasz) +{ + return -EINVAL; +} + +int vfio_platform_devtree_info(struct device_node *np, + uint32_t type, unsigned *lenp, + void __user *datap, unsigned long datasz) +{ + char *name; + long namesz; + int ret; + + if (type == VFIO_DEVTREE_PROP_LIST) { + return devtree_get_prop_list(np, lenp, datap, datasz); + } + + namesz = strnlen_user(datap, datasz); + if (!namesz) + return -EFAULT; + if (namesz datasz) + return -EINVAL; + + name = kzalloc(namesz, GFP_KERNEL); + if (!name) + return -ENOMEM; + if (strncpy_from_user(name, datap, namesz) = 0) { + kfree(name); + return -EFAULT; + } + + switch (type) { + case VFIO_DEVTREE_TYPE_STRINGS: + ret = devtree_get_strings(np, name, lenp, datap, datasz); + break; + + case VFIO_DEVTREE_TYPE_U32: + case VFIO_DEVTREE_TYPE_U16: + case VFIO_DEVTREE_TYPE_U8: + ret = devtree_get_uint(np, name, type, lenp, datap, datasz); + break; + + default: + ret = -EINVAL; + } + + kfree(name); + return ret; +} diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index 2a6c665..bfbee2f 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -24,6 +24,7 @@ #include linux/uaccess.h #include linux/vfio.h #include linux/io.h +#include linux/of.h #include vfio_platform_private.h @@ -244,6 +245,34 @@ static long vfio_platform_ioctl(void *device_data, return ret; + } else if (cmd == VFIO_DEVICE_GET_DEVTREE_INFO) { + struct vfio_devtree_info info; + void __user *datap; + unsigned long datasz; + int ret; + + if (!vdev-of_node) + return -EINVAL; + + minsz = offsetofend(struct vfio_devtree_info, length); + + if (copy_from_user(info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz minsz) + return -EINVAL; + + datap = (void __user *) arg + minsz; + datasz = info.argsz - minsz; + + ret = vfio_platform_devtree_info(vdev-of_node, info.type, +
[RFC PATCH v2 4/4] vfio: platform: devtree: return arrays of u32, u16, or u8 data
Certain properties of a device tree node are accessible as an array of unsigned integers, either u32, u16, or u8. Let the VFIO user query this type of device node properties. Accessing u64 arrays is not yet implemented in this RFC. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/devtree.c | 55 - 1 file changed, 54 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/platform/devtree.c b/drivers/vfio/platform/devtree.c index 6d25f97..17f55d4 100644 --- a/drivers/vfio/platform/devtree.c +++ b/drivers/vfio/platform/devtree.c @@ -97,7 +97,60 @@ static int devtree_get_uint(struct device_node *np, char *name, uint32_t type, unsigned *lenp, void __user *datap, unsigned long datasz) { - return -EINVAL; + int ret, n; + size_t sz; + u8 *out; + int (*func)(const struct device_node *, const char *, void *, size_t) + = NULL; + + switch (type) { + case VFIO_DEVTREE_TYPE_U32: + sz = sizeof(u32); + func = (int (*)(const struct device_node *, + const char *, void *, size_t)) + of_property_read_u32_array; + break; + case VFIO_DEVTREE_TYPE_U16: + sz = sizeof(u16); + func = (int (*)(const struct device_node *, + const char *, void *, size_t)) + of_property_read_u16_array; + break; + case VFIO_DEVTREE_TYPE_U8: + sz = sizeof(u8); + func = (int (*)(const struct device_node *, + const char *, void *, size_t)) + of_property_read_u8_array; + break; + + default: + return -EINVAL; + } + + n = of_property_count_elems_of_size(np, name, sz); + if (n 0) + return n; + + if (lenp) + *lenp = n * sz; + + if (n * sz datasz) + return -EAGAIN; + + out = kcalloc(n, sz, GFP_KERNEL); + if (!out) + return -EFAULT; + + ret = func(np, name, out, n); + if (ret) + goto out; + + if (copy_to_user(datap, out, n * sz)) + ret = -EFAULT; + +out: + kfree(out); + return ret; } int vfio_platform_devtree_info(struct device_node *np, -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 2/4] vfio: platform: devtree: return available property names
The available properties of a device are not indexed numerically, instead they are accessible by property name. Passing type = VFIO_DEVTREE_PROP_LIST to VFIO_DEVICE_GET_DEVTREE_INFO, returns a list of strings with the available properties that the VFIO user can access. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/devtree.c | 37 - 1 file changed, 36 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/platform/devtree.c b/drivers/vfio/platform/devtree.c index c057be3..032ee16 100644 --- a/drivers/vfio/platform/devtree.c +++ b/drivers/vfio/platform/devtree.c @@ -7,7 +7,42 @@ static int devtree_get_prop_list(struct device_node *np, unsigned *lenp, void __user *datap, unsigned long datasz) { - return -EINVAL; + struct property *prop; + int len = 0, sz; + int ret = 0; + + for_each_property_of_node(np, prop) { + sz = strlen(prop-name) + 1; + + if (datasz sz) { + ret = -EAGAIN; + break; + } + + if (copy_to_user(datap, prop-name, sz)) + return -EFAULT; + + datap += sz; + datasz -= sz; + len += sz; + } + + /* if overflow occurs, calculate remaining length */ + while (prop) { + len += strlen(prop-name) + 1; + prop = prop-next; + } + + /* we expose the full_name in addition to the usual properties */ + len += sz = strlen(full_name) + 1; + if (datasz sz) { + ret = -EAGAIN; + } else if (copy_to_user(datap, full_name, sz)) + return -EFAULT; + + *lenp = len; + + return ret; } static int devtree_get_strings(struct device_node *np, -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 3/4] vfio: platform: devtree: access property as a list of strings
Certain device tree properties (e.g. the device node name, the compatible string), are available as a list of strings (separated by the null terminating character). Let the VFIO user query this type of properties. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/devtree.c | 43 - 1 file changed, 42 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/platform/devtree.c b/drivers/vfio/platform/devtree.c index 032ee16..6d25f97 100644 --- a/drivers/vfio/platform/devtree.c +++ b/drivers/vfio/platform/devtree.c @@ -45,11 +45,52 @@ static int devtree_get_prop_list(struct device_node *np, unsigned *lenp, return ret; } +static int devtree_get_full_name(struct device_node *np, unsigned *lenp, +void __user *datap, unsigned long datasz) +{ + int len = strlen(np-full_name) + 1; + + if (lenp) + *lenp = len; + + if (len datasz) + return -EAGAIN; + + if (copy_to_user(datap, np-full_name, len)) + return -EFAULT; + + return 0; +} + static int devtree_get_strings(struct device_node *np, char *name, unsigned *lenp, void __user *datap, unsigned long datasz) { - return -EINVAL; + struct property *prop; + int len; + + prop = of_find_property(np, name, len); + + if (!prop) { + /* special case full_name as a property that is not on the fdt, +* but we wish to return to the user as it includes the full +* path of the device */ + if (!strcmp(name, full_name)) + return devtree_get_full_name(np, lenp, datap, datasz); + else + return -EINVAL; + } + + if (lenp) + *lenp = len; + + if (len datasz) + return -EAGAIN; + + if (copy_to_user(datap, prop-value, len)) + return -EFAULT; + + return 0; } static int devtree_get_uint(struct device_node *np, char *name, -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 11/11] pvqspinlock, x86: Enable PV qspinlock for XEN
This patch adds the necessary XEN specific code to allow XEN to support the CPU halting and kicking operations needed by the queue spinlock PV code. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/xen/spinlock.c | 149 +-- kernel/Kconfig.locks|2 +- 2 files changed, 145 insertions(+), 6 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index d1b6a32..8edc197 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -17,6 +17,12 @@ #include xen-ops.h #include debugfs.h +static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; +static DEFINE_PER_CPU(char *, irq_name); +static bool xen_pvspin = true; + +#ifndef CONFIG_QUEUE_SPINLOCK + enum xen_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -100,12 +106,9 @@ struct xen_lock_waiting { __ticket_t want; }; -static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; -static DEFINE_PER_CPU(char *, irq_name); static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting); static cpumask_t waiting_cpus; -static bool xen_pvspin = true; __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want) { int irq = __this_cpu_read(lock_kicker_irq); @@ -213,6 +216,118 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next) } } +#else /* CONFIG_QUEUE_SPINLOCK */ + +#ifdef CONFIG_XEN_DEBUG_FS +static u32 kick_nohlt_stats; /* Kick but not halt count */ +static u32 halt_qhead_stats; /* Queue head halting count */ +static u32 halt_qnode_stats; /* Queue node halting count */ +static u32 halt_abort_stats; /* Halting abort count */ +static u32 wake_kick_stats;/* Wakeup by kicking count */ +static u32 wake_spur_stats;/* Spurious wakeup count*/ +static u64 time_blocked; /* Total blocking time */ + +static inline void xen_halt_stats(enum pv_lock_stats type) +{ + if (type == PV_HALT_QHEAD) + add_smp(halt_qhead_stats, 1); + else if (type == PV_HALT_QNODE) + add_smp(halt_qnode_stats, 1); + else /* type == PV_HALT_ABORT */ + add_smp(halt_abort_stats, 1); +} + +void xen_lock_stats(enum pv_lock_stats type) +{ + if (type == PV_WAKE_KICKED) + add_smp(wake_kick_stats, 1); + else if (type == PV_WAKE_SPURIOUS) + add_smp(wake_spur_stats, 1); + else /* type == PV_KICK_NOHALT */ + add_smp(kick_nohlt_stats, 1); +} +PV_CALLEE_SAVE_REGS_THUNK(xen_lock_stats); + +static inline u64 spin_time_start(void) +{ + return sched_clock(); +} + +static inline void spin_time_accum_blocked(u64 start) +{ + u64 delta; + + delta = sched_clock() - start; + add_smp(time_blocked, delta); +} +#else /* CONFIG_XEN_DEBUG_FS */ +static inline void xen_halt_stats(enum pv_lock_stats type) +{ +} + +static inline u64 spin_time_start(void) +{ + return 0; +} + +static inline void spin_time_accum_blocked(u64 start) +{ +} +#endif /* CONFIG_XEN_DEBUG_FS */ + +void xen_kick_cpu(int cpu) +{ + xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); +} +PV_CALLEE_SAVE_REGS_THUNK(xen_kick_cpu); + +/* + * Halt the current CPU release it back to the host + */ +void xen_halt_cpu(u8 *lockbyte) +{ + int irq = __this_cpu_read(lock_kicker_irq); + unsigned long flags; + u64 start; + + /* If kicker interrupts not initialized yet, just spin */ + if (irq == -1) + return; + + /* +* Make sure an interrupt handler can't upset things in a +* partially setup state. +*/ + local_irq_save(flags); + start = spin_time_start(); + + xen_halt_stats(lockbyte ? PV_HALT_QHEAD : PV_HALT_QNODE); + /* clear pending */ + xen_clear_irq_pending(irq); + + /* Allow interrupts while blocked */ + local_irq_restore(flags); + /* +* Don't halt if the lock is now available +*/ + if (lockbyte !ACCESS_ONCE(*lockbyte)) { + xen_halt_stats(PV_HALT_ABORT); + return; + } + /* +* If an interrupt happens here, it will leave the wakeup irq +* pending, which will cause xen_poll_irq() to return +* immediately. +*/ + + /* Block until irq becomes pending (or perhaps a spurious wakeup) */ + xen_poll_irq(irq); + spin_time_accum_blocked(start); +} +PV_CALLEE_SAVE_REGS_THUNK(xen_halt_cpu); + +#endif /* CONFIG_QUEUE_SPINLOCK */ + static irqreturn_t dummy_handler(int irq, void *dev_id) { BUG(); @@ -258,7 +373,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -275,8 +389,17 @@ void __init xen_init_spinlocks(void)
[PATCH v12 10/11] pvqspinlock, x86: Enable PV qspinlock for KVM
This patch adds the necessary KVM specific code to allow KVM to support the CPU halting and kicking operations needed by the queue spinlock PV code. Two KVM guests of 20 CPU cores (2 nodes) were created for performance testing in one of the following three configurations: 1) Only 1 VM is active 2) Both VMs are active and they share the same 20 physical CPUs (200% overcommit) The tests run included the disk workload of the AIM7 benchmark on both ext4 and xfs RAM disks at 3000 users on a 3.17 based kernel. The ebizzy -m test and futextest was was also run and its performance data were recorded. With two VMs running, the idle=poll kernel option was added to simulate a busy guest. If PV qspinlock is not enabled, unfairlock will be used automically in a guest. AIM7 XFS Disk Test (no overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 25423737.08 98.95 5.44 PV qspinlock 25495757.06 98.63 5.40 unfairlock26162796.91 97.05 5.42 AIM7 XFS Disk Test (200% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 64446827.93 415.22 6.33 PV qspinlock 64562427.88 419.84 0.39 unfairlock69551825.88 377.40 4.09 AIM7 EXT4 Disk Test (no overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 19955659.02 103.67 5.76 PV qspinlock 20111738.95 102.15 5.40 unfairlock20665908.71 98.13 5.46 AIM7 EXT4 Disk Test (200% overcommit) kernel JPMReal Time Sys TimeUsr Time - ---- PV ticketlock 47834137.63 495.81 30.78 PV qspinlock 47405837.97 475.74 30.95 unfairlock56022432.13 398.43 26.27 For the AIM7 disk workload, both PV ticketlock and qspinlock have about the same performance. The unfairlock performs slightly better than the PV lock. EBIZZY-m Test (no overcommit) kernelRec/s Real Time Sys TimeUsr Time - - - PV ticketlock 3255 10.00 60.65 3.62 PV qspinlock 3318 10.00 54.27 3.60 unfairlock2833 10.00 26.66 3.09 EBIZZY-m Test (200% overcommit) kernelRec/s Real Time Sys TimeUsr Time - - - PV ticketlock 841 10.00 71.03 2.37 PV qspinlock 834 10.00 68.27 2.39 unfairlock 865 10.00 27.08 1.51 futextest (no overcommit) kernel kops/s --- PV ticketlock11523 PV qspinlock 12328 unfairlock9478 futextest (200% overcommit) kernel kops/s --- PV ticketlock 7276 PV qspinlock 7095 unfairlock5614 The ebizzy and futextest have much higher spinlock contention than the AIM7 disk workload. In this case, the unfairlock performs worse than both the PV ticketlock and qspinlock. The performance of the 2 PV locks are comparable. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/kernel/kvm.c | 138 - kernel/Kconfig.locks |2 +- 2 files changed, 138 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index bc11fb5..9fb9015 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -560,7 +560,7 @@ arch_initcall(activate_jump_labels); #ifdef CONFIG_PARAVIRT_SPINLOCKS /* Kick a cpu by its apicid. Used to wake up a halted vcpu */ -static void kvm_kick_cpu(int cpu) +void kvm_kick_cpu(int cpu) { int apicid; unsigned long flags = 0; @@ -568,7 +568,9 @@ static void kvm_kick_cpu(int cpu) apicid = per_cpu(x86_cpu_to_apicid, cpu); kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid); } +PV_CALLEE_SAVE_REGS_THUNK(kvm_kick_cpu); +#ifndef CONFIG_QUEUE_SPINLOCK enum kvm_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -796,6 +798,132 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) } } } +#else /* !CONFIG_QUEUE_SPINLOCK */ + +#ifdef CONFIG_KVM_DEBUG_FS +static struct dentry *d_spin_debug; +static struct dentry *d_kvm_debug; +static u32 kick_nohlt_stats; /* Kick but
[PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
This patch adds para-virtualization support to the queue spinlock code base with minimal impact to the native case. There are some minor code changes in the generic qspinlock.c file which should be usable in other architectures. The other code changes are specific to x86 processors and so are all put under the arch/x86 directory. On the lock side, there are a couple of jump labels and 2 paravirt callee saved calls that defaults to NOPs and some registered move instructions. So the performance impact should be minimal. Since enabling paravirt spinlock will disable unlock function inlining, a jump label can be added to the unlock function without adding patch sites all over the kernel. The actual paravirt code comes in 5 parts; - init_node; this initializes the extra data members required for PV state. PV state data is kept 1 cacheline ahead of the regular data. - link_and_wait_node; this replaces the regular MCS queuing code. CPU halting can happen if the wait is too long. - wait_head; this waits until the lock is avialable and the CPU will be halted if the wait is too long. - wait_check; this is called after acquiring the lock to see if the next queue head CPU is halted. If this is the case, the lock bit is changed to indicate the queue head will have to be kicked on unlock. - queue_unlock; this routine has a jump label to check if paravirt is enabled. If yes, it has to do an atomic cmpxchg to clear the lock bit or call the slowpath function to kick the queue head cpu. Tracking the head is done in two parts, firstly the pv_wait_head will store its cpu number in whichever node is pointed to by the tail part of the lock word. Secondly, pv_link_and_wait_node() will propagate the existing head from the old to the new tail node. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/paravirt.h | 20 ++ arch/x86/include/asm/paravirt_types.h | 20 ++ arch/x86/include/asm/pvqspinlock.h| 403 + arch/x86/include/asm/qspinlock.h | 44 - arch/x86/kernel/paravirt-spinlocks.c |6 + kernel/locking/qspinlock.c| 72 ++- 6 files changed, 558 insertions(+), 7 deletions(-) create mode 100644 arch/x86/include/asm/pvqspinlock.h diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index cd6e161..3b041db 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -712,6 +712,25 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUE_SPINLOCK + +static __always_inline void pv_kick_cpu(int cpu) +{ + PVOP_VCALLEE1(pv_lock_ops.kick_cpu, cpu); +} + +static __always_inline void +pv_lockwait(u8 *lockbyte) +{ + PVOP_VCALLEE1(pv_lock_ops.lockwait, lockbyte); +} + +static __always_inline void pv_lockstat(enum pv_lock_stats type) +{ + PVOP_VCALLEE1(pv_lock_ops.lockstat, type); +} + +#else static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -723,6 +742,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, { PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } +#endif #endif diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 7549b8b..49e4b76 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -326,6 +326,9 @@ struct pv_mmu_ops { phys_addr_t phys, pgprot_t flags); }; +struct mcs_spinlock; +struct qspinlock; + struct arch_spinlock; #ifdef CONFIG_SMP #include asm/spinlock_types.h @@ -333,9 +336,26 @@ struct arch_spinlock; typedef u16 __ticket_t; #endif +#ifdef CONFIG_QUEUE_SPINLOCK +enum pv_lock_stats { + PV_HALT_QHEAD, /* Queue head halting */ + PV_HALT_QNODE, /* Other queue node halting */ + PV_HALT_ABORT, /* Halting aborted */ + PV_WAKE_KICKED, /* Wakeup by kicking*/ + PV_WAKE_SPURIOUS, /* Spurious wakeup */ + PV_KICK_NOHALT /* Kick but CPU not halted */ +}; +#endif + struct pv_lock_ops { +#ifdef CONFIG_QUEUE_SPINLOCK + struct paravirt_callee_save kick_cpu; + struct paravirt_callee_save lockstat; + struct paravirt_callee_save lockwait; +#else struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket); +#endif }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/pvqspinlock.h b/arch/x86/include/asm/pvqspinlock.h new file mode 100644 index 000..d424252 --- /dev/null +++ b/arch/x86/include/asm/pvqspinlock.h @@ -0,0 +1,403 @@ +#ifndef _ASM_X86_PVQSPINLOCK_H +#define _ASM_X86_PVQSPINLOCK_H + +/* + *
[PATCH v12 08/11] qspinlock, x86: Rename paravirt_ticketlocks_enabled
This patch renames the paravirt_ticketlocks_enabled static key to a more generic paravirt_spinlocks_enabled name. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- arch/x86/include/asm/spinlock.h |4 ++-- arch/x86/kernel/kvm.c|2 +- arch/x86/kernel/paravirt-spinlocks.c |4 ++-- arch/x86/xen/spinlock.c |2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 5899483..928751e 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -39,7 +39,7 @@ /* How long a lock should spin before we consider blocking */ #define SPIN_THRESHOLD (1 15) -extern struct static_key paravirt_ticketlocks_enabled; +extern struct static_key paravirt_spinlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); #ifdef CONFIG_QUEUE_SPINLOCK @@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock, static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { if (TICKET_SLOWPATH_FLAG - static_key_false(paravirt_ticketlocks_enabled)) { + static_key_false(paravirt_spinlocks_enabled)) { arch_spinlock_t prev; prev = *lock; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 3dd8e2c..bc11fb5 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -819,7 +819,7 @@ static __init int kvm_spinlock_init_jump(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); printk(KERN_INFO KVM setup paravirtual spinlock\n); return 0; diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index bbb6c73..e434f24 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -16,5 +16,5 @@ struct pv_lock_ops pv_lock_ops = { }; EXPORT_SYMBOL(pv_lock_ops); -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; -EXPORT_SYMBOL(paravirt_ticketlocks_enabled); +struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE; +EXPORT_SYMBOL(paravirt_spinlocks_enabled); diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 0ba5f3b..d1b6a32 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jump(void) if (!xen_domain()) return 0; - static_key_slow_inc(paravirt_ticketlocks_enabled); + static_key_slow_inc(paravirt_spinlocks_enabled); return 0; } early_initcall(xen_init_spinlocks_jump); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 07/11] qspinlock: Revert to test-and-set on hypervisors
From: Peter Zijlstra pet...@infradead.org When we detect a hypervisor (!paravirt, see qspinlock paravirt support patches), revert to a simple test-and-set lock to avoid the horrors of queue preemption. Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/qspinlock.h | 14 ++ include/asm-generic/qspinlock.h |7 +++ kernel/locking/qspinlock.c |3 +++ 3 files changed, 24 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index a6a8762..05a77fe 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_QSPINLOCK_H #define _ASM_X86_QSPINLOCK_H +#include asm/cpufeature.h #include asm-generic/qspinlock_types.h #ifndef CONFIG_X86_PPRO_FENCE @@ -20,6 +21,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock) #endif /* !CONFIG_X86_PPRO_FENCE */ +#define virt_queue_spin_lock virt_queue_spin_lock + +static inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) + return false; + + while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0) + cpu_relax(); + + return true; +} + #include asm-generic/qspinlock.h #endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index e8a7ae8..a53a7bb 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -98,6 +98,13 @@ static __always_inline void queue_spin_unlock(struct qspinlock *lock) } #endif +#ifndef virt_queue_spin_lock +static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + return false; +} +#endif + /* * Initializier */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index fb0e988..1c1926a 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -257,6 +257,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (virt_queue_spin_lock(lock)) + return; + /* * wait for in-progress pending-locked hand-overs * -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 06/11] qspinlock: Use a simple write to grab the lock
Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the tail code. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the tail code in the lock is set. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasksBefore patchAfter patch %Change ----- -- --- 3 2324/2321 2248/2265-3%/-2% 4 2890/2896 2819/2831-2%/-2% 5 3611/3595 3522/3512-2%/-2% 6 4281/4276 4173/4160-3%/-3% 7 5018/5001 4875/4861-3%/-3% 8 5759/5750 5563/5568-3%/-3% [Standalone/Embedded - different nodes] # of tasksBefore patchAfter patch %Change ----- -- --- 312242/12237 12087/12093 -1%/-1% 410688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The ebizzy -m test was also run with the following results: kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- kernel/locking/qspinlock.c | 59 1 files changed, 43 insertions(+), 16 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 7c127b4..fb0e988 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -103,24 +103,33 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) * By using the whole 2nd least significant byte for the pending bit, we * can allow better optimization of the lock acquisition for the pending * bit holder. + * + * This internal structure is also used by the set_locked function which + * is not restricted to _Q_PENDING_BITS == 8. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + u8 locked; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[3]; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -207,6 +216,19 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) #endif /* _Q_PENDING_BITS == 8 */ /** + * set_locked - Set the lock bit and own the lock + * @lock: Pointer to queue spinlock structure + * + * *,*,0 - *,0,1 + */ +static __always_inline void set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue
[PATCH v12 05/11] qspinlock: Optimize for smaller NR_CPUS
From: Peter Zijlstra pet...@infradead.org When we allow for a max NR_CPUS 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 13 ++ kernel/locking/qspinlock.c| 71 - 2 files changed, 83 insertions(+), 1 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 88d647c..01b46df 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -35,6 +35,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS = 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -47,7 +55,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS (1U 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -58,6 +70,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 48bd2ad..7c127b4 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -22,6 +22,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/mutex.h +#include asm/byteorder.h #include asm/qspinlock.h /* @@ -54,6 +55,10 @@ * node; whereby avoiding the need to carry a node from lock to unlock, and * preserving existing lock API. This also makes the unlock code simpler and * faster. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. + * */ #include mcs_spinlock.h @@ -94,6 +99,64 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + * + * Lock stealing is not allowed if this function is used. + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + struct __qspinlock *l = (void *)lock; + + ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL; +} + +/* + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + struct __qspinlock *l = (void *)lock; + + return (u32)xchg(l-tail, tail _Q_TAIL_OFFSET) _Q_TAIL_OFFSET; +} + +#else /* _Q_PENDING_BITS == 8 */ + /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -141,6 +204,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) }
[PATCH v12 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock
This patch makes the necessary changes at the x86 architecture specific layer to enable the use of queue spinlock for x86-64. As x86-32 machines are typically not multi-socket. The benefit of queue spinlock may not be apparent. So queue spinlock is not enabled. Currently, there is some incompatibilities between the para-virtualized spinlock code (which hard-codes the use of ticket spinlock) and the queue spinlock. Therefore, the use of queue spinlock is disabled when the para-virtualized spinlock is enabled. The arch/x86/include/asm/qspinlock.h header file includes some x86 specific optimization which will make the queue spinlock code perform better than the generic implementation. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- arch/x86/Kconfig |1 + arch/x86/include/asm/qspinlock.h | 25 + arch/x86/include/asm/spinlock.h |5 + arch/x86/include/asm/spinlock_types.h |4 4 files changed, 35 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index fad4aa6..da42708 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -123,6 +123,7 @@ config X86 select MODULES_USE_ELF_RELA if X86_64 select CLONE_BACKWARDS if X86_32 select ARCH_USE_BUILTIN_BSWAP + select ARCH_USE_QUEUE_SPINLOCK select ARCH_USE_QUEUE_RWLOCK select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION select OLD_SIGACTION if X86_32 diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h new file mode 100644 index 000..a6a8762 --- /dev/null +++ b/arch/x86/include/asm/qspinlock.h @@ -0,0 +1,25 @@ +#ifndef _ASM_X86_QSPINLOCK_H +#define _ASM_X86_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +#ifndef CONFIG_X86_PPRO_FENCE + +#definequeue_spin_unlock queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + * + * An effective smp_store_release() on the least-significant byte. + */ +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + barrier(); + ACCESS_ONCE(*(u8 *)lock) = 0; +} + +#endif /* !CONFIG_X86_PPRO_FENCE */ + +#include asm-generic/qspinlock.h + +#endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 9295016..5899483 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -42,6 +42,10 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm/qspinlock.h +#else + #ifdef CONFIG_PARAVIRT_SPINLOCKS static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) @@ -180,6 +184,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, { arch_spin_lock(lock); } +#endif /* CONFIG_QUEUE_SPINLOCK */ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 5f9d757..5d654a1 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm-generic/qspinlock_types.h +#else typedef struct arch_spinlock { union { __ticketpair_t head_tail; @@ -33,6 +36,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif /* CONFIG_QUEUE_SPINLOCK */ #include asm-generic/qrwlock_types.h -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v12 01/11] qspinlock: A simple generic 4-byte queue spinlock
This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock.h | 118 +++ include/asm-generic/qspinlock_types.h | 58 + kernel/Kconfig.locks |7 + kernel/locking/Makefile |1 + kernel/locking/mcs_spinlock.h |1 + kernel/locking/qspinlock.c| 207 + 6 files changed, 392 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 000..e8a7ae8 --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,118 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return atomic_read(lock-val); +} + +/** + * queue_spin_value_unlocked - is the spinlock structure unlocked? + * @lock: queue spinlock structure + * Return: 1 if it is unlocked, 0 otherwise + * + * N.B. Whenever there are tasks waiting for the lock, it is considered + * locked wrt the lockref code to avoid lock stealing by the lockref + * code and change things underneath the lock. This also allows some + * optimizations to be applied without conflict with lockref. + */ +static __always_inline int queue_spin_value_unlocked(struct qspinlock lock) +{ + return !atomic_read(lock.val); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return atomic_read(lock-val) ~_Q_LOCKED_MASK; +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!atomic_read(lock-val) + (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0)) + return 1; + return 0; +} + +extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + u32 val; + +
[PATCH v12 04/11] qspinlock: Extract out code snippets for the next patch
This is a preparatory patch that extracts out the following 2 code snippets to prepare for the next performance optimization patch. 1) the logic for the exchange of new and previous tail code words into a new xchg_tail() function. 2) the logic for clearing the pending bit and setting the locked bit into a new clear_pending_set_locked() function. This patch also simplifies the trylock operation before queuing by calling queue_spin_trylock() directly. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock_types.h |2 + kernel/locking/qspinlock.c| 91 +--- 2 files changed, 62 insertions(+), 31 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 4196694..88d647c 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -58,6 +58,8 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) + #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) #define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 226b11d..48bd2ad 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -95,6 +95,54 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) /** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + u32 new, old; + + for (;;) { + new = (val ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } +} + +/** + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + u32 old, new, val = atomic_read(lock-val); + + for (;;) { + new = (val _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + return old; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -176,15 +224,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) * * *,1,0 - *,0,1 */ - for (;;) { - new = (val ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + clear_pending_set_locked(lock, val); return; /* @@ -201,37 +241,26 @@ queue: node-next = NULL; /* -* We have already touched the queueing cacheline; don't bother with -* pending stuff. -* -* trylock || xchg(lock, node) -* -* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked. -* p,y,x - n,y,x ; tail was p - tail is n; preserving locked. +* We touched a (possibly) cold cacheline in the per-cpu queue node; +* attempt the trylock once more in the hope someone let go while we +* weren't watching. */ - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + if (queue_spin_trylock(lock)) + goto release; /* -* we won the trylock; forget about queueing. +* We have already touched the queueing cacheline; don't bother with +* pending stuff. +* +* p,*,* - n,*,* */ - if (new == _Q_LOCKED_VAL) - goto release; + old = xchg_tail(lock, tail); /* * if there was a previous node; link it and wait until reaching the * head of the waitqueue. */ - if (old ~_Q_LOCKED_PENDING_MASK) { + if (old _Q_TAIL_MASK) { prev =
[PATCH v12 03/11] qspinlock: Add pending bit
From: Peter Zijlstra pet...@infradead.org Because the qspinlock needs to touch a second cacheline (the per-cpu mcs_nodes[]); add a pending bit and allow a single in-word spinner before we punt to the second cacheline. It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Signed-off-by: Peter Zijlstra pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 12 +++- kernel/locking/qspinlock.c| 119 +++-- 2 files changed, 107 insertions(+), 24 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 67a2110..4196694 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -36,8 +36,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U _Q_ ## type ## _BITS) - 1)\ _Q_ ## type ## _OFFSET) @@ -45,7 +46,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -54,5 +59,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index c114076..226b11d 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -92,24 +92,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) return per_cpu_ptr(mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word * - * (queue tail, lock value) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:-- (0,1) :-- (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+-- (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+-- (*,0) - (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock value) * + * fast :slow :unlock + * : : + * uncontended (0,0,0) -:-- (0,0,1) --:-- (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +-- (0,1,0) \ | : + * : | ^--' | | : + * : v | | : + * uncontended :(n,x,y) +-- (n,0,0) --' | : + * queue : | ^--' | : + * : v | : + * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -' : + * queue : ^--' : */ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) { @@ -119,6 +123,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + /* +* wait for in-progress pending-locked hand-overs +* +* 0,1,0 - 0,0,1 +*/ + if (val == _Q_PENDING_VAL) { + while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL) +
[PATCH v12 00/11] qspinlock: a 4-byte queue spinlock with PV support
v11-v12: - Based on PeterZ's version of the qspinlock patch (https://lkml.org/lkml/2014/6/15/63). - Incorporated many of the review comments from Konrad Wilk and Paolo Bonzini. - The pvqspinlock code is largely from my previous version with PeterZ's way of going from queue tail to head and his idea of using callee saved calls to KVM and XEN codes. v10-v11: - Use a simple test-and-set unfair lock to simplify the code, but performance may suffer a bit for large guest with many CPUs. - Take out Raghavendra KT's test results as the unfair lock changes may render some of his results invalid. - Add PV support without increasing the size of the core queue node structure. - Other minor changes to address some of the feedback comments. v9-v10: - Make some minor changes to qspinlock.c to accommodate review feedback. - Change author to PeterZ for 2 of the patches. - Include Raghavendra KT's test results in patch 18. v8-v9: - Integrate PeterZ's version of the queue spinlock patch with some modification: http://lkml.kernel.org/r/20140310154236.038181...@infradead.org - Break the more complex patches into smaller ones to ease review effort. - Fix a racing condition in the PV qspinlock code. v7-v8: - Remove one unneeded atomic operation from the slowpath, thus improving performance. - Simplify some of the codes and add more comments. - Test for X86_FEATURE_HYPERVISOR CPU feature bit to enable/disable unfair lock. - Reduce unfair lock slowpath lock stealing frequency depending on its distance from the queue head. - Add performance data for IvyBridge-EX CPU. v6-v7: - Remove an atomic operation from the 2-task contending code - Shorten the names of some macros - Make the queue waiter to attempt to steal lock when unfair lock is enabled. - Remove lock holder kick from the PV code and fix a race condition - Run the unfair lock PV code on overcommitted KVM guests to collect performance data. v5-v6: - Change the optimized 2-task contending code to make it fairer at the expense of a bit of performance. - Add a patch to support unfair queue spinlock for Xen. - Modify the PV qspinlock code to follow what was done in the PV ticketlock. - Add performance data for the unfair lock as well as the PV support code. v4-v5: - Move the optimized 2-task contending code to the generic file to enable more architectures to use it without code duplication. - Address some of the style-related comments by PeterZ. - Allow the use of unfair queue spinlock in a real para-virtualized execution environment. - Add para-virtualization support to the qspinlock code by ensuring that the lock holder and queue head stay alive as much as possible. v3-v4: - Remove debugging code and fix a configuration error - Simplify the qspinlock structure and streamline the code to make it perform a bit better - Add an x86 version of asm/qspinlock.h for holding x86 specific optimization. - Add an optimized x86 code path for 2 contending tasks to improve low contention performance. v2-v3: - Simplify the code by using numerous mode only without an unfair option. - Use the latest smp_load_acquire()/smp_store_release() barriers. - Move the queue spinlock code to kernel/locking. - Make the use of queue spinlock the default for x86-64 without user configuration. - Additional performance tuning. v1-v2: - Add some more comments to document what the code does. - Add a numerous CPU mode to support = 16K CPUs - Add a configuration option to allow lock stealing which can further improve performance in many cases. - Enable wakeup of queue head CPU at unlock time for non-numerous CPU mode. This patch set has 3 different sections: 1) Patches 1-6: Introduces a queue-based spinlock implementation that can replace the default ticket spinlock without increasing the size of the spinlock data structure. As a result, critical kernel data structures that embed spinlock won't increase in size and break data alignments. 2) Patch 7: Enables the use of unfair lock in a virtual guest. This can resolve some of the locking related performance issues due to the fact that the next CPU to get the lock may have been scheduled out for a period of time. 3) Patches 8-11: Enable qspinlock para-virtualization support by halting the waiting CPUs after spinning for a certain amount of time. The unlock code will detect the a sleeping waiter and wake it up. This is essentially the same logic as the PV ticketlock code. The queue spinlock has slightly better performance than the ticket spinlock in uncontended case. Its performance can be much better with moderate to heavy contention. This patch has the potential of improving the performance of all the workloads that have moderate to heavy spinlock contention. The queue spinlock is especially suitable for NUMA machines with at least 2 sockets. Though even at
Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors
On 10/15/2014 01:40 PM, Rusty Russell wrote: Jason Wang jasow...@redhat.com writes: Below should be useful for some experiments Jason is doing. I thought I'd send it out for early review/feedback. event idx feature allows us to defer interrupts until a specific # of descriptors were used. Sometimes it might be useful to get an interrupt after a specific descriptor, regardless. This adds a descriptor flag for this, and an API to create an urgent output descriptor. This is still an RFC: we'll need a feature bit for drivers to detect this, but we've run out of feature bits for virtio 0.X. For experimentation purposes, drivers can assume this is set, or add a driver-specific feature bit. Signed-off-by: Michael S. Tsirkin m...@redhat.com Signed-off-by: Jason Wang jasow...@redhat.com The new VRING_DESC_F_URGENT bit is theoretically nicer, but for networking (which tends to take packets in order) couldn't we just set the event counter to give us a tx interrupt at the packet we want? Cheers, Rusty. Yes, we could. Recent RFC of enabling tx interrupt use this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 07/11] powerpc: kvm: the stopper func to cease secondary hwthread
To enter guest, primary hwtherad schedules the stopper func on secondary threads and force them into NAP mode. When exit to host,secondary threads hardcode to restore the stack, then switch back to the stopper func, i.e host. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv.c| 15 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 34 + 2 files changed, 49 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index ba258c8..4348abd 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1486,6 +1486,21 @@ static void kvmppc_remove_runnable(struct kvmppc_vcore *vc, list_del(vcpu-arch.run_list); } +#ifdef KVMPPC_ENABLE_SECONDARY + +extern void kvmppc_secondary_stopper_enter(); + +static int kvmppc_secondary_stopper(void *data) +{ + int cpu =smp_processor_id(); + struct paca_struct *lpaca = get_paca(); + BUG_ON(!(cpu%thread_per_core)); + + kvmppc_secondary_stopper_enter(); +} + +#endif + static int kvmppc_grab_hwthread(int cpu) { struct paca_struct *tpaca; diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index d5594b0..254038b 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -349,7 +349,41 @@ kvm_do_nap: #ifdef PPCKVM_ENABLE_SECONDARY kvm_secondary_exit_trampoline: + + /* all register is free to use, later kvmppc_secondary_stopper_exit set up them*/ + //loop-wait for the primary to signal that host env is ready + + LOAD_REG_ADDR(r5, kvmppc_secondary_stopper_exit) + /* fixme, load msr from lpaca stack */ + li r6, MSR_IR | MSR_DR + mtsrr0 r5 + mtsrr1 r6 + RFI + +_GLOBAL_TOC(kvmppc_secondary_stopper_enter) + mflrr0 + std r0, PPC_LR_STKOFF(r1) + stdur1, -112(r1) + + /* fixme: store other register such as msr */ + + /* prevent us to enter kernel */ + li r0, 1 + stb r0, HSTATE_HWTHREAD_REQ(r13) + /* tell the primary that we are ready */ +li r0,KVM_HWTHREAD_IN_KERNEL +stb r0,HSTATE_HWTHREAD_STATE(r13) + nap b . + +/* enter with vmode */ +kvmppc_secondary_stopper_exit: + /* fixme, restore the stack which we store on lpaca */ + + ld r0, 112+PPC_LR_STKOFF(r1) + addir1, r1, 112 + mtlrr0 + blr #endif /** -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 09/11] powerpc: kvm: handle time base on secondary hwthread
(This is a place holder patch.) We need to store the time base for host on secondary hwthread. Later when switching back, we need to reprogram it with elapse time. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 89ea16c..a817ba6 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -371,6 +371,8 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter) /* fixme: store other register such as msr */ + /* fixme: store the tb, and set it as MAX, so we cease the tick on secondary */ + /* prevent us to enter kernel */ li r0, 1 stb r0, HSTATE_HWTHREAD_REQ(r13) @@ -382,6 +384,10 @@ _GLOBAL_TOC(kvmppc_secondary_stopper_enter) /* enter with vmode */ kvmppc_secondary_stopper_exit: + /* fixme: restore the tb, with the orig val plus time elapse + * so we can fire the hrtimer as soon as possible + */ + /* fixme, restore the stack which we store on lpaca */ ld r0, 112+PPC_LR_STKOFF(r1) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 05/11] sched: introduce stop_cpus_async() to schedule special tsk on cpu
The proto will be: cpu1 cpuX stop_cpus_async() bring cpuX to a special state signal flag and trapped check for flag The func help powerpc to reuse the scheme of cpu_stopper_task to force the secondary hwthread goto NAP state, in which state, cpu will not run any longer until the master cpu tells them to go. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- include/linux/stop_machine.h | 2 ++ kernel/stop_machine.c| 25 - 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h index d2abbdb..871c1bf 100644 --- a/include/linux/stop_machine.h +++ b/include/linux/stop_machine.h @@ -32,6 +32,8 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void * void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg, struct cpu_stop_work *work_buf); int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg); +int stop_cpus_async(const struct cpumask *cpumask, cpu_stop_fn_t fn, + void *arg); int try_stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg); #else /* CONFIG_SMP */ diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index 695f0c6..d26fd6a 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -354,13 +354,15 @@ static void queue_stop_cpus_work(const struct cpumask *cpumask, } static int __stop_cpus(const struct cpumask *cpumask, - cpu_stop_fn_t fn, void *arg) + cpu_stop_fn_t fn, void *arg, bool sync) { struct cpu_stop_done done; - cpu_stop_init_done(done, cpumask_weight(cpumask)); + if (sync) + cpu_stop_init_done(done, cpumask_weight(cpumask)); queue_stop_cpus_work(cpumask, fn, arg, done); - wait_for_completion(done.completion); + if (sync) + wait_for_completion(done.completion); return done.executed ? done.ret : -ENOENT; } @@ -398,7 +400,20 @@ int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg) /* static works are used, process one request at a time */ mutex_lock(stop_cpus_mutex); - ret = __stop_cpus(cpumask, fn, arg); + ret = __stop_cpus(cpumask, fn, arg, true); + mutex_unlock(stop_cpus_mutex); + return ret; +} + +/* similar to stop_cpus(), but not wait for the ack. */ +int stop_cpus_async(const struct cpumask *cpumask, cpu_stop_fn_t fn, + void *arg) +{ + int ret; + + /* static works are used, process one request at a time */ + mutex_lock(stop_cpus_mutex); + ret = __stop_cpus(cpumask, fn, arg, false); mutex_unlock(stop_cpus_mutex); return ret; } @@ -428,7 +443,7 @@ int try_stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg) /* static works are used, process one request at a time */ if (!mutex_trylock(stop_cpus_mutex)) return -EAGAIN; - ret = __stop_cpus(cpumask, fn, arg); + ret = __stop_cpus(cpumask, fn, arg, true); mutex_unlock(stop_cpus_mutex); return ret; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 03/11] powerpc: kvm: add interface to control kvm function on a core
When kvm is enabled on a core, we migrate all external irq to primary thread. Since currently, the kvmirq logic is handled by the primary hwthread. Todo: this patch lacks re-enable of irqbalance when kvm is disable on the core Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kernel/sysfs.c| 39 ++ arch/powerpc/sysdev/xics/xics-common.c | 12 +++ 2 files changed, 51 insertions(+) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index 67fd2fd..a2595dd 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -552,6 +552,45 @@ static void sysfs_create_dscr_default(void) if (cpu_has_feature(CPU_FTR_DSCR)) err = device_create_file(cpu_subsys.dev_root, dev_attr_dscr_default); } + +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY +#define NR_CORES (CONFIG_NR_CPUS/threads_per_core) +static DECLARE_BITMAP(kvm_on_core, NR_CORES) __read_mostly + +static ssize_t show_kvm_enable(struct device *dev, + struct device_attribute *attr, char *buf) +{ +} + +static ssize_t __used store_kvm_enable(struct device *dev, + struct device_attribute *attr, const char *buf, + size_t count) +{ + struct cpumask stop_cpus; + unsigned long core, thr; + + sscanf(buf, %lx, core); + if (core NR_CORES) + return -1; + if (!test_bit(core, kvm_on_core)) + for (thr = 1; thr threads_per_core; thr++) + if (cpu_online(thr * threads_per_core + thr)) + cpumask_set_cpu(thr * threads_per_core + thr, stop_cpus); + + stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus); + set_bit(core, kvm_on_core); + return count; +} + +static DEVICE_ATTR(kvm_enable, 0600, + show_kvm_enable, store_kvm_enable); + +static void sysfs_create_kvm_enable(void) +{ + device_create_file(cpu_subsys.dev_root, dev_attr_kvm_enable); +} +#endif + #endif /* CONFIG_PPC64 */ #ifdef HAS_PPC_PMC_PA6T diff --git a/arch/powerpc/sysdev/xics/xics-common.c b/arch/powerpc/sysdev/xics/xics-common.c index fe0cca4..68b33d8 100644 --- a/arch/powerpc/sysdev/xics/xics-common.c +++ b/arch/powerpc/sysdev/xics/xics-common.c @@ -258,6 +258,18 @@ unlock: raw_spin_unlock_irqrestore(desc-lock, flags); } } + +int xics_migrate_irqs_away_secondary(void *data) +{ + int cpu = smp_processor_id(); + if(cpu%thread_per_core != 0) { + WARN(condition, format...); + return 0; + } + /* In fact, if we can migrate the primary, it will be more fine */ + xics_migrate_irqs_away(); + return 0; +} #endif /* CONFIG_HOTPLUG_CPU */ #ifdef CONFIG_SMP -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 04/11] powerpc: kvm: introduce a kthread on primary thread to anti tickless
(This patch is a place holder.) If there is only one vcpu thread is ready(the other vcpu thread can wait for it to execute), the primary thread can enter tickless mode, which causes the primary keeps running, so the secondary has no opportunity to exit to host, even they have other tsk on them. Introduce a kthread (anti_tickless) on primary, so when there is only one vcpu thread on primary, the secondary can resort to anti_tickless to keep the primary out of tickless mode. (I thought that anti_tickless thread can goto NAP, so we can let the secondary run). Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kernel/sysfs.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c index a2595dd..f0b110e 100644 --- a/arch/powerpc/kernel/sysfs.c +++ b/arch/powerpc/kernel/sysfs.c @@ -575,9 +575,11 @@ static ssize_t __used store_kvm_enable(struct device *dev, if (!test_bit(core, kvm_on_core)) for (thr = 1; thr threads_per_core; thr++) if (cpu_online(thr * threads_per_core + thr)) - cpumask_set_cpu(thr * threads_per_core + thr, stop_cpus); + cpumask_set_cpu(core * threads_per_core + thr, stop_cpus); stop_machine(xics_migrate_irqs_away_secondary, NULL, stop_cpus); + /* fixme, create a kthread on primary hwthread to handle tickless mode */ + //kthread_create_on_cpu(prevent_tickless, NULL, core * threads_per_core, ppckvm_prevent_tickless); set_bit(core, kvm_on_core); return count; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 11/11] powerpc: kvm: Kconfig add an option for enabling secondary hwthread
Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/Kconfig | 4 1 file changed, 4 insertions(+) diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig index 602eb51..de38566 100644 --- a/arch/powerpc/kvm/Kconfig +++ b/arch/powerpc/kvm/Kconfig @@ -93,6 +93,10 @@ config KVM_BOOK3S_64_HV If unsure, say N. +config KVMPPC_ENABLE_SECONDARY + tristate KVM support for running on secondary hwthread in host + depends on KVM_BOOK3S_64_HV + config KVM_BOOK3S_64_PR tristate KVM support without using hypervisor mode in host depends on KVM_BOOK3S_64 -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 06/11] powerpc: kvm: introduce online in paca to indicate whether cpu is needed by host
Nowadays, powerKVM runs with secondary hwthread offline. Although we can make all secondary hwthread online later, we still preserve this behavior for dedicated KVM env. Achieve this by setting paca-online as false. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/include/asm/paca.h | 3 +++ arch/powerpc/kernel/asm-offsets.c | 3 +++ arch/powerpc/kernel/smp.c | 3 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 12 4 files changed, 21 insertions(+) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index a5139ea..67c2500 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -84,6 +84,9 @@ struct paca_struct { u8 cpu_start; /* At startup, processor spins until */ /* this becomes non-zero. */ u8 kexec_state; /* set when kexec down has irqs off */ +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + u8 online; +#endif #ifdef CONFIG_PPC_STD_MMU_64 struct slb_shadow *slb_shadow_ptr; struct dtl_entry *dispatch_log; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 9d7dede..0faa8fe 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -182,6 +182,9 @@ int main(void) DEFINE(PACATOC, offsetof(struct paca_struct, kernel_toc)); DEFINE(PACAKBASE, offsetof(struct paca_struct, kernelbase)); DEFINE(PACAKMSR, offsetof(struct paca_struct, kernel_msr)); +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + DEFINE(PACAONLINE, offsetof(struct paca_struct, online)); +#endif DEFINE(PACASOFTIRQEN, offsetof(struct paca_struct, soft_enabled)); DEFINE(PACAIRQHAPPENED, offsetof(struct paca_struct, irq_happened)); DEFINE(PACACONTEXTID, offsetof(struct paca_struct, context.id)); diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index a0738af..4c3843e 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -736,6 +736,9 @@ void start_secondary(void *unused) cpu_startup_entry(CPUHP_ONLINE); +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + get_paca()-online = true; +#endif BUG(); } diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index f0c4db7..d5594b0 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -322,6 +322,13 @@ kvm_no_guest: li r0, KVM_HWTHREAD_IN_NAP stb r0, HSTATE_HWTHREAD_STATE(r13) kvm_do_nap: +#ifdef PPCKVM_ENABLE_SECONDARY + /* check the cpu is needed by host or not */ + ld r2, PACAONLINE(r13) + ld r3, 0 + cmp r2, r3 + bne kvm_secondary_exit_trampoline +#endif /* Clear the runlatch bit before napping */ mfspr r2, SPRN_CTRLF clrrdi r2, r2, 1 @@ -340,6 +347,11 @@ kvm_do_nap: nap b . +#ifdef PPCKVM_ENABLE_SECONDARY +kvm_secondary_exit_trampoline: + b . +#endif + /** ** * Entry code * -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 08/11] powerpc: kvm: add a flag in vcore to sync primary with secondry hwthread
The secondary thread can only jump back to host until primary has set up the env. Add host_ready field in kvm_vcore to sync this action. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_host.h | 3 +++ arch/powerpc/kernel/asm-offsets.c | 3 +++ arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++- 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 9a3355e..1310e03 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -305,6 +305,9 @@ struct kvmppc_vcore { u32 arch_compat; ulong pcr; ulong dpdes;/* doorbell state (POWER8) */ +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + u8 host_ready; +#endif void *mpp_buffer; /* Micro Partition Prefetch buffer */ bool mpp_buffer_is_valid; }; diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 0faa8fe..9c04ac2 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -562,6 +562,9 @@ int main(void) DEFINE(VCORE_LPCR, offsetof(struct kvmppc_vcore, lpcr)); DEFINE(VCORE_PCR, offsetof(struct kvmppc_vcore, pcr)); DEFINE(VCORE_DPDES, offsetof(struct kvmppc_vcore, dpdes)); +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + DEFINE(VCORE_HOST_READY, offsetof(struct kvmppc_vcore, host_ready)); +#endif DEFINE(VCPU_SLB_E, offsetof(struct kvmppc_slb, orige)); DEFINE(VCPU_SLB_V, offsetof(struct kvmppc_slb, origv)); DEFINE(VCPU_SLB_SIZE, sizeof(struct kvmppc_slb)); diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 254038b..89ea16c 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -351,7 +351,11 @@ kvm_do_nap: kvm_secondary_exit_trampoline: /* all register is free to use, later kvmppc_secondary_stopper_exit set up them*/ - //loop-wait for the primary to signal that host env is ready + /* wait until the primary to set up host env */ + ld r5, HSTATE_KVM_VCORE(r13) + ld r0, VCORE_HOST_READY(r5) + cmp r0, //primary is ready? + bne kvm_secondary_exit_trampoline LOAD_REG_ADDR(r5, kvmppc_secondary_stopper_exit) /* fixme, load msr from lpaca stack */ @@ -1821,6 +1825,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) li r0, KVM_GUEST_MODE_NONE stb r0, HSTATE_IN_GUEST(r13) +#ifdef PPCKVM_ENABLE_SECONDARY + /* signal the secondary that host env is ready */ + li r0, 1 + stb r0, VCORE_HOST_READY(r5) +#endif ld r0, 112+PPC_LR_STKOFF(r1) addir1, r1, 112 mtlrr0 -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 02/11] powerpc: kvm: ensure vcpu-thread run only on primary hwthread
When vcpu thread runs at the first time, it will ensure to stick to the primary thread. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/include/asm/kvm_host.h | 3 +++ arch/powerpc/kvm/book3s_hv.c| 17 + 2 files changed, 20 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 98d9dd5..9a3355e 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -666,6 +666,9 @@ struct kvm_vcpu_arch { spinlock_t tbacct_lock; u64 busy_stolen; u64 busy_preempt; +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + bool cpu_selected; +#endif #endif }; diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 27cced9..ba258c8 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1909,6 +1909,23 @@ static int kvmppc_vcpu_run_hv(struct kvm_run *run, struct kvm_vcpu *vcpu) { int r; int srcu_idx; +#ifdef CONFIG_KVMPPC_ENABLE_SECONDARY + int cpu = smp_processor_id(); + int target_cpu; + unsigned int cpu; + struct task_struct *p = current; + + if (unlikely(!vcpu-arch.cpu_selected)) { + vcpu-arch.cpu_selected = true; + for (cpu = 0; cpu NR_CPUS; cpu+=threads_per_core) { + cpumask_set_cpu(cpu, p-sys_allowed); + } + if (cpu%threads_per_core != 0) { + target_cpu = cpu/threads_per_core*threads_per_core; + migrate_task_to(current, target_cpu); + } + } +#endif if (!vcpu-arch.sane) { run-exit_reason = KVM_EXIT_INTERNAL_ERROR; -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 10/11] powerpc: kvm: on_primary_thread() force the secondary threads into NAP mode
The primary hwthread ceases the scheduler of secondary hwthread by bringing them into NAP. Then, the secondary is ready for guest. Signed-off-by: Liu Ping Fan pingf...@linux.vnet.ibm.com --- arch/powerpc/kvm/book3s_hv.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 4348abd..7896c31 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -1593,15 +1593,22 @@ static int on_primary_thread(void) { int cpu = smp_processor_id(); int thr; + struct cpumask msk; /* Are we on a primary subcore? */ if (cpu_thread_in_subcore(cpu)) return 0; thr = 0; +#ifdef KVMPPC_ENABLE_SECONDARY + while (++thr threads_per_subcore) + cpumask_set_cpu(thr, msk); + stop_cpus_async(msk, kvmppc_secondary_stopper, NULL); +#else while (++thr threads_per_subcore) if (cpu_online(cpu + thr)) return 0; +#endif /* Grab all hw threads so they can't go into the kernel */ for (thr = 1; thr threads_per_subcore; ++thr) { -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html