Re: [PATCH kvm-unit-tests 00/15] arm64: initial drop
Andrew Jones drjo...@redhat.com writes: On Fri, Dec 12, 2014 at 01:44:52PM +0100, Paolo Bonzini wrote: snip On 10/12/2014 20:59, Andrew Jones wrote: This series adds support for aarch64 to the kvm-unit-tests framework, bringing it to the same level as the arm support. In the process a few tweaks to the arm support were made, as one of the main goals was to share as much code as possible between the two. Patches 01 : A fix for the script runner. We need this one for arm regardless of the aarch64 support. 02-03: Fixes to the arm support. The bugs fixed weren't visible until running on aarch64. 04-07: Prep the arm framework for the bare minimal initial drop 08 : The bare minimal initial drop 09 : Add vector support to the minimal drop 10-12: Prep the arm framework for enabling the mmu on aarch64 13-14: Prep the aarch64 framework for enabling the mmu 15 : Enables the mmu on aarch64 These patches are also available here https://github.com/rhdrjones/kvm-unit-tests/tree/arm64/initial-drop snip Ping? I'd like to send more patches building on this series soon. Hopefully others (/me looks at Alex Bennee) do too :-) I am looking at them now. -- Alex Bennée -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On 05/01/2015 20:17, Marcelo Tosatti wrote: But there is no guarantee that vCPU-N has updated its pvti when vCPU-M resumes guest instruction execution. You're right. So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ? Perhaps you can use Gleb's idea to stick vcpu id into version field ? Or just replace __getcpu with rdtscp. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On 05/01/2015 23:48, Marcelo Tosatti wrote: But there is no guarantee that vCPU-N has updated its pvti when vCPU-M resumes guest instruction execution. Still confused. So we can freeze all vCPUs in the host, then update pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM doesn't increment the version pre-update, and we can return completely bogus results. Yes. But then the getcpu test would fail (1-0). Even if you have an ABA situation (1-0-1), it's okay because the pvti that is fetched is the one returned by the first getcpu. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 01/20] vfio/platform: initial skeleton of VFIO support for platform devices
This patch forms the common skeleton code for platform devices support with VFIO. This will include the core functionality of VFIO_PLATFORM, however binding to the device and discovering the device resources will be done with the help of a separate file where any Linux platform bus specific code will reside. This will allow us to implement support for also discovering AMBA devices and their resources, but still reuse a large part of the VFIO_PLATFORM implementation. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 121 ++ drivers/vfio/platform/vfio_platform_private.h | 36 2 files changed, 157 insertions(+) create mode 100644 drivers/vfio/platform/vfio_platform_common.c create mode 100644 drivers/vfio/platform/vfio_platform_private.h diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c new file mode 100644 index 000..34d023b --- /dev/null +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -0,0 +1,121 @@ +/* + * Copyright (C) 2013 - Virtual Open Systems + * Author: Antonios Motakis a.mota...@virtualopensystems.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License, version 2, as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/device.h +#include linux/iommu.h +#include linux/module.h +#include linux/mutex.h +#include linux/slab.h +#include linux/types.h +#include linux/vfio.h + +#include vfio_platform_private.h + +static void vfio_platform_release(void *device_data) +{ + module_put(THIS_MODULE); +} + +static int vfio_platform_open(void *device_data) +{ + if (!try_module_get(THIS_MODULE)) + return -ENODEV; + + return 0; +} + +static long vfio_platform_ioctl(void *device_data, + unsigned int cmd, unsigned long arg) +{ + if (cmd == VFIO_DEVICE_GET_INFO) + return -EINVAL; + + else if (cmd == VFIO_DEVICE_GET_REGION_INFO) + return -EINVAL; + + else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) + return -EINVAL; + + else if (cmd == VFIO_DEVICE_SET_IRQS) + return -EINVAL; + + else if (cmd == VFIO_DEVICE_RESET) + return -EINVAL; + + return -ENOTTY; +} + +static ssize_t vfio_platform_read(void *device_data, char __user *buf, + size_t count, loff_t *ppos) +{ + return -EINVAL; +} + +static ssize_t vfio_platform_write(void *device_data, const char __user *buf, + size_t count, loff_t *ppos) +{ + return -EINVAL; +} + +static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma) +{ + return -EINVAL; +} + +static const struct vfio_device_ops vfio_platform_ops = { + .name = vfio-platform, + .open = vfio_platform_open, + .release= vfio_platform_release, + .ioctl = vfio_platform_ioctl, + .read = vfio_platform_read, + .write = vfio_platform_write, + .mmap = vfio_platform_mmap, +}; + +int vfio_platform_probe_common(struct vfio_platform_device *vdev, + struct device *dev) +{ + struct iommu_group *group; + int ret; + + if (!vdev) + return -EINVAL; + + group = iommu_group_get(dev); + if (!group) { + pr_err(VFIO: No IOMMU group for device %s\n, vdev-name); + return -EINVAL; + } + + ret = vfio_add_group_dev(dev, vfio_platform_ops, vdev); + if (ret) { + iommu_group_put(group); + return ret; + } + + return 0; +} +EXPORT_SYMBOL_GPL(vfio_platform_probe_common); + +struct vfio_platform_device *vfio_platform_remove_common(struct device *dev) +{ + struct vfio_platform_device *vdev; + + vdev = vfio_del_group_dev(dev); + if (vdev) + iommu_group_put(dev-iommu_group); + + return vdev; +} +EXPORT_SYMBOL_GPL(vfio_platform_remove_common); diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h new file mode 100644 index 000..062b92d --- /dev/null +++ b/drivers/vfio/platform/vfio_platform_private.h @@ -0,0 +1,36 @@ +/* + * Copyright (C) 2013 - Virtual Open Systems + * Author: Antonios Motakis a.mota...@virtualopensystems.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License, version 2, as + * published by the Free Software Foundation. + * + *
[PATCH v11 06/20] vfio/platform: return info for bound device
A VFIO userspace driver will start by opening the VFIO device that corresponds to an IOMMU group, and will use the ioctl interface to get the basic device info, such as number of memory regions and interrupts, and their properties. This patch enables the VFIO_DEVICE_GET_INFO ioctl call. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 23 --- 1 file changed, 20 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index 34d023b..862b43b 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -38,10 +38,27 @@ static int vfio_platform_open(void *device_data) static long vfio_platform_ioctl(void *device_data, unsigned int cmd, unsigned long arg) { - if (cmd == VFIO_DEVICE_GET_INFO) - return -EINVAL; + struct vfio_platform_device *vdev = device_data; + unsigned long minsz; + + if (cmd == VFIO_DEVICE_GET_INFO) { + struct vfio_device_info info; + + minsz = offsetofend(struct vfio_device_info, num_irqs); + + if (copy_from_user(info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz minsz) + return -EINVAL; + + info.flags = vdev-flags; + info.num_regions = 0; + info.num_irqs = 0; + + return copy_to_user((void __user *)arg, info, minsz); - else if (cmd == VFIO_DEVICE_GET_REGION_INFO) + } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) return -EINVAL; else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 16/20] vfio: add local lock for virqfd instead of depending on VFIO PCI
The Virqfd code needs to keep accesses to any struct *virqfd safe, but this comes into play only when creating or destroying eventfds, so sharing the same spinlock with the VFIO bus driver is not necessary. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/vfio_pci_intrs.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index a5378d5..b35bc16 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -44,6 +44,7 @@ struct virqfd { }; static struct workqueue_struct *vfio_irqfd_cleanup_wq; +DEFINE_SPINLOCK(virqfd_lock); int __init vfio_virqfd_init(void) { @@ -80,21 +81,21 @@ static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) if (flags POLLHUP) { unsigned long flags; - spin_lock_irqsave(virqfd-vdev-irqlock, flags); + spin_lock_irqsave(virqfd_lock, flags); /* * The eventfd is closing, if the virqfd has not yet been * queued for release, as determined by testing whether the -* vdev pointer to it is still valid, queue it now. As +* virqfd pointer to it is still valid, queue it now. As * with kvm irqfds, we know we won't race against the virqfd -* going away because we hold wqh-lock to get here. +* going away because we hold the lock to get here. */ if (*(virqfd-pvirqfd) == virqfd) { *(virqfd-pvirqfd) = NULL; virqfd_deactivate(virqfd); } - spin_unlock_irqrestore(virqfd-vdev-irqlock, flags); + spin_unlock_irqrestore(virqfd_lock, flags); } return 0; @@ -170,16 +171,16 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev, * we update the pointer to the virqfd under lock to avoid * pushing multiple jobs to release the same virqfd. */ - spin_lock_irq(vdev-irqlock); + spin_lock_irq(virqfd_lock); if (*pvirqfd) { - spin_unlock_irq(vdev-irqlock); + spin_unlock_irq(virqfd_lock); ret = -EBUSY; goto err_busy; } *pvirqfd = virqfd; - spin_unlock_irq(vdev-irqlock); + spin_unlock_irq(virqfd_lock); /* * Install our own custom wake-up handling so we are notified via @@ -217,18 +218,18 @@ err_fd: } EXPORT_SYMBOL_GPL(vfio_virqfd_enable); -void vfio_virqfd_disable(struct vfio_pci_device *vdev, struct virqfd **pvirqfd) +void vfio_virqfd_disable(struct virqfd **pvirqfd) { unsigned long flags; - spin_lock_irqsave(vdev-irqlock, flags); + spin_lock_irqsave(virqfd_lock, flags); if (*pvirqfd) { virqfd_deactivate(*pvirqfd); *pvirqfd = NULL; } - spin_unlock_irqrestore(vdev-irqlock, flags); + spin_unlock_irqrestore(virqfd_lock, flags); /* * Block until we know all outstanding shutdown jobs have completed. @@ -441,8 +442,8 @@ static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd) static void vfio_intx_disable(struct vfio_pci_device *vdev) { vfio_intx_set_signal(vdev, -1); - vfio_virqfd_disable(vdev, vdev-ctx[0].unmask); - vfio_virqfd_disable(vdev, vdev-ctx[0].mask); + vfio_virqfd_disable(vdev-ctx[0].unmask); + vfio_virqfd_disable(vdev-ctx[0].mask); vdev-irq_type = VFIO_PCI_NUM_IRQS; vdev-num_ctx = 0; kfree(vdev-ctx); @@ -606,8 +607,8 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix) vfio_msi_set_block(vdev, 0, vdev-num_ctx, NULL, msix); for (i = 0; i vdev-num_ctx; i++) { - vfio_virqfd_disable(vdev, vdev-ctx[i].unmask); - vfio_virqfd_disable(vdev, vdev-ctx[i].mask); + vfio_virqfd_disable(vdev-ctx[i].unmask); + vfio_virqfd_disable(vdev-ctx[i].mask); } if (msix) { @@ -645,7 +646,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev, vfio_send_intx_eventfd, NULL, vdev-ctx[0].unmask, fd); - vfio_virqfd_disable(vdev, vdev-ctx[0].unmask); + vfio_virqfd_disable(vdev-ctx[0].unmask); } return 0; -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 13/20] vfio/platform: support for level sensitive interrupts
Level sensitive interrupts are exposed as maskable and automasked interrupts and are masked and disabled automatically when they fire. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_irq.c | 99 ++- drivers/vfio/platform/vfio_platform_private.h | 2 + 2 files changed, 98 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index 4896581..b3602cb 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -23,12 +23,59 @@ #include vfio_platform_private.h +static void vfio_platform_mask(struct vfio_platform_irq *irq_ctx) +{ + unsigned long flags; + + spin_lock_irqsave(irq_ctx-lock, flags); + + if (!irq_ctx-masked) { + disable_irq_nosync(irq_ctx-hwirq); + irq_ctx-masked = true; + } + + spin_unlock_irqrestore(irq_ctx-lock, flags); +} + static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev, unsigned index, unsigned start, unsigned count, uint32_t flags, void *data) { - return -EINVAL; + if (start != 0 || count != 1) + return -EINVAL; + + if (!(vdev-irqs[index].flags VFIO_IRQ_INFO_MASKABLE)) + return -EINVAL; + + if (flags VFIO_IRQ_SET_DATA_EVENTFD) + return -EINVAL; /* not implemented yet */ + + if (flags VFIO_IRQ_SET_DATA_NONE) { + vfio_platform_mask(vdev-irqs[index]); + + } else if (flags VFIO_IRQ_SET_DATA_BOOL) { + uint8_t mask = *(uint8_t *)data; + + if (mask) + vfio_platform_mask(vdev-irqs[index]); + } + + return 0; +} + +static void vfio_platform_unmask(struct vfio_platform_irq *irq_ctx) +{ + unsigned long flags; + + spin_lock_irqsave(irq_ctx-lock, flags); + + if (irq_ctx-masked) { + enable_irq(irq_ctx-hwirq); + irq_ctx-masked = false; + } + + spin_unlock_irqrestore(irq_ctx-lock, flags); } static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, @@ -36,7 +83,50 @@ static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, unsigned count, uint32_t flags, void *data) { - return -EINVAL; + if (start != 0 || count != 1) + return -EINVAL; + + if (!(vdev-irqs[index].flags VFIO_IRQ_INFO_MASKABLE)) + return -EINVAL; + + if (flags VFIO_IRQ_SET_DATA_EVENTFD) + return -EINVAL; /* not implemented yet */ + + if (flags VFIO_IRQ_SET_DATA_NONE) { + vfio_platform_unmask(vdev-irqs[index]); + + } else if (flags VFIO_IRQ_SET_DATA_BOOL) { + uint8_t unmask = *(uint8_t *)data; + + if (unmask) + vfio_platform_unmask(vdev-irqs[index]); + } + + return 0; +} + +static irqreturn_t vfio_automasked_irq_handler(int irq, void *dev_id) +{ + struct vfio_platform_irq *irq_ctx = dev_id; + unsigned long flags; + int ret = IRQ_NONE; + + spin_lock_irqsave(irq_ctx-lock, flags); + + if (!irq_ctx-masked) { + ret = IRQ_HANDLED; + + /* automask maskable interrupts */ + disable_irq_nosync(irq_ctx-hwirq); + irq_ctx-masked = true; + } + + spin_unlock_irqrestore(irq_ctx-lock, flags); + + if (ret == IRQ_HANDLED) + eventfd_signal(irq_ctx-trigger, 1); + + return ret; } static irqreturn_t vfio_irq_handler(int irq, void *dev_id) @@ -106,7 +196,7 @@ static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev, irq_handler_t handler; if (vdev-irqs[index].flags VFIO_IRQ_INFO_AUTOMASKED) - return -EINVAL; /* not implemented */ + handler = vfio_automasked_irq_handler; else handler = vfio_irq_handler; @@ -178,6 +268,8 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev) if (hwirq 0) goto err; + spin_lock_init(vdev-irqs[i].lock); + vdev-irqs[i].flags = VFIO_IRQ_INFO_EVENTFD; if (irq_get_trigger_type(hwirq) IRQ_TYPE_LEVEL_MASK) @@ -186,6 +278,7 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev) vdev-irqs[i].count = 1; vdev-irqs[i].hwirq = hwirq; + vdev-irqs[i].masked = false; } vdev-num_irqs = cnt; diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h index b705f17..eef6d1b 100644 --- a/drivers/vfio/platform/vfio_platform_private.h
[PATCH v11 20/20] vfio/platform: implement IRQ masking/unmasking via an eventfd
With this patch the VFIO user will be able to set an eventfd that can be used in order to mask and unmask IRQs of platform devices. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_irq.c | 47 --- drivers/vfio/platform/vfio_platform_private.h | 2 ++ 2 files changed, 45 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index b3602cb..6ade36b 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -37,6 +37,15 @@ static void vfio_platform_mask(struct vfio_platform_irq *irq_ctx) spin_unlock_irqrestore(irq_ctx-lock, flags); } +static int vfio_platform_mask_handler(void *opaque, void *unused) +{ + struct vfio_platform_irq *irq_ctx = opaque; + + vfio_platform_mask(irq_ctx); + + return 0; +} + static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev, unsigned index, unsigned start, unsigned count, uint32_t flags, @@ -48,8 +57,18 @@ static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev, if (!(vdev-irqs[index].flags VFIO_IRQ_INFO_MASKABLE)) return -EINVAL; - if (flags VFIO_IRQ_SET_DATA_EVENTFD) - return -EINVAL; /* not implemented yet */ + if (flags VFIO_IRQ_SET_DATA_EVENTFD) { + int32_t fd = *(int32_t *)data; + + if (fd = 0) + return vfio_virqfd_enable((void *) vdev-irqs[index], + vfio_platform_mask_handler, + NULL, NULL, + vdev-irqs[index].mask, fd); + + vfio_virqfd_disable(vdev-irqs[index].mask); + return 0; + } if (flags VFIO_IRQ_SET_DATA_NONE) { vfio_platform_mask(vdev-irqs[index]); @@ -78,6 +97,15 @@ static void vfio_platform_unmask(struct vfio_platform_irq *irq_ctx) spin_unlock_irqrestore(irq_ctx-lock, flags); } +static int vfio_platform_unmask_handler(void *opaque, void *unused) +{ + struct vfio_platform_irq *irq_ctx = opaque; + + vfio_platform_unmask(irq_ctx); + + return 0; +} + static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, unsigned index, unsigned start, unsigned count, uint32_t flags, @@ -89,8 +117,19 @@ static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, if (!(vdev-irqs[index].flags VFIO_IRQ_INFO_MASKABLE)) return -EINVAL; - if (flags VFIO_IRQ_SET_DATA_EVENTFD) - return -EINVAL; /* not implemented yet */ + if (flags VFIO_IRQ_SET_DATA_EVENTFD) { + int32_t fd = *(int32_t *)data; + + if (fd = 0) + return vfio_virqfd_enable((void *) vdev-irqs[index], + vfio_platform_unmask_handler, + NULL, NULL, + vdev-irqs[index].unmask, + fd); + + vfio_virqfd_disable(vdev-irqs[index].unmask); + return 0; + } if (flags VFIO_IRQ_SET_DATA_NONE) { vfio_platform_unmask(vdev-irqs[index]); diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h index eef6d1b..c3d5b4b 100644 --- a/drivers/vfio/platform/vfio_platform_private.h +++ b/drivers/vfio/platform/vfio_platform_private.h @@ -32,6 +32,8 @@ struct vfio_platform_irq { struct eventfd_ctx *trigger; boolmasked; spinlock_t lock; + struct virqfd *unmask; + struct virqfd *mask; }; struct vfio_platform_region { -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 18/20] vfio: move eventfd support code for VFIO_PCI to a separate file
The virqfd functionality that is used by VFIO_PCI to implement interrupt masking and unmasking via an eventfd, is generic enough and can be reused by another driver. Move it to a separate file in order to allow the code to be shared. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/Makefile | 3 +- drivers/vfio/pci/vfio_pci_intrs.c | 215 drivers/vfio/pci/vfio_pci_private.h | 3 - drivers/vfio/virqfd.c | 213 +++ include/linux/vfio.h| 27 + 5 files changed, 242 insertions(+), 219 deletions(-) create mode 100644 drivers/vfio/virqfd.c diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 1310792..c7c8644 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,4 +1,5 @@ -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o +vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o \ + ../virqfd.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 5b5fc23..de4befc 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -19,228 +19,13 @@ #include linux/msi.h #include linux/pci.h #include linux/file.h -#include linux/poll.h #include linux/vfio.h #include linux/wait.h -#include linux/workqueue.h #include linux/slab.h #include vfio_pci_private.h /* - * IRQfd - generic - */ -struct virqfd { - void*opaque; - struct eventfd_ctx *eventfd; - int (*handler)(void *, void *); - void(*thread)(void *, void *); - void*data; - struct work_struct inject; - wait_queue_twait; - poll_table pt; - struct work_struct shutdown; - struct virqfd **pvirqfd; -}; - -static struct workqueue_struct *vfio_irqfd_cleanup_wq; -DEFINE_SPINLOCK(virqfd_lock); - -int __init vfio_virqfd_init(void) -{ - vfio_irqfd_cleanup_wq = - create_singlethread_workqueue(vfio-irqfd-cleanup); - if (!vfio_irqfd_cleanup_wq) - return -ENOMEM; - - return 0; -} - -void vfio_virqfd_exit(void) -{ - destroy_workqueue(vfio_irqfd_cleanup_wq); -} - -static void virqfd_deactivate(struct virqfd *virqfd) -{ - queue_work(vfio_irqfd_cleanup_wq, virqfd-shutdown); -} - -static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) -{ - struct virqfd *virqfd = container_of(wait, struct virqfd, wait); - unsigned long flags = (unsigned long)key; - - if (flags POLLIN) { - /* An event has been signaled, call function */ - if ((!virqfd-handler || -virqfd-handler(virqfd-opaque, virqfd-data)) - virqfd-thread) - schedule_work(virqfd-inject); - } - - if (flags POLLHUP) { - unsigned long flags; - spin_lock_irqsave(virqfd_lock, flags); - - /* -* The eventfd is closing, if the virqfd has not yet been -* queued for release, as determined by testing whether the -* virqfd pointer to it is still valid, queue it now. As -* with kvm irqfds, we know we won't race against the virqfd -* going away because we hold the lock to get here. -*/ - if (*(virqfd-pvirqfd) == virqfd) { - *(virqfd-pvirqfd) = NULL; - virqfd_deactivate(virqfd); - } - - spin_unlock_irqrestore(virqfd_lock, flags); - } - - return 0; -} - -static void virqfd_ptable_queue_proc(struct file *file, -wait_queue_head_t *wqh, poll_table *pt) -{ - struct virqfd *virqfd = container_of(pt, struct virqfd, pt); - add_wait_queue(wqh, virqfd-wait); -} - -static void virqfd_shutdown(struct work_struct *work) -{ - struct virqfd *virqfd = container_of(work, struct virqfd, shutdown); - u64 cnt; - - eventfd_ctx_remove_wait_queue(virqfd-eventfd, virqfd-wait, cnt); - flush_work(virqfd-inject); - eventfd_ctx_put(virqfd-eventfd); - - kfree(virqfd); -} - -static void virqfd_inject(struct work_struct *work) -{ - struct virqfd *virqfd = container_of(work, struct virqfd, inject); - if (virqfd-thread) - virqfd-thread(virqfd-opaque, virqfd-data); -} - -int vfio_virqfd_enable(void *opaque, - int (*handler)(void *, void *), - void (*thread)(void *, void *), - void *data, struct virqfd **pvirqfd, int fd) -{ - struct fd irqfd; - struct eventfd_ctx *ctx; - struct virqfd *virqfd; - int ret
[PATCH v11 17/20] vfio: pass an opaque pointer on virqfd initialization
VFIO_PCI passes the VFIO device structure *vdev via eventfd to the handler that implements masking/unmasking of IRQs via an eventfd. We can replace it in the virqfd infrastructure with an opaque type so we can make use of the mechanism from other VFIO bus drivers. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/vfio_pci_intrs.c | 30 -- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index b35bc16..5b5fc23 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -31,10 +31,10 @@ * IRQfd - generic */ struct virqfd { - struct vfio_pci_device *vdev; + void*opaque; struct eventfd_ctx *eventfd; - int (*handler)(struct vfio_pci_device *, void *); - void(*thread)(struct vfio_pci_device *, void *); + int (*handler)(void *, void *); + void(*thread)(void *, void *); void*data; struct work_struct inject; wait_queue_twait; @@ -74,7 +74,7 @@ static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) if (flags POLLIN) { /* An event has been signaled, call function */ if ((!virqfd-handler || -virqfd-handler(virqfd-vdev, virqfd-data)) +virqfd-handler(virqfd-opaque, virqfd-data)) virqfd-thread) schedule_work(virqfd-inject); } @@ -124,12 +124,12 @@ static void virqfd_inject(struct work_struct *work) { struct virqfd *virqfd = container_of(work, struct virqfd, inject); if (virqfd-thread) - virqfd-thread(virqfd-vdev, virqfd-data); + virqfd-thread(virqfd-opaque, virqfd-data); } -int vfio_virqfd_enable(struct vfio_pci_device *vdev, - int (*handler)(struct vfio_pci_device *, void *), - void (*thread)(struct vfio_pci_device *, void *), +int vfio_virqfd_enable(void *opaque, + int (*handler)(void *, void *), + void (*thread)(void *, void *), void *data, struct virqfd **pvirqfd, int fd) { struct fd irqfd; @@ -143,7 +143,7 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev, return -ENOMEM; virqfd-pvirqfd = pvirqfd; - virqfd-vdev = vdev; + virqfd-opaque = opaque; virqfd-handler = handler; virqfd-thread = thread; virqfd-data = data; @@ -196,7 +196,7 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev, * before we registered and trigger it as if we didn't miss it. */ if (events POLLIN) { - if ((!handler || handler(vdev, data)) thread) + if ((!handler || handler(opaque, data)) thread) schedule_work(virqfd-inject); } @@ -243,8 +243,10 @@ EXPORT_SYMBOL_GPL(vfio_virqfd_disable); /* * INTx */ -static void vfio_send_intx_eventfd(struct vfio_pci_device *vdev, void *unused) +static void vfio_send_intx_eventfd(void *opaque, void *unused) { + struct vfio_pci_device *vdev = opaque; + if (likely(is_intx(vdev) !vdev-virq_disabled)) eventfd_signal(vdev-ctx[0].trigger, 1); } @@ -287,9 +289,9 @@ void vfio_pci_intx_mask(struct vfio_pci_device *vdev) * a signal is necessary, which can then be handled via a work queue * or directly depending on the caller. */ -static int vfio_pci_intx_unmask_handler(struct vfio_pci_device *vdev, - void *unused) +static int vfio_pci_intx_unmask_handler(void *opaque, void *unused) { + struct vfio_pci_device *vdev = opaque; struct pci_dev *pdev = vdev-pdev; unsigned long flags; int ret = 0; @@ -641,7 +643,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev, } else if (flags VFIO_IRQ_SET_DATA_EVENTFD) { int32_t fd = *(int32_t *)data; if (fd = 0) - return vfio_virqfd_enable(vdev, + return vfio_virqfd_enable((void *) vdev, vfio_pci_intx_unmask_handler, vfio_send_intx_eventfd, NULL, vdev-ctx[0].unmask, fd); -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 12/20] vfio/platform: trigger an interrupt via eventfd
This patch allows to set an eventfd for a platform device's interrupt, and also to trigger the interrupt eventfd from userspace for testing. Level sensitive interrupts are marked as maskable and are handled in a later patch. Edge triggered interrupts are not advertised as maskable and are implemented here using a simple and efficient IRQ handler. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_irq.c | 102 +- drivers/vfio/platform/vfio_platform_private.h | 2 + 2 files changed, 102 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index df5c919..4896581 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -39,12 +39,100 @@ static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, return -EINVAL; } +static irqreturn_t vfio_irq_handler(int irq, void *dev_id) +{ + struct vfio_platform_irq *irq_ctx = dev_id; + + eventfd_signal(irq_ctx-trigger, 1); + + return IRQ_HANDLED; +} + +static int vfio_set_trigger(struct vfio_platform_device *vdev, int index, + int fd, irq_handler_t handler) +{ + struct vfio_platform_irq *irq = vdev-irqs[index]; + struct eventfd_ctx *trigger; + unsigned long flags; + int ret; + + if (irq-trigger) { + free_irq(irq-hwirq, irq); + kfree(irq-name); + eventfd_ctx_put(irq-trigger); + irq-trigger = NULL; + } + + if (fd 0) /* Disable only */ + return 0; + + irq-name = kasprintf(GFP_KERNEL, vfio-irq[%d](%s), + irq-hwirq, vdev-name); + if (!irq-name) + return -ENOMEM; + + trigger = eventfd_ctx_fdget(fd); + if (IS_ERR(trigger)) { + kfree(irq-name); + return PTR_ERR(trigger); + } + + irq-trigger = trigger; + + ret = request_irq(irq-hwirq, handler, 0, irq-name, irq); + if (ret) { + kfree(irq-name); + eventfd_ctx_put(trigger); + irq-trigger = NULL; + return ret; + } + + /* if the IRQ has been masked by the user before setting an eventfd, +* then we need to make sure it is properly disabled */ + spin_lock_irqsave(irq-lock, flags); + if (irq-masked) + disable_irq_nosync(irq-hwirq); + spin_unlock_irqrestore(irq-lock, flags); + + return 0; +} + static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev, unsigned index, unsigned start, unsigned count, uint32_t flags, void *data) { - return -EINVAL; + struct vfio_platform_irq *irq = vdev-irqs[index]; + irq_handler_t handler; + + if (vdev-irqs[index].flags VFIO_IRQ_INFO_AUTOMASKED) + return -EINVAL; /* not implemented */ + else + handler = vfio_irq_handler; + + if (!count (flags VFIO_IRQ_SET_DATA_NONE)) + return vfio_set_trigger(vdev, index, -1, handler); + + if (start != 0 || count != 1) + return -EINVAL; + + if (flags VFIO_IRQ_SET_DATA_EVENTFD) { + int32_t fd = *(int32_t *)data; + + return vfio_set_trigger(vdev, index, fd, handler); + } + + if (flags VFIO_IRQ_SET_DATA_NONE) { + handler(irq-hwirq, irq); + + } else if (flags VFIO_IRQ_SET_DATA_BOOL) { + uint8_t trigger = *(uint8_t *)data; + + if (trigger) + handler(irq-hwirq, irq); + } + + return 0; } int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev, @@ -90,7 +178,12 @@ int vfio_platform_irq_init(struct vfio_platform_device *vdev) if (hwirq 0) goto err; - vdev-irqs[i].flags = 0; + vdev-irqs[i].flags = VFIO_IRQ_INFO_EVENTFD; + + if (irq_get_trigger_type(hwirq) IRQ_TYPE_LEVEL_MASK) + vdev-irqs[i].flags |= VFIO_IRQ_INFO_MASKABLE + | VFIO_IRQ_INFO_AUTOMASKED; + vdev-irqs[i].count = 1; vdev-irqs[i].hwirq = hwirq; } @@ -105,6 +198,11 @@ err: void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev) { + int i; + + for (i = 0; i vdev-num_irqs; i++) + vfio_set_trigger(vdev, i, -1, NULL); + vdev-num_irqs = 0; kfree(vdev-irqs); } diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h index 61f3ed4..b705f17 100644 --- a/drivers/vfio/platform/vfio_platform_private.h +++
[PATCH v11 14/20] vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export
We want to reuse virqfd functionality in multiple VFIO drivers; before moving these functions to core VFIO, add the vfio_ prefix to the virqfd_enable and virqfd_disable functions, and export them so they can be used from other modules. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/vfio_pci_intrs.c | 30 -- drivers/vfio/pci/vfio_pci_private.h | 4 ++-- 2 files changed, 18 insertions(+), 16 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index e8d695b..0a41833d 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -126,10 +126,10 @@ static void virqfd_inject(struct work_struct *work) virqfd-thread(virqfd-vdev, virqfd-data); } -static int virqfd_enable(struct vfio_pci_device *vdev, -int (*handler)(struct vfio_pci_device *, void *), -void (*thread)(struct vfio_pci_device *, void *), -void *data, struct virqfd **pvirqfd, int fd) +int vfio_virqfd_enable(struct vfio_pci_device *vdev, + int (*handler)(struct vfio_pci_device *, void *), + void (*thread)(struct vfio_pci_device *, void *), + void *data, struct virqfd **pvirqfd, int fd) { struct fd irqfd; struct eventfd_ctx *ctx; @@ -215,9 +215,9 @@ err_fd: return ret; } +EXPORT_SYMBOL_GPL(vfio_virqfd_enable); -static void virqfd_disable(struct vfio_pci_device *vdev, - struct virqfd **pvirqfd) +void vfio_virqfd_disable(struct vfio_pci_device *vdev, struct virqfd **pvirqfd) { unsigned long flags; @@ -237,6 +237,7 @@ static void virqfd_disable(struct vfio_pci_device *vdev, */ flush_workqueue(vfio_irqfd_cleanup_wq); } +EXPORT_SYMBOL_GPL(vfio_virqfd_disable); /* * INTx @@ -440,8 +441,8 @@ static int vfio_intx_set_signal(struct vfio_pci_device *vdev, int fd) static void vfio_intx_disable(struct vfio_pci_device *vdev) { vfio_intx_set_signal(vdev, -1); - virqfd_disable(vdev, vdev-ctx[0].unmask); - virqfd_disable(vdev, vdev-ctx[0].mask); + vfio_virqfd_disable(vdev, vdev-ctx[0].unmask); + vfio_virqfd_disable(vdev, vdev-ctx[0].mask); vdev-irq_type = VFIO_PCI_NUM_IRQS; vdev-num_ctx = 0; kfree(vdev-ctx); @@ -605,8 +606,8 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, bool msix) vfio_msi_set_block(vdev, 0, vdev-num_ctx, NULL, msix); for (i = 0; i vdev-num_ctx; i++) { - virqfd_disable(vdev, vdev-ctx[i].unmask); - virqfd_disable(vdev, vdev-ctx[i].mask); + vfio_virqfd_disable(vdev, vdev-ctx[i].unmask); + vfio_virqfd_disable(vdev, vdev-ctx[i].mask); } if (msix) { @@ -639,11 +640,12 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device *vdev, } else if (flags VFIO_IRQ_SET_DATA_EVENTFD) { int32_t fd = *(int32_t *)data; if (fd = 0) - return virqfd_enable(vdev, vfio_pci_intx_unmask_handler, -vfio_send_intx_eventfd, NULL, -vdev-ctx[0].unmask, fd); + return vfio_virqfd_enable(vdev, + vfio_pci_intx_unmask_handler, + vfio_send_intx_eventfd, NULL, + vdev-ctx[0].unmask, fd); - virqfd_disable(vdev, vdev-ctx[0].unmask); + vfio_virqfd_disable(vdev, vdev-ctx[0].unmask); } return 0; diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 671c17a..2e2f0ea 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -86,8 +86,8 @@ extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf, extern int vfio_pci_init_perm_bits(void); extern void vfio_pci_uninit_perm_bits(void); -extern int vfio_pci_virqfd_init(void); -extern void vfio_pci_virqfd_exit(void); +extern int vfio_virqfd_init(void); +extern void vfio_virqfd_exit(void); extern int vfio_config_init(struct vfio_pci_device *vdev); extern void vfio_config_free(struct vfio_pci_device *vdev); -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 11/20] vfio/platform: initial interrupts support code
This patch is a skeleton for the VFIO_DEVICE_SET_IRQS IOCTL, around which most IRQ functionality is implemented in VFIO. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 52 +-- drivers/vfio/platform/vfio_platform_irq.c | 59 +++ drivers/vfio/platform/vfio_platform_private.h | 7 3 files changed, 115 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index cf7bb08..a532a25 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -204,10 +204,54 @@ static long vfio_platform_ioctl(void *device_data, return copy_to_user((void __user *)arg, info, minsz); - } else if (cmd == VFIO_DEVICE_SET_IRQS) - return -EINVAL; + } else if (cmd == VFIO_DEVICE_SET_IRQS) { + struct vfio_irq_set hdr; + u8 *data = NULL; + int ret = 0; + + minsz = offsetofend(struct vfio_irq_set, count); + + if (copy_from_user(hdr, (void __user *)arg, minsz)) + return -EFAULT; + + if (hdr.argsz minsz) + return -EINVAL; + + if (hdr.index = vdev-num_irqs) + return -EINVAL; + + if (hdr.flags ~(VFIO_IRQ_SET_DATA_TYPE_MASK | + VFIO_IRQ_SET_ACTION_TYPE_MASK)) + return -EINVAL; - else if (cmd == VFIO_DEVICE_RESET) + if (!(hdr.flags VFIO_IRQ_SET_DATA_NONE)) { + size_t size; + + if (hdr.flags VFIO_IRQ_SET_DATA_BOOL) + size = sizeof(uint8_t); + else if (hdr.flags VFIO_IRQ_SET_DATA_EVENTFD) + size = sizeof(int32_t); + else + return -EINVAL; + + if (hdr.argsz - minsz size) + return -EINVAL; + + data = memdup_user((void __user *)(arg + minsz), size); + if (IS_ERR(data)) + return PTR_ERR(data); + } + + mutex_lock(vdev-igate); + + ret = vfio_platform_set_irqs_ioctl(vdev, hdr.flags, hdr.index, + hdr.start, hdr.count, data); + mutex_unlock(vdev-igate); + kfree(data); + + return ret; + + } else if (cmd == VFIO_DEVICE_RESET) return -EINVAL; return -ENOTTY; @@ -457,6 +501,8 @@ int vfio_platform_probe_common(struct vfio_platform_device *vdev, return ret; } + mutex_init(vdev-igate); + return 0; } EXPORT_SYMBOL_GPL(vfio_platform_probe_common); diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c index c6c3ec1..df5c919 100644 --- a/drivers/vfio/platform/vfio_platform_irq.c +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -23,6 +23,56 @@ #include vfio_platform_private.h +static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev, + unsigned index, unsigned start, + unsigned count, uint32_t flags, + void *data) +{ + return -EINVAL; +} + +static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev, + unsigned index, unsigned start, + unsigned count, uint32_t flags, + void *data) +{ + return -EINVAL; +} + +static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev, +unsigned index, unsigned start, +unsigned count, uint32_t flags, +void *data) +{ + return -EINVAL; +} + +int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev, +uint32_t flags, unsigned index, unsigned start, +unsigned count, void *data) +{ + int (*func)(struct vfio_platform_device *vdev, unsigned index, + unsigned start, unsigned count, uint32_t flags, + void *data) = NULL; + + switch (flags VFIO_IRQ_SET_ACTION_TYPE_MASK) { + case VFIO_IRQ_SET_ACTION_MASK: + func = vfio_platform_set_irq_mask; + break; + case VFIO_IRQ_SET_ACTION_UNMASK: + func = vfio_platform_set_irq_unmask; + break; + case VFIO_IRQ_SET_ACTION_TRIGGER: + func = vfio_platform_set_irq_trigger; + break; + }
[PATCH v11 08/20] vfio/platform: read and write support for the device fd
VFIO returns a file descriptor which we can use to manipulate the memory regions of the device. Usually, the user will mmap memory regions that are addressable on page boundaries, however for memory regions where this is not the case we cannot provide mmap functionality due to security concerns. For this reason we also allow to use read and write functions to the file descriptor pointing to the memory regions. We implement this functionality only for MMIO regions of platform devices; PIO regions are not being handled at this point. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 150 ++ drivers/vfio/platform/vfio_platform_private.h | 1 + 2 files changed, 151 insertions(+) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index 2a4613c..fda4c30 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -50,6 +50,10 @@ static int vfio_platform_regions_init(struct vfio_platform_device *vdev) switch (resource_type(res)) { case IORESOURCE_MEM: vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_MMIO; + vdev-regions[i].flags |= VFIO_REGION_INFO_FLAG_READ; + if (!(res-flags IORESOURCE_READONLY)) + vdev-regions[i].flags |= + VFIO_REGION_INFO_FLAG_WRITE; break; case IORESOURCE_IO: vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO; @@ -69,6 +73,11 @@ err: static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev) { + int i; + + for (i = 0; i vdev-num_regions; i++) + iounmap(vdev-regions[i].ioaddr); + vdev-num_regions = 0; kfree(vdev-regions); } @@ -171,15 +180,156 @@ static long vfio_platform_ioctl(void *device_data, return -ENOTTY; } +static ssize_t vfio_platform_read_mmio(struct vfio_platform_region reg, + char __user *buf, size_t count, + loff_t off) +{ + unsigned int done = 0; + + if (!reg.ioaddr) { + reg.ioaddr = + ioremap_nocache(reg.addr, reg.size); + + if (!reg.ioaddr) + return -ENOMEM; + } + + while (count) { + size_t filled; + + if (count = 4 !(off % 4)) { + u32 val; + + val = ioread32(reg.ioaddr + off); + if (copy_to_user(buf, val, 4)) + goto err; + + filled = 4; + } else if (count = 2 !(off % 2)) { + u16 val; + + val = ioread16(reg.ioaddr + off); + if (copy_to_user(buf, val, 2)) + goto err; + + filled = 2; + } else { + u8 val; + + val = ioread8(reg.ioaddr + off); + if (copy_to_user(buf, val, 1)) + goto err; + + filled = 1; + } + + + count -= filled; + done += filled; + off += filled; + buf += filled; + } + + return done; +err: + return -EFAULT; +} + static ssize_t vfio_platform_read(void *device_data, char __user *buf, size_t count, loff_t *ppos) { + struct vfio_platform_device *vdev = device_data; + unsigned int index = VFIO_PLATFORM_OFFSET_TO_INDEX(*ppos); + loff_t off = *ppos VFIO_PLATFORM_OFFSET_MASK; + + if (index = vdev-num_regions) + return -EINVAL; + + if (!(vdev-regions[index].flags VFIO_REGION_INFO_FLAG_READ)) + return -EINVAL; + + if (vdev-regions[index].type VFIO_PLATFORM_REGION_TYPE_MMIO) + return vfio_platform_read_mmio(vdev-regions[index], + buf, count, off); + else if (vdev-regions[index].type VFIO_PLATFORM_REGION_TYPE_PIO) + return -EINVAL; /* not implemented */ + return -EINVAL; } +static ssize_t vfio_platform_write_mmio(struct vfio_platform_region reg, + const char __user *buf, size_t count, + loff_t off) +{ + unsigned int done = 0; + + if (!reg.ioaddr) { + reg.ioaddr = + ioremap_nocache(reg.addr, reg.size); + + if (!reg.ioaddr) + return -ENOMEM; + } + + while (count) { + size_t filled; + + if (count = 4 !(off % 4)) { +
[PATCH v11 10/20] vfio/platform: return IRQ info
Return information for the interrupts exposed by the device. This patch extends VFIO_DEVICE_GET_INFO with the number of IRQs and enables VFIO_DEVICE_GET_IRQ_INFO. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/Makefile| 2 +- drivers/vfio/platform/vfio_platform_common.c | 31 +--- drivers/vfio/platform/vfio_platform_irq.c | 51 +++ drivers/vfio/platform/vfio_platform_private.h | 10 ++ 4 files changed, 89 insertions(+), 5 deletions(-) create mode 100644 drivers/vfio/platform/vfio_platform_irq.c diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile index 1957170..81de144 100644 --- a/drivers/vfio/platform/Makefile +++ b/drivers/vfio/platform/Makefile @@ -1,5 +1,5 @@ -vfio-platform-y := vfio_platform.o vfio_platform_common.o +vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index 6bf78ee..cf7bb08 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -100,6 +100,7 @@ static void vfio_platform_release(void *device_data) if (!(--vdev-refcnt)) { vfio_platform_regions_cleanup(vdev); + vfio_platform_irq_cleanup(vdev); } mutex_unlock(driver_lock); @@ -121,6 +122,10 @@ static int vfio_platform_open(void *device_data) ret = vfio_platform_regions_init(vdev); if (ret) goto err_reg; + + ret = vfio_platform_irq_init(vdev); + if (ret) + goto err_irq; } vdev-refcnt++; @@ -128,6 +133,8 @@ static int vfio_platform_open(void *device_data) mutex_unlock(driver_lock); return 0; +err_irq: + vfio_platform_regions_cleanup(vdev); err_reg: mutex_unlock(driver_lock); module_put(THIS_MODULE); @@ -153,7 +160,7 @@ static long vfio_platform_ioctl(void *device_data, info.flags = vdev-flags; info.num_regions = vdev-num_regions; - info.num_irqs = 0; + info.num_irqs = vdev-num_irqs; return copy_to_user((void __user *)arg, info, minsz); @@ -178,10 +185,26 @@ static long vfio_platform_ioctl(void *device_data, return copy_to_user((void __user *)arg, info, minsz); - } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) - return -EINVAL; + } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) { + struct vfio_irq_info info; + + minsz = offsetofend(struct vfio_irq_info, count); + + if (copy_from_user(info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz minsz) + return -EINVAL; + + if (info.index = vdev-num_irqs) + return -EINVAL; + + info.flags = vdev-irqs[info.index].flags; + info.count = vdev-irqs[info.index].count; + + return copy_to_user((void __user *)arg, info, minsz); - else if (cmd == VFIO_DEVICE_SET_IRQS) + } else if (cmd == VFIO_DEVICE_SET_IRQS) return -EINVAL; else if (cmd == VFIO_DEVICE_RESET) diff --git a/drivers/vfio/platform/vfio_platform_irq.c b/drivers/vfio/platform/vfio_platform_irq.c new file mode 100644 index 000..c6c3ec1 --- /dev/null +++ b/drivers/vfio/platform/vfio_platform_irq.c @@ -0,0 +1,51 @@ +/* + * VFIO platform devices interrupt handling + * + * Copyright (C) 2013 - Virtual Open Systems + * Author: Antonios Motakis a.mota...@virtualopensystems.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License, version 2, as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/eventfd.h +#include linux/interrupt.h +#include linux/slab.h +#include linux/types.h +#include linux/vfio.h +#include linux/irq.h + +#include vfio_platform_private.h + +int vfio_platform_irq_init(struct vfio_platform_device *vdev) +{ + int cnt = 0, i; + + while (vdev-get_irq(vdev, cnt) = 0) + cnt++; + + vdev-irqs = kcalloc(cnt, sizeof(struct vfio_platform_irq), GFP_KERNEL); + if (!vdev-irqs) + return -ENOMEM; + + for (i = 0; i cnt; i++) { + vdev-irqs[i].flags = 0; + vdev-irqs[i].count = 1; + } + + vdev-num_irqs = cnt; + + return 0; +} + +void vfio_platform_irq_cleanup(struct
[PATCH v11 09/20] vfio/platform: support MMAP of MMIO regions
Allow to memory map the MMIO regions of the device so userspace can directly access them. PIO regions are not being handled at this point. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 65 1 file changed, 65 insertions(+) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index fda4c30..6bf78ee 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -54,6 +54,16 @@ static int vfio_platform_regions_init(struct vfio_platform_device *vdev) if (!(res-flags IORESOURCE_READONLY)) vdev-regions[i].flags |= VFIO_REGION_INFO_FLAG_WRITE; + + /* +* Only regions addressed with PAGE granularity may be +* MMAPed securely. +*/ + if (!(vdev-regions[i].addr ~PAGE_MASK) + !(vdev-regions[i].size ~PAGE_MASK)) + vdev-regions[i].flags |= + VFIO_REGION_INFO_FLAG_MMAP; + break; case IORESOURCE_IO: vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO; @@ -333,8 +343,63 @@ static ssize_t vfio_platform_write(void *device_data, const char __user *buf, return -EINVAL; } +static int vfio_platform_mmap_mmio(struct vfio_platform_region region, + struct vm_area_struct *vma) +{ + u64 req_len, pgoff, req_start; + + req_len = vma-vm_end - vma-vm_start; + pgoff = vma-vm_pgoff + ((1U (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT)) - 1); + req_start = pgoff PAGE_SHIFT; + + if (region.size PAGE_SIZE || req_start + req_len region.size) + return -EINVAL; + + vma-vm_page_prot = pgprot_noncached(vma-vm_page_prot); + vma-vm_pgoff = (region.addr PAGE_SHIFT) + pgoff; + + return remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff, + req_len, vma-vm_page_prot); +} + static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma) { + struct vfio_platform_device *vdev = device_data; + unsigned int index; + + index = vma-vm_pgoff (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT); + + if (vma-vm_end vma-vm_start) + return -EINVAL; + if (!(vma-vm_flags VM_SHARED)) + return -EINVAL; + if (index = vdev-num_regions) + return -EINVAL; + if (vma-vm_start ~PAGE_MASK) + return -EINVAL; + if (vma-vm_end ~PAGE_MASK) + return -EINVAL; + + if (!(vdev-regions[index].flags VFIO_REGION_INFO_FLAG_MMAP)) + return -EINVAL; + + if (!(vdev-regions[index].flags VFIO_REGION_INFO_FLAG_READ) +(vma-vm_flags VM_READ)) + return -EINVAL; + + if (!(vdev-regions[index].flags VFIO_REGION_INFO_FLAG_WRITE) +(vma-vm_flags VM_WRITE)) + return -EINVAL; + + vma-vm_private_data = vdev; + + if (vdev-regions[index].type VFIO_PLATFORM_REGION_TYPE_MMIO) + return vfio_platform_mmap_mmio(vdev-regions[index], vma); + + else if (vdev-regions[index].type VFIO_PLATFORM_REGION_TYPE_PIO) + return -EINVAL; /* not implemented */ + return -EINVAL; } -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 07/20] vfio/platform: return info for device memory mapped IO regions
This patch enables the IOCTLs VFIO_DEVICE_GET_REGION_INFO ioctl call, which allows the user to learn about the available MMIO resources of a device. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform_common.c | 106 +- drivers/vfio/platform/vfio_platform_private.h | 22 ++ 2 files changed, 124 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c index 862b43b..2a4613c 100644 --- a/drivers/vfio/platform/vfio_platform_common.c +++ b/drivers/vfio/platform/vfio_platform_common.c @@ -22,17 +22,97 @@ #include vfio_platform_private.h +static DEFINE_MUTEX(driver_lock); + +static int vfio_platform_regions_init(struct vfio_platform_device *vdev) +{ + int cnt = 0, i; + + while (vdev-get_resource(vdev, cnt)) + cnt++; + + vdev-regions = kcalloc(cnt, sizeof(struct vfio_platform_region), + GFP_KERNEL); + if (!vdev-regions) + return -ENOMEM; + + for (i = 0; i cnt; i++) { + struct resource *res = + vdev-get_resource(vdev, i); + + if (!res) + goto err; + + vdev-regions[i].addr = res-start; + vdev-regions[i].size = resource_size(res); + vdev-regions[i].flags = 0; + + switch (resource_type(res)) { + case IORESOURCE_MEM: + vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_MMIO; + break; + case IORESOURCE_IO: + vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO; + break; + default: + goto err; + } + } + + vdev-num_regions = cnt; + + return 0; +err: + kfree(vdev-regions); + return -EINVAL; +} + +static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev) +{ + vdev-num_regions = 0; + kfree(vdev-regions); +} + static void vfio_platform_release(void *device_data) { + struct vfio_platform_device *vdev = device_data; + + mutex_lock(driver_lock); + + if (!(--vdev-refcnt)) { + vfio_platform_regions_cleanup(vdev); + } + + mutex_unlock(driver_lock); + module_put(THIS_MODULE); } static int vfio_platform_open(void *device_data) { + struct vfio_platform_device *vdev = device_data; + int ret; + if (!try_module_get(THIS_MODULE)) return -ENODEV; + mutex_lock(driver_lock); + + if (!vdev-refcnt) { + ret = vfio_platform_regions_init(vdev); + if (ret) + goto err_reg; + } + + vdev-refcnt++; + + mutex_unlock(driver_lock); return 0; + +err_reg: + mutex_unlock(driver_lock); + module_put(THIS_MODULE); + return ret; } static long vfio_platform_ioctl(void *device_data, @@ -53,15 +133,33 @@ static long vfio_platform_ioctl(void *device_data, return -EINVAL; info.flags = vdev-flags; - info.num_regions = 0; + info.num_regions = vdev-num_regions; info.num_irqs = 0; return copy_to_user((void __user *)arg, info, minsz); - } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) - return -EINVAL; + } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) { + struct vfio_region_info info; + + minsz = offsetofend(struct vfio_region_info, offset); + + if (copy_from_user(info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz minsz) + return -EINVAL; + + if (info.index = vdev-num_regions) + return -EINVAL; + + /* map offset to the physical address */ + info.offset = VFIO_PLATFORM_INDEX_TO_OFFSET(info.index); + info.size = vdev-regions[info.index].size; + info.flags = vdev-regions[info.index].flags; + + return copy_to_user((void __user *)arg, info, minsz); - else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) + } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) return -EINVAL; else if (cmd == VFIO_DEVICE_SET_IRQS) diff --git a/drivers/vfio/platform/vfio_platform_private.h b/drivers/vfio/platform/vfio_platform_private.h index 062b92d..b24729f 100644 --- a/drivers/vfio/platform/vfio_platform_private.h +++ b/drivers/vfio/platform/vfio_platform_private.h @@ -15,7 +15,29 @@ #ifndef VFIO_PLATFORM_PRIVATE_H #define VFIO_PLATFORM_PRIVATE_H +#define VFIO_PLATFORM_OFFSET_SHIFT 40 +#define VFIO_PLATFORM_OFFSET_MASK (((u64)(1) VFIO_PLATFORM_OFFSET_SHIFT) - 1) + +#define VFIO_PLATFORM_OFFSET_TO_INDEX(off)
[PATCH v11 03/20] vfio: platform: add the VFIO PLATFORM module to Kconfig
Enable building the VFIO PLATFORM driver that allows to use Linux platform devices with VFIO. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/Kconfig | 1 + drivers/vfio/Makefile | 1 + drivers/vfio/platform/Kconfig | 9 + drivers/vfio/platform/Makefile | 4 4 files changed, 15 insertions(+) create mode 100644 drivers/vfio/platform/Kconfig create mode 100644 drivers/vfio/platform/Makefile diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index a0abe04..962fb80 100644 --- a/drivers/vfio/Kconfig +++ b/drivers/vfio/Kconfig @@ -27,3 +27,4 @@ menuconfig VFIO If you don't know what to do here, say N. source drivers/vfio/pci/Kconfig +source drivers/vfio/platform/Kconfig diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index 0b035b1..dadf0ca 100644 --- a/drivers/vfio/Makefile +++ b/drivers/vfio/Makefile @@ -3,3 +3,4 @@ obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o obj-$(CONFIG_VFIO_PCI) += pci/ +obj-$(CONFIG_VFIO_PLATFORM) += platform/ diff --git a/drivers/vfio/platform/Kconfig b/drivers/vfio/platform/Kconfig new file mode 100644 index 000..c51af17 --- /dev/null +++ b/drivers/vfio/platform/Kconfig @@ -0,0 +1,9 @@ +config VFIO_PLATFORM + tristate VFIO support for platform devices + depends on VFIO EVENTFD ARM + help + Support for platform devices with VFIO. This is required to make + use of platform devices present on the system using the VFIO + framework. + + If you don't know what to do here, say N. diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile new file mode 100644 index 000..279862b --- /dev/null +++ b/drivers/vfio/platform/Makefile @@ -0,0 +1,4 @@ + +vfio-platform-y := vfio_platform.o vfio_platform_common.o + +obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 05/20] vfio: amba: add the VFIO for AMBA devices module to Kconfig
Enable building the VFIO AMBA driver. VFIO_AMBA depends on VFIO_PLATFORM, since it is sharing a portion of the code, and it is essentially implemented as a platform device whose resources are discovered via AMBA specific APIs in the kernel. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/Kconfig | 10 ++ drivers/vfio/platform/Makefile | 4 2 files changed, 14 insertions(+) diff --git a/drivers/vfio/platform/Kconfig b/drivers/vfio/platform/Kconfig index c51af17..c0a3bff 100644 --- a/drivers/vfio/platform/Kconfig +++ b/drivers/vfio/platform/Kconfig @@ -7,3 +7,13 @@ config VFIO_PLATFORM framework. If you don't know what to do here, say N. + +config VFIO_AMBA + tristate VFIO support for AMBA devices + depends on VFIO_PLATFORM ARM_AMBA + help + Support for ARM AMBA devices with VFIO. This is required to make + use of ARM AMBA devices present on the system using the VFIO + framework. + + If you don't know what to do here, say N. diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile index 279862b..1957170 100644 --- a/drivers/vfio/platform/Makefile +++ b/drivers/vfio/platform/Makefile @@ -2,3 +2,7 @@ vfio-platform-y := vfio_platform.o vfio_platform_common.o obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o + +vfio-amba-y := vfio_amba.o + +obj-$(CONFIG_VFIO_AMBA) += vfio-amba.o -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 04/20] vfio: amba: VFIO support for AMBA devices
Add support for discovering AMBA devices with VFIO and handle them similarly to Linux platform devices. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_amba.c | 115 ++ include/uapi/linux/vfio.h | 1 + 2 files changed, 116 insertions(+) create mode 100644 drivers/vfio/platform/vfio_amba.c diff --git a/drivers/vfio/platform/vfio_amba.c b/drivers/vfio/platform/vfio_amba.c new file mode 100644 index 000..ff0331f --- /dev/null +++ b/drivers/vfio/platform/vfio_amba.c @@ -0,0 +1,115 @@ +/* + * Copyright (C) 2013 - Virtual Open Systems + * Author: Antonios Motakis a.mota...@virtualopensystems.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License, version 2, as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/slab.h +#include linux/vfio.h +#include linux/amba/bus.h + +#include vfio_platform_private.h + +#define DRIVER_VERSION 0.10 +#define DRIVER_AUTHOR Antonios Motakis a.mota...@virtualopensystems.com +#define DRIVER_DESC VFIO for AMBA devices - User Level meta-driver + +/* probing devices from the AMBA bus */ + +static struct resource *get_amba_resource(struct vfio_platform_device *vdev, + int i) +{ + struct amba_device *adev = (struct amba_device *) vdev-opaque; + + if (i == 0) + return adev-res; + + return NULL; +} + +static int get_amba_irq(struct vfio_platform_device *vdev, int i) +{ + struct amba_device *adev = (struct amba_device *) vdev-opaque; + int ret = 0; + + if (i AMBA_NR_IRQS) + ret = adev-irq[i]; + + /* zero is an unset IRQ for AMBA devices */ + return ret ? ret : -ENXIO; +} + +static int vfio_amba_probe(struct amba_device *adev, const struct amba_id *id) +{ + struct vfio_platform_device *vdev; + int ret; + + vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); + if (!vdev) + return -ENOMEM; + + vdev-name = kasprintf(GFP_KERNEL, vfio-amba-%08x, adev-periphid); + if (!vdev-name) { + kfree(vdev); + return -ENOMEM; + } + + vdev-opaque = (void *) adev; + vdev-flags = VFIO_DEVICE_FLAGS_AMBA; + vdev-get_resource = get_amba_resource; + vdev-get_irq = get_amba_irq; + + ret = vfio_platform_probe_common(vdev, adev-dev); + if (ret) { + kfree(vdev-name); + kfree(vdev); + } + + return ret; +} + +static int vfio_amba_remove(struct amba_device *adev) +{ + struct vfio_platform_device *vdev; + + vdev = vfio_platform_remove_common(adev-dev); + if (vdev) { + kfree(vdev-name); + kfree(vdev); + return 0; + } + + return -EINVAL; +} + +static struct amba_id pl330_ids[] = { + { 0, 0 }, +}; + +MODULE_DEVICE_TABLE(amba, pl330_ids); + +static struct amba_driver vfio_amba_driver = { + .probe = vfio_amba_probe, + .remove = vfio_amba_remove, + .id_table = pl330_ids, + .drv = { + .name = vfio-amba, + .owner = THIS_MODULE, + }, +}; + +module_amba_driver(vfio_amba_driver); + +MODULE_VERSION(DRIVER_VERSION); +MODULE_LICENSE(GPL v2); +MODULE_AUTHOR(DRIVER_AUTHOR); +MODULE_DESCRIPTION(DRIVER_DESC); diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 4e93a97..544d3d8 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -160,6 +160,7 @@ struct vfio_device_info { #define VFIO_DEVICE_FLAGS_RESET(1 0)/* Device supports reset */ #define VFIO_DEVICE_FLAGS_PCI (1 1)/* vfio-pci device */ #define VFIO_DEVICE_FLAGS_PLATFORM (1 2)/* vfio-platform device */ +#define VFIO_DEVICE_FLAGS_AMBA (1 3) /* vfio-amba device */ __u32 num_regions;/* Max region index + 1 */ __u32 num_irqs; /* Max IRQ index + 1 */ }; -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 02/20] vfio: platform: probe to devices on the platform bus
Driver to bind to Linux platform devices, and callbacks to discover their resources to be used by the main VFIO PLATFORM code. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/platform/vfio_platform.c | 103 ++ include/uapi/linux/vfio.h | 1 + 2 files changed, 104 insertions(+) create mode 100644 drivers/vfio/platform/vfio_platform.c diff --git a/drivers/vfio/platform/vfio_platform.c b/drivers/vfio/platform/vfio_platform.c new file mode 100644 index 000..cef645c --- /dev/null +++ b/drivers/vfio/platform/vfio_platform.c @@ -0,0 +1,103 @@ +/* + * Copyright (C) 2013 - Virtual Open Systems + * Author: Antonios Motakis a.mota...@virtualopensystems.com + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License, version 2, as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include linux/module.h +#include linux/slab.h +#include linux/vfio.h +#include linux/platform_device.h + +#include vfio_platform_private.h + +#define DRIVER_VERSION 0.10 +#define DRIVER_AUTHOR Antonios Motakis a.mota...@virtualopensystems.com +#define DRIVER_DESC VFIO for platform devices - User Level meta-driver + +/* probing devices from the linux platform bus */ + +static struct resource *get_platform_resource(struct vfio_platform_device *vdev, + int num) +{ + struct platform_device *dev = (struct platform_device *) vdev-opaque; + int i; + + for (i = 0; i dev-num_resources; i++) { + struct resource *r = dev-resource[i]; + + if (resource_type(r) (IORESOURCE_MEM|IORESOURCE_IO)) { + if (!num) + return r; + + num--; + } + } + return NULL; +} + +static int get_platform_irq(struct vfio_platform_device *vdev, int i) +{ + struct platform_device *pdev = (struct platform_device *) vdev-opaque; + + return platform_get_irq(pdev, i); +} + +static int vfio_platform_probe(struct platform_device *pdev) +{ + struct vfio_platform_device *vdev; + int ret; + + vdev = kzalloc(sizeof(*vdev), GFP_KERNEL); + if (!vdev) + return -ENOMEM; + + vdev-opaque = (void *) pdev; + vdev-name = pdev-name; + vdev-flags = VFIO_DEVICE_FLAGS_PLATFORM; + vdev-get_resource = get_platform_resource; + vdev-get_irq = get_platform_irq; + + ret = vfio_platform_probe_common(vdev, pdev-dev); + if (ret) + kfree(vdev); + + return ret; +} + +static int vfio_platform_remove(struct platform_device *pdev) +{ + struct vfio_platform_device *vdev; + + vdev = vfio_platform_remove_common(pdev-dev); + if (vdev) { + kfree(vdev); + return 0; + } + + return -EINVAL; +} + +static struct platform_driver vfio_platform_driver = { + .probe = vfio_platform_probe, + .remove = vfio_platform_remove, + .driver = { + .name = vfio-platform, + .owner = THIS_MODULE, + }, +}; + +module_platform_driver(vfio_platform_driver); + +MODULE_VERSION(DRIVER_VERSION); +MODULE_LICENSE(GPL v2); +MODULE_AUTHOR(DRIVER_AUTHOR); +MODULE_DESCRIPTION(DRIVER_DESC); diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 9ade02b..4e93a97 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -159,6 +159,7 @@ struct vfio_device_info { __u32 flags; #define VFIO_DEVICE_FLAGS_RESET(1 0)/* Device supports reset */ #define VFIO_DEVICE_FLAGS_PCI (1 1)/* vfio-pci device */ +#define VFIO_DEVICE_FLAGS_PLATFORM (1 2)/* vfio-platform device */ __u32 num_regions;/* Max region index + 1 */ __u32 num_irqs; /* Max IRQ index + 1 */ }; -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 15/20] vfio: virqfd: rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit
The functions vfio_pci_virqfd_init and vfio_pci_virqfd_exit are not really PCI specific, since we plan to reuse the virqfd code with more VFIO drivers in addition to VFIO_PCI. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/pci/vfio_pci.c | 6 +++--- drivers/vfio/pci/vfio_pci_intrs.c | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 9558da3..fc4308c 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1012,7 +1012,7 @@ put_devs: static void __exit vfio_pci_cleanup(void) { pci_unregister_driver(vfio_pci_driver); - vfio_pci_virqfd_exit(); + vfio_virqfd_exit(); vfio_pci_uninit_perm_bits(); } @@ -1026,7 +1026,7 @@ static int __init vfio_pci_init(void) return ret; /* Start the virqfd cleanup handler */ - ret = vfio_pci_virqfd_init(); + ret = vfio_virqfd_init(); if (ret) goto out_virqfd; @@ -1038,7 +1038,7 @@ static int __init vfio_pci_init(void) return 0; out_driver: - vfio_pci_virqfd_exit(); + vfio_virqfd_exit(); out_virqfd: vfio_pci_uninit_perm_bits(); return ret; diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c index 0a41833d..a5378d5 100644 --- a/drivers/vfio/pci/vfio_pci_intrs.c +++ b/drivers/vfio/pci/vfio_pci_intrs.c @@ -45,7 +45,7 @@ struct virqfd { static struct workqueue_struct *vfio_irqfd_cleanup_wq; -int __init vfio_pci_virqfd_init(void) +int __init vfio_virqfd_init(void) { vfio_irqfd_cleanup_wq = create_singlethread_workqueue(vfio-irqfd-cleanup); @@ -55,7 +55,7 @@ int __init vfio_pci_virqfd_init(void) return 0; } -void vfio_pci_virqfd_exit(void) +void vfio_virqfd_exit(void) { destroy_workqueue(vfio_irqfd_cleanup_wq); } -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v11 19/20] vfio: initialize the virqfd workqueue in VFIO generic code
Now we have finally completely decoupled virqfd from VFIO_PCI. We can initialize it from the VFIO generic code, in order to safely use it from multiple independent VFIO bus drivers. Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com --- drivers/vfio/Makefile | 4 +++- drivers/vfio/pci/Makefile | 3 +-- drivers/vfio/pci/vfio_pci.c | 8 drivers/vfio/vfio.c | 8 4 files changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index dadf0ca..d798b09 100644 --- a/drivers/vfio/Makefile +++ b/drivers/vfio/Makefile @@ -1,4 +1,6 @@ -obj-$(CONFIG_VFIO) += vfio.o +vfio_core-y := vfio.o virqfd.o + +obj-$(CONFIG_VFIO) += vfio_core.o obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index c7c8644..1310792 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,5 +1,4 @@ -vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o \ - ../virqfd.o +vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index fc4308c..8d156d7 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1012,7 +1012,6 @@ put_devs: static void __exit vfio_pci_cleanup(void) { pci_unregister_driver(vfio_pci_driver); - vfio_virqfd_exit(); vfio_pci_uninit_perm_bits(); } @@ -1025,11 +1024,6 @@ static int __init vfio_pci_init(void) if (ret) return ret; - /* Start the virqfd cleanup handler */ - ret = vfio_virqfd_init(); - if (ret) - goto out_virqfd; - /* Register and scan for devices */ ret = pci_register_driver(vfio_pci_driver); if (ret) @@ -1038,8 +1032,6 @@ static int __init vfio_pci_init(void) return 0; out_driver: - vfio_virqfd_exit(); -out_virqfd: vfio_pci_uninit_perm_bits(); return ret; } diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index f018d8d..8e84471 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1464,6 +1464,11 @@ static int __init vfio_init(void) if (ret) goto err_cdev_add; + /* Start the virqfd cleanup handler used by some VFIO bus drivers */ + ret = vfio_virqfd_init(); + if (ret) + goto err_virqfd; + pr_info(DRIVER_DESC version: DRIVER_VERSION \n); /* @@ -1476,6 +1481,8 @@ static int __init vfio_init(void) return 0; +err_virqfd: + cdev_del(vfio.group_cdev); err_cdev_add: unregister_chrdev_region(vfio.group_devt, MINORMASK); err_alloc_chrdev: @@ -1490,6 +1497,7 @@ static void __exit vfio_cleanup(void) { WARN_ON(!list_empty(vfio.group_list)); + vfio_virqfd_exit(); idr_destroy(vfio.group_idr); cdev_del(vfio.group_cdev); unregister_chrdev_region(vfio.group_devt, MINORMASK); -- 2.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On 06/01/2015 09:42, Paolo Bonzini wrote: Still confused. So we can freeze all vCPUs in the host, then update pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM doesn't increment the version pre-update, and we can return completely bogus results. Yes. But then the getcpu test would fail (1-0). Even if you have an ABA situation (1-0-1), it's okay because the pvti that is fetched is the one returned by the first getcpu. ... this case of partial update of pvti, which is caught by the version field, if of course different from the other (extremely unlikely) that Andy pointed out. That is when the getcpus are done on the same vCPU, but the rdtsc is another. That one can be fixed by rdtscp, like do { // get a consistent (pvti, v, tsc) tuple do { cpu = get_cpu(); pvti = get_pvti(cpu); v = pvti-version ~1; // also acts as rmb(); rdtsc_barrier(); tsc = rdtscp(cpu1); // control dependency, no need for rdtsc_barrier? } while(cpu != cpu1); // ... compute nanoseconds from pvti and tsc ... rmb(); } while(v != pvti-version); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvmtool: Add parameter to specifiy number of threads in thread_pool
With current code the number of threads added to the thread_pool equals number of online CPUs. IMHO on systems with many CPUs this is overkill for guests that just have a few vCPUs and/or if a guest is pinned to a subset of host CPUs. E.g. of a system with 48 cores # numactl -C 4,5,7,8 ./lkvm run -c 2 -m 256 -k paravirt -d rootfs.ext3 ... # ps -La | grep threadpool-work | wc -l 48 Don't change default behaviour (for sake of compatibility) but introduce a new parameter (-t or --threads) that allows to specify number of threads to be created for the thread_pool: # numactl -C 4,5,7,8 ./lkvm run -c 2 -m 256 --threads 4 -k paravirt -d ... # ps -La | grep threadpool-work | wc -l 4 Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com --- tools/kvm/builtin-run.c|2 ++ tools/kvm/include/kvm/kvm-config.h |1 + tools/kvm/util/threadpool.c|5 - 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c index 1ee75ad..86de53d 100644 --- a/tools/kvm/builtin-run.c +++ b/tools/kvm/builtin-run.c @@ -131,6 +131,8 @@ void kvm_run_set_wrapper_sandbox(void) rootfs), \ OPT_STRING('\0', hugetlbfs, (cfg)-hugetlbfs_path, path, \ Hugetlbfs path), \ + OPT_INTEGER('t', threads, (cfg)-nrthreads, \ +Number of threads in thread_pool), \ \ OPT_GROUP(Kernel options:), \ OPT_STRING('k', kernel, (cfg)-kernel_filename, kernel,\ diff --git a/tools/kvm/include/kvm/kvm-config.h b/tools/kvm/include/kvm/kvm-config.h index 386fa8c..9cc50f5 100644 --- a/tools/kvm/include/kvm/kvm-config.h +++ b/tools/kvm/include/kvm/kvm-config.h @@ -27,6 +27,7 @@ struct kvm_config { int active_console; int debug_iodelay; int nrcpus; + int nrthreads; const char *kernel_cmdline; const char *kernel_filename; const char *vmlinux_filename; diff --git a/tools/kvm/util/threadpool.c b/tools/kvm/util/threadpool.c index e64aa26..620fdbd 100644 --- a/tools/kvm/util/threadpool.c +++ b/tools/kvm/util/threadpool.c @@ -124,7 +124,10 @@ static int thread_pool__addthread(void) int thread_pool__init(struct kvm *kvm) { unsigned long i; - unsigned int thread_count = sysconf(_SC_NPROCESSORS_ONLN); + unsigned int thread_count; + + thread_count = kvm-cfg.nrthreads ? kvm-cfg.nrthreads : + sysconf(_SC_NPROCESSORS_ONLN); running = true; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvmtool: Set names for virtio-net-ctrl and term-poll threads
Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com --- tools/kvm/term.c |2 ++ tools/kvm/virtio/net.c |2 ++ 2 files changed, 4 insertions(+) diff --git a/tools/kvm/term.c b/tools/kvm/term.c index 1b8131a..5754510 100644 --- a/tools/kvm/term.c +++ b/tools/kvm/term.c @@ -104,6 +104,8 @@ static void *term_poll_thread_loop(void *param) int i; + kvm__set_thread_name(term-poll); + for (i = 0; i TERM_MAX_DEVS; i++) { fds[i].fd = term_fds[i][TERM_FD_IN]; fds[i].events = POLLIN; diff --git a/tools/kvm/virtio/net.c b/tools/kvm/virtio/net.c index ecdb94e..fed5095 100644 --- a/tools/kvm/virtio/net.c +++ b/tools/kvm/virtio/net.c @@ -233,6 +233,8 @@ static void *virtio_net_ctrl_thread(void *p) struct virtio_net_ctrl_hdr *ctrl; virtio_net_ctrl_ack *ack; + kvm__set_thread_name(virtio-net-ctrl); + while (1) { mutex_lock(ndev-io_lock[id]); if (!virt_queue__available(vq)) -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote: On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote: The pvclock vdso code was too abstracted to understand easily and excessively paranoid. Simplify it for a huge speedup. This opens the door for additional simplifications, as the vdso no longer accesses the pvti for any vcpu other than vcpu 0. Before, vclock_gettime using kvm-clock took about 64ns on my machine. With this change, it takes 19ns, which is almost as fast as the pure TSC implementation. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/vdso/vclock_gettime.c | 82 -- 1 file changed, 47 insertions(+), 35 deletions(-) diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c index 9793322751e0..f2e0396d5629 100644 --- a/arch/x86/vdso/vclock_gettime.c +++ b/arch/x86/vdso/vclock_gettime.c @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) static notrace cycle_t vread_pvclock(int *mode) { - const struct pvclock_vsyscall_time_info *pvti; + const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti; cycle_t ret; - u64 last; - u32 version; - u8 flags; - unsigned cpu, cpu1; - + u64 tsc, pvti_tsc; + u64 last, delta, pvti_system_time; + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; /* - * Note: hypervisor must guarantee that: - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info. - * 2. that per-CPU pvclock time info is updated if the - *underlying CPU changes. - * 3. that version is increased whenever underlying CPU - *changes. + * Note: The kernel and hypervisor must guarantee that cpu ID + * number maps 1:1 to per-CPU pvclock time info. + * + * Because the hypervisor is entirely unaware of guest userspace + * preemption, it cannot guarantee that per-CPU pvclock time + * info is updated if the underlying CPU changes or that that + * version is increased whenever underlying CPU changes. + * + * On KVM, we are guaranteed that pvti updates for any vCPU are + * atomic as seen by *all* vCPUs. This is an even stronger + * guarantee than we get with a normal seqlock. * + * On Xen, we don't appear to have that guarantee, but Xen still + * supplies a valid seqlock using the version field. + + * We only do pvclock vdso timing at all if + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to + * mean that all vCPUs have matching pvti and that the TSC is + * synced, so we can just look at vCPU 0's pvti. */ Can Xen guarantee that ? I think so, vacuously. Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT at all. I have no idea going forward, though. Xen people? The person who would know of the top of his head is Dan Magenheimer, who is now enjoy retirement :-( I will have to dig in the code to answer this - that will take a bit of time sadly (I am sick this week). - do { - cpu = __getcpu() VGETCPU_CPU_MASK; - /* TODO: We can put vcpu id into higher bits of pvti.version. - * This will save a couple of cycles by getting rid of - * __getcpu() calls (Gleb). - */ - - pvti = get_pvti(cpu); - - version = __pvclock_read_cycles(pvti-pvti, ret, flags); - - /* - * Test we're still on the cpu as well as the version. - * We could have been migrated just after the first - * vgetcpu but before fetching the version, so we - * wouldn't notice a version change. - */ - cpu1 = __getcpu() VGETCPU_CPU_MASK; - } while (unlikely(cpu != cpu1 || - (pvti-pvti.version 1) || - pvti-pvti.version != version)); - - if (unlikely(!(flags PVCLOCK_TSC_STABLE_BIT))) + + if (unlikely(!(pvti-flags PVCLOCK_TSC_STABLE_BIT))) { *mode = VCLOCK_NONE; + return 0; + } This check must be performed after reading a stable pvti. We can even read it in the middle, guarded by the version checks. I'll do that for v2. + + do { + version = pvti-version; + + /* This is also a read barrier, so we'll read version first. */ + rdtsc_barrier(); + tsc = __native_read_tsc(); + + pvti_tsc_to_system_mul = pvti-tsc_to_system_mul; + pvti_tsc_shift = pvti-tsc_shift; + pvti_system_time = pvti-system_time; + pvti_tsc = pvti-tsc_timestamp; + +
[ÞATCH] kvmtool, mips: Support more than 256 MB guest memory
Two guest memory regions need to be defined and two mem= parameters need to be passed to guest kernel to support more than 256 MB. Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com --- tools/kvm/mips/include/kvm/kvm-arch.h | 10 + tools/kvm/mips/kvm.c | 36 +++-- 2 files changed, 40 insertions(+), 6 deletions(-) diff --git a/tools/kvm/mips/include/kvm/kvm-arch.h b/tools/kvm/mips/include/kvm/kvm-arch.h index 7eadbf4..97bbf34 100644 --- a/tools/kvm/mips/include/kvm/kvm-arch.h +++ b/tools/kvm/mips/include/kvm/kvm-arch.h @@ -1,10 +1,20 @@ #ifndef KVM__KVM_ARCH_H #define KVM__KVM_ARCH_H + +/* + * Guest memory map is: + * 0x-0x0fff : System RAM + * 0x1000-0x1fff : I/O (defined by KVM_MMIO_START and KVM_MMIO_SIZE) + * 0x2000-...: System RAM + * See also kvm__init_ram(). + */ + #define KVM_MMIO_START 0x1000 #define KVM_PCI_CFG_AREA KVM_MMIO_START #define KVM_PCI_MMIO_AREA (KVM_MMIO_START + 0x100) #define KVM_VIRTIO_MMIO_AREA (KVM_MMIO_START + 0x200) +#define KVM_MMIO_SIZE 0x1000 /* * Just for reference. This and the above corresponds to what's used diff --git a/tools/kvm/mips/kvm.c b/tools/kvm/mips/kvm.c index fc0428b..10b441b 100644 --- a/tools/kvm/mips/kvm.c +++ b/tools/kvm/mips/kvm.c @@ -22,11 +22,28 @@ void kvm__init_ram(struct kvm *kvm) u64 phys_start, phys_size; void*host_mem; - phys_start = 0; - phys_size = kvm-ram_size; - host_mem = kvm-ram_start; + if (kvm-ram_size = KVM_MMIO_START) { + /* one region for all memory */ + phys_start = 0; + phys_size = kvm-ram_size; + host_mem = kvm-ram_start; - kvm__register_mem(kvm, phys_start, phys_size, host_mem); + kvm__register_mem(kvm, phys_start, phys_size, host_mem); + } else { + /* one region for memory that fits below MMIO range */ + phys_start = 0; + phys_size = KVM_MMIO_SIZE; + host_mem = kvm-ram_start; + + kvm__register_mem(kvm, phys_start, phys_size, host_mem); + + /* one region for rest of memory */ + phys_start = KVM_MMIO_START + KVM_MMIO_SIZE; + phys_size = kvm-ram_size - 0x1000; + host_mem = kvm-ram_start + KVM_MMIO_SIZE; + + kvm__register_mem(kvm, phys_start, phys_size, host_mem); + } } void kvm__arch_delete_ram(struct kvm *kvm) @@ -108,8 +125,15 @@ static void kvm__mips_install_cmdline(struct kvm *kvm) u64 argv_offset = argv_start; u64 argc = 0; - sprintf(p + cmdline_offset, mem=0x%llx@0 , -(unsigned long long)kvm-ram_size); + + if ((u64) kvm-ram_size = KVM_MMIO_SIZE) + sprintf(p + cmdline_offset, mem=0x%llx@0 , + (unsigned long long)kvm-ram_size); + else + sprintf(p + cmdline_offset, mem=0x%llx@0 mem=0x%llx@0x%llx , + (unsigned long long)KVM_MMIO_START, + (unsigned long long)kvm-ram_size - KVM_MMIO_START, + (unsigned long long)(KVM_MMIO_START + KVM_MMIO_SIZE)); strcat(p + cmdline_offset, kvm-cfg.real_cmdline); /* maximum size is 2K */ -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
On Tue, Jan 06, 2015 at 10:43:57AM -0800, Andy Lutomirski wrote: Sure, but the code would be simpler if we shoved that value in the EFLAGS slot. There probably is some reason for that but it's not like we can change it :-) Hmm. I added and pushed a test for fork, but that didn't turn anything up. And I don't see any bugs in the code. I booted 3.18 plus this patch with context tracking forced on on my laptop, and something seems to have gone wrong. Well, I ran it on -rc2 + tip/master and it did trigger sporadically yesterday but was unable to trigger something today. I'll redo the whole games tomorrow. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote: On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote: On Jan 6, 2015 4:01 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 06/01/2015 09:42, Paolo Bonzini wrote: Still confused. So we can freeze all vCPUs in the host, then update pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM doesn't increment the version pre-update, and we can return completely bogus results. Yes. But then the getcpu test would fail (1-0). Even if you have an ABA situation (1-0-1), it's okay because the pvti that is fetched is the one returned by the first getcpu. ... this case of partial update of pvti, which is caught by the version field, if of course different from the other (extremely unlikely) that Andy pointed out. That is when the getcpus are done on the same vCPU, but the rdtsc is another. That one can be fixed by rdtscp, like do { // get a consistent (pvti, v, tsc) tuple do { cpu = get_cpu(); pvti = get_pvti(cpu); v = pvti-version ~1; // also acts as rmb(); rdtsc_barrier(); tsc = rdtscp(cpu1); Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD specified it that way and both AMD and Intel implement it correctly. (rdtsc, on the other hand, definitely needs the barrier beforehand.) // control dependency, no need for rdtsc_barrier? } while(cpu != cpu1); // ... compute nanoseconds from pvti and tsc ... rmb(); } while(v != pvti-version); Still no good. We can migrate a bunch of times so we see the same CPU all three times and *still* don't get a consistent read, unless we play nasty games with lots of version checks (I have a patch for that, but I don't like it very much). The patch is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d but I don't like it. Thus far, I've been told unambiguously that a guest can't observe pvti while it's being written, and I think you're now telling me that this isn't true and that a guest *can* observe pvti while it's being written while the low bit of the version field is not set. If so, this is rather strongly incompatible with the spec in the KVM docs. I don't suppose that you and Marcelo could agree on what the actual semantics that KVM provides are and could write it down in a way that people who haven't spent a long time staring at the request code understand? And maybe you could even fix the implementation while you're at it if the implementation is, indeed, broken. I have ugly patches to fix it here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=3b718a050cba52563d831febc2e1ca184c02bac0 but I'm not thrilled with them. --Andy I suppose that separating the version write from the rest of the pvclock structure is sufficient, as that would guarantee the writes are not reordered even with fast string REP MOVS. Thanks for catching this Andy! Don't you stil need: version++; write the rest; version++; with possible smp_wmb() in there to keep the compiler from messing around? Correct. Could just as well follow the protocol and use odd/even, which is what your patch does. What is the point with the new flags bit though? Also, if you do this, can you also make setting and clearing STABLE_BIT properly atomic across all vCPUs? Or at least do something like setting it last and clearing it first on vPCU 0? If the version seqlock works properly across vCPUs, why do you need STABLE_BIT properly atomic ? Please define what you mean by properly atomic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
On Jan 6, 2015 7:34 AM, Borislav Petkov b...@alien8.de wrote: On Mon, Jan 05, 2015 at 12:31:15PM -0800, Andy Lutomirski wrote: Do you have context tracking on? Yap, it is enabled for whatever reason: CONFIG_CONTEXT_TRACKING=y CONFIG_CONTEXT_TRACKING_FORCE=y CONFIG_HAVE_CONTEXT_TRACKING=y I'll boot a kernel like this on bare metal and see what shakes loose. I assume that's in the historical tree? Yeah. [ 180.059170] ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x6 frozen [ 180.066873] ata1.00: failed command: WRITE FPDMA QUEUED [ 180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 4096 out [ 180.072158] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) That's really weird. The only thing I can think of is that somehow we returned to user mode without enabling interrupts. Right, considering FIXUP_TOP_OF_STACK is used in a bunch of cases in entry_64.S, no wonder it corrupts something. This leads me to wonder: why do we save eflags in the R11 pt_regs slot? That: If executed in 64-bit mode, SYSRET loads the lower-32 RFLAGS bits from R11[31:0] and clears the upper 32 RFLAGS bits. Sure, but the code would be simpler if we shoved that value in the EFLAGS slot. This seems entirely backwards, not to mention that it accounts for two instructions in each of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK for no apparently reason whatsoever. Can you send the full output from syscall_exit_regs_64 from here: https://gitorious.org/linux-test-utils/linux-clock-tests/source/34884122b6ebe81d9b96e3e5128b6d6d95082c6e: with the patch applied (assuming it even gets that far for you)? I see results like: [NOTE]syscall : orig RCX = 1 ss = 2b orig_ss = 6b flags = 217 orig_flags = 217 which seems fine. ./syscall_exit_regs_64 [OK]int80 : AX = ffda [OK]int80 4000: AX = ffda [OK]syscall : RCX = 400962 RIP = 400962 [OK]syscall : AX = ffda [NOTE] syscall : orig RCX = 1 ss = 2b orig_ss = 6b flags = 217 orig_flags = 217 [OK]syscall 4000: RCX = 400962 RIP = 400962 [FAIL] syscall 4000: AX = fff7 [NOTE] syscall 4000: orig RCX = 1 ss = 2b orig_ss = 6b flags = 217 orig_flags = 217 [OK]syscall(): ret = -1, errno = 38 Are you seeing this with the whole series applied or with only this patch? I applied this patch only and started seeing those. Then I booted in the previous kernel and tried to repro but it didn't trigger. I'll try hammering on the kernel *without* your patch to see whether I can trigger it somehow... Hmm. I added and pushed a test for fork, but that didn't turn anything up. And I don't see any bugs in the code. I booted 3.18 plus this patch with context tracking forced on on my laptop, and something seems to have gone wrong. --Andy -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote: On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote: On Jan 6, 2015 4:01 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 06/01/2015 09:42, Paolo Bonzini wrote: Still confused. So we can freeze all vCPUs in the host, then update pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM doesn't increment the version pre-update, and we can return completely bogus results. Yes. But then the getcpu test would fail (1-0). Even if you have an ABA situation (1-0-1), it's okay because the pvti that is fetched is the one returned by the first getcpu. ... this case of partial update of pvti, which is caught by the version field, if of course different from the other (extremely unlikely) that Andy pointed out. That is when the getcpus are done on the same vCPU, but the rdtsc is another. That one can be fixed by rdtscp, like do { // get a consistent (pvti, v, tsc) tuple do { cpu = get_cpu(); pvti = get_pvti(cpu); v = pvti-version ~1; // also acts as rmb(); rdtsc_barrier(); tsc = rdtscp(cpu1); Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD specified it that way and both AMD and Intel implement it correctly. (rdtsc, on the other hand, definitely needs the barrier beforehand.) // control dependency, no need for rdtsc_barrier? } while(cpu != cpu1); // ... compute nanoseconds from pvti and tsc ... rmb(); } while(v != pvti-version); Still no good. We can migrate a bunch of times so we see the same CPU all three times and *still* don't get a consistent read, unless we play nasty games with lots of version checks (I have a patch for that, but I don't like it very much). The patch is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d but I don't like it. Thus far, I've been told unambiguously that a guest can't observe pvti while it's being written, and I think you're now telling me that this isn't true and that a guest *can* observe pvti while it's being written while the low bit of the version field is not set. If so, this is rather strongly incompatible with the spec in the KVM docs. I don't suppose that you and Marcelo could agree on what the actual semantics that KVM provides are and could write it down in a way that people who haven't spent a long time staring at the request code understand? And maybe you could even fix the implementation while you're at it if the implementation is, indeed, broken. I have ugly patches to fix it here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=3b718a050cba52563d831febc2e1ca184c02bac0 but I'm not thrilled with them. --Andy I suppose that separating the version write from the rest of the pvclock structure is sufficient, as that would guarantee the writes are not reordered even with fast string REP MOVS. Thanks for catching this Andy! Don't you stil need: version++; write the rest; version++; with possible smp_wmb() in there to keep the compiler from messing around? Correct. Could just as well follow the protocol and use odd/even, which is what your patch does. What is the point with the new flags bit though? To try to work around the problem on old hosts. I'm not at all convinced that this is worthwhile or that it helps, though. Also, if you do this, can you also make setting and clearing STABLE_BIT properly atomic across all vCPUs? Or at least do something like setting it last and clearing it first on vPCU 0? If the version seqlock works properly across vCPUs, why do you need STABLE_BIT properly atomic ? Please define what you mean by properly atomic. I'd like to be able to rely using vCPU 0's pvti even from other vCPUs in the vdso if the stable bit is set. That means that the host should avoid doing things like migrating the guest, clearing the stable bit for vCPU 1, resuming vCPU 1, and waiting long enough to clear the stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data and return a bad timestamp. Maybe this scenario is impossible, but getting rid of any getcpu-like operation in the vdso has really nice benefits. It's faster and it lets us guarantee that the vdso's pvti data fits in a single page. The latter means that we can easily make it work like the hpet mapping, which gets us 32-bit support and will *finally* let us turn off user access to the fixmap if vsyscall=none.
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote: What is the point with the new flags bit though? To try to work around the problem on old hosts. I'm not at all convinced that this is worthwhile or that it helps, though. Don't think so. Just fix the host bug. Also, if you do this, can you also make setting and clearing STABLE_BIT properly atomic across all vCPUs? Or at least do something like setting it last and clearing it first on vPCU 0? If the version seqlock works properly across vCPUs, why do you need STABLE_BIT properly atomic ? Please define what you mean by properly atomic. I'd like to be able to rely using vCPU 0's pvti even from other vCPUs in the vdso if the stable bit is set. That means that the host should avoid doing things like migrating the guest, clearing the stable bit for vCPU 1, resuming vCPU 1, and waiting long enough to clear the stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data and return a bad timestamp. Maybe this scenario is impossible, but getting rid of any getcpu-like operation in the vdso has really nice benefits. You can park every vCPU in host while updating vCPU-0's timestamp. See kvm_gen_update_masterclock: + /* no guest entries from this point */ + pvclock_update_vm_gtod_copy(kvm); - touch guest memory + /* guest entries allowed */ + kvm_for_each_vcpu(i, vcpu, kvm) + clear_bit(KVM_REQ_MCLOCK_INPROGRESS, vcpu-requests); It's faster and it lets us guarantee that the vdso's pvti data fits in a single page. The latter means that we can easily make it work like the hpet mapping, which gets us 32-bit support and will *finally* let us turn off user access to the fixmap if vsyscall=none. (We can, of course, still do this if the pvti data needs to be an array, but it's messier.) --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 34/50] vhost/net: virtio 1.0 byte swap
On Mon, 2014-12-01 at 18:05 +0200, Michael S. Tsirkin wrote: I had to add an explicit tag to suppress compiler warning: gcc isn't smart enough to notice that len is always initialized since function is called with size 0. I'm getting a panic inside a guest when this change is applied on the host. I identified this patch via bisect and confirmed by reverting it from v3.19-rc2. Guest is centos6. Thanks, Alex commit 8b38694a2dc8b18374310df50174f1e4376d6824 Author: Michael S. Tsirkin m...@redhat.com Date: Fri Oct 24 14:19:48 2014 +0300 vhost/net: virtio 1.0 byte swap I had to add an explicit tag to suppress compiler warning: gcc isn't smart enough to notice that len is always initialized since function is called with size 0. Signed-off-by: Michael S. Tsirkin m...@redhat.com Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com XML chunk: interface type='direct' mac address='52:54:00:64:f3:34'/ source dev='iscsinet0' mode='bridge'/ model type='virtio'/ address type='pci' domain='0x' bus='0x00' slot='0x03' function='0x0'/ /interface Panic log: 1BUG: unable to handle kernel NULL pointer dereference at 0010 1IP: [a0079469] virtnet_poll+0x4f9/0x910 [virtio_net] 4PGD 1aa2f4067 PUD 1aa2f5067 PMD 0 4Oops: [#1] SMP 4last sysfs file: /sys/devices/pci:00/:00:03.0/virtio0/net/eth9/ifindex 4CPU 0 4Modules linked in: 8021q garp stp llc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput microcode snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc igbvf nvidia(P)(U) i2c_core tg3 ptp pps_core virtio_balloon virtio_net virtio_console ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] 4 4Pid: 1374, comm: NetworkManager Tainted: P --- 2.6.32-431.23.3.el6.centos.plus.x86_64 #1 QEMU Standard PC (i440FX + PIIX, 1996) 4RIP: 0010:[a0079469] [a0079469] virtnet_poll+0x4f9/0x910 [virtio_net] 4RSP: 0018:880028203e48 EFLAGS: 00010246 4RAX: 8801a3383d00 RBX: 8801a6aaf480 RCX: 8801aa20b6e0 4RDX: 00c0 RSI: 8801a3383c00 RDI: 8801a3383cc0 4RBP: 880028203ed8 R08: 009e R09: 8801aa1d800c 4R10: 0218 R11: R12: 8801aa20b6e0 4R13: R14: R15: 4FS: 7febf114d800() GS:88002820() knlGS: 4CS: 0010 DS: ES: CR0: 80050033 4CR2: 0010 CR3: 0001aa793000 CR4: 06f0 4DR0: DR1: DR2: 4DR3: DR6: 0ff0 DR7: 0400 4Process NetworkManager (pid: 1374, threadinfo 8801a74ba000, task 8801a8d56040) 4Stack: 4 8801aa1d8000 009e 8801aa20b6e0 8801aa20b718 4d 8801aa20b780 8801aa1d800c 8801a6aaf4b8 8801aa20b020 4d 0080 8801aa20b708 0001 1f5981a830c8 4Call Trace: 4 IRQ 4 [8146ae33] net_rx_action+0x103/0x2f0 4 [8107a5f1] __do_softirq+0xc1/0x1e0 4 [8100c30c] ? call_softirq+0x1c/0x30 4 [8100c30c] call_softirq+0x1c/0x30 4 EOI 4 [8100fa75] ? do_softirq+0x65/0xa0 4 [8107b2ea] local_bh_enable+0x9a/0xb0 4 [a007813a] virtnet_napi_enable+0x4a/0x60 [virtio_net] 4 [a0078ebf] virtnet_open+0x4f/0x60 [virtio_net] 4 [81467691] dev_open+0xa1/0x100 4 [81466751] dev_change_flags+0xa1/0x1d0 4 [81474a59] do_setlink+0x169/0x8b0 4 [814770b6] ? rtnl_fill_ifinfo+0x946/0xcb0 4 [812a3d24] ? nla_parse+0x34/0x110 4 [8147659e] rtnl_setlink+0xee/0x130 4 [81475b67] rtnetlink_rcv_msg+0x2d7/0x340 4 [81231e14] ? socket_has_perm+0x74/0x90 4 [81475890] ? rtnetlink_rcv_msg+0x0/0x340 4 [814910a9] netlink_rcv_skb+0xa9/0xd0 4 [81475875] rtnetlink_rcv+0x25/0x40 4 [81490cdb] netlink_unicast+0x2db/0x320 4 [81491750] netlink_sendmsg+0x2c0/0x3d0 4 [814520c3] sock_sendmsg+0x123/0x150 4 [81453d73] ? sock_recvmsg+0x133/0x160 4 [8109afa0] ? autoremove_wake_function+0x0/0x40 4 [81136941] ? lru_cache_add_lru+0x21/0x40 4 [8115522d] ? page_add_new_anon_rmap+0x9d/0xf0 4 [8114aeef] ? handle_pte_fault+0x4af/0xb00 4 [81451f14] ? move_addr_to_kernel+0x64/0x70 4 [814538b6] __sys_sendmsg+0x406/0x420 4 [8104a98c] ? __do_page_fault+0x1ec/0x480 4 [814523d9] ? sys_sendto+0x139/0x190 4 [8103ea6c] ? kvm_clock_read+0x1c/0x20 4 [81453ad9] sys_sendmsg+0x49/0x90 4 [8100b072] system_call_fastpath+0x16/0x1b 4Code: 83 e0
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Tue, Jan 6, 2015 at 12:20 PM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote: What is the point with the new flags bit though? To try to work around the problem on old hosts. I'm not at all convinced that this is worthwhile or that it helps, though. Don't think so. Just fix the host bug. Also, if you do this, can you also make setting and clearing STABLE_BIT properly atomic across all vCPUs? Or at least do something like setting it last and clearing it first on vPCU 0? If the version seqlock works properly across vCPUs, why do you need STABLE_BIT properly atomic ? Please define what you mean by properly atomic. I'd like to be able to rely using vCPU 0's pvti even from other vCPUs in the vdso if the stable bit is set. That means that the host should avoid doing things like migrating the guest, clearing the stable bit for vCPU 1, resuming vCPU 1, and waiting long enough to clear the stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data and return a bad timestamp. Maybe this scenario is impossible, but getting rid of any getcpu-like operation in the vdso has really nice benefits. You can park every vCPU in host while updating vCPU-0's timestamp. See kvm_gen_update_masterclock: + /* no guest entries from this point */ + pvclock_update_vm_gtod_copy(kvm); - touch guest memory + /* guest entries allowed */ + kvm_for_each_vcpu(i, vcpu, kvm) + clear_bit(KVM_REQ_MCLOCK_INPROGRESS, vcpu-requests); Can we do that easily? It looks like we're holding a spinlock in there. Could we make pvclock_gtod_sync_lock into a mutex? We could also add something to explicitly prevent any of the guests from entering until we're updated all of them, but that would hurt performance even more. It would be kind of nice if we could avoid serializing all CPUs entirely, though. For example, if we could increment all the versions, then write all the pvtis, then increment all the versions again from a single function, then everything is atomic for a correctly behaving guest, but if the guest isn't actually reading the time, then it doesn't stall. (Also, we should really have a cpu_relax in all of these loops, IMO, so that pause loop exiting can take effect.) --Andy It's faster and it lets us guarantee that the vdso's pvti data fits in a single page. The latter means that we can easily make it work like the hpet mapping, which gets us 32-bit support and will *finally* let us turn off user access to the fixmap if vsyscall=none. (We can, of course, still do this if the pvti data needs to be an array, but it's messier.) --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] KVM: introduce kvm_check_device
On Tue, 2015-01-06 at 16:12 +, Andre Przywara wrote: While we can easily register and unregister KVM devices, there is currently no easy way of checking whether a device has been registered. Introduce kvm_check_device() for that purpose and use it in two existing functions. Also change the return code for an invalid type number from ENOSPC to EINVAL. This function will be later used by another patch set to check whether a KVM_CREATE_IRQCHIP ioctl is valid. You're checking whether a device type has been registered, not the device itself -- so could you make it kvm_check_device_type()? Signed-off-by: Andre Przywara andre.przyw...@arm.com --- Hi, can people comment whether there is an easier way to detect KVM device registration _outside_ of virt/kvm/kvm_main.c? Using the KVM_CREATE_DEVICE_TEST flag sounds like a fit, but kvm_ioctl_create_device() isn't exported. Out of curiosity, why do you need to test it from inside the kernel but outside kvm_main.c? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
On Mon, Jan 05, 2015 at 12:31:15PM -0800, Andy Lutomirski wrote: Do you have context tracking on? Yap, it is enabled for whatever reason: CONFIG_CONTEXT_TRACKING=y CONFIG_CONTEXT_TRACKING_FORCE=y CONFIG_HAVE_CONTEXT_TRACKING=y I assume that's in the historical tree? Yeah. [ 180.059170] ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 0x6 frozen [ 180.066873] ata1.00: failed command: WRITE FPDMA QUEUED [ 180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 4096 out [ 180.072158] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) That's really weird. The only thing I can think of is that somehow we returned to user mode without enabling interrupts. Right, considering FIXUP_TOP_OF_STACK is used in a bunch of cases in entry_64.S, no wonder it corrupts something. This leads me to wonder: why do we save eflags in the R11 pt_regs slot? That: If executed in 64-bit mode, SYSRET loads the lower-32 RFLAGS bits from R11[31:0] and clears the upper 32 RFLAGS bits. This seems entirely backwards, not to mention that it accounts for two instructions in each of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK for no apparently reason whatsoever. Can you send the full output from syscall_exit_regs_64 from here: https://gitorious.org/linux-test-utils/linux-clock-tests/source/34884122b6ebe81d9b96e3e5128b6d6d95082c6e: with the patch applied (assuming it even gets that far for you)? I see results like: [NOTE]syscall : orig RCX = 1 ss = 2b orig_ss = 6b flags = 217 orig_flags = 217 which seems fine. ./syscall_exit_regs_64 [OK]int80 : AX = ffda [OK]int80 4000: AX = ffda [OK]syscall : RCX = 400962 RIP = 400962 [OK]syscall : AX = ffda [NOTE] syscall : orig RCX = 1 ss = 2b orig_ss = 6b flags = 217 orig_flags = 217 [OK]syscall 4000: RCX = 400962 RIP = 400962 [FAIL] syscall 4000: AX = fff7 [NOTE] syscall 4000: orig RCX = 1 ss = 2b orig_ss = 6b flags = 217 orig_flags = 217 [OK]syscall(): ret = -1, errno = 38 Are you seeing this with the whole series applied or with only this patch? I applied this patch only and started seeing those. Then I booted in the previous kernel and tried to repro but it didn't trigger. I'll try hammering on the kernel *without* your patch to see whether I can trigger it somehow... -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On 06/01/2015 17:56, Andy Lutomirski wrote: Still no good. We can migrate a bunch of times so we see the same CPU all three times There are no three times. The CPU you see here: // ... compute nanoseconds from pvti and tsc ... rmb(); } while(v != pvti-version); is the same you read here: cpu = get_cpu(); The algorithm is: 1) get a consistent (cpu, version, tsc) 1.a) get cpu 1.b) get pvti[cpu]-version, ignoring low bit 1.c) get (tsc, cpu) 1.d) if cpu from 1.a and 1.c do not match, loop 1.e) if pvti[cpu] was being updated, we'll loop later 2) compute nanoseconds from pvti[cpu] and tsc 3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't match, retry. As long as the CPU is consistent between get_cpu() and rdtscp(), there is no problem with migration, because pvti is always accessed for that one CPU. If there were any problem, it would be caught by the version check. Writing it down with two nested do...whiles makes it clearer IMHO. and *still* don't get a consistent read, unless we play nasty games with lots of version checks (I have a patch for that, but I don't like it very much). The patch is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d but I don't like it. Thus far, I've been told unambiguously that a guest can't observe pvti while it's being written, and I think you're now telling me that this isn't true and that a guest *can* observe pvti while it's being written while the low bit of the version field is not set. If so, this is rather strongly incompatible with the spec in the KVM docs. Where am I saying that? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On 06/01/2015 19:26, Andy Lutomirski wrote: Don't you stil need: version++; write the rest; version++; with possible smp_wmb() in there to keep the compiler from messing around? No, see my other reply. Separating the version write is a real bug, but that should be all that it's needed. Also, if you do this, can you also make setting and clearing STABLE_BIT properly atomic across all vCPUs? Or at least do something like setting it last and clearing it first on vPCU 0? That would be nice if you want to make the pvclock area fit in a single page. However, it would have to be a separate flag bit, or a separate CPUID feature. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Tue, Jan 6, 2015 at 9:38 PM, Paolo Bonzini pbonz...@redhat.com wrote: On 06/01/2015 17:56, Andy Lutomirski wrote: Still no good. We can migrate a bunch of times so we see the same CPU all three times There are no three times. The CPU you see here: // ... compute nanoseconds from pvti and tsc ... rmb(); } while(v != pvti-version); is the same you read here: cpu = get_cpu(); The algorithm is: I still don't see why this is safe, and I think that the issue is that you left out part of the loop. 1) get a consistent (cpu, version, tsc) 1.a) get cpu Suppose we observe cpu 0. 1.b) get pvti[cpu]-version, ignoring low bit Missing step, presumably here: read pvti[cpu]-tsc_timestamp, scale, etc. This could all execute on vCPU 1. We could read values that are inconsistent with each other. 1.c) get (tsc, cpu) Now we could end up back on vCPU 0. 1.d) if cpu from 1.a and 1.c do not match, loop 1.e) if pvti[cpu] was being updated, we'll loop later 2) compute nanoseconds from pvti[cpu] and tsc 3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't match, retry. As long as the CPU is consistent between get_cpu() and rdtscp(), there is no problem with migration, because pvti is always accessed for that one CPU. If there were any problem, it would be caught by the version check. Writing it down with two nested do...whiles makes it clearer IMHO. Why exactly would it be caught by the version check? My ugly patch tries to make the argument that, at any point at which we observe ourselves to be on a given vCPU, that vCPU won't be updating pvti. That means that, if version doesn't change between two consecutive observations that we're on that vCPU, then we're okay. This IMO sucks. It's fragile, it's hard to make a coherent argument about correctness, and it requires at least two getcpu-like operations to read the time. Those operations are *slow*. One is much better than two, and zero is much better than one. and *still* don't get a consistent read, unless we play nasty games with lots of version checks (I have a patch for that, but I don't like it very much). The patch is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d but I don't like it. Thus far, I've been told unambiguously that a guest can't observe pvti while it's being written, and I think you're now telling me that this isn't true and that a guest *can* observe pvti while it's being written while the low bit of the version field is not set. If so, this is rather strongly incompatible with the spec in the KVM docs. Where am I saying that? I thought the conclusion from what you and Marcelo pointed out about the code was that, once the first vCPU updated its pvti, it could start running guest code while the other vCPUs are still updating pvti, so its guest code can observe the other vCPUs mid-update. Also, if you do this, can you also make setting and clearing STABLE_BIT properly atomic across all vCPUs? Or at least do something like setting it last and clearing it first on vPCU 0? That would be nice if you want to make the pvclock area fit in a single page. However, it would have to be a separate flag bit, or a separate CPUID feature. It would be nice to have. Although I think that fixing the host to increment version pre-update and post-update may actually be good enough. Is there any case in which it would fail in practice if we made that fix and always looked at pvti 0? --Andy Paolo -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] KVM: introduce kvm_check_device
While we can easily register and unregister KVM devices, there is currently no easy way of checking whether a device has been registered. Introduce kvm_check_device() for that purpose and use it in two existing functions. Also change the return code for an invalid type number from ENOSPC to EINVAL. This function will be later used by another patch set to check whether a KVM_CREATE_IRQCHIP ioctl is valid. Signed-off-by: Andre Przywara andre.przyw...@arm.com --- Hi, can people comment whether there is an easier way to detect KVM device registration _outside_ of virt/kvm/kvm_main.c? Using the KVM_CREATE_DEVICE_TEST flag sounds like a fit, but kvm_ioctl_create_device() isn't exported. Also is my return code semantics too exerted? Cheers, Andre. include/linux/kvm_host.h |1 + virt/kvm/kvm_main.c | 33 + 2 files changed, 22 insertions(+), 12 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 83d770e..78e94ef 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1038,6 +1038,7 @@ void kvm_device_get(struct kvm_device *dev); void kvm_device_put(struct kvm_device *dev); struct kvm_device *kvm_device_from_filp(struct file *filp); int kvm_register_device_ops(struct kvm_device_ops *ops, u32 type); +int kvm_check_device(u32 type); void kvm_unregister_device_ops(u32 type); extern struct kvm_device_ops kvm_mpic_ops; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 1cc6e2e..42604a9 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2341,13 +2341,24 @@ static struct kvm_device_ops *kvm_device_ops_table[KVM_DEV_TYPE_MAX] = { #endif }; -int kvm_register_device_ops(struct kvm_device_ops *ops, u32 type) +int kvm_check_device(u32 type) { if (type = ARRAY_SIZE(kvm_device_ops_table)) - return -ENOSPC; + return -EINVAL; - if (kvm_device_ops_table[type] != NULL) - return -EEXIST; + if (kvm_device_ops_table[type] == NULL) + return -ENODEV; + + return -EEXIST; +} + +int kvm_register_device_ops(struct kvm_device_ops *ops, u32 type) +{ + int ret; + + ret = kvm_check_device(type); + if (ret != -ENODEV) + return ret; kvm_device_ops_table[type] = ops; return 0; @@ -2364,19 +2375,17 @@ static int kvm_ioctl_create_device(struct kvm *kvm, { struct kvm_device_ops *ops = NULL; struct kvm_device *dev; - bool test = cd-flags KVM_CREATE_DEVICE_TEST; int ret; - if (cd-type = ARRAY_SIZE(kvm_device_ops_table)) - return -ENODEV; - - ops = kvm_device_ops_table[cd-type]; - if (ops == NULL) - return -ENODEV; + ret = kvm_check_device(cd-type); + if (ret != -EEXIST) + return ret; - if (test) + if (cd-flags KVM_CREATE_DEVICE_TEST) return 0; + ops = kvm_device_ops_table[cd-type]; + dev = kzalloc(sizeof(*dev), GFP_KERNEL); if (!dev) return -ENOMEM; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti mtosa...@redhat.com wrote: On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote: On Jan 6, 2015 4:01 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 06/01/2015 09:42, Paolo Bonzini wrote: Still confused. So we can freeze all vCPUs in the host, then update pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM doesn't increment the version pre-update, and we can return completely bogus results. Yes. But then the getcpu test would fail (1-0). Even if you have an ABA situation (1-0-1), it's okay because the pvti that is fetched is the one returned by the first getcpu. ... this case of partial update of pvti, which is caught by the version field, if of course different from the other (extremely unlikely) that Andy pointed out. That is when the getcpus are done on the same vCPU, but the rdtsc is another. That one can be fixed by rdtscp, like do { // get a consistent (pvti, v, tsc) tuple do { cpu = get_cpu(); pvti = get_pvti(cpu); v = pvti-version ~1; // also acts as rmb(); rdtsc_barrier(); tsc = rdtscp(cpu1); Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD specified it that way and both AMD and Intel implement it correctly. (rdtsc, on the other hand, definitely needs the barrier beforehand.) // control dependency, no need for rdtsc_barrier? } while(cpu != cpu1); // ... compute nanoseconds from pvti and tsc ... rmb(); } while(v != pvti-version); Still no good. We can migrate a bunch of times so we see the same CPU all three times and *still* don't get a consistent read, unless we play nasty games with lots of version checks (I have a patch for that, but I don't like it very much). The patch is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d but I don't like it. Thus far, I've been told unambiguously that a guest can't observe pvti while it's being written, and I think you're now telling me that this isn't true and that a guest *can* observe pvti while it's being written while the low bit of the version field is not set. If so, this is rather strongly incompatible with the spec in the KVM docs. I don't suppose that you and Marcelo could agree on what the actual semantics that KVM provides are and could write it down in a way that people who haven't spent a long time staring at the request code understand? And maybe you could even fix the implementation while you're at it if the implementation is, indeed, broken. I have ugly patches to fix it here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=3b718a050cba52563d831febc2e1ca184c02bac0 but I'm not thrilled with them. --Andy I suppose that separating the version write from the rest of the pvclock structure is sufficient, as that would guarantee the writes are not reordered even with fast string REP MOVS. Thanks for catching this Andy! Don't you stil need: version++; write the rest; version++; with possible smp_wmb() in there to keep the compiler from messing around? Also, if you do this, can you also make setting and clearing STABLE_BIT properly atomic across all vCPUs? Or at least do something like setting it last and clearing it first on vPCU 0? --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Jan 6, 2015 4:01 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 06/01/2015 09:42, Paolo Bonzini wrote: Still confused. So we can freeze all vCPUs in the host, then update pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM doesn't increment the version pre-update, and we can return completely bogus results. Yes. But then the getcpu test would fail (1-0). Even if you have an ABA situation (1-0-1), it's okay because the pvti that is fetched is the one returned by the first getcpu. ... this case of partial update of pvti, which is caught by the version field, if of course different from the other (extremely unlikely) that Andy pointed out. That is when the getcpus are done on the same vCPU, but the rdtsc is another. That one can be fixed by rdtscp, like do { // get a consistent (pvti, v, tsc) tuple do { cpu = get_cpu(); pvti = get_pvti(cpu); v = pvti-version ~1; // also acts as rmb(); rdtsc_barrier(); tsc = rdtscp(cpu1); Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD specified it that way and both AMD and Intel implement it correctly. (rdtsc, on the other hand, definitely needs the barrier beforehand.) // control dependency, no need for rdtsc_barrier? } while(cpu != cpu1); // ... compute nanoseconds from pvti and tsc ... rmb(); } while(v != pvti-version); Still no good. We can migrate a bunch of times so we see the same CPU all three times and *still* don't get a consistent read, unless we play nasty games with lots of version checks (I have a patch for that, but I don't like it very much). The patch is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d but I don't like it. Thus far, I've been told unambiguously that a guest can't observe pvti while it's being written, and I think you're now telling me that this isn't true and that a guest *can* observe pvti while it's being written while the low bit of the version field is not set. If so, this is rather strongly incompatible with the spec in the KVM docs. I don't suppose that you and Marcelo could agree on what the actual semantics that KVM provides are and could write it down in a way that people who haven't spent a long time staring at the request code understand? And maybe you could even fix the implementation while you're at it if the implementation is, indeed, broken. I have ugly patches to fix it here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=3b718a050cba52563d831febc2e1ca184c02bac0 but I'm not thrilled with them. --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote: On Jan 6, 2015 4:01 AM, Paolo Bonzini pbonz...@redhat.com wrote: On 06/01/2015 09:42, Paolo Bonzini wrote: Still confused. So we can freeze all vCPUs in the host, then update pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM doesn't increment the version pre-update, and we can return completely bogus results. Yes. But then the getcpu test would fail (1-0). Even if you have an ABA situation (1-0-1), it's okay because the pvti that is fetched is the one returned by the first getcpu. ... this case of partial update of pvti, which is caught by the version field, if of course different from the other (extremely unlikely) that Andy pointed out. That is when the getcpus are done on the same vCPU, but the rdtsc is another. That one can be fixed by rdtscp, like do { // get a consistent (pvti, v, tsc) tuple do { cpu = get_cpu(); pvti = get_pvti(cpu); v = pvti-version ~1; // also acts as rmb(); rdtsc_barrier(); tsc = rdtscp(cpu1); Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD specified it that way and both AMD and Intel implement it correctly. (rdtsc, on the other hand, definitely needs the barrier beforehand.) // control dependency, no need for rdtsc_barrier? } while(cpu != cpu1); // ... compute nanoseconds from pvti and tsc ... rmb(); } while(v != pvti-version); Still no good. We can migrate a bunch of times so we see the same CPU all three times and *still* don't get a consistent read, unless we play nasty games with lots of version checks (I have a patch for that, but I don't like it very much). The patch is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d but I don't like it. Thus far, I've been told unambiguously that a guest can't observe pvti while it's being written, and I think you're now telling me that this isn't true and that a guest *can* observe pvti while it's being written while the low bit of the version field is not set. If so, this is rather strongly incompatible with the spec in the KVM docs. I don't suppose that you and Marcelo could agree on what the actual semantics that KVM provides are and could write it down in a way that people who haven't spent a long time staring at the request code understand? And maybe you could even fix the implementation while you're at it if the implementation is, indeed, broken. I have ugly patches to fix it here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=3b718a050cba52563d831febc2e1ca184c02bac0 but I'm not thrilled with them. --Andy I suppose that separating the version write from the rest of the pvclock structure is sufficient, as that would guarantee the writes are not reordered even with fast string REP MOVS. Thanks for catching this Andy! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html