date:20150106



On 05/01/2015 20:17, Marcelo Tosatti wrote:
 But there is no guarantee that vCPU-N has updated its pvti when
 vCPU-M resumes guest instruction execution.

You're right.

 So the cost this patch removes is mainly from __getcpu (==RDTSCP?) ?
 Perhaps you can use Gleb's idea to stick vcpu id into version field ?

Or just replace __getcpu with rdtscp.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader



On 05/01/2015 23:48, Marcelo Tosatti wrote:
   But there is no guarantee that vCPU-N has updated its pvti when
   vCPU-M resumes guest instruction execution.
  
  Still confused.  So we can freeze all vCPUs in the host, then update
  pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
  a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
  doesn't increment the version pre-update, and we can return completely
  bogus results.
 Yes.

But then the getcpu test would fail (1-0).  Even if you have an ABA
situation (1-0-1), it's okay because the pvti that is fetched is the
one returned by the first getcpu.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 01/20] vfio/platform: initial skeleton of VFIO support for platform devices

This patch forms the common skeleton code for platform devices support
with VFIO. This will include the core functionality of VFIO_PLATFORM,
however binding to the device and discovering the device resources will
be done with the help of a separate file where any Linux platform bus
specific code will reside.

This will allow us to implement support for also discovering AMBA devices
and their resources, but still reuse a large part of the VFIO_PLATFORM
implementation.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c  | 121 ++
 drivers/vfio/platform/vfio_platform_private.h |  36 
 2 files changed, 157 insertions(+)
 create mode 100644 drivers/vfio/platform/vfio_platform_common.c
 create mode 100644 drivers/vfio/platform/vfio_platform_private.h

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
new file mode 100644
index 000..34d023b
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -0,0 +1,121 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis a.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/device.h
+#include linux/iommu.h
+#include linux/module.h
+#include linux/mutex.h
+#include linux/slab.h
+#include linux/types.h
+#include linux/vfio.h
+
+#include vfio_platform_private.h
+
+static void vfio_platform_release(void *device_data)
+{
+   module_put(THIS_MODULE);
+}
+
+static int vfio_platform_open(void *device_data)
+{
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   return 0;
+}
+
+static long vfio_platform_ioctl(void *device_data,
+   unsigned int cmd, unsigned long arg)
+{
+   if (cmd == VFIO_DEVICE_GET_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_SET_IRQS)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_RESET)
+   return -EINVAL;
+
+   return -ENOTTY;
+}
+
+static ssize_t vfio_platform_read(void *device_data, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+   return -EINVAL;
+}
+
+static ssize_t vfio_platform_write(void *device_data, const char __user *buf,
+  size_t count, loff_t *ppos)
+{
+   return -EINVAL;
+}
+
+static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma)
+{
+   return -EINVAL;
+}
+
+static const struct vfio_device_ops vfio_platform_ops = {
+   .name   = vfio-platform,
+   .open   = vfio_platform_open,
+   .release= vfio_platform_release,
+   .ioctl  = vfio_platform_ioctl,
+   .read   = vfio_platform_read,
+   .write  = vfio_platform_write,
+   .mmap   = vfio_platform_mmap,
+};
+
+int vfio_platform_probe_common(struct vfio_platform_device *vdev,
+  struct device *dev)
+{
+   struct iommu_group *group;
+   int ret;
+
+   if (!vdev)
+   return -EINVAL;
+
+   group = iommu_group_get(dev);
+   if (!group) {
+   pr_err(VFIO: No IOMMU group for device %s\n, vdev-name);
+   return -EINVAL;
+   }
+
+   ret = vfio_add_group_dev(dev, vfio_platform_ops, vdev);
+   if (ret) {
+   iommu_group_put(group);
+   return ret;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_platform_probe_common);
+
+struct vfio_platform_device *vfio_platform_remove_common(struct device *dev)
+{
+   struct vfio_platform_device *vdev;
+
+   vdev = vfio_del_group_dev(dev);
+   if (vdev)
+   iommu_group_put(dev-iommu_group);
+
+   return vdev;
+}
+EXPORT_SYMBOL_GPL(vfio_platform_remove_common);
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
new file mode 100644
index 000..062b92d
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis a.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ *

[PATCH v11 06/20] vfio/platform: return info for bound device

A VFIO userspace driver will start by opening the VFIO device
that corresponds to an IOMMU group, and will use the ioctl interface
to get the basic device info, such as number of memory regions and
interrupts, and their properties. This patch enables the
VFIO_DEVICE_GET_INFO ioctl call.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 34d023b..862b43b 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -38,10 +38,27 @@ static int vfio_platform_open(void *device_data)
 static long vfio_platform_ioctl(void *device_data,
unsigned int cmd, unsigned long arg)
 {
-   if (cmd == VFIO_DEVICE_GET_INFO)
-   return -EINVAL;
+   struct vfio_platform_device *vdev = device_data;
+   unsigned long minsz;
+
+   if (cmd == VFIO_DEVICE_GET_INFO) {
+   struct vfio_device_info info;
+
+   minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+   if (copy_from_user(info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz  minsz)
+   return -EINVAL;
+
+   info.flags = vdev-flags;
+   info.num_regions = 0;
+   info.num_irqs = 0;
+
+   return copy_to_user((void __user *)arg, info, minsz);
 
-   else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
+   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
return -EINVAL;
 
else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 16/20] vfio: add local lock for virqfd instead of depending on VFIO PCI

The Virqfd code needs to keep accesses to any struct *virqfd safe, but
this comes into play only when creating or destroying eventfds, so sharing
the same spinlock with the VFIO bus driver is not necessary.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/vfio_pci_intrs.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index a5378d5..b35bc16 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -44,6 +44,7 @@ struct virqfd {
 };
 
 static struct workqueue_struct *vfio_irqfd_cleanup_wq;
+DEFINE_SPINLOCK(virqfd_lock);
 
 int __init vfio_virqfd_init(void)
 {
@@ -80,21 +81,21 @@ static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, 
int sync, void *key)
 
if (flags  POLLHUP) {
unsigned long flags;
-   spin_lock_irqsave(virqfd-vdev-irqlock, flags);
+   spin_lock_irqsave(virqfd_lock, flags);
 
/*
 * The eventfd is closing, if the virqfd has not yet been
 * queued for release, as determined by testing whether the
-* vdev pointer to it is still valid, queue it now.  As
+* virqfd pointer to it is still valid, queue it now.  As
 * with kvm irqfds, we know we won't race against the virqfd
-* going away because we hold wqh-lock to get here.
+* going away because we hold the lock to get here.
 */
if (*(virqfd-pvirqfd) == virqfd) {
*(virqfd-pvirqfd) = NULL;
virqfd_deactivate(virqfd);
}
 
-   spin_unlock_irqrestore(virqfd-vdev-irqlock, flags);
+   spin_unlock_irqrestore(virqfd_lock, flags);
}
 
return 0;
@@ -170,16 +171,16 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev,
 * we update the pointer to the virqfd under lock to avoid
 * pushing multiple jobs to release the same virqfd.
 */
-   spin_lock_irq(vdev-irqlock);
+   spin_lock_irq(virqfd_lock);
 
if (*pvirqfd) {
-   spin_unlock_irq(vdev-irqlock);
+   spin_unlock_irq(virqfd_lock);
ret = -EBUSY;
goto err_busy;
}
*pvirqfd = virqfd;
 
-   spin_unlock_irq(vdev-irqlock);
+   spin_unlock_irq(virqfd_lock);
 
/*
 * Install our own custom wake-up handling so we are notified via
@@ -217,18 +218,18 @@ err_fd:
 }
 EXPORT_SYMBOL_GPL(vfio_virqfd_enable);
 
-void vfio_virqfd_disable(struct vfio_pci_device *vdev, struct virqfd **pvirqfd)
+void vfio_virqfd_disable(struct virqfd **pvirqfd)
 {
unsigned long flags;
 
-   spin_lock_irqsave(vdev-irqlock, flags);
+   spin_lock_irqsave(virqfd_lock, flags);
 
if (*pvirqfd) {
virqfd_deactivate(*pvirqfd);
*pvirqfd = NULL;
}
 
-   spin_unlock_irqrestore(vdev-irqlock, flags);
+   spin_unlock_irqrestore(virqfd_lock, flags);
 
/*
 * Block until we know all outstanding shutdown jobs have completed.
@@ -441,8 +442,8 @@ static int vfio_intx_set_signal(struct vfio_pci_device 
*vdev, int fd)
 static void vfio_intx_disable(struct vfio_pci_device *vdev)
 {
vfio_intx_set_signal(vdev, -1);
-   vfio_virqfd_disable(vdev, vdev-ctx[0].unmask);
-   vfio_virqfd_disable(vdev, vdev-ctx[0].mask);
+   vfio_virqfd_disable(vdev-ctx[0].unmask);
+   vfio_virqfd_disable(vdev-ctx[0].mask);
vdev-irq_type = VFIO_PCI_NUM_IRQS;
vdev-num_ctx = 0;
kfree(vdev-ctx);
@@ -606,8 +607,8 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, 
bool msix)
vfio_msi_set_block(vdev, 0, vdev-num_ctx, NULL, msix);
 
for (i = 0; i  vdev-num_ctx; i++) {
-   vfio_virqfd_disable(vdev, vdev-ctx[i].unmask);
-   vfio_virqfd_disable(vdev, vdev-ctx[i].mask);
+   vfio_virqfd_disable(vdev-ctx[i].unmask);
+   vfio_virqfd_disable(vdev-ctx[i].mask);
}
 
if (msix) {
@@ -645,7 +646,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device 
*vdev,
  vfio_send_intx_eventfd, NULL,
  vdev-ctx[0].unmask, fd);
 
-   vfio_virqfd_disable(vdev, vdev-ctx[0].unmask);
+   vfio_virqfd_disable(vdev-ctx[0].unmask);
}
 
return 0;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 13/20] vfio/platform: support for level sensitive interrupts

Level sensitive interrupts are exposed as maskable and automasked
interrupts and are masked and disabled automatically when they fire.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_irq.c | 99 ++-
 drivers/vfio/platform/vfio_platform_private.h |  2 +
 2 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index 4896581..b3602cb 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -23,12 +23,59 @@
 
 #include vfio_platform_private.h
 
+static void vfio_platform_mask(struct vfio_platform_irq *irq_ctx)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(irq_ctx-lock, flags);
+
+   if (!irq_ctx-masked) {
+   disable_irq_nosync(irq_ctx-hwirq);
+   irq_ctx-masked = true;
+   }
+
+   spin_unlock_irqrestore(irq_ctx-lock, flags);
+}
+
 static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev,
  unsigned index, unsigned start,
  unsigned count, uint32_t flags,
  void *data)
 {
-   return -EINVAL;
+   if (start != 0 || count != 1)
+   return -EINVAL;
+
+   if (!(vdev-irqs[index].flags  VFIO_IRQ_INFO_MASKABLE))
+   return -EINVAL;
+
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD)
+   return -EINVAL; /* not implemented yet */
+
+   if (flags  VFIO_IRQ_SET_DATA_NONE) {
+   vfio_platform_mask(vdev-irqs[index]);
+
+   } else if (flags  VFIO_IRQ_SET_DATA_BOOL) {
+   uint8_t mask = *(uint8_t *)data;
+
+   if (mask)
+   vfio_platform_mask(vdev-irqs[index]);
+   }
+
+   return 0;
+}
+
+static void vfio_platform_unmask(struct vfio_platform_irq *irq_ctx)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(irq_ctx-lock, flags);
+
+   if (irq_ctx-masked) {
+   enable_irq(irq_ctx-hwirq);
+   irq_ctx-masked = false;
+   }
+
+   spin_unlock_irqrestore(irq_ctx-lock, flags);
 }
 
 static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev,
@@ -36,7 +83,50 @@ static int vfio_platform_set_irq_unmask(struct 
vfio_platform_device *vdev,
unsigned count, uint32_t flags,
void *data)
 {
-   return -EINVAL;
+   if (start != 0 || count != 1)
+   return -EINVAL;
+
+   if (!(vdev-irqs[index].flags  VFIO_IRQ_INFO_MASKABLE))
+   return -EINVAL;
+
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD)
+   return -EINVAL; /* not implemented yet */
+
+   if (flags  VFIO_IRQ_SET_DATA_NONE) {
+   vfio_platform_unmask(vdev-irqs[index]);
+
+   } else if (flags  VFIO_IRQ_SET_DATA_BOOL) {
+   uint8_t unmask = *(uint8_t *)data;
+
+   if (unmask)
+   vfio_platform_unmask(vdev-irqs[index]);
+   }
+
+   return 0;
+}
+
+static irqreturn_t vfio_automasked_irq_handler(int irq, void *dev_id)
+{
+   struct vfio_platform_irq *irq_ctx = dev_id;
+   unsigned long flags;
+   int ret = IRQ_NONE;
+
+   spin_lock_irqsave(irq_ctx-lock, flags);
+
+   if (!irq_ctx-masked) {
+   ret = IRQ_HANDLED;
+
+   /* automask maskable interrupts */
+   disable_irq_nosync(irq_ctx-hwirq);
+   irq_ctx-masked = true;
+   }
+
+   spin_unlock_irqrestore(irq_ctx-lock, flags);
+
+   if (ret == IRQ_HANDLED)
+   eventfd_signal(irq_ctx-trigger, 1);
+
+   return ret;
 }
 
 static irqreturn_t vfio_irq_handler(int irq, void *dev_id)
@@ -106,7 +196,7 @@ static int vfio_platform_set_irq_trigger(struct 
vfio_platform_device *vdev,
irq_handler_t handler;
 
if (vdev-irqs[index].flags  VFIO_IRQ_INFO_AUTOMASKED)
-   return -EINVAL; /* not implemented */
+   handler = vfio_automasked_irq_handler;
else
handler = vfio_irq_handler;
 
@@ -178,6 +268,8 @@ int vfio_platform_irq_init(struct vfio_platform_device 
*vdev)
if (hwirq  0)
goto err;
 
+   spin_lock_init(vdev-irqs[i].lock);
+
vdev-irqs[i].flags = VFIO_IRQ_INFO_EVENTFD;
 
if (irq_get_trigger_type(hwirq)  IRQ_TYPE_LEVEL_MASK)
@@ -186,6 +278,7 @@ int vfio_platform_irq_init(struct vfio_platform_device 
*vdev)
 
vdev-irqs[i].count = 1;
vdev-irqs[i].hwirq = hwirq;
+   vdev-irqs[i].masked = false;
}
 
vdev-num_irqs = cnt;
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index b705f17..eef6d1b 100644
--- a/drivers/vfio/platform/vfio_platform_private.h

[PATCH v11 20/20] vfio/platform: implement IRQ masking/unmasking via an eventfd

With this patch the VFIO user will be able to set an eventfd that can be
used in order to mask and unmask IRQs of platform devices.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_irq.c | 47 ---
 drivers/vfio/platform/vfio_platform_private.h |  2 ++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index b3602cb..6ade36b 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -37,6 +37,15 @@ static void vfio_platform_mask(struct vfio_platform_irq 
*irq_ctx)
spin_unlock_irqrestore(irq_ctx-lock, flags);
 }
 
+static int vfio_platform_mask_handler(void *opaque, void *unused)
+{
+   struct vfio_platform_irq *irq_ctx = opaque;
+
+   vfio_platform_mask(irq_ctx);
+
+   return 0;
+}
+
 static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev,
  unsigned index, unsigned start,
  unsigned count, uint32_t flags,
@@ -48,8 +57,18 @@ static int vfio_platform_set_irq_mask(struct 
vfio_platform_device *vdev,
if (!(vdev-irqs[index].flags  VFIO_IRQ_INFO_MASKABLE))
return -EINVAL;
 
-   if (flags  VFIO_IRQ_SET_DATA_EVENTFD)
-   return -EINVAL; /* not implemented yet */
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
+   int32_t fd = *(int32_t *)data;
+
+   if (fd = 0)
+   return vfio_virqfd_enable((void *) vdev-irqs[index],
+ vfio_platform_mask_handler,
+ NULL, NULL,
+ vdev-irqs[index].mask, fd);
+
+   vfio_virqfd_disable(vdev-irqs[index].mask);
+   return 0;
+   }
 
if (flags  VFIO_IRQ_SET_DATA_NONE) {
vfio_platform_mask(vdev-irqs[index]);
@@ -78,6 +97,15 @@ static void vfio_platform_unmask(struct vfio_platform_irq 
*irq_ctx)
spin_unlock_irqrestore(irq_ctx-lock, flags);
 }
 
+static int vfio_platform_unmask_handler(void *opaque, void *unused)
+{
+   struct vfio_platform_irq *irq_ctx = opaque;
+
+   vfio_platform_unmask(irq_ctx);
+
+   return 0;
+}
+
 static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev,
unsigned index, unsigned start,
unsigned count, uint32_t flags,
@@ -89,8 +117,19 @@ static int vfio_platform_set_irq_unmask(struct 
vfio_platform_device *vdev,
if (!(vdev-irqs[index].flags  VFIO_IRQ_INFO_MASKABLE))
return -EINVAL;
 
-   if (flags  VFIO_IRQ_SET_DATA_EVENTFD)
-   return -EINVAL; /* not implemented yet */
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
+   int32_t fd = *(int32_t *)data;
+
+   if (fd = 0)
+   return vfio_virqfd_enable((void *) vdev-irqs[index],
+ vfio_platform_unmask_handler,
+ NULL, NULL,
+ vdev-irqs[index].unmask,
+ fd);
+
+   vfio_virqfd_disable(vdev-irqs[index].unmask);
+   return 0;
+   }
 
if (flags  VFIO_IRQ_SET_DATA_NONE) {
vfio_platform_unmask(vdev-irqs[index]);
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index eef6d1b..c3d5b4b 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -32,6 +32,8 @@ struct vfio_platform_irq {
struct eventfd_ctx  *trigger;
boolmasked;
spinlock_t  lock;
+   struct virqfd   *unmask;
+   struct virqfd   *mask;
 };
 
 struct vfio_platform_region {
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 18/20] vfio: move eventfd support code for VFIO_PCI to a separate file

The virqfd functionality that is used by VFIO_PCI to implement interrupt
masking and unmasking via an eventfd, is generic enough and can be reused
by another driver. Move it to a separate file in order to allow the code
to be shared.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/Makefile   |   3 +-
 drivers/vfio/pci/vfio_pci_intrs.c   | 215 
 drivers/vfio/pci/vfio_pci_private.h |   3 -
 drivers/vfio/virqfd.c   | 213 +++
 include/linux/vfio.h|  27 +
 5 files changed, 242 insertions(+), 219 deletions(-)
 create mode 100644 drivers/vfio/virqfd.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 1310792..c7c8644 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,4 +1,5 @@
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o \
+ ../virqfd.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 5b5fc23..de4befc 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -19,228 +19,13 @@
 #include linux/msi.h
 #include linux/pci.h
 #include linux/file.h
-#include linux/poll.h
 #include linux/vfio.h
 #include linux/wait.h
-#include linux/workqueue.h
 #include linux/slab.h
 
 #include vfio_pci_private.h
 
 /*
- * IRQfd - generic
- */
-struct virqfd {
-   void*opaque;
-   struct eventfd_ctx  *eventfd;
-   int (*handler)(void *, void *);
-   void(*thread)(void *, void *);
-   void*data;
-   struct work_struct  inject;
-   wait_queue_twait;
-   poll_table  pt;
-   struct work_struct  shutdown;
-   struct virqfd   **pvirqfd;
-};
-
-static struct workqueue_struct *vfio_irqfd_cleanup_wq;
-DEFINE_SPINLOCK(virqfd_lock);
-
-int __init vfio_virqfd_init(void)
-{
-   vfio_irqfd_cleanup_wq =
-   create_singlethread_workqueue(vfio-irqfd-cleanup);
-   if (!vfio_irqfd_cleanup_wq)
-   return -ENOMEM;
-
-   return 0;
-}
-
-void vfio_virqfd_exit(void)
-{
-   destroy_workqueue(vfio_irqfd_cleanup_wq);
-}
-
-static void virqfd_deactivate(struct virqfd *virqfd)
-{
-   queue_work(vfio_irqfd_cleanup_wq, virqfd-shutdown);
-}
-
-static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void 
*key)
-{
-   struct virqfd *virqfd = container_of(wait, struct virqfd, wait);
-   unsigned long flags = (unsigned long)key;
-
-   if (flags  POLLIN) {
-   /* An event has been signaled, call function */
-   if ((!virqfd-handler ||
-virqfd-handler(virqfd-opaque, virqfd-data)) 
-   virqfd-thread)
-   schedule_work(virqfd-inject);
-   }
-
-   if (flags  POLLHUP) {
-   unsigned long flags;
-   spin_lock_irqsave(virqfd_lock, flags);
-
-   /*
-* The eventfd is closing, if the virqfd has not yet been
-* queued for release, as determined by testing whether the
-* virqfd pointer to it is still valid, queue it now.  As
-* with kvm irqfds, we know we won't race against the virqfd
-* going away because we hold the lock to get here.
-*/
-   if (*(virqfd-pvirqfd) == virqfd) {
-   *(virqfd-pvirqfd) = NULL;
-   virqfd_deactivate(virqfd);
-   }
-
-   spin_unlock_irqrestore(virqfd_lock, flags);
-   }
-
-   return 0;
-}
-
-static void virqfd_ptable_queue_proc(struct file *file,
-wait_queue_head_t *wqh, poll_table *pt)
-{
-   struct virqfd *virqfd = container_of(pt, struct virqfd, pt);
-   add_wait_queue(wqh, virqfd-wait);
-}
-
-static void virqfd_shutdown(struct work_struct *work)
-{
-   struct virqfd *virqfd = container_of(work, struct virqfd, shutdown);
-   u64 cnt;
-
-   eventfd_ctx_remove_wait_queue(virqfd-eventfd, virqfd-wait, cnt);
-   flush_work(virqfd-inject);
-   eventfd_ctx_put(virqfd-eventfd);
-
-   kfree(virqfd);
-}
-
-static void virqfd_inject(struct work_struct *work)
-{
-   struct virqfd *virqfd = container_of(work, struct virqfd, inject);
-   if (virqfd-thread)
-   virqfd-thread(virqfd-opaque, virqfd-data);
-}
-
-int vfio_virqfd_enable(void *opaque,
-  int (*handler)(void *, void *),
-  void (*thread)(void *, void *),
-  void *data, struct virqfd **pvirqfd, int fd)
-{
-   struct fd irqfd;
-   struct eventfd_ctx *ctx;
-   struct virqfd *virqfd;
-   int ret

[PATCH v11 17/20] vfio: pass an opaque pointer on virqfd initialization

VFIO_PCI passes the VFIO device structure *vdev via eventfd to the handler
that implements masking/unmasking of IRQs via an eventfd. We can replace
it in the virqfd infrastructure with an opaque type so we can make use
of the mechanism from other VFIO bus drivers.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/vfio_pci_intrs.c | 30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index b35bc16..5b5fc23 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -31,10 +31,10 @@
  * IRQfd - generic
  */
 struct virqfd {
-   struct vfio_pci_device  *vdev;
+   void*opaque;
struct eventfd_ctx  *eventfd;
-   int (*handler)(struct vfio_pci_device *, void *);
-   void(*thread)(struct vfio_pci_device *, void *);
+   int (*handler)(void *, void *);
+   void(*thread)(void *, void *);
void*data;
struct work_struct  inject;
wait_queue_twait;
@@ -74,7 +74,7 @@ static int virqfd_wakeup(wait_queue_t *wait, unsigned mode, 
int sync, void *key)
if (flags  POLLIN) {
/* An event has been signaled, call function */
if ((!virqfd-handler ||
-virqfd-handler(virqfd-vdev, virqfd-data)) 
+virqfd-handler(virqfd-opaque, virqfd-data)) 
virqfd-thread)
schedule_work(virqfd-inject);
}
@@ -124,12 +124,12 @@ static void virqfd_inject(struct work_struct *work)
 {
struct virqfd *virqfd = container_of(work, struct virqfd, inject);
if (virqfd-thread)
-   virqfd-thread(virqfd-vdev, virqfd-data);
+   virqfd-thread(virqfd-opaque, virqfd-data);
 }
 
-int vfio_virqfd_enable(struct vfio_pci_device *vdev,
-  int (*handler)(struct vfio_pci_device *, void *),
-  void (*thread)(struct vfio_pci_device *, void *),
+int vfio_virqfd_enable(void *opaque,
+  int (*handler)(void *, void *),
+  void (*thread)(void *, void *),
   void *data, struct virqfd **pvirqfd, int fd)
 {
struct fd irqfd;
@@ -143,7 +143,7 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev,
return -ENOMEM;
 
virqfd-pvirqfd = pvirqfd;
-   virqfd-vdev = vdev;
+   virqfd-opaque = opaque;
virqfd-handler = handler;
virqfd-thread = thread;
virqfd-data = data;
@@ -196,7 +196,7 @@ int vfio_virqfd_enable(struct vfio_pci_device *vdev,
 * before we registered and trigger it as if we didn't miss it.
 */
if (events  POLLIN) {
-   if ((!handler || handler(vdev, data))  thread)
+   if ((!handler || handler(opaque, data))  thread)
schedule_work(virqfd-inject);
}
 
@@ -243,8 +243,10 @@ EXPORT_SYMBOL_GPL(vfio_virqfd_disable);
 /*
  * INTx
  */
-static void vfio_send_intx_eventfd(struct vfio_pci_device *vdev, void *unused)
+static void vfio_send_intx_eventfd(void *opaque, void *unused)
 {
+   struct vfio_pci_device *vdev = opaque;
+
if (likely(is_intx(vdev)  !vdev-virq_disabled))
eventfd_signal(vdev-ctx[0].trigger, 1);
 }
@@ -287,9 +289,9 @@ void vfio_pci_intx_mask(struct vfio_pci_device *vdev)
  * a signal is necessary, which can then be handled via a work queue
  * or directly depending on the caller.
  */
-static int vfio_pci_intx_unmask_handler(struct vfio_pci_device *vdev,
-   void *unused)
+static int vfio_pci_intx_unmask_handler(void *opaque, void *unused)
 {
+   struct vfio_pci_device *vdev = opaque;
struct pci_dev *pdev = vdev-pdev;
unsigned long flags;
int ret = 0;
@@ -641,7 +643,7 @@ static int vfio_pci_set_intx_unmask(struct vfio_pci_device 
*vdev,
} else if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
int32_t fd = *(int32_t *)data;
if (fd = 0)
-   return vfio_virqfd_enable(vdev,
+   return vfio_virqfd_enable((void *) vdev,
  vfio_pci_intx_unmask_handler,
  vfio_send_intx_eventfd, NULL,
  vdev-ctx[0].unmask, fd);
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 12/20] vfio/platform: trigger an interrupt via eventfd

This patch allows to set an eventfd for a platform device's interrupt,
and also to trigger the interrupt eventfd from userspace for testing.
Level sensitive interrupts are marked as maskable and are handled in
a later patch. Edge triggered interrupts are not advertised as maskable
and are implemented here using a simple and efficient IRQ handler.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_irq.c | 102 +-
 drivers/vfio/platform/vfio_platform_private.h |   2 +
 2 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index df5c919..4896581 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -39,12 +39,100 @@ static int vfio_platform_set_irq_unmask(struct 
vfio_platform_device *vdev,
return -EINVAL;
 }
 
+static irqreturn_t vfio_irq_handler(int irq, void *dev_id)
+{
+   struct vfio_platform_irq *irq_ctx = dev_id;
+
+   eventfd_signal(irq_ctx-trigger, 1);
+
+   return IRQ_HANDLED;
+}
+
+static int vfio_set_trigger(struct vfio_platform_device *vdev, int index,
+   int fd, irq_handler_t handler)
+{
+   struct vfio_platform_irq *irq = vdev-irqs[index];
+   struct eventfd_ctx *trigger;
+   unsigned long flags;
+   int ret;
+
+   if (irq-trigger) {
+   free_irq(irq-hwirq, irq);
+   kfree(irq-name);
+   eventfd_ctx_put(irq-trigger);
+   irq-trigger = NULL;
+   }
+
+   if (fd  0) /* Disable only */
+   return 0;
+
+   irq-name = kasprintf(GFP_KERNEL, vfio-irq[%d](%s),
+   irq-hwirq, vdev-name);
+   if (!irq-name)
+   return -ENOMEM;
+
+   trigger = eventfd_ctx_fdget(fd);
+   if (IS_ERR(trigger)) {
+   kfree(irq-name);
+   return PTR_ERR(trigger);
+   }
+
+   irq-trigger = trigger;
+
+   ret = request_irq(irq-hwirq, handler, 0, irq-name, irq);
+   if (ret) {
+   kfree(irq-name);
+   eventfd_ctx_put(trigger);
+   irq-trigger = NULL;
+   return ret;
+   }
+
+   /* if the IRQ has been masked by the user before setting an eventfd,
+* then we need to make sure it is properly disabled */
+   spin_lock_irqsave(irq-lock, flags);
+   if (irq-masked)
+   disable_irq_nosync(irq-hwirq);
+   spin_unlock_irqrestore(irq-lock, flags);
+
+   return 0;
+}
+
 static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev,
 unsigned index, unsigned start,
 unsigned count, uint32_t flags,
 void *data)
 {
-   return -EINVAL;
+   struct vfio_platform_irq *irq = vdev-irqs[index];
+   irq_handler_t handler;
+
+   if (vdev-irqs[index].flags  VFIO_IRQ_INFO_AUTOMASKED)
+   return -EINVAL; /* not implemented */
+   else
+   handler = vfio_irq_handler;
+
+   if (!count  (flags  VFIO_IRQ_SET_DATA_NONE))
+   return vfio_set_trigger(vdev, index, -1, handler);
+
+   if (start != 0 || count != 1)
+   return -EINVAL;
+
+   if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
+   int32_t fd = *(int32_t *)data;
+
+   return vfio_set_trigger(vdev, index, fd, handler);
+   }
+
+   if (flags  VFIO_IRQ_SET_DATA_NONE) {
+   handler(irq-hwirq, irq);
+
+   } else if (flags  VFIO_IRQ_SET_DATA_BOOL) {
+   uint8_t trigger = *(uint8_t *)data;
+
+   if (trigger)
+   handler(irq-hwirq, irq);
+   }
+
+   return 0;
 }
 
 int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev,
@@ -90,7 +178,12 @@ int vfio_platform_irq_init(struct vfio_platform_device 
*vdev)
if (hwirq  0)
goto err;
 
-   vdev-irqs[i].flags = 0;
+   vdev-irqs[i].flags = VFIO_IRQ_INFO_EVENTFD;
+
+   if (irq_get_trigger_type(hwirq)  IRQ_TYPE_LEVEL_MASK)
+   vdev-irqs[i].flags |= VFIO_IRQ_INFO_MASKABLE
+   | VFIO_IRQ_INFO_AUTOMASKED;
+
vdev-irqs[i].count = 1;
vdev-irqs[i].hwirq = hwirq;
}
@@ -105,6 +198,11 @@ err:
 
 void vfio_platform_irq_cleanup(struct vfio_platform_device *vdev)
 {
+   int i;
+
+   for (i = 0; i  vdev-num_irqs; i++)
+   vfio_set_trigger(vdev, i, -1, NULL);
+
vdev-num_irqs = 0;
kfree(vdev-irqs);
 }
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index 61f3ed4..b705f17 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++

[PATCH v11 14/20] vfio: add a vfio_ prefix to virqfd_enable and virqfd_disable and export

We want to reuse virqfd functionality in multiple VFIO drivers; before
moving these functions to core VFIO, add the vfio_ prefix to the
virqfd_enable and virqfd_disable functions, and export them so they can
be used from other modules.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/vfio_pci_intrs.c   | 30 --
 drivers/vfio/pci/vfio_pci_private.h |  4 ++--
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index e8d695b..0a41833d 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -126,10 +126,10 @@ static void virqfd_inject(struct work_struct *work)
virqfd-thread(virqfd-vdev, virqfd-data);
 }
 
-static int virqfd_enable(struct vfio_pci_device *vdev,
-int (*handler)(struct vfio_pci_device *, void *),
-void (*thread)(struct vfio_pci_device *, void *),
-void *data, struct virqfd **pvirqfd, int fd)
+int vfio_virqfd_enable(struct vfio_pci_device *vdev,
+  int (*handler)(struct vfio_pci_device *, void *),
+  void (*thread)(struct vfio_pci_device *, void *),
+  void *data, struct virqfd **pvirqfd, int fd)
 {
struct fd irqfd;
struct eventfd_ctx *ctx;
@@ -215,9 +215,9 @@ err_fd:
 
return ret;
 }
+EXPORT_SYMBOL_GPL(vfio_virqfd_enable);
 
-static void virqfd_disable(struct vfio_pci_device *vdev,
-  struct virqfd **pvirqfd)
+void vfio_virqfd_disable(struct vfio_pci_device *vdev, struct virqfd **pvirqfd)
 {
unsigned long flags;
 
@@ -237,6 +237,7 @@ static void virqfd_disable(struct vfio_pci_device *vdev,
 */
flush_workqueue(vfio_irqfd_cleanup_wq);
 }
+EXPORT_SYMBOL_GPL(vfio_virqfd_disable);
 
 /*
  * INTx
@@ -440,8 +441,8 @@ static int vfio_intx_set_signal(struct vfio_pci_device 
*vdev, int fd)
 static void vfio_intx_disable(struct vfio_pci_device *vdev)
 {
vfio_intx_set_signal(vdev, -1);
-   virqfd_disable(vdev, vdev-ctx[0].unmask);
-   virqfd_disable(vdev, vdev-ctx[0].mask);
+   vfio_virqfd_disable(vdev, vdev-ctx[0].unmask);
+   vfio_virqfd_disable(vdev, vdev-ctx[0].mask);
vdev-irq_type = VFIO_PCI_NUM_IRQS;
vdev-num_ctx = 0;
kfree(vdev-ctx);
@@ -605,8 +606,8 @@ static void vfio_msi_disable(struct vfio_pci_device *vdev, 
bool msix)
vfio_msi_set_block(vdev, 0, vdev-num_ctx, NULL, msix);
 
for (i = 0; i  vdev-num_ctx; i++) {
-   virqfd_disable(vdev, vdev-ctx[i].unmask);
-   virqfd_disable(vdev, vdev-ctx[i].mask);
+   vfio_virqfd_disable(vdev, vdev-ctx[i].unmask);
+   vfio_virqfd_disable(vdev, vdev-ctx[i].mask);
}
 
if (msix) {
@@ -639,11 +640,12 @@ static int vfio_pci_set_intx_unmask(struct 
vfio_pci_device *vdev,
} else if (flags  VFIO_IRQ_SET_DATA_EVENTFD) {
int32_t fd = *(int32_t *)data;
if (fd = 0)
-   return virqfd_enable(vdev, vfio_pci_intx_unmask_handler,
-vfio_send_intx_eventfd, NULL,
-vdev-ctx[0].unmask, fd);
+   return vfio_virqfd_enable(vdev,
+ vfio_pci_intx_unmask_handler,
+ vfio_send_intx_eventfd, NULL,
+ vdev-ctx[0].unmask, fd);
 
-   virqfd_disable(vdev, vdev-ctx[0].unmask);
+   vfio_virqfd_disable(vdev, vdev-ctx[0].unmask);
}
 
return 0;
diff --git a/drivers/vfio/pci/vfio_pci_private.h 
b/drivers/vfio/pci/vfio_pci_private.h
index 671c17a..2e2f0ea 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -86,8 +86,8 @@ extern ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, 
char __user *buf,
 extern int vfio_pci_init_perm_bits(void);
 extern void vfio_pci_uninit_perm_bits(void);
 
-extern int vfio_pci_virqfd_init(void);
-extern void vfio_pci_virqfd_exit(void);
+extern int vfio_virqfd_init(void);
+extern void vfio_virqfd_exit(void);
 
 extern int vfio_config_init(struct vfio_pci_device *vdev);
 extern void vfio_config_free(struct vfio_pci_device *vdev);
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 11/20] vfio/platform: initial interrupts support code

This patch is a skeleton for the VFIO_DEVICE_SET_IRQS IOCTL, around which
most IRQ functionality is implemented in VFIO.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c  | 52 +--
 drivers/vfio/platform/vfio_platform_irq.c | 59 +++
 drivers/vfio/platform/vfio_platform_private.h |  7 
 3 files changed, 115 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index cf7bb08..a532a25 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -204,10 +204,54 @@ static long vfio_platform_ioctl(void *device_data,
 
return copy_to_user((void __user *)arg, info, minsz);
 
-   } else if (cmd == VFIO_DEVICE_SET_IRQS)
-   return -EINVAL;
+   } else if (cmd == VFIO_DEVICE_SET_IRQS) {
+   struct vfio_irq_set hdr;
+   u8 *data = NULL;
+   int ret = 0;
+
+   minsz = offsetofend(struct vfio_irq_set, count);
+
+   if (copy_from_user(hdr, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (hdr.argsz  minsz)
+   return -EINVAL;
+
+   if (hdr.index = vdev-num_irqs)
+   return -EINVAL;
+
+   if (hdr.flags  ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+ VFIO_IRQ_SET_ACTION_TYPE_MASK))
+   return -EINVAL;
 
-   else if (cmd == VFIO_DEVICE_RESET)
+   if (!(hdr.flags  VFIO_IRQ_SET_DATA_NONE)) {
+   size_t size;
+
+   if (hdr.flags  VFIO_IRQ_SET_DATA_BOOL)
+   size = sizeof(uint8_t);
+   else if (hdr.flags  VFIO_IRQ_SET_DATA_EVENTFD)
+   size = sizeof(int32_t);
+   else
+   return -EINVAL;
+
+   if (hdr.argsz - minsz  size)
+   return -EINVAL;
+
+   data = memdup_user((void __user *)(arg + minsz), size);
+   if (IS_ERR(data))
+   return PTR_ERR(data);
+   }
+
+   mutex_lock(vdev-igate);
+
+   ret = vfio_platform_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
+  hdr.start, hdr.count, data);
+   mutex_unlock(vdev-igate);
+   kfree(data);
+
+   return ret;
+
+   } else if (cmd == VFIO_DEVICE_RESET)
return -EINVAL;
 
return -ENOTTY;
@@ -457,6 +501,8 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
return ret;
}
 
+   mutex_init(vdev-igate);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(vfio_platform_probe_common);
diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
index c6c3ec1..df5c919 100644
--- a/drivers/vfio/platform/vfio_platform_irq.c
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -23,6 +23,56 @@
 
 #include vfio_platform_private.h
 
+static int vfio_platform_set_irq_mask(struct vfio_platform_device *vdev,
+ unsigned index, unsigned start,
+ unsigned count, uint32_t flags,
+ void *data)
+{
+   return -EINVAL;
+}
+
+static int vfio_platform_set_irq_unmask(struct vfio_platform_device *vdev,
+   unsigned index, unsigned start,
+   unsigned count, uint32_t flags,
+   void *data)
+{
+   return -EINVAL;
+}
+
+static int vfio_platform_set_irq_trigger(struct vfio_platform_device *vdev,
+unsigned index, unsigned start,
+unsigned count, uint32_t flags,
+void *data)
+{
+   return -EINVAL;
+}
+
+int vfio_platform_set_irqs_ioctl(struct vfio_platform_device *vdev,
+uint32_t flags, unsigned index, unsigned start,
+unsigned count, void *data)
+{
+   int (*func)(struct vfio_platform_device *vdev, unsigned index,
+   unsigned start, unsigned count, uint32_t flags,
+   void *data) = NULL;
+
+   switch (flags  VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+   case VFIO_IRQ_SET_ACTION_MASK:
+   func = vfio_platform_set_irq_mask;
+   break;
+   case VFIO_IRQ_SET_ACTION_UNMASK:
+   func = vfio_platform_set_irq_unmask;
+   break;
+   case VFIO_IRQ_SET_ACTION_TRIGGER:
+   func = vfio_platform_set_irq_trigger;
+   break;
+   }

[PATCH v11 08/20] vfio/platform: read and write support for the device fd

VFIO returns a file descriptor which we can use to manipulate the memory
regions of the device. Usually, the user will mmap memory regions that are
addressable on page boundaries, however for memory regions where this is
not the case we cannot provide mmap functionality due to security concerns.
For this reason we also allow to use read and write functions to the file
descriptor pointing to the memory regions.

We implement this functionality only for MMIO regions of platform devices;
PIO regions are not being handled at this point.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c  | 150 ++
 drivers/vfio/platform/vfio_platform_private.h |   1 +
 2 files changed, 151 insertions(+)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 2a4613c..fda4c30 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -50,6 +50,10 @@ static int vfio_platform_regions_init(struct 
vfio_platform_device *vdev)
switch (resource_type(res)) {
case IORESOURCE_MEM:
vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_MMIO;
+   vdev-regions[i].flags |= VFIO_REGION_INFO_FLAG_READ;
+   if (!(res-flags  IORESOURCE_READONLY))
+   vdev-regions[i].flags |=
+   VFIO_REGION_INFO_FLAG_WRITE;
break;
case IORESOURCE_IO:
vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO;
@@ -69,6 +73,11 @@ err:
 
 static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev)
 {
+   int i;
+
+   for (i = 0; i  vdev-num_regions; i++)
+   iounmap(vdev-regions[i].ioaddr);
+
vdev-num_regions = 0;
kfree(vdev-regions);
 }
@@ -171,15 +180,156 @@ static long vfio_platform_ioctl(void *device_data,
return -ENOTTY;
 }
 
+static ssize_t vfio_platform_read_mmio(struct vfio_platform_region reg,
+  char __user *buf, size_t count,
+  loff_t off)
+{
+   unsigned int done = 0;
+
+   if (!reg.ioaddr) {
+   reg.ioaddr =
+   ioremap_nocache(reg.addr, reg.size);
+
+   if (!reg.ioaddr)
+   return -ENOMEM;
+   }
+
+   while (count) {
+   size_t filled;
+
+   if (count = 4  !(off % 4)) {
+   u32 val;
+
+   val = ioread32(reg.ioaddr + off);
+   if (copy_to_user(buf, val, 4))
+   goto err;
+
+   filled = 4;
+   } else if (count = 2  !(off % 2)) {
+   u16 val;
+
+   val = ioread16(reg.ioaddr + off);
+   if (copy_to_user(buf, val, 2))
+   goto err;
+
+   filled = 2;
+   } else {
+   u8 val;
+
+   val = ioread8(reg.ioaddr + off);
+   if (copy_to_user(buf, val, 1))
+   goto err;
+
+   filled = 1;
+   }
+
+
+   count -= filled;
+   done += filled;
+   off += filled;
+   buf += filled;
+   }
+
+   return done;
+err:
+   return -EFAULT;
+}
+
 static ssize_t vfio_platform_read(void *device_data, char __user *buf,
  size_t count, loff_t *ppos)
 {
+   struct vfio_platform_device *vdev = device_data;
+   unsigned int index = VFIO_PLATFORM_OFFSET_TO_INDEX(*ppos);
+   loff_t off = *ppos  VFIO_PLATFORM_OFFSET_MASK;
+
+   if (index = vdev-num_regions)
+   return -EINVAL;
+
+   if (!(vdev-regions[index].flags  VFIO_REGION_INFO_FLAG_READ))
+   return -EINVAL;
+
+   if (vdev-regions[index].type  VFIO_PLATFORM_REGION_TYPE_MMIO)
+   return vfio_platform_read_mmio(vdev-regions[index],
+   buf, count, off);
+   else if (vdev-regions[index].type  VFIO_PLATFORM_REGION_TYPE_PIO)
+   return -EINVAL; /* not implemented */
+
return -EINVAL;
 }
 
+static ssize_t vfio_platform_write_mmio(struct vfio_platform_region reg,
+   const char __user *buf, size_t count,
+   loff_t off)
+{
+   unsigned int done = 0;
+
+   if (!reg.ioaddr) {
+   reg.ioaddr =
+   ioremap_nocache(reg.addr, reg.size);
+
+   if (!reg.ioaddr)
+   return -ENOMEM;
+   }
+
+   while (count) {
+   size_t filled;
+
+   if (count = 4  !(off % 4)) {
+

[PATCH v11 10/20] vfio/platform: return IRQ info

Return information for the interrupts exposed by the device.
This patch extends VFIO_DEVICE_GET_INFO with the number of IRQs
and enables VFIO_DEVICE_GET_IRQ_INFO.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/Makefile|  2 +-
 drivers/vfio/platform/vfio_platform_common.c  | 31 +---
 drivers/vfio/platform/vfio_platform_irq.c | 51 +++
 drivers/vfio/platform/vfio_platform_private.h | 10 ++
 4 files changed, 89 insertions(+), 5 deletions(-)
 create mode 100644 drivers/vfio/platform/vfio_platform_irq.c

diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
index 1957170..81de144 100644
--- a/drivers/vfio/platform/Makefile
+++ b/drivers/vfio/platform/Makefile
@@ -1,5 +1,5 @@
 
-vfio-platform-y := vfio_platform.o vfio_platform_common.o
+vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o
 
 obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
 
diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 6bf78ee..cf7bb08 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -100,6 +100,7 @@ static void vfio_platform_release(void *device_data)
 
if (!(--vdev-refcnt)) {
vfio_platform_regions_cleanup(vdev);
+   vfio_platform_irq_cleanup(vdev);
}
 
mutex_unlock(driver_lock);
@@ -121,6 +122,10 @@ static int vfio_platform_open(void *device_data)
ret = vfio_platform_regions_init(vdev);
if (ret)
goto err_reg;
+
+   ret = vfio_platform_irq_init(vdev);
+   if (ret)
+   goto err_irq;
}
 
vdev-refcnt++;
@@ -128,6 +133,8 @@ static int vfio_platform_open(void *device_data)
mutex_unlock(driver_lock);
return 0;
 
+err_irq:
+   vfio_platform_regions_cleanup(vdev);
 err_reg:
mutex_unlock(driver_lock);
module_put(THIS_MODULE);
@@ -153,7 +160,7 @@ static long vfio_platform_ioctl(void *device_data,
 
info.flags = vdev-flags;
info.num_regions = vdev-num_regions;
-   info.num_irqs = 0;
+   info.num_irqs = vdev-num_irqs;
 
return copy_to_user((void __user *)arg, info, minsz);
 
@@ -178,10 +185,26 @@ static long vfio_platform_ioctl(void *device_data,
 
return copy_to_user((void __user *)arg, info, minsz);
 
-   } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
-   return -EINVAL;
+   } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+   struct vfio_irq_info info;
+
+   minsz = offsetofend(struct vfio_irq_info, count);
+
+   if (copy_from_user(info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz  minsz)
+   return -EINVAL;
+
+   if (info.index = vdev-num_irqs)
+   return -EINVAL;
+
+   info.flags = vdev-irqs[info.index].flags;
+   info.count = vdev-irqs[info.index].count;
+
+   return copy_to_user((void __user *)arg, info, minsz);
 
-   else if (cmd == VFIO_DEVICE_SET_IRQS)
+   } else if (cmd == VFIO_DEVICE_SET_IRQS)
return -EINVAL;
 
else if (cmd == VFIO_DEVICE_RESET)
diff --git a/drivers/vfio/platform/vfio_platform_irq.c 
b/drivers/vfio/platform/vfio_platform_irq.c
new file mode 100644
index 000..c6c3ec1
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform_irq.c
@@ -0,0 +1,51 @@
+/*
+ * VFIO platform devices interrupt handling
+ *
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis a.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/eventfd.h
+#include linux/interrupt.h
+#include linux/slab.h
+#include linux/types.h
+#include linux/vfio.h
+#include linux/irq.h
+
+#include vfio_platform_private.h
+
+int vfio_platform_irq_init(struct vfio_platform_device *vdev)
+{
+   int cnt = 0, i;
+
+   while (vdev-get_irq(vdev, cnt) = 0)
+   cnt++;
+
+   vdev-irqs = kcalloc(cnt, sizeof(struct vfio_platform_irq), GFP_KERNEL);
+   if (!vdev-irqs)
+   return -ENOMEM;
+
+   for (i = 0; i  cnt; i++) {
+   vdev-irqs[i].flags = 0;
+   vdev-irqs[i].count = 1;
+   }
+
+   vdev-num_irqs = cnt;
+
+   return 0;
+}
+
+void vfio_platform_irq_cleanup(struct

[PATCH v11 09/20] vfio/platform: support MMAP of MMIO regions

Allow to memory map the MMIO regions of the device so userspace can
directly access them. PIO regions are not being handled at this point.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c | 65 
 1 file changed, 65 insertions(+)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index fda4c30..6bf78ee 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -54,6 +54,16 @@ static int vfio_platform_regions_init(struct 
vfio_platform_device *vdev)
if (!(res-flags  IORESOURCE_READONLY))
vdev-regions[i].flags |=
VFIO_REGION_INFO_FLAG_WRITE;
+
+   /*
+* Only regions addressed with PAGE granularity may be
+* MMAPed securely.
+*/
+   if (!(vdev-regions[i].addr  ~PAGE_MASK) 
+   !(vdev-regions[i].size  ~PAGE_MASK))
+   vdev-regions[i].flags |=
+   VFIO_REGION_INFO_FLAG_MMAP;
+
break;
case IORESOURCE_IO:
vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO;
@@ -333,8 +343,63 @@ static ssize_t vfio_platform_write(void *device_data, 
const char __user *buf,
return -EINVAL;
 }
 
+static int vfio_platform_mmap_mmio(struct vfio_platform_region region,
+  struct vm_area_struct *vma)
+{
+   u64 req_len, pgoff, req_start;
+
+   req_len = vma-vm_end - vma-vm_start;
+   pgoff = vma-vm_pgoff 
+   ((1U  (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+   req_start = pgoff  PAGE_SHIFT;
+
+   if (region.size  PAGE_SIZE || req_start + req_len  region.size)
+   return -EINVAL;
+
+   vma-vm_page_prot = pgprot_noncached(vma-vm_page_prot);
+   vma-vm_pgoff = (region.addr  PAGE_SHIFT) + pgoff;
+
+   return remap_pfn_range(vma, vma-vm_start, vma-vm_pgoff,
+  req_len, vma-vm_page_prot);
+}
+
 static int vfio_platform_mmap(void *device_data, struct vm_area_struct *vma)
 {
+   struct vfio_platform_device *vdev = device_data;
+   unsigned int index;
+
+   index = vma-vm_pgoff  (VFIO_PLATFORM_OFFSET_SHIFT - PAGE_SHIFT);
+
+   if (vma-vm_end  vma-vm_start)
+   return -EINVAL;
+   if (!(vma-vm_flags  VM_SHARED))
+   return -EINVAL;
+   if (index = vdev-num_regions)
+   return -EINVAL;
+   if (vma-vm_start  ~PAGE_MASK)
+   return -EINVAL;
+   if (vma-vm_end  ~PAGE_MASK)
+   return -EINVAL;
+
+   if (!(vdev-regions[index].flags  VFIO_REGION_INFO_FLAG_MMAP))
+   return -EINVAL;
+
+   if (!(vdev-regions[index].flags  VFIO_REGION_INFO_FLAG_READ)
+(vma-vm_flags  VM_READ))
+   return -EINVAL;
+
+   if (!(vdev-regions[index].flags  VFIO_REGION_INFO_FLAG_WRITE)
+(vma-vm_flags  VM_WRITE))
+   return -EINVAL;
+
+   vma-vm_private_data = vdev;
+
+   if (vdev-regions[index].type  VFIO_PLATFORM_REGION_TYPE_MMIO)
+   return vfio_platform_mmap_mmio(vdev-regions[index], vma);
+
+   else if (vdev-regions[index].type  VFIO_PLATFORM_REGION_TYPE_PIO)
+   return -EINVAL; /* not implemented */
+
return -EINVAL;
 }
 
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 07/20] vfio/platform: return info for device memory mapped IO regions

This patch enables the IOCTLs VFIO_DEVICE_GET_REGION_INFO ioctl call,
which allows the user to learn about the available MMIO resources of
a device.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform_common.c  | 106 +-
 drivers/vfio/platform/vfio_platform_private.h |  22 ++
 2 files changed, 124 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 862b43b..2a4613c 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -22,17 +22,97 @@
 
 #include vfio_platform_private.h
 
+static DEFINE_MUTEX(driver_lock);
+
+static int vfio_platform_regions_init(struct vfio_platform_device *vdev)
+{
+   int cnt = 0, i;
+
+   while (vdev-get_resource(vdev, cnt))
+   cnt++;
+
+   vdev-regions = kcalloc(cnt, sizeof(struct vfio_platform_region),
+   GFP_KERNEL);
+   if (!vdev-regions)
+   return -ENOMEM;
+
+   for (i = 0; i  cnt;  i++) {
+   struct resource *res =
+   vdev-get_resource(vdev, i);
+
+   if (!res)
+   goto err;
+
+   vdev-regions[i].addr = res-start;
+   vdev-regions[i].size = resource_size(res);
+   vdev-regions[i].flags = 0;
+
+   switch (resource_type(res)) {
+   case IORESOURCE_MEM:
+   vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_MMIO;
+   break;
+   case IORESOURCE_IO:
+   vdev-regions[i].type = VFIO_PLATFORM_REGION_TYPE_PIO;
+   break;
+   default:
+   goto err;
+   }
+   }
+
+   vdev-num_regions = cnt;
+
+   return 0;
+err:
+   kfree(vdev-regions);
+   return -EINVAL;
+}
+
+static void vfio_platform_regions_cleanup(struct vfio_platform_device *vdev)
+{
+   vdev-num_regions = 0;
+   kfree(vdev-regions);
+}
+
 static void vfio_platform_release(void *device_data)
 {
+   struct vfio_platform_device *vdev = device_data;
+
+   mutex_lock(driver_lock);
+
+   if (!(--vdev-refcnt)) {
+   vfio_platform_regions_cleanup(vdev);
+   }
+
+   mutex_unlock(driver_lock);
+
module_put(THIS_MODULE);
 }
 
 static int vfio_platform_open(void *device_data)
 {
+   struct vfio_platform_device *vdev = device_data;
+   int ret;
+
if (!try_module_get(THIS_MODULE))
return -ENODEV;
 
+   mutex_lock(driver_lock);
+
+   if (!vdev-refcnt) {
+   ret = vfio_platform_regions_init(vdev);
+   if (ret)
+   goto err_reg;
+   }
+
+   vdev-refcnt++;
+
+   mutex_unlock(driver_lock);
return 0;
+
+err_reg:
+   mutex_unlock(driver_lock);
+   module_put(THIS_MODULE);
+   return ret;
 }
 
 static long vfio_platform_ioctl(void *device_data,
@@ -53,15 +133,33 @@ static long vfio_platform_ioctl(void *device_data,
return -EINVAL;
 
info.flags = vdev-flags;
-   info.num_regions = 0;
+   info.num_regions = vdev-num_regions;
info.num_irqs = 0;
 
return copy_to_user((void __user *)arg, info, minsz);
 
-   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
-   return -EINVAL;
+   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+   struct vfio_region_info info;
+
+   minsz = offsetofend(struct vfio_region_info, offset);
+
+   if (copy_from_user(info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz  minsz)
+   return -EINVAL;
+
+   if (info.index = vdev-num_regions)
+   return -EINVAL;
+
+   /* map offset to the physical address  */
+   info.offset = VFIO_PLATFORM_INDEX_TO_OFFSET(info.index);
+   info.size = vdev-regions[info.index].size;
+   info.flags = vdev-regions[info.index].flags;
+
+   return copy_to_user((void __user *)arg, info, minsz);
 
-   else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
+   } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
return -EINVAL;
 
else if (cmd == VFIO_DEVICE_SET_IRQS)
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index 062b92d..b24729f 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -15,7 +15,29 @@
 #ifndef VFIO_PLATFORM_PRIVATE_H
 #define VFIO_PLATFORM_PRIVATE_H
 
+#define VFIO_PLATFORM_OFFSET_SHIFT   40
+#define VFIO_PLATFORM_OFFSET_MASK (((u64)(1)  VFIO_PLATFORM_OFFSET_SHIFT) - 
1)
+
+#define VFIO_PLATFORM_OFFSET_TO_INDEX(off)

[PATCH v11 03/20] vfio: platform: add the VFIO PLATFORM module to Kconfig

Enable building the VFIO PLATFORM driver that allows to use Linux platform
devices with VFIO.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/Kconfig   | 1 +
 drivers/vfio/Makefile  | 1 +
 drivers/vfio/platform/Kconfig  | 9 +
 drivers/vfio/platform/Makefile | 4 
 4 files changed, 15 insertions(+)
 create mode 100644 drivers/vfio/platform/Kconfig
 create mode 100644 drivers/vfio/platform/Makefile

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index a0abe04..962fb80 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -27,3 +27,4 @@ menuconfig VFIO
  If you don't know what to do here, say N.
 
 source drivers/vfio/pci/Kconfig
+source drivers/vfio/platform/Kconfig
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 0b035b1..dadf0ca 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
+obj-$(CONFIG_VFIO_PLATFORM) += platform/
diff --git a/drivers/vfio/platform/Kconfig b/drivers/vfio/platform/Kconfig
new file mode 100644
index 000..c51af17
--- /dev/null
+++ b/drivers/vfio/platform/Kconfig
@@ -0,0 +1,9 @@
+config VFIO_PLATFORM
+   tristate VFIO support for platform devices
+   depends on VFIO  EVENTFD  ARM
+   help
+ Support for platform devices with VFIO. This is required to make
+ use of platform devices present on the system using the VFIO
+ framework.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
new file mode 100644
index 000..279862b
--- /dev/null
+++ b/drivers/vfio/platform/Makefile
@@ -0,0 +1,4 @@
+
+vfio-platform-y := vfio_platform.o vfio_platform_common.o
+
+obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 05/20] vfio: amba: add the VFIO for AMBA devices module to Kconfig

Enable building the VFIO AMBA driver. VFIO_AMBA depends on VFIO_PLATFORM,
since it is sharing a portion of the code, and it is essentially implemented
as a platform device whose resources are discovered via AMBA specific APIs
in the kernel.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/Kconfig  | 10 ++
 drivers/vfio/platform/Makefile |  4 
 2 files changed, 14 insertions(+)

diff --git a/drivers/vfio/platform/Kconfig b/drivers/vfio/platform/Kconfig
index c51af17..c0a3bff 100644
--- a/drivers/vfio/platform/Kconfig
+++ b/drivers/vfio/platform/Kconfig
@@ -7,3 +7,13 @@ config VFIO_PLATFORM
  framework.
 
  If you don't know what to do here, say N.
+
+config VFIO_AMBA
+   tristate VFIO support for AMBA devices
+   depends on VFIO_PLATFORM  ARM_AMBA
+   help
+ Support for ARM AMBA devices with VFIO. This is required to make
+ use of ARM AMBA devices present on the system using the VFIO
+ framework.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
index 279862b..1957170 100644
--- a/drivers/vfio/platform/Makefile
+++ b/drivers/vfio/platform/Makefile
@@ -2,3 +2,7 @@
 vfio-platform-y := vfio_platform.o vfio_platform_common.o
 
 obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
+
+vfio-amba-y := vfio_amba.o
+
+obj-$(CONFIG_VFIO_AMBA) += vfio-amba.o
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 04/20] vfio: amba: VFIO support for AMBA devices

Add support for discovering AMBA devices with VFIO and handle them
similarly to Linux platform devices.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_amba.c | 115 ++
 include/uapi/linux/vfio.h |   1 +
 2 files changed, 116 insertions(+)
 create mode 100644 drivers/vfio/platform/vfio_amba.c

diff --git a/drivers/vfio/platform/vfio_amba.c 
b/drivers/vfio/platform/vfio_amba.c
new file mode 100644
index 000..ff0331f
--- /dev/null
+++ b/drivers/vfio/platform/vfio_amba.c
@@ -0,0 +1,115 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis a.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/slab.h
+#include linux/vfio.h
+#include linux/amba/bus.h
+
+#include vfio_platform_private.h
+
+#define DRIVER_VERSION  0.10
+#define DRIVER_AUTHOR   Antonios Motakis a.mota...@virtualopensystems.com
+#define DRIVER_DESC VFIO for AMBA devices - User Level meta-driver
+
+/* probing devices from the AMBA bus */
+
+static struct resource *get_amba_resource(struct vfio_platform_device *vdev,
+ int i)
+{
+   struct amba_device *adev = (struct amba_device *) vdev-opaque;
+
+   if (i == 0)
+   return adev-res;
+
+   return NULL;
+}
+
+static int get_amba_irq(struct vfio_platform_device *vdev, int i)
+{
+   struct amba_device *adev = (struct amba_device *) vdev-opaque;
+   int ret = 0;
+
+   if (i  AMBA_NR_IRQS)
+   ret = adev-irq[i];
+
+   /* zero is an unset IRQ for AMBA devices */
+   return ret ? ret : -ENXIO;
+}
+
+static int vfio_amba_probe(struct amba_device *adev, const struct amba_id *id)
+{
+   struct vfio_platform_device *vdev;
+   int ret;
+
+   vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+   if (!vdev)
+   return -ENOMEM;
+
+   vdev-name = kasprintf(GFP_KERNEL, vfio-amba-%08x, adev-periphid);
+   if (!vdev-name) {
+   kfree(vdev);
+   return -ENOMEM;
+   }
+
+   vdev-opaque = (void *) adev;
+   vdev-flags = VFIO_DEVICE_FLAGS_AMBA;
+   vdev-get_resource = get_amba_resource;
+   vdev-get_irq = get_amba_irq;
+
+   ret = vfio_platform_probe_common(vdev, adev-dev);
+   if (ret) {
+   kfree(vdev-name);
+   kfree(vdev);
+   }
+
+   return ret;
+}
+
+static int vfio_amba_remove(struct amba_device *adev)
+{
+   struct vfio_platform_device *vdev;
+
+   vdev = vfio_platform_remove_common(adev-dev);
+   if (vdev) {
+   kfree(vdev-name);
+   kfree(vdev);
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+static struct amba_id pl330_ids[] = {
+   { 0, 0 },
+};
+
+MODULE_DEVICE_TABLE(amba, pl330_ids);
+
+static struct amba_driver vfio_amba_driver = {
+   .probe = vfio_amba_probe,
+   .remove = vfio_amba_remove,
+   .id_table = pl330_ids,
+   .drv = {
+   .name = vfio-amba,
+   .owner = THIS_MODULE,
+   },
+};
+
+module_amba_driver(vfio_amba_driver);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE(GPL v2);
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 4e93a97..544d3d8 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -160,6 +160,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_RESET(1  0)/* Device supports 
reset */
 #define VFIO_DEVICE_FLAGS_PCI  (1  1)/* vfio-pci device */
 #define VFIO_DEVICE_FLAGS_PLATFORM (1  2)/* vfio-platform device */
+#define VFIO_DEVICE_FLAGS_AMBA  (1  3)   /* vfio-amba device */
__u32   num_regions;/* Max region index + 1 */
__u32   num_irqs;   /* Max IRQ index + 1 */
 };
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 02/20] vfio: platform: probe to devices on the platform bus

Driver to bind to Linux platform devices, and callbacks to discover their
resources to be used by the main VFIO PLATFORM code.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/platform/vfio_platform.c | 103 ++
 include/uapi/linux/vfio.h |   1 +
 2 files changed, 104 insertions(+)
 create mode 100644 drivers/vfio/platform/vfio_platform.c

diff --git a/drivers/vfio/platform/vfio_platform.c 
b/drivers/vfio/platform/vfio_platform.c
new file mode 100644
index 000..cef645c
--- /dev/null
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -0,0 +1,103 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis a.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include linux/module.h
+#include linux/slab.h
+#include linux/vfio.h
+#include linux/platform_device.h
+
+#include vfio_platform_private.h
+
+#define DRIVER_VERSION  0.10
+#define DRIVER_AUTHOR   Antonios Motakis a.mota...@virtualopensystems.com
+#define DRIVER_DESC VFIO for platform devices - User Level meta-driver
+
+/* probing devices from the linux platform bus */
+
+static struct resource *get_platform_resource(struct vfio_platform_device 
*vdev,
+ int num)
+{
+   struct platform_device *dev = (struct platform_device *) vdev-opaque;
+   int i;
+
+   for (i = 0; i  dev-num_resources; i++) {
+   struct resource *r = dev-resource[i];
+
+   if (resource_type(r)  (IORESOURCE_MEM|IORESOURCE_IO)) {
+   if (!num)
+   return r;
+
+   num--;
+   }
+   }
+   return NULL;
+}
+
+static int get_platform_irq(struct vfio_platform_device *vdev, int i)
+{
+   struct platform_device *pdev = (struct platform_device *) vdev-opaque;
+
+   return platform_get_irq(pdev, i);
+}
+
+static int vfio_platform_probe(struct platform_device *pdev)
+{
+   struct vfio_platform_device *vdev;
+   int ret;
+
+   vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+   if (!vdev)
+   return -ENOMEM;
+
+   vdev-opaque = (void *) pdev;
+   vdev-name = pdev-name;
+   vdev-flags = VFIO_DEVICE_FLAGS_PLATFORM;
+   vdev-get_resource = get_platform_resource;
+   vdev-get_irq = get_platform_irq;
+
+   ret = vfio_platform_probe_common(vdev, pdev-dev);
+   if (ret)
+   kfree(vdev);
+
+   return ret;
+}
+
+static int vfio_platform_remove(struct platform_device *pdev)
+{
+   struct vfio_platform_device *vdev;
+
+   vdev = vfio_platform_remove_common(pdev-dev);
+   if (vdev) {
+   kfree(vdev);
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+static struct platform_driver vfio_platform_driver = {
+   .probe  = vfio_platform_probe,
+   .remove = vfio_platform_remove,
+   .driver = {
+   .name   = vfio-platform,
+   .owner  = THIS_MODULE,
+   },
+};
+
+module_platform_driver(vfio_platform_driver);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE(GPL v2);
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9ade02b..4e93a97 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -159,6 +159,7 @@ struct vfio_device_info {
__u32   flags;
 #define VFIO_DEVICE_FLAGS_RESET(1  0)/* Device supports 
reset */
 #define VFIO_DEVICE_FLAGS_PCI  (1  1)/* vfio-pci device */
+#define VFIO_DEVICE_FLAGS_PLATFORM (1  2)/* vfio-platform device */
__u32   num_regions;/* Max region index + 1 */
__u32   num_irqs;   /* Max IRQ index + 1 */
 };
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 15/20] vfio: virqfd: rename vfio_pci_virqfd_init and vfio_pci_virqfd_exit

The functions vfio_pci_virqfd_init and vfio_pci_virqfd_exit are not really
PCI specific, since we plan to reuse the virqfd code with more VFIO drivers
in addition to VFIO_PCI.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/pci/vfio_pci.c   | 6 +++---
 drivers/vfio/pci/vfio_pci_intrs.c | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 9558da3..fc4308c 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1012,7 +1012,7 @@ put_devs:
 static void __exit vfio_pci_cleanup(void)
 {
pci_unregister_driver(vfio_pci_driver);
-   vfio_pci_virqfd_exit();
+   vfio_virqfd_exit();
vfio_pci_uninit_perm_bits();
 }
 
@@ -1026,7 +1026,7 @@ static int __init vfio_pci_init(void)
return ret;
 
/* Start the virqfd cleanup handler */
-   ret = vfio_pci_virqfd_init();
+   ret = vfio_virqfd_init();
if (ret)
goto out_virqfd;
 
@@ -1038,7 +1038,7 @@ static int __init vfio_pci_init(void)
return 0;
 
 out_driver:
-   vfio_pci_virqfd_exit();
+   vfio_virqfd_exit();
 out_virqfd:
vfio_pci_uninit_perm_bits();
return ret;
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 0a41833d..a5378d5 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -45,7 +45,7 @@ struct virqfd {
 
 static struct workqueue_struct *vfio_irqfd_cleanup_wq;
 
-int __init vfio_pci_virqfd_init(void)
+int __init vfio_virqfd_init(void)
 {
vfio_irqfd_cleanup_wq =
create_singlethread_workqueue(vfio-irqfd-cleanup);
@@ -55,7 +55,7 @@ int __init vfio_pci_virqfd_init(void)
return 0;
 }
 
-void vfio_pci_virqfd_exit(void)
+void vfio_virqfd_exit(void)
 {
destroy_workqueue(vfio_irqfd_cleanup_wq);
 }
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v11 19/20] vfio: initialize the virqfd workqueue in VFIO generic code

Now we have finally completely decoupled virqfd from VFIO_PCI. We can
initialize it from the VFIO generic code, in order to safely use it from
multiple independent VFIO bus drivers.

Signed-off-by: Antonios Motakis a.mota...@virtualopensystems.com
---
 drivers/vfio/Makefile   | 4 +++-
 drivers/vfio/pci/Makefile   | 3 +--
 drivers/vfio/pci/vfio_pci.c | 8 
 drivers/vfio/vfio.c | 8 
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index dadf0ca..d798b09 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,4 +1,6 @@
-obj-$(CONFIG_VFIO) += vfio.o
+vfio_core-y := vfio.o virqfd.o
+
+obj-$(CONFIG_VFIO) += vfio_core.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index c7c8644..1310792 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,5 +1,4 @@
 
-vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o \
- ../virqfd.o
+vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index fc4308c..8d156d7 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1012,7 +1012,6 @@ put_devs:
 static void __exit vfio_pci_cleanup(void)
 {
pci_unregister_driver(vfio_pci_driver);
-   vfio_virqfd_exit();
vfio_pci_uninit_perm_bits();
 }
 
@@ -1025,11 +1024,6 @@ static int __init vfio_pci_init(void)
if (ret)
return ret;
 
-   /* Start the virqfd cleanup handler */
-   ret = vfio_virqfd_init();
-   if (ret)
-   goto out_virqfd;
-
/* Register and scan for devices */
ret = pci_register_driver(vfio_pci_driver);
if (ret)
@@ -1038,8 +1032,6 @@ static int __init vfio_pci_init(void)
return 0;
 
 out_driver:
-   vfio_virqfd_exit();
-out_virqfd:
vfio_pci_uninit_perm_bits();
return ret;
 }
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index f018d8d..8e84471 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1464,6 +1464,11 @@ static int __init vfio_init(void)
if (ret)
goto err_cdev_add;
 
+   /* Start the virqfd cleanup handler used by some VFIO bus drivers */
+   ret = vfio_virqfd_init();
+   if (ret)
+   goto err_virqfd;
+
pr_info(DRIVER_DESC  version:  DRIVER_VERSION \n);
 
/*
@@ -1476,6 +1481,8 @@ static int __init vfio_init(void)
 
return 0;
 
+err_virqfd:
+   cdev_del(vfio.group_cdev);
 err_cdev_add:
unregister_chrdev_region(vfio.group_devt, MINORMASK);
 err_alloc_chrdev:
@@ -1490,6 +1497,7 @@ static void __exit vfio_cleanup(void)
 {
WARN_ON(!list_empty(vfio.group_list));
 
+   vfio_virqfd_exit();
idr_destroy(vfio.group_idr);
cdev_del(vfio.group_cdev);
unregister_chrdev_region(vfio.group_devt, MINORMASK);
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader



On 06/01/2015 09:42, Paolo Bonzini wrote:
   Still confused.  So we can freeze all vCPUs in the host, then update
   pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we have
   a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
   doesn't increment the version pre-update, and we can return completely
   bogus results.
  Yes.
 But then the getcpu test would fail (1-0).  Even if you have an ABA
 situation (1-0-1), it's okay because the pvti that is fetched is the
 one returned by the first getcpu.

... this case of partial update of pvti, which is caught by the version
field, if of course different from the other (extremely unlikely) that
Andy pointed out.  That is when the getcpus are done on the same vCPU,
but the rdtsc is another.

That one can be fixed by rdtscp, like

do {
// get a consistent (pvti, v, tsc) tuple
do {
cpu = get_cpu();
pvti = get_pvti(cpu);
v = pvti-version  ~1;
// also acts as rmb();
rdtsc_barrier();
tsc = rdtscp(cpu1);
// control dependency, no need for rdtsc_barrier?
} while(cpu != cpu1);

// ... compute nanoseconds from pvti and tsc ...
rmb();
}   while(v != pvti-version);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvmtool: Add parameter to specifiy number of threads in thread_pool

2015-01-06 Thread Andreas Herrmann

With current code the number of threads added to the thread_pool
equals number of online CPUs. IMHO on systems with many CPUs this is
overkill for guests that just have a few vCPUs and/or if a guest is
pinned to a subset of host CPUs. E.g. of a system with 48 cores

 # numactl -C 4,5,7,8 ./lkvm run -c 2 -m 256 -k paravirt -d rootfs.ext3 ...
 # ps -La | grep threadpool-work | wc -l
 48

Don't change default behaviour (for sake of compatibility) but
introduce a new parameter (-t or --threads) that allows to specify
number of threads to be created for the thread_pool:

 # numactl -C 4,5,7,8 ./lkvm run -c 2 -m 256 --threads 4 -k paravirt -d ...
 # ps -La | grep threadpool-work | wc -l
 4

Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/builtin-run.c|2 ++
 tools/kvm/include/kvm/kvm-config.h |1 +
 tools/kvm/util/threadpool.c|5 -
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c
index 1ee75ad..86de53d 100644
--- a/tools/kvm/builtin-run.c
+++ b/tools/kvm/builtin-run.c
@@ -131,6 +131,8 @@ void kvm_run_set_wrapper_sandbox(void)
 rootfs), \
OPT_STRING('\0', hugetlbfs, (cfg)-hugetlbfs_path, path,   \
Hugetlbfs path),  \
+   OPT_INTEGER('t', threads, (cfg)-nrthreads,  \
+Number of threads in thread_pool),   \
\
OPT_GROUP(Kernel options:),   \
OPT_STRING('k', kernel, (cfg)-kernel_filename, kernel,\
diff --git a/tools/kvm/include/kvm/kvm-config.h 
b/tools/kvm/include/kvm/kvm-config.h
index 386fa8c..9cc50f5 100644
--- a/tools/kvm/include/kvm/kvm-config.h
+++ b/tools/kvm/include/kvm/kvm-config.h
@@ -27,6 +27,7 @@ struct kvm_config {
int active_console;
int debug_iodelay;
int nrcpus;
+   int nrthreads;
const char *kernel_cmdline;
const char *kernel_filename;
const char *vmlinux_filename;
diff --git a/tools/kvm/util/threadpool.c b/tools/kvm/util/threadpool.c
index e64aa26..620fdbd 100644
--- a/tools/kvm/util/threadpool.c
+++ b/tools/kvm/util/threadpool.c
@@ -124,7 +124,10 @@ static int thread_pool__addthread(void)
 int thread_pool__init(struct kvm *kvm)
 {
unsigned long i;
-   unsigned int thread_count = sysconf(_SC_NPROCESSORS_ONLN);
+   unsigned int thread_count;
+
+   thread_count = kvm-cfg.nrthreads ? kvm-cfg.nrthreads :
+   sysconf(_SC_NPROCESSORS_ONLN);
 
running = true;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvmtool: Set names for virtio-net-ctrl and term-poll threads

2015-01-06 Thread Andreas Herrmann


Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/term.c   |2 ++
 tools/kvm/virtio/net.c |2 ++
 2 files changed, 4 insertions(+)

diff --git a/tools/kvm/term.c b/tools/kvm/term.c
index 1b8131a..5754510 100644
--- a/tools/kvm/term.c
+++ b/tools/kvm/term.c
@@ -104,6 +104,8 @@ static void *term_poll_thread_loop(void *param)
 
int i;
 
+   kvm__set_thread_name(term-poll);
+
for (i = 0; i  TERM_MAX_DEVS; i++) {
fds[i].fd = term_fds[i][TERM_FD_IN];
fds[i].events = POLLIN;
diff --git a/tools/kvm/virtio/net.c b/tools/kvm/virtio/net.c
index ecdb94e..fed5095 100644
--- a/tools/kvm/virtio/net.c
+++ b/tools/kvm/virtio/net.c
@@ -233,6 +233,8 @@ static void *virtio_net_ctrl_thread(void *p)
struct virtio_net_ctrl_hdr *ctrl;
virtio_net_ctrl_ack *ack;
 
+   kvm__set_thread_name(virtio-net-ctrl);
+
while (1) {
mutex_lock(ndev-io_lock[id]);
if (!virt_queue__available(vq))
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Xen-devel] [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-06 Thread Konrad Rzeszutek Wilk

On Mon, Jan 05, 2015 at 10:56:07AM -0800, Andy Lutomirski wrote:
 On Mon, Jan 5, 2015 at 7:25 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
  On Mon, Dec 22, 2014 at 04:39:57PM -0800, Andy Lutomirski wrote:
  The pvclock vdso code was too abstracted to understand easily and
  excessively paranoid.  Simplify it for a huge speedup.
 
  This opens the door for additional simplifications, as the vdso no
  longer accesses the pvti for any vcpu other than vcpu 0.
 
  Before, vclock_gettime using kvm-clock took about 64ns on my machine.
  With this change, it takes 19ns, which is almost as fast as the pure TSC
  implementation.
 
  Signed-off-by: Andy Lutomirski l...@amacapital.net
  ---
   arch/x86/vdso/vclock_gettime.c | 82 
  --
   1 file changed, 47 insertions(+), 35 deletions(-)
 
  diff --git a/arch/x86/vdso/vclock_gettime.c 
  b/arch/x86/vdso/vclock_gettime.c
  index 9793322751e0..f2e0396d5629 100644
  --- a/arch/x86/vdso/vclock_gettime.c
  +++ b/arch/x86/vdso/vclock_gettime.c
  @@ -78,47 +78,59 @@ static notrace const struct pvclock_vsyscall_time_info 
  *get_pvti(int cpu)
 
   static notrace cycle_t vread_pvclock(int *mode)
   {
  - const struct pvclock_vsyscall_time_info *pvti;
  + const struct pvclock_vcpu_time_info *pvti = get_pvti(0)-pvti;
cycle_t ret;
  - u64 last;
  - u32 version;
  - u8 flags;
  - unsigned cpu, cpu1;
  -
  + u64 tsc, pvti_tsc;
  + u64 last, delta, pvti_system_time;
  + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift;
 
/*
  -  * Note: hypervisor must guarantee that:
  -  * 1. cpu ID number maps 1:1 to per-CPU pvclock time info.
  -  * 2. that per-CPU pvclock time info is updated if the
  -  *underlying CPU changes.
  -  * 3. that version is increased whenever underlying CPU
  -  *changes.
  +  * Note: The kernel and hypervisor must guarantee that cpu ID
  +  * number maps 1:1 to per-CPU pvclock time info.
  +  *
  +  * Because the hypervisor is entirely unaware of guest userspace
  +  * preemption, it cannot guarantee that per-CPU pvclock time
  +  * info is updated if the underlying CPU changes or that that
  +  * version is increased whenever underlying CPU changes.
  +  *
  +  * On KVM, we are guaranteed that pvti updates for any vCPU are
  +  * atomic as seen by *all* vCPUs.  This is an even stronger
  +  * guarantee than we get with a normal seqlock.
 *
  +  * On Xen, we don't appear to have that guarantee, but Xen still
  +  * supplies a valid seqlock using the version field.
  +
  +  * We only do pvclock vdso timing at all if
  +  * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to
  +  * mean that all vCPUs have matching pvti and that the TSC is
  +  * synced, so we can just look at vCPU 0's pvti.
 */
 
  Can Xen guarantee that ?
 
 I think so, vacuously.  Xen doesn't seem to set PVCLOCK_TSC_STABLE_BIT
 at all.  I have no idea going forward, though.
 
 Xen people?

The person who would know of the top of his head is Dan Magenheimer, who
is now enjoy retirement :-(

I will have to dig in the code to answer this - that will take a bit of time
sadly (I am sick this week).
 
 
  - do {
  - cpu = __getcpu()  VGETCPU_CPU_MASK;
  - /* TODO: We can put vcpu id into higher bits of pvti.version.
  -  * This will save a couple of cycles by getting rid of
  -  * __getcpu() calls (Gleb).
  -  */
  -
  - pvti = get_pvti(cpu);
  -
  - version = __pvclock_read_cycles(pvti-pvti, ret, flags);
  -
  - /*
  -  * Test we're still on the cpu as well as the version.
  -  * We could have been migrated just after the first
  -  * vgetcpu but before fetching the version, so we
  -  * wouldn't notice a version change.
  -  */
  - cpu1 = __getcpu()  VGETCPU_CPU_MASK;
  - } while (unlikely(cpu != cpu1 ||
  -   (pvti-pvti.version  1) ||
  -   pvti-pvti.version != version));
  -
  - if (unlikely(!(flags  PVCLOCK_TSC_STABLE_BIT)))
  +
  + if (unlikely(!(pvti-flags  PVCLOCK_TSC_STABLE_BIT))) {
*mode = VCLOCK_NONE;
  + return 0;
  + }
 
  This check must be performed after reading a stable pvti.
 
 
 We can even read it in the middle, guarded by the version checks.
 I'll do that for v2.
 
  +
  + do {
  + version = pvti-version;
  +
  + /* This is also a read barrier, so we'll read version first. 
  */
  + rdtsc_barrier();
  + tsc = __native_read_tsc();
  +
  + pvti_tsc_to_system_mul = pvti-tsc_to_system_mul;
  + pvti_tsc_shift = pvti-tsc_shift;
  + pvti_system_time = pvti-system_time;
  + pvti_tsc = pvti-tsc_timestamp;
  +
  +

[ÞATCH] kvmtool, mips: Support more than 256 MB guest memory

2015-01-06 Thread Andreas Herrmann

Two guest memory regions need to be defined and two mem= parameters
need to be passed to guest kernel to support more than 256 MB.

Signed-off-by: Andreas Herrmann andreas.herrm...@caviumnetworks.com
---
 tools/kvm/mips/include/kvm/kvm-arch.h |   10 +
 tools/kvm/mips/kvm.c  |   36 +++--
 2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/tools/kvm/mips/include/kvm/kvm-arch.h 
b/tools/kvm/mips/include/kvm/kvm-arch.h
index 7eadbf4..97bbf34 100644
--- a/tools/kvm/mips/include/kvm/kvm-arch.h
+++ b/tools/kvm/mips/include/kvm/kvm-arch.h
@@ -1,10 +1,20 @@
 #ifndef KVM__KVM_ARCH_H
 #define KVM__KVM_ARCH_H
 
+
+/*
+ * Guest memory map is:
+ *   0x-0x0fff : System RAM
+ *   0x1000-0x1fff : I/O (defined by KVM_MMIO_START and KVM_MMIO_SIZE)
+ *   0x2000-...: System RAM
+ * See also kvm__init_ram().
+ */
+
 #define KVM_MMIO_START 0x1000
 #define KVM_PCI_CFG_AREA   KVM_MMIO_START
 #define KVM_PCI_MMIO_AREA  (KVM_MMIO_START + 0x100)
 #define KVM_VIRTIO_MMIO_AREA   (KVM_MMIO_START + 0x200)
+#define KVM_MMIO_SIZE  0x1000
 
 /*
  * Just for reference. This and the above corresponds to what's used
diff --git a/tools/kvm/mips/kvm.c b/tools/kvm/mips/kvm.c
index fc0428b..10b441b 100644
--- a/tools/kvm/mips/kvm.c
+++ b/tools/kvm/mips/kvm.c
@@ -22,11 +22,28 @@ void kvm__init_ram(struct kvm *kvm)
u64 phys_start, phys_size;
void*host_mem;
 
-   phys_start = 0;
-   phys_size  = kvm-ram_size;
-   host_mem   = kvm-ram_start;
+   if (kvm-ram_size = KVM_MMIO_START) {
+   /* one region for all memory */
+   phys_start = 0;
+   phys_size  = kvm-ram_size;
+   host_mem   = kvm-ram_start;
 
-   kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+   kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+   } else {
+   /* one region for memory that fits below MMIO range */
+   phys_start = 0;
+   phys_size  = KVM_MMIO_SIZE;
+   host_mem   = kvm-ram_start;
+
+   kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+
+   /* one region for rest of memory */
+   phys_start = KVM_MMIO_START + KVM_MMIO_SIZE;
+   phys_size  = kvm-ram_size - 0x1000;
+   host_mem   = kvm-ram_start + KVM_MMIO_SIZE;
+
+   kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+   }
 }
 
 void kvm__arch_delete_ram(struct kvm *kvm)
@@ -108,8 +125,15 @@ static void kvm__mips_install_cmdline(struct kvm *kvm)
u64 argv_offset = argv_start;
u64 argc = 0;
 
-   sprintf(p + cmdline_offset, mem=0x%llx@0 ,
-(unsigned long long)kvm-ram_size);
+
+   if ((u64) kvm-ram_size = KVM_MMIO_SIZE)
+   sprintf(p + cmdline_offset, mem=0x%llx@0 ,
+   (unsigned long long)kvm-ram_size);
+   else
+   sprintf(p + cmdline_offset, mem=0x%llx@0 mem=0x%llx@0x%llx ,
+   (unsigned long long)KVM_MMIO_START,
+   (unsigned long long)kvm-ram_size - KVM_MMIO_START,
+   (unsigned long long)(KVM_MMIO_START + KVM_MMIO_SIZE));
 
strcat(p + cmdline_offset, kvm-cfg.real_cmdline); /* maximum size is 
2K */
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls

2015-01-06 Thread Borislav Petkov

On Tue, Jan 06, 2015 at 10:43:57AM -0800, Andy Lutomirski wrote:
 Sure, but the code would be simpler if we shoved that value in the
 EFLAGS slot.

There probably is some reason for that but it's not like we can change
it :-)

 Hmm.  I added and pushed a test for fork, but that didn't turn
 anything up.  And I don't see any bugs in the code.
 
 I booted 3.18 plus this patch with context tracking forced on on my
 laptop, and something seems to have gone wrong.

Well, I ran it on -rc2 + tip/master and it did trigger sporadically
yesterday but was unable to trigger something today.

I'll redo the whole games tomorrow.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-06 Thread Marcelo Tosatti

On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
On Jan 6, 2015 4:01 AM, Paolo Bonzini pbonz...@redhat.com wrote:

On 06/01/2015 09:42, Paolo Bonzini wrote:
Still confused. So we can freeze all vCPUs in the host, then
update
pvti 1, then resume vCPU 1, then update pvti 0? In that case, we
have
a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
doesn't increment the version pre-update, and we can return
completely
bogus results.
Yes.
But then the getcpu test would fail (1-0). Even if you have an ABA
situation (1-0-1), it's okay because the pvti that is fetched is the
one returned by the first getcpu.

... this case of partial update of pvti, which is caught by the version
field, if of course different from the other (extremely unlikely) that
Andy pointed out. That is when the getcpus are done on the same vCPU,
but the rdtsc is another.

That one can be fixed by rdtscp, like

do {
// get a consistent (pvti, v, tsc) tuple
do {
cpu = get_cpu();
pvti = get_pvti(cpu);
v = pvti-version ~1;
// also acts as rmb();
rdtsc_barrier();
tsc = rdtscp(cpu1);

Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD
specified it that way and both AMD and Intel implement it correctly.
(rdtsc, on the other hand, definitely needs the barrier beforehand.)

// control dependency, no need for rdtsc_barrier?
} while(cpu != cpu1);

// ... compute nanoseconds from pvti and tsc ...
rmb();
} while(v != pvti-version);

Still no good. We can migrate a bunch of times so we see the same CPU
all three times and *still* don't get a consistent read, unless we
play nasty games with lots of version checks (I have a patch for that,
but I don't like it very much). The patch is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d

but I don't like it.

Thus far, I've been told unambiguously that a guest can't observe pvti
while it's being written, and I think you're now telling me that this
isn't true and that a guest *can* observe pvti while it's being
written while the low bit of the version field is not set. If so,
this is rather strongly incompatible with the spec in the KVM docs.

I don't suppose that you and Marcelo could agree on what the actual
semantics that KVM provides are and could write it down in a way that
people who haven't spent a long time staring at the request code
understand? And maybe you could even fix the implementation while
you're at it if the implementation is, indeed, broken. I have ugly
patches to fix it here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=3b718a050cba52563d831febc2e1ca184c02bac0

but I'm not thrilled with them.

--Andy

I suppose that separating the version write from the rest of the pvclock
structure is sufficient, as that would guarantee the writes are not
reordered even with fast string REP MOVS.

Thanks for catching this Andy!

Don't you stil need:

version++;
write the rest;
version++;

with possible smp_wmb() in there to keep the compiler from messing around?

Correct. Could just as well follow the protocol and use odd/even, which
is what your patch does.

What is the point with the new flags bit though?

Also, if you do this, can you also make setting and clearing
STABLE_BIT properly atomic across all vCPUs? Or at least do something
like setting it last and clearing it first on vPCU 0?

If the version seqlock works properly across vCPUs, why do you need
STABLE_BIT properly atomic ?

Please define what you mean by properly atomic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls

On Jan 6, 2015 7:34 AM, Borislav Petkov b...@alien8.de wrote:

 On Mon, Jan 05, 2015 at 12:31:15PM -0800, Andy Lutomirski wrote:
  Do you have context tracking on?

 Yap, it is enabled for whatever reason:
 CONFIG_CONTEXT_TRACKING=y
 CONFIG_CONTEXT_TRACKING_FORCE=y
 CONFIG_HAVE_CONTEXT_TRACKING=y

I'll boot a kernel like this on bare metal and see what shakes loose.


  I assume that's in the historical tree?

 Yeah.

   [  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 
   action 0x6 frozen
   [  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
   [  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 
   4096 out
   [  180.072158]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
   (timeout)
 
  That's really weird.  The only thing I can think of is that somehow we
  returned to user mode without enabling interrupts.

 Right, considering FIXUP_TOP_OF_STACK is used in a bunch of cases in
 entry_64.S, no wonder it corrupts something.

  This leads me to wonder: why do we save eflags in the R11 pt_regs
  slot?

 That: If executed in 64-bit mode, SYSRET loads the lower-32 RFLAGS bits
 from R11[31:0] and clears the upper 32 RFLAGS bits.

Sure, but the code would be simpler if we shoved that value in the EFLAGS slot.


  This seems entirely backwards, not to mention that it accounts for two
  instructions in each of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK
  for no apparently reason whatsoever.

  Can you send the full output from syscall_exit_regs_64 from here:
 
  https://gitorious.org/linux-test-utils/linux-clock-tests/source/34884122b6ebe81d9b96e3e5128b6d6d95082c6e:
 
  with the patch applied (assuming it even gets that far for you)?  I
  see results like:
 
  [NOTE]syscall : orig RCX = 1  ss = 2b  orig_ss = 6b  flags =
  217  orig_flags = 217
 
  which seems fine.

 ./syscall_exit_regs_64
 [OK]int80 : AX = ffda
 [OK]int80 4000: AX = ffda
 [OK]syscall : RCX = 400962  RIP = 400962
 [OK]syscall : AX = ffda
 [NOTE]  syscall : orig RCX = 1  ss = 2b  orig_ss = 6b  flags = 217  
 orig_flags = 217
 [OK]syscall 4000: RCX = 400962  RIP = 400962
 [FAIL]  syscall 4000: AX = fff7
 [NOTE]  syscall 4000: orig RCX = 1  ss = 2b  orig_ss = 6b  flags = 217  
 orig_flags = 217
 [OK]syscall(): ret = -1, errno = 38

  Are you seeing this with the whole series applied or with only this patch?

 I applied this patch only and started seeing those. Then I booted in the
 previous kernel and tried to repro but it didn't trigger.

 I'll try hammering on the kernel *without* your patch to see whether I
 can trigger it somehow...

Hmm.  I added and pushed a test for fork, but that didn't turn
anything up.  And I don't see any bugs in the code.

I booted 3.18 plus this patch with context tracking forced on on my
laptop, and something seems to have gone wrong.

--Andy


 --
 Regards/Gruss,
 Boris.

 Sent from a fat crate under my desk. Formatting is fine.
 --
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
On Jan 6, 2015 4:01 AM, Paolo Bonzini pbonz...@redhat.com wrote:

That one can be fixed by rdtscp, like

do {
// get a consistent (pvti, v, tsc) tuple
do {
cpu = get_cpu();
pvti = get_pvti(cpu);
v = pvti-version ~1;
// also acts as rmb();
rdtsc_barrier();
tsc = rdtscp(cpu1);

// control dependency, no need for rdtsc_barrier?
} while(cpu != cpu1);

// ... compute nanoseconds from pvti and tsc ...
rmb();
} while(v != pvti-version);

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d

but I don't like it.

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=3b718a050cba52563d831febc2e1ca184c02bac0

but I'm not thrilled with them.

--Andy

I suppose that separating the version write from the rest of the pvclock
structure is sufficient, as that would guarantee the writes are not
reordered even with fast string REP MOVS.

Thanks for catching this Andy!

Don't you stil need:

version++;
write the rest;
version++;

with possible smp_wmb() in there to keep the compiler from messing around?

Correct. Could just as well follow the protocol and use odd/even, which
is what your patch does.

What is the point with the new flags bit though?

To try to work around the problem on old hosts. I'm not at all
convinced that this is worthwhile or that it helps, though.

Also, if you do this, can you also make setting and clearing
STABLE_BIT properly atomic across all vCPUs? Or at least do something
like setting it last and clearing it first on vPCU 0?

If the version seqlock works properly across vCPUs, why do you need
STABLE_BIT properly atomic ?

Please define what you mean by properly atomic.

I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
in the vdso if the stable bit is set. That means that the host should
avoid doing things like migrating the guest, clearing the stable bit
for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
and return a bad timestamp.

Maybe this scenario is impossible, but getting rid of any getcpu-like
operation in the vdso has really nice benefits. It's faster and it
lets us guarantee that the vdso's pvti data fits in a single page.
The latter means that we can easily make it work like the hpet
mapping, which gets us 32-bit support and will *finally* let us turn
off user access to the fixmap if vsyscall=none.

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

2015-01-06 Thread Marcelo Tosatti

On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
  What is the point with the new flags bit though?
 
 To try to work around the problem on old hosts.  I'm not at all
 convinced that this is worthwhile or that it helps, though.

Don't think so. Just fix the host bug.

  Also, if you do this, can you also make setting and clearing
  STABLE_BIT properly atomic across all vCPUs?  Or at least do something
  like setting it last and clearing it first on vPCU 0?
 
  If the version seqlock works properly across vCPUs, why do you need
  STABLE_BIT properly atomic ?
 
  Please define what you mean by properly atomic.
 
 
 I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
 in the vdso if the stable bit is set.  That means that the host should
 avoid doing things like migrating the guest, clearing the stable bit
 for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
 stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
 and return a bad timestamp.
 
 Maybe this scenario is impossible, but getting rid of any getcpu-like
 operation in the vdso has really nice benefits. 

You can park every vCPU in host while updating vCPU-0's timestamp.

See kvm_gen_update_masterclock:

+   /* no guest entries from this point */
+   pvclock_update_vm_gtod_copy(kvm);

- touch guest memory

+   /* guest entries allowed */
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   clear_bit(KVM_REQ_MCLOCK_INPROGRESS, vcpu-requests);

  It's faster and it
 lets us guarantee that the vdso's pvti data fits in a single page.
 The latter means that we can easily make it work like the hpet
 mapping, which gets us 32-bit support and will *finally* let us turn
 off user access to the fixmap if vsyscall=none.
 
 (We can, of course, still do this if the pvti data needs to be an
 array, but it's messier.)
 
 --Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v8 34/50] vhost/net: virtio 1.0 byte swap

2015-01-06 Thread Alex Williamson

On Mon, 2014-12-01 at 18:05 +0200, Michael S. Tsirkin wrote:
 I had to add an explicit tag to suppress compiler warning:
 gcc isn't smart enough to notice that
 len is always initialized since function is called with size  0.

I'm getting a panic inside a guest when this change is applied on the
host.  I identified this patch via bisect and confirmed by reverting it
from v3.19-rc2.  Guest is centos6.  Thanks,

Alex

commit 8b38694a2dc8b18374310df50174f1e4376d6824
Author: Michael S. Tsirkin m...@redhat.com
Date:   Fri Oct 24 14:19:48 2014 +0300

vhost/net: virtio 1.0 byte swap

I had to add an explicit tag to suppress compiler warning:
gcc isn't smart enough to notice that
len is always initialized since function is called with size  0.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com

XML chunk:

interface type='direct'
  mac address='52:54:00:64:f3:34'/
  source dev='iscsinet0' mode='bridge'/
  model type='virtio'/
  address type='pci' domain='0x' bus='0x00' slot='0x03' 
function='0x0'/
/interface

Panic log:

1BUG: unable to handle kernel NULL pointer dereference at 0010
1IP: [a0079469] virtnet_poll+0x4f9/0x910 [virtio_net]
4PGD 1aa2f4067 PUD 1aa2f5067 PMD 0 
4Oops:  [#1] SMP 
4last sysfs file: 
/sys/devices/pci:00/:00:03.0/virtio0/net/eth9/ifindex
4CPU 0 
4Modules linked in: 8021q garp stp llc ipt_REJECT nf_conntrack_ipv4 
nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 
nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput 
microcode snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq 
snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc igbvf 
nvidia(P)(U) i2c_core tg3 ptp pps_core virtio_balloon virtio_net virtio_console 
ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi 
ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: 
speedstep_lib]
4
4Pid: 1374, comm: NetworkManager Tainted: P   ---
2.6.32-431.23.3.el6.centos.plus.x86_64 #1 QEMU Standard PC (i440FX + PIIX, 1996)
4RIP: 0010:[a0079469]  [a0079469] 
virtnet_poll+0x4f9/0x910 [virtio_net]
4RSP: 0018:880028203e48  EFLAGS: 00010246
4RAX: 8801a3383d00 RBX: 8801a6aaf480 RCX: 8801aa20b6e0
4RDX: 00c0 RSI: 8801a3383c00 RDI: 8801a3383cc0
4RBP: 880028203ed8 R08: 009e R09: 8801aa1d800c
4R10: 0218 R11:  R12: 8801aa20b6e0
4R13:  R14:  R15: 
4FS:  7febf114d800() GS:88002820() knlGS:
4CS:  0010 DS:  ES:  CR0: 80050033
4CR2: 0010 CR3: 0001aa793000 CR4: 06f0
4DR0:  DR1:  DR2: 
4DR3:  DR6: 0ff0 DR7: 0400
4Process NetworkManager (pid: 1374, threadinfo 8801a74ba000, task 
8801a8d56040)
4Stack:
4 8801aa1d8000 009e 8801aa20b6e0 8801aa20b718
4d 8801aa20b780 8801aa1d800c 8801a6aaf4b8 8801aa20b020
4d 0080 8801aa20b708 0001 1f5981a830c8
4Call Trace:
4 IRQ 
4 [8146ae33] net_rx_action+0x103/0x2f0
4 [8107a5f1] __do_softirq+0xc1/0x1e0
4 [8100c30c] ? call_softirq+0x1c/0x30
4 [8100c30c] call_softirq+0x1c/0x30
4 EOI 
4 [8100fa75] ? do_softirq+0x65/0xa0
4 [8107b2ea] local_bh_enable+0x9a/0xb0
4 [a007813a] virtnet_napi_enable+0x4a/0x60 [virtio_net]
4 [a0078ebf] virtnet_open+0x4f/0x60 [virtio_net]
4 [81467691] dev_open+0xa1/0x100
4 [81466751] dev_change_flags+0xa1/0x1d0
4 [81474a59] do_setlink+0x169/0x8b0
4 [814770b6] ? rtnl_fill_ifinfo+0x946/0xcb0
4 [812a3d24] ? nla_parse+0x34/0x110
4 [8147659e] rtnl_setlink+0xee/0x130
4 [81475b67] rtnetlink_rcv_msg+0x2d7/0x340
4 [81231e14] ? socket_has_perm+0x74/0x90
4 [81475890] ? rtnetlink_rcv_msg+0x0/0x340
4 [814910a9] netlink_rcv_skb+0xa9/0xd0
4 [81475875] rtnetlink_rcv+0x25/0x40
4 [81490cdb] netlink_unicast+0x2db/0x320
4 [81491750] netlink_sendmsg+0x2c0/0x3d0
4 [814520c3] sock_sendmsg+0x123/0x150
4 [81453d73] ? sock_recvmsg+0x133/0x160
4 [8109afa0] ? autoremove_wake_function+0x0/0x40
4 [81136941] ? lru_cache_add_lru+0x21/0x40
4 [8115522d] ? page_add_new_anon_rmap+0x9d/0xf0
4 [8114aeef] ? handle_pte_fault+0x4af/0xb00
4 [81451f14] ? move_addr_to_kernel+0x64/0x70
4 [814538b6] __sys_sendmsg+0x406/0x420
4 [8104a98c] ? __do_page_fault+0x1ec/0x480
4 [814523d9] ? sys_sendto+0x139/0x190
4 [8103ea6c] ? kvm_clock_read+0x1c/0x20
4 [81453ad9] sys_sendmsg+0x49/0x90
4 [8100b072] system_call_fastpath+0x16/0x1b
4Code: 83 e0

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On Tue, Jan 6, 2015 at 12:20 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
  What is the point with the new flags bit though?

 To try to work around the problem on old hosts.  I'm not at all
 convinced that this is worthwhile or that it helps, though.

 Don't think so. Just fix the host bug.

  Also, if you do this, can you also make setting and clearing
  STABLE_BIT properly atomic across all vCPUs?  Or at least do something
  like setting it last and clearing it first on vPCU 0?
 
  If the version seqlock works properly across vCPUs, why do you need
  STABLE_BIT properly atomic ?
 
  Please define what you mean by properly atomic.
 

 I'd like to be able to rely using vCPU 0's pvti even from other vCPUs
 in the vdso if the stable bit is set.  That means that the host should
 avoid doing things like migrating the guest, clearing the stable bit
 for vCPU 1, resuming vCPU 1, and waiting long enough to clear the
 stable bit for vCPU 0 that vCPU 1's vdso code could see invalid data
 and return a bad timestamp.

 Maybe this scenario is impossible, but getting rid of any getcpu-like
 operation in the vdso has really nice benefits.

 You can park every vCPU in host while updating vCPU-0's timestamp.

 See kvm_gen_update_masterclock:

 +   /* no guest entries from this point */
 +   pvclock_update_vm_gtod_copy(kvm);

 - touch guest memory

 +   /* guest entries allowed */
 +   kvm_for_each_vcpu(i, vcpu, kvm)
 +   clear_bit(KVM_REQ_MCLOCK_INPROGRESS, vcpu-requests);


Can we do that easily?  It looks like we're holding a spinlock in
there.  Could we make pvclock_gtod_sync_lock into a mutex?

We could also add something to explicitly prevent any of the guests
from entering until we're updated all of them, but that would hurt
performance even more.  It would be kind of nice if we could avoid
serializing all CPUs entirely, though.  For example, if we could
increment all the versions, then write all the pvtis, then increment
all the versions again from a single function, then everything is
atomic for a correctly behaving guest, but if the guest isn't actually
reading the time, then it doesn't stall.

(Also, we should really have a cpu_relax in all of these loops, IMO,
so that pause loop exiting can take effect.)

--Andy

  It's faster and it
 lets us guarantee that the vdso's pvti data fits in a single page.
 The latter means that we can easily make it work like the hpet
 mapping, which gets us 32-bit support and will *finally* let us turn
 off user access to the fixmap if vsyscall=none.

 (We can, of course, still do this if the pvti data needs to be an
 array, but it's messier.)

 --Andy



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] KVM: introduce kvm_check_device

2015-01-06 Thread Scott Wood

On Tue, 2015-01-06 at 16:12 +, Andre Przywara wrote:
 While we can easily register and unregister KVM devices, there is
 currently no easy way of checking whether a device has been
 registered.
 Introduce kvm_check_device() for that purpose and use it in two
 existing functions. Also change the return code for an invalid
 type number from ENOSPC to EINVAL.
 This function will be later used by another patch set to check
 whether a KVM_CREATE_IRQCHIP ioctl is valid.

You're checking whether a device type has been registered, not the
device itself -- so could you make it kvm_check_device_type()?

 Signed-off-by: Andre Przywara andre.przyw...@arm.com
 ---
 Hi,
 
 can people comment whether there is an easier way to detect KVM
 device registration _outside_ of virt/kvm/kvm_main.c? Using the
 KVM_CREATE_DEVICE_TEST flag sounds like a fit, but
 kvm_ioctl_create_device() isn't exported.

Out of curiosity, why do you need to test it from inside the kernel but
outside kvm_main.c?

-Scott


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls

2015-01-06 Thread Borislav Petkov

On Mon, Jan 05, 2015 at 12:31:15PM -0800, Andy Lutomirski wrote:
 Do you have context tracking on?

Yap, it is enabled for whatever reason:
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_FORCE=y
CONFIG_HAVE_CONTEXT_TRACKING=y

 I assume that's in the historical tree?

Yeah.

  [  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fff SErr 0x0 action 
  0x6 frozen
  [  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
  [  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 
  4096 out
  [  180.072158]  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 
  (timeout)
 
 That's really weird.  The only thing I can think of is that somehow we
 returned to user mode without enabling interrupts.

Right, considering FIXUP_TOP_OF_STACK is used in a bunch of cases in
entry_64.S, no wonder it corrupts something.

 This leads me to wonder: why do we save eflags in the R11 pt_regs
 slot?

That: If executed in 64-bit mode, SYSRET loads the lower-32 RFLAGS bits
from R11[31:0] and clears the upper 32 RFLAGS bits.

 This seems entirely backwards, not to mention that it accounts for two
 instructions in each of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK
 for no apparently reason whatsoever.

 Can you send the full output from syscall_exit_regs_64 from here:
 
 https://gitorious.org/linux-test-utils/linux-clock-tests/source/34884122b6ebe81d9b96e3e5128b6d6d95082c6e:
 
 with the patch applied (assuming it even gets that far for you)?  I
 see results like:
 
 [NOTE]syscall : orig RCX = 1  ss = 2b  orig_ss = 6b  flags =
 217  orig_flags = 217
 
 which seems fine.

./syscall_exit_regs_64
[OK]int80 : AX = ffda
[OK]int80 4000: AX = ffda
[OK]syscall : RCX = 400962  RIP = 400962
[OK]syscall : AX = ffda
[NOTE]  syscall : orig RCX = 1  ss = 2b  orig_ss = 6b  flags = 217  
orig_flags = 217
[OK]syscall 4000: RCX = 400962  RIP = 400962
[FAIL]  syscall 4000: AX = fff7
[NOTE]  syscall 4000: orig RCX = 1  ss = 2b  orig_ss = 6b  flags = 217  
orig_flags = 217
[OK]syscall(): ret = -1, errno = 38

 Are you seeing this with the whole series applied or with only this patch?

I applied this patch only and started seeing those. Then I booted in the
previous kernel and tried to repro but it didn't trigger.

I'll try hammering on the kernel *without* your patch to see whether I
can trigger it somehow...

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader



On 06/01/2015 17:56, Andy Lutomirski wrote:
 Still no good.  We can migrate a bunch of times so we see the same CPU
 all three times

There are no three times.  The CPU you see here:

 
 
 // ... compute nanoseconds from pvti and tsc ...
 rmb();
 }   while(v != pvti-version);

is the same you read here:

 cpu = get_cpu();

The algorithm is:

1) get a consistent (cpu, version, tsc)

   1.a) get cpu
   1.b) get pvti[cpu]-version, ignoring low bit
   1.c) get (tsc, cpu)
   1.d) if cpu from 1.a and 1.c do not match, loop
   1.e) if pvti[cpu] was being updated, we'll loop later

2) compute nanoseconds from pvti[cpu] and tsc

3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
match, retry.

As long as the CPU is consistent between get_cpu() and rdtscp(), there
is no problem with migration, because pvti is always accessed for that
one CPU.  If there were any problem, it would be caught by the version
check.  Writing it down with two nested do...whiles makes it clearer IMHO.

 and *still* don't get a consistent read, unless we
 play nasty games with lots of version checks (I have a patch for that,
 but I don't like it very much).  The patch is here:
 
 https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d
 
 but I don't like it.
 
 Thus far, I've been told unambiguously that a guest can't observe pvti
 while it's being written, and I think you're now telling me that this
 isn't true and that a guest *can* observe pvti while it's being
 written while the low bit of the version field is not set.  If so,
 this is rather strongly incompatible with the spec in the KVM docs.

Where am I saying that?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader



On 06/01/2015 19:26, Andy Lutomirski wrote:
 Don't you stil need:
 
 version++;
 write the rest;
 version++;
 
 with possible smp_wmb() in there to keep the compiler from messing around?

No, see my other reply.

Separating the version write is a real bug, but that should be all that
it's needed.

 Also, if you do this, can you also make setting and clearing
 STABLE_BIT properly atomic across all vCPUs?  Or at least do something
 like setting it last and clearing it first on vPCU 0?

That would be nice if you want to make the pvclock area fit in a single
page.  However, it would have to be a separate flag bit, or a separate
CPUID feature.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On Tue, Jan 6, 2015 at 9:38 PM, Paolo Bonzini pbonz...@redhat.com wrote:

On 06/01/2015 17:56, Andy Lutomirski wrote:
Still no good. We can migrate a bunch of times so we see the same CPU
all three times

There are no three times. The CPU you see here:

// ... compute nanoseconds from pvti and tsc ...
rmb();
} while(v != pvti-version);

is the same you read here:

cpu = get_cpu();

The algorithm is:

I still don't see why this is safe, and I think that the issue is that
you left out part of the loop.

1) get a consistent (cpu, version, tsc)

1.a) get cpu

Suppose we observe cpu 0.

1.b) get pvti[cpu]-version, ignoring low bit

Missing step, presumably here: read pvti[cpu]-tsc_timestamp, scale,
etc. This could all execute on vCPU 1. We could read values that are
inconsistent with each other.

1.c) get (tsc, cpu)

Now we could end up back on vCPU 0.

1.d) if cpu from 1.a and 1.c do not match, loop
1.e) if pvti[cpu] was being updated, we'll loop later

2) compute nanoseconds from pvti[cpu] and tsc

3) if pvti[cpu] changed under our feet during (2), i.e. version doesn't
match, retry.

As long as the CPU is consistent between get_cpu() and rdtscp(), there
is no problem with migration, because pvti is always accessed for that
one CPU. If there were any problem, it would be caught by the version
check. Writing it down with two nested do...whiles makes it clearer IMHO.

Why exactly would it be caught by the version check?

My ugly patch tries to make the argument that, at any point at which
we observe ourselves to be on a given vCPU, that vCPU won't be
updating pvti. That means that, if version doesn't change between two
consecutive observations that we're on that vCPU, then we're okay.
This IMO sucks. It's fragile, it's hard to make a coherent argument
about correctness, and it requires at least two getcpu-like operations
to read the time. Those operations are *slow*. One is much better
than two, and zero is much better than one.

and *still* don't get a consistent read, unless we
play nasty games with lots of version checks (I have a patch for that,
but I don't like it very much). The patch is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d

but I don't like it.

Where am I saying that?

I thought the conclusion from what you and Marcelo pointed out about
the code was that, once the first vCPU updated its pvti, it could
start running guest code while the other vCPUs are still updating
pvti, so its guest code can observe the other vCPUs mid-update.

Also, if you do this, can you also make setting and clearing
STABLE_BIT properly atomic across all vCPUs? Or at least do something
like setting it last and clearing it first on vPCU 0?

That would be nice if you want to make the pvclock area fit in a single
page. However, it would have to be a separate flag bit, or a separate
CPUID feature.

It would be nice to have. Although I think that fixing the host to
increment version pre-update and post-update may actually be good
enough. Is there any case in which it would fail in practice if we
made that fix and always looked at pvti 0?

--Andy

Paolo

--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

[RFC PATCH] KVM: introduce kvm_check_device

2015-01-06 Thread Andre Przywara

While we can easily register and unregister KVM devices, there is
currently no easy way of checking whether a device has been
registered.
Introduce kvm_check_device() for that purpose and use it in two
existing functions. Also change the return code for an invalid
type number from ENOSPC to EINVAL.
This function will be later used by another patch set to check
whether a KVM_CREATE_IRQCHIP ioctl is valid.

Signed-off-by: Andre Przywara andre.przyw...@arm.com
---
Hi,

can people comment whether there is an easier way to detect KVM
device registration _outside_ of virt/kvm/kvm_main.c? Using the
KVM_CREATE_DEVICE_TEST flag sounds like a fit, but
kvm_ioctl_create_device() isn't exported.
Also is my return code semantics too exerted?

Cheers,
Andre.

 include/linux/kvm_host.h |1 +
 virt/kvm/kvm_main.c  |   33 +
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 83d770e..78e94ef 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1038,6 +1038,7 @@ void kvm_device_get(struct kvm_device *dev);
 void kvm_device_put(struct kvm_device *dev);
 struct kvm_device *kvm_device_from_filp(struct file *filp);
 int kvm_register_device_ops(struct kvm_device_ops *ops, u32 type);
+int kvm_check_device(u32 type);
 void kvm_unregister_device_ops(u32 type);
 
 extern struct kvm_device_ops kvm_mpic_ops;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1cc6e2e..42604a9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2341,13 +2341,24 @@ static struct kvm_device_ops 
*kvm_device_ops_table[KVM_DEV_TYPE_MAX] = {
 #endif
 };
 
-int kvm_register_device_ops(struct kvm_device_ops *ops, u32 type)
+int kvm_check_device(u32 type)
 {
if (type = ARRAY_SIZE(kvm_device_ops_table))
-   return -ENOSPC;
+   return -EINVAL;
 
-   if (kvm_device_ops_table[type] != NULL)
-   return -EEXIST;
+   if (kvm_device_ops_table[type] == NULL)
+   return -ENODEV;
+
+   return -EEXIST;
+}
+
+int kvm_register_device_ops(struct kvm_device_ops *ops, u32 type)
+{
+   int ret;
+
+   ret = kvm_check_device(type);
+   if (ret != -ENODEV)
+   return ret;
 
kvm_device_ops_table[type] = ops;
return 0;
@@ -2364,19 +2375,17 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 {
struct kvm_device_ops *ops = NULL;
struct kvm_device *dev;
-   bool test = cd-flags  KVM_CREATE_DEVICE_TEST;
int ret;
 
-   if (cd-type = ARRAY_SIZE(kvm_device_ops_table))
-   return -ENODEV;
-
-   ops = kvm_device_ops_table[cd-type];
-   if (ops == NULL)
-   return -ENODEV;
+   ret = kvm_check_device(cd-type);
+   if (ret != -EEXIST)
+   return ret;
 
-   if (test)
+   if (cd-flags  KVM_CREATE_DEVICE_TEST)
return 0;
 
+   ops = kvm_device_ops_table[cd-type];
+
dev = kzalloc(sizeof(*dev), GFP_KERNEL);
if (!dev)
return -ENOMEM;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader

On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
 On Jan 6, 2015 4:01 AM, Paolo Bonzini pbonz...@redhat.com wrote:
 
 
 
  On 06/01/2015 09:42, Paolo Bonzini wrote:
 Still confused.  So we can freeze all vCPUs in the host, then update
 pvti 1, then resume vCPU 1, then update pvti 0?  In that case, we 
 have
 a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
 doesn't increment the version pre-update, and we can return 
 completely
 bogus results.
Yes.
   But then the getcpu test would fail (1-0).  Even if you have an ABA
   situation (1-0-1), it's okay because the pvti that is fetched is the
   one returned by the first getcpu.
 
  ... this case of partial update of pvti, which is caught by the version
  field, if of course different from the other (extremely unlikely) that
  Andy pointed out.  That is when the getcpus are done on the same vCPU,
  but the rdtsc is another.
 
  That one can be fixed by rdtscp, like
 
  do {
  // get a consistent (pvti, v, tsc) tuple
  do {
  cpu = get_cpu();
  pvti = get_pvti(cpu);
  v = pvti-version  ~1;
  // also acts as rmb();
  rdtsc_barrier();
  tsc = rdtscp(cpu1);

 Off-topic note: rdtscp doesn't need a barrier at all.  AIUI AMD
 specified it that way and both AMD and Intel implement it correctly.
 (rdtsc, on the other hand, definitely needs the barrier beforehand.)

  // control dependency, no need for rdtsc_barrier?
  } while(cpu != cpu1);
 
  // ... compute nanoseconds from pvti and tsc ...
  rmb();
  }   while(v != pvti-version);

 Still no good.  We can migrate a bunch of times so we see the same CPU
 all three times and *still* don't get a consistent read, unless we
 play nasty games with lots of version checks (I have a patch for that,
 but I don't like it very much).  The patch is here:

 https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=a69754dc5ff33f5187162b5338854ad23dd7be8d

 but I don't like it.

 Thus far, I've been told unambiguously that a guest can't observe pvti
 while it's being written, and I think you're now telling me that this
 isn't true and that a guest *can* observe pvti while it's being
 written while the low bit of the version field is not set.  If so,
 this is rather strongly incompatible with the spec in the KVM docs.

 I don't suppose that you and Marcelo could agree on what the actual
 semantics that KVM provides are and could write it down in a way that
 people who haven't spent a long time staring at the request code
 understand?  And maybe you could even fix the implementation while
 you're at it if the implementation is, indeed, broken.  I have ugly
 patches to fix it here:

 https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoiaid=3b718a050cba52563d831febc2e1ca184c02bac0

 but I'm not thrilled with them.

 --Andy

 I suppose that separating the version write from the rest of the pvclock
 structure is sufficient, as that would guarantee the writes are not
 reordered even with fast string REP MOVS.

 Thanks for catching this Andy!


Don't you stil need:

version++;
write the rest;
version++;

with possible smp_wmb() in there to keep the compiler from messing around?

Also, if you do this, can you also make setting and clearing
STABLE_BIT properly atomic across all vCPUs?  Or at least do something
like setting it last and clearing it first on vPCU 0?

--Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader