[PATCH v5 4/7] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-28 Thread Eric Auger
This patch adds the reset function registration/unregistration.
This is handled through the module_vfio_reset_handler macro. This
latter also defines a MODULE_ALIAS which simplifies the load from
vfio-platform.

Signed-off-by: Eric Auger 
Reviewed-by: Arnd Bergmann 

---

v3 -> v4:
- I restored the EXPORT_SYMBOL which will be removed when switching the
  lookup method
- Add Arnd R-b.

v2 -> v3:
- do not include vfio_platform_reset_private.h anymore (removed)
- remove pr_info
- rework commit message

v1 -> v2:
- uses the module_vfio_reset_handler macro
- add pr_info on vfio reset
- do not export vfio_platform_calxedaxgmac_reset symbol anymore
---
 drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c 
b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
index 619dc7d..80718f2 100644
--- a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
+++ b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
@@ -30,8 +30,6 @@
 #define DRIVER_AUTHOR   "Eric Auger "
 #define DRIVER_DESC "Reset support for Calxeda xgmac vfio platform device"
 
-#define CALXEDAXGMAC_COMPAT "calxeda,hb-xgmac"
-
 /* XGMAC Register definitions */
 #define XGMAC_CONTROL   0x  /* MAC Configuration */
 
@@ -80,6 +78,8 @@ int vfio_platform_calxedaxgmac_reset(struct 
vfio_platform_device *vdev)
 }
 EXPORT_SYMBOL_GPL(vfio_platform_calxedaxgmac_reset);
 
+module_vfio_reset_handler("calxeda,hb-xgmac", 
vfio_platform_calxedaxgmac_reset);
+
 MODULE_VERSION(DRIVER_VERSION);
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR(DRIVER_AUTHOR);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Michael S. Tsirkin
On Wed, Oct 28, 2015 at 05:36:53PM +0900, Benjamin Herrenschmidt wrote:
> On Wed, 2015-10-28 at 16:40 +0900, Christian Borntraeger wrote:
> > We have discussed that at kernel summit. I will try to implement a dummy 
> > dma_ops for
> > s390 that does 1:1 mapping and Ben will look into doing some quirk to 
> > handle "old"
> > code in addition to also make it possible to mark devices as iommu bypass 
> > (IIRC,
> > via device tree, Ben?)
> 
> Something like that yes. I'll look into it when I'm back home.
> 
> Cheers,
> Ben.

OK so I guess that means we should prefer a transport-specific
interface in virtio-pci then.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] VFIO: platform: reset: AMD xgbe reset module

2015-10-28 Thread Eric Auger
This patch introduces a module that registers and implements a low-level
reset function for the AMD XGBE device.

it performs the following actions:
- reset the PHY
- disable auto-negotiation
- disable & clear auto-negotiation IRQ
- soft-reset the MAC

Those tiny pieces of code are inherited from the native xgbe driver.

Signed-off-by: Eric Auger 
Reviewed-by: Arnd Bergmann 

---

Applies on top of [PATCH v5 0/7] VFIO platform reset module rework

v3 -> v4:
- add Arnd's R-b

v2 -> v3:
- in Kconfig, add empty line between the 2 options
- remove DRIVER_VERSION, DRIVER_AUTHOR and DRIVER_DESC and put
  strings directly in MODULE macros

v1 -> v2:
- uses module_vfio_reset_handler macro
---
 drivers/vfio/platform/reset/Kconfig|   8 ++
 drivers/vfio/platform/reset/Makefile   |   2 +
 .../vfio/platform/reset/vfio_platform_amdxgbe.c| 127 +
 3 files changed, 137 insertions(+)
 create mode 100644 drivers/vfio/platform/reset/vfio_platform_amdxgbe.c

diff --git a/drivers/vfio/platform/reset/Kconfig 
b/drivers/vfio/platform/reset/Kconfig
index 746b96b..705 100644
--- a/drivers/vfio/platform/reset/Kconfig
+++ b/drivers/vfio/platform/reset/Kconfig
@@ -5,3 +5,11 @@ config VFIO_PLATFORM_CALXEDAXGMAC_RESET
  Enables the VFIO platform driver to handle reset for Calxeda xgmac
 
  If you don't know what to do here, say N.
+
+config VFIO_PLATFORM_AMDXGBE_RESET
+   tristate "VFIO support for AMD XGBE reset"
+   depends on VFIO_PLATFORM
+   help
+ Enables the VFIO platform driver to handle reset for AMD XGBE
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/platform/reset/Makefile 
b/drivers/vfio/platform/reset/Makefile
index 2a486af..93f4e23 100644
--- a/drivers/vfio/platform/reset/Makefile
+++ b/drivers/vfio/platform/reset/Makefile
@@ -1,5 +1,7 @@
 vfio-platform-calxedaxgmac-y := vfio_platform_calxedaxgmac.o
+vfio-platform-amdxgbe-y := vfio_platform_amdxgbe.o
 
 ccflags-y += -Idrivers/vfio/platform
 
 obj-$(CONFIG_VFIO_PLATFORM_CALXEDAXGMAC_RESET) += vfio-platform-calxedaxgmac.o
+obj-$(CONFIG_VFIO_PLATFORM_AMDXGBE_RESET) += vfio-platform-amdxgbe.o
diff --git a/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c 
b/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c
new file mode 100644
index 000..1636e22
--- /dev/null
+++ b/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c
@@ -0,0 +1,127 @@
+/*
+ * VFIO platform driver specialized for AMD xgbe reset
+ * reset code is inherited from AMD xgbe native driver
+ *
+ * Copyright (c) 2015 Linaro Ltd.
+ *  www.linaro.org
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vfio_platform_private.h"
+
+#define DMA_MR 0x3000
+#define MAC_VR 0x0110
+#define DMA_ISR0x3008
+#define MAC_ISR0x00b0
+#define PCS_MMD_SELECT 0xff
+#define MDIO_AN_INT0x8002
+#define MDIO_AN_INTMASK0x8001
+
+static unsigned int xmdio_read(void *ioaddr, unsigned int mmd,
+  unsigned int reg)
+{
+   unsigned int mmd_address, value;
+
+   mmd_address = (mmd << 16) | ((reg) & 0x);
+   iowrite32(mmd_address >> 8, ioaddr + (PCS_MMD_SELECT << 2));
+   value = ioread32(ioaddr + ((mmd_address & 0xff) << 2));
+   return value;
+}
+
+static void xmdio_write(void *ioaddr, unsigned int mmd,
+   unsigned int reg, unsigned int value)
+{
+   unsigned int mmd_address;
+
+   mmd_address = (mmd << 16) | ((reg) & 0x);
+   iowrite32(mmd_address >> 8, ioaddr + (PCS_MMD_SELECT << 2));
+   iowrite32(value, ioaddr + ((mmd_address & 0xff) << 2));
+}
+
+int vfio_platform_amdxgbe_reset(struct vfio_platform_device *vdev)
+{
+   struct vfio_platform_region xgmac_regs = vdev->regions[0];
+   struct vfio_platform_region xpcs_regs = vdev->regions[1];
+   u32 dma_mr_value, pcs_value, value;
+   unsigned int count;
+
+   if (!xgmac_regs.ioaddr) {
+   xgmac_regs.ioaddr =
+   ioremap_nocache(xgmac_regs.addr, xgmac_regs.size);
+   if (!xgmac_regs.ioaddr)
+   return -ENOMEM;
+   }
+   if (!xpcs_regs.ioaddr) {
+   xpcs_regs.ioaddr =
+  

[PATCH v5 2/7] vfio: platform: add capability to register a reset function

2015-10-28 Thread Eric Auger
In preparation for subsequent changes in reset function lookup,
lets introduce a dynamic list of reset combos (compat string,
reset module, reset function). The list can be populated/voided with
vfio_platform_register/unregister_reset. Those are not yet used in
this patch.

Signed-off-by: Eric Auger 
Reviewed-by: Arnd Bergmann 

---

v4 -> v5:
- add Arnd's R-b

v3 -> v4:
- __vfio_platform_register_reset does not return any value anymore
- vfio_platform_unregister_reset also takes the reset function pointer
  as parameter

v2 -> v3:
- use goto out to have a single mutex_unlock
- implement vfio_platform_register_reset as a macro (suggested by Arnd)
- move reset_node struct declaration back to vfio_platform_private.h
- vfio_platform_unregister_reset does not return any value anymore

v1 -> v2:
- reset_list becomes static
- vfio_platform_register/unregister_reset take a const char * as compat
- fix node leak
- add reset_lock to protect the reset list manipulation
- move vfio_platform_reset_node declaration in vfio_platform_common.c
---
 drivers/vfio/platform/vfio_platform_common.c  | 27 +++
 drivers/vfio/platform/vfio_platform_private.h | 20 
 2 files changed, 47 insertions(+)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 184e9d2..3b7e52c 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -27,6 +27,7 @@
 #define DRIVER_AUTHOR   "Antonios Motakis "
 #define DRIVER_DESC "VFIO platform base module"
 
+static LIST_HEAD(reset_list);
 static DEFINE_MUTEX(driver_lock);
 
 static const struct vfio_platform_reset_combo reset_lookup_table[] = {
@@ -578,6 +579,32 @@ struct vfio_platform_device 
*vfio_platform_remove_common(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(vfio_platform_remove_common);
 
+void __vfio_platform_register_reset(struct vfio_platform_reset_node *node)
+{
+   mutex_lock(_lock);
+   list_add(>link, _list);
+   mutex_unlock(_lock);
+}
+EXPORT_SYMBOL_GPL(__vfio_platform_register_reset);
+
+void vfio_platform_unregister_reset(const char *compat,
+   vfio_platform_reset_fn_t fn)
+{
+   struct vfio_platform_reset_node *iter, *temp;
+
+   mutex_lock(_lock);
+   list_for_each_entry_safe(iter, temp, _list, link) {
+   if (!strcmp(iter->compat, compat) && (iter->reset == fn)) {
+   list_del(>link);
+   break;
+   }
+   }
+
+   mutex_unlock(_lock);
+
+}
+EXPORT_SYMBOL_GPL(vfio_platform_unregister_reset);
+
 MODULE_VERSION(DRIVER_VERSION);
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR(DRIVER_AUTHOR);
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index 7128690..c563940 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -71,6 +71,15 @@ struct vfio_platform_device {
int (*reset)(struct vfio_platform_device *vdev);
 };
 
+typedef int (*vfio_platform_reset_fn_t)(struct vfio_platform_device *vdev);
+
+struct vfio_platform_reset_node {
+   struct list_head link;
+   char *compat;
+   struct module *owner;
+   vfio_platform_reset_fn_t reset;
+};
+
 struct vfio_platform_reset_combo {
const char *compat;
const char *reset_function_name;
@@ -90,4 +99,15 @@ extern int vfio_platform_set_irqs_ioctl(struct 
vfio_platform_device *vdev,
unsigned start, unsigned count,
void *data);
 
+extern void __vfio_platform_register_reset(struct vfio_platform_reset_node *n);
+extern void vfio_platform_unregister_reset(const char *compat,
+  vfio_platform_reset_fn_t fn);
+#define vfio_platform_register_reset(__compat, __reset)\
+static struct vfio_platform_reset_node __reset ## _node = {\
+   .owner = THIS_MODULE,   \
+   .compat = __compat, \
+   .reset = __reset,   \
+}; \
+__vfio_platform_register_reset(&__reset ## _node)
+
 #endif /* VFIO_PLATFORM_PRIVATE_H */
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 3/7] vfio: platform: introduce module_vfio_reset_handler macro

2015-10-28 Thread Eric Auger
The module_vfio_reset_handler macro
- define a module alias
- implement module init/exit function which respectively registers
  and unregisters the reset function.

Signed-off-by: Eric Auger 
Reviewed-by: Arnd Bergmann 

---
v4 -> v5:
- add Arnd's R-b

v3 -> v4:
- pass reset to vfio_platform_unregister_reset

v2 -> v3:
- use vfio_platform_register_reset macro

v1 -> v2:
- remove vfio_platform_reset_private.h and move back the macro to
  vfio_platform_private.h header: removed reset_module_register &
  unregister (symbol_get)
- defines the module_vfio_reset_handler macro as suggested by Arnd
  (formerly in vfio_platform_reset_private.h)
---
 drivers/vfio/platform/vfio_platform_private.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index c563940..fd262be 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -110,4 +110,18 @@ static struct vfio_platform_reset_node __reset ## _node = 
{\
 }; \
 __vfio_platform_register_reset(&__reset ## _node)
 
+#define module_vfio_reset_handler(compat, reset)   \
+MODULE_ALIAS("vfio-reset:" compat);\
+static int __init reset ## _module_init(void)  \
+{  \
+   vfio_platform_register_reset(compat, reset);\
+   return 0;   \
+}; \
+static void __exit reset ## _module_exit(void) \
+{  \
+   vfio_platform_unregister_reset(compat, reset);  \
+}; \
+module_init(reset ## _module_init);\
+module_exit(reset ## _module_exit)
+
 #endif /* VFIO_PLATFORM_PRIVATE_H */
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Michael S. Tsirkin
On Wed, Oct 28, 2015 at 05:09:47PM +0900, David Woodhouse wrote:
> On Wed, 2015-10-28 at 16:40 +0900, Christian Borntraeger wrote:
> > Am 28.10.2015 um 16:17 schrieb Michael S. Tsirkin:
> > > On Tue, Oct 27, 2015 at 11:38:57PM -0700, Andy Lutomirski wrote:
> > > > This switches virtio to use the DMA API unconditionally.  I'm sure
> > > > it breaks things, but it seems to work on x86 using virtio-pci, with
> > > > and without Xen, and using both the modern 1.0 variant and the
> > > > legacy variant.
> > > 
> > > I'm very glad to see work on this making progress.
> > > 
> > > I suspect we'll have to find a way to make this optional though, and
> > > keep doing the non-DMA API thing with old devices.  And I've been
> > > debating with myself whether a pci specific thing or a feature bit is
> > > preferable.
> > > 
> > 
> > We have discussed that at kernel summit. I will try to implement a dummy 
> > dma_ops for
> > s390 that does 1:1 mapping and Ben will look into doing some quirk to 
> > handle "old"
> > code in addition to also make it possible to mark devices as iommu bypass 
> > (IIRC,
> > via device tree, Ben?)
> 
> Right. You never eschew the DMA API in the *driver* — you just expect
> the DMA API to do the right thing for devices which don't need
> translation (with platforms using per-device dma_ops and generally
> getting their act together).
> We're pushing that on the platforms where it's currently an issue,
> including Power, SPARC and S390.
> 
> -- 
> dwmw2
> 
> 

Well APIs are just that - internal kernel APIs.
If the only user of an API is virtio, we can strick the
code in virtio.h just as well.
I think controlling this dynamically and not statically
in e.g. devicetree is important though.

E.g. on intel x86, there's an option iommu=pt which does the 1:1
thing for devices when used by kernel, but enables
the iommu if used by userspace/VMs.

Something like this would be needed for other platforms IMHO.

And given that
1. virtio seems the only user so far
2. supporting this per device seems like something that
   might become useful in the future
maybe we'd better make this part of virtio transports.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 13:10 +0200, Shamir Rabinovitch wrote:
> On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
> > > > +For systems with IOMMU it is assumed all DMA translations use the 
> > > > IOMMU.
> > 
> > Not entirely true. We have per-device dma_ops on a most architectures
> > already, and we were just talking about the need to add them to
> > POWER/SPARC too, because we need to avoid trying to use the IOMMU to
> > map virtio devices too.
> 
> SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).
> 
> Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
> Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
> On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
> translation that does not require any complex translations tables.

We have an option in the Intel IOMMU for pass-through mode too, which
basically *is* a total bypass. In practice, what's the difference
between that and a "simple translation that does not require any
[translation]"? We set up a full 1:1 mapping of all memory, and then
the map/unmap methods become no-ops.

Currently we have no way to request that mode on a per-device basis; we
only have 'iommu=pt' on the command line to set it for *all* devices.
But performance-sensitive devices might want it, while we keep doing
proper translation for others.

> > 
> > As we look at that (and make the per-device dma_ops a generic thing
> > rather than per-arch), we should probably look at folding your case in
> > too.
> 
> Whether to use IOMMU or not for DMA is up to the driver. The above example
> show real situation where one driver can use IOMMU and the other can't. It
> seems that the device cannot know when to use IOMMU bypass and when not.
> Even in the driver we can decide to DMA map some buffers using IOMMU
> translation and some as IOMMU bypass.

In practice there are multiple possibilities — there are cases where
you *must* use an IOMMU and do full translation, and there is no option
of a bypass. There are cases where there just isn't an IOMMU (and
sometimes that's a per-device fact, like with virtio). And there are
cases where you *can* use the IOMMU, but if you ask nicely you can get
away without it.

My point in linking up these two threads is that we should contemplate
all of those use cases and come up with something that addresses it
all.

> Do you agree that we need this attribute in the generic DMA API? 

Yeah, I think it can be useful.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 13:35 +0200, Michael S. Tsirkin wrote:
> E.g. on intel x86, there's an option iommu=pt which does the 1:1
> thing for devices when used by kernel, but enables
> the iommu if used by userspace/VMs.

That's none of your business.

You call the DMA API when you do DMA. That's all there is to it.

If the IOMMU happens to be in passthrough mode, or your device happens
to not to be routed through an IOMMU today, then I/O virtual address
you get back from the DMA API will look a *lot* like the physical
address you asked the DMA to map. You might think there's no IOMMU. We
couldn't possibly comment.

Use the DMA API. Always. Let the platform worry about whether it
actually needs to *do* anything or not.

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


Re: virtio assisted migration

2015-10-28 Thread Dave Young
On 10/28/15 at 10:27am, Stefan Hajnoczi wrote:
> On Sat, Oct 24, 2015 at 05:21:01PM +0800, Dave Young wrote:
> > * block device
> > For block storage migration, the problem is similar as memory. The original
> > migration does not consider storage usage ratio it just copy all the 
> > sectors.
> 
> This is equivalent to issuing discard requests.  File systems can
> already do that.
> 
> The block/mirror.c code checks if sectors are allocated so this should
> be possible to achieve this today (with guest cooperation).

Yes, I did using hole punching in my code, I remember I did it on top of some
old thread from Christoph Hellwig about puncing hole, not sure if they have been
merged in qemu upstream.

I lost my original patch thus I just put tarballs on the website, but seems I
successfully generated the diff with official qemu-kvm-0.13.0.

I have just put the hack patch of qemu to the url:
http://people.redhat.com/ruyang/kvm/qemu-kvm-patches/qemu-kvm-0.13.0-block-migration-sparse.patch
Another two patches for virtio qemu host side to work stay same directory as
 well.

Thanks
Dave
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Miller
From: David Woodhouse 
Date: Wed, 28 Oct 2015 22:31:50 +0900

> On Wed, 2015-10-28 at 13:10 +0200, Shamir Rabinovitch wrote:
>> On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
>> > > > +For systems with IOMMU it is assumed all DMA translations use the 
>> > > > IOMMU.
>> > 
>> > Not entirely true. We have per-device dma_ops on a most architectures
>> > already, and we were just talking about the need to add them to
>> > POWER/SPARC too, because we need to avoid trying to use the IOMMU to
>> > map virtio devices too.
>> 
>> SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).
>> 
>> Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
>> Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
>> On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
>> translation that does not require any complex translations tables.
> 
> We have an option in the Intel IOMMU for pass-through mode too, which
> basically *is* a total bypass. In practice, what's the difference
> between that and a "simple translation that does not require any
> [translation]"? We set up a full 1:1 mapping of all memory, and then
> the map/unmap methods become no-ops.
> 
> Currently we have no way to request that mode on a per-device basis; we
> only have 'iommu=pt' on the command line to set it for *all* devices.
> But performance-sensitive devices might want it, while we keep doing
> proper translation for others.

In the sparc64 case, the 64-bit DMA address space is divided into
IOMMU translated and non-IOMMU translated.

You just set the high bits differently depending upon what you want.

So a device could use both IOMMU translated and bypass accesses at the
same time.  While seemingly interesting, I do not recommend we provide
this kind of flexibility in our DMA interfaces.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 1/7] vfio: platform: introduce vfio-platform-base module

2015-10-28 Thread Eric Auger
To prepare for vfio platform reset rework let's build
vfio_platform_common.c and vfio_platform_irq.c in a separate
module from vfio-platform and vfio-amba. This makes possible
to have separate module inits and works around a race between
platform driver init and vfio reset module init: that way we
make sure symbols exported by base are available when vfio-platform
driver gets probed.

The open/release being implemented in the base module, the ref
count is applied to the parent module instead.

Signed-off-by: Eric Auger 
Suggested-by: Arnd Bergmann 
Reviewed-by: Arnd Bergmann 

---
v3 -> v4:
- add Arnd R-b

v3: creation
---
 drivers/vfio/platform/Makefile|  6 --
 drivers/vfio/platform/vfio_amba.c |  1 +
 drivers/vfio/platform/vfio_platform.c |  1 +
 drivers/vfio/platform/vfio_platform_common.c  | 13 +++--
 drivers/vfio/platform/vfio_platform_private.h |  1 +
 5 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
index 9ce8afe..41a6224 100644
--- a/drivers/vfio/platform/Makefile
+++ b/drivers/vfio/platform/Makefile
@@ -1,10 +1,12 @@
-
-vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o
+vfio-platform-base-y := vfio_platform_common.o vfio_platform_irq.o
+vfio-platform-y := vfio_platform.o
 
 obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
+obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform-base.o
 obj-$(CONFIG_VFIO_PLATFORM) += reset/
 
 vfio-amba-y := vfio_amba.o
 
 obj-$(CONFIG_VFIO_AMBA) += vfio-amba.o
+obj-$(CONFIG_VFIO_AMBA) += vfio-platform-base.o
 obj-$(CONFIG_VFIO_AMBA) += reset/
diff --git a/drivers/vfio/platform/vfio_amba.c 
b/drivers/vfio/platform/vfio_amba.c
index ff0331f..a66479b 100644
--- a/drivers/vfio/platform/vfio_amba.c
+++ b/drivers/vfio/platform/vfio_amba.c
@@ -67,6 +67,7 @@ static int vfio_amba_probe(struct amba_device *adev, const 
struct amba_id *id)
vdev->flags = VFIO_DEVICE_FLAGS_AMBA;
vdev->get_resource = get_amba_resource;
vdev->get_irq = get_amba_irq;
+   vdev->parent_module = THIS_MODULE;
 
ret = vfio_platform_probe_common(vdev, >dev);
if (ret) {
diff --git a/drivers/vfio/platform/vfio_platform.c 
b/drivers/vfio/platform/vfio_platform.c
index cef645c..f1625dc 100644
--- a/drivers/vfio/platform/vfio_platform.c
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -65,6 +65,7 @@ static int vfio_platform_probe(struct platform_device *pdev)
vdev->flags = VFIO_DEVICE_FLAGS_PLATFORM;
vdev->get_resource = get_platform_resource;
vdev->get_irq = get_platform_irq;
+   vdev->parent_module = THIS_MODULE;
 
ret = vfio_platform_probe_common(vdev, >dev);
if (ret)
diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index e43efb5..184e9d2 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -23,6 +23,10 @@
 
 #include "vfio_platform_private.h"
 
+#define DRIVER_VERSION  "0.10"
+#define DRIVER_AUTHOR   "Antonios Motakis "
+#define DRIVER_DESC "VFIO platform base module"
+
 static DEFINE_MUTEX(driver_lock);
 
 static const struct vfio_platform_reset_combo reset_lookup_table[] = {
@@ -146,7 +150,7 @@ static void vfio_platform_release(void *device_data)
 
mutex_unlock(_lock);
 
-   module_put(THIS_MODULE);
+   module_put(vdev->parent_module);
 }
 
 static int vfio_platform_open(void *device_data)
@@ -154,7 +158,7 @@ static int vfio_platform_open(void *device_data)
struct vfio_platform_device *vdev = device_data;
int ret;
 
-   if (!try_module_get(THIS_MODULE))
+   if (!try_module_get(vdev->parent_module))
return -ENODEV;
 
mutex_lock(_lock);
@@ -573,3 +577,8 @@ struct vfio_platform_device 
*vfio_platform_remove_common(struct device *dev)
return vdev;
 }
 EXPORT_SYMBOL_GPL(vfio_platform_remove_common);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index 1c9b3d5..7128690 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -56,6 +56,7 @@ struct vfio_platform_device {
u32 num_irqs;
int refcnt;
struct mutexigate;
+   struct module   *parent_module;
 
/*
 * These fields should be filled by the bus specific binder
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 0/7] VFIO platform reset module rework

2015-10-28 Thread Eric Auger
This series fixes the current implementation by getting rid of the
usage of __symbol_get which caused a compilation issue with
CONFIG_MODULES disabled. On top of this, the usage of MODULE_ALIAS makes
possible to add a new reset module without being obliged to update the
framework. The new implementation relies on the reset module registering
its reset function to the vfio-platform driver.

The series is available at

https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.3-rc7-rework-v5

Best Regards

Eric

v4 -> v5:
- no code change
- only added Arnd's new R-b

v3 -> v4:
- Remove the EXPORT_SYMBOL_GPL(vfio_platform_calxedaxgmac_reset) later
  in [6/7], to keep the functionality working all along the series
- Add Arnd R-b (I dared to keep them despite the above change)
- vfio_platform_unregister_reset gets the reset function to do a double
  check on the compat and the function pointer too
- __vfio_platform_register_reset turned to 'void'

v2 -> v3:
- use driver_mutex instead of reset_mutex
- style fixes: single mutex_unlock
- use static nodes; vfio_platform_register_reset now is a macro
- vfio_platform_reset_private.h removed since reset_module_(un)register
  disappear. No use of symbol_get anymore.
- new patch introducing vfio-platform-base
- reset look-up moved back at vfio-platform probe time
- new patch featuring dev_info/dev_warn

v1 -> v2:
* in vfio_platform_common.c:
  - move reset lookup at load time and put reset at release: this is to
prevent a race between the 2 load module loads
  - reset_list becomes static
  - vfio_platform_register/unregister_reset take a const char * as compat
  - fix node link
  - remove old combo struct and cleanup proto of vfio_platform_get_reset
  - add mutex to protect the reset list
* in calxeda xgmac reset module
  - introduce vfio_platform_reset_private.h
  - use module_vfio_reset_handler macro
  - do not export vfio_platform_calxedaxgmac_reset symbol anymore
  - add a pr_info to show the device is reset by vfio reset module


Eric Auger (7):
  vfio: platform: introduce vfio-platform-base module
  vfio: platform: add capability to register a reset function
  vfio: platform: introduce module_vfio_reset_handler macro
  vfio: platform: reset: calxedaxgmac: add reset function registration
  vfio: platform: add compat in vfio_platform_device
  vfio: platform: use list of registered reset function
  vfio: platform: add dev_info on device reset

 drivers/vfio/platform/Makefile |   6 +-
 .../platform/reset/vfio_platform_calxedaxgmac.c|   5 +-
 drivers/vfio/platform/vfio_amba.c  |   1 +
 drivers/vfio/platform/vfio_platform.c  |   1 +
 drivers/vfio/platform/vfio_platform_common.c   | 119 +++--
 drivers/vfio/platform/vfio_platform_private.h  |  40 ++-
 6 files changed, 130 insertions(+), 42 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Eric Auger
Current vfio_pgsize_bitmap code hides the supported IOMMU page
sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
does not support PAGE_SIZE page, the alignment check on map/unmap
is done with larger page sizes, if any. This can fail although
mapping could be done with pages smaller than PAGE_SIZE.

vfio_pgsize_bitmap is modified to expose the IOMMU page sizes,
supported by all domains, even those smaller than PAGE_SIZE. The
alignment check on map is performed against PAGE_SIZE if the minimum
IOMMU size is less than PAGE_SIZE or against the min page size greater
than PAGE_SIZE.

Signed-off-by: Eric Auger 

---

This was tested on AMD Seattle with 64kB page host. ARM MMU 401
currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
the map/unmap check is done against 2MB. Some alignment check fail
so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
size.
---
 drivers/vfio/vfio_iommu_type1.c | 25 +++--
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 57d8c37..13fb974 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
struct vfio_dma *dma)
 static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 {
struct vfio_domain *domain;
-   unsigned long bitmap = PAGE_MASK;
+   unsigned long bitmap = ULONG_MAX;
 
mutex_lock(>lock);
list_for_each_entry(domain, >domain_list, next)
@@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu 
*iommu)
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 struct vfio_iommu_type1_dma_unmap *unmap)
 {
-   uint64_t mask;
struct vfio_dma *dma;
size_t unmapped = 0;
int ret = 0;
+   unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
+   unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
+   PAGE_SIZE : min_pagesz;
 
-   mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
-
-   if (unmap->iova & mask)
+   if (!IS_ALIGNED(unmap->iova, requested_alignment))
return -EINVAL;
-   if (!unmap->size || unmap->size & mask)
+   if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
return -EINVAL;
 
-   WARN_ON(mask & PAGE_MASK);
-
mutex_lock(>lock);
 
/*
@@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
size_t size = map->size;
long npage;
int ret = 0, prot = 0;
-   uint64_t mask;
struct vfio_dma *dma;
unsigned long pfn;
+   unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
+   unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
+   PAGE_SIZE : min_pagesz;
 
/* Verify that none of our __u64 fields overflow */
if (map->size != size || map->vaddr != vaddr || map->iova != iova)
return -EINVAL;
 
-   mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
-
-   WARN_ON(mask & PAGE_MASK);
-
/* READ/WRITE from device perspective */
if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
prot |= IOMMU_WRITE;
if (map->flags & VFIO_DMA_MAP_FLAG_READ)
prot |= IOMMU_READ;
 
-   if (!prot || !size || (size | iova | vaddr) & mask)
+   if (!prot || !size ||
+   !IS_ALIGNED(size | iova | vaddr, requested_alignment))
return -EINVAL;
 
/* Don't allow IOVA or virtual address wrap */
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread Shamir Rabinovitch
On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
> > > +For systems with IOMMU it is assumed all DMA translations use the IOMMU.
> 
> Not entirely true. We have per-device dma_ops on a most architectures
> already, and we were just talking about the need to add them to
> POWER/SPARC too, because we need to avoid trying to use the IOMMU to
> map virtio devices too.

SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).

Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
translation that does not require any complex translations tables.

> 
> As we look at that (and make the per-device dma_ops a generic thing
> rather than per-arch), we should probably look at folding your case in
> too.

Whether to use IOMMU or not for DMA is up to the driver. The above example
show real situation where one driver can use IOMMU and the other can't. It
seems that the device cannot know when to use IOMMU bypass and when not.
Even in the driver we can decide to DMA map some buffers using IOMMU
translation and some as IOMMU bypass.

Do you agree that we need this attribute in the generic DMA API? 

> 
> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v5 0/7] KVM: arm64: Implement API for vGICv3 live migration

2015-10-28 Thread Pavel Fedin
 Hello!

> > v4 => v5:
> > - Adapted to new API by Peter Maydell, Marc Zyngier and Christoffer Dall.
> >   Acked-by's on the documentation were dropped, just in case, because i
> >   slightly adjusted it. Additionally, i merged all doc updates into one
> >   patch.
> 
> Could you tell us what you changed in the doc patch from the version
> that got sent out with the acks, please?

 Sorry, completely forgot to answer this one...

 The major differences / things to review are:

1. Both GICv2 and GICv3 use the same value of KVM_DEV_ARM_VGIC_CPUID_MASK, 
which is extended to 32 bits. So GICv2 attribute layout also looks like:
--- cut ---
bits: | 63     32  |  31   0 |
values:   |  vcpu_index|  offset |
--- cut ---

2. KVM_DEV_ARM_VGIC_CPU_SYSREGS documentation originally says (error code 
description):
--- cut ---
-EBUSY: One or more VCPUs are running
--- cut ---
 While my version says "VCPU is running". Since this is CPU interface, it does 
not affect other CPUs, so for simplicity i check only current vCPU in my code.

 That's all. Just i'm maybe too careful about fundamentals...

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 5/7] vfio: platform: add compat in vfio_platform_device

2015-10-28 Thread Eric Auger
Let's retrieve the compatibility string on probe and store it
in the vfio_platform_device struct

Signed-off-by: Eric Auger 

---

v2 -> v3:
- populate compat after vdev check
---
 drivers/vfio/platform/vfio_platform_common.c  | 15 ---
 drivers/vfio/platform/vfio_platform_private.h |  1 +
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 3b7e52c..f2d41a0 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -41,16 +41,11 @@ static const struct vfio_platform_reset_combo 
reset_lookup_table[] = {
 static void vfio_platform_get_reset(struct vfio_platform_device *vdev,
struct device *dev)
 {
-   const char *compat;
int (*reset)(struct vfio_platform_device *);
-   int ret, i;
-
-   ret = device_property_read_string(dev, "compatible", );
-   if (ret)
-   return;
+   int i;
 
for (i = 0 ; i < ARRAY_SIZE(reset_lookup_table); i++) {
-   if (!strcmp(reset_lookup_table[i].compat, compat)) {
+   if (!strcmp(reset_lookup_table[i].compat, vdev->compat)) {
request_module(reset_lookup_table[i].module_name);
reset = __symbol_get(
reset_lookup_table[i].reset_function_name);
@@ -544,6 +539,12 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
if (!vdev)
return -EINVAL;
 
+   ret = device_property_read_string(dev, "compatible", >compat);
+   if (ret) {
+   pr_err("VFIO: cannot retrieve compat for %s\n", vdev->name);
+   return -EINVAL;
+   }
+
group = iommu_group_get(dev);
if (!group) {
pr_err("VFIO: No IOMMU group for device %s\n", vdev->name);
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index fd262be..415310f 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -57,6 +57,7 @@ struct vfio_platform_device {
int refcnt;
struct mutexigate;
struct module   *parent_module;
+   const char  *compat;
 
/*
 * These fields should be filled by the bus specific binder
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 7/7] vfio: platform: add dev_info on device reset

2015-10-28 Thread Eric Auger
It might be helpful for the end-user to check the device reset
function was found by the vfio platform reset framework.

Lets store a pointer to the struct device in vfio_platform_device
and trace when the reset function is called or not found.

Signed-off-by: Eric Auger 

---

v3: creation
---
 drivers/vfio/platform/vfio_platform_common.c  | 14 --
 drivers/vfio/platform/vfio_platform_private.h |  1 +
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index f74836a..376d289 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -144,8 +144,12 @@ static void vfio_platform_release(void *device_data)
mutex_lock(_lock);
 
if (!(--vdev->refcnt)) {
-   if (vdev->reset)
+   if (vdev->reset) {
+   dev_info(vdev->device, "reset\n");
vdev->reset(vdev);
+   } else {
+   dev_warn(vdev->device, "no reset function found!\n");
+   }
vfio_platform_regions_cleanup(vdev);
vfio_platform_irq_cleanup(vdev);
}
@@ -174,8 +178,12 @@ static int vfio_platform_open(void *device_data)
if (ret)
goto err_irq;
 
-   if (vdev->reset)
+   if (vdev->reset) {
+   dev_info(vdev->device, "reset\n");
vdev->reset(vdev);
+   } else {
+   dev_warn(vdev->device, "no reset function found!\n");
+   }
}
 
vdev->refcnt++;
@@ -551,6 +559,8 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
return -EINVAL;
}
 
+   vdev->device = dev;
+
group = iommu_group_get(dev);
if (!group) {
pr_err("VFIO: No IOMMU group for device %s\n", vdev->name);
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index d1b0668..42816dd 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -59,6 +59,7 @@ struct vfio_platform_device {
struct module   *parent_module;
const char  *compat;
struct module   *reset_module;
+   struct device   *device;
 
/*
 * These fields should be filled by the bus specific binder
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 6/7] vfio: platform: use list of registered reset function

2015-10-28 Thread Eric Auger
Remove the static lookup table and use the dynamic list of registered
reset functions instead. Also load the reset module through its alias.
The reset struct module pointer is stored in vfio_platform_device.

We also remove the useless struct device pointer parameter in
vfio_platform_get_reset.

This patch fixes the issue related to the usage of __symbol_get, which
besides from being moot, prevented compilation with CONFIG_MODULES
disabled.

Also usage of MODULE_ALIAS makes possible to add a new reset module
without needing to update the framework. This was suggested by Arnd.

Signed-off-by: Eric Auger 
Reported-by: Arnd Bergmann 
Reviewed-by: Arnd Bergmann 

---

v3 -> v4:
- add Arnd R-b.
- Remove the EXPORT_SYMBOL_GPL(vfio_platform_calxedaxgmac_reset) here

v2 -> v3:
- remove clear of vfio_platform_device reset_module and reset
  in vfio_platform_put_reset
- single unlock in vfio_platform_lookup_reset
- use driver_lock instead of reset_lock

v1 -> v2:
- use reset_lock in vfio_platform_lookup_reset
- remove vfio_platform_reset_combo declaration
- remove struct device *dev parameter in vfio_platform_get_reset
- set reset_module and reset to NULL in put function
---
 .../platform/reset/vfio_platform_calxedaxgmac.c|  1 -
 drivers/vfio/platform/vfio_platform_common.c   | 52 --
 drivers/vfio/platform/vfio_platform_private.h  |  7 +--
 3 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c 
b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
index 80718f2..640f5d8 100644
--- a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
+++ b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
@@ -76,7 +76,6 @@ int vfio_platform_calxedaxgmac_reset(struct 
vfio_platform_device *vdev)
 
return 0;
 }
-EXPORT_SYMBOL_GPL(vfio_platform_calxedaxgmac_reset);
 
 module_vfio_reset_handler("calxeda,hb-xgmac", 
vfio_platform_calxedaxgmac_reset);
 
diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index f2d41a0..f74836a 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -30,37 +30,43 @@
 static LIST_HEAD(reset_list);
 static DEFINE_MUTEX(driver_lock);
 
-static const struct vfio_platform_reset_combo reset_lookup_table[] = {
-   {
-   .compat = "calxeda,hb-xgmac",
-   .reset_function_name = "vfio_platform_calxedaxgmac_reset",
-   .module_name = "vfio-platform-calxedaxgmac",
-   },
-};
-
-static void vfio_platform_get_reset(struct vfio_platform_device *vdev,
-   struct device *dev)
+static vfio_platform_reset_fn_t vfio_platform_lookup_reset(const char *compat,
+   struct module **module)
 {
-   int (*reset)(struct vfio_platform_device *);
-   int i;
+   struct vfio_platform_reset_node *iter;
+   vfio_platform_reset_fn_t reset_fn = NULL;
 
-   for (i = 0 ; i < ARRAY_SIZE(reset_lookup_table); i++) {
-   if (!strcmp(reset_lookup_table[i].compat, vdev->compat)) {
-   request_module(reset_lookup_table[i].module_name);
-   reset = __symbol_get(
-   reset_lookup_table[i].reset_function_name);
-   if (reset) {
-   vdev->reset = reset;
-   return;
-   }
+   mutex_lock(_lock);
+   list_for_each_entry(iter, _list, link) {
+   if (!strcmp(iter->compat, compat) &&
+   try_module_get(iter->owner)) {
+   *module = iter->owner;
+   reset_fn = iter->reset;
+   break;
}
}
+   mutex_unlock(_lock);
+   return reset_fn;
+}
+
+static void vfio_platform_get_reset(struct vfio_platform_device *vdev)
+{
+   char modname[256];
+
+   vdev->reset = vfio_platform_lookup_reset(vdev->compat,
+   >reset_module);
+   if (!vdev->reset) {
+   snprintf(modname, 256, "vfio-reset:%s", vdev->compat);
+   request_module(modname);
+   vdev->reset = vfio_platform_lookup_reset(vdev->compat,
+>reset_module);
+   }
 }
 
 static void vfio_platform_put_reset(struct vfio_platform_device *vdev)
 {
if (vdev->reset)
-   symbol_put_addr(vdev->reset);
+   module_put(vdev->reset_module);
 }
 
 static int vfio_platform_regions_init(struct vfio_platform_device *vdev)
@@ -557,7 +563,7 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
return ret;
}
 
-   vfio_platform_get_reset(vdev, dev);
+   vfio_platform_get_reset(vdev);
 

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 13:23 +0200, Michael S. Tsirkin wrote:
> On Wed, Oct 28, 2015 at 05:36:53PM +0900, Benjamin Herrenschmidt
> wrote:
> > On Wed, 2015-10-28 at 16:40 +0900, Christian Borntraeger wrote:
> > > We have discussed that at kernel summit. I will try to implement
> > > a dummy dma_ops for
> > > s390 that does 1:1 mapping and Ben will look into doing some
> > > quirk to handle "old"
> > > code in addition to also make it possible to mark devices as
> > > iommu bypass (IIRC,
> > > via device tree, Ben?)
> > 
> > Something like that yes. I'll look into it when I'm back home.
> > 
> > Cheers,
> > Ben.
> 
> OK so I guess that means we should prefer a transport-specific
> interface in virtio-pci then.

Why?

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


[PATCH v5 13/33] pc-dimm: make pc_existing_dimms_capacity static and rename it

2015-10-28 Thread Xiao Guangrong
pc_existing_dimms_capacity() can be static since it is not used out of
pc-dimm.c and drop the pc_ prefix to prepare the work which abstracts
dimm device type from pc-dimm

Signed-off-by: Xiao Guangrong 
---
 hw/mem/pc-dimm.c | 73 
 include/hw/mem/pc-dimm.h |  1 -
 2 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 2bae994..425f627 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -32,6 +32,38 @@ typedef struct pc_dimms_capacity {
  Error**errp;
 } pc_dimms_capacity;
 
+static int existing_dimms_capacity_internal(Object *obj, void *opaque)
+{
+pc_dimms_capacity *cap = opaque;
+uint64_t *size = >size;
+
+if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
+DeviceState *dev = DEVICE(obj);
+
+if (dev->realized) {
+(*size) += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
+cap->errp);
+}
+
+if (cap->errp && *cap->errp) {
+return 1;
+}
+}
+object_child_foreach(obj, existing_dimms_capacity_internal, opaque);
+return 0;
+}
+
+static uint64_t existing_dimms_capacity(Error **errp)
+{
+pc_dimms_capacity cap;
+
+cap.size = 0;
+cap.errp = errp;
+
+existing_dimms_capacity_internal(qdev_get_machine(), );
+return cap.size;
+}
+
 void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
  MemoryRegion *mr, uint64_t align, bool gap,
  Error **errp)
@@ -40,7 +72,7 @@ void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState 
*hpms,
 MachineState *machine = MACHINE(qdev_get_machine());
 PCDIMMDevice *dimm = PC_DIMM(dev);
 Error *local_err = NULL;
-uint64_t existing_dimms_capacity = 0;
+uint64_t dimms_capacity = 0;
 uint64_t addr;
 
 addr = object_property_get_int(OBJECT(dimm), PC_DIMM_ADDR_PROP, 
_err);
@@ -56,17 +88,16 @@ void pc_dimm_memory_plug(DeviceState *dev, 
MemoryHotplugState *hpms,
 goto out;
 }
 
-existing_dimms_capacity = pc_existing_dimms_capacity(_err);
+dimms_capacity = existing_dimms_capacity(_err);
 if (local_err) {
 goto out;
 }
 
-if (existing_dimms_capacity + memory_region_size(mr) >
+if (dimms_capacity + memory_region_size(mr) >
 machine->maxram_size - machine->ram_size) {
 error_setg(_err, "not enough space, currently 0x%" PRIx64
" in use of total hot pluggable 0x" RAM_ADDR_FMT,
-   existing_dimms_capacity,
-   machine->maxram_size - machine->ram_size);
+   dimms_capacity, machine->maxram_size - machine->ram_size);
 goto out;
 }
 
@@ -121,38 +152,6 @@ void pc_dimm_memory_unplug(DeviceState *dev, 
MemoryHotplugState *hpms,
 vmstate_unregister_ram(mr, dev);
 }
 
-static int pc_existing_dimms_capacity_internal(Object *obj, void *opaque)
-{
-pc_dimms_capacity *cap = opaque;
-uint64_t *size = >size;
-
-if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
-DeviceState *dev = DEVICE(obj);
-
-if (dev->realized) {
-(*size) += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
-cap->errp);
-}
-
-if (cap->errp && *cap->errp) {
-return 1;
-}
-}
-object_child_foreach(obj, pc_existing_dimms_capacity_internal, opaque);
-return 0;
-}
-
-uint64_t pc_existing_dimms_capacity(Error **errp)
-{
-pc_dimms_capacity cap;
-
-cap.size = 0;
-cap.errp = errp;
-
-pc_existing_dimms_capacity_internal(qdev_get_machine(), );
-return cap.size;
-}
-
 int qmp_pc_dimm_device_list(Object *obj, void *opaque)
 {
 MemoryDeviceInfoList ***prev = opaque;
diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index 15590f1..c1e5774 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -87,7 +87,6 @@ uint64_t pc_dimm_get_free_addr(uint64_t address_space_start,
 int pc_dimm_get_free_slot(const int *hint, int max_slots, Error **errp);
 
 int qmp_pc_dimm_device_list(Object *obj, void *opaque);
-uint64_t pc_existing_dimms_capacity(Error **errp);
 void pc_dimm_memory_plug(DeviceState *dev, MemoryHotplugState *hpms,
  MemoryRegion *mr, uint64_t align, bool gap,
  Error **errp);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 11/33] hostmem-file: use whole file size if possible

2015-10-28 Thread Xiao Guangrong
Use the whole file size if @size is not specified which is useful
if we want to directly pass a file to guest

Signed-off-by: Xiao Guangrong 
---
 backends/hostmem-file.c | 48 
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 9097a57..e1bc9ff 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -9,6 +9,9 @@
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
  */
+#include 
+#include 
+
 #include "qemu-common.h"
 #include "sysemu/hostmem.h"
 #include "sysemu/sysemu.h"
@@ -33,20 +36,57 @@ struct HostMemoryBackendFile {
 char *mem_path;
 };
 
+static uint64_t get_file_size(const char *file)
+{
+struct stat stat_buf;
+uint64_t size = 0;
+int fd;
+
+fd = open(file, O_RDONLY);
+if (fd < 0) {
+return 0;
+}
+
+if (stat(file, _buf) < 0) {
+goto exit;
+}
+
+if ((S_ISBLK(stat_buf.st_mode)) && !ioctl(fd, BLKGETSIZE64, )) {
+goto exit;
+}
+
+size = lseek(fd, 0, SEEK_END);
+if (size == -1) {
+size = 0;
+}
+exit:
+close(fd);
+return size;
+}
+
 static void
 file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
 {
 HostMemoryBackendFile *fb = MEMORY_BACKEND_FILE(backend);
 
-if (!backend->size) {
-error_setg(errp, "can't create backend with size 0");
-return;
-}
 if (!fb->mem_path) {
 error_setg(errp, "mem-path property not set");
 return;
 }
 
+if (!backend->size) {
+/*
+ * use the whole file size if @size is not specified.
+ */
+backend->size = get_file_size(fb->mem_path);
+}
+
+if (!backend->size) {
+error_setg(errp, "failed to get file size for %s, can't create "
+ "backend on it", mem_path);
+return;
+}
+
 backend->force_prealloc = mem_prealloc;
 memory_region_init_ram_from_file(>mr, OBJECT(backend),
  object_get_canonical_path(OBJECT(backend)),
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 30/33] nvdimm acpi: support Set Namespace Label Data function

2015-10-28 Thread Xiao Guangrong
Function 6 is used to set Namespace Label Data

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 5b621ed..5e72ca8 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -572,6 +572,44 @@ exit:
 nvdimm_dsm_write_status(out, status);
 }
 
+/*
+ * DSM Spec Rev1 4.6 Set Namespace Label Data (Function Index 6).
+ */
+static void nvdimm_dsm_func_set_label_data(NVDIMMDevice *nvdimm,
+   nvdimm_dsm_in *in, GArray *out)
+{
+NVDIMMClass *nvc = NVDIMM_GET_CLASS(nvdimm);
+nvdimm_func_in_set_label_data *set_label_data = >func_set_label_data;
+uint32_t status;
+
+le32_to_cpus(_label_data->offset);
+le32_to_cpus(_label_data->length);
+
+nvdimm_debug("Write Label Data: offset %#x length %#x.\n",
+ set_label_data->offset, set_label_data->length);
+
+if (nvdimm->label_size < set_label_data->offset + set_label_data->length) {
+nvdimm_debug("position %#x is beyond label data (len = %#lx).\n",
+ set_label_data->offset + set_label_data->length,
+ nvdimm->label_size);
+status = NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+goto exit;
+}
+
+if (set_label_data->length > nvdimm_get_max_xfer_label_size()) {
+nvdimm_debug("set length (%#x) is larger than max_xfer (%#x).\n",
+ set_label_data->length, nvdimm_get_max_xfer_label_size());
+status = NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+goto exit;
+}
+
+status = NVDIMM_DSM_STATUS_SUCCESS;
+nvc->write_label_data(nvdimm, set_label_data->in_buf,
+  set_label_data->length, set_label_data->offset);
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
 static void nvdimm_dsm_device(nvdimm_dsm_in *in, GArray *out)
 {
 GSList *list = nvdimm_get_plugged_device_list();
@@ -602,6 +640,9 @@ static void nvdimm_dsm_device(nvdimm_dsm_in *in, GArray 
*out)
 case 0x5 /* Get Namespace Label Data */:
 nvdimm_dsm_func_get_label_data(nvdimm, in, out);
 goto free;
+case 0x6 /* Set Namespace Label Data */:
+nvdimm_dsm_func_set_label_data(nvdimm, in, out);
+goto free;
 default:
 status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 04/33] acpi: add aml_concatenate

2015-10-28 Thread Xiao Guangrong
Implement Concatenate term which is used by NVDIMM _DSM method
in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 14 ++
 include/hw/acpi/aml-build.h |  1 +
 2 files changed, 15 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 9fe5e7b..efc06ab 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1164,6 +1164,20 @@ Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, 
const char *name)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefConcat */
+Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target)
+{
+Aml *var = aml_opcode(0x73 /* ConcatOp */);
+aml_append(var, source1);
+aml_append(var, source2);
+
+if (target) {
+aml_append(var, target);
+}
+
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 7e1c43b..325782d 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -277,6 +277,7 @@ Aml *aml_unicode(const char *str);
 Aml *aml_derefof(Aml *arg);
 Aml *aml_sizeof(Aml *arg);
 Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
+Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 01/33] acpi: add aml_derefof

2015-10-28 Thread Xiao Guangrong
Implement DeRefOf term which is used by NVDIMM _DSM method in later patch

Reviewed-by: Igor Mammedov 
Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 8 
 include/hw/acpi/aml-build.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 0d4b324..cbd53f4 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1135,6 +1135,14 @@ Aml *aml_unicode(const char *str)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefDerefOf */
+Aml *aml_derefof(Aml *arg)
+{
+Aml *var = aml_opcode(0x83 /* DerefOfOp */);
+aml_append(var, arg);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 1b632dc..5a03d33 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -274,6 +274,7 @@ Aml *aml_create_dword_field(Aml *srcbuf, Aml *index, const 
char *name);
 Aml *aml_varpackage(uint32_t num_elements);
 Aml *aml_touuid(const char *uuid);
 Aml *aml_unicode(const char *str);
+Aml *aml_derefof(Aml *arg);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 00/33] implement vNVDIMM

2015-10-28 Thread Xiao Guangrong
This patchset can be found at:
  https://github.com/xiaogr/qemu.git nvdimm-v5

It is based on pci branch on Michael's tree and the top commit is:
commit 04040096 (tests: re-enable vhost-user-test).

Changelog in v5:
- changes from Michael's comments:
  1) prefix nvdimm_ to everything in NVDIMM source files
  2) make parsing _DSM Arg3 more clear
  3) comment style fix
  5) drop single used definition
  6) fix dirty dsm buffer lost due to memory write happened on host
  7) check dsm buffer if it is big enough to contain input data
  8) use build_append_int_noprefix to store single value to GArray

- changes from Michael's and Igor's comments:
  1) introduce 'nvdimm-support' parameter to control nvdimm
 enablement and it is disabled for 2.4 and its earlier versions
 to make live migration compatible
  2) only reserve 1 RAM page and 4 bytes IO Port for NVDIMM ACPI
 virtualization

- changes from Stefan's comments:
  1) do endian adjustment for the buffer length

- changes from Bharata B Rao's comments:
  1) fix compile on ppc

- others:
  1) the buffer length is directly got from IO read rather than got
 from dsm memory
  2) fix dirty label data lost due to memory write happened on host

Changelog in v4:
- changes from Michael's comments:
  1) show the message, "Memory is not allocated from HugeTlbfs", if file
 based memory is not allocated from hugetlbfs.
  2) introduce function, acpi_get_nvdimm_state(), to get NVDIMMState
 from Machine.
  3) statically define UUID and make its operation more clear
  4) use GArray to build device structures to avoid potential buffer
 overflow
  4) improve comments in the code
  5) improve code style

- changes from Igor's comments:
  1) add NVDIMM ACPI spec document
  2) use serialized method to avoid Mutex
  3) move NVDIMM ACPI's code to hw/acpi/nvdimm.c
  4) introduce a common ASL method used by _DSM for all devices to reduce
 ACPI size
  5) handle UUID in ACPI AML code. BTW, i'd keep handling revision in QEMU
 it's better to upgrade QEMU to support Rev2 in the future

- changes from Stefan's comments:
  1) copy input data from DSM memory to local buffer to avoid potential
 issues as DSM memory is visible to guest. Output data is handled
 in a similar way

- changes from Dan's comments:
  1) drop static namespace as Linux has already supported label-less
 nvdimm devices

- changes from Vladimir's comments:
  1) print better message, "failed to get file size for %s, can't create
 backend on it", if any file operation filed to obtain file size

- others:
  create a git repo on github.com for better review/test

Also, thanks for Eric Blake's review on QAPI's side.

Thank all of you to review this patchset.

Changelog in v3:
There is huge change in this version, thank Igor, Stefan, Paolo, Eduardo,
Michael for their valuable comments, the patchset finally gets better shape.
- changes from Igor's comments:
  1) abstract dimm device type from pc-dimm and create nvdimm device based on
 dimm, then it uses memory backend device as nvdimm's memory and NUMA has
 easily been implemented.
  2) let file-backend device support any kind of filesystem not only for
 hugetlbfs and let it work on file not only for directory which is
 achieved by extending 'mem-path' - if it's a directory then it works as
 current behavior, otherwise if it's file then directly allocates memory
 from it.
  3) we figure out a unused memory hole below 4G that is 0xFF0 ~ 
 0xFFF0, this range is large enough for NVDIMM ACPI as build 64-bit
 ACPI SSDT/DSDT table will break windows XP.
 BTW, only make SSDT.rev = 2 can not work since the width is only depended
 on DSDT.rev based on 19.6.28 DefinitionBlock (Declare Definition Block)
 in ACPI spec:
| Note: For compatibility with ACPI versions before ACPI 2.0, the bit 
| width of Integer objects is dependent on the ComplianceRevision of the DSDT.
| If the ComplianceRevision is less than 2, all integers are restricted to 32 
| bits. Otherwise, full 64-bit integers are used. The version of the DSDT sets 
| the global integer width for all integers, including integers in SSDTs.
  4) use the lowest ACPI spec version to document AML terms.
  5) use "nvdimm" as nvdimm device name instead of "pc-nvdimm"

- changes from Stefan's comments:
  1) do not do endian adjustment in-place since _DSM memory is visible to guest
  2) use target platform's target page size instead of fixed PAGE_SIZE
 definition
  3) lots of code style improvement and typo fixes.
  4) live migration fix
- changes from Paolo's comments:
  1) improve the name of memory region
  
- other changes:
  1) return exact buffer size for _DSM method instead of the page size.
  2) introduce mutex in NVDIMM ACPI as the _DSM memory is shared by all nvdimm
 devices.
  3) NUMA support
  4) implement _FIT method
  5) rename "configdata" to "reserve-label-data"
  6) simplify _DSM arg3 determination
  7) main 

[PATCH v5 27/33] nvdimm acpi: support function 0

2015-10-28 Thread Xiao Guangrong
__DSM is defined in ACPI 6.0: 9.14.1 _DSM (Device Specific Method)

Function 0 is a query function. We do not support any function on root
device and only 3 functions are support for NVDIMM device, Get Namespace
Label Size, Get Namespace Label Data and Set Namespace Label Data, that
means we currently only allow to access device's Label Namespace

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c |   2 +-
 hw/acpi/nvdimm.c| 156 +++-
 include/hw/acpi/aml-build.h |   1 +
 3 files changed, 157 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 8bee8b2..90229c5 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -231,7 +231,7 @@ static void build_extop_package(GArray *package, uint8_t op)
 build_prepend_byte(package, 0x5B); /* ExtOpPrefix */
 }
 
-static void build_append_int_noprefix(GArray *table, uint64_t value, int size)
+void build_append_int_noprefix(GArray *table, uint64_t value, int size)
 {
 int i;
 
diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 69de4f6..8efa640 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -212,6 +212,22 @@ static uint32_t nvdimm_slot_to_dcr_index(int slot)
 return nvdimm_slot_to_spa_index(slot) + 1;
 }
 
+static NVDIMMDevice
+*nvdimm_get_device_by_handle(GSList *list, uint32_t handle)
+{
+for (; list; list = list->next) {
+NVDIMMDevice *nvdimm = list->data;
+int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
+   NULL);
+
+if (nvdimm_slot_to_handle(slot) == handle) {
+return nvdimm;
+}
+}
+
+return NULL;
+}
+
 /* ACPI 6.0: 5.2.25.1 System Physical Address Range Structure */
 static void
 nvdimm_build_structure_spa(GArray *structures, NVDIMMDevice *nvdimm)
@@ -368,6 +384,29 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
 g_array_free(structures, true);
 }
 
+/* define NVDIMM DSM return status codes according to DSM Spec Rev1. */
+enum {
+/* Common return status codes. */
+/* Success */
+NVDIMM_DSM_STATUS_SUCCESS = 0,
+/* Not Supported */
+NVDIMM_DSM_STATUS_NOT_SUPPORTED = 1,
+
+/* NVDIMM Root Device _DSM function return status codes*/
+/* Invalid Input Parameters */
+NVDIMM_DSM_ROOT_DEV_STATUS_INVALID_PARAS = 2,
+/* Function-Specific Error */
+NVDIMM_DSM_ROOT_DEV_STATUS_FUNCTION_SPECIFIC_ERROR = 3,
+
+/* NVDIMM Device (non-root) _DSM function return status codes*/
+/* Non-Existing Memory Device */
+NVDIMM_DSM_DEV_STATUS_NON_EXISTING_MEM_DEV = 2,
+/* Invalid Input Parameters */
+NVDIMM_DSM_DEV_STATUS_INVALID_PARAS = 3,
+/* Vendor Specific Error */
+NVDIMM_DSM_DEV_STATUS_VENDOR_SPECIFIC_ERROR = 4,
+};
+
 struct nvdimm_dsm_in {
 uint32_t handle;
 uint32_t revision;
@@ -377,10 +416,125 @@ struct nvdimm_dsm_in {
 } QEMU_PACKED;
 typedef struct nvdimm_dsm_in nvdimm_dsm_in;
 
+static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
+{
+status = cpu_to_le32(status);
+build_append_int_noprefix(out, status, sizeof(status));
+}
+
+static void nvdimm_dsm_root(nvdimm_dsm_in *in, GArray *out)
+{
+uint32_t status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
+
+/*
+ * Query command implemented per ACPI Specification, it is defined in
+ * ACPI 6.0: 9.14.1 _DSM (Device Specific Method).
+ */
+if (in->function == 0x0) {
+/*
+ * Set it to zero to indicate no function is supported for NVDIMM
+ * root.
+ */
+uint64_t cmd_list = cpu_to_le64(0);
+
+build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
+return;
+}
+
+nvdimm_debug("Return status %#x.\n", status);
+nvdimm_dsm_write_status(out, status);
+}
+
+static void nvdimm_dsm_device(nvdimm_dsm_in *in, GArray *out)
+{
+GSList *list = nvdimm_get_plugged_device_list();
+NVDIMMDevice *nvdimm = nvdimm_get_device_by_handle(list, in->handle);
+uint32_t status = NVDIMM_DSM_DEV_STATUS_NON_EXISTING_MEM_DEV;
+uint64_t cmd_list;
+
+if (!nvdimm) {
+goto set_status_free;
+}
+
+/* Encode DSM function according to DSM Spec Rev1. */
+switch (in->function) {
+/* see comments in nvdimm_dsm_root(). */
+case 0x0:
+cmd_list = cpu_to_le64(0x1 /* Bit 0 indicates whether there is
+  support for any functions other
+  than function 0.
+*/   |
+   1 << 4 /* Get Namespace Label Size */ |
+   1 << 5 /* Get Namespace Label Data */ |
+   1 << 6 /* Set Namespace Label Data */);
+build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
+goto free;
+default:
+status = 

[PATCH v5 05/33] acpi: add aml_object_type

2015-10-28 Thread Xiao Guangrong
Implement ObjectType which is used by NVDIMM _DSM method in
later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 8 
 include/hw/acpi/aml-build.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index efc06ab..9f792ab 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1178,6 +1178,14 @@ Aml *aml_concatenate(Aml *source1, Aml *source2, Aml 
*target)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefObjectType */
+Aml *aml_object_type(Aml *object)
+{
+Aml *var = aml_opcode(0x8E /* ObjectTypeOp */);
+aml_append(var, object);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 325782d..5b8a118 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -278,6 +278,7 @@ Aml *aml_derefof(Aml *arg);
 Aml *aml_sizeof(Aml *arg);
 Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
 Aml *aml_concatenate(Aml *source1, Aml *source2, Aml *target);
+Aml *aml_object_type(Aml *object);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 20/33] dimm: introduce realize callback

2015-10-28 Thread Xiao Guangrong
nvdimm need check if the backend memory is large enough to contain label
data and init its memory region when the device is realized, so introduce
realize callback which is called after common dimm has been realize

Signed-off-by: Xiao Guangrong 
---
 hw/mem/dimm.c | 5 +
 include/hw/mem/dimm.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 478cacd..3d06cb9 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -429,6 +429,7 @@ static void dimm_init(Object *obj)
 static void dimm_realize(DeviceState *dev, Error **errp)
 {
 DIMMDevice *dimm = DIMM(dev);
+DIMMDeviceClass *ddc = DIMM_GET_CLASS(dimm);
 
 if (!dimm->hostmem) {
 error_setg(errp, "'" DIMM_MEMDEV_PROP "' property is not set");
@@ -441,6 +442,10 @@ static void dimm_realize(DeviceState *dev, Error **errp)
dimm->node, nb_numa_nodes ? nb_numa_nodes : 1);
 return;
 }
+
+if (ddc->realize) {
+ddc->realize(dimm, errp);
+}
 }
 
 static void dimm_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/mem/dimm.h b/include/hw/mem/dimm.h
index 84a62ed..663288d 100644
--- a/include/hw/mem/dimm.h
+++ b/include/hw/mem/dimm.h
@@ -65,6 +65,7 @@ typedef struct DIMMDeviceClass {
 DeviceClass parent_class;
 
 /* public */
+void (*realize)(DIMMDevice *dimm, Error **errp);
 MemoryRegion *(*get_memory_region)(DIMMDevice *dimm);
 } DIMMDeviceClass;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 06/33] acpi: add aml_method_serialized

2015-10-28 Thread Xiao Guangrong
It avoid explicit Mutex and will be used by NVDIMM ACPI

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 26 --
 include/hw/acpi/aml-build.h |  1 +
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 9f792ab..8bee8b2 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -696,14 +696,36 @@ Aml *aml_while(Aml *predicate)
 }
 
 /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefMethod */
-Aml *aml_method(const char *name, int arg_count)
+static Aml *__aml_method(const char *name, int arg_count, bool serialized)
 {
 Aml *var = aml_bundle(0x14 /* MethodOp */, AML_PACKAGE);
+int methodflags;
+
+/*
+ * MethodFlags:
+ *   bit 0-2: ArgCount (0-7)
+ *   bit 3: SerializeFlag
+ * 0: NotSerialized
+ * 1: Serialized
+ *   bit 4-7: reserved (must be 0)
+ */
+assert(!(arg_count & ~7));
+methodflags = arg_count | (serialized << 3);
 build_append_namestring(var->buf, "%s", name);
-build_append_byte(var->buf, arg_count); /* MethodFlags: ArgCount */
+build_append_byte(var->buf, methodflags);
 return var;
 }
 
+Aml *aml_method(const char *name, int arg_count)
+{
+return __aml_method(name, arg_count, false);
+}
+
+Aml *aml_method_serialized(const char *name, int arg_count)
+{
+return __aml_method(name, arg_count, true);
+}
+
 /* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefDevice */
 Aml *aml_device(const char *name_format, ...)
 {
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 5b8a118..00cf40e 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -263,6 +263,7 @@ Aml *aml_qword_memory(AmlDecode dec, AmlMinFixed min_fixed,
 Aml *aml_scope(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
 Aml *aml_device(const char *name_format, ...) GCC_FMT_ATTR(1, 2);
 Aml *aml_method(const char *name, int arg_count);
+Aml *aml_method_serialized(const char *name, int arg_count);
 Aml *aml_if(Aml *predicate);
 Aml *aml_else(void);
 Aml *aml_while(Aml *predicate);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] x86: context_tracking: avoid irq_save/irq_restore on kernel entry and exit

2015-10-28 Thread Paolo Bonzini


On 28/10/2015 06:22, Andy Lutomirski wrote:
>> > called by guest_enter and guest_exit.
>> >
>> > Use the previously introduced __context_tracking_entry and
>> > __context_tracking_exit.
> x86 isn't ready for this yet.  We could do a quick-and-dirty fix with
> explicit IRQs-on-and-off much protected by the static key, or we could
> just wait until I finish the syscall cleanup.  I favor the latter, but
> you're all welcome to do the former and I'll review it.

Or we could just do save/restore for the only call that doesn't ensure
that interrupts are disabled (syscall_trace_phase1 or whatever it's called).

But two days from the merge window, I also favor waiting until 4.5.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Michael S. Tsirkin
On Wed, Oct 28, 2015 at 10:37:56PM +0900, David Woodhouse wrote:
> On Wed, 2015-10-28 at 13:23 +0200, Michael S. Tsirkin wrote:
> > On Wed, Oct 28, 2015 at 05:36:53PM +0900, Benjamin Herrenschmidt
> > wrote:
> > > On Wed, 2015-10-28 at 16:40 +0900, Christian Borntraeger wrote:
> > > > We have discussed that at kernel summit. I will try to implement
> > > > a dummy dma_ops for
> > > > s390 that does 1:1 mapping and Ben will look into doing some
> > > > quirk to handle "old"
> > > > code in addition to also make it possible to mark devices as
> > > > iommu bypass (IIRC,
> > > > via device tree, Ben?)
> > > 
> > > Something like that yes. I'll look into it when I'm back home.
> > > 
> > > Cheers,
> > > Ben.
> > 
> > OK so I guess that means we should prefer a transport-specific
> > interface in virtio-pci then.
> 
> Why?

Because you said you are doing something device tree specific for ARM,
aren't you?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 07/33] util: introduce qemu_file_get_page_size()

2015-10-28 Thread Xiao Guangrong
There are three places use the some logic to get the page size on
the file path or file fd

This patch introduces qemu_file_get_page_size() to unify the code

Signed-off-by: Xiao Guangrong 
---
 include/qemu/osdep.h |  1 +
 target-ppc/kvm.c | 21 +++--
 util/oslib-posix.c   | 16 
 util/oslib-win32.c   |  5 +
 4 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index ef21efb..9c8c0c4 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -286,4 +286,5 @@ void os_mem_prealloc(int fd, char *area, size_t sz);
 
 int qemu_read_password(char *buf, int buf_size);
 
+size_t qemu_file_get_page_size(const char *mem_path);
 #endif
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index 7276299..0b68a7a 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -306,28 +306,13 @@ static void kvm_get_smmu_info(PowerPCCPU *cpu, struct 
kvm_ppc_smmu_info *info)
 
 static long gethugepagesize(const char *mem_path)
 {
-struct statfs fs;
-int ret;
-
-do {
-ret = statfs(mem_path, );
-} while (ret != 0 && errno == EINTR);
+long size = qemu_file_get_page_size(mem_path);
 
-if (ret != 0) {
-fprintf(stderr, "Couldn't statfs() memory path: %s\n",
-strerror(errno));
+if (!size) {
 exit(1);
 }
 
-#define HUGETLBFS_MAGIC   0x958458f6
-
-if (fs.f_type != HUGETLBFS_MAGIC) {
-/* Explicit mempath, but it's ordinary pages */
-return getpagesize();
-}
-
-/* It's hugepage, return the huge page size */
-return fs.f_bsize;
+return size;
 }
 
 static int find_max_supported_pagesize(Object *obj, void *opaque)
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 892d2d8..32b4d1f 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -360,6 +360,22 @@ static size_t fd_getpagesize(int fd)
 return getpagesize();
 }
 
+size_t qemu_file_get_page_size(const char *path)
+{
+size_t size = 0;
+int fd = qemu_open(path, O_RDONLY);
+
+if (fd < 0) {
+fprintf(stderr, "Could not open %s.\n", path);
+goto exit;
+}
+
+size = fd_getpagesize(fd);
+qemu_close(fd);
+exit:
+return size;
+}
+
 void os_mem_prealloc(int fd, char *area, size_t memory)
 {
 int ret;
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index 08f5a9c..1ff1fae 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -462,6 +462,11 @@ size_t getpagesize(void)
 return system_info.dwPageSize;
 }
 
+size_t qemu_file_get_page_size(const char *path)
+{
+return getpagesize();
+}
+
 void os_mem_prealloc(int fd, char *area, size_t memory)
 {
 int i;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 07:07 -0700, David Miller wrote:
> In the sparc64 case, the 64-bit DMA address space is divided into
> IOMMU translated and non-IOMMU translated.
> 
> You just set the high bits differently depending upon what you want.

Wait, does that mean a (rogue) device could *always* get full access to
physical memory just by setting the high bits appropriately? That
mapping is *always* available?

> So a device could use both IOMMU translated and bypass accesses at 
> the same time.  While seemingly interesting, I do not recommend we 
> provide this kind of flexibility in our DMA interfaces.

Now I could understand this if the answer to my question above was
'no'. We absolutely want the *security* all the time, and we don't want
the device to be able to do stupid stuff. But if the answer was 'yes'
then we take the map/unmap performance hit for... *what* benefit?

On Intel we have the passthrough as an *option* and I have the same
initial reaction — "Hell no, we want the security". But I concede the
performance motivation for it, and I'm not *dead* set against
permitting it.

If I tolerate a per-device request for passthrough mode, that might
prevent people from disabling the IOMMU or putting it into passthrough
mode *entirely*. So actually, I'm *improving* security...

I think it makes sense to allow performance sensitive device drivers to
*request* a passthrough mode. The platform can reserve the right to
refuse, if either the IOMMU hardware doesn't support that, or we're in
a paranoid mode (with iommu=always or something on the command line).


-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


[PATCH v5 22/33] docs: add NVDIMM ACPI documentation

2015-10-28 Thread Xiao Guangrong
It describes the basic concepts of NVDIMM ACPI and the interface
between QEMU and the ACPI BIOS

Signed-off-by: Xiao Guangrong 
---
 docs/specs/acpi_nvdimm.txt | 179 +
 1 file changed, 179 insertions(+)
 create mode 100644 docs/specs/acpi_nvdimm.txt

diff --git a/docs/specs/acpi_nvdimm.txt b/docs/specs/acpi_nvdimm.txt
new file mode 100644
index 000..cc5db2c
--- /dev/null
+++ b/docs/specs/acpi_nvdimm.txt
@@ -0,0 +1,179 @@
+QEMU<->ACPI BIOS NVDIMM interface
+-
+
+QEMU supports NVDIMM via ACPI. This document describes the basic concepts of
+NVDIMM ACPI and the interface between QEMU and the ACPI BIOS.
+
+NVDIMM ACPI Background
+--
+NVDIMM is introduced in ACPI 6.0 which defines an NVDIMM root device under
+_SB scope with a _HID of “ACPI0012”. For each NVDIMM present or intended
+to be supported by platform, platform firmware also exposes an ACPI
+Namespace Device under the root device.
+
+The NVDIMM child devices under the NVDIMM root device are defined with _ADR
+corresponding to the NFIT device handle. The NVDIMM root device and the
+NVDIMM devices can have device specific methods (_DSM) to provide additional
+functions specific to a particular NVDIMM implementation.
+
+This is an example from ACPI 6.0, a platform contains one NVDIMM:
+
+Scope (\_SB){
+   Device (NVDR) // Root device
+   {
+  Name (_HID, “ACPI0012”)
+  Method (_STA) {...}
+  Method (_FIT) {...}
+  Method (_DSM, ...) {...}
+  Device (NVD)
+  {
+ Name(_ADR, h) //where h is NFIT Device Handle for this NVDIMM
+ Method (_DSM, ...) {...}
+  }
+   }
+}
+
+Methods supported on both NVDIMM root device and NVDIMM device are
+1) _STA(Status)
+   It returns the current status of a device, which can be one of the
+   following: enabled, disabled, or removed.
+
+   Arguments: None
+
+   Return Value:
+   It returns an An Integer which is defined as followings:
+   Bit [0] – Set if the device is present.
+   Bit [1] – Set if the device is enabled and decoding its resources.
+   Bit [2] – Set if the device should be shown in the UI.
+   Bit [3] – Set if the device is functioning properly (cleared if device
+ failed its diagnostics).
+   Bit [4] – Set if the battery is present.
+   Bits [31:5] – Reserved (must be cleared).
+
+2) _DSM (Device Specific Method)
+   It is a control method that enables devices to provide device specific
+   control functions that are consumed by the device driver.
+   The NVDIMM DSM specification can be found at:
+http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
+
+   Arguments:
+   Arg0 – A Buffer containing a UUID (16 Bytes)
+   Arg1 – An Integer containing the Revision ID (4 Bytes)
+   Arg2 – An Integer containing the Function Index (4 Bytes)
+   Arg3 – A package containing parameters for the function specified by the
+  UUID, Revision ID, and Function Index
+
+   Return Value:
+   If Function Index = 0, a Buffer containing a function index bitfield.
+   Otherwise, the return value and type depends on the UUID, revision ID
+   and function index which are described in the DSM specification.
+
+Methods on NVDIMM ROOT Device
+_FIT(Firmware Interface Table)
+   It evaluates to a buffer returning data in the format of a series of NFIT
+   Type Structure.
+
+   Arguments: None
+
+   Return Value:
+   A Buffer containing a list of NFIT Type structure entries.
+
+   The detailed definition of the structure can be found at ACPI 6.0: 5.2.25
+   NVDIMM Firmware Interface Table (NFIT).
+
+QEMU NVDIMM Implemention
+
+QEMU reserves a page starting from 0xFF0 and 4 bytes IO Port starting
+from 0x0a18 for NVDIMM ACPI.
+
+Memory 0xFF0 - 0xFF00FFF:
+   This page is RAM-based and it is used to transfer data between _DSM
+   method and QEMU. If ACPI has control, this pages is owned by ACPI which
+   writes _DSM input data to it, otherwise, it is owned by QEMU which
+   emulates _DSM access and writes the output data to it.
+
+   ACPI Writes _DSM Input Data:
+   [0xFF0 - 0xFF3]: 4 bytes, NVDIMM Devcie Handle, 0 is reserved
+for NVDIMM Root device.
+   [0xFF4 - 0xFF7]: 4 bytes, Revision ID, that is the Arg1 of _DSM
+method.
+   [0xFF8 - 0xFFB]: 4 bytes. Function Index, that is the Arg2 of
+_DSM method.
+   [0xFFC - 0xFF00FFF]: 4084 bytes, the Arg3 of _DSM method
+
+   QEMU Writes Output Data:
+   [0xFF0 - 0xFF00FFF]: the DSM return result filled by QEMU
+
+IO Port 0x0a18 - 0xa1b:
+   ACPI uses it to transfer control from guest to QEMU and read the size
+   of return result filled by QEMU
+
+   Read Access:
+   [0x0a18 - 0xa1b]: 4 bytes, the buffer size of _DSM output data.
+
+_DSM process diagram:
+-
+The page, 0xFF0 - 0xFF00FFF, is used by _DSM Virtualization.
+

Re: [PATCH v4 17/33] dimm: abstract dimm device from pc-dimm

2015-10-28 Thread Xiao Guangrong



On 10/24/2015 11:20 AM, Bharata B Rao wrote:


  CONFIG_ACPI_X86_ICH=y
+CONFIG_DIMM=y


Same change needs to be done in default-configs/ppc64-softmmu.mak too.


I have fixed it in v5 which cat be found at:
   http://marc.info/?l=kvm=144604272221080=2

Bharata, thank you very much for your review.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 14/33] pc-dimm: drop the prefix of pc-dimm

2015-10-28 Thread Xiao Guangrong
This patch is generated by this script:

find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
| xargs sed -i "s/PC_DIMM/DIMM/g"

find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
| xargs sed -i "s/PCDIMM/DIMM/g"

find ./ -name "*.[ch]" -o -name "*.json" -o -name "trace-events" -type f \
| xargs sed -i "s/pc_dimm/dimm/g"

find ./ -name "trace-events" -type f | xargs sed -i "s/pc-dimm/dimm/g"

It prepares the work which abstracts dimm device type for both pc-dimm and
nvdimm

Signed-off-by: Xiao Guangrong 
---
 hmp.c   |   2 +-
 hw/acpi/ich9.c  |   6 +-
 hw/acpi/memory_hotplug.c|  16 ++---
 hw/acpi/piix4.c |   6 +-
 hw/i386/pc.c|  32 -
 hw/mem/pc-dimm.c| 148 
 hw/ppc/spapr.c  |  18 ++---
 include/hw/mem/pc-dimm.h|  62 -
 numa.c  |   2 +-
 qapi-schema.json|   8 +--
 qmp.c   |   2 +-
 stubs/qmp_pc_dimm_device_list.c |   2 +-
 trace-events|   8 +--
 13 files changed, 156 insertions(+), 156 deletions(-)

diff --git a/hmp.c b/hmp.c
index 5048eee..5c617d2 100644
--- a/hmp.c
+++ b/hmp.c
@@ -1952,7 +1952,7 @@ void hmp_info_memory_devices(Monitor *mon, const QDict 
*qdict)
 MemoryDeviceInfoList *info_list = qmp_query_memory_devices();
 MemoryDeviceInfoList *info;
 MemoryDeviceInfo *value;
-PCDIMMDeviceInfo *di;
+DIMMDeviceInfo *di;
 
 for (info = info_list; info; info = info->next) {
 value = info->value;
diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
index 1c7fcfa..b0d6a67 100644
--- a/hw/acpi/ich9.c
+++ b/hw/acpi/ich9.c
@@ -440,7 +440,7 @@ void ich9_pm_add_properties(Object *obj, ICH9LPCPMRegs *pm, 
Error **errp)
 void ich9_pm_device_plug_cb(ICH9LPCPMRegs *pm, DeviceState *dev, Error **errp)
 {
 if (pm->acpi_memory_hotplug.is_enabled &&
-object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
+object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
 acpi_memory_plug_cb(>acpi_regs, pm->irq, >acpi_memory_hotplug,
 dev, errp);
 } else if (object_dynamic_cast(OBJECT(dev), TYPE_CPU)) {
@@ -455,7 +455,7 @@ void ich9_pm_device_unplug_request_cb(ICH9LPCPMRegs *pm, 
DeviceState *dev,
   Error **errp)
 {
 if (pm->acpi_memory_hotplug.is_enabled &&
-object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
+object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
 acpi_memory_unplug_request_cb(>acpi_regs, pm->irq,
   >acpi_memory_hotplug, dev, errp);
 } else {
@@ -468,7 +468,7 @@ void ich9_pm_device_unplug_cb(ICH9LPCPMRegs *pm, 
DeviceState *dev,
   Error **errp)
 {
 if (pm->acpi_memory_hotplug.is_enabled &&
-object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM)) {
+object_dynamic_cast(OBJECT(dev), TYPE_DIMM)) {
 acpi_memory_unplug_cb(>acpi_memory_hotplug, dev, errp);
 } else {
 error_setg(errp, "acpi: device unplug for not supported device"
diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
index 2ff0d5c..1f6 100644
--- a/hw/acpi/memory_hotplug.c
+++ b/hw/acpi/memory_hotplug.c
@@ -54,23 +54,23 @@ static uint64_t acpi_memory_hotplug_read(void *opaque, 
hwaddr addr,
 o = OBJECT(mdev->dimm);
 switch (addr) {
 case 0x0: /* Lo part of phys address where DIMM is mapped */
-val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) : 0;
+val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) : 0;
 trace_mhp_acpi_read_addr_lo(mem_st->selector, val);
 break;
 case 0x4: /* Hi part of phys address where DIMM is mapped */
-val = o ? object_property_get_int(o, PC_DIMM_ADDR_PROP, NULL) >> 32 : 
0;
+val = o ? object_property_get_int(o, DIMM_ADDR_PROP, NULL) >> 32 : 0;
 trace_mhp_acpi_read_addr_hi(mem_st->selector, val);
 break;
 case 0x8: /* Lo part of DIMM size */
-val = o ? object_property_get_int(o, PC_DIMM_SIZE_PROP, NULL) : 0;
+val = o ? object_property_get_int(o, DIMM_SIZE_PROP, NULL) : 0;
 trace_mhp_acpi_read_size_lo(mem_st->selector, val);
 break;
 case 0xc: /* Hi part of DIMM size */
-val = o ? object_property_get_int(o, PC_DIMM_SIZE_PROP, NULL) >> 32 : 
0;
+val = o ? object_property_get_int(o, DIMM_SIZE_PROP, NULL) >> 32 : 0;
 trace_mhp_acpi_read_size_hi(mem_st->selector, val);
 break;
 case 0x10: /* node proximity for _PXM method */
-val = o ? object_property_get_int(o, PC_DIMM_NODE_PROP, NULL) : 0;
+val = o ? object_property_get_int(o, DIMM_NODE_PROP, NULL) : 0;
 trace_mhp_acpi_read_pxm(mem_st->selector, val);
 break;
 case 0x14: /* pack and return is_* 

[PATCH v5 15/33] stubs: rename qmp_pc_dimm_device_list.c

2015-10-28 Thread Xiao Guangrong
Rename qmp_pc_dimm_device_list.c to qmp_dimm_device_list.c

Signed-off-by: Xiao Guangrong 
---
 stubs/Makefile.objs | 2 +-
 stubs/{qmp_pc_dimm_device_list.c => qmp_dimm_device_list.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename stubs/{qmp_pc_dimm_device_list.c => qmp_dimm_device_list.c} (100%)

diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index ce6ce11..e28af50 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -37,6 +37,6 @@ stub-obj-y += vmstate.o
 stub-obj-$(CONFIG_WIN32) += fd-register.o
 stub-obj-y += cpus.o
 stub-obj-y += kvm.o
-stub-obj-y += qmp_pc_dimm_device_list.o
+stub-obj-y += qmp_dimm_device_list.o
 stub-obj-y += target-monitor-defs.o
 stub-obj-y += vhost.o
diff --git a/stubs/qmp_pc_dimm_device_list.c b/stubs/qmp_dimm_device_list.c
similarity index 100%
rename from stubs/qmp_pc_dimm_device_list.c
rename to stubs/qmp_dimm_device_list.c
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Michael S. Tsirkin
On Wed, Oct 28, 2015 at 11:13:29PM +0900, David Woodhouse wrote:
> On Wed, 2015-10-28 at 16:05 +0200, Michael S. Tsirkin wrote:
> > 
> > Short answer - platforms need a way to discover, and express different
> > security requirements of different devices.
> 
> Sure. PLATFORMS need that. Do not let it go anywhere near your device
> drivers. Including the virtio drivers.

But would there be any users of this outside the virtio subsystem?
If no, maybe virtio core is a logical place to keep this.

> > If they continue to lack that, we'll need a custom API in virtio,
> > and while this seems a bit less elegant, I would not see that as
> > the end of the world at all, there are not that many virtio drivers.
> 
> No. If they continue to lack that, we fix them. This is a *platform*
> issue. The DMA API shall do the right thing. Do not second-guess it.
> 
> 
>  (From the other mail)

I don't have a problem with extending DMA API to address
more usecases.

> > > > OK so I guess that means we should prefer a transport-specific
> > > > interface in virtio-pci then.
> > >
> > > Why?
> > 
> > Because you said you are doing something device tree specific for 
> > ARM, aren't you?
> 
> Nonono. The ARM platform code might do that, and the DMA API on ARM
> *might* give you I/O virtual addresses that look a lot like the
> physical addresses you asked it to map. That's none of your business.
> Drivers use DMA API. No more talky.

Well for virtio they don't ATM. And 1:1 mapping makes perfect sense
for the wast majority of users, so I can't switch them over
until the DMA API actually addresses all existing usecases.


> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Will Deacon
On Wed, Oct 28, 2015 at 01:12:45PM +, Eric Auger wrote:
> Current vfio_pgsize_bitmap code hides the supported IOMMU page
> sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
> does not support PAGE_SIZE page, the alignment check on map/unmap
> is done with larger page sizes, if any. This can fail although
> mapping could be done with pages smaller than PAGE_SIZE.
> 
> vfio_pgsize_bitmap is modified to expose the IOMMU page sizes,
> supported by all domains, even those smaller than PAGE_SIZE. The
> alignment check on map is performed against PAGE_SIZE if the minimum
> IOMMU size is less than PAGE_SIZE or against the min page size greater
> than PAGE_SIZE.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> This was tested on AMD Seattle with 64kB page host. ARM MMU 401
> currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
> the map/unmap check is done against 2MB. Some alignment check fail
> so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
> size.
> ---
>  drivers/vfio/vfio_iommu_type1.c | 25 +++--
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 57d8c37..13fb974 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> struct vfio_dma *dma)
>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  {
>   struct vfio_domain *domain;
> - unsigned long bitmap = PAGE_MASK;
> + unsigned long bitmap = ULONG_MAX;
>  
>   mutex_lock(>lock);
>   list_for_each_entry(domain, >domain_list, next)
> @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
> vfio_iommu *iommu)
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> - uint64_t mask;
>   struct vfio_dma *dma;
>   size_t unmapped = 0;
>   int ret = 0;
> + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> + PAGE_SIZE : min_pagesz;

max_t ?

>  
> - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> -
> - if (unmap->iova & mask)
> + if (!IS_ALIGNED(unmap->iova, requested_alignment))
>   return -EINVAL;
> - if (!unmap->size || unmap->size & mask)
> + if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
>   return -EINVAL;
>  
> - WARN_ON(mask & PAGE_MASK);
> -
>   mutex_lock(>lock);
>  
>   /*
> @@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>   size_t size = map->size;
>   long npage;
>   int ret = 0, prot = 0;
> - uint64_t mask;
>   struct vfio_dma *dma;
>   unsigned long pfn;
> + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> + PAGE_SIZE : min_pagesz;

Same code here. Perhaps you need another function to get the alignment?

Otherwise, this looks pretty straightforward to me; iommu_map will take
care of splitting up the requests to the IOMMU driver so they are in the
right chunks.

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Michael S. Tsirkin
On Wed, Oct 28, 2015 at 10:35:27PM +0900, David Woodhouse wrote:
> On Wed, 2015-10-28 at 13:35 +0200, Michael S. Tsirkin wrote:
> > E.g. on intel x86, there's an option iommu=pt which does the 1:1
> > thing for devices when used by kernel, but enables
> > the iommu if used by userspace/VMs.
> 
> That's none of your business.
> 
> You call the DMA API when you do DMA. That's all there is to it.
> 
> If the IOMMU happens to be in passthrough mode, or your device happens
> to not to be routed through an IOMMU today, then I/O virtual address
> you get back from the DMA API will look a *lot* like the physical
> address you asked the DMA to map. You might think there's no IOMMU. We
> couldn't possibly comment.
> 
> Use the DMA API. Always. Let the platform worry about whether it
> actually needs to *do* anything or not.
> -- 
> dwmw2
> 
> 

Short answer - platforms need a way to discover, and express different
security requirements of different devices.  If they continue to lack
that, we'll need a custom API in virtio, and while this seems a bit less
elegant, I would not see that as the end of the world at all, there are
not that many virtio drivers.

And hey - that's just an internal API.  We can change it later at a whim.

Long answer - PV is weird. It's not always the same as real hardware.

For PV, it's generally hypervisor doing writes into memory.

If it's monolitic with device emulation in same memory space as the
hypervisor (e.g. in the case of the current QEMU, or using vhost in host
kernel), then you gain *no security* by "restricting" it by means of the
IOMMU - the IOMMU is part of the same hypervisor.

If it is modular with device emulation in a separate memory space (e.g.
in case of Xen, or vhost-user in modern QEMU) then you do gain security:
the part emulating the IOMMU limits the part doing DMA.

In both cases for assigned devices, it is always modular in a sense, so
you do gain security since that is restricted by the hardware IOMMU.

The way things are set up at the moment, it's mostly global,
with iommu=pt on intel being a kind of exception.
We need host/guest and API interfaces that are more nuanced than that.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 28/33] nvdimm acpi: support Get Namespace Label Size function

2015-10-28 Thread Xiao Guangrong
Function 4 is used to get Namespace label size

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 87 +++-
 1 file changed, 86 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 8efa640..72203d2 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -407,15 +407,48 @@ enum {
 NVDIMM_DSM_DEV_STATUS_VENDOR_SPECIFIC_ERROR = 4,
 };
 
+struct nvdimm_func_in_get_label_data {
+uint32_t offset; /* the offset in the namespace label data area. */
+uint32_t length; /* the size of data is to be read via the function. */
+} QEMU_PACKED;
+typedef struct nvdimm_func_in_get_label_data nvdimm_func_in_get_label_data;
+
+struct nvdimm_func_in_set_label_data {
+uint32_t offset; /* the offset in the namespace label data area. */
+uint32_t length; /* the size of data is to be written via the function. */
+uint8_t in_buf[0]; /* the data written to label data area. */
+} QEMU_PACKED;
+typedef struct nvdimm_func_in_set_label_data nvdimm_func_in_set_label_data;
+
 struct nvdimm_dsm_in {
 uint32_t handle;
 uint32_t revision;
 uint32_t function;
/* the remaining size in the page is used by arg3. */
-uint8_t arg3[0];
+union {
+uint8_t arg3[0];
+nvdimm_func_in_set_label_data func_set_label_data;
+};
 } QEMU_PACKED;
 typedef struct nvdimm_dsm_in nvdimm_dsm_in;
 
+struct nvdimm_func_out_label_size {
+uint32_t status; /* return status code. */
+uint32_t label_size; /* the size of label data area. */
+/*
+ * Maximum size of the namespace label data length supported by
+ * the platform in Get/Set Namespace Label Data functions.
+ */
+uint32_t max_xfer;
+} QEMU_PACKED;
+typedef struct nvdimm_func_out_label_size nvdimm_func_out_label_size;
+
+struct nvdimm_func_out_get_label_data {
+uint32_t status;/*return status code. */
+uint8_t out_buf[0]; /* the data got via Get Namesapce Label function. */
+} QEMU_PACKED;
+typedef struct nvdimm_func_out_get_label_data nvdimm_func_out_get_label_data;
+
 static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
 {
 status = cpu_to_le32(status);
@@ -445,6 +478,55 @@ static void nvdimm_dsm_root(nvdimm_dsm_in *in, GArray *out)
 nvdimm_dsm_write_status(out, status);
 }
 
+/*
+ * the max transfer size is the max size transferred by both a
+ * 'Get Namespace Label Data' function and a 'Set Namespace Label Data'
+ * function.
+ */
+static uint32_t nvdimm_get_max_xfer_label_size(void)
+{
+nvdimm_dsm_in *in;
+uint32_t max_get_size, max_set_size, dsm_memory_size = getpagesize();
+
+/*
+ * the max data ACPI can read one time which is transferred by
+ * the response of 'Get Namespace Label Data' function.
+ */
+max_get_size = dsm_memory_size - sizeof(nvdimm_func_out_get_label_data);
+
+/*
+ * the max data ACPI can write one time which is transferred by
+ * 'Set Namespace Label Data' function.
+ */
+max_set_size = dsm_memory_size - offsetof(nvdimm_dsm_in, arg3) -
+   sizeof(in->func_set_label_data);
+
+return MIN(max_get_size, max_set_size);
+}
+
+/*
+ * DSM Spec Rev1 4.4 Get Namespace Label Size (Function Index 4).
+ *
+ * It gets the size of Namespace Label data area and the max data size
+ * that Get/Set Namespace Label Data functions can transfer.
+ */
+static void nvdimm_dsm_func_label_size(NVDIMMDevice *nvdimm, GArray *out)
+{
+nvdimm_func_out_label_size func_label_size;
+uint32_t label_size, mxfer;
+
+label_size = nvdimm->label_size;
+mxfer = nvdimm_get_max_xfer_label_size();
+
+nvdimm_debug("label_size %#x, max_xfer %#x.\n", label_size, mxfer);
+
+func_label_size.status = cpu_to_le32(NVDIMM_DSM_STATUS_SUCCESS);
+func_label_size.label_size = cpu_to_le32(label_size);
+func_label_size.max_xfer = cpu_to_le32(mxfer);
+
+g_array_append_vals(out, _label_size, sizeof(func_label_size));
+}
+
 static void nvdimm_dsm_device(nvdimm_dsm_in *in, GArray *out)
 {
 GSList *list = nvdimm_get_plugged_device_list();
@@ -469,6 +551,9 @@ static void nvdimm_dsm_device(nvdimm_dsm_in *in, GArray 
*out)
1 << 6 /* Set Namespace Label Data */);
 build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
 goto free;
+case 0x4 /* Get Namespace Label Size */:
+nvdimm_dsm_func_label_size(nvdimm, out);
+goto free;
 default:
 status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 03/33] acpi: add aml_create_field

2015-10-28 Thread Xiao Guangrong
Implement CreateField term which is used by NVDIMM _DSM method in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 13 +
 include/hw/acpi/aml-build.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index a72214d..9fe5e7b 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1151,6 +1151,19 @@ Aml *aml_sizeof(Aml *arg)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.2 Named Objects Encoding: DefCreateField */
+Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name)
+{
+Aml *var = aml_alloc();
+build_append_byte(var->buf, 0x5B); /* ExtOpPrefix */
+build_append_byte(var->buf, 0x13); /* CreateFieldOp */
+aml_append(var, srcbuf);
+aml_append(var, index);
+aml_append(var, len);
+build_append_namestring(var->buf, "%s", name);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 7296efb..7e1c43b 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -276,6 +276,7 @@ Aml *aml_touuid(const char *uuid);
 Aml *aml_unicode(const char *str);
 Aml *aml_derefof(Aml *arg);
 Aml *aml_sizeof(Aml *arg);
+Aml *aml_create_field(Aml *srcbuf, Aml *index, Aml *len, const char *name);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 26/33] nvdimm acpi: save arg3 for NVDIMM device _DSM method

2015-10-28 Thread Xiao Guangrong
Check if the input Arg3 is valid then store it into dsm_in if needed

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 8412be3..69de4f6 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -524,13 +524,38 @@ static void nvdimm_build_acpi_devices(GSList 
*device_list, Aml *sb_scope)
 
 method = aml_method_serialized("NCAL", 4);
 {
-Aml *buffer_size = aml_local(0);
+Aml *ifctx, *pckg, *buffer_size = aml_local(0);
 
 aml_append(method, aml_store(aml_arg(0), aml_name("HDLE")));
 aml_append(method, aml_store(aml_arg(1), aml_name("REVS")));
 aml_append(method, aml_store(aml_arg(2), aml_name("FUNC")));
 
 /*
+ * The fourth parameter (Arg3) of _DSM is a package which contains
+ * a buffer, the layout of the buffer is specified by UUID (Arg0),
+ * Revision ID (Arg1) and Function Index (Arg2) which are documented
+ * in the DSM Spec.
+ */
+pckg = aml_arg(3);
+ifctx = aml_if(aml_and(aml_equal(aml_object_type(pckg),
+ aml_int(4 /* Package */)),
+   aml_equal(aml_sizeof(pckg),
+ aml_int(1;
+{
+Aml *pckg_index, *pckg_buf;
+
+pckg_index = aml_local(2);
+pckg_buf = aml_local(3);
+
+aml_append(ifctx, aml_store(aml_index(pckg, aml_int(0)),
+pckg_index));
+aml_append(ifctx, aml_store(aml_derefof(pckg_index),
+pckg_buf));
+aml_append(ifctx, aml_store(pckg_buf, aml_name("ARG3")));
+}
+aml_append(method, ifctx);
+
+/*
  * transfer control to QEMU and the buffer size filled by
  * QEMU is returned.
  */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 31/33] nvdimm: allow using whole backend memory as pmem

2015-10-28 Thread Xiao Guangrong
Introduce a parameter, named "reserve-label-data", if it is
false which indicates that QEMU does not reserve any region
on the backend memory to support label data. It is a
'label-less' NVDIMM device mode that linux will use whole
memory on the device as a single namesapce

This is useful for the users who want to pass whole nvdimm
device and make its data completely be visible to guest

The parameter is false on default

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c| 20 
 hw/mem/nvdimm.c | 37 -
 include/hw/mem/nvdimm.h |  6 ++
 3 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 5e72ca8..c5a50ea 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -545,6 +545,13 @@ static void nvdimm_dsm_func_get_label_data(NVDIMMDevice 
*nvdimm,
 nvdimm_debug("Read Label Data: offset %#x length %#x.\n",
  get_label_data->offset, get_label_data->length);
 
+if (!nvdimm->reserve_label_data) {
+nvdimm_debug("read label request on the device without "
+ "label data reserved.\n");
+status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
+goto exit;
+}
+
 if (nvdimm->label_size < get_label_data->offset + get_label_data->length) {
 nvdimm_debug("position %#x is beyond label data (len = %#lx).\n",
  get_label_data->offset + get_label_data->length,
@@ -588,6 +595,13 @@ static void nvdimm_dsm_func_set_label_data(NVDIMMDevice 
*nvdimm,
 nvdimm_debug("Write Label Data: offset %#x length %#x.\n",
  set_label_data->offset, set_label_data->length);
 
+if (!nvdimm->reserve_label_data) {
+nvdimm_debug("write label request on the device without "
+ "label data reserved.\n");
+status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
+goto exit;
+}
+
 if (nvdimm->label_size < set_label_data->offset + set_label_data->length) {
 nvdimm_debug("position %#x is beyond label data (len = %#lx).\n",
  set_label_data->offset + set_label_data->length,
@@ -632,6 +646,12 @@ static void nvdimm_dsm_device(nvdimm_dsm_in *in, GArray 
*out)
1 << 4 /* Get Namespace Label Size */ |
1 << 5 /* Get Namespace Label Data */ |
1 << 6 /* Set Namespace Label Data */);
+
+/* no function support if the device does not have label data. */
+if (!nvdimm->reserve_label_data) {
+cmd_list = cpu_to_le64(0);
+}
+
 build_append_int_noprefix(out, cmd_list, sizeof(cmd_list));
 goto free;
 case 0x4 /* Get Namespace Label Size */:
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
index 825d664..27dbfbd 100644
--- a/hw/mem/nvdimm.c
+++ b/hw/mem/nvdimm.c
@@ -36,14 +36,15 @@ static void nvdimm_realize(DIMMDevice *dimm, Error **errp)
 {
 MemoryRegion *mr;
 NVDIMMDevice *nvdimm = NVDIMM(dimm);
-uint64_t size;
+uint64_t reserved_label_size, size;
 
 nvdimm->label_size = MIN_NAMESPACE_LABEL_SIZE;
+reserved_label_size = nvdimm->reserve_label_data ? nvdimm->label_size : 0;
 
 mr = host_memory_backend_get_memory(dimm->hostmem, errp);
 size = memory_region_size(mr);
 
-if (size <= nvdimm->label_size) {
+if (size <= reserved_label_size) {
 char *path = 
object_get_canonical_path_component(OBJECT(dimm->hostmem));
 error_setg(errp, "the size of memdev %s (0x%" PRIx64 ") is too small"
" to contain nvdimm namespace label (0x%" PRIx64 ")", path,
@@ -52,9 +53,12 @@ static void nvdimm_realize(DIMMDevice *dimm, Error **errp)
 }
 
 memory_region_init_alias(>nvdimm_mr, OBJECT(dimm), "nvdimm-memory",
- mr, 0, size - nvdimm->label_size);
-nvdimm->label_data = memory_region_get_ram_ptr(mr) +
- memory_region_size(>nvdimm_mr);
+ mr, 0, size - reserved_label_size);
+
+if (reserved_label_size) {
+nvdimm->label_data = memory_region_get_ram_ptr(mr) +
+ memory_region_size(>nvdimm_mr);
+}
 }
 
 static void nvdimm_read_label_data(NVDIMMDevice *nvdimm, void *buf,
@@ -97,10 +101,33 @@ static void nvdimm_class_init(ObjectClass *oc, void *data)
 nvc->write_label_data = nvdimm_write_label_data;
 }
 
+static bool nvdimm_get_reserve_label_data(Object *obj, Error **errp)
+{
+NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+return nvdimm->reserve_label_data;
+}
+
+static void
+nvdimm_set_reserve_label_data(Object *obj, bool value, Error **errp)
+{
+NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+nvdimm->reserve_label_data = value;
+}
+
+static void nvdimm_init(Object *obj)
+{
+object_property_add_bool(obj, "reserve-label-data",
+ nvdimm_get_reserve_label_data,
+ 

[PATCH v5 09/33] exec: allow file_ram_alloc to work on file

2015-10-28 Thread Xiao Guangrong
Currently, file_ram_alloc() only works on directory - it creates a file
under @path and do mmap on it

This patch tries to allow it to work on file directly, if @path is a
directory it works as before, otherwise it treats @path as the target
file then directly allocate memory from it

Signed-off-by: Xiao Guangrong 
---
 exec.c | 80 ++
 1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/exec.c b/exec.c
index d2a3357..09e9938 100644
--- a/exec.c
+++ b/exec.c
@@ -1157,14 +1157,60 @@ void qemu_mutex_unlock_ramlist(void)
 }
 
 #ifdef __linux__
+static bool path_is_dir(const char *path)
+{
+struct stat fs;
+
+return stat(path, ) == 0 && S_ISDIR(fs.st_mode);
+}
+
+static int open_file_path(RAMBlock *block, const char *path, size_t size)
+{
+char *filename;
+char *sanitized_name;
+char *c;
+int fd;
+
+if (!path_is_dir(path)) {
+int flags = (block->flags & RAM_SHARED) ? O_RDWR : O_RDONLY;
+
+flags |= O_EXCL;
+return open(path, flags);
+}
+
+/* Make name safe to use with mkstemp by replacing '/' with '_'. */
+sanitized_name = g_strdup(memory_region_name(block->mr));
+for (c = sanitized_name; *c != '\0'; c++) {
+if (*c == '/') {
+*c = '_';
+}
+}
+filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
+   sanitized_name);
+g_free(sanitized_name);
+fd = mkstemp(filename);
+if (fd >= 0) {
+unlink(filename);
+/*
+ * ftruncate is not supported by hugetlbfs in older
+ * hosts, so don't bother bailing out on errors.
+ * If anything goes wrong with it under other filesystems,
+ * mmap will fail.
+ */
+if (ftruncate(fd, size)) {
+perror("ftruncate");
+}
+}
+g_free(filename);
+
+return fd;
+}
+
 static void *file_ram_alloc(RAMBlock *block,
 ram_addr_t memory,
 const char *path,
 Error **errp)
 {
-char *filename;
-char *sanitized_name;
-char *c;
 void *area;
 int fd;
 uint64_t pagesize;
@@ -1194,38 +1240,14 @@ static void *file_ram_alloc(RAMBlock *block,
 goto error;
 }
 
-/* Make name safe to use with mkstemp by replacing '/' with '_'. */
-sanitized_name = g_strdup(memory_region_name(block->mr));
-for (c = sanitized_name; *c != '\0'; c++) {
-if (*c == '/')
-*c = '_';
-}
-
-filename = g_strdup_printf("%s/qemu_back_mem.%s.XX", path,
-   sanitized_name);
-g_free(sanitized_name);
+memory = ROUND_UP(memory, pagesize);
 
-fd = mkstemp(filename);
+fd = open_file_path(block, path, memory);
 if (fd < 0) {
 error_setg_errno(errp, errno,
  "unable to create backing store for path %s", path);
-g_free(filename);
 goto error;
 }
-unlink(filename);
-g_free(filename);
-
-memory = ROUND_UP(memory, pagesize);
-
-/*
- * ftruncate is not supported by hugetlbfs in older
- * hosts, so don't bother bailing out on errors.
- * If anything goes wrong with it under other filesystems,
- * mmap will fail.
- */
-if (ftruncate(fd, memory)) {
-perror("ftruncate");
-}
 
 area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
 if (area == MAP_FAILED) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 19/33] dimm: keep the state of the whole backend memory

2015-10-28 Thread Xiao Guangrong
QEMU keeps the state of memory of dimm device during live migration,
however, it is not enough for nvdimm device as its memory does not
contain its label data, so that we should protect the whole backend
memory instead

Signed-off-by: Xiao Guangrong 
---
 hw/mem/dimm.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 9e0403a..478cacd 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -135,9 +135,16 @@ void dimm_memory_plug(DeviceState *dev, MemoryHotplugState 
*hpms,
 }
 
 memory_region_add_subregion(>mr, addr - hpms->base, mr);
-vmstate_register_ram(mr, dev);
 numa_set_mem_node_id(addr, memory_region_size(mr), dimm->node);
 
+/*
+ * save the state only for @mr is not enough as it does not contain
+ * the label data of NVDIMM device, so that we keep the state of
+ * whole hostmem instead.
+ */
+vmstate_register_ram(host_memory_backend_get_memory(dimm->hostmem, errp),
+ dev);
+
 out:
 error_propagate(errp, local_err);
 }
@@ -146,10 +153,13 @@ void dimm_memory_unplug(DeviceState *dev, 
MemoryHotplugState *hpms,
MemoryRegion *mr)
 {
 DIMMDevice *dimm = DIMM(dev);
+MemoryRegion *backend_mr;
+
+backend_mr = host_memory_backend_get_memory(dimm->hostmem, _abort);
 
 numa_unset_mem_node_id(dimm->addr, memory_region_size(mr), dimm->node);
 memory_region_del_subregion(>mr, mr);
-vmstate_unregister_ram(mr, dev);
+vmstate_unregister_ram(backend_mr, dev);
 }
 
 int qmp_dimm_device_list(Object *obj, void *opaque)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 02/33] acpi: add aml_sizeof

2015-10-28 Thread Xiao Guangrong
Implement SizeOf term which is used by NVDIMM _DSM method in later patch

Reviewed-by: Igor Mammedov 
Signed-off-by: Xiao Guangrong 
---
 hw/acpi/aml-build.c | 8 
 include/hw/acpi/aml-build.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index cbd53f4..a72214d 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1143,6 +1143,14 @@ Aml *aml_derefof(Aml *arg)
 return var;
 }
 
+/* ACPI 1.0b: 16.2.5.4 Type 2 Opcodes Encoding: DefSizeOf */
+Aml *aml_sizeof(Aml *arg)
+{
+Aml *var = aml_opcode(0x87 /* SizeOfOp */);
+aml_append(var, arg);
+return var;
+}
+
 void
 build_header(GArray *linker, GArray *table_data,
  AcpiTableHeader *h, const char *sig, int len, uint8_t rev)
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 5a03d33..7296efb 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -275,6 +275,7 @@ Aml *aml_varpackage(uint32_t num_elements);
 Aml *aml_touuid(const char *uuid);
 Aml *aml_unicode(const char *str);
 Aml *aml_derefof(Aml *arg);
+Aml *aml_sizeof(Aml *arg);
 
 void
 build_header(GArray *linker, GArray *table_data,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 21/33] nvdimm: implement NVDIMM device abstract

2015-10-28 Thread Xiao Guangrong
Introduce "nvdimm" device which is based on dimm device type

128K memory region which is the minimum namespace label size
required by NVDIMM Namespace Spec locates at the end of
backend memory device is reserved for label data

We can use "-m 1G,maxmem=100G,slots=10 -object memory-backend-file,
id=mem1,size=1G,mem-path=/dev/pmem0 -device nvdimm,memdev=mem1" to
create NVDIMM device for guest

Signed-off-by: Xiao Guangrong 
---
 default-configs/i386-softmmu.mak   |   1 +
 default-configs/x86_64-softmmu.mak |   1 +
 hw/acpi/memory_hotplug.c   |   6 ++
 hw/mem/Makefile.objs   |   1 +
 hw/mem/nvdimm.c| 113 +
 include/hw/mem/nvdimm.h|  83 +++
 6 files changed, 205 insertions(+)
 create mode 100644 hw/mem/nvdimm.c
 create mode 100644 include/hw/mem/nvdimm.h

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 3ece8bb..4e84a1c 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -47,6 +47,7 @@ CONFIG_APIC=y
 CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/default-configs/x86_64-softmmu.mak 
b/default-configs/x86_64-softmmu.mak
index 92ea7c1..e877a86 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -47,6 +47,7 @@ CONFIG_APIC=y
 CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
index e232641..92cd973 100644
--- a/hw/acpi/memory_hotplug.c
+++ b/hw/acpi/memory_hotplug.c
@@ -1,6 +1,7 @@
 #include "hw/acpi/memory_hotplug.h"
 #include "hw/acpi/pc-hotplug.h"
 #include "hw/mem/dimm.h"
+#include "hw/mem/nvdimm.h"
 #include "hw/boards.h"
 #include "hw/qdev-core.h"
 #include "trace.h"
@@ -231,6 +232,11 @@ void acpi_memory_plug_cb(ACPIREGS *ar, qemu_irq irq, 
MemHotplugState *mem_st,
 {
 MemStatus *mdev;
 
+/* Currently, NVDIMM hotplug has not been supported yet. */
+if (object_dynamic_cast(OBJECT(dev), TYPE_NVDIMM)) {
+return;
+}
+
 mdev = acpi_memory_slot_status(mem_st, dev, errp);
 if (!mdev) {
 return;
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index cebb4b1..12d9b72 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1,2 +1,3 @@
 common-obj-$(CONFIG_DIMM) += dimm.o
 common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
+common-obj-$(CONFIG_NVDIMM) += nvdimm.o
diff --git a/hw/mem/nvdimm.c b/hw/mem/nvdimm.c
new file mode 100644
index 000..825d664
--- /dev/null
+++ b/hw/mem/nvdimm.c
@@ -0,0 +1,113 @@
+/*
+ * Non-Volatile Dual In-line Memory Module Virtualization Implementation
+ *
+ * Copyright(C) 2015 Intel Corporation.
+ *
+ * Author:
+ *  Xiao Guangrong 
+ *
+ * Currently, it only supports PMEM Virtualization.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see 
+ */
+
+#include "qapi/visitor.h"
+#include "hw/mem/nvdimm.h"
+
+static MemoryRegion *nvdimm_get_memory_region(DIMMDevice *dimm)
+{
+NVDIMMDevice *nvdimm = NVDIMM(dimm);
+
+return memory_region_size(>nvdimm_mr) ? >nvdimm_mr : NULL;
+}
+
+static void nvdimm_realize(DIMMDevice *dimm, Error **errp)
+{
+MemoryRegion *mr;
+NVDIMMDevice *nvdimm = NVDIMM(dimm);
+uint64_t size;
+
+nvdimm->label_size = MIN_NAMESPACE_LABEL_SIZE;
+
+mr = host_memory_backend_get_memory(dimm->hostmem, errp);
+size = memory_region_size(mr);
+
+if (size <= nvdimm->label_size) {
+char *path = 
object_get_canonical_path_component(OBJECT(dimm->hostmem));
+error_setg(errp, "the size of memdev %s (0x%" PRIx64 ") is too small"
+   " to contain nvdimm namespace label (0x%" PRIx64 ")", path,
+   memory_region_size(mr), nvdimm->label_size);
+return;
+}
+
+memory_region_init_alias(>nvdimm_mr, OBJECT(dimm), "nvdimm-memory",
+ mr, 0, size - nvdimm->label_size);
+nvdimm->label_data = memory_region_get_ram_ptr(mr) +
+ memory_region_size(>nvdimm_mr);
+}
+
+static void nvdimm_read_label_data(NVDIMMDevice *nvdimm, void *buf,
+   

[PATCH v5 33/33] nvdimm: add maintain info

2015-10-28 Thread Xiao Guangrong
Add NVDIMM maintainer

Signed-off-by: Xiao Guangrong 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9bd2b8f..a3c38cc 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -868,6 +868,13 @@ M: Jiri Pirko 
 S: Maintained
 F: hw/net/rocker/
 
+NVDIMM
+M: Xiao Guangrong 
+S: Maintained
+F: hw/acpi/nvdimm.c
+F: hw/mem/nvdimm.c
+F: include/hw/mem/nvdimm.h
+
 Subsystems
 --
 Audio
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 23/33] nvdimm acpi: init the resource used by NVDIMM ACPI

2015-10-28 Thread Xiao Guangrong
A page staring from 0xFF0 and IO port 0x0a18 - 0xa1b in guest are
reserved for NVDIMM ACPI emulation, refer to docs/specs/acpi_nvdimm.txt
for detailed design

A parameter, 'nvdimm-support', is introduced for PIIX4_PM and ICH9-LPC
that controls if nvdimm support is enabled, it is true on default and
it is false on 2.4 and its earlier version to keep compatibility

Signed-off-by: Xiao Guangrong 
---
 default-configs/i386-softmmu.mak |  1 +
 default-configs/mips-softmmu.mak |  1 +
 default-configs/mips64-softmmu.mak   |  1 +
 default-configs/mips64el-softmmu.mak |  1 +
 default-configs/mipsel-softmmu.mak   |  1 +
 default-configs/x86_64-softmmu.mak   |  1 +
 hw/acpi/Makefile.objs|  1 +
 hw/acpi/ich9.c   | 24 ++
 hw/acpi/nvdimm.c | 63 
 hw/acpi/piix4.c  | 27 
 include/hw/acpi/ich9.h   |  3 ++
 include/hw/i386/pc.h | 10 ++
 include/hw/mem/nvdimm.h  | 34 +++
 13 files changed, 161 insertions(+), 7 deletions(-)
 create mode 100644 hw/acpi/nvdimm.c

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 4e84a1c..51e71d4 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -48,6 +48,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/default-configs/mips-softmmu.mak b/default-configs/mips-softmmu.mak
index 44467c3..6b8b70e 100644
--- a/default-configs/mips-softmmu.mak
+++ b/default-configs/mips-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/mips64-softmmu.mak 
b/default-configs/mips64-softmmu.mak
index 66ed5f9..ea820f6 100644
--- a/default-configs/mips64-softmmu.mak
+++ b/default-configs/mips64-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/mips64el-softmmu.mak 
b/default-configs/mips64el-softmmu.mak
index bfca2b2..8993851 100644
--- a/default-configs/mips64el-softmmu.mak
+++ b/default-configs/mips64el-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/mipsel-softmmu.mak 
b/default-configs/mipsel-softmmu.mak
index 0162ef0..87ab964 100644
--- a/default-configs/mipsel-softmmu.mak
+++ b/default-configs/mipsel-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
 CONFIG_I8257=y
diff --git a/default-configs/x86_64-softmmu.mak 
b/default-configs/x86_64-softmmu.mak
index e877a86..0a7dc10 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -48,6 +48,7 @@ CONFIG_IOAPIC=y
 CONFIG_PVPANIC=y
 CONFIG_MEM_HOTPLUG=y
 CONFIG_NVDIMM=y
+CONFIG_ACPI_NVDIMM=y
 CONFIG_XIO3130=y
 CONFIG_IOH3420=y
 CONFIG_I82801B11=y
diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
index 7d3230c..095597f 100644
--- a/hw/acpi/Makefile.objs
+++ b/hw/acpi/Makefile.objs
@@ -2,6 +2,7 @@ common-obj-$(CONFIG_ACPI_X86) += core.o piix4.o pcihp.o
 common-obj-$(CONFIG_ACPI_X86_ICH) += ich9.o tco.o
 common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu_hotplug.o
 common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
+common-obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
 common-obj-$(CONFIG_ACPI) += acpi_interface.o
 common-obj-$(CONFIG_ACPI) += bios-linker-loader.o
 common-obj-$(CONFIG_ACPI) += aml-build.o
diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
index 1e9ae20..603c1bd 100644
--- a/hw/acpi/ich9.c
+++ b/hw/acpi/ich9.c
@@ -280,6 +280,12 @@ void ich9_pm_init(PCIDevice *lpc_pci, ICH9LPCPMRegs *pm,
 acpi_memory_hotplug_init(pci_address_space_io(lpc_pci), 
OBJECT(lpc_pci),
  >acpi_memory_hotplug);
 }
+
+if (pm->acpi_nvdimm_state.is_enabled) {
+nvdimm_init_acpi_state(pci_address_space(lpc_pci),
+   pci_address_space_io(lpc_pci), OBJECT(lpc_pci),
+   >acpi_nvdimm_state);
+}
 }
 
 static void ich9_pm_get_gpe0_blk(Object *obj, Visitor *v,
@@ -307,6 +313,20 @@ static void ich9_pm_set_memory_hotplug_support(Object 
*obj, bool value,
 s->pm.acpi_memory_hotplug.is_enabled = value;
 }
 
+static bool ich9_pm_get_nvdimm_support(Object *obj, Error **errp)
+{
+ICH9LPCState *s = ICH9_LPC_DEVICE(obj);
+
+return s->pm.acpi_nvdimm_state.is_enabled;
+}
+
+static void 

[PATCH v5 18/33] dimm: get mapped memory region from DIMMDeviceClass->get_memory_region

2015-10-28 Thread Xiao Guangrong
Curretly, the memory region of backed memory is directly mapped to
guest's address space, however, it is not true for nvdimm device

This patch let dimm device realize this fact and use
DIMMDeviceClass->get_memory_region method to get the mapped memory
region

Signed-off-by: Xiao Guangrong 
---
 hw/mem/dimm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 23d5daa..9e0403a 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -380,8 +380,9 @@ static void dimm_get_size(Object *obj, Visitor *v, void 
*opaque,
 int64_t value;
 MemoryRegion *mr;
 DIMMDevice *dimm = DIMM(obj);
+DIMMDeviceClass *ddc = DIMM_GET_CLASS(obj);
 
-mr = host_memory_backend_get_memory(dimm->hostmem, errp);
+mr = ddc->get_memory_region(dimm);
 value = memory_region_size(mr);
 
 visit_type_int(v, , name, errp);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 16:05 +0200, Michael S. Tsirkin wrote:
> 
> Short answer - platforms need a way to discover, and express different
> security requirements of different devices.

Sure. PLATFORMS need that. Do not let it go anywhere near your device
drivers. Including the virtio drivers.

> If they continue to lack that, we'll need a custom API in virtio,
> and while this seems a bit less elegant, I would not see that as
> the end of the world at all, there are not that many virtio drivers.

No. If they continue to lack that, we fix them. This is a *platform*
issue. The DMA API shall do the right thing. Do not second-guess it.


 (From the other mail)
> > > OK so I guess that means we should prefer a transport-specific
> > > interface in virtio-pci then.
> >
> > Why?
> 
> Because you said you are doing something device tree specific for 
> ARM, aren't you?

Nonono. The ARM platform code might do that, and the DMA API on ARM
*might* give you I/O virtual addresses that look a lot like the
physical addresses you asked it to map. That's none of your business.
Drivers use DMA API. No more talky.

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


[PATCH v5 16/33] pc-dimm: rename pc-dimm.c and pc-dimm.h

2015-10-28 Thread Xiao Guangrong
Rename:
   pc-dimm.c => dimm.c
   pc-dimm.h => dimm.h

It prepares the work which abstracts dimm device type for both pc-dimm and
nvdimm

Signed-off-by: Xiao Guangrong 
---
 hw/Makefile.objs | 2 +-
 hw/acpi/ich9.c   | 2 +-
 hw/acpi/memory_hotplug.c | 4 ++--
 hw/acpi/piix4.c  | 2 +-
 hw/i386/pc.c | 2 +-
 hw/mem/Makefile.objs | 2 +-
 hw/mem/{pc-dimm.c => dimm.c} | 2 +-
 hw/ppc/spapr.c   | 2 +-
 include/hw/i386/pc.h | 2 +-
 include/hw/mem/{pc-dimm.h => dimm.h} | 0
 include/hw/ppc/spapr.h   | 2 +-
 numa.c   | 2 +-
 qmp.c| 2 +-
 stubs/qmp_dimm_device_list.c | 2 +-
 14 files changed, 14 insertions(+), 14 deletions(-)
 rename hw/mem/{pc-dimm.c => dimm.c} (99%)
 rename include/hw/mem/{pc-dimm.h => dimm.h} (100%)

diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index 7e7c241..12ecda9 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -30,8 +30,8 @@ devices-dirs-$(CONFIG_SOFTMMU) += vfio/
 devices-dirs-$(CONFIG_VIRTIO) += virtio/
 devices-dirs-$(CONFIG_SOFTMMU) += watchdog/
 devices-dirs-$(CONFIG_SOFTMMU) += xen/
-devices-dirs-$(CONFIG_MEM_HOTPLUG) += mem/
 devices-dirs-$(CONFIG_SMBIOS) += smbios/
+devices-dirs-y += mem/
 devices-dirs-y += core/
 common-obj-y += $(devices-dirs-y)
 obj-y += $(devices-dirs-y)
diff --git a/hw/acpi/ich9.c b/hw/acpi/ich9.c
index b0d6a67..1e9ae20 100644
--- a/hw/acpi/ich9.c
+++ b/hw/acpi/ich9.c
@@ -35,7 +35,7 @@
 #include "exec/address-spaces.h"
 
 #include "hw/i386/ich9.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 
 //#define DEBUG
 
diff --git a/hw/acpi/memory_hotplug.c b/hw/acpi/memory_hotplug.c
index 1f6..e232641 100644
--- a/hw/acpi/memory_hotplug.c
+++ b/hw/acpi/memory_hotplug.c
@@ -1,6 +1,6 @@
 #include "hw/acpi/memory_hotplug.h"
 #include "hw/acpi/pc-hotplug.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 #include "hw/boards.h"
 #include "hw/qdev-core.h"
 #include "trace.h"
@@ -148,7 +148,7 @@ static void acpi_memory_hotplug_write(void *opaque, hwaddr 
addr, uint64_t data,
 
 dev = DEVICE(mdev->dimm);
 hotplug_ctrl = qdev_get_hotplug_handler(dev);
-/* call pc-dimm unplug cb */
+/* call dimm unplug cb */
 hotplug_handler_unplug(hotplug_ctrl, dev, _err);
 if (local_err) {
 trace_mhp_acpi_dimm_delete_failed(mem_st->selector);
diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
index 0b2cb6e..b2f5b2c 100644
--- a/hw/acpi/piix4.c
+++ b/hw/acpi/piix4.c
@@ -33,7 +33,7 @@
 #include "hw/acpi/pcihp.h"
 #include "hw/acpi/cpu_hotplug.h"
 #include "hw/hotplug.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 #include "hw/acpi/memory_hotplug.h"
 #include "hw/acpi/acpi_dev_interface.h"
 #include "hw/xen/xen.h"
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index d8732f3..b8584d7 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -62,7 +62,7 @@
 #include "hw/boards.h"
 #include "hw/pci/pci_host.h"
 #include "acpi-build.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 #include "qapi/visitor.h"
 #include "qapi-visit.h"
 
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index b000fb4..7563ef5 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1 +1 @@
-common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
+common-obj-$(CONFIG_MEM_HOTPLUG) += dimm.o
diff --git a/hw/mem/pc-dimm.c b/hw/mem/dimm.c
similarity index 99%
rename from hw/mem/pc-dimm.c
rename to hw/mem/dimm.c
index 51f737f..6c1ea98 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/dimm.c
@@ -18,7 +18,7 @@
  * License along with this library; if not, see 
  */
 
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 #include "qemu/config-file.h"
 #include "qapi/visitor.h"
 #include "qemu/range.h"
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 4fb91a5..171fa77 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2138,7 +2138,7 @@ static void spapr_machine_device_plug(HotplugHandler 
*hotplug_dev,
  *
  * - Memory gets hotplugged to a different node than what the user
  *   specified.
- * - Since pc-dimm subsystem in QEMU still thinks that memory belongs
+ * - Since dimm subsystem in QEMU still thinks that memory belongs
  *   to memory-less node, a reboot will set things accordingly
  *   and the previously hotplugged memory now ends in the right node.
  *   This appears as if some memory moved from one node to another.
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index c5961d7..7dfb50f 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -16,7 +16,7 @@
 #include "hw/pci/pci.h"
 #include "hw/boards.h"
 #include "hw/compat.h"
-#include "hw/mem/pc-dimm.h"
+#include "hw/mem/dimm.h"
 
 #define HPET_INTCAP "hpet-intcap"
 
diff --git 

[PATCH v5 32/33] nvdimm acpi: support _FIT method

2015-10-28 Thread Xiao Guangrong
FIT buffer is not completely mapped into guest address space, so a new
function, Read FIT, function index 0x, is reserved by QEMU to
read the piece of FIT buffer. The buffer is concatenated before _FIT
return

Refer to docs/specs/acpi-nvdimm.txt for detailed design

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 168 +--
 1 file changed, 164 insertions(+), 4 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index c5a50ea..e1ae6c5 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -384,6 +384,18 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
 g_array_free(structures, true);
 }
 
+/*
+ * define UUID for NVDIMM Root Device according to Chapter 3 DSM Interface
+ * for NVDIMM Root Device - Example in DSM Spec Rev1.
+ */
+#define NVDIMM_DSM_ROOT_UUID "2F10E7A4-9E91-11E4-89D3-123B93F75CBA"
+
+/*
+ * Read FIT Function, which is a QEMU internal use only function, more detail
+ * refer to docs/specs/acpi_nvdimm.txt
+ */
+#define NVDIMM_DSM_FUNC_READ_FIT 0x
+
 /* define NVDIMM DSM return status codes according to DSM Spec Rev1. */
 enum {
 /* Common return status codes. */
@@ -420,6 +432,11 @@ struct nvdimm_func_in_set_label_data {
 } QEMU_PACKED;
 typedef struct nvdimm_func_in_set_label_data nvdimm_func_in_set_label_data;
 
+struct nvdimm_func_in_read_fit {
+uint32_t offset; /* fit offset */
+} QEMU_PACKED;
+typedef struct nvdimm_func_in_read_fit nvdimm_func_in_read_fit;
+
 struct nvdimm_dsm_in {
 uint32_t handle;
 uint32_t revision;
@@ -429,6 +446,7 @@ struct nvdimm_dsm_in {
 uint8_t arg3[0];
 nvdimm_func_in_set_label_data func_set_label_data;
 nvdimm_func_in_get_label_data func_get_label_data;
+nvdimm_func_in_read_fit func_read_fit;
 };
 } QEMU_PACKED;
 typedef struct nvdimm_dsm_in nvdimm_dsm_in;
@@ -450,13 +468,71 @@ struct nvdimm_func_out_get_label_data {
 } QEMU_PACKED;
 typedef struct nvdimm_func_out_get_label_data nvdimm_func_out_get_label_data;
 
+struct nvdimm_func_out_read_fit {
+uint32_t status;/* return status code. */
+uint32_t length;/* the length of fit data we read. */
+uint8_t fit_data[0]; /* fit data. */
+} QEMU_PACKED;
+typedef struct nvdimm_func_out_read_fit nvdimm_func_out_read_fit;
+
 static void nvdimm_dsm_write_status(GArray *out, uint32_t status)
 {
 status = cpu_to_le32(status);
 build_append_int_noprefix(out, status, sizeof(status));
 }
 
-static void nvdimm_dsm_root(nvdimm_dsm_in *in, GArray *out)
+/* Build fit memory which is presented to guest via _FIT method. */
+static void nvdimm_build_fit(AcpiNVDIMMState *state)
+{
+if (!state->fit) {
+GSList *device_list = nvdimm_get_plugged_device_list();
+
+nvdimm_debug("Rebuild FIT...\n");
+state->fit = nvdimm_build_device_structure(device_list);
+g_slist_free(device_list);
+}
+}
+
+/* Read FIT data, defined in docs/specs/acpi_nvdimm.txt. */
+static void nvdimm_dsm_func_read_fit(AcpiNVDIMMState *state,
+ nvdimm_dsm_in *in, GArray *out)
+{
+nvdimm_func_in_read_fit *read_fit = >func_read_fit;
+nvdimm_func_out_read_fit fit_out;
+uint32_t read_length = getpagesize() - sizeof(nvdimm_func_out_read_fit);
+uint32_t status = NVDIMM_DSM_ROOT_DEV_STATUS_INVALID_PARAS;
+
+nvdimm_build_fit(state);
+
+le32_to_cpus(_fit->offset);
+
+nvdimm_debug("Read FIT offset %#x.\n", read_fit->offset);
+
+if (read_fit->offset > state->fit->len) {
+nvdimm_debug("offset %#x is beyond fit size (%#x).\n",
+ read_fit->offset, state->fit->len);
+goto exit;
+}
+
+read_length = MIN(read_length, state->fit->len - read_fit->offset);
+nvdimm_debug("read length %#x.\n", read_length);
+
+fit_out.status = cpu_to_le32(NVDIMM_DSM_STATUS_SUCCESS);
+fit_out.length = cpu_to_le32(read_length);
+g_array_append_vals(out, _out, sizeof(fit_out));
+
+if (read_length) {
+g_array_append_vals(out, state->fit->data + read_fit->offset,
+read_length);
+}
+return;
+
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
+static void nvdimm_dsm_root(AcpiNVDIMMState *state, nvdimm_dsm_in *in,
+GArray *out)
 {
 uint32_t status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 
@@ -475,6 +551,10 @@ static void nvdimm_dsm_root(nvdimm_dsm_in *in, GArray *out)
 return;
 }
 
+if (in->function == NVDIMM_DSM_FUNC_READ_FIT /* FIT Read */) {
+return nvdimm_dsm_func_read_fit(state, in, out);
+}
+
 nvdimm_debug("Return status %#x.\n", status);
 nvdimm_dsm_write_status(out, status);
 }
@@ -713,7 +793,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
 
 /* Handle 0 is reserved for NVDIMM Root Device. */
 if (!in->handle) {
-nvdimm_dsm_root(in, out);
+nvdimm_dsm_root(state, in, out);

[PATCH v5 29/33] nvdimm acpi: support Get Namespace Label Data function

2015-10-28 Thread Xiao Guangrong
Function 5 is used to get Namespace Label Data

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 48 
 1 file changed, 48 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 72203d2..5b621ed 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -428,6 +428,7 @@ struct nvdimm_dsm_in {
 union {
 uint8_t arg3[0];
 nvdimm_func_in_set_label_data func_set_label_data;
+nvdimm_func_in_get_label_data func_get_label_data;
 };
 } QEMU_PACKED;
 typedef struct nvdimm_dsm_in nvdimm_dsm_in;
@@ -527,6 +528,50 @@ static void nvdimm_dsm_func_label_size(NVDIMMDevice 
*nvdimm, GArray *out)
 g_array_append_vals(out, _label_size, sizeof(func_label_size));
 }
 
+/*
+ * DSM Spec Rev1 4.5 Get Namespace Label Data (Function Index 5).
+ */
+static void nvdimm_dsm_func_get_label_data(NVDIMMDevice *nvdimm,
+   nvdimm_dsm_in *in, GArray *out)
+{
+NVDIMMClass *nvc = NVDIMM_GET_CLASS(nvdimm);
+nvdimm_func_in_get_label_data *get_label_data = >func_get_label_data;
+void *buf;
+uint32_t status = NVDIMM_DSM_STATUS_SUCCESS;
+
+le32_to_cpus(_label_data->offset);
+le32_to_cpus(_label_data->length);
+
+nvdimm_debug("Read Label Data: offset %#x length %#x.\n",
+ get_label_data->offset, get_label_data->length);
+
+if (nvdimm->label_size < get_label_data->offset + get_label_data->length) {
+nvdimm_debug("position %#x is beyond label data (len = %#lx).\n",
+ get_label_data->offset + get_label_data->length,
+ nvdimm->label_size);
+status = NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+goto exit;
+}
+
+if (get_label_data->length > nvdimm_get_max_xfer_label_size()) {
+nvdimm_debug("get length (%#x) is larger than max_xfer (%#x).\n",
+ get_label_data->length, nvdimm_get_max_xfer_label_size());
+status = NVDIMM_DSM_DEV_STATUS_INVALID_PARAS;
+goto exit;
+}
+
+/* write nvdimm_func_out_get_label_data.status. */
+nvdimm_dsm_write_status(out, status);
+/* write nvdimm_func_out_get_label_data.out_buf. */
+buf = acpi_data_push(out, get_label_data->length);
+nvc->read_label_data(nvdimm, buf, get_label_data->length,
+ get_label_data->offset);
+return;
+
+exit:
+nvdimm_dsm_write_status(out, status);
+}
+
 static void nvdimm_dsm_device(nvdimm_dsm_in *in, GArray *out)
 {
 GSList *list = nvdimm_get_plugged_device_list();
@@ -554,6 +599,9 @@ static void nvdimm_dsm_device(nvdimm_dsm_in *in, GArray 
*out)
 case 0x4 /* Get Namespace Label Size */:
 nvdimm_dsm_func_label_size(nvdimm, out);
 goto free;
+case 0x5 /* Get Namespace Label Data */:
+nvdimm_dsm_func_get_label_data(nvdimm, in, out);
+goto free;
 default:
 status = NVDIMM_DSM_STATUS_NOT_SUPPORTED;
 };
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 12/33] pc-dimm: remove DEFAULT_PC_DIMMSIZE

2015-10-28 Thread Xiao Guangrong
It's not used any more

Signed-off-by: Xiao Guangrong 
---
 include/hw/mem/pc-dimm.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/hw/mem/pc-dimm.h b/include/hw/mem/pc-dimm.h
index c1ee7b0..15590f1 100644
--- a/include/hw/mem/pc-dimm.h
+++ b/include/hw/mem/pc-dimm.h
@@ -20,8 +20,6 @@
 #include "sysemu/hostmem.h"
 #include "hw/qdev.h"
 
-#define DEFAULT_PC_DIMMSIZE (1024*1024*1024)
-
 #define TYPE_PC_DIMM "pc-dimm"
 #define PC_DIMM(obj) \
 OBJECT_CHECK(PCDIMMDevice, (obj), TYPE_PC_DIMM)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 10/33] hostmem-file: clean up memory allocation

2015-10-28 Thread Xiao Guangrong
- hostmem-file.c is compiled only if CONFIG_LINUX is enabled so that is
  unnecessary to do the same check in the source file

- the interface, HostMemoryBackendClass->alloc(), is not called many
  times, do not need to check if the memory-region is initialized

Signed-off-by: Xiao Guangrong 
---
 backends/hostmem-file.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index e9b6d21..9097a57 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -46,17 +46,12 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error 
**errp)
 error_setg(errp, "mem-path property not set");
 return;
 }
-#ifndef CONFIG_LINUX
-error_setg(errp, "-mem-path not supported on this host");
-#else
-if (!memory_region_size(>mr)) {
-backend->force_prealloc = mem_prealloc;
-memory_region_init_ram_from_file(>mr, OBJECT(backend),
+
+backend->force_prealloc = mem_prealloc;
+memory_region_init_ram_from_file(>mr, OBJECT(backend),
  object_get_canonical_path(OBJECT(backend)),
  backend->size, fb->share,
  fb->mem_path, errp);
-}
-#endif
 }
 
 static void
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 25/33] nvdimm acpi: build ACPI nvdimm devices

2015-10-28 Thread Xiao Guangrong
NVDIMM devices is defined in ACPI 6.0 9.20 NVDIMM Devices

There is a root device under \_SB and specified NVDIMM devices are under the
root device. Each NVDIMM device has _ADR which returns its handle used to
associate MEMDEV structure in NFIT

We reserve handle 0 for root device. In this patch, we save handle, handle,
arg1 and arg2 to dsm memory. Arg3 is conditionally saved in later patch

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c | 184 +++
 1 file changed, 184 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 3fc82e1..8412be3 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -368,6 +368,15 @@ static void nvdimm_build_nfit(GSList *device_list, GArray 
*table_offsets,
 g_array_free(structures, true);
 }
 
+struct nvdimm_dsm_in {
+uint32_t handle;
+uint32_t revision;
+uint32_t function;
+   /* the remaining size in the page is used by arg3. */
+uint8_t arg3[0];
+} QEMU_PACKED;
+typedef struct nvdimm_dsm_in nvdimm_dsm_in;
+
 static uint64_t
 nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
 {
@@ -377,6 +386,7 @@ nvdimm_dsm_read(void *opaque, hwaddr addr, unsigned size)
 static void
 nvdimm_dsm_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
 {
+fprintf(stderr, "BUG: we never write DSM notification IO Port.\n");
 }
 
 static const MemoryRegionOps nvdimm_dsm_ops = {
@@ -402,6 +412,179 @@ void nvdimm_init_acpi_state(MemoryRegion *memory, 
MemoryRegion *io,
 memory_region_add_subregion(io, NVDIMM_ACPI_IO_BASE, >io_mr);
 }
 
+#define BUILD_STA_METHOD(_dev_, _method_)  \
+do {   \
+_method_ = aml_method("_STA", 0);  \
+aml_append(_method_, aml_return(aml_int(0x0f)));   \
+aml_append(_dev_, _method_);   \
+} while (0)
+
+#define BUILD_DSM_METHOD(_dev_, _method_, _handle_, _uuid_)\
+do {   \
+Aml *ifctx, *uuid; \
+_method_ = aml_method("_DSM", 4);  \
+/* check UUID if it is we expect, return the errorcode if not.*/   \
+uuid = aml_touuid(_uuid_); \
+ifctx = aml_if(aml_lnot(aml_equal(aml_arg(0), uuid))); \
+aml_append(ifctx, aml_return(aml_int(1 /* Not Supported */))); \
+aml_append(method, ifctx); \
+aml_append(method, aml_return(aml_call4("NCAL", aml_int(_handle_), \
+   aml_arg(1), aml_arg(2), aml_arg(3;  \
+aml_append(_dev_, _method_);   \
+} while (0)
+
+#define BUILD_FIELD_UNIT_SIZE(_field_, _byte_, _name_) \
+aml_append(_field_, aml_named_field(_name_, (_byte_) * BITS_PER_BYTE))
+
+#define BUILD_FIELD_UNIT_STRUCT(_field_, _s_, _f_, _name_) \
+BUILD_FIELD_UNIT_SIZE(_field_, sizeof(typeof_field(_s_, _f_)), _name_)
+
+static void build_nvdimm_devices(GSList *device_list, Aml *root_dev)
+{
+for (; device_list; device_list = device_list->next) {
+NVDIMMDevice *nvdimm = device_list->data;
+int slot = object_property_get_int(OBJECT(nvdimm), DIMM_SLOT_PROP,
+   NULL);
+uint32_t handle = nvdimm_slot_to_handle(slot);
+Aml *dev, *method;
+
+dev = aml_device("NV%02X", slot);
+aml_append(dev, aml_name_decl("_ADR", aml_int(handle)));
+
+BUILD_STA_METHOD(dev, method);
+
+/*
+ * Chapter 4: _DSM Interface for NVDIMM Device (non-root) - Example
+ * in DSM Spec Rev1.
+ */
+BUILD_DSM_METHOD(dev, method,
+ handle /* NVDIMM Device Handle */,
+ "4309AC30-0D11-11E4-9191-0800200C9A66"
+ /* UUID for NVDIMM Devices. */);
+
+aml_append(root_dev, dev);
+}
+}
+
+static void nvdimm_build_acpi_devices(GSList *device_list, Aml *sb_scope)
+{
+Aml *dev, *method, *field;
+uint64_t page_size = getpagesize();
+
+dev = aml_device("NVDR");
+aml_append(dev, aml_name_decl("_HID", aml_string("ACPI0012")));
+
+/* map DSM memory and IO into ACPI namespace. */
+aml_append(dev, aml_operation_region("NPIO", AML_SYSTEM_IO,
+   NVDIMM_ACPI_IO_BASE, NVDIMM_ACPI_IO_LEN));
+aml_append(dev, aml_operation_region("NRAM", AML_SYSTEM_MEMORY,
+   NVDIMM_ACPI_MEM_BASE, page_size));
+
+/*
+ * DSM notifier:
+ * @NOTI: Read it will notify QEMU that _DSM method is being
+ *called and the parameters can be found in nvdimm_dsm_in.
+ *The value 

[PATCH v5 08/33] exec: allow memory to be allocated from any kind of path

2015-10-28 Thread Xiao Guangrong
Currently file_ram_alloc() is designed for hugetlbfs, however, the memory
of nvdimm can come from either raw pmem device eg, /dev/pmem, or the file
locates at DAX enabled filesystem

So this patch let it work on any kind of path

Signed-off-by: Xiao Guangrong 
---
 exec.c | 56 +---
 1 file changed, 17 insertions(+), 39 deletions(-)

diff --git a/exec.c b/exec.c
index 4505dc7..d2a3357 100644
--- a/exec.c
+++ b/exec.c
@@ -1157,32 +1157,6 @@ void qemu_mutex_unlock_ramlist(void)
 }
 
 #ifdef __linux__
-
-#include 
-
-#define HUGETLBFS_MAGIC   0x958458f6
-
-static long gethugepagesize(const char *path, Error **errp)
-{
-struct statfs fs;
-int ret;
-
-do {
-ret = statfs(path, );
-} while (ret != 0 && errno == EINTR);
-
-if (ret != 0) {
-error_setg_errno(errp, errno, "failed to get page size of file %s",
- path);
-return 0;
-}
-
-if (fs.f_type != HUGETLBFS_MAGIC)
-fprintf(stderr, "Warning: path not on HugeTLBFS: %s\n", path);
-
-return fs.f_bsize;
-}
-
 static void *file_ram_alloc(RAMBlock *block,
 ram_addr_t memory,
 const char *path,
@@ -1193,20 +1167,24 @@ static void *file_ram_alloc(RAMBlock *block,
 char *c;
 void *area;
 int fd;
-uint64_t hpagesize;
-Error *local_err = NULL;
+uint64_t pagesize;
 
-hpagesize = gethugepagesize(path, _err);
-if (local_err) {
-error_propagate(errp, local_err);
+pagesize = qemu_file_get_page_size(path);
+if (!pagesize) {
+error_setg(errp, "can't get page size for %s", path);
 goto error;
 }
-block->mr->align = hpagesize;
 
-if (memory < hpagesize) {
+if (pagesize == getpagesize()) {
+fprintf(stderr, "Memory is not allocated from HugeTlbfs.\n");
+}
+
+block->mr->align = pagesize;
+
+if (memory < pagesize) {
 error_setg(errp, "memory size 0x" RAM_ADDR_FMT " must be equal to "
-   "or larger than huge page size 0x%" PRIx64,
-   memory, hpagesize);
+   "or larger than page size 0x%" PRIx64,
+   memory, pagesize);
 goto error;
 }
 
@@ -1230,14 +1208,14 @@ static void *file_ram_alloc(RAMBlock *block,
 fd = mkstemp(filename);
 if (fd < 0) {
 error_setg_errno(errp, errno,
- "unable to create backing store for hugepages");
+ "unable to create backing store for path %s", path);
 g_free(filename);
 goto error;
 }
 unlink(filename);
 g_free(filename);
 
-memory = ROUND_UP(memory, hpagesize);
+memory = ROUND_UP(memory, pagesize);
 
 /*
  * ftruncate is not supported by hugetlbfs in older
@@ -1249,10 +1227,10 @@ static void *file_ram_alloc(RAMBlock *block,
 perror("ftruncate");
 }
 
-area = qemu_ram_mmap(fd, memory, hpagesize, block->flags & RAM_SHARED);
+area = qemu_ram_mmap(fd, memory, pagesize, block->flags & RAM_SHARED);
 if (area == MAP_FAILED) {
 error_setg_errno(errp, errno,
- "unable to map backing store for hugepages");
+ "unable to map backing store for path %s", path);
 close(fd);
 goto error;
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 24/33] nvdimm acpi: build ACPI NFIT table

2015-10-28 Thread Xiao Guangrong
NFIT is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)

Currently, we only support PMEM mode. Each device has 3 structures:
- SPA structure, defines the PMEM region info

- MEM DEV structure, it has the @handle which is used to associate specified
  ACPI NVDIMM  device we will introduce in later patch.
  Also we can happily ignored the memory device's interleave, the real
  nvdimm hardware access is hidden behind host

- DCR structure, it defines vendor ID used to associate specified vendor
  nvdimm driver. Since we only implement PMEM mode this time, Command
  window and Data window are not needed

Signed-off-by: Xiao Guangrong 
---
 hw/acpi/nvdimm.c| 355 
 hw/i386/acpi-build.c|   6 +
 include/hw/mem/nvdimm.h |  10 ++
 3 files changed, 371 insertions(+)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 647a5dd..3fc82e1 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -26,8 +26,348 @@
  * License along with this library; if not, see 
  */
 
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/aml-build.h"
 #include "hw/mem/nvdimm.h"
 
+static int nvdimm_plugged_device_list(Object *obj, void *opaque)
+{
+GSList **list = opaque;
+
+if (object_dynamic_cast(obj, TYPE_NVDIMM)) {
+NVDIMMDevice *nvdimm = NVDIMM(obj);
+
+if (memory_region_is_mapped(>nvdimm_mr)) {
+*list = g_slist_append(*list, DEVICE(obj));
+}
+}
+
+object_child_foreach(obj, nvdimm_plugged_device_list, opaque);
+return 0;
+}
+
+/*
+ * inquire plugged NVDIMM devices and link them into the list which is
+ * returned to the caller.
+ *
+ * Note: it is the caller's responsibility to free the list to avoid
+ * memory leak.
+ */
+static GSList *nvdimm_get_plugged_device_list(void)
+{
+GSList *list = NULL;
+
+object_child_foreach(qdev_get_machine(), nvdimm_plugged_device_list,
+ );
+return list;
+}
+
+#define NVDIMM_UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7) \
+   { (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
+ (b) & 0xff, ((b) >> 8) & 0xff, (c) & 0xff, ((c) >> 8) & 0xff,  \
+ (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }
+/*
+ * define Byte Addressable Persistent Memory (PM) Region according to
+ * ACPI 6.0: 5.2.25.1 System Physical Address Range Structure.
+ */
+static const uint8_t nvdimm_nfit_spa_uuid_pm[] =
+  NVDIMM_UUID_LE(0x66f0d379, 0xb4f3, 0x4074, 0xac, 0x43, 0x0d, 0x33,
+ 0x18, 0xb7, 0x8c, 0xdb);
+
+/*
+ * NVDIMM Firmware Interface Table
+ * @signature: "NFIT"
+ *
+ * It provides information that allows OSPM to enumerate NVDIMM present in
+ * the platform and associate system physical address ranges created by the
+ * NVDIMMs.
+ *
+ * It is defined in ACPI 6.0: 5.2.25 NVDIMM Firmware Interface Table (NFIT)
+ */
+struct nvdimm_nfit {
+ACPI_TABLE_HEADER_DEF
+uint32_t reserved;
+} QEMU_PACKED;
+typedef struct nvdimm_nfit nvdimm_nfit;
+
+/*
+ * define NFIT structures according to ACPI 6.0: 5.2.25 NVDIMM Firmware
+ * Interface Table (NFIT).
+ */
+
+/*
+ * System Physical Address Range Structure
+ *
+ * It describes the system physical address ranges occupied by NVDIMMs and
+ * the types of the regions.
+ */
+struct nvdimm_nfit_spa {
+uint16_t type;
+uint16_t length;
+uint16_t spa_index;
+uint16_t flags;
+uint32_t reserved;
+uint32_t proximity_domain;
+uint8_t type_guid[16];
+uint64_t spa_base;
+uint64_t spa_length;
+uint64_t mem_attr;
+} QEMU_PACKED;
+typedef struct nvdimm_nfit_spa nvdimm_nfit_spa;
+
+/*
+ * Memory Device to System Physical Address Range Mapping Structure
+ *
+ * It enables identifying each NVDIMM region and the corresponding SPA
+ * describing the memory interleave
+ */
+struct nvdimm_nfit_memdev {
+uint16_t type;
+uint16_t length;
+uint32_t nfit_handle;
+uint16_t phys_id;
+uint16_t region_id;
+uint16_t spa_index;
+uint16_t dcr_index;
+uint64_t region_len;
+uint64_t region_offset;
+uint64_t region_dpa;
+uint16_t interleave_index;
+uint16_t interleave_ways;
+uint16_t flags;
+uint16_t reserved;
+} QEMU_PACKED;
+typedef struct nvdimm_nfit_memdev nvdimm_nfit_memdev;
+
+/*
+ * NVDIMM Control Region Structure
+ *
+ * It describes the NVDIMM and if applicable, Block Control Window.
+ */
+struct nvdimm_nfit_dcr {
+uint16_t type;
+uint16_t length;
+uint16_t dcr_index;
+uint16_t vendor_id;
+uint16_t device_id;
+uint16_t revision_id;
+uint16_t sub_vendor_id;
+uint16_t sub_device_id;
+uint16_t sub_revision_id;
+uint8_t reserved[6];
+uint32_t serial_number;
+uint16_t fic;
+uint16_t num_bcw;
+uint64_t bcw_size;
+uint64_t cmd_offset;
+uint64_t cmd_size;
+uint64_t status_offset;
+uint64_t status_size;
+uint16_t flags;
+uint8_t reserved2[6];

[PATCH v5 17/33] dimm: abstract dimm device from pc-dimm

2015-10-28 Thread Xiao Guangrong
A base device, dimm, is abstracted from pc-dimm, so that we can
build nvdimm device based on dimm in the later patch

Signed-off-by: Xiao Guangrong 
---
 default-configs/i386-softmmu.mak   |  1 +
 default-configs/ppc64-softmmu.mak  |  1 +
 default-configs/x86_64-softmmu.mak |  1 +
 hw/mem/Makefile.objs   |  3 ++-
 hw/mem/dimm.c  | 11 ++---
 hw/mem/pc-dimm.c   | 46 ++
 include/hw/mem/dimm.h  |  4 ++--
 include/hw/mem/pc-dimm.h   |  7 ++
 8 files changed, 62 insertions(+), 12 deletions(-)
 create mode 100644 hw/mem/pc-dimm.c
 create mode 100644 include/hw/mem/pc-dimm.h

diff --git a/default-configs/i386-softmmu.mak b/default-configs/i386-softmmu.mak
index 43c96d1..3ece8bb 100644
--- a/default-configs/i386-softmmu.mak
+++ b/default-configs/i386-softmmu.mak
@@ -18,6 +18,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_X86_ICH=y
+CONFIG_DIMM=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
diff --git a/default-configs/ppc64-softmmu.mak 
b/default-configs/ppc64-softmmu.mak
index e77cb1a..a1a9c36 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -53,3 +53,4 @@ CONFIG_XICS_KVM=$(and $(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_MC146818RTC=y
 CONFIG_ISA_TESTDEV=y
 CONFIG_MEM_HOTPLUG=y
+CONFIG_DIMM=y
diff --git a/default-configs/x86_64-softmmu.mak 
b/default-configs/x86_64-softmmu.mak
index dfb8095..92ea7c1 100644
--- a/default-configs/x86_64-softmmu.mak
+++ b/default-configs/x86_64-softmmu.mak
@@ -18,6 +18,7 @@ CONFIG_FDC=y
 CONFIG_ACPI=y
 CONFIG_ACPI_X86=y
 CONFIG_ACPI_X86_ICH=y
+CONFIG_DIMM=y
 CONFIG_ACPI_MEMORY_HOTPLUG=y
 CONFIG_ACPI_CPU_HOTPLUG=y
 CONFIG_APM=y
diff --git a/hw/mem/Makefile.objs b/hw/mem/Makefile.objs
index 7563ef5..cebb4b1 100644
--- a/hw/mem/Makefile.objs
+++ b/hw/mem/Makefile.objs
@@ -1 +1,2 @@
-common-obj-$(CONFIG_MEM_HOTPLUG) += dimm.o
+common-obj-$(CONFIG_DIMM) += dimm.o
+common-obj-$(CONFIG_MEM_HOTPLUG) += pc-dimm.o
diff --git a/hw/mem/dimm.c b/hw/mem/dimm.c
index 6c1ea98..23d5daa 100644
--- a/hw/mem/dimm.c
+++ b/hw/mem/dimm.c
@@ -1,5 +1,5 @@
 /*
- * Dimm device for Memory Hotplug
+ * Dimm device abstraction
  *
  * Copyright ProfitBricks GmbH 2012
  * Copyright (C) 2014 Red Hat Inc
@@ -432,21 +432,13 @@ static void dimm_realize(DeviceState *dev, Error **errp)
 }
 }
 
-static MemoryRegion *dimm_get_memory_region(DIMMDevice *dimm)
-{
-return host_memory_backend_get_memory(dimm->hostmem, _abort);
-}
-
 static void dimm_class_init(ObjectClass *oc, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(oc);
-DIMMDeviceClass *ddc = DIMM_CLASS(oc);
 
 dc->realize = dimm_realize;
 dc->props = dimm_properties;
 dc->desc = "DIMM memory module";
-
-ddc->get_memory_region = dimm_get_memory_region;
 }
 
 static TypeInfo dimm_info = {
@@ -456,6 +448,7 @@ static TypeInfo dimm_info = {
 .instance_init = dimm_init,
 .class_init= dimm_class_init,
 .class_size= sizeof(DIMMDeviceClass),
+.abstract  = true,
 };
 
 static void dimm_register_types(void)
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
new file mode 100644
index 000..38323e9
--- /dev/null
+++ b/hw/mem/pc-dimm.c
@@ -0,0 +1,46 @@
+/*
+ * Dimm device for Memory Hotplug
+ *
+ * Copyright ProfitBricks GmbH 2012
+ * Copyright (C) 2014 Red Hat Inc
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see 
+ */
+
+#include "hw/mem/pc-dimm.h"
+
+static MemoryRegion *pc_dimm_get_memory_region(DIMMDevice *dimm)
+{
+return host_memory_backend_get_memory(dimm->hostmem, _abort);
+}
+
+static void pc_dimm_class_init(ObjectClass *oc, void *data)
+{
+DIMMDeviceClass *ddc = DIMM_CLASS(oc);
+
+ddc->get_memory_region = pc_dimm_get_memory_region;
+}
+
+static TypeInfo pc_dimm_info = {
+.name  = TYPE_PC_DIMM,
+.parent= TYPE_DIMM,
+.class_init= pc_dimm_class_init,
+};
+
+static void pc_dimm_register_types(void)
+{
+type_register_static(_dimm_info);
+}
+
+type_init(pc_dimm_register_types)
diff --git a/include/hw/mem/dimm.h b/include/hw/mem/dimm.h
index 5ddbf08..84a62ed 100644
--- a/include/hw/mem/dimm.h
+++ b/include/hw/mem/dimm.h
@@ -1,5 +1,5 @@
 /*
- * PC DIMM device
+ * Dimm device abstraction
  *
  * Copyright ProfitBricks GmbH 2012
  * 

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 16:22 +0200, Michael S. Tsirkin wrote:
> On Wed, Oct 28, 2015 at 11:13:29PM +0900, David Woodhouse wrote:
> > On Wed, 2015-10-28 at 16:05 +0200, Michael S. Tsirkin wrote:
> > > 
> > > Short answer - platforms need a way to discover, and express
> > > different
> > > security requirements of different devices.
> > 
> > Sure. PLATFORMS need that. Do not let it go anywhere near your
> > device
> > drivers. Including the virtio drivers.
> 
> But would there be any users of this outside the virtio subsystem?
> If no, maybe virtio core is a logical place to keep this.

Users of what? DMA API ops which basically do nothing? Sure — there are
*plenty* of cases where there isn't actually an IOMMU in active use and
the DMA API just returns the same address it was given.

Obviously that happens in platforms without an IOMMU, but it also
happens in cases where an IOMMU exists but is in passthrough mode, and
it also happens in cases where an IOMMU exists somewhere in the system
but only translates for *other* devices.

In all cases, drivers must just use the DMA API and *it* is responsible
for doing the right thing.

> I don't have a problem with extending DMA API to address
> more usecases.

No, this isn't an extension. This is fixing a bug, on certain platforms
where the DMA API has currently done the wrong thing.

We have historically worked around that bug by introducing *another*
bug, which is not to *use* the DMA API in the virtio driver.

Sure, we can co-ordinate those two bug-fixes. But let's not talk about
them as anything other than bug-fixes.

> > Drivers use DMA API. No more talky.
> 
> Well for virtio they don't ATM. And 1:1 mapping makes perfect sense
> for the wast majority of users, so I can't switch them over
> until the DMA API actually addresses all existing usecases.

That's still not your business; it's the platform's. And there are
hardware implementations of the virtio protocols on real PCI cards. And
we have the option of doing IOMMU translation for the virtio devices
even in a virtual machine. Just don't get involved.

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH] KVM: x86: fix eflags state following processor init/reset

2015-10-28 Thread Nadav Amit
Here are my 5 cents. Note that vmx_vcpu_reset calls:

vmcs_writel(GUEST_RFLAGS, 0x02);

(And the RFLAGS value is not cached by KVM, so no consistency problem should
occur.)

You may want to change the value into constant or call a wrapper function
for setting RFLAGS, but I don’t see something broken in the functionality.

Regards,
Nadav

Wanpeng Li  wrote:

> Ping, :-)
> On 10/21/15 2:28 PM, Wanpeng Li wrote:
>> Reference SDM 3.4.3:
>> 
>> Following initialization of the processor (either by asserting the
>> RESET pin or the INIT pin), the state of the EFLAGS register is
>> 0002H.
>> 
>> However, the eflags fixed bit is not set and other bits are also not
>> cleared during the init/reset in kvm.
>> 
>> This patch fix it by set eflags register to 0002H following
>> initialization of the processor.
>> 
>> Signed-off-by: Wanpeng Li 
>> ---
>>  arch/x86/kvm/vmx.c | 1 +
>>  1 file changed, 1 insertion(+)
>> 
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index b680c2e..326f6ea 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -4935,6 +4935,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool 
>> init_event)
>>  vmx_set_efer(vcpu, 0);
>>  vmx_fpu_activate(vcpu);
>>  update_exception_bitmap(vcpu);
>> +vmx_set_rflags(vcpu, X86_EFLAGS_FIXED);
>>  vpid_sync_context(vmx->vpid);
>>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] virtio_ring: Support DMA APIs

2015-10-28 Thread Christian Borntraeger
Am 28.10.2015 um 15:04 schrieb David Woodhouse:
> On Wed, 2015-10-28 at 14:52 +0900, Christian Borntraeger wrote:
>>0059b25a: e3201024   lg  %r2,32(%r1)
>>   #0059b260: e310b0a4   lg  %r1,160(%r11)
>>   >0059b266: 4810100c   lh  %r1,12(%r1)
> 
> Precisely what is that? Is that loading ->archdata.dev_ops (offset 160)
> from a device and then trying to find the ->unmap_page method therein?
> 

qemu 2.4 works, and current qemu master has a broken gdb serverI am going 
to fix
this first.
What I can tell, that the problem goes away if I disable virtio 1.0 so I assume
that things are fine when Andy sends a v4 with the endianess fixes.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] s390/virtio: use noop dma ops

2015-10-28 Thread Cornelia Huck
On Wed, 28 Oct 2015 09:43:34 +0900
Joerg Roedel  wrote:

> On Tue, Oct 27, 2015 at 11:48:51PM +0100, Christian Borntraeger wrote:
> > @@ -1093,6 +1094,7 @@ static void virtio_ccw_auto_online(void *data, 
> > async_cookie_t cookie)
> > struct ccw_device *cdev = data;
> > int ret;
> >  
> > +   cdev->dev.archdata.dma_ops = _noop_ops;
> > ret = ccw_device_set_online(cdev);
> > if (ret)
> > dev_warn(>dev, "Failed to set online: %d\n", ret);
> 
> Hmm, drivers usually don't deal with setting the dma_ops for their
> devices, as they depend on the platform and not so much on the device
> itself.
> 
> Can you do this special-case handling from device independent platform
> code, where you also setup dma_ops for other devices?

Hm, maybe at the bus level?

pci devices get s390_dma_ops, ccw devices get the noop dma ops (just a
bit of dead weight for non-virtio ccw devices, I guess).

The old style s390-virtio devices are the odd ones around, but I'd like
to invest the least time possible there to keep them going.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Woodhouse
On Sun, 2015-10-25 at 09:07 -0700, Shamir Rabinovitch wrote:
> 
> +
> +DMA_ATTR_IOMMU_BYPASS
> +-
> +
> > +For systems with IOMMU it is assumed all DMA translations use the IOMMU.

Not entirely true. We have per-device dma_ops on a most architectures
already, and we were just talking about the need to add them to
POWER/SPARC too, because we need to avoid trying to use the IOMMU to
map virtio devices too.

As we look at that (and make the per-device dma_ops a generic thing
rather than per-arch), we should probably look at folding your case in
too.

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


[PATCH v3 2/3] virtio_ring: Support DMA APIs

2015-10-28 Thread Andy Lutomirski
virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/Kconfig   |   2 +-
 drivers/virtio/virtio_ring.c | 187 +++
 tools/virtio/linux/dma-mapping.h |  17 
 3 files changed, 169 insertions(+), 37 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cab9f3f63a38..77590320d44c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -60,7 +60,7 @@ config VIRTIO_INPUT
 
  config VIRTIO_MMIO
tristate "Platform bus driver for memory mapped virtio devices"
-   depends on HAS_IOMEM
+   depends on HAS_IOMEM && HAS_DMA
select VIRTIO
---help---
 This drivers provides support for memory mapped virtio
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..6962ea37ade0 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,7 +55,14 @@
 #define END_USE(vq)
 #endif
 
-struct vring_virtqueue {
+struct vring_desc_state
+{
+   void *data; /* Data for callback. */
+   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+};
+
+struct vring_virtqueue
+{
struct virtqueue vq;
 
/* Actual memory layout for this queue */
@@ -92,12 +100,71 @@ struct vring_virtqueue {
ktime_t last_add_time;
 #endif
 
-   /* Tokens for callbacks. */
-   void *data[];
+   /* Per-descriptor state. */
+   struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/*
+ * The DMA ops on various arches are rather gnarly right now, and
+ * making all of the arch DMA ops work on the vring device itself
+ * is a mess.  For now, we use the parent device for DMA ops.
+ */
+struct device *vring_dma_dev(const struct vring_virtqueue *vq)
+{
+   return vq->vq.vdev->dev.parent;
+}
+
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+  struct scatterlist *sg,
+  enum dma_data_direction direction)
+{
+   /*
+* We can't use dma_map_sg, because we don't use scatterlists in
+* the way it expects (we don't guarantee that the scatterlist
+* will exist for the lifetime of the mapping).
+*/
+   return dma_map_page(vring_dma_dev(vq),
+   sg_page(sg), sg->offset, sg->length,
+   direction);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+  void *cpu_addr, size_t size,
+  enum dma_data_direction direction)
+{
+   return dma_map_single(vring_dma_dev(vq),
+ cpu_addr, size, direction);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+   struct vring_desc *desc)
+{
+   u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
 static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 unsigned int total_sg, gfp_t gfp)
 {
@@ -131,7 +198,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
struct 

[PATCH v3 3/3] virtio_pci: Use the DMA API

2015-10-28 Thread Andy Lutomirski
This fixes virtio-pci on platforms and busses that have IOMMUs.  This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed.  In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.

We should clean up the virtqueue API to do its own allocation and
teach virtqueue_get_avail and virtqueue_get_used to return DMA
addresses directly.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/virtio_pci_common.h |  3 ++-
 drivers/virtio/virtio_pci_legacy.c | 19 +++
 drivers/virtio/virtio_pci_modern.c | 34 --
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index b976d968e793..cd6196b513ad 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -38,8 +38,9 @@ struct virtio_pci_vq_info {
/* the number of entries in the queue */
int num;
 
-   /* the virtual address of the ring queue */
+   /* the ring queue */
void *queue;
+   dma_addr_t queue_dma_addr;  /* bus address */
 
/* the list node for the virtqueues list */
struct list_head node;
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index 48bc9797e530..b5293e5f2af4 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
info->msix_vector = msix_vec;
 
size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+   info->queue = dma_zalloc_coherent(_dev->pci_dev->dev, size,
+ >queue_dma_addr,
+ GFP_KERNEL);
if (info->queue == NULL)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
@@ -169,7 +171,8 @@ out_assign:
vring_del_virtqueue(vq);
 out_activate_queue:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
return ERR_PTR(err);
 }
 
@@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device 
*vp_dev)
return -ENODEV;
}
 
+   rc = dma_set_mask_and_coherent(_dev->dev, DMA_BIT_MASK(64));
+   if (rc)
+   rc = dma_set_mask_and_coherent(_dev->dev,
+   DMA_BIT_MASK(32));
+   if (rc)
+   dev_warn(_dev->dev, "Failed to enable 64-bit or 32-bit DMA. 
 Trying to continue, but this might not work.\n");
+
rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
if (rc)
return rc;
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 8e5cf194cc0b..fbe0bd1c4881 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num)
return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
 }
 
-static void *alloc_virtqueue_pages(int *num)
+static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
+  int *num, dma_addr_t *dma_addr)
 {
void *pages;
 
/* TODO: allocate each queue chunk individually */
for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-   pages = alloc_pages_exact(vring_pci_size(*num),
- GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN);
+   pages = dma_zalloc_coherent(
+   _dev->pci_dev->dev, vring_pci_size(*num),
+   dma_addr, GFP_KERNEL|__GFP_NOWARN);
if (pages)
return pages;
}
@@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num)
return NULL;
 
/* Try to get a single page. You are my only hope! */
-   return alloc_pages_exact(vring_pci_size(*num), GFP_KERNEL|__GFP_ZERO);
+   return dma_zalloc_coherent(
+   _dev->pci_dev->dev, 

[PATCH v3 1/3] virtio_net: Stop doing DMA from the stack

2015-10-28 Thread Andy Lutomirski
From: Andy Lutomirski 

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack.  virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

Cc: "Michael S. Tsirkin" 
Cc: virtualizat...@lists.linux-foundation.org
Reviewed-by: Joerg Roedel 
Signed-off-by: Andy Lutomirski 
---

Hi Michael and DaveM-

This is a prerequisite for the virtio DMA fixing project.  It works
as a standalone patch, though.  Would it make sense to apply it to
an appropriate networking tree now?

(This is unchanged from v2.)

drivers/net/virtio_net.c | 53 
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838dedb7a4..4f10f8a58811 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
 struct scatterlist *out)
 {
struct scatterlist *sgs[4], hdr, stat;
-   struct virtio_net_ctrl_hdr ctrl;
-   virtio_net_ctrl_ack status = ~0;
+
+   struct {
+   struct virtio_net_ctrl_hdr ctrl;
+   virtio_net_ctrl_ack status;
+   } *buf;
+
unsigned out_num = 0, tmp;
+   bool ret;
 
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-   ctrl.class = class;
-   ctrl.cmd = cmd;
+   buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
+   if (!buf)
+   return false;
+   buf->status = ~0;
+
+   buf->ctrl.class = class;
+   buf->ctrl.cmd = cmd;
/* Add header */
-   sg_init_one(, , sizeof(ctrl));
+   sg_init_one(, >ctrl, sizeof(buf->ctrl));
sgs[out_num++] = 
 
if (out)
sgs[out_num++] = out;
 
/* Add return status. */
-   sg_init_one(, , sizeof(status));
+   sg_init_one(, >status, sizeof(buf->status));
sgs[out_num] = 
 
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
-   if (unlikely(!virtqueue_kick(vi->cvq)))
-   return status == VIRTIO_NET_OK;
+   if (unlikely(!virtqueue_kick(vi->cvq))) {
+   ret = (buf->status == VIRTIO_NET_OK);
+   goto out;
+   }
 
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
@@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info 
*vi, u8 class, u8 cmd,
   !virtqueue_is_broken(vi->cvq))
cpu_relax();
 
-   return status == VIRTIO_NET_OK;
+   ret = (buf->status == VIRTIO_NET_OK);
+
+out:
+   kfree(buf);
+   return ret;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg[2];
-   u8 promisc, allmulti;
+   u8 *cmdbyte;
struct virtio_net_ctrl_mac *mac_data;
struct netdev_hw_addr *ha;
int uc_count;
@@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
 
-   promisc = ((dev->flags & IFF_PROMISC) != 0);
-   allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+   cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
+   if (!cmdbyte)
+   return;
 
-   sg_init_one(sg, , sizeof(promisc));
+   sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
 
+   *cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(>dev, "Failed to %sable promisc mode.\n",
-promisc ? "en" : "dis");
-
-   sg_init_one(sg, , sizeof(allmulti));
+*cmdbyte ? "en" : "dis");
 
+   *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(>dev, "Failed to %sable allmulti mode.\n",
-allmulti ? "en" : "dis");
+*cmdbyte ? "en" : "dis");
+
+   kfree(cmdbyte);
 
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Andy Lutomirski
This switches virtio to use the DMA API unconditionally.  I'm sure
it breaks things, but it seems to work on x86 using virtio-pci, with
and without Xen, and using both the modern 1.0 variant and the
legacy variant.

Changes from v2:
 - Fix really embarrassing bug.  This version actually works.

Changes from v1:
 - Fix an endian conversion error causing a BUG to hit.
 - Fix a DMA ordering issue (swiotlb=force works now).
 - Minor cleanups.

Andy Lutomirski (3):
  virtio_net: Stop doing DMA from the stack
  virtio_ring: Support DMA APIs
  virtio_pci: Use the DMA API

 drivers/net/virtio_net.c   |  53 +++
 drivers/virtio/Kconfig |   2 +-
 drivers/virtio/virtio_pci_common.h |   3 +-
 drivers/virtio/virtio_pci_legacy.c |  19 +++-
 drivers/virtio/virtio_pci_modern.c |  34 +--
 drivers/virtio/virtio_ring.c   | 187 ++---
 tools/virtio/linux/dma-mapping.h   |  17 
 7 files changed, 246 insertions(+), 69 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread David Woodhouse
On Tue, 2015-10-27 at 23:38 -0700, Andy Lutomirski wrote:
> 
> Changes from v2:
>  - Fix really embarrassing bug.  This version actually works.

So embarrassing you didn't want to tell us what it was? ...

--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -292,7 +292,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
vq, desc, total_sg * sizeof(struct vring_desc),
DMA_TO_DEVICE);
 
-   if (vring_mapping_error(vq, vq->vring.desc[head].addr))
+   if (vring_mapping_error(vq, addr))
goto unmap_release;
 
vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev, 
VRING_DESC_F_INDIRECT);

That wasn't going to be the reason for Christian's failure, was it?


-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v2 1/3] virtio_net: Stop doing DMA from the stack

2015-10-28 Thread Michael S. Tsirkin
On Tue, Oct 27, 2015 at 10:30:19PM -0700, Andy Lutomirski wrote:
> From: Andy Lutomirski 
> 
> Once virtio starts using the DMA API, we won't be able to safely DMA
> from the stack.  virtio-net does a couple of config DMA requests
> from small stack buffers -- switch to using dynamically-allocated
> memory.
> 
> This should have no effect on any performance-critical code paths.
> 
> Cc: net...@vger.kernel.org
> Cc: "Michael S. Tsirkin" 
> Cc: virtualizat...@lists.linux-foundation.org
> Reviewed-by: Joerg Roedel 
> Signed-off-by: Andy Lutomirski 
> ---
> 
> Hi Michael and DaveM-
> 
> This is a prerequisite for the virtio DMA fixing project.  It works
> as a standalone patch, though.  Would it make sense to apply it to
> an appropriate networking tree now?
> 
>  drivers/net/virtio_net.c | 53 
> 
>  1 file changed, 36 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d8838dedb7a4..4f10f8a58811 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info 
> *vi, u8 class, u8 cmd,
>struct scatterlist *out)
>  {
>   struct scatterlist *sgs[4], hdr, stat;
> - struct virtio_net_ctrl_hdr ctrl;
> - virtio_net_ctrl_ack status = ~0;
> +
> + struct {
> + struct virtio_net_ctrl_hdr ctrl;
> + virtio_net_ctrl_ack status;
> + } *buf;
> +
>   unsigned out_num = 0, tmp;
> + bool ret;
>  
>   /* Caller should know better */
>   BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>  
> - ctrl.class = class;
> - ctrl.cmd = cmd;
> + buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
> + if (!buf)
> + return false;

This is problematic. The command is never retried, the error
is propagated to userspace.

> + buf->status = ~0;
> +
> + buf->ctrl.class = class;
> + buf->ctrl.cmd = cmd;
>   /* Add header */
> - sg_init_one(, , sizeof(ctrl));
> + sg_init_one(, >ctrl, sizeof(buf->ctrl));
>   sgs[out_num++] = 
>  
>   if (out)
>   sgs[out_num++] = out;
>  
>   /* Add return status. */
> - sg_init_one(, , sizeof(status));
> + sg_init_one(, >status, sizeof(buf->status));
>   sgs[out_num] = 
>  
>   BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
>   virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
>  
> - if (unlikely(!virtqueue_kick(vi->cvq)))
> - return status == VIRTIO_NET_OK;
> + if (unlikely(!virtqueue_kick(vi->cvq))) {
> + ret = (buf->status == VIRTIO_NET_OK);
> + goto out;
> + }
>  
>   /* Spin for a response, the kick causes an ioport write, trapping
>* into the hypervisor, so the request should be handled immediately.
> @@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info 
> *vi, u8 class, u8 cmd,
>  !virtqueue_is_broken(vi->cvq))
>   cpu_relax();
>  
> - return status == VIRTIO_NET_OK;
> + ret = (buf->status == VIRTIO_NET_OK);
> +
> +out:
> + kfree(buf);
> + return ret;
>  }
>  
>  static int virtnet_set_mac_address(struct net_device *dev, void *p)
> @@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
>  {
>   struct virtnet_info *vi = netdev_priv(dev);
>   struct scatterlist sg[2];
> - u8 promisc, allmulti;
> + u8 *cmdbyte;
>   struct virtio_net_ctrl_mac *mac_data;
>   struct netdev_hw_addr *ha;
>   int uc_count;
> @@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device 
> *dev)
>   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
>   return;
>  
> - promisc = ((dev->flags & IFF_PROMISC) != 0);
> - allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
> + cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
> + if (!cmdbyte)
> + return;

Here the error is ignored, rx mode will be incorrect.
OTOH it looks like that's already the case.

>  
> - sg_init_one(sg, , sizeof(promisc));
> + sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
>  
> + *cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
>   if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> VIRTIO_NET_CTRL_RX_PROMISC, sg))
>   dev_warn(>dev, "Failed to %sable promisc mode.\n",
> -  promisc ? "en" : "dis");
> -
> - sg_init_one(sg, , sizeof(allmulti));
> +  *cmdbyte ? "en" : "dis");
>  
> + *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
>   if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
>   dev_warn(>dev, "Failed to %sable allmulti mode.\n",
> -  allmulti ? "en" : "dis");
> + 

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Christian Borntraeger
Am 28.10.2015 um 16:17 schrieb Michael S. Tsirkin:
> On Tue, Oct 27, 2015 at 11:38:57PM -0700, Andy Lutomirski wrote:
>> This switches virtio to use the DMA API unconditionally.  I'm sure
>> it breaks things, but it seems to work on x86 using virtio-pci, with
>> and without Xen, and using both the modern 1.0 variant and the
>> legacy variant.
> 
> I'm very glad to see work on this making progress.
> 
> I suspect we'll have to find a way to make this optional though, and
> keep doing the non-DMA API thing with old devices.  And I've been
> debating with myself whether a pci specific thing or a feature bit is
> preferable.
> 

We have discussed that at kernel summit. I will try to implement a dummy 
dma_ops for
s390 that does 1:1 mapping and Ben will look into doing some quirk to handle 
"old"
code in addition to also make it possible to mark devices as iommu bypass (IIRC,
via device tree, Ben?)

Christian



> Thoughts?
> 
>> Changes from v2:
>>  - Fix really embarrassing bug.  This version actually works.
>>
>> Changes from v1:
>>  - Fix an endian conversion error causing a BUG to hit.
>>  - Fix a DMA ordering issue (swiotlb=force works now).
>>  - Minor cleanups.
>>
>> Andy Lutomirski (3):
>>   virtio_net: Stop doing DMA from the stack
>>   virtio_ring: Support DMA APIs
>>   virtio_pci: Use the DMA API
>>
>>  drivers/net/virtio_net.c   |  53 +++
>>  drivers/virtio/Kconfig |   2 +-
>>  drivers/virtio/virtio_pci_common.h |   3 +-
>>  drivers/virtio/virtio_pci_legacy.c |  19 +++-
>>  drivers/virtio/virtio_pci_modern.c |  34 +--
>>  drivers/virtio/virtio_ring.c   | 187 
>> ++---
>>  tools/virtio/linux/dma-mapping.h   |  17 
>>  7 files changed, 246 insertions(+), 69 deletions(-)
>>  create mode 100644 tools/virtio/linux/dma-mapping.h
>>
>> -- 
>> 2.4.3
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] virtio_ring: Support DMA APIs

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 14:52 +0900, Christian Borntraeger wrote:
>0059b25a: e3201024   lg  %r2,32(%r1)
>   #0059b260: e310b0a4   lg  %r1,160(%r11)
>   >0059b266: 4810100c   lh  %r1,12(%r1)

Precisely what is that? Is that loading ->archdata.dev_ops (offset 160)
from a device and then trying to find the ->unmap_page method therein?

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3 1/3] virtio_net: Stop doing DMA from the stack

2015-10-28 Thread Michael S. Tsirkin
On Tue, Oct 27, 2015 at 11:38:58PM -0700, Andy Lutomirski wrote:
> From: Andy Lutomirski 
> 
> Once virtio starts using the DMA API, we won't be able to safely DMA
> from the stack.  virtio-net does a couple of config DMA requests
> from small stack buffers -- switch to using dynamically-allocated
> memory.
> 
> This should have no effect on any performance-critical code paths.
> 
> Cc: "Michael S. Tsirkin" 
> Cc: virtualizat...@lists.linux-foundation.org
> Reviewed-by: Joerg Roedel 
> Signed-off-by: Andy Lutomirski 

Same issues as v2 (I only saw v3 now).
I've proposed an alternative patch.

> ---
> 
> Hi Michael and DaveM-
> 
> This is a prerequisite for the virtio DMA fixing project.  It works
> as a standalone patch, though.  Would it make sense to apply it to
> an appropriate networking tree now?
> 
> (This is unchanged from v2.)
> 
> drivers/net/virtio_net.c | 53 
>  1 file changed, 36 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d8838dedb7a4..4f10f8a58811 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info 
> *vi, u8 class, u8 cmd,
>struct scatterlist *out)
>  {
>   struct scatterlist *sgs[4], hdr, stat;
> - struct virtio_net_ctrl_hdr ctrl;
> - virtio_net_ctrl_ack status = ~0;
> +
> + struct {
> + struct virtio_net_ctrl_hdr ctrl;
> + virtio_net_ctrl_ack status;
> + } *buf;
> +
>   unsigned out_num = 0, tmp;
> + bool ret;
>  
>   /* Caller should know better */
>   BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>  
> - ctrl.class = class;
> - ctrl.cmd = cmd;
> + buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
> + if (!buf)
> + return false;
> + buf->status = ~0;
> +
> + buf->ctrl.class = class;
> + buf->ctrl.cmd = cmd;
>   /* Add header */
> - sg_init_one(, , sizeof(ctrl));
> + sg_init_one(, >ctrl, sizeof(buf->ctrl));
>   sgs[out_num++] = 
>  
>   if (out)
>   sgs[out_num++] = out;
>  
>   /* Add return status. */
> - sg_init_one(, , sizeof(status));
> + sg_init_one(, >status, sizeof(buf->status));
>   sgs[out_num] = 
>  
>   BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
>   virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
>  
> - if (unlikely(!virtqueue_kick(vi->cvq)))
> - return status == VIRTIO_NET_OK;
> + if (unlikely(!virtqueue_kick(vi->cvq))) {
> + ret = (buf->status == VIRTIO_NET_OK);
> + goto out;
> + }
>  
>   /* Spin for a response, the kick causes an ioport write, trapping
>* into the hypervisor, so the request should be handled immediately.
> @@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info 
> *vi, u8 class, u8 cmd,
>  !virtqueue_is_broken(vi->cvq))
>   cpu_relax();
>  
> - return status == VIRTIO_NET_OK;
> + ret = (buf->status == VIRTIO_NET_OK);
> +
> +out:
> + kfree(buf);
> + return ret;
>  }
>  
>  static int virtnet_set_mac_address(struct net_device *dev, void *p)
> @@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
>  {
>   struct virtnet_info *vi = netdev_priv(dev);
>   struct scatterlist sg[2];
> - u8 promisc, allmulti;
> + u8 *cmdbyte;
>   struct virtio_net_ctrl_mac *mac_data;
>   struct netdev_hw_addr *ha;
>   int uc_count;
> @@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device 
> *dev)
>   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
>   return;
>  
> - promisc = ((dev->flags & IFF_PROMISC) != 0);
> - allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
> + cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
> + if (!cmdbyte)
> + return;
>  
> - sg_init_one(sg, , sizeof(promisc));
> + sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
>  
> + *cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
>   if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> VIRTIO_NET_CTRL_RX_PROMISC, sg))
>   dev_warn(>dev, "Failed to %sable promisc mode.\n",
> -  promisc ? "en" : "dis");
> -
> - sg_init_one(sg, , sizeof(allmulti));
> +  *cmdbyte ? "en" : "dis");
>  
> + *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
>   if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
> VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
>   dev_warn(>dev, "Failed to %sable allmulti mode.\n",
> -  allmulti ? "en" : "dis");
> +  *cmdbyte ? "en" : "dis");
> +
> + kfree(cmdbyte);
>  
>   uc_count = netdev_uc_count(dev);
>   

Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 11:53 PM, David Woodhouse  wrote:
> On Tue, 2015-10-27 at 23:38 -0700, Andy Lutomirski wrote:
>>
>> Changes from v2:
>>  - Fix really embarrassing bug.  This version actually works.
>
> So embarrassing you didn't want to tell us what it was? ...

Shhh, it's a secret!

I somehow managed to test-boot a different kernel than I thought I was booting.

>
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -292,7 +292,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> vq, desc, total_sg * sizeof(struct vring_desc),
> DMA_TO_DEVICE);
>
> -   if (vring_mapping_error(vq, vq->vring.desc[head].addr))
> +   if (vring_mapping_error(vq, addr))
> goto unmap_release;
>
> vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev, 
> VRING_DESC_F_INDIRECT);
>
> That wasn't going to be the reason for Christian's failure, was it?
>

Not obviously, but it's possible.  Now that I'm staring at it, I have
some more big-endian issues, so there'll be a v4.  I'll also play with
Michael's thing.  Expect a long delay, though -- my flight's about to
leave.

The readme notwithstanding, virtme (https://github.com/amluto/virtme)
actually has s390x support, so I can try to debug when I get home.
I'm not about to try doing this on a laptop :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Michael S. Tsirkin
On Tue, Oct 27, 2015 at 11:38:57PM -0700, Andy Lutomirski wrote:
> This switches virtio to use the DMA API unconditionally.  I'm sure
> it breaks things, but it seems to work on x86 using virtio-pci, with
> and without Xen, and using both the modern 1.0 variant and the
> legacy variant.

I'm very glad to see work on this making progress.

I suspect we'll have to find a way to make this optional though, and
keep doing the non-DMA API thing with old devices.  And I've been
debating with myself whether a pci specific thing or a feature bit is
preferable.

Thoughts?

> Changes from v2:
>  - Fix really embarrassing bug.  This version actually works.
> 
> Changes from v1:
>  - Fix an endian conversion error causing a BUG to hit.
>  - Fix a DMA ordering issue (swiotlb=force works now).
>  - Minor cleanups.
> 
> Andy Lutomirski (3):
>   virtio_net: Stop doing DMA from the stack
>   virtio_ring: Support DMA APIs
>   virtio_pci: Use the DMA API
> 
>  drivers/net/virtio_net.c   |  53 +++
>  drivers/virtio/Kconfig |   2 +-
>  drivers/virtio/virtio_pci_common.h |   3 +-
>  drivers/virtio/virtio_pci_legacy.c |  19 +++-
>  drivers/virtio/virtio_pci_modern.c |  34 +--
>  drivers/virtio/virtio_ring.c   | 187 
> ++---
>  tools/virtio/linux/dma-mapping.h   |  17 
>  7 files changed, 246 insertions(+), 69 deletions(-)
>  create mode 100644 tools/virtio/linux/dma-mapping.h
> 
> -- 
> 2.4.3
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 16:40 +0900, Christian Borntraeger wrote:
> Am 28.10.2015 um 16:17 schrieb Michael S. Tsirkin:
> > On Tue, Oct 27, 2015 at 11:38:57PM -0700, Andy Lutomirski wrote:
> > > This switches virtio to use the DMA API unconditionally.  I'm sure
> > > it breaks things, but it seems to work on x86 using virtio-pci, with
> > > and without Xen, and using both the modern 1.0 variant and the
> > > legacy variant.
> > 
> > I'm very glad to see work on this making progress.
> > 
> > I suspect we'll have to find a way to make this optional though, and
> > keep doing the non-DMA API thing with old devices.  And I've been
> > debating with myself whether a pci specific thing or a feature bit is
> > preferable.
> > 
> 
> We have discussed that at kernel summit. I will try to implement a dummy 
> dma_ops for
> s390 that does 1:1 mapping and Ben will look into doing some quirk to handle 
> "old"
> code in addition to also make it possible to mark devices as iommu bypass 
> (IIRC,
> via device tree, Ben?)

Right. You never eschew the DMA API in the *driver* — you just expect
the DMA API to do the right thing for devices which don't need
translation (with platforms using per-device dma_ops and generally
getting their act together).

We're pushing that on the platforms where it's currently an issue,
including Power, SPARC and S390.

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Benjamin Herrenschmidt
On Wed, 2015-10-28 at 16:40 +0900, Christian Borntraeger wrote:
> We have discussed that at kernel summit. I will try to implement a dummy 
> dma_ops for
> s390 that does 1:1 mapping and Ben will look into doing some quirk to handle 
> "old"
> code in addition to also make it possible to mark devices as iommu bypass 
> (IIRC,
> via device tree, Ben?)

Something like that yes. I'll look into it when I'm back home.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 01:44 +0100, Paolo Bonzini wrote:
> 
> On 27/10/2015 22:26, Yunhong Jiang wrote:
> >> > On RT kernels however can you call eventfd_signal from interrupt
> >> > context?  You cannot call spin_lock_irqsave (which can sleep) from a
> >> > non-threaded interrupt handler, can you?  You would need a raw spin lock.
> > Thanks for pointing this out. Yes, we can't call spin_lock_irqsave on RT 
> > kernel. Will do this way on next patch. But not sure if it's overkill to 
> > use 
> > raw_spinlock there since the eventfd_signal is used by other caller also.
> 
> No, I don't think you can use raw_spinlock there.  The problem is not
> just eventfd_signal, it is especially wake_up_locked_poll.  You cannot
> convert the whole workqueue infrastructure to use raw_spinlock.
> 
> Alex, would it make sense to use the IRQ bypass infrastructure always,
> not just for VT-d, to do the MSI injection directly from the VFIO
> interrupt handler and bypass the eventfd?  Basically this would add an
> RCU-protected list of consumers matching the token to struct
> irq_bypass_producer, and a
> 
>   int (*inject)(struct irq_bypass_consumer *);
> 
> callback to struct irq_bypass_consumer.  If any callback returns true,
> the eventfd is not signaled.  The KVM implementation would be like this
> (compare with virt/kvm/eventfd.c):
> 
>   /* Extracted out of irqfd_wakeup */
>   static int
>   irqfd_wakeup_pollin(struct kvm_kernel_irqfd *irqfd)
>   {
>   ...
>   }
> 
>   /* Extracted out of irqfd_wakeup */
>   static int
>   irqfd_wakeup_pollhup(struct kvm_kernel_irqfd *irqfd)
>   {
>   ...
>   }
> 
>   static int
>   irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync,
>void *key)
>   {
>   struct _irqfd *irqfd = container_of(wait,
>   struct _irqfd, wait);
>   unsigned long flags = (unsigned long)key;
> 
>   if (flags & POLLIN)
>   irqfd_wakeup_pollin(irqfd);
>   if (flags & POLLHUP)
>   irqfd_wakeup_pollhup(irqfd);
> 
>   return 0;
>   }
> 
>   static int kvm_arch_irq_bypass_inject(
>   struct irq_bypass_consumer *cons)
>   {
>   struct kvm_kernel_irqfd *irqfd =
>   container_of(cons, struct kvm_kernel_irqfd,
>consumer); 
> 
>   irqfd_wakeup_pollin(irqfd);
>   }
> 
> Or do you think it would be a hack?  The latency improvement might
> actually be even better than what Yunhong is already reporting.

Yeah, that might be a good idea, it's probably more plausible than
making the eventfd_signal() code friendly to call from hard interrupt
context.  On the vfio side can we use request_threaded_irq() directly
for this?  Making the hard irq handler return IRQ_HANDLED if we can use
the irq bypass manager or IRQ_WAKE_THREAD if we need to use the eventfd.
I think we need some way to get back to irq thread context to use
eventfd_signal().  Would we ever not want to use the direct bypass
manager path if available?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 27/33] nvdimm acpi: support function 0

2015-10-28 Thread Stefan Hajnoczi
On Wed, Oct 28, 2015 at 10:26:25PM +, Xiao Guangrong wrote:
> __DSM is defined in ACPI 6.0: 9.14.1 _DSM (Device Specific Method)
> 
> Function 0 is a query function. We do not support any function on root
> device and only 3 functions are support for NVDIMM device, Get Namespace
> Label Size, Get Namespace Label Data and Set Namespace Label Data, that
> means we currently only allow to access device's Label Namespace
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  hw/acpi/aml-build.c |   2 +-
>  hw/acpi/nvdimm.c| 156 
> +++-
>  include/hw/acpi/aml-build.h |   1 +
>  3 files changed, 157 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v5 30/33] nvdimm acpi: support Set Namespace Label Data function

2015-10-28 Thread Stefan Hajnoczi
On Wed, Oct 28, 2015 at 10:26:28PM +, Xiao Guangrong wrote:
> +static void nvdimm_dsm_func_set_label_data(NVDIMMDevice *nvdimm,
> +   nvdimm_dsm_in *in, GArray *out)
> +{
> +NVDIMMClass *nvc = NVDIMM_GET_CLASS(nvdimm);
> +nvdimm_func_in_set_label_data *set_label_data = >func_set_label_data;
> +uint32_t status;
> +
> +le32_to_cpus(_label_data->offset);
> +le32_to_cpus(_label_data->length);
> +
> +nvdimm_debug("Write Label Data: offset %#x length %#x.\n",
> + set_label_data->offset, set_label_data->length);
> +
> +if (nvdimm->label_size < set_label_data->offset + 
> set_label_data->length) {

Integer overflow.


signature.asc
Description: PGP signature


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
> Current vfio_pgsize_bitmap code hides the supported IOMMU page
> sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
> does not support PAGE_SIZE page, the alignment check on map/unmap
> is done with larger page sizes, if any. This can fail although
> mapping could be done with pages smaller than PAGE_SIZE.
> 
> vfio_pgsize_bitmap is modified to expose the IOMMU page sizes,
> supported by all domains, even those smaller than PAGE_SIZE. The
> alignment check on map is performed against PAGE_SIZE if the minimum
> IOMMU size is less than PAGE_SIZE or against the min page size greater
> than PAGE_SIZE.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> This was tested on AMD Seattle with 64kB page host. ARM MMU 401
> currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
> the map/unmap check is done against 2MB. Some alignment check fail
> so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
> size.
> ---
>  drivers/vfio/vfio_iommu_type1.c | 25 +++--
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 57d8c37..13fb974 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> struct vfio_dma *dma)
>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  {
>   struct vfio_domain *domain;
> - unsigned long bitmap = PAGE_MASK;
> + unsigned long bitmap = ULONG_MAX;

Isn't this and removing the WARN_ON()s the only real change in this
patch?  The rest looks like conversion to use IS_ALIGNED and the
following test, that I don't really understand...

>  
>   mutex_lock(>lock);
>   list_for_each_entry(domain, >domain_list, next)
> @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
> vfio_iommu *iommu)
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> - uint64_t mask;
>   struct vfio_dma *dma;
>   size_t unmapped = 0;
>   int ret = 0;
> + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> + PAGE_SIZE : min_pagesz;

This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
care to cap alignment at PAGE_SIZE?

> - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> -
> - if (unmap->iova & mask)
> + if (!IS_ALIGNED(unmap->iova, requested_alignment))
>   return -EINVAL;
> - if (!unmap->size || unmap->size & mask)
> + if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
>   return -EINVAL;
>  
> - WARN_ON(mask & PAGE_MASK);
> -
>   mutex_lock(>lock);
>  
>   /*
> @@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>   size_t size = map->size;
>   long npage;
>   int ret = 0, prot = 0;
> - uint64_t mask;
>   struct vfio_dma *dma;
>   unsigned long pfn;
> + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> + PAGE_SIZE : min_pagesz;
>  
>   /* Verify that none of our __u64 fields overflow */
>   if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>   return -EINVAL;
>  
> - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> -
> - WARN_ON(mask & PAGE_MASK);
> -
>   /* READ/WRITE from device perspective */
>   if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
>   prot |= IOMMU_WRITE;
>   if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>   prot |= IOMMU_READ;
>  
> - if (!prot || !size || (size | iova | vaddr) & mask)
> + if (!prot || !size ||
> + !IS_ALIGNED(size | iova | vaddr, requested_alignment))
>   return -EINVAL;
>  
>   /* Don't allow IOVA or virtual address wrap */

This is mostly ignoring the problems with sub-PAGE_SIZE mappings.  For
instance, we can only pin on PAGE_SIZE and therefore we only do
accounting on PAGE_SIZE, so if the user does 4K mappings across your 64K
page, that page gets pinned and accounted 16 times.  Are we going to
tell users that their locked memory limit needs to be 16x now?  The rest
of the code would need an audit as well to see what other sub-page bugs
might be hiding.  Thanks,

Alex



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/3] virtio DMA API core stuff

2015-10-28 Thread Michael S. Tsirkin
On Wed, Oct 28, 2015 at 11:32:34PM +0900, David Woodhouse wrote:
> > I don't have a problem with extending DMA API to address
> > more usecases.
> 
> No, this isn't an extension. This is fixing a bug, on certain platforms
> where the DMA API has currently done the wrong thing.
> 
> We have historically worked around that bug by introducing *another*
> bug, which is not to *use* the DMA API in the virtio driver.
> 
> Sure, we can co-ordinate those two bug-fixes. But let's not talk about
> them as anything other than bug-fixes.

It was pretty practical not to use it. All virtio devices at the time
without exception bypassed the IOMMU, so it was a question of omitting a
couple of function calls in virtio versus hacking on DMA implementation
on multiple platforms. We have more policy options now, so I agree it's
time to revisit this.

But for me, the most important thing is that we do coordinate.

> > > Drivers use DMA API. No more talky.
> > 
> > Well for virtio they don't ATM. And 1:1 mapping makes perfect sense
> > for the wast majority of users, so I can't switch them over
> > until the DMA API actually addresses all existing usecases.
> 
> That's still not your business; it's the platform's. And there are
> hardware implementations of the virtio protocols on real PCI cards. And
> we have the option of doing IOMMU translation for the virtio devices
> even in a virtual machine. Just don't get involved.
> 
> -- 
> dwmw2
> 
> 

I'm involved anyway, it's possible not to put all the code in the virtio
subsystem in guest though.  But I suspect we'll need to find a way for
non-linux drivers within guest to work correctly too, and they might
have trouble poking at things at the system level.  So possibly virtio
subsystem will have to tell platform "this device wants to bypass IOMMU"
and then DMA API does the right thing.

I'll look into this after my vacation ~1.5 weeks from now.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 29/33] nvdimm acpi: support Get Namespace Label Data function

2015-10-28 Thread Stefan Hajnoczi
On Wed, Oct 28, 2015 at 10:26:27PM +, Xiao Guangrong wrote:
> +static void nvdimm_dsm_func_get_label_data(NVDIMMDevice *nvdimm,
> +   nvdimm_dsm_in *in, GArray *out)
> +{
> +NVDIMMClass *nvc = NVDIMM_GET_CLASS(nvdimm);
> +nvdimm_func_in_get_label_data *get_label_data = >func_get_label_data;
> +void *buf;
> +uint32_t status = NVDIMM_DSM_STATUS_SUCCESS;
> +
> +le32_to_cpus(_label_data->offset);
> +le32_to_cpus(_label_data->length);
> +
> +nvdimm_debug("Read Label Data: offset %#x length %#x.\n",
> + get_label_data->offset, get_label_data->length);
> +
> +if (nvdimm->label_size < get_label_data->offset + 
> get_label_data->length) {

Integer overflow isn't handled here and it's unclear if that can cause
problems later on.  It's safest to catch it right away instead of
relying on nvc->read_label_data() to check again.


signature.asc
Description: PGP signature


Re: [PATCH v5 28/33] nvdimm acpi: support Get Namespace Label Size function

2015-10-28 Thread Stefan Hajnoczi
On Wed, Oct 28, 2015 at 10:26:26PM +, Xiao Guangrong wrote:
> +struct nvdimm_func_in_get_label_data {
> +uint32_t offset; /* the offset in the namespace label data area. */
> +uint32_t length; /* the size of data is to be read via the function. */
> +} QEMU_PACKED;
> +typedef struct nvdimm_func_in_get_label_data nvdimm_func_in_get_label_data;

./CODING_STYLE "3. Naming":

  Structured type names are in CamelCase; harder to type but standing
  out.

I'm surprised that scripts/checkpatch.pl didn't warning about this.

> +/*
> + * the max transfer size is the max size transferred by both a
> + * 'Get Namespace Label Data' function and a 'Set Namespace Label Data'
> + * function.
> + */
> +static uint32_t nvdimm_get_max_xfer_label_size(void)
> +{
> +nvdimm_dsm_in *in;
> +uint32_t max_get_size, max_set_size, dsm_memory_size = getpagesize();

Why is the host's page size relevant here?  Did you mean
TARGET_PAGE_SIZE?


signature.asc
Description: PGP signature


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Eric Auger
Hi Alex,
On 10/28/2015 05:27 PM, Alex Williamson wrote:
> On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
>> Current vfio_pgsize_bitmap code hides the supported IOMMU page
>> sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
>> does not support PAGE_SIZE page, the alignment check on map/unmap
>> is done with larger page sizes, if any. This can fail although
>> mapping could be done with pages smaller than PAGE_SIZE.
>>
>> vfio_pgsize_bitmap is modified to expose the IOMMU page sizes,
>> supported by all domains, even those smaller than PAGE_SIZE. The
>> alignment check on map is performed against PAGE_SIZE if the minimum
>> IOMMU size is less than PAGE_SIZE or against the min page size greater
>> than PAGE_SIZE.
>>
>> Signed-off-by: Eric Auger 
>>
>> ---
>>
>> This was tested on AMD Seattle with 64kB page host. ARM MMU 401
>> currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
>> the map/unmap check is done against 2MB. Some alignment check fail
>> so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
>> size.
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 25 +++--
>>  1 file changed, 11 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>> b/drivers/vfio/vfio_iommu_type1.c
>> index 57d8c37..13fb974 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
>> struct vfio_dma *dma)
>>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>>  {
>>  struct vfio_domain *domain;
>> -unsigned long bitmap = PAGE_MASK;
>> +unsigned long bitmap = ULONG_MAX;
> 
> Isn't this and removing the WARN_ON()s the only real change in this
> patch?  The rest looks like conversion to use IS_ALIGNED and the
> following test, that I don't really understand...
Yes basically you're right.
> 
>>  
>>  mutex_lock(>lock);
>>  list_for_each_entry(domain, >domain_list, next)
>> @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
>> vfio_iommu *iommu)
>>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>   struct vfio_iommu_type1_dma_unmap *unmap)
>>  {
>> -uint64_t mask;
>>  struct vfio_dma *dma;
>>  size_t unmapped = 0;
>>  int ret = 0;
>> +unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
>> +unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
>> +PAGE_SIZE : min_pagesz;
> 
> This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
> care to cap alignment at PAGE_SIZE?
My intent in this patch isn't to allow the user-space to map/unmap
sub-PAGE_SIZE buffers. The new test makes sure the mapped area is bigger
or equal than a host page whatever the supported page sizes.

I noticed that chunk construction, pinning and other many things are
based on PAGE_SIZE and far be it from me to change that code! I want to
keep that minimal granularity for all those computation.

However on iommu side, I would like to rely on the fact the iommu driver
is clever enough to choose the right page size and even to choose a size
that is smaller than PAGE_SIZE if this latter is not supported.
> 
>> -mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>> -
>> -if (unmap->iova & mask)
>> +if (!IS_ALIGNED(unmap->iova, requested_alignment))
>>  return -EINVAL;
>> -if (!unmap->size || unmap->size & mask)
>> +if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
>>  return -EINVAL;
>>  
>> -WARN_ON(mask & PAGE_MASK);
>> -
>>  mutex_lock(>lock);
>>  
>>  /*
>> @@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  size_t size = map->size;
>>  long npage;
>>  int ret = 0, prot = 0;
>> -uint64_t mask;
>>  struct vfio_dma *dma;
>>  unsigned long pfn;
>> +unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
>> +unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
>> +PAGE_SIZE : min_pagesz;
>>  
>>  /* Verify that none of our __u64 fields overflow */
>>  if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>>  return -EINVAL;
>>  
>> -mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>> -
>> -WARN_ON(mask & PAGE_MASK);
>> -
>>  /* READ/WRITE from device perspective */
>>  if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
>>  prot |= IOMMU_WRITE;
>>  if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>>  prot |= IOMMU_READ;
>>  
>> -if (!prot || !size || (size | iova | vaddr) & mask)
>> +if (!prot || !size ||
>> +!IS_ALIGNED(size | iova | vaddr, requested_alignment))
>>  return -EINVAL;
>>  
>>  /* Don't allow IOVA or virtual address wrap */
> 
> This is mostly 

Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 18:10 +0100, Eric Auger wrote:
> Hi Alex,
> On 10/28/2015 05:27 PM, Alex Williamson wrote:
> > On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
> >> Current vfio_pgsize_bitmap code hides the supported IOMMU page
> >> sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
> >> does not support PAGE_SIZE page, the alignment check on map/unmap
> >> is done with larger page sizes, if any. This can fail although
> >> mapping could be done with pages smaller than PAGE_SIZE.
> >>
> >> vfio_pgsize_bitmap is modified to expose the IOMMU page sizes,
> >> supported by all domains, even those smaller than PAGE_SIZE. The
> >> alignment check on map is performed against PAGE_SIZE if the minimum
> >> IOMMU size is less than PAGE_SIZE or against the min page size greater
> >> than PAGE_SIZE.
> >>
> >> Signed-off-by: Eric Auger 
> >>
> >> ---
> >>
> >> This was tested on AMD Seattle with 64kB page host. ARM MMU 401
> >> currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
> >> the map/unmap check is done against 2MB. Some alignment check fail
> >> so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
> >> size.
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 25 +++--
> >>  1 file changed, 11 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c 
> >> b/drivers/vfio/vfio_iommu_type1.c
> >> index 57d8c37..13fb974 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> >> struct vfio_dma *dma)
> >>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >>  {
> >>struct vfio_domain *domain;
> >> -  unsigned long bitmap = PAGE_MASK;
> >> +  unsigned long bitmap = ULONG_MAX;
> > 
> > Isn't this and removing the WARN_ON()s the only real change in this
> > patch?  The rest looks like conversion to use IS_ALIGNED and the
> > following test, that I don't really understand...
> Yes basically you're right.


Ok, so with hopefully correcting my understand of what this does, isn't
this effectively the same:

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 57d8c37..7db4f5a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -403,13 +403,19 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, stru
 static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 {
struct vfio_domain *domain;
-   unsigned long bitmap = PAGE_MASK;
+   unsigned long bitmap = ULONG_MAX;
 
mutex_lock(>lock);
list_for_each_entry(domain, >domain_list, next)
bitmap &= domain->domain->ops->pgsize_bitmap;
mutex_unlock(>lock);
 
+   /* Some comment about how the IOMMU API splits requests */
+   if (bitmap & ~PAGE_MASK) {
+   bitmap &= PAGE_MASK;
+   bitmap |= PAGE_SIZE;
+   }
+
return bitmap;
 }
 
This would also expose to the user that we're accepting PAGE_SIZE, which
we weren't before, so it was not quite right to just let them do it
anyway.  I don't think we even need to get rid of the WARN_ONs, do we?
Thanks,

Alex

> > 
> >>  
> >>mutex_lock(>lock);
> >>list_for_each_entry(domain, >domain_list, next)
> >> @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
> >> vfio_iommu *iommu)
> >>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >> struct vfio_iommu_type1_dma_unmap *unmap)
> >>  {
> >> -  uint64_t mask;
> >>struct vfio_dma *dma;
> >>size_t unmapped = 0;
> >>int ret = 0;
> >> +  unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> >> +  unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> >> +  PAGE_SIZE : min_pagesz;
> > 
> > This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
> > care to cap alignment at PAGE_SIZE?
> My intent in this patch isn't to allow the user-space to map/unmap
> sub-PAGE_SIZE buffers. The new test makes sure the mapped area is bigger
> or equal than a host page whatever the supported page sizes.
> 
> I noticed that chunk construction, pinning and other many things are
> based on PAGE_SIZE and far be it from me to change that code! I want to
> keep that minimal granularity for all those computation.
> 
> However on iommu side, I would like to rely on the fact the iommu driver
> is clever enough to choose the right page size and even to choose a size
> that is smaller than PAGE_SIZE if this latter is not supported.
> > 
> >> -  mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >> -
> >> -  if (unmap->iova & mask)
> >> +  if (!IS_ALIGNED(unmap->iova, requested_alignment))
> >>return -EINVAL;
> >> -  if (!unmap->size || unmap->size & mask)
> >> +  if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
> >>   

Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Eric Auger
On 10/28/2015 06:37 PM, Alex Williamson wrote:
> On Wed, 2015-10-28 at 18:10 +0100, Eric Auger wrote:
>> Hi Alex,
>> On 10/28/2015 05:27 PM, Alex Williamson wrote:
>>> On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
 Current vfio_pgsize_bitmap code hides the supported IOMMU page
 sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
 does not support PAGE_SIZE page, the alignment check on map/unmap
 is done with larger page sizes, if any. This can fail although
 mapping could be done with pages smaller than PAGE_SIZE.

 vfio_pgsize_bitmap is modified to expose the IOMMU page sizes,
 supported by all domains, even those smaller than PAGE_SIZE. The
 alignment check on map is performed against PAGE_SIZE if the minimum
 IOMMU size is less than PAGE_SIZE or against the min page size greater
 than PAGE_SIZE.

 Signed-off-by: Eric Auger 

 ---

 This was tested on AMD Seattle with 64kB page host. ARM MMU 401
 currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
 the map/unmap check is done against 2MB. Some alignment check fail
 so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
 size.
 ---
  drivers/vfio/vfio_iommu_type1.c | 25 +++--
  1 file changed, 11 insertions(+), 14 deletions(-)

 diff --git a/drivers/vfio/vfio_iommu_type1.c 
 b/drivers/vfio/vfio_iommu_type1.c
 index 57d8c37..13fb974 100644
 --- a/drivers/vfio/vfio_iommu_type1.c
 +++ b/drivers/vfio/vfio_iommu_type1.c
 @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
 struct vfio_dma *dma)
  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
  {
struct vfio_domain *domain;
 -  unsigned long bitmap = PAGE_MASK;
 +  unsigned long bitmap = ULONG_MAX;
>>>
>>> Isn't this and removing the WARN_ON()s the only real change in this
>>> patch?  The rest looks like conversion to use IS_ALIGNED and the
>>> following test, that I don't really understand...
>> Yes basically you're right.
> 
> 
> Ok, so with hopefully correcting my understand of what this does, isn't
> this effectively the same:
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 57d8c37..7db4f5a 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -403,13 +403,19 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> stru
>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  {
> struct vfio_domain *domain;
> -   unsigned long bitmap = PAGE_MASK;
> +   unsigned long bitmap = ULONG_MAX;
>  
> mutex_lock(>lock);
> list_for_each_entry(domain, >domain_list, next)
> bitmap &= domain->domain->ops->pgsize_bitmap;
> mutex_unlock(>lock);
>  
> +   /* Some comment about how the IOMMU API splits requests */
> +   if (bitmap & ~PAGE_MASK) {
> +   bitmap &= PAGE_MASK;
> +   bitmap |= PAGE_SIZE;
> +   }
> +
> return bitmap;
>  }
Yes, to me it is indeed the same
>  
> This would also expose to the user that we're accepting PAGE_SIZE, which
> we weren't before, so it was not quite right to just let them do it
> anyway.  I don't think we even need to get rid of the WARN_ONs, do we?
> Thanks,

The end-user might be afraid of those latter. Personally I would get rid
of them but that's definitively up to you.

Just let me know and I will respin.

Best Regards

Eric

> 
> Alex
> 
>>>
  
mutex_lock(>lock);
list_for_each_entry(domain, >domain_list, next)
 @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
 vfio_iommu *iommu)
  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 struct vfio_iommu_type1_dma_unmap *unmap)
  {
 -  uint64_t mask;
struct vfio_dma *dma;
size_t unmapped = 0;
int ret = 0;
 +  unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
 +  unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
 +  PAGE_SIZE : min_pagesz;
>>>
>>> This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
>>> care to cap alignment at PAGE_SIZE?
>> My intent in this patch isn't to allow the user-space to map/unmap
>> sub-PAGE_SIZE buffers. The new test makes sure the mapped area is bigger
>> or equal than a host page whatever the supported page sizes.
>>
>> I noticed that chunk construction, pinning and other many things are
>> based on PAGE_SIZE and far be it from me to change that code! I want to
>> keep that minimal granularity for all those computation.
>>
>> However on iommu side, I would like to rely on the fact the iommu driver
>> is clever enough to choose the right page size and even to choose a size
>> that is smaller than PAGE_SIZE if this latter is not 

Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Paolo Bonzini


On 28/10/2015 17:00, Alex Williamson wrote:
> > Alex, would it make sense to use the IRQ bypass infrastructure always,
> > not just for VT-d, to do the MSI injection directly from the VFIO
> > interrupt handler and bypass the eventfd?  Basically this would add an
> > RCU-protected list of consumers matching the token to struct
> > irq_bypass_producer, and a
> > 
> > int (*inject)(struct irq_bypass_consumer *);
> > 
> > callback to struct irq_bypass_consumer.  If any callback returns true,
> > the eventfd is not signaled.
>
> Yeah, that might be a good idea, it's probably more plausible than
> making the eventfd_signal() code friendly to call from hard interrupt
> context.  On the vfio side can we use request_threaded_irq() directly
> for this?

I don't know if that gives you a non-threaded IRQ with the real-time
kernel...  CCing Marcelo to get some insight.

> Making the hard irq handler return IRQ_HANDLED if we can use
> the irq bypass manager or IRQ_WAKE_THREAD if we need to use the eventfd.
> I think we need some way to get back to irq thread context to use
> eventfd_signal().

The irqfd is already able to schedule a work item, because it runs with
interrupts disabled, so I think we can always return IRQ_HANDLED.

There's another little complication.  Right now, only x86 has
kvm_set_msi_inatomic.  We should merge kvm_set_msi_inatomic,
kvm_set_irq_inatomic and kvm_arch_set_irq.

Some cleanups are needed there; the flow between the functions is really
badly structured because the API grew somewhat by accretion.  I'll get
to it next week or on the way back to Italy.

> Would we ever not want to use the direct bypass
> manager path if available?  Thanks,

I don't think so.  KVM always registers itself as a consumer, even if
there is no VT-d posted interrupts.  add_producer simply returns -EINVAL
then.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Eric Auger
Hi Will,
On 10/28/2015 06:14 PM, Will Deacon wrote:
> On Wed, Oct 28, 2015 at 10:27:28AM -0600, Alex Williamson wrote:
>> On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>>> b/drivers/vfio/vfio_iommu_type1.c
>>> index 57d8c37..13fb974 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
>>> struct vfio_dma *dma)
>>>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>>>  {
>>> struct vfio_domain *domain;
>>> -   unsigned long bitmap = PAGE_MASK;
>>> +   unsigned long bitmap = ULONG_MAX;
>>
>> Isn't this and removing the WARN_ON()s the only real change in this
>> patch?  The rest looks like conversion to use IS_ALIGNED and the
>> following test, that I don't really understand...
>>
>>>  
>>> mutex_lock(>lock);
>>> list_for_each_entry(domain, >domain_list, next)
>>> @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
>>> vfio_iommu *iommu)
>>>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>  struct vfio_iommu_type1_dma_unmap *unmap)
>>>  {
>>> -   uint64_t mask;
>>> struct vfio_dma *dma;
>>> size_t unmapped = 0;
>>> int ret = 0;
>>> +   unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
>>> +   unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
>>> +   PAGE_SIZE : min_pagesz;
>>
>> This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
>> care to cap alignment at PAGE_SIZE?
> 
> Eric can clarify, but I think the intention here is to have VFIO continue
> doing things in PAGE_SIZE chunks precisely so that we don't have to rework
> all of the pinning code etc.
That's my intention indeed ;-)

Thanks

Eric
 The IOMMU API can then deal with the smaller
> page size.

> 
>>> -   mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>>> -
>>> -   if (unmap->iova & mask)
>>> +   if (!IS_ALIGNED(unmap->iova, requested_alignment))
>>> return -EINVAL;
>>> -   if (!unmap->size || unmap->size & mask)
>>> +   if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
>>> return -EINVAL;
>>>  
>>> -   WARN_ON(mask & PAGE_MASK);
>>> -
>>> mutex_lock(>lock);
>>>  
>>> /*
>>> @@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>> size_t size = map->size;
>>> long npage;
>>> int ret = 0, prot = 0;
>>> -   uint64_t mask;
>>> struct vfio_dma *dma;
>>> unsigned long pfn;
>>> +   unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
>>> +   unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
>>> +   PAGE_SIZE : min_pagesz;
>>>  
>>> /* Verify that none of our __u64 fields overflow */
>>> if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>>> return -EINVAL;
>>>  
>>> -   mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
>>> -
>>> -   WARN_ON(mask & PAGE_MASK);
>>> -
>>> /* READ/WRITE from device perspective */
>>> if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
>>> prot |= IOMMU_WRITE;
>>> if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>>> prot |= IOMMU_READ;
>>>  
>>> -   if (!prot || !size || (size | iova | vaddr) & mask)
>>> +   if (!prot || !size ||
>>> +   !IS_ALIGNED(size | iova | vaddr, requested_alignment))
>>> return -EINVAL;
>>>  
>>> /* Don't allow IOVA or virtual address wrap */
>>
>> This is mostly ignoring the problems with sub-PAGE_SIZE mappings.  For
>> instance, we can only pin on PAGE_SIZE and therefore we only do
>> accounting on PAGE_SIZE, so if the user does 4K mappings across your 64K
>> page, that page gets pinned and accounted 16 times.  Are we going to
>> tell users that their locked memory limit needs to be 16x now?  The rest
>> of the code would need an audit as well to see what other sub-page bugs
>> might be hiding.  Thanks,
> 
> I don't see that. The pinning all happens the same in VFIO, which can
> then happily pass a 64k region to iommu_map. iommu_map will then call
> ->map in 4k chunks on the IOMMU driver ops.
> 
> Will
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Eric Auger
On 10/28/2015 06:55 PM, Will Deacon wrote:
> On Wed, Oct 28, 2015 at 06:48:41PM +0100, Eric Auger wrote:
>> On 10/28/2015 06:37 PM, Alex Williamson wrote:
>>> Ok, so with hopefully correcting my understand of what this does, isn't
>>> this effectively the same:
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
>>> b/drivers/vfio/vfio_iommu_type1.c
>>> index 57d8c37..7db4f5a 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -403,13 +403,19 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
>>> stru
>>>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>>>  {
>>> struct vfio_domain *domain;
>>> -   unsigned long bitmap = PAGE_MASK;
>>> +   unsigned long bitmap = ULONG_MAX;
>>>  
>>> mutex_lock(>lock);
>>> list_for_each_entry(domain, >domain_list, next)
>>> bitmap &= domain->domain->ops->pgsize_bitmap;
>>> mutex_unlock(>lock);
>>>  
>>> +   /* Some comment about how the IOMMU API splits requests */
>>> +   if (bitmap & ~PAGE_MASK) {
>>> +   bitmap &= PAGE_MASK;
>>> +   bitmap |= PAGE_SIZE;
>>> +   }
>>> +
>>> return bitmap;
>>>  }
>> Yes, to me it is indeed the same
>>>  
>>> This would also expose to the user that we're accepting PAGE_SIZE, which
>>> we weren't before, so it was not quite right to just let them do it
>>> anyway.  I don't think we even need to get rid of the WARN_ONs, do we?
>>> Thanks,
>>
>> The end-user might be afraid of those latter. Personally I would get rid
>> of them but that's definitively up to you.
> 
> I think Alex's point is that the WARN_ON's won't trigger with this patch,
> because he clears those lower bits in the bitmap.
ah yes sure!

Thanks

Eric
> 
> Will
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 8/9] kvm/x86: Hyper-V synthetic interrupt controller

2015-10-28 Thread Paolo Bonzini
Hi Andrey,

just one question.  Is kvm_arch_set_irq actually needed?  I think
everything should work fine without it.  Can you check?  If so, I can
remove it myself and revert the patch that introduced the hook.

Paolo

On 22/10/2015 18:09, Andrey Smetanin wrote:
> SynIC (synthetic interrupt controller) is a lapic extension,
> which is controlled via MSRs and maintains for each vCPU
>  - 16 synthetic interrupt "lines" (SINT's); each can be configured to
>trigger a specific interrupt vector optionally with auto-EOI
>semantics
>  - a message page in the guest memory with 16 256-byte per-SINT message
>slots
>  - an event flag page in the guest memory with 16 2048-bit per-SINT
>event flag areas
> 
> The host triggers a SINT whenever it delivers a new message to the
> corresponding slot or flips an event flag bit in the corresponding area.
> The guest informs the host that it can try delivering a message by
> explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
> MSR.
> 
> The userspace (qemu) triggers interrupts and receives EOM notifications
> via irqfd with resampler; for that, a GSI is allocated for each
> configured SINT, and irq_routing api is extended to support GSI-SINT
> mapping.
> 
> Signed-off-by: Andrey Smetanin 
> Reviewed-by: Roman Kagan 
> Signed-off-by: Denis V. Lunev 
> CC: Vitaly Kuznetsov 
> CC: "K. Y. Srinivasan" 
> CC: Gleb Natapov 
> CC: Paolo Bonzini 
> CC: Roman Kagan 
> 
> Changes v3:
> * added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
> docs
> 
> Changes v2:
> * do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
> * add Hyper-V SynIC vectors into EOI exit bitmap
> * Hyper-V SyniIC SINT msr write logic simplified
> ---
>  Documentation/virtual/kvm/api.txt |  14 ++
>  arch/x86/include/asm/kvm_host.h   |  14 ++
>  arch/x86/kvm/hyperv.c | 297 
> ++
>  arch/x86/kvm/hyperv.h |  21 +++
>  arch/x86/kvm/irq_comm.c   |  34 +
>  arch/x86/kvm/lapic.c  |  18 ++-
>  arch/x86/kvm/lapic.h  |   5 +
>  arch/x86/kvm/x86.c|  12 +-
>  include/linux/kvm_host.h  |   6 +
>  include/uapi/linux/kvm.h  |   8 +
>  10 files changed, 421 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 092ee9f..8710418 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1451,6 +1451,7 @@ struct kvm_irq_routing_entry {
>   struct kvm_irq_routing_irqchip irqchip;
>   struct kvm_irq_routing_msi msi;
>   struct kvm_irq_routing_s390_adapter adapter;
> + struct kvm_irq_routing_hv_sint hv_sint;
>   __u32 pad[8];
>   } u;
>  };
> @@ -1459,6 +1460,7 @@ struct kvm_irq_routing_entry {
>  #define KVM_IRQ_ROUTING_IRQCHIP 1
>  #define KVM_IRQ_ROUTING_MSI 2
>  #define KVM_IRQ_ROUTING_S390_ADAPTER 3
> +#define KVM_IRQ_ROUTING_HV_SINT 4
>  
>  No flags are specified so far, the corresponding field must be set to zero.
>  
> @@ -1482,6 +1484,10 @@ struct kvm_irq_routing_s390_adapter {
>   __u32 adapter_id;
>  };
>  
> +struct kvm_irq_routing_hv_sint {
> + __u32 vcpu;
> + __u32 sint;
> +};
>  
>  4.53 KVM_ASSIGN_SET_MSIX_NR (deprecated)
>  
> @@ -3685,3 +3691,11 @@ available, means that that the kernel has an 
> implementation of the
>  H_RANDOM hypercall backed by a hardware random-number generator.
>  If present, the kernel H_RANDOM handler can be enabled for guest use
>  with the KVM_CAP_PPC_ENABLE_HCALL capability.
> +
> +8.2 KVM_CAP_HYPERV_SYNIC
> +
> +Architectures: x86
> +This capability, if KVM_CHECK_EXTENSION indicates that it is
> +available, means that that the kernel has an implementation of the
> +Hyper-V Synthetic interrupt controller(SynIC). SynIC is used to
> +support Windows Hyper-V based guest paravirt drivers(VMBus).
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 3c6327d..8434f88 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -374,10 +375,23 @@ struct kvm_mtrr {
>   struct list_head head;
>  };
>  
> +/* Hyper-V synthetic interrupt controller (SynIC)*/
> +struct kvm_vcpu_hv_synic {
> + u64 version;
> + u64 control;
> + u64 msg_page;
> + u64 evt_page;
> + atomic64_t sint[HV_SYNIC_SINT_COUNT];
> + atomic_t sint_to_gsi[HV_SYNIC_SINT_COUNT];
> + DECLARE_BITMAP(auto_eoi_bitmap, 256);
> + DECLARE_BITMAP(vec_bitmap, 256);
> +};
> +
>  /* Hyper-V per vcpu emulation context */
>  struct kvm_vcpu_hv {
>   u64 hv_vapic;
>   s64 runtime_offset;
> + struct 

Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Eric Auger
Alex,
On 10/28/2015 06:28 PM, Alex Williamson wrote:
> On Wed, 2015-10-28 at 17:14 +, Will Deacon wrote:
>> On Wed, Oct 28, 2015 at 10:27:28AM -0600, Alex Williamson wrote:
>>> On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
 diff --git a/drivers/vfio/vfio_iommu_type1.c 
 b/drivers/vfio/vfio_iommu_type1.c
 index 57d8c37..13fb974 100644
 --- a/drivers/vfio/vfio_iommu_type1.c
 +++ b/drivers/vfio/vfio_iommu_type1.c
 @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
 struct vfio_dma *dma)
  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
  {
struct vfio_domain *domain;
 -  unsigned long bitmap = PAGE_MASK;
 +  unsigned long bitmap = ULONG_MAX;
>>>
>>> Isn't this and removing the WARN_ON()s the only real change in this
>>> patch?  The rest looks like conversion to use IS_ALIGNED and the
>>> following test, that I don't really understand...
>>>
  
mutex_lock(>lock);
list_for_each_entry(domain, >domain_list, next)
 @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
 vfio_iommu *iommu)
  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 struct vfio_iommu_type1_dma_unmap *unmap)
  {
 -  uint64_t mask;
struct vfio_dma *dma;
size_t unmapped = 0;
int ret = 0;
 +  unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
 +  unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
 +  PAGE_SIZE : min_pagesz;
>>>
>>> This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
>>> care to cap alignment at PAGE_SIZE?
>>
>> Eric can clarify, but I think the intention here is to have VFIO continue
>> doing things in PAGE_SIZE chunks precisely so that we don't have to rework
>> all of the pinning code etc. The IOMMU API can then deal with the smaller
>> page size.
> 
> Gak, I read this wrong.  So really we're just artificially adding
> PAGE_SIZE as a supported IOMMU size so long as the IOMMU support
> something smaller than PAGE_SIZE, where PAGE_SIZE is obviously a
> multiple of that smaller size.  Ok, but should we just do this once in
> vfio_pgsize_bitmap()?  This is exactly why VT-d just reports ~(4k - 1)
> for the iommu bitmap.
Yes I can do this in vfio_pgsize_bitmap if you prefer.

Thanks

Eric

> 
 -  mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
 -
 -  if (unmap->iova & mask)
 +  if (!IS_ALIGNED(unmap->iova, requested_alignment))
return -EINVAL;
 -  if (!unmap->size || unmap->size & mask)
 +  if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
return -EINVAL;
  
 -  WARN_ON(mask & PAGE_MASK);
 -
mutex_lock(>lock);
  
/*
 @@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
size_t size = map->size;
long npage;
int ret = 0, prot = 0;
 -  uint64_t mask;
struct vfio_dma *dma;
unsigned long pfn;
 +  unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
 +  unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
 +  PAGE_SIZE : min_pagesz;
  
/* Verify that none of our __u64 fields overflow */
if (map->size != size || map->vaddr != vaddr || map->iova != iova)
return -EINVAL;
  
 -  mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
 -
 -  WARN_ON(mask & PAGE_MASK);
 -
/* READ/WRITE from device perspective */
if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
prot |= IOMMU_WRITE;
if (map->flags & VFIO_DMA_MAP_FLAG_READ)
prot |= IOMMU_READ;
  
 -  if (!prot || !size || (size | iova | vaddr) & mask)
 +  if (!prot || !size ||
 +  !IS_ALIGNED(size | iova | vaddr, requested_alignment))
return -EINVAL;
  
/* Don't allow IOVA or virtual address wrap */
>>>
>>> This is mostly ignoring the problems with sub-PAGE_SIZE mappings.  For
>>> instance, we can only pin on PAGE_SIZE and therefore we only do
>>> accounting on PAGE_SIZE, so if the user does 4K mappings across your 64K
>>> page, that page gets pinned and accounted 16 times.  Are we going to
>>> tell users that their locked memory limit needs to be 16x now?  The rest
>>> of the code would need an audit as well to see what other sub-page bugs
>>> might be hiding.  Thanks,
>>
>> I don't see that. The pinning all happens the same in VFIO, which can
>> then happily pass a 64k region to iommu_map. iommu_map will then call
>> ->map in 4k chunks on the IOMMU driver ops.
> 
> Yep, I see now that this isn't doing sub-page mappings.  Thanks,
> 
> Alex
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info 

Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 10:50 -0700, Yunhong Jiang wrote:
> On Wed, Oct 28, 2015 at 01:44:55AM +0100, Paolo Bonzini wrote:
> > 
> > 
> > On 27/10/2015 22:26, Yunhong Jiang wrote:
> > >> > On RT kernels however can you call eventfd_signal from interrupt
> > >> > context?  You cannot call spin_lock_irqsave (which can sleep) from a
> > >> > non-threaded interrupt handler, can you?  You would need a raw spin 
> > >> > lock.
> > > Thanks for pointing this out. Yes, we can't call spin_lock_irqsave on RT 
> > > kernel. Will do this way on next patch. But not sure if it's overkill to 
> > > use 
> > > raw_spinlock there since the eventfd_signal is used by other caller also.
> > 
> > No, I don't think you can use raw_spinlock there.  The problem is not
> > just eventfd_signal, it is especially wake_up_locked_poll.  You cannot
> > convert the whole workqueue infrastructure to use raw_spinlock.
> 
> You mean the waitqueue, instead of workqueue, right? One choice is to change 
> the eventfd to use simple wait queue, which is raw_spinlock. But use simple 
> waitqueue on eventfd may in fact impact real time latency if not in this 
> scenario.
> 
> > 
> > Alex, would it make sense to use the IRQ bypass infrastructure always,
> > not just for VT-d, to do the MSI injection directly from the VFIO
> > interrupt handler and bypass the eventfd?  Basically this would add an
> > RCU-protected list of consumers matching the token to struct
> > irq_bypass_producer, and a
> > 
> > int (*inject)(struct irq_bypass_consumer *);
> > 
> > callback to struct irq_bypass_consumer.  If any callback returns true,
> > the eventfd is not signaled.  The KVM implementation would be like this
> > (compare with virt/kvm/eventfd.c):
> > 
> > /* Extracted out of irqfd_wakeup */
> > static int
> > irqfd_wakeup_pollin(struct kvm_kernel_irqfd *irqfd)
> > {
> > ...
> > }
> > 
> > /* Extracted out of irqfd_wakeup */
> > static int
> > irqfd_wakeup_pollhup(struct kvm_kernel_irqfd *irqfd)
> > {
> > ...
> > }
> > 
> > static int
> > irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync,
> >  void *key)
> > {
> > struct _irqfd *irqfd = container_of(wait,
> > struct _irqfd, wait);
> > unsigned long flags = (unsigned long)key;
> > 
> > if (flags & POLLIN)
> > irqfd_wakeup_pollin(irqfd);
> > if (flags & POLLHUP)
> > irqfd_wakeup_pollhup(irqfd);
> > 
> > return 0;
> > }
> > 
> > static int kvm_arch_irq_bypass_inject(
> > struct irq_bypass_consumer *cons)
> > {
> > struct kvm_kernel_irqfd *irqfd =
> > container_of(cons, struct kvm_kernel_irqfd,
> >  consumer); 
> > 
> > irqfd_wakeup_pollin(irqfd);
> > }
> > 
> This is a good idea IMHO. So for MSI interrupt, the 
> kvm_arch_irq_bypass_inject will be used, and the irqfd_wakeup will not be 
> invoked anymore, am I right?
> 
> I noticed the irq bypass manager is not merged yet, are there any git branch 
> for it?

It's in linux-next via the kvm.git next branch:

git://git.kernel.org/pub/scm/virt/kvm/kvm.git

Thanks,
Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Paolo Bonzini


On 28/10/2015 18:50, Yunhong Jiang wrote:
> > No, I don't think you can use raw_spinlock there.  The problem is not
> > just eventfd_signal, it is especially wake_up_locked_poll.  You cannot
> > convert the whole workqueue infrastructure to use raw_spinlock.
> 
> You mean the waitqueue, instead of workqueue, right?

Yes.

> One choice is to change 
> the eventfd to use simple wait queue, which is raw_spinlock. But use simple 
> waitqueue on eventfd may in fact impact real time latency if not in this 
> scenario.

Userspace can put an arbitrary amount of tasks on the work queue, so
it's not possible to use a simple wait queue.  It would also touch
multiple subsystems, so it's much better to bypass the eventfd completely.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Yunhong Jiang
On Wed, Oct 28, 2015 at 01:44:55AM +0100, Paolo Bonzini wrote:
> 
> 
> On 27/10/2015 22:26, Yunhong Jiang wrote:
> >> > On RT kernels however can you call eventfd_signal from interrupt
> >> > context?  You cannot call spin_lock_irqsave (which can sleep) from a
> >> > non-threaded interrupt handler, can you?  You would need a raw spin lock.
> > Thanks for pointing this out. Yes, we can't call spin_lock_irqsave on RT 
> > kernel. Will do this way on next patch. But not sure if it's overkill to 
> > use 
> > raw_spinlock there since the eventfd_signal is used by other caller also.
> 
> No, I don't think you can use raw_spinlock there.  The problem is not
> just eventfd_signal, it is especially wake_up_locked_poll.  You cannot
> convert the whole workqueue infrastructure to use raw_spinlock.

You mean the waitqueue, instead of workqueue, right? One choice is to change 
the eventfd to use simple wait queue, which is raw_spinlock. But use simple 
waitqueue on eventfd may in fact impact real time latency if not in this 
scenario.

> 
> Alex, would it make sense to use the IRQ bypass infrastructure always,
> not just for VT-d, to do the MSI injection directly from the VFIO
> interrupt handler and bypass the eventfd?  Basically this would add an
> RCU-protected list of consumers matching the token to struct
> irq_bypass_producer, and a
> 
>   int (*inject)(struct irq_bypass_consumer *);
> 
> callback to struct irq_bypass_consumer.  If any callback returns true,
> the eventfd is not signaled.  The KVM implementation would be like this
> (compare with virt/kvm/eventfd.c):
> 
>   /* Extracted out of irqfd_wakeup */
>   static int
>   irqfd_wakeup_pollin(struct kvm_kernel_irqfd *irqfd)
>   {
>   ...
>   }
> 
>   /* Extracted out of irqfd_wakeup */
>   static int
>   irqfd_wakeup_pollhup(struct kvm_kernel_irqfd *irqfd)
>   {
>   ...
>   }
> 
>   static int
>   irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync,
>void *key)
>   {
>   struct _irqfd *irqfd = container_of(wait,
>   struct _irqfd, wait);
>   unsigned long flags = (unsigned long)key;
> 
>   if (flags & POLLIN)
>   irqfd_wakeup_pollin(irqfd);
>   if (flags & POLLHUP)
>   irqfd_wakeup_pollhup(irqfd);
> 
>   return 0;
>   }
> 
>   static int kvm_arch_irq_bypass_inject(
>   struct irq_bypass_consumer *cons)
>   {
>   struct kvm_kernel_irqfd *irqfd =
>   container_of(cons, struct kvm_kernel_irqfd,
>consumer); 
> 
>   irqfd_wakeup_pollin(irqfd);
>   }
> 
This is a good idea IMHO. So for MSI interrupt, the 
kvm_arch_irq_bypass_inject will be used, and the irqfd_wakeup will not be 
invoked anymore, am I right?

I noticed the irq bypass manager is not merged yet, are there any git branch 
for it?

> Or do you think it would be a hack?  The latency improvement might
> actually be even better than what Yunhong is already reporting.

I will be glad to try it.

Thanks
--jyh

> 
> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Will Deacon
On Wed, Oct 28, 2015 at 10:27:28AM -0600, Alex Williamson wrote:
> On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 57d8c37..13fb974 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> > struct vfio_dma *dma)
> >  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >  {
> > struct vfio_domain *domain;
> > -   unsigned long bitmap = PAGE_MASK;
> > +   unsigned long bitmap = ULONG_MAX;
> 
> Isn't this and removing the WARN_ON()s the only real change in this
> patch?  The rest looks like conversion to use IS_ALIGNED and the
> following test, that I don't really understand...
> 
> >  
> > mutex_lock(>lock);
> > list_for_each_entry(domain, >domain_list, next)
> > @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
> > vfio_iommu *iommu)
> >  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >  struct vfio_iommu_type1_dma_unmap *unmap)
> >  {
> > -   uint64_t mask;
> > struct vfio_dma *dma;
> > size_t unmapped = 0;
> > int ret = 0;
> > +   unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> > +   unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> > +   PAGE_SIZE : min_pagesz;
> 
> This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
> care to cap alignment at PAGE_SIZE?

Eric can clarify, but I think the intention here is to have VFIO continue
doing things in PAGE_SIZE chunks precisely so that we don't have to rework
all of the pinning code etc. The IOMMU API can then deal with the smaller
page size.

> > -   mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > -
> > -   if (unmap->iova & mask)
> > +   if (!IS_ALIGNED(unmap->iova, requested_alignment))
> > return -EINVAL;
> > -   if (!unmap->size || unmap->size & mask)
> > +   if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
> > return -EINVAL;
> >  
> > -   WARN_ON(mask & PAGE_MASK);
> > -
> > mutex_lock(>lock);
> >  
> > /*
> > @@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > size_t size = map->size;
> > long npage;
> > int ret = 0, prot = 0;
> > -   uint64_t mask;
> > struct vfio_dma *dma;
> > unsigned long pfn;
> > +   unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> > +   unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> > +   PAGE_SIZE : min_pagesz;
> >  
> > /* Verify that none of our __u64 fields overflow */
> > if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> > return -EINVAL;
> >  
> > -   mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > -
> > -   WARN_ON(mask & PAGE_MASK);
> > -
> > /* READ/WRITE from device perspective */
> > if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
> > prot |= IOMMU_WRITE;
> > if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> > prot |= IOMMU_READ;
> >  
> > -   if (!prot || !size || (size | iova | vaddr) & mask)
> > +   if (!prot || !size ||
> > +   !IS_ALIGNED(size | iova | vaddr, requested_alignment))
> > return -EINVAL;
> >  
> > /* Don't allow IOVA or virtual address wrap */
> 
> This is mostly ignoring the problems with sub-PAGE_SIZE mappings.  For
> instance, we can only pin on PAGE_SIZE and therefore we only do
> accounting on PAGE_SIZE, so if the user does 4K mappings across your 64K
> page, that page gets pinned and accounted 16 times.  Are we going to
> tell users that their locked memory limit needs to be 16x now?  The rest
> of the code would need an audit as well to see what other sub-page bugs
> might be hiding.  Thanks,

I don't see that. The pinning all happens the same in VFIO, which can
then happily pass a 64k region to iommu_map. iommu_map will then call
->map in 4k chunks on the IOMMU driver ops.

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 17:14 +, Will Deacon wrote:
> On Wed, Oct 28, 2015 at 10:27:28AM -0600, Alex Williamson wrote:
> > On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > > b/drivers/vfio/vfio_iommu_type1.c
> > > index 57d8c37..13fb974 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> > > struct vfio_dma *dma)
> > >  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > >  {
> > >   struct vfio_domain *domain;
> > > - unsigned long bitmap = PAGE_MASK;
> > > + unsigned long bitmap = ULONG_MAX;
> > 
> > Isn't this and removing the WARN_ON()s the only real change in this
> > patch?  The rest looks like conversion to use IS_ALIGNED and the
> > following test, that I don't really understand...
> > 
> > >  
> > >   mutex_lock(>lock);
> > >   list_for_each_entry(domain, >domain_list, next)
> > > @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
> > > vfio_iommu *iommu)
> > >  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >struct vfio_iommu_type1_dma_unmap *unmap)
> > >  {
> > > - uint64_t mask;
> > >   struct vfio_dma *dma;
> > >   size_t unmapped = 0;
> > >   int ret = 0;
> > > + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> > > + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> > > + PAGE_SIZE : min_pagesz;
> > 
> > This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
> > care to cap alignment at PAGE_SIZE?
> 
> Eric can clarify, but I think the intention here is to have VFIO continue
> doing things in PAGE_SIZE chunks precisely so that we don't have to rework
> all of the pinning code etc. The IOMMU API can then deal with the smaller
> page size.

Gak, I read this wrong.  So really we're just artificially adding
PAGE_SIZE as a supported IOMMU size so long as the IOMMU support
something smaller than PAGE_SIZE, where PAGE_SIZE is obviously a
multiple of that smaller size.  Ok, but should we just do this once in
vfio_pgsize_bitmap()?  This is exactly why VT-d just reports ~(4k - 1)
for the iommu bitmap.

> > > - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > > -
> > > - if (unmap->iova & mask)
> > > + if (!IS_ALIGNED(unmap->iova, requested_alignment))
> > >   return -EINVAL;
> > > - if (!unmap->size || unmap->size & mask)
> > > + if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
> > >   return -EINVAL;
> > >  
> > > - WARN_ON(mask & PAGE_MASK);
> > > -
> > >   mutex_lock(>lock);
> > >  
> > >   /*
> > > @@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >   size_t size = map->size;
> > >   long npage;
> > >   int ret = 0, prot = 0;
> > > - uint64_t mask;
> > >   struct vfio_dma *dma;
> > >   unsigned long pfn;
> > > + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> > > + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> > > + PAGE_SIZE : min_pagesz;
> > >  
> > >   /* Verify that none of our __u64 fields overflow */
> > >   if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> > >   return -EINVAL;
> > >  
> > > - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > > -
> > > - WARN_ON(mask & PAGE_MASK);
> > > -
> > >   /* READ/WRITE from device perspective */
> > >   if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
> > >   prot |= IOMMU_WRITE;
> > >   if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> > >   prot |= IOMMU_READ;
> > >  
> > > - if (!prot || !size || (size | iova | vaddr) & mask)
> > > + if (!prot || !size ||
> > > + !IS_ALIGNED(size | iova | vaddr, requested_alignment))
> > >   return -EINVAL;
> > >  
> > >   /* Don't allow IOVA or virtual address wrap */
> > 
> > This is mostly ignoring the problems with sub-PAGE_SIZE mappings.  For
> > instance, we can only pin on PAGE_SIZE and therefore we only do
> > accounting on PAGE_SIZE, so if the user does 4K mappings across your 64K
> > page, that page gets pinned and accounted 16 times.  Are we going to
> > tell users that their locked memory limit needs to be 16x now?  The rest
> > of the code would need an audit as well to see what other sub-page bugs
> > might be hiding.  Thanks,
> 
> I don't see that. The pinning all happens the same in VFIO, which can
> then happily pass a 64k region to iommu_map. iommu_map will then call
> ->map in 4k chunks on the IOMMU driver ops.

Yep, I see now that this isn't doing sub-page mappings.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Will Deacon
On Wed, Oct 28, 2015 at 06:48:41PM +0100, Eric Auger wrote:
> On 10/28/2015 06:37 PM, Alex Williamson wrote:
> > Ok, so with hopefully correcting my understand of what this does, isn't
> > this effectively the same:
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 57d8c37..7db4f5a 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -403,13 +403,19 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> > stru
> >  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >  {
> > struct vfio_domain *domain;
> > -   unsigned long bitmap = PAGE_MASK;
> > +   unsigned long bitmap = ULONG_MAX;
> >  
> > mutex_lock(>lock);
> > list_for_each_entry(domain, >domain_list, next)
> > bitmap &= domain->domain->ops->pgsize_bitmap;
> > mutex_unlock(>lock);
> >  
> > +   /* Some comment about how the IOMMU API splits requests */
> > +   if (bitmap & ~PAGE_MASK) {
> > +   bitmap &= PAGE_MASK;
> > +   bitmap |= PAGE_SIZE;
> > +   }
> > +
> > return bitmap;
> >  }
> Yes, to me it is indeed the same
> >  
> > This would also expose to the user that we're accepting PAGE_SIZE, which
> > we weren't before, so it was not quite right to just let them do it
> > anyway.  I don't think we even need to get rid of the WARN_ONs, do we?
> > Thanks,
> 
> The end-user might be afraid of those latter. Personally I would get rid
> of them but that's definitively up to you.

I think Alex's point is that the WARN_ON's won't trigger with this patch,
because he clears those lower bits in the bitmap.

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >