Re: [PATCH 2/7] Initial skeleton of VFIO support for Device Tree based devices

2013-10-29 Thread Don Dutile

On 10/29/2013 07:47 AM, Alex Williamson wrote:

On Mon, 2013-10-28 at 21:29 -0400, Don Dutile wrote:

On 09/30/2013 11:37 AM, Bhushan Bharat-R65777 wrote:




-Original Message-
From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
boun...@lists.linux-foundation.org] On Behalf Of Antonios Motakis
Sent: Monday, September 30, 2013 8:59 PM
To: kvm...@lists.cs.columbia.edu; alex.william...@redhat.com
Cc: linux-samsung-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Yoder
Stuart-B08248; io...@lists.linux-foundation.org; Antonios Motakis;
t...@virtualopensystems.com
Subject: [PATCH 2/7] Initial skeleton of VFIO support for Device Tree based
devices

Platform devices in the Linux kernel are usually managed by the DT interface.
This patch forms the base to support these kind of devices with VFIO.

Signed-off-by: Antonios Motakisa.mota...@virtualopensystems.com
---
   drivers/vfio/Kconfig |  11 +++
   drivers/vfio/Makefile|   1 +
   drivers/vfio/vfio_platform.c | 187 
+++
   include/uapi/linux/vfio.h|   1 +
   4 files changed, 200 insertions(+)
   create mode 100644 drivers/vfio/vfio_platform.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index 1f84eda..35254b7
100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -13,4 +13,15 @@ menuconfig VFIO

  If you don't know what to do here, say N.

+config VFIO_PLATFORM
+   tristate VFIO support for device tree based platform devices
+   depends on VFIO   EVENTFD   OF
+   help
+ Support for platform devices with VFIO. This is required to make
+ use of platform devices present on device tree nodes using the VFIO
+ framework. Devices that are not described in the device tree cannot
+ be used by this driver.
+
+ If you don't know what to do here, say N.
+
   source drivers/vfio/pci/Kconfig
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
2398d4a..575c8dd 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
   obj-$(CONFIG_VFIO) += vfio.o
   obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
   obj-$(CONFIG_VFIO_PCI) += pci/
+obj-$(CONFIG_VFIO_PLATFORM) += vfio_platform.o
diff --git a/drivers/vfio/vfio_platform.c b/drivers/vfio/vfio_platform.c new


We can make this parallel to PCI, something like 
drivers/vfio/platform/platform.c


pls, no.  'platform' is too generic, and it really means 'arm-dt' ... so can
move it to the arch/arm space, and have it's kconfig conditional on ARMVFIO.
if kept under drivers/vfio, then use a better directory name that ties it to 
arm-dt.
thanks.


The intention is that vfio platform device support is not arm-dt
specific.  This is to be used by both arm and embedded ppc.  The devices
we intend to support with them are known as platform drivers in the
kernel, thus the name.  I suppose the question remains whether the
interface here is really generic for any platform device or whether
we're making whether we're making an interface specifically for device
tree platform devices, or if those are one in the same.  In any case,
arm-dt is certainly not the answer.

Alex


I thought that was the intention until I saw this use in platform.c:
static const struct of_device_id vfio_platform_match[] = {

So, of_device_id hit me as DT-specific, and thus, the file should have
a name that implies its not as platform-generic as one may want/expect.
I agree that the 'arm' part can be dropped, but it's not of-agnostic atm.



file mode 100644 index 000..b9686b0
--- /dev/null
+++ b/drivers/vfio/vfio_platform.c
@@ -0,0 +1,187 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakisa.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+
+#includelinux/device.h
+#includelinux/eventfd.h
+#includelinux/interrupt.h
+#includelinux/iommu.h
+#includelinux/module.h
+#includelinux/mutex.h
+#includelinux/notifier.h
+#includelinux/pm_runtime.h
+#includelinux/slab.h
+#includelinux/types.h
+#includelinux/uaccess.h
+#includelinux/vfio.h
+
+#define DRIVER_VERSION  0.1
+#define DRIVER_AUTHOR   Antonios Motakisa.mota...@virtualopensystems.com
+#define DRIVER_DESC VFIO Device Tree devices - User Level meta-driver
+
+struct vfio_platform_device {
+   struct platform_device  *pdev;
+};
+
+static

Re: [PATCH 1/7] VFIO_IOMMU_TYPE1 workaround to build for platform devices

2013-10-28 Thread Don Dutile

On 10/02/2013 08:14 AM, Alexander Graf wrote:


On 01.10.2013, at 21:21, Yoder Stuart-B08248 wrote:


static int __init vfio_iommu_type1_init(void)
{
-   if (!iommu_present(pci_bus_type))
+#ifdef CONFIG_PCI
+   if (iommu_present(pci_bus_type)) {
+   iommu_bus_type =pci_bus_type;
+   /* For PCI targets, IOMMU_CAP_INTR_REMAP is required */
+   require_cap_intr_remap = true;
+   }
+#endif
+   if (!iommu_bus_type  iommu_present(platform_bus_type))
+   iommu_bus_type =platform_bus_type;
+
+   if(!iommu_bus_type)
return -ENODEV;

return vfio_register_iommu_driver(vfio_iommu_driver_ops_type1);


Is it possible to have a system with both PCI and platform devices?  How
would you support that?  Thanks,


It most certainly is a requirement to support both.  This is how
all of our (FSL) SoCs will expect to work.


I thought the PCI bus emits a cookie that the system wide IOMMU can then use to 
differentiate the origin of the transaction? So the same IOMMU can be used for 
PCI as well as platform routing.


*can* be the same IOMMU, yes;
have to, no, so there can be multiple IOMMUs of different types.



Alex

___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] Initial skeleton of VFIO support for Device Tree based devices

2013-10-28 Thread Don Dutile

On 09/30/2013 11:37 AM, Bhushan Bharat-R65777 wrote:




-Original Message-
From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
boun...@lists.linux-foundation.org] On Behalf Of Antonios Motakis
Sent: Monday, September 30, 2013 8:59 PM
To: kvm...@lists.cs.columbia.edu; alex.william...@redhat.com
Cc: linux-samsung-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de; Yoder
Stuart-B08248; io...@lists.linux-foundation.org; Antonios Motakis;
t...@virtualopensystems.com
Subject: [PATCH 2/7] Initial skeleton of VFIO support for Device Tree based
devices

Platform devices in the Linux kernel are usually managed by the DT interface.
This patch forms the base to support these kind of devices with VFIO.

Signed-off-by: Antonios Motakisa.mota...@virtualopensystems.com
---
  drivers/vfio/Kconfig |  11 +++
  drivers/vfio/Makefile|   1 +
  drivers/vfio/vfio_platform.c | 187 +++
  include/uapi/linux/vfio.h|   1 +
  4 files changed, 200 insertions(+)
  create mode 100644 drivers/vfio/vfio_platform.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index 1f84eda..35254b7
100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -13,4 +13,15 @@ menuconfig VFIO

  If you don't know what to do here, say N.

+config VFIO_PLATFORM
+   tristate VFIO support for device tree based platform devices
+   depends on VFIO  EVENTFD  OF
+   help
+ Support for platform devices with VFIO. This is required to make
+ use of platform devices present on device tree nodes using the VFIO
+ framework. Devices that are not described in the device tree cannot
+ be used by this driver.
+
+ If you don't know what to do here, say N.
+
  source drivers/vfio/pci/Kconfig
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
2398d4a..575c8dd 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
  obj-$(CONFIG_VFIO) += vfio.o
  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
  obj-$(CONFIG_VFIO_PCI) += pci/
+obj-$(CONFIG_VFIO_PLATFORM) += vfio_platform.o
diff --git a/drivers/vfio/vfio_platform.c b/drivers/vfio/vfio_platform.c new


We can make this parallel to PCI, something like 
drivers/vfio/platform/platform.c


pls, no.  'platform' is too generic, and it really means 'arm-dt' ... so can
move it to the arch/arm space, and have it's kconfig conditional on ARMVFIO.
if kept under drivers/vfio, then use a better directory name that ties it to 
arm-dt.
thanks.
- Don



-Bharat


file mode 100644 index 000..b9686b0
--- /dev/null
+++ b/drivers/vfio/vfio_platform.c
@@ -0,0 +1,187 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakisa.mota...@virtualopensystems.com
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+
+#includelinux/device.h
+#includelinux/eventfd.h
+#includelinux/interrupt.h
+#includelinux/iommu.h
+#includelinux/module.h
+#includelinux/mutex.h
+#includelinux/notifier.h
+#includelinux/pm_runtime.h
+#includelinux/slab.h
+#includelinux/types.h
+#includelinux/uaccess.h
+#includelinux/vfio.h
+
+#define DRIVER_VERSION  0.1
+#define DRIVER_AUTHOR   Antonios Motakisa.mota...@virtualopensystems.com
+#define DRIVER_DESC VFIO Device Tree devices - User Level meta-driver
+
+struct vfio_platform_device {
+   struct platform_device  *pdev;
+};
+
+static void vfio_platform_release(void *device_data) {
+   module_put(THIS_MODULE);
+}
+
+static int vfio_platform_open(void *device_data) {
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   return 0;
+}
+
+static long vfio_platform_ioctl(void *device_data,
+  unsigned int cmd, unsigned long arg) {
+   struct vfio_platform_device *vdev = device_data;
+   unsigned long minsz;
+
+   if (cmd == VFIO_DEVICE_GET_INFO) {
+   struct vfio_device_info info;
+
+   minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+   if (copy_from_user(info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz  minsz)
+   return -EINVAL;
+
+   info.flags = VFIO_DEVICE_FLAGS_PLATFORM;
+   info.num_regions = 0;
+   info.num_irqs = 0;
+
+   return copy_to_user((void __user 

Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-07-23 Thread Don Dutile

On 06/26/2013 03:03 PM, Alex Williamson wrote:

On Mon, 2013-06-24 at 11:43 -0600, Bjorn Helgaas wrote:

On Wed, Jun 19, 2013 at 6:43 AM, Don Dutileddut...@redhat.com  wrote:

On 06/18/2013 10:52 PM, Bjorn Helgaas wrote:


On Tue, Jun 18, 2013 at 5:03 PM, Don Dutileddut...@redhat.com   wrote:


On 06/18/2013 06:22 PM, Alex Williamson wrote:



On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:



On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
alex.william...@redhat.comwrote:



On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:



On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:


...


Who do you expect to decide whether to use this option?  I think it
requires intimate knowledge of how the device works.

I think the benefit of using the option is that it makes assignment
of
devices to guests more flexible, which will make it attractive to
users.
But most users have no way of knowing whether it's actually *safe* to
use this.  So I worry that you're adding an easy way to pretend
isolation
exists when there's no good way of being confident that it actually
does.




...




I wonder if we should taint the kernel if this option is used (but not
for specific devices added to pci_dev_acs_enabled[]).  It would also
be nice if pci_dev_specific_acs_enabled() gave some indication in
dmesg for the specific devices you're hoping to add to
pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
so I'm not sure how we'd limit it to one message per device.



Right, setup vs use and getting single prints is a lot of extra code.
Tainting is troublesome for support, Don had some objections when I
suggested the same to him.


For RH GSS (Global Support Services), a 'taint' in the kernel printk
means
RH doesn't support that system.  The 'non-support' due to 'taint' being
printed
out in this case may be incorrect -- RH may support that use, at least
until
a more sufficient patched kernel is provided.
Thus my dissension that 'taint' be output.  WARN is ok. 'driver beware',
'unleashed dog afoot' sure...



So ...  that's really a RH-specific support issue, and easily worked
around by RH adding a patch that turns off tainting.


sure. what's another patch to the thousands... :-/


It still sounds like a good idea to me for upstream, where use of this
option can very possibly lead to corruption or information leakage
between devices the user claimed were isolated, but in fact were not.


Did I miss something?  This patch provides a user-level/chosen override;
like all other overrides, (pci=realloc, etc.), it can lead to a failing
system.
IMO, this patch is no different.  If you want to tag this patch with taint,
then let's audit all the (PCI) overrides and taint them appropriately.
Taint should be reserved to changes to the kernel that were done outside
the development of the kernel, or with the explicit intent to circumvent
the normal operation of the kernel.  This patch provides a way to enable
ACS checking to succeed when the devices have not provided sufficiently
complete
ACS information.  i.e., it's a growth path for PCIe-ACS and its need for
proper support.


We're telling the kernel to assume something (the hardware provides
protection) that may not be true.  If that assumption turns out to be
false, the result is that a VM can be crashed or comprised by another
VM.

One difference I see is that this override can lead to a crash that
looks like random memory corruption and has no apparent connection to
the actual cause.  Most other overrides won't cause run-time crashes
(I think they're more likely to cause boot or device configuration
failures), and the dmesg log will probably have good clues as to the
reason.

But the possibility of compromise is probably even more serious,
because there would be no crash at all, and we'd have no indication
that VM A read or corrupted data in VM B.  I'm very concerned about
that, enough so that it's not clear to me that an override belongs in
the upstream kernel at all.

Yes, that would mean some hardware is not suitable for device
assignment.  That just sounds like if hardware manufacturers do their
homework and support ACS properly, their hardware is more useful for
virtualization than other hardware.  I don't see the problem with
that.


That's easy to say for someone that doesn't get caught trying to explain
this to users over and over.  In many cases devices don't do
peer-to-peer and missing ACS is an oversight.  I imagine that quite a
few vendors also see the ACS capability as a means to allow control of
ACS and therefore see it as a much larger investment that just providing
an empty ACS structure in config space to indicate the lack of
peer-to-peer.


+1 -- A good explanation why the workaround is needed.
In fact, until recently, we didn't understand that an empty ACS meant
'no p2p' and not just an empty ACS with no control to disable p2p. :-/


Even if we taint the kernel when this is enabled and add extremely
verbose 

Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-06-19 Thread Don Dutile

On 06/18/2013 10:52 PM, Bjorn Helgaas wrote:

On Tue, Jun 18, 2013 at 5:03 PM, Don Dutileddut...@redhat.com  wrote:

On 06/18/2013 06:22 PM, Alex Williamson wrote:


On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:


On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
alex.william...@redhat.com   wrote:


On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:


On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:

...

Who do you expect to decide whether to use this option?  I think it
requires intimate knowledge of how the device works.

I think the benefit of using the option is that it makes assignment of
devices to guests more flexible, which will make it attractive to
users.
But most users have no way of knowing whether it's actually *safe* to
use this.  So I worry that you're adding an easy way to pretend
isolation
exists when there's no good way of being confident that it actually
does.



...



I wonder if we should taint the kernel if this option is used (but not
for specific devices added to pci_dev_acs_enabled[]).  It would also
be nice if pci_dev_specific_acs_enabled() gave some indication in
dmesg for the specific devices you're hoping to add to
pci_dev_acs_enabled[].  It's not an enumeration-time quirk right now,
so I'm not sure how we'd limit it to one message per device.


Right, setup vs use and getting single prints is a lot of extra code.
Tainting is troublesome for support, Don had some objections when I
suggested the same to him.


For RH GSS (Global Support Services), a 'taint' in the kernel printk means
RH doesn't support that system.  The 'non-support' due to 'taint' being
printed
out in this case may be incorrect -- RH may support that use, at least until
a more sufficient patched kernel is provided.
Thus my dissension that 'taint' be output.  WARN is ok. 'driver beware',
'unleashed dog afoot' sure...


So ...  that's really a RH-specific support issue, and easily worked
around by RH adding a patch that turns off tainting.


sure. what's another patch to the thousands... :-/


It still sounds like a good idea to me for upstream, where use of this
option can very possibly lead to corruption or information leakage
between devices the user claimed were isolated, but in fact were not.

Bjorn

Did I miss something?  This patch provides a user-level/chosen override;
like all other overrides, (pci=realloc, etc.), it can lead to a failing system.
IMO, this patch is no different.  If you want to tag this patch with taint,
then let's audit all the (PCI) overrides and taint them appropriately.
Taint should be reserved to changes to the kernel that were done outside
the development of the kernel, or with the explicit intent to circumvent
the normal operation of the kernel.  This patch provides a way to enable
ACS checking to succeed when the devices have not provided sufficiently complete
ACS information.  i.e., it's a growth path for PCIe-ACS and its need for
proper support.


--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-06-18 Thread Don Dutile

On 06/18/2013 06:22 PM, Alex Williamson wrote:

On Tue, 2013-06-18 at 15:31 -0600, Bjorn Helgaas wrote:

On Tue, Jun 18, 2013 at 12:20 PM, Alex Williamson
alex.william...@redhat.com  wrote:

On Tue, 2013-06-18 at 11:28 -0600, Bjorn Helgaas wrote:

On Thu, May 30, 2013 at 12:40:19PM -0600, Alex Williamson wrote:

PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
allows us to control whether transactions are allowed to be redirected
in various subnodes of a PCIe topology.  For instance, if two
endpoints are below a root port or downsteam switch port, the
downstream port may optionally redirect transactions between the
devices, bypassing upstream devices.  The same can happen internally
on multifunction devices.  The transaction may never be visible to the
upstream devices.

One upstream device that we particularly care about is the IOMMU.  If
a redirection occurs in the topology below the IOMMU, then the IOMMU
cannot provide isolation between devices.  This is why the PCIe spec
encourages topologies to include ACS support.  Without it, we have to
assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.

Unfortunately, far too many topologies do not support ACS to make this
a steadfast requirement.  Even the latest chipsets from Intel are only
sporadically supporting ACS.  We have trouble getting interconnect
vendors to include the PCIe spec required PCIe capability, let alone
suggested features.

Therefore, we need to add some flexibility.  The pcie_acs_override=
boot option lets users opt-in specific devices or sets of devices to
assume ACS support.


ACS support means the device provides an ACS capability and we
can change bits in the ACS Control Register to control how things
work.

You are really adding a way to assume this device always routes
peer-to-peer DMA upstream, which ultimately translates into assume
this device can be isolated from others via the IOMMU.  I think.


We've heard from one vendor that they support ACS with a NULL capability
structure, ie. the ACS PCIe capability exists, but reports no ACS
capabilities and allows no control.  This takes advantage of the if
supported style wording of the spec to imply that peer-to-peer is not
supported because the capability is not available.  This supported but
not controllable version is what we're trying to enable here.


I was wrong to say we can change bits in the control register.  All
the control register bits are actually required to be hardwired to
zero when the relevant functionality is not implemented.

The example you mention (a device with an ACS capability structure
that reports no supported capabilities and allows no control) sounds
perfectly legal as a description of a device that doesn't support
peer-to-peer, and I hope it doesn't require any user intervention
(e.g., this patch) or quirks to make it work.


It requires:

Subject: [PATCH v2 2/2] pci: Differentiate ACS controllable from enabled


The ACS P2P Request Redirect must be implemented by Functions that
support peer-to-peer traffic with other Functions.  This example
device doesn't support peer-to-peer traffic, so why would it implement
the ACS R bit?  In fact, it looks like the R bit (and all the other
bits) *must* be hardwired to zero in both capability and control
registers.


Right, if they don't support peer-to-peer then hardwiring capability and
control to zero should indicate that and is fixed by the patch
referenced above.


The downstream option assumes full ACS support
on root ports and downstream switch ports.  The multifunction
option assumes the subset of ACS features available on multifunction
endpoints and upstream switch ports are supported.  The id::
option enables ACS support on devices matching the provided vendor
and device IDs, allowing more strategic ACS overrides.  These options
may be combined in any order.  A maximum of 16 id specific overrides
are available.  It's suggested to use the most limited set of options
necessary to avoid completely disabling ACS across the topology.


Probably I'm confused by your use of assume full ACS support,


[on root ports and downstream ports]


  but I
don't understand the bit about completely disabling ACS.


[across the topology]


   If we use
more options than necessary, it seems like we'd assume more isolation
than really exists.  That sounds like pretending ACS is *enabled* in
more places than it really is.


Exactly.  I'm just trying to suggest that booting with
pcie_acs_override=downstream,multifunction is not generally recommended
because it effectively disables ACS checking across the topology and
assumes isolation where there may be none.  In other words, if
everything is overriding ACS checks, then we've disabled any benefit of
doing the checking in the first place (so I really mean disable
checking, not disable ACS).


Yep, the missing checking is what is confusing.  Also, I think it
would be good to make the implication more explicit -- using this
option makes the 

Re: [PATCH] pci: Enable overrides for missing ACS capabilities

2013-06-17 Thread Don Dutile

On 05/30/2013 02:40 PM, Alex Williamson wrote:

PCIe ACS (Access Control Services) is the PCIe 2.0+ feature that
allows us to control whether transactions are allowed to be redirected
in various subnodes of a PCIe topology.  For instance, if two
endpoints are below a root port or downsteam switch port, the
downstream port may optionally redirect transactions between the
devices, bypassing upstream devices.  The same can happen internally
on multifunction devices.  The transaction may never be visible to the
upstream devices.

One upstream device that we particularly care about is the IOMMU.  If
a redirection occurs in the topology below the IOMMU, then the IOMMU
cannot provide isolation between devices.  This is why the PCIe spec
encourages topologies to include ACS support.  Without it, we have to
assume peer-to-peer DMA within a hierarchy can bypass IOMMU isolation.

Unfortunately, far too many topologies do not support ACS to make this
a steadfast requirement.  Even the latest chipsets from Intel are only
sporadically supporting ACS.  We have trouble getting interconnect
vendors to include the PCIe spec required PCIe capability, let alone
suggested features.

Therefore, we need to add some flexibility.  The pcie_acs_override=
boot option lets users opt-in specific devices or sets of devices to
assume ACS support.  The downstream option assumes full ACS support
on root ports and downstream switch ports.  The multifunction
option assumes the subset of ACS features available on multifunction
endpoints and upstream switch ports are supported.  The id::
option enables ACS support on devices matching the provided vendor
and device IDs, allowing more strategic ACS overrides.  These options
may be combined in any order.  A maximum of 16 id specific overrides
are available.  It's suggested to use the most limited set of options
necessary to avoid completely disabling ACS across the topology.

Note to hardware vendors, we have facilities to permanently quirk
specific devices which enforce isolation but not provide an ACS
capability.  Please contact me to have your devices added and save
your customers the hassle of this boot option.

Signed-off-by: Alex Williamsonalex.william...@redhat.com
---
  Documentation/kernel-parameters.txt |   10 +++
  drivers/pci/quirks.c|  102 +++
  2 files changed, 112 insertions(+)



Feel free to add my ack.

I like the fact that all of this code is in quirks, and not sprinkled btwn pci core 
 quirks.
As we have discovered, even common, ACS-compatilbe x86 chipsets are out there 
w/o ACS caps.
Additionally, an unclear area of the spec has some vendors providing 'null ACS' 
caps,
which is suppose to be interpreted as no-peer-to-peer DMA; this latter 
'feature' is being
brought up to PCI-SIG to get a note added to SRIOV spec in the ACS-related 
material to clarify.


diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 47bb23c..a60e6ad 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2349,6 +2349,16 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
nomsi   Do not use MSI for native PCIe PME signaling (this makes
all PCIe root ports use INTx for all services).

+   pcie_acs_override =
+   [PCIE] Override missing PCIe ACS support for:
+   downstream
+   All downstream ports - full ACS capabilties
+   multifunction
+   All multifunction devices - multifunction ACS subset
+   id::
+   Specfic device - full ACS capabilities
+   Specified as vid:did (vendor/device ID) in hex
+
pcmv=   [HW,PCMCIA] BadgePAD 4

pd. [PARIDE]
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 0369fb6..c7609f6 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3292,11 +3292,113 @@ struct pci_dev *pci_get_dma_source(struct pci_dev *dev)
return pci_dev_get(dev);
  }

+static bool acs_on_downstream;
+static bool acs_on_multifunction;
+
+#define NUM_ACS_IDS 16
+struct acs_on_id {
+   unsigned short vendor;
+   unsigned short device;
+};

At first, I wasn't sure if vid/did would be sufficient, but convinced myself
that a sub-vid/did that differed amongst sub-vid's, even if they added/modified
ACS cap, the ACS-cap-existence test in pcie_acs_overrides() will not exercise
any override for that variant, and both bases -- newer/sub-vid modified devices
w/ACS, and vid/did w/o ACS will be handled properly.
So, another reason for the ack.


+static struct acs_on_id acs_on_ids[NUM_ACS_IDS];
+static u8 max_acs_id;
+
+static __init int pcie_acs_override_setup(char *p)
+{
+   if (!p)
+   return -EINVAL;
+
+   while (*p) {
+   if (!strncmp(p, downstream, 10))
+  

Re: virtio PCI on KVM without IO BARs

2013-04-29 Thread Don Dutile

On 02/28/2013 10:24 AM, Michael S. Tsirkin wrote:

OK we talked about this a while ago, here's
a summary and some proposals:
At the moment, virtio PCI uses IO BARs for all accesses.

The reason for IO use is the cost of different VM exit types
of transactions and their emulation on KVM on x86
(it would be trivial to use memory BARs on non x86 platforms
  if they don't have PIO).
Example benchmark (cycles per transaction):
(io access) outw 1737
(memory access) movw 4341
for comparison:
(hypercall access): vmcall 1566
(pv memory access) movw_fast 1817 (*explanation what this is below)

This creates a problem if we want to make virtio devices
proper PCI express devices with native hotplug support.
This is because each hotpluggable PCI express device always has
a PCI express port (port per device),
where each port is represented by a PCI to PCI bridge.
In turn, a PCI to PCI bridge claims a 4Kbyte aligned
range of IO addresses. This means that we can have at
most 15 such devices, this is a nasty limitation.

Another problem with PIO is support for physical virtio devices,
and nested virt: KVM currently programs all PIO accesses
to cause vm exit, so using this device in a VM will be slow.

So we really want to stop using IO BARs completely if at all possible,
but looking at the table above, switching to memory BAR and movw for
notifications will not work well.

Possible solutions:
1. hypercall instead of PIO
basically add a hypercall that gets an MMIO address/data
and does an MMIO write for us.
We'll want some capability in the device to let guest know
this is what it should do.
Pros: even faster than PIO
Cons: this won't help nested or assigned devices (won't hurt
  them either as it will be conditional on the capability above).
Cons: need host kernel support, which then has to be maintained
  forever, even if intel speeds up MMIO exits.

2. pv memory access
There are two reasons that memory access is slower:
- one is that it's handled as an EPT misconfiguration error
so handled by cpu slow path
- one is that we need to decode the x86 instruction in
software, to calculate address/data for the access.

We could agree that guests would use a specific instruction
for virtio accesses, and fast-path it specifically.
This is the pv memory access option above.
Pros: helps assigned devices and nested virt
Pros: easy to drop if hardware support is there
Cons: a bit slower than IO
Cons: need host kernel support

3. hypervisor assigned IO address
qemu can reserve IO addresses and assign to virtio devices.
2 bytes per device (for notification and ISR access) will be
enough. So we can reserve 4K and this gets us 2000 devices.
 From KVM perspective, nothing changes.
We'll want some capability in the device to let guest know
this is what it should do, and pass the io address.
One way to reserve the addresses is by using the bridge.
Pros: no need for host kernel support
Pros: regular PIO so fast
Cons: does not help assigned devices, breaks nested virt

Simply counting pros/cons, option 3 seems best. It's also the
easiest to implement.

Comments?


apologies for late response...

It seems that solution 1 would be the best option for the following reasons:

a) (nearly?) every virt technology out there (xen, kvm, vmware, hyperv)
   has pv drivers in the major OS's using virt (Windows, Linux),
   so having a hypercall table searched, initialized and used for
   fast virtio register access is trivially simple to do.

b) the support can be added with whatever pvdriver set is provided
   w/o impacting OS core support.

c) it's architecture neutral, or can be made architecture neutral.
   e.g., inb/outb  PCI ioport support is very different btwn x86  non-x86.
   A hypercall interface would not have that dependency/difference.
 
d) it doesn't require new OS support in std/core areas for

   new standard(s), as another thread proposed; this kind of approach
   has a long, time delay to get defined  implemented across OS's.
   In contrast, a hypercall defined interface can be indep. of standards
   bodies, and if built into a pvdriver core, can change /or adapt rapidly,
   and have additional i/f mechanisms for version-levels, which enables
   cross-hypervisor(version) migration.

e) the hypercall can be extended to do pv-specific hot add/remove,
   eliminating dependencies on emulation support of ACPI-hp or PCIe-hp,
   and simply(?) track core interfaces for hot-plug of (class) devices.

f) For migration, hypercall interfaces could be extended for better/faster
   migration as well (suspend/resume pv device).

my (late) 5 cents (I'll admit it was more than 2 cents)... Don


--
To unsubscribe from this list: send 

Re: DMAR and DRHD errors[DMAR:[fault reason 06] PTE Read access is not set] Vt-d intel_iommu

2012-12-17 Thread Don Dutile

On 12/15/2012 05:16 PM, Robert Hancock wrote:

On 12/14/2012 03:32 PM, Don Dutile wrote:

On 12/13/2012 04:50 AM, Jason Gao wrote:

Dear List:

Description of problem:
After installed Centos 6.3(RHEL6.3) on my Dell R710(lastest
bios:Version: 6.3.0,Release Date: 07/24/2012) server,and updated
lastest kernel 2.6.32-279.14.1.el6.x86_64,I want to use the Intel
82576 ET Dual Port nic's SR-IOV feature,assigning VFs to kvm guest

appended kernel boot parameter: intel_iommu=on,after boot with the
following messages:

Dec 13 16:58:15 2 kernel: DRHD: handling fault status reg 2
Dec 13 16:58:15 2 kernel: DMAR:[DMA Read] Request device [03:00.0]
fault addr ffe65000
Dec 13 16:58:15 2 kernel: DMAR:[fault reason 06] PTE Read access is
not set
Dec 13 16:58:15 2 kernel: DRHD: handling fault status reg 102
Dec 13 16:58:15 2 kernel: DMAR:[DMA Read] Request device [03:00.0]
fault addr ffe8a000
Dec 13 16:58:15 2 kernel: DMAR:[fault reason 06] PTE Read access is
not set
Dec 13 16:58:15 2 kernel: scsi 0:0:32:0: Enclosure DP
BACKPLANE 1.07 PQ: 0 ANSI: 5
Dec 13 16:58:15 2 kernel: DRHD: handling fault status reg 202
Dec 13 16:58:15 2 kernel: DMAR:[DMA Read] Request device [03:00.0]
fault addr ffe89000
Dec 13 16:58:15 2 kernel: DMAR:[fault reason 06] PTE Read access is
not set

full dmesg detail:
http://pastebin.com/BzFQV0jU
lspci -vvv full detail:
http://pastebin.com/9rP2d1br


it's a production server,and I'm not sure if this is a critical
problem,how to fix it,any help would be greatly appreciated.


DMAR table does not have an entry for this device to this region.
Once the driver reconfigs/resets the device to stop polling bios-boot
cmd rings and use (new) OS (dma-mapped) rings, there's a period of time
during this transition that the hw is babbling away to an area that is no
longer mapped.


Maybe some kind of boot PCI quirk is needed to stop the device DMA activity 
before enabling the IOMMU?


No, lack of a *proper* RMRR for this device is the source of the problem;
that's why the RMRR's exist -- so this transition state does not cause these
types of problems.


--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMAR and DRHD errors[DMAR:[fault reason 06] PTE Read access is not set] Vt-d intel_iommu

2012-12-14 Thread Don Dutile

On 12/13/2012 04:50 AM, Jason Gao wrote:

Dear List:

Description of problem:
After installed Centos 6.3(RHEL6.3) on my Dell R710(lastest
bios:Version: 6.3.0,Release Date: 07/24/2012) server,and updated
lastest kernel 2.6.32-279.14.1.el6.x86_64,I want to use the Intel
82576 ET Dual Port nic's SR-IOV feature,assigning VFs to kvm guest

appended kernel boot parameter: intel_iommu=on,after boot with the
following messages:

Dec 13 16:58:15 2 kernel: DRHD: handling fault status reg 2
Dec 13 16:58:15 2 kernel: DMAR:[DMA Read] Request device [03:00.0]
fault addr ffe65000
Dec 13 16:58:15 2 kernel: DMAR:[fault reason 06] PTE Read access is not set
Dec 13 16:58:15 2 kernel: DRHD: handling fault status reg 102
Dec 13 16:58:15 2 kernel: DMAR:[DMA Read] Request device [03:00.0]
fault addr ffe8a000
Dec 13 16:58:15 2 kernel: DMAR:[fault reason 06] PTE Read access is not set
Dec 13 16:58:15 2 kernel: scsi 0:0:32:0: Enclosure DP
BACKPLANE1.07 PQ: 0 ANSI: 5
Dec 13 16:58:15 2 kernel: DRHD: handling fault status reg 202
Dec 13 16:58:15 2 kernel: DMAR:[DMA Read] Request device [03:00.0]
fault addr ffe89000
Dec 13 16:58:15 2 kernel: DMAR:[fault reason 06] PTE Read access is not set

full dmesg detail:
http://pastebin.com/BzFQV0jU
lspci -vvv full detail:
http://pastebin.com/9rP2d1br


it's a production server,and I'm not sure if this is a critical
problem,how to fix it,any help would be greatly appreciated.


DMAR table does not have an entry for this device to this region.
Once the driver reconfigs/resets the device to stop polling bios-boot
cmd rings and use (new) OS (dma-mapped) rings, there's a period of time
during this transition that the hw is babbling away to an area that is no
longer mapped.

--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMAR and DRHD errors[DMAR:[fault reason 06] PTE Read access is not set] Vt-d intel_iommu

2012-12-14 Thread Don Dutile

On 12/13/2012 09:01 PM, Jason Gao wrote:

On Fri, Dec 14, 2012 at 12:23 AM, Alex Williamson
alex.william...@redhat.com  wrote:


Device 03:00.0 is your raid controller:

03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 
04)

For some reason it's trying to read from ffe65000, ffe8a000, ffe89000,
ffe86000, ffe87000, ffe84000.  Those are in reserved memory regions, so
it's not reading an OS allocated buffer, which probably means it's some
kind of side-band communication with a management controller.  I'd guess
it's a BIOS bug and there should be an RMRR covering those accesses.
Thanks,


First of all ,I want to known whether I can ignore these errors on the
production server,and do these error may affect the system?

By the way,when I removed the intel_iommu=on from /etc/grub.conf,no
DMAR related errors occur


well, if you don't enable the IOMMU, then it won't have IOMMU faults! ;-)


It's a strange thing,other three Dell R710 servers with the same bios
version v. 6.3.0, same kernel 2.6.32-279.14.1 on RHEL6u3(Centos 6u3)
,but these errors don't appear on these tree servers


mptsas or smi fw has to be different


Anyone have any idea for this ?

thanks
--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: DMAR and DRHD errors[DMAR:[fault reason 06] PTE Read access is not set] Vt-d intel_iommu

2012-12-14 Thread Don Dutile

On 12/13/2012 09:01 PM, Jason Gao wrote:

On Fri, Dec 14, 2012 at 12:23 AM, Alex Williamson
alex.william...@redhat.com  wrote:


Device 03:00.0 is your raid controller:

03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 
04)

For some reason it's trying to read from ffe65000, ffe8a000, ffe89000,
ffe86000, ffe87000, ffe84000.  Those are in reserved memory regions, so
it's not reading an OS allocated buffer, which probably means it's some
kind of side-band communication with a management controller.  I'd guess
it's a BIOS bug and there should be an RMRR covering those accesses.
Thanks,


First of all ,I want to known whether I can ignore these errors on the
production server,and do these error may affect the system?

By the way,when I removed the intel_iommu=on from /etc/grub.conf,no
DMAR related errors occur

It's a strange thing,other three Dell R710 servers with the same bios
version v. 6.3.0, same kernel 2.6.32-279.14.1 on RHEL6u3(Centos 6u3)
,but these errors don't appear on these tree servers


forgot: did you check that all the bios settings are the same btwn
the 710 systems?


Anyone have any idea for this ?

thanks
--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources for SR-IOV)

2012-11-13 Thread Don Dutile

On 11/13/2012 11:04 AM, Li, Sibai wrote:




-Original Message-
From: Jason Gao [mailto:pkill.2...@gmail.com]
Sent: Tuesday, November 13, 2012 5:38 AM
To: bhelg...@google.com; Rose, Gregory V; Li, Sibai
Cc: ddut...@redhat.com; Kirsher, Jeffrey T; linux-kernel; netdev; kvm; e1000-
de...@lists.sourceforge.net; linux-...@vger.kernel.org; Yinghai Lu
Subject: Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources
for SR-IOV)

I'm very sorry for delayed reply.now SR-IOV works for me in Centos 6.3,thank all
of you.


On Fri, Nov 9, 2012 at 11:26 PM, Bjorn Helgaasbhelg...@google.com  wrote:

Linux normally uses the resource assignments done by the BIOS, but it
is possible for the kernel to reassign those.  We don't have good
automatic support for that yet, but on a recent upstream kernel, you
can try pci=realloc.  I doubt this option is in CentOS 6.3, though


Thank you very much,I try pci=realloc in Centos 6.3,and now it works for me.



On Sat, Nov 10, 2012 at 2:08 AM, Li, Sibaisibai...@intel.com  wrote:

DellR710 with the latest BIOS should work fine for SR-IOV. My BIOS is
v.6.3.0 and release date is 07/24/2012 Please check if you configured

intel_iommu=on in the grub.conf file.

If you did, check your kernel .config file under Device Drivers-  IOMMU

Hardware support-enable Support for Intel IOMMU using DMA remapping
Devices, enable Intel DMA Remapping Devices by Default, enable Support for
Interrupt Remapping.

thank you Sibai,Our server Dell R710,its BIOS version is just
v.6.3.0 and release date is 07/24/2012,and I also configured intel_iommu=on in
the grub.conf file,but I can't find these IOMMU options in Device Drivers in 
my

Sibai is referring to kernel config options.  RHEL6.3 has the IOMMU options 
built into
the kernel, but not enabled by default -- have to add 'intel_iommu=on' to the 
kernel
cmdline to enable IOMMU. SRIOV support (CONFIG_IOV) is built into the RHEL6.3 
kernel as well.


kernel(2.6.32-279) .config file , btw my os is Centos 6.3(RHEL6.3),although the
problem solved,I'd like to know what's your os version ,kernel version?


I am using RHEL6.3 with unstable kernel 3.7.0-rc
--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SR-IOV problem with Intel 82599EB (not enough MMIO resources for SR-IOV)

2012-11-12 Thread Don Dutile

On 11/09/2012 10:26 AM, Bjorn Helgaas wrote:

[+ linux-pci, Yinghai]

On Thu, Nov 8, 2012 at 8:59 PM, Jason Gaopkill.2...@gmail.com  wrote:

The BIOS in your machine doesn't support SR-IOV.  You'll need to ask the 
manufacturer for a BIOS upgrade, if in fact one is available.  Sometimes 
they're not.


very thanks Greg,my server Dell R710 with latest BIOS version and
option for SR-IOV(SR-IOV Global Enable-Enabled)  opened,I'm confused
that Does R710 provide full support for SR-IOV, kernel or  ixgbe
driver's bug? but I'm not sure where the problem lies,anyone has any
experience about this?   .


Linux normally uses the resource assignments done by the BIOS, but it
is possible for the kernel to reassign those.  We don't have good
automatic support for that yet, but on a recent upstream kernel, you
can try pci=realloc.  I doubt this option is in CentOS 6.3, though.


Try moving the device into a different slot.
You may be trying it in a non-ARI slot in the 710; that is a problem
in RHEL6 (needing to realloc bus /or mem-space).
A non-ARI slot will want to use one bus number per VF which will
be problematic.
I know I've seen ixgbe's ( their vfs's) working on a dell 710;
but they may also be one of those systems that has a slot off the ich10
with no ARI support.


If an upstream kernel with pci=realloc still doesn't work, please
post the entire dmesg log.
--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] PCI: Enable LTR/OBFF before device is used by driver

2012-06-08 Thread Don Dutile

On 06/08/2012 02:02 PM, Myron Stowe wrote:

On Fri, Jun 8, 2012 at 11:31 AM, Bjorn Helgaasbhelg...@google.com  wrote:

On Fri, Jun 8, 2012 at 1:01 AM, Xudong Haoxudong@intel.com  wrote:

The series of patches enable LTR and OBFF before device is used by driver, and
introduce a couple of functions to save/restore LTR latency value.

Patch 1/4 introduce new function pci_obff_supported() as pci_ltr_support().

Patch 2/4 enable LTR(Latency tolerance reporting) before device is used by
driver.

Patch 3/4 enable OBFF(optimized buffer flush/fill) before device is used by
driver.

Patch 4/4 introduce a couple of functions pci_save_ltr_value() and
pci_restore_ltr_value() to save and restore LTR latency value, while device is
reset.


We need some justification for these patches.  Why do we want them?
Do they improve performance?  Reduce power consumption?  How have they
been tested?  How can we be confident that these features work
correctly on hardware in the field?  Should or could the BIOS enable
them itself, based on OEM testing and desire to support these
features?


I too am a little nervous about these changes due to Jesse's earlier response
(see http://marc.info/?l=linux-pcim=133372610102933w=2) where he indicated:
   Given how device specific these extensions are, I'd expect you'd need
to know about each specific device anyway, which is why I think the
control belongs in the driver.

Having these features enabled by default may be too aggressive.  Not saying it
is not correct - something you may be able to inform us about, especially since
you are with Intel - just make me nervous without further information.

Myron


+1; like AER, I prefer the enablement be in the driver; when/if the
feature has proven itself reliable, then the kernel can enable it by default
In the case of the kernel  driver doing an enable, it won't hurt.
If want hook to disable by boot parameter, the kernel would have to clear
on scan, and put the disable *after* driver probe.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] Enable LTR/OBFF before device is used by driver

2012-05-30 Thread Don Dutile

While you are making the other recommended fixes, could
you add/create a pci_obff_supported() function, like the pci_ltr_supported()
function, and more importantly, add it to the pci_disable_obff() function?
The latter does not check that DEVCAP2 is supported, and thus, could be
writing to non-existent (potentially private) cap space (i.e., a V1 CAP device,
which do exist in the marketplace).

The [enable,disable]_ltr functions do a good job of doing pci_ltr_supported()
checks, but the obff enable,disable functions should have similar calls.


On 05/13/2012 10:48 PM, Xudong Hao wrote:

Enable LTR(Latency tolerance reporting) and OBFF(optimized buffer flush/fill) in
  pci_enable_device(), so that they are enabled before the device is used by 
driver.

Signed-off-by: Xudong Haoxudong@intel.com

---
  drivers/pci/pci.c |   29 +
  1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..2369883 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1134,6 +1134,31 @@ int pci_load_and_free_saved_state(struct pci_dev *dev,
  }
  EXPORT_SYMBOL_GPL(pci_load_and_free_saved_state);

+static void pci_enable_dev_caps(struct pci_dev *dev)
+{
+   /* set default value */
+   unsigned long type = PCI_EXP_OBFF_SIGNAL_ALWAYS;
+
+   /* LTR(Latency tolerance reporting) allows devices to send
+* messages to the root complex indicating their latency
+* tolerance for snooped  unsnooped memory transactions.
+*/
+   pci_enable_ltr(dev);
+
+   /* OBFF (optimized buffer flush/fill), where supported,
+* can help improve energy efficiency by giving devices
+* information about when interrupts and other activity
+* will have a reduced power impact.
+*/
+   pci_enable_obff(dev, type);
+}
+
+static void pci_disable_dev_caps(struct pci_dev *dev)
+{
+   pci_disable_obff(dev);
+   pci_disable_ltr(dev);
+}
+
  static int do_pci_enable_device(struct pci_dev *dev, int bars)
  {
 int err;
@@ -1146,6 +1171,9 @@ static int do_pci_enable_device(struct pci_dev *dev, int 
bars)
 return err;
 pci_fixup_device(pci_fixup_enable, dev);

+   /* Enable some device capibility before it's used by driver. */
+   pci_enable_dev_caps(dev);
+
 return 0;
  }

@@ -1361,6 +1389,7 @@ static void do_pci_disable_device(struct pci_dev *dev)
 }

 pcibios_disable_device(dev);
+   pci_disable_dev_caps(dev);
  }

  /**
--
1.6.0.rc1

--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 09/13] vfio: x86 IOMMU implementation

2012-05-25 Thread Don Dutile

On 05/24/2012 06:46 PM, Alex Williamson wrote:

On Thu, 2012-05-24 at 17:38 -0400, Don Dutile wrote:

On 05/22/2012 01:05 AM, Alex Williamson wrote:

x86 is probably the wrong name for this VFIO IOMMU driver, but x86
is the primary target for it.  This driver support a very simple
usage model using the existing IOMMU API.  The IOMMU is expected to
support the full host address space with no special IOVA windows,
number of mappings restrictions, or unique processor target options.

Signed-off-by: Alex Williamsonalex.william...@redhat.com
---

   Documentation/ioctl/ioctl-number.txt |2
   drivers/vfio/Kconfig |6
   drivers/vfio/Makefile|2
   drivers/vfio/vfio.c  |7
   drivers/vfio/vfio_iommu_x86.c|  743 
++
   include/linux/vfio.h |   52 ++
   6 files changed, 811 insertions(+), 1 deletions(-)
   create mode 100644 drivers/vfio/vfio_iommu_x86.c

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 111e30a..9d1694e 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,7 +88,7 @@ Code  Seq#(hex)   Include FileComments
and kernel/power/user.c
   '8'  all SNP8023 advanced NIC card
mailto:m...@solidum.com
-';'64-6F   linux/vfio.h
+';'64-72   linux/vfio.h
   '@'  00-0F   linux/radeonfb.hconflict!
   '@'  00-0F   drivers/video/aty/aty128fb.cconflict!
   'A'  00-1F   linux/apm_bios.hconflict!
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9acb1e7..bd88a30 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -1,6 +1,12 @@
+config VFIO_IOMMU_X86
+   tristate
+   depends on VFIO   X86
+   default n
+
   menuconfig VFIO
tristate VFIO Non-Privileged userspace driver framework
depends on IOMMU_API
+   select VFIO_IOMMU_X86 if X86
help
  VFIO provides a framework for secure userspace device drivers.
  See Documentation/vfio.txt for more details.


So a future refactoring that uses some chunk of this support
on a non-x86 machine could be a lot of useless renaming.

Why not rename vfio_iommu_x86 to something like vfio_iommu_no_iova
and just make it conditionally compiled on X86 (as you've done above in 
Kconfig's)?
Then if another arch can use it, or refactors the file to use
some of it, and split x86 vsother-arch  into separate per-arch files,
or per-iova schemes, it's more descriptive and less disruptive?


Yep, the problem is how to concisely describe what we expect to support
here.  This file supports IOMMU API based usage of an IOMMU with
effectively no DMA window or mapping constraints, optimized for static
mapping of an address space.  What's a good name for that?  Maybe I
should follow the example of others and just call it a Type 1 IOMMU
implementation so the marketing material looks better!  ;-P  That may
honestly be better than calling it x86.  Thoughts?  Thanks,

Alex


I'll vote for 'type1' over 'x86' 
Add a comment in the file what a 'type1 IOMMU' is.
Then others can dupe format for typeX.


--
To unsubscribe from this list: send the line unsubscribe linux-pci in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/13] iommu: IOMMU groups for VT-d and AMD-Vi

2012-05-24 Thread Don Dutile

On 05/22/2012 01:04 AM, Alex Williamson wrote:

Add back group support for AMD  Intel.  amd_iommu already tracks
devices and has init and uninit routines to manage groups.
intel-iommu does this on the fly, so we make use of the notifier
support built into iommu groups to create and remove groups.

Signed-off-by: Alex Williamsonalex.william...@redhat.com
---

  drivers/iommu/amd_iommu.c   |   28 +-
  drivers/iommu/intel-iommu.c |   46 +++
  2 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 32c00cd..b7e5ddf 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -256,9 +256,11 @@ static bool check_device(struct device *dev)

  static int iommu_init_device(struct device *dev)
  {
-   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
struct iommu_dev_data *dev_data;
+   struct iommu_group *group;
u16 alias;
+   int ret;

if (dev-archdata.iommu)
return 0;
@@ -279,8 +281,30 @@ static int iommu_init_device(struct device *dev)
return -ENOTSUPP;
}
dev_data-alias_data = alias_data;
+
+   dma_pdev = pci_get_bus_and_slot(alias  8, alias  0xff);
+   } else
+   dma_pdev = pdev;
+
+   if (!pdev-is_virtfn  PCI_FUNC(pdev-devfn)  iommu_group_mf
+   pdev-hdr_type == PCI_HEADER_TYPE_NORMAL)
+   dma_pdev = pci_get_slot(pdev-bus,
+   PCI_DEVFN(PCI_SLOT(pdev-devfn), 0));
+
+   group = iommu_group_get(dma_pdev-dev);
+   if (!group) {
+   group = iommu_group_alloc();
+   if (IS_ERR(group))
+   return PTR_ERR(group);
}

+   ret = iommu_group_add_device(group, dev);
+
+   iommu_group_put(group);
+

do you want to do a put if there is a failure in the iommu_group_add_device()?

+   if (ret)
+   return ret;
+
if (pci_iommuv2_capable(pdev)) {
struct amd_iommu *iommu;

@@ -309,6 +333,8 @@ static void iommu_ignore_device(struct device *dev)

  static void iommu_uninit_device(struct device *dev)
  {
+   iommu_group_remove_device(dev);
+
/*
 * Nothing to do here - we keep dev_data around for unplugged devices
 * and reuse it when the device is re-plugged - not doing so would
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d4a0ff7..e63b33b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,6 +4087,50 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
  }

+static int intel_iommu_add_device(struct device *dev)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_dev *bridge, *dma_pdev = pdev;
+   struct iommu_group *group;
+   int ret;
+
+   if (!device_to_iommu(pci_domain_nr(pdev-bus),
+pdev-bus-number, pdev-devfn))
+   return -ENODEV;
+
+   bridge = pci_find_upstream_pcie_bridge(pdev);
+   if (bridge) {
+   if (pci_is_pcie(bridge))
+   dma_pdev = pci_get_domain_bus_and_slot(
+   pci_domain_nr(pdev-bus),
+   bridge-subordinate-number, 0);
+   else
+   dma_pdev = bridge;
+   }
+
+   if (!pdev-is_virtfn  PCI_FUNC(pdev-devfn)  iommu_group_mf
+   pdev-hdr_type == PCI_HEADER_TYPE_NORMAL)
+   dma_pdev = pci_get_slot(pdev-bus,
+   PCI_DEVFN(PCI_SLOT(pdev-devfn), 0));
+
+   group = iommu_group_get(dma_pdev-dev);
+   if (!group) {
+   group = iommu_group_alloc();
+   if (IS_ERR(group))
+   return PTR_ERR(group);
+   }
+
+   ret = iommu_group_add_device(group, dev);
+

ditto.

+   iommu_group_put(group);
+   return ret;
+}
+
+static void intel_iommu_remove_device(struct device *dev)
+{
+   iommu_group_remove_device(dev);
+}
+
  static struct iommu_ops intel_iommu_ops = {
.domain_init= intel_iommu_domain_init,
.domain_destroy = intel_iommu_domain_destroy,
@@ -4096,6 +4140,8 @@ static struct iommu_ops intel_iommu_ops = {
.unmap  = intel_iommu_unmap,
.iova_to_phys   = intel_iommu_iova_to_phys,
.domain_has_cap = intel_iommu_domain_has_cap,
+   .add_device = intel_iommu_add_device,
+   .remove_device  = intel_iommu_remove_device,
.pgsize_bitmap  = INTEL_IOMMU_PGSIZES,
  };




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/13] pci: Add ACS validation utility

2012-05-24 Thread Don Dutile

On 05/22/2012 01:05 AM, Alex Williamson wrote:

In a PCI environment, transactions aren't always required to reach
the root bus before being re-routed.  Intermediate switches between
an endpoint and the root bus can redirect DMA back downstream before
things like IOMMUs have a chance to intervene.  Legacy PCI is always
susceptible to this as it operates on a shared bus.  PCIe added a
new capability to describe and control this behavior, Access Control
Services, or ACS.  The utility function pci_acs_enabled() allows us
to test the ACS capabilities of an individual devices against a set
of flags while pci_acs_path_enabled() tests a complete path from
a given downstream device up to the specified upstream device.  We
also include the ability to add device specific tests as it's
likely we'll see devices that do no implement ACS, but want to
indicate support for various capabilities in this space.

Signed-off-by: Alex Williamsonalex.william...@redhat.com
---

  drivers/pci/pci.c|   76 ++
  drivers/pci/quirks.c |   29 +++
  include/linux/pci.h  |   10 ++-
  3 files changed, 114 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..ab6c2a6 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2359,6 +2359,82 @@ void pci_enable_acs(struct pci_dev *dev)
  }

  /**
+ * pci_acs_enable - test ACS against required flags for a given device

typo:   ^^^ missing 'd'


+ * @pdev: device to test
+ * @acs_flags: required PCI ACS flags
+ *
+ * Return true if the device supports the provided flags.  Automatically
+ * filters out flags that are not implemented on multifunction devices.
+ */
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
+{
+   int pos;
+   u16 ctrl;
+
+   if (pci_dev_specific_acs_enabled(pdev, acs_flags))
+   return true;
+
+   if (!pci_is_pcie(pdev))
+   return false;
+
+   if (pdev-pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
+   pdev-pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+   if (!pos)
+   return false;
+
+   pci_read_config_word(pdev, pos + PCI_ACS_CTRL,ctrl);
+   if ((ctrl  acs_flags) != acs_flags)
+   return false;
+   } else if (pdev-multifunction) {
+   /* Filter out flags not applicable to multifunction */
+   acs_flags= (PCI_ACS_RR | PCI_ACS_CR |
+ PCI_ACS_EC | PCI_ACS_DT);
+
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+   if (!pos)
+   return false;
+
+   pci_read_config_word(pdev, pos + PCI_ACS_CTRL,ctrl);
+   if ((ctrl  acs_flags) != acs_flags)
+   return false;
+   }
+
+   return true;

or, to reduce duplicated code (which compiler may do?):

/* Filter out flags not applicable to multifunction */
if (pdev-multifunction)
acs_flags = (PCI_ACS_RR | PCI_ACS_CR |
  PCI_ACS_EC | PCI_ACS_DT);

if (pdev-pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
pdev-pcie_type == PCI_EXP_TYPE_ROOT_PORT ||
pdev-multifunction) {
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
if (!pos)
return false;
pci_read_config_word(pdev, pos + PCI_ACS_CTRL, ctrl);
if ((ctrl  acs_flags) != acs_flags)
return false;
}

return true;

+}


But the above doesn't handle the case where the RC does not do
peer-to-peer btwn root ports. Per ACS spec, such a RC's root ports
don't need to provide an ACS cap, since peer-to-peer port xfers aren't
allowed/enabled/supported, so by design, the root port is ACS compliant.
ATM, an IOMMU-capable system is a pre-req for VFIO,
and all such systems have an ACS cap, but they may not always be true.


+EXPORT_SYMBOL_GPL(pci_acs_enabled);
+
+/**
+ * pci_acs_path_enable - test ACS flags from start to end in a hierarchy
+ * @start: starting downstream device
+ * @end: ending upstream device or NULL to search to the root bus
+ * @acs_flags: required flags
+ *
+ * Walk up a device tree from start to end testing PCI ACS support.  If
+ * any step along the way does not support the required flags, return false.
+ */
+bool pci_acs_path_enabled(struct pci_dev *start,
+ struct pci_dev *end, u16 acs_flags)
+{
+   struct pci_dev *pdev, *parent = start;
+
+   do {
+   pdev = parent;
+
+   if (!pci_acs_enabled(pdev, acs_flags))
+   return false;
+
+   if (pci_is_root_bus(pdev-bus))
+   return (end == NULL);

doesn't this mean that a caller can't pass the pdev of the root port?
I would think that 

Re: [PATCH v2 12/13] pci: Misc pci_reg additions

2012-05-24 Thread Don Dutile

On 05/22/2012 01:05 AM, Alex Williamson wrote:

Fill in many missing definitions and add sizeof fields for many
sections allowing for more extensive config parsing.

Signed-off-by: Alex Williamsonalex.william...@redhat.com
---


overall, i'm very glad to see defines instead of hardcoded numbers in the code, 
but


  include/linux/pci_regs.h |  112 +-
  1 files changed, 100 insertions(+), 12 deletions(-)

diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 4b608f5..379be84 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -26,6 +26,7 @@
   * Under PCI, each device has 256 bytes of configuration address space,
   * of which the first 64 bytes are standardized as follows:
   */
+#define PCI_STD_HEADER_SIZEOF  64
  #define PCI_VENDOR_ID 0x00/* 16 bits */
  #define PCI_DEVICE_ID 0x02/* 16 bits */
  #define PCI_COMMAND   0x04/* 16 bits */
@@ -209,9 +210,12 @@
  #define  PCI_CAP_ID_SHPC  0x0C/* PCI Standard Hot-Plug Controller */
  #define  PCI_CAP_ID_SSVID 0x0D/* Bridge subsystem vendor/device ID */
  #define  PCI_CAP_ID_AGP3  0x0E/* AGP Target PCI-PCI bridge */
+#define  PCI_CAP_ID_SECDEV 0x0F/* Secure Device */
  #define  PCI_CAP_ID_EXP   0x10/* PCI Express */
  #define  PCI_CAP_ID_MSIX  0x11/* MSI-X */
+#define  PCI_CAP_ID_SATA   0x12/* SATA Data/Index Conf. */
  #define  PCI_CAP_ID_AF0x13/* PCI Advanced Features */
+#define  PCI_CAP_ID_MAXPCI_CAP_ID_AF
  #define PCI_CAP_LIST_NEXT 1   /* Next capability in the list */
  #define PCI_CAP_FLAGS 2   /* Capability defined flags (16 bits) */
  #define PCI_CAP_SIZEOF4
@@ -276,6 +280,7 @@
  #define  PCI_VPD_ADDR_MASK0x7fff  /* Address mask */
  #define  PCI_VPD_ADDR_F   0x8000  /* Write 0, 1 indicates 
completion */
  #define PCI_VPD_DATA  4   /* 32-bits of data returned here */
+#define PCI_CAP_VPD_SIZEOF 8

  /* Slot Identification */

@@ -297,8 +302,10 @@
  #define PCI_MSI_ADDRESS_HI8   /* Upper 32 bits (if 
PCI_MSI_FLAGS_64BIT set) */
  #define PCI_MSI_DATA_32   8   /* 16 bits of data for 32-bit 
devices */
  #define PCI_MSI_MASK_32   12  /* Mask bits register for 
32-bit devices */
+#define PCI_MSI_PENDING_32 16  /* Pending intrs for 32-bit devices */
  #define PCI_MSI_DATA_64   12  /* 16 bits of data for 64-bit 
devices */
  #define PCI_MSI_MASK_64   16  /* Mask bits register for 
64-bit devices */
+#define PCI_MSI_PENDING_64 20  /* Pending intrs for 64-bit devices */

  /* MSI-X registers */
  #define PCI_MSIX_FLAGS2
@@ -308,6 +315,7 @@
  #define PCI_MSIX_TABLE4
  #define PCI_MSIX_PBA  8
  #define  PCI_MSIX_FLAGS_BIRMASK   (7  0)
+#define PCI_CAP_MSIX_SIZEOF12  /* size of MSIX registers */

  /* MSI-X entry's format */
  #define PCI_MSIX_ENTRY_SIZE   16
@@ -338,6 +346,7 @@
  #define  PCI_AF_CTRL_FLR  0x01
  #define PCI_AF_STATUS 5
  #define  PCI_AF_STATUS_TP 0x01
+#define PCI_CAP_AF_SIZEOF  6   /* size of AF registers */

  /* PCI-X registers */

@@ -374,6 +383,9 @@
  #define  PCI_X_STATUS_SPL_ERR 0x2000  /* Rcvd Split Completion Error 
Msg */
  #define  PCI_X_STATUS_266MHZ  0x4000  /* 266 MHz capable */
  #define  PCI_X_STATUS_533MHZ  0x8000  /* 533 MHz capable */
+#define PCI_X_ECC_CSR  8   /* ECC control and status */
+#define PCI_CAP_PCIX_SIZEOF_V0 8   /* size of registers for Version 0 */
+#define PCI_CAP_PCIX_SIZEOF_V1224  /* size for Version 1  2 */

ew!
unlikely that version 12 will ever exist, but why not:
#define PCI_CAP_PCIX_SIZEOF_V1  24
#define PCI_CAP_PCIX_SIZEOF_V2  PCI_CAP_PCIX_SIZEOF_V1




  /* PCI Bridge Subsystem ID registers */

@@ -462,6 +474,7 @@
  #define  PCI_EXP_LNKSTA_DLLLA 0x2000  /* Data Link Layer Link Active */
  #define  PCI_EXP_LNKSTA_LBMS  0x4000  /* Link Bandwidth Management Status */
  #define  PCI_EXP_LNKSTA_LABS  0x8000  /* Link Autonomous Bandwidth Status */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1 20  /* v1 endpoints end here */
  #define PCI_EXP_SLTCAP20  /* Slot Capabilities */
  #define  PCI_EXP_SLTCAP_ABP   0x0001 /* Attention Button Present */
  #define  PCI_EXP_SLTCAP_PCP   0x0002 /* Power Controller Present */
@@ -521,6 +534,7 @@
  #define  PCI_EXP_OBFF_MSGA_EN 0x2000  /* OBFF enable with Message type A */
  #define  PCI_EXP_OBFF_MSGB_EN 0x4000  /* OBFF enable with Message type B */
  #define  PCI_EXP_OBFF_WAKE_EN 0x6000  /* OBFF using WAKE# signaling */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 44  /* v2 endpoints end here */
  #define PCI_EXP_LNKCTL2   48  /* Link Control 2 */
  #define PCI_EXP_SLTCTL2   56  /* Slot Control 2 */

@@ -529,23 

Re: [PATCH v2 00/13] IOMMU Groups + VFIO

2012-05-24 Thread Don Dutile

On 05/22/2012 01:04 AM, Alex Williamson wrote:

Version 2 incorporating acks and feedback from v1.  The PCI DMA quirk
and ACS check are reworked, sysfs iommu groups ABI Documentation
added as well as numerous other fixes, including patches from Alexey
Kardashevskiy towards supporting POWER usage of VFIO and IOMMU groups.

This series can be found here on top of 3.4:

git://github.com/awilliam/linux-vfio.git iommu-group-vfio-20120521

The Qemu tree has also been updated to Qemu 1.1 and can be found here:

git://github.com/awilliam/qemu-vfio.git iommu-group-vfio

I'd really like to make a push to get this in for 3.5, so let's talk
about how to do that across iommu, pci, and new driver.  Joerg, are
you sufficiently happy with the IOMMU group concept and code?  We'll
also need David Woodhouse buyin on the intel-iommu changes in patches
3  6.  Who needs to approve VFIO as a new driver, GregKH?  Bjorn,
I'd be happy to send the PCI changes as a series for you, but I
wonder if it makes sense to collect acks for them if you approve and
bundle them in with the associated code that needs them so you're
not left with unused code.  Let me know which you prefer.  If there
are better ways to do it, please let me know.  Thanks,

Alex

---

ack to 1,2,4,6,8,10  11.
provided some minor feedback on 3,9,12.
have to do final review of the big stuff, 7  13.


Alex Williamson (13):
   vfio: Add PCI device driver
   pci: Misc pci_reg additions
   pci: Create common pcibios_err_to_errno
   pci: export pci_user functions for use by other drivers
   vfio: x86 IOMMU implementation
   vfio: Add documentation
   vfio: VFIO core
   iommu: Make use of DMA quirking and ACS enabled check for groups
   pci: Add ACS validation utility
   pci: Add PCI DMA source ID quirk
   iommu: IOMMU groups for VT-d and AMD-Vi
   iommu: IOMMU Groups
   driver core: Add iommu_group tracking to struct device


  .../ABI/testing/sysfs-kernel-iommu_groups  |   14
  Documentation/ioctl/ioctl-number.txt   |1
  Documentation/vfio.txt |  315 
  MAINTAINERS|8
  drivers/Kconfig|2
  drivers/Makefile   |1
  drivers/iommu/amd_iommu.c  |   67 +
  drivers/iommu/intel-iommu.c|   87 +
  drivers/iommu/iommu.c  |  578 +++-
  drivers/pci/access.c   |6
  drivers/pci/pci.c  |   76 +
  drivers/pci/pci.h  |7
  drivers/pci/quirks.c   |   69 +
  drivers/vfio/Kconfig   |   16
  drivers/vfio/Makefile  |3
  drivers/vfio/pci/Kconfig   |8
  drivers/vfio/pci/Makefile  |4
  drivers/vfio/pci/vfio_pci.c|  557 +++
  drivers/vfio/pci/vfio_pci_config.c | 1522 
  drivers/vfio/pci/vfio_pci_intrs.c  |  724 ++
  drivers/vfio/pci/vfio_pci_private.h|   91 +
  drivers/vfio/pci/vfio_pci_rdwr.c   |  269 
  drivers/vfio/vfio.c| 1413 +++
  drivers/vfio/vfio_iommu_x86.c  |  743 ++
  drivers/xen/xen-pciback/conf_space.c   |6
  include/linux/device.h |2
  include/linux/iommu.h  |  104 +
  include/linux/pci.h|   49 +
  include/linux/pci_regs.h   |  112 +
  include/linux/vfio.h   |  444 ++
  30 files changed, 7182 insertions(+), 116 deletions(-)
  create mode 100644 Documentation/ABI/testing/sysfs-kernel-iommu_groups
  create mode 100644 Documentation/vfio.txt
  create mode 100644 drivers/vfio/Kconfig
  create mode 100644 drivers/vfio/Makefile
  create mode 100644 drivers/vfio/pci/Kconfig
  create mode 100644 drivers/vfio/pci/Makefile
  create mode 100644 drivers/vfio/pci/vfio_pci.c
  create mode 100644 drivers/vfio/pci/vfio_pci_config.c
  create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
  create mode 100644 drivers/vfio/pci/vfio_pci_private.h
  create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
  create mode 100644 drivers/vfio/vfio.c
  create mode 100644 drivers/vfio/vfio_iommu_x86.c
  create mode 100644 include/linux/vfio.h


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/13] pci: New pci_acs_enabled()

2012-05-21 Thread Don Dutile

On 05/18/2012 10:47 PM, Alex Williamson wrote:

On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:

On 05/18/2012 06:02 PM, Alex Williamson wrote:

On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:

On 05/15/2012 05:09 PM, Alex Williamson wrote:

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
alex.william...@redhat.comwrote:

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
alex.william...@redhat.comwrote:

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.


The name pci_acs_enabled() sounds like it returns a boolean, but it doesn't.


Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);


+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

Finally, with my email filters fixed, I can see this email... :)


Welcome back ;)


Indeed... and I recvd 3 copies of this reply,
so the pendulum has flipped the other direction... ;-)


Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
struct pci_dev *end, u16 acs_flags);


ok.


I'm not sure what a complete PCI ACS chain means.

The function starts from dev and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.


Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation,


Is this redirection just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

  ^
  |
 ++---+
 ||
  +--+-++-++-+
  | Downstream || Downstream |
  |Port||Port|
  |   06:05.0  ||   06:06.0  |
  +--+-++--+-+
 | |
+v+   +v+
| Endpoint|   | Endpoint|
| 07:00.0 |   | 08:00.0 |
+-+   +-+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.


Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.


detail: PCIe up/downstream routing is really done by an internal switch;
   ACS forces the legacy, PCI base-limit address routing and *forces*
   the switch to always route the transaction from a downstream port
   to the upstream port.


I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.


Yep, that's my meager understanding as well.


+1


so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.


Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.


Yes, modulo this comment in libvirt source:

   /* if we have no parent, and this is the root bus, ACS doesn't come
* into play since devices on the root bus can't P2P without going
* through the root IOMMU.
*/


Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.


Would you mind looking for the paragraph that says this?  I'd rather
code this into the iommu driver callers than core PCI code if this is
just a platform standard.


In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
ACS upstream fwding: Must be implemented

Re: [PATCH 05/13] pci: New pci_acs_enabled()

2012-05-21 Thread Don Dutile

On 05/21/2012 10:59 AM, Alex Williamson wrote:

On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:

On 05/18/2012 10:47 PM, Alex Williamson wrote:

On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:

On 05/18/2012 06:02 PM, Alex Williamson wrote:

On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:

On 05/15/2012 05:09 PM, Alex Williamson wrote:

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
alex.william...@redhat.com wrote:

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
alex.william...@redhat.com wrote:

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.


The name pci_acs_enabled() sounds like it returns a boolean, but it doesn't.


Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);


+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

Finally, with my email filters fixed, I can see this email... :)


Welcome back ;)


Indeed... and I recvd 3 copies of this reply,
so the pendulum has flipped the other direction... ;-)


Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
 struct pci_dev *end, u16 acs_flags);


ok.


I'm not sure what a complete PCI ACS chain means.

The function starts from dev and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.


Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation,


Is this redirection just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

   ^
   |
  ++---+
  ||
   +--+-++-++-+
   | Downstream || Downstream |
   |Port||Port|
   |   06:05.0  ||   06:06.0  |
   +--+-++--+-+
  | |
 +v+   +v+
 | Endpoint|   | Endpoint|
 | 07:00.0 |   | 08:00.0 |
 +-+   +-+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.


Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.


detail: PCIe up/downstream routing is really done by an internal switch;
ACS forces the legacy, PCI base-limit address routing and *forces*
the switch to always route the transaction from a downstream port
to the upstream port.


I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.


Yep, that's my meager understanding as well.


+1


so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.


Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.


Yes, modulo this comment in libvirt source:

/* if we have no parent, and this is the root bus, ACS doesn't come
 * into play since devices on the root bus can't P2P without going
 * through the root IOMMU.
 */


Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.


Would you mind looking for the paragraph that says this?  I'd rather
code this into the iommu driver callers than core PCI code if this is
just

Re: RESEND3: Re: [PATCH 05/13] pci: New pci_acs_enabled()

2012-05-18 Thread Don Dutile

On 05/18/2012 06:02 PM, Alex Williamson wrote:

On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:

On 05/15/2012 05:09 PM, Alex Williamson wrote:

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
alex.william...@redhat.com   wrote:

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
alex.william...@redhat.com   wrote:

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.


The name pci_acs_enabled() sounds like it returns a boolean, but it doesn't.


Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);


+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

Finally, with my email filters fixed, I can see this email... :)



Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
   struct pci_dev *end, u16 acs_flags);


ok.


I'm not sure what a complete PCI ACS chain means.

The function starts from dev and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.


Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation,


Is this redirection just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

 ^
 |
++---+
||
 +--+-++-++-+
 | Downstream || Downstream |
 |Port||Port|
 |   06:05.0  ||   06:06.0  |
 +--+-++--+-+
| |
   +v+   +v+
   | Endpoint|   | Endpoint|
   | 07:00.0 |   | 08:00.0 |
   +-+   +-+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.


Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.


detail: PCIe up/downstream routing is really done by an internal switch;
  ACS forces the legacy, PCI base-limit address routing and *forces*
  the switch to always route the transaction from a downstream port
  to the upstream port.


I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.


Yep, that's my meager understanding as well.


+1


so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.


Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.


Yes, modulo this comment in libvirt source:

  /* if we have no parent, and this is the root bus, ACS doesn't come
   * into play since devices on the root bus can't P2P without going
   * through the root IOMMU.
   */


Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.


Would you mind looking for the paragraph that says this?  I'd rather
code this into the iommu driver callers than core PCI code if this is
just a platform standard.


In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
ACS upstream fwding: Must be implemented by Root Ports if the RC supports
 Redirected Request Validation;
-- which means, if a Root port allows a peer-to-peer transaction to another
   one of its ports, then it has to support ACS.

So, this means that:
(a) if a Root

Re: [PATCH 05/13] pci: New pci_acs_enabled()

2012-05-16 Thread Don Dutile

On 05/15/2012 05:09 PM, Alex Williamson wrote:

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
alex.william...@redhat.com  wrote:

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
alex.william...@redhat.com  wrote:

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.


The name pci_acs_enabled() sounds like it returns a boolean, but it doesn't.


Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);


+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.


I'm not sure what a complete PCI ACS chain means.

The function starts from dev and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.


Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation,


Is this redirection just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

^
|
   ++---+
   ||
+--+-++-++-+
| Downstream || Downstream |
|Port||Port|
|   06:05.0  ||   06:06.0  |
+--+-++--+-+
   | |
  +v+   +v+
  | Endpoint|   | Endpoint|
  | 07:00.0 |   | 08:00.0 |
  +-+   +-+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.


Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.


detail: PCIe up/downstream routing is really done by an internal switch;
ACS forces the legacy, PCI base-limit address routing and *forces*
the switch to always route the transaction from a downstream port
to the upstream port.


I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.


Yep, that's my meager understanding as well.


+1


so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.


Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.


Yes, modulo this comment in libvirt source:

 /* if we have no parent, and this is the root bus, ACS doesn't come
  * into play since devices on the root bus can't P2P without going
  * through the root IOMMU.
  */


Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.


So we assume that a redirect at the point of the iommu will factor in
iommu translation.


If so, I think you're trying to find out the closest upstream device X
such that everything leading to X has ACS enabled.  Every device below
X can DMA freely to other devices below X, so they would all have to
be in the same isolated group.


Yes


I tried to work through some examples to develop some intuition about this:


(inserting fixed url)

http://www.asciiflow.com/#3736558963405980039



pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
if 00:00.0 is PCIe or if RP has ACS?))


Hmm, the latter is the assumption above.  For the former, I think
libvirt was probably assuming that PCI devices must have a PCIe device
upstream from them because x86 doesn't have assignment friendly IOMMUs
except on PCIe.  I'll need to work on making that more generic.


pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)

Re: [AMD iommu] pci Failed to assign device hostdev0 : Device or resource busy

2011-12-19 Thread Don Dutile

On 12/13/2011 03:40 AM, Andreas Hartmann wrote:

Hello Don!


Andreas,

sorry for the delay, I don't follow this email list frequently (read: !daily).


Am Tue, 13 Dec 2011 01:21:41 +0100
schrieb Andreas Hartmannandihartm...@01019freenet.de:


Am Mon, 12 Dec 2011 13:36:36 -0500
schrieb Don Dutileddut...@redhat.com:


On 12/12/2011 06:15 AM, Andreas Hartmann wrote:

Hello!

I've got a few questions to a problem, which already was analyzed here
sometime ago:
http://markmail.org/message/dspovwvzp3wtdrf6#query:+page:1+mid:i2oph4xwfmiknt3y+state:results

My situation is a bit different. I do have two PCI cards (a Linksys wlan
card and an intel e100 card). Each of these cards should be managed in
an own VM. I do have no problems with IRQ sharing (each device does have its 
own IRQ).

I'm using linux 3.0.6, kvm 0.15 and libvirt 0.9.7. Mainboard is: GA-990XA-UD3.


The problem is: both cards are behind a PCI-PCI bridge:

00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)


-[:00]-+-00.0  ATI Technologies Inc RD890 PCI to PCI bridge (external gfx0 
port B)
 +-00.2  ATI Technologies Inc Device 5a23
 +-02.0-[01]--+-00.0  ATI Technologies Inc Device 6759
 |\-00.1  ATI Technologies Inc Device aa90
 +-04.0-[02]00.0  Device 1b6f:7023
 +-09.0-[03]00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B 
PCI Express Gigabit Ethernet controller
 +-0a.0-[04]00.0  Device 1b6f:7023
 +-11.0  ATI Technologies Inc SB700/SB800 SATA Controller [AHCI 
mode]
 +-12.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
 +-12.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
 +-13.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
 +-13.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
 +-14.0  ATI Technologies Inc SBx00 SMBus Controller
 +-14.1  ATI Technologies Inc SB700/SB800 IDE Controller
 +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
 +-14.3  ATI Technologies Inc SB700/SB800 LPC host controller

 +-14.4-[05]--+-06.0  Intel Corporation 82557/8/9/0/1 Ethernet Pro 
100
 |\-07.0  RaLink RT2800 802.11n PCI


You cannot assign two devices behind a (legacy) PCI (not PCIe) bridge to two
different VMs because PCI devices dont provide a B:D.F in their transaction 
headers;
In this case, the PCIe-PCI bridge/switch prepends the bridge's B:D.F in front of
a transaction that is sourced by 05:06.0 or 05:07.0 .

So, the devices cannot be isolated from each other's (mem) mapping domains in 
the
IOMMU, thus, libvirt prevents this (security) violation from being enabled.


Ok. If I remove the intel card and put in instead a 32 bit PCIe card
like TP-Link TG-3468, I could assign each of these two cards to
different VMs.


KISS: Don't try to assign legacy PCI devices; just PCIe. ;-)

Where is TP-Link TG-3468 in lspci output below?  the AMD devices in 
18.0-18.5 ???


Is this correct?

[...]


1. Is there any way to get the devices into different VMs? (I can't put them to 
another PCI slot as there are just 2 PCI slots on the board.)

Only if the two slots are behind different PCIe-PCI bridges .


2. Is there any fix to prevent the host crash - maybe in a newer version of kvm 
or kernel?

don't assign the PCI bridge; libvirt does all the proper bind/unbinding of 
devices
for assigned devices, so the above echo steps are unnecessary if you use 
libvirt.


If I just remove the intel card, I should be able to assign the wlan
card to one of my VMs. I'll try that.


Meanwhile I checked this scenario: I removed the intel card and
rebooted. I got the following pci tree:

-[:00]-+-00.0  ATI Technologies Inc RD890 PCI to PCI bridge (external gfx0 
port B)
+-00.2  ATI Technologies Inc Device 5a23
+-02.0-[01]--+-00.0  ATI Technologies Inc Device 6759
|\-00.1  ATI Technologies Inc Device aa90
+-04.0-[02]00.0  Device 1b6f:7023
+-09.0-[03]00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B 
PCI Express Gigabit Ethernet controller
+-0a.0-[04]00.0  Device 1b6f:7023
+-11.0  ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]
+-12.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
+-12.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
+-13.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
+-13.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
+-14.0  ATI Technologies Inc SBx00 SMBus Controller
+-14.1  ATI Technologies Inc SB700/SB800 IDE Controller
+-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
+-14.3  ATI Technologies Inc SB700/SB800 LPC host controller
+-14.4-[05]07.0  RaLink RT2800 802.11n PCI

+-14.5  ATI 

Re: [AMD iommu] pci Failed to assign device hostdev0 : Device or resource busy

2011-12-12 Thread Don Dutile

On 12/12/2011 06:15 AM, Andreas Hartmann wrote:

Hello!

I've got a few questions to a problem, which already was analyzed here
sometime ago:
http://markmail.org/message/dspovwvzp3wtdrf6#query:+page:1+mid:i2oph4xwfmiknt3y+state:results

My situation is a bit different. I do have two PCI cards (a Linksys wlan
card and an intel e100 card). Each of these cards should be managed in
an own VM. I do have no problems with IRQ sharing (each device does have its 
own IRQ).

I'm using linux 3.0.6, kvm 0.15 and libvirt 0.9.7. Mainboard is: GA-990XA-UD3.


The problem is: both cards are behind a PCI-PCI bridge:

00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)


-[:00]-+-00.0  ATI Technologies Inc RD890 PCI to PCI bridge (external gfx0 
port B)
+-00.2  ATI Technologies Inc Device 5a23
+-02.0-[01]--+-00.0  ATI Technologies Inc Device 6759
|\-00.1  ATI Technologies Inc Device aa90
+-04.0-[02]00.0  Device 1b6f:7023
+-09.0-[03]00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B 
PCI Express Gigabit Ethernet controller
+-0a.0-[04]00.0  Device 1b6f:7023
+-11.0  ATI Technologies Inc SB700/SB800 SATA Controller [AHCI mode]
+-12.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
+-12.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
+-13.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
+-13.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
+-14.0  ATI Technologies Inc SBx00 SMBus Controller
+-14.1  ATI Technologies Inc SB700/SB800 IDE Controller
+-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
+-14.3  ATI Technologies Inc SB700/SB800 LPC host controller

+-14.4-[05]--+-06.0  Intel Corporation 82557/8/9/0/1 Ethernet Pro 
100
|\-07.0  RaLink RT2800 802.11n PCI


You cannot assign two devices behind a (legacy) PCI (not PCIe) bridge to two
different VMs because PCI devices dont provide a B:D.F in their transaction 
headers;
In this case, the PCIe-PCI bridge/switch prepends the bridge's B:D.F in front of
a transaction that is sourced by 05:06.0 or 05:07.0 .

So, the devices cannot be isolated from each other's (mem) mapping domains in 
the
IOMMU, thus, libvirt prevents this (security) violation from being enabled.


+-14.5  ATI Technologies Inc SB700/SB800 USB OHCI2 Controller
+-15.0-[06]--
+-16.0  ATI Technologies Inc SB700/SB800 USB OHCI0 Controller
+-16.2  ATI Technologies Inc SB700/SB800 USB EHCI Controller
+-18.0  Advanced Micro Devices [AMD] Device 1600
+-18.1  Advanced Micro Devices [AMD] Device 1601
+-18.2  Advanced Micro Devices [AMD] Device 1602
+-18.3  Advanced Micro Devices [AMD] Device 1603
+-18.4  Advanced Micro Devices [AMD] Device 1604
\-18.5  Advanced Micro Devices [AMD] Device 1605

That's what can be seen during boot:

[0.621901] AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300
[0.621906] AMD-Vi:mmio-addr: fec3
[0.622091] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:00.0 flags: 00
[0.622095] AMD-Vi:   DEV_RANGE_END   devid: 00:00.2
[0.622097] AMD-Vi:   DEV_SELECT  devid: 00:02.0 flags: 
00
[0.622100] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 01:00.0 flags: 00
[0.622102] AMD-Vi:   DEV_RANGE_END   devid: 01:00.1
[0.622105] AMD-Vi:   DEV_SELECT  devid: 00:04.0 flags: 
00
[0.622107] AMD-Vi:   DEV_SELECT  devid: 02:00.0 flags: 
00
[0.622109] AMD-Vi:   DEV_SELECT  devid: 00:09.0 flags: 
00
[0.622112] AMD-Vi:   DEV_SELECT  devid: 03:00.0 flags: 
00
[0.622114] AMD-Vi:   DEV_SELECT  devid: 00:0a.0 flags: 
00
[0.622117] AMD-Vi:   DEV_SELECT  devid: 04:00.0 flags: 
00
[0.622119] AMD-Vi:   DEV_SELECT  devid: 00:11.0 flags: 
00
[0.622122] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:12.0 flags: 00
[0.622124] AMD-Vi:   DEV_RANGE_END   devid: 00:12.2
[0.622127] AMD-Vi:   DEV_SELECT_RANGE_START  devid: 00:13.0 flags: 00
[0.622129] AMD-Vi:   DEV_RANGE_END   devid: 00:13.2
[0.622132] AMD-Vi:   DEV_SELECT  devid: 00:14.0 flags: 
d7
[0.622134] AMD-Vi:   DEV_SELECT  devid: 00:14.1 flags: 
00
[0.622137] AMD-Vi:   DEV_SELECT  devid: 00:14.2 flags: 
00
[0.622139] AMD-Vi:   DEV_SELECT  devid: 00:14.3 flags: 
00
[0.622141] AMD-Vi:   DEV_SELECT  devid: 00:14.4 flags: 
00
[0.622144] AMD-Vi:   DEV_ALIAS_RANGE devid: 05:00.0 flags: 
00 devid_to: 00:14.4
[0.622147] AMD-Vi:   DEV_RANGE_END   

Re: [PATCH] qemu-kvm: device assignment: add 82599 PCIe Cap struct quirk

2011-10-03 Thread Don Dutile

On 10/02/2011 05:56 AM, Michael S. Tsirkin wrote:

On Wed, Sep 28, 2011 at 08:20:33PM -0400, Donald Dutile wrote:

commit f9c29774d2174df6ffc20becec20928948198914
changed the PCIe Capability structure version check
from if  2 fail, to if ==1, size=x, if ==2, size=y,
else fail.
Turns out the 82599's VF has an errata where it's
PCIe Cap struct version is 0, which now fails device assignment
due to the else fallout, where before, it would blissfully work.

Add a quirk if version=0,  intel-82599, set size to version 2 struct.

Signed-off-by: Donald_Dutileddut...@redhat.com


Makes sense.

Nit: please use PCI_VENDOR_ID_INTEL instead of 0x8086 below.


good idea read more below..

---
  hw/device-assignment.c |   12 ++--
  1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 288f80c..ed2a883 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1261,12 +1261,20 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)

  if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
  uint8_t version, size;
-uint16_t type, devctl, lnkcap, lnksta;
+uint16_t type, devctl, lnkcap, lnksta, vendor, device;
  uint32_t devcap;

+vendor = pci_get_word(pci_dev-config + PCI_VENDOR_ID);
+device = pci_get_word(pci_dev-config + PCI_DEVICE_ID);
  version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS);
  version= PCI_EXP_FLAGS_VERS;
-if (version == 1) {
+if (version == 0  vendor == 0x8086  device == 0x10ed) {


I'd also make version == 0 last test in the list
to stress the fact this is a device specific quirk,
but that's a matter of taste ...


given this is the odd case, yes, i agree.
will repost a new patch w/recommended changes.
pushed this up do to desire to fix sooner than later, but another day
won't matter given it's been this way for a couple months!


+/*
+ * quirk for Intel 82599 VF with invalid PCIe capability version,
+ * should really be version 2 (same as PF)
+ */
+size = 0x3c;
+} else if (version == 1) {
  size = 0x14;
  } else if (version == 2) {
  /*
--
1.7.1


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] Driver core: Add iommu_ops to bus_type

2011-09-07 Thread Don Dutile

On 09/07/2011 03:44 PM, Greg KH wrote:

On Wed, Sep 07, 2011 at 09:19:19PM +0200, Joerg Roedel wrote:

Hi Greg,

the bus_set_iommu() function will be called by the IOMMU driver. There
can be different drivers for the same bus, depending on the hardware. On
PCI for example, there can be the Intel or the AMD IOMMU driver that
implement the iommu-api and that register for that bus.


Why are you pushing this down into the driver core?  What other busses
becides PCI use/need this?

If you can have a different IOMMU driver on the same bus, then wouldn't
this be a per-device thing instead of a per-bus thing?


And given the dma api takes a struct device *, it'd be more efficient
to be tied into the device structure.
Device structure would get iommu ops set by parent(bus);
if a bus (segment) doesn't provide a unique/different/layered IOMMU
then the parent bus, it inherits the parent's iommu-ops.
setting the iommu-ops in the root bus struct, seeds the iommu-ops
for the (PCI) tree.

For intel  amd IOMMUs, in early pci (bios,root?) init, you would
seed the pci root busses with appropriate IOMMU support (based on
dmar/drhd  ivrs/ivhd data structures, respectively), and
then modify the PCI code to do the inheritence (PPB code inherits
unless specific device driver for a given PPB vid-did loads a
different iommu-ops for that segment/branch).

This would enable different types of IOMMUs for different devices
(or PCI segments, or branches of PCI trees) that are designed for
different tasks -- simple IOMMUs for legacy devices; complicated, 
io-page-faulting
IOMMUs for plug-in, high-end devices on virtualizing servers for PCI (SRIOV) 
endpoints.

and as Greg indicates, is only relevant to PCI.
The catch is that dev* has to be looked at for iommu support for dma-ops.




On Wed, Sep 07, 2011 at 11:47:50AM -0700, Greg KH wrote:

+#ifdef CONFIG_IOMMU_API
+int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops)
+{
+   if (bus-iommu_ops != NULL)
+   return -EBUSY;


Busy?


Yes, it signals to the IOMMU driver that another driver has already
registered for that bus. In the previous register_iommu() interface this
was just a BUG(), but I think returning an error to the caller is
better. It can be turned back into a BUG() if it is considered better,
though.


Can you ever have more than one IOMMU driver per bus?  If so, this seems
wrong (see above.)


+
+   bus-iommu_ops = ops;
+
+   /* Do IOMMU specific setup for this bus-type */
+   iommu_bus_init(bus, ops);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(bus_set_iommu);


I don't understand what this function is for, and who would call it.


It is called by the IOMMU driver.


Please provide kerneldoc that explains this.


Will do.


@@ -67,6 +68,9 @@ extern void bus_remove_file(struct bus_type *, struct 
bus_attribute *);
   * @resume:   Called to bring a device on this bus out of sleep mode.
   * @pm:   Power management operations of this bus, callback the 
specific
   *device driver's pm-ops.
+ * @iommu_ops   IOMMU specific operations for this bus, used to attach IOMMU
+ *  driver implementations to a bus and allow the driver to do
+ *  bus-specific setup


So why is this just not set by the bus itself, making the above function
not needed at all?


The IOMMUs are usually devices on the bus itself, so they are
initialized after the bus is set up and the devices on it are
populated.  So the function can not be called on bus initialization
because the IOMMU is not ready at this point.


Ok, that makes more sense, please state as much in the documentation.

thanks,

greg k-h
___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/iommu


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment VFIO ramblings

2011-08-25 Thread Don Dutile

On 08/25/2011 06:54 AM, Roedel, Joerg wrote:

Hi Alex,

On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote:

Is this roughly what you're thinking of for the iommu_group component?
Adding a dev_to_group iommu ops callback let's us consolidate the sysfs
support in the iommu base.  Would AMD-Vi do something similar (or
exactly the same) for group #s?  Thanks,


The concept looks good, I have some comments, though. On AMD-Vi the
implementation would look a bit different because there is a
data-structure were the information can be gathered from, so no need for
PCI bus scanning there.


diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c
index 6e6b6a1..6b54c1a 100644
--- a/drivers/base/iommu.c
+++ b/drivers/base/iommu.c
@@ -17,20 +17,56 @@
   */

  #includelinux/bug.h
+#includelinux/device.h
  #includelinux/types.h
  #includelinux/module.h
  #includelinux/slab.h
  #includelinux/errno.h
  #includelinux/iommu.h
+#includelinux/pci.h

  static struct iommu_ops *iommu_ops;

+static ssize_t show_iommu_group(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return sprintf(buf, %lx, iommu_dev_to_group(dev));


Probably add a 0x prefix so userspace knows the format?


+}
+static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
+
+static int add_iommu_group(struct device *dev, void *unused)
+{
+   if (iommu_dev_to_group(dev)= 0)
+   return device_create_file(dev,dev_attr_iommu_group);
+
+   return 0;
+}
+
+static int device_notifier(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct device *dev = data;
+
+   if (action == BUS_NOTIFY_ADD_DEVICE)
+   return add_iommu_group(dev, NULL);
+
+   return 0;
+}
+
+static struct notifier_block device_nb = {
+   .notifier_call = device_notifier,
+};
+
  void register_iommu(struct iommu_ops *ops)
  {
if (iommu_ops)
BUG();

iommu_ops = ops;
+
+   /* FIXME - non-PCI, really want for_each_bus() */
+   bus_register_notifier(pci_bus_type,device_nb);
+   bus_for_each_dev(pci_bus_type, NULL, NULL, add_iommu_group);
  }


We need to solve this differently. ARM is starting to use the iommu-api
too and this definitly does not work there. One possible solution might
be to make the iommu-ops per-bus.


When you think of a system where there isn't just one bus-type
with iommu support, it makes more sense.
Additionally, it also allows the long-term architecture to use different types
of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs --
esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared
for direct-attach disk hba's.



  bool iommu_found(void)
@@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain,
  }
  EXPORT_SYMBOL_GPL(iommu_domain_has_cap);

+long iommu_dev_to_group(struct device *dev)
+{
+   if (iommu_ops-dev_to_group)
+   return iommu_ops-dev_to_group(dev);
+   return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_to_group);


Please rename this to iommu_device_group(). The dev_to_group name
suggests a conversion but it is actually just a property of the device.
Also the return type should not be long but something that fits into
32bit on all platforms. Since you use -ENODEV, probably s32 is a good
choice.


+
  int iommu_map(struct iommu_domain *domain, unsigned long iova,
  phys_addr_t paddr, int gfp_order, int prot)
  {
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index f02c34d..477259c 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -404,6 +404,7 @@ static int dmar_map_gfx = 1;
  static int dmar_forcedac;
  static int intel_iommu_strict;
  static int intel_iommu_superpage = 1;
+static int intel_iommu_no_mf_groups;

  #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1))
  static DEFINE_SPINLOCK(device_domain_lock);
@@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str)
printk(KERN_INFO
Intel-IOMMU: disable supported super page\n);
intel_iommu_superpage = 0;
+   } else if (!strncmp(str, no_mf_groups, 12)) {
+   printk(KERN_INFO
+   Intel-IOMMU: disable separate groups for 
multifunction devices\n);
+   intel_iommu_no_mf_groups = 1;


This should really be a global iommu option and not be VT-d specific.



str += strcspn(str, ,);
@@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
  }

+/* Group numbers are arbitrary.  Device with the same group number
+ * indicate the iommu cannot differentiate between them.  To avoid
+ * tracking used groups we just use the seg|bus|devfn of the lowest
+ * level we're able to differentiate devices */
+static long intel_iommu_dev_to_group(struct 

Re: [PATCH v3] pci: correct pci config size default for cap version 2 endpoints

2011-07-25 Thread Don Dutile

On 07/24/2011 06:58 AM, Michael S. Tsirkin wrote:

On Sun, Jul 24, 2011 at 11:41:10AM +0300, Michael S. Tsirkin wrote:

On Sun, Jul 24, 2011 at 11:12:44AM +0300, Michael S. Tsirkin wrote:

On Fri, Jul 22, 2011 at 02:35:47PM -0700, Chris Wright wrote:

* Alex Williamson (alex.william...@redhat.com) wrote:

On Fri, 2011-07-22 at 14:24 -0700, Chris Wright wrote:

* Donald Dutile (ddut...@redhat.com) wrote:

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 36ad6b0..34db52e 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1419,16 +1419,18 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
  }

  if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
-uint8_t version;
+uint8_t version, size;
  uint16_t type, devctl, lnkcap, lnksta;
  uint32_t devcap;
-int size = 0x3c; /* version 2 size */

  version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS);
  version= PCI_EXP_FLAGS_VERS;
  if (version == 1) {
  size = 0x14;
-} else if (version  2) {
+} else if (version == 2) {
+/* don't include slot cap/stat/ctrl 2 regs; only support endpoints 
*/
+size = 0x34;


That doesn't look correct to me.  The size is fixed, just that some
registers are Reserved Zero when they do not apply (e.g. endpoint only).


Apparently it can be interpreted differently.  In this case, we've seen
a tg3 device expose a v2 PCI express capability at offset 0xcc.  Using
0x3c bytes, we extend 8 bytes past the legacy config space area :(


Wow, that device sounds broken to me.  The spec is pretty clear.


Yes, I agree it's broken. Looks like something that
happens when a device is designed in parallel with the spec.

What bothers me is this patch seems to make devices that do behave
correctly out of spec (registers will be writeable by default) -
correct?

How about we check for overflow and only do the hacks
if it happens?

Also, the code to initialize slot and root control registers is still
there: it would seem that running it will corrupt memmory beyond the
config array?


I take this last bit back: registers we touch are at offset  0x34.
Sorry about the noise. But the question about read-only registers
still stands.


Also, where does the magic 0x34 come from? I'm guessing this is
simply what's left till the end of the config space.
So let's be conservative specific as possible with
this hack:



I believe the spec leaves room for interpretation, and thus the
resulting 'broken' device.  As I read the spec, the size of the struct can be:

-- 0x2c for all devices, min., that are cap version 2 or higher.
-- 0x34 for devices with links, i.e., not a root-port-based device,
  , a device not integrated into the root port
  , or if it is, it uses a serial link anyhow
(doesn't strip out 8b/10b serial link btwn 
device
  internal root port)
-- 0x3c for devices with slots, i.e., a bridge with downstream slots,
i.e., not an endpoint, i.e., a PCI(e) bridge.

Thus, 0x34 was chosen, since we don't support device assigning PCI bridges,
(not until MRIOV shows up, at least), and 0x34 fits the bug at hand, and
device cap/stat/control may be used/modified.

So, a 'hack' is not needed.  In fact, the 0x34 size is a bit of a hack,
since the case to use 0x2c could be ascertained by checking if the device
is a root port device, _and_ it's not using a serial link, but a perusal of
root port devices on a number of systems I looked at always had this structure
greater than 0x2c, so I figured the simple heuristic of 0x34 was sufficient.


/* A version 2 device was observed to only have a partial
  * implementation for the capability structure. Apparently, it doesn't
  * implement the registers from slot capability 2 and on (offset 0x34),
  * with the capability at offset 0xCC = 256 - 0x34. This is out of spec,
  * but let's try to support this. */
if (version == 2  pos == 0xCC) {
size = 0x34;
}




--
MST


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] pci: correct pci config size default for cap version 2 endpoints

2011-07-25 Thread Don Dutile

On 07/25/2011 04:20 PM, Alex Williamson wrote:

On Mon, 2011-07-25 at 15:37 -0400, Don Dutile wrote:

On 07/24/2011 06:58 AM, Michael S. Tsirkin wrote:

On Sun, Jul 24, 2011 at 11:41:10AM +0300, Michael S. Tsirkin wrote:

On Sun, Jul 24, 2011 at 11:12:44AM +0300, Michael S. Tsirkin wrote:

On Fri, Jul 22, 2011 at 02:35:47PM -0700, Chris Wright wrote:

* Alex Williamson (alex.william...@redhat.com) wrote:

On Fri, 2011-07-22 at 14:24 -0700, Chris Wright wrote:

* Donald Dutile (ddut...@redhat.com) wrote:

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 36ad6b0..34db52e 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1419,16 +1419,18 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
   }

   if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
-uint8_t version;
+uint8_t version, size;
   uint16_t type, devctl, lnkcap, lnksta;
   uint32_t devcap;
-int size = 0x3c; /* version 2 size */

   version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS);
   version= PCI_EXP_FLAGS_VERS;
   if (version == 1) {
   size = 0x14;
-} else if (version   2) {
+} else if (version == 2) {
+/* don't include slot cap/stat/ctrl 2 regs; only support endpoints 
*/
+size = 0x34;


That doesn't look correct to me.  The size is fixed, just that some
registers are Reserved Zero when they do not apply (e.g. endpoint only).


Apparently it can be interpreted differently.  In this case, we've seen
a tg3 device expose a v2 PCI express capability at offset 0xcc.  Using
0x3c bytes, we extend 8 bytes past the legacy config space area :(


Wow, that device sounds broken to me.  The spec is pretty clear.


Yes, I agree it's broken. Looks like something that
happens when a device is designed in parallel with the spec.

What bothers me is this patch seems to make devices that do behave
correctly out of spec (registers will be writeable by default) -
correct?

How about we check for overflow and only do the hacks
if it happens?

Also, the code to initialize slot and root control registers is still
there: it would seem that running it will corrupt memmory beyond the
config array?


I take this last bit back: registers we touch are at offset   0x34.
Sorry about the noise. But the question about read-only registers
still stands.


Also, where does the magic 0x34 come from? I'm guessing this is
simply what's left till the end of the config space.
So let's be conservative specific as possible with
this hack:



I believe the spec leaves room for interpretation, and thus the
resulting 'broken' device.  As I read the spec, the size of the struct can be:

-- 0x2c for all devices, min., that are cap version 2 or higher.
-- 0x34 for devices with links, i.e., not a root-port-based device,
, a device not integrated into the root port
, or if it is, it uses a serial link anyhow
  (doesn't strip out 8b/10b serial link btwn 
device
 internal root port)
-- 0x3c for devices with slots, i.e., a bridge with downstream slots,
  i.e., not an endpoint, i.e., a PCI(e) bridge.

Thus, 0x34 was chosen, since we don't support device assigning PCI bridges,
(not until MRIOV shows up, at least), and 0x34 fits the bug at hand, and
device cap/stat/control may be used/modified.

So, a 'hack' is not needed.  In fact, the 0x34 size is a bit of a hack,
since the case to use 0x2c could be ascertained by checking if the device
is a root port device, _and_ it's not using a serial link, but a perusal of
root port devices on a number of systems I looked at always had this structure
greater than 0x2c, so I figured the simple heuristic of 0x34 was sufficient.


/* A version 2 device was observed to only have a partial
   * implementation for the capability structure. Apparently, it doesn't
   * implement the registers from slot capability 2 and on (offset 0x34),
   * with the capability at offset 0xCC = 256 - 0x34. This is out of spec,
   * but let's try to support this. */
if (version == 2   pos == 0xCC) {
size = 0x34;
}


The more I look at it, the more I think that maybe this is an especially
broken device and we shouldn't change the default for it.  BTW, the
programming reference for this device is here:

http://www.broadcom.com/collateral/pg/5761-PG100-R.pdf

They've burned up most of the capability area for vendor specific
registers, so there's not enough room for the full pci-e capability
structure.  I'd be fine with adding a test specifically for this.  As
you suggested on IRC, print some warning and cut pci-e back to 0x34 if
it extends past the legacy config space, and reject the device if it's
still too small.  Leave the default 0x3c since this is the only device
we've found with this problem.  Thanks,

Alex





I have

Re: [PATCH 1/2] pci: bounds check offsets into config_map

2011-07-21 Thread Don Dutile

On 07/21/2011 04:11 AM, Michael S. Tsirkin wrote:

On Wed, Jul 20, 2011 at 07:49:34PM -0400, Donald Dutile wrote:

Bounds check to avoid walking off config_map[]
and corrupting what is after it.

Signed-off-by: Donald Dutileddut...@redhat.com
cc: Alex Williamsonalex.william...@redhat.com
cc: Michael S. Tsirkinm...@redhat.com
---

  hw/pci.c |   16 ++--
  1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 5deaa04..9a7ff2d 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -2078,12 +2078,24 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
  }
  } else {
  int i;
+   uint32_t config_size = pci_config_size(pdev);


This is actually slightly wrong: even for express
devices, legacy capabilities can not spill out from
the low 256 bytes: the extended config space has its
own capability list always starting with the express
capability at offset 256.


will just remove this checking  put it all in device-assignment.c


+
+   /* ensure don't walk off end of config_map[] */
+   if (offset  (config_size - size)) {


I prefer not to have () around math here.



ok.


+fprintf(stderr, ERROR: %04x:%02x:%02x.%x 
+Attempt to add PCI capability 0x%x at offset 0x%x,
+size 0x%x, which exceeds emulated PCI config space 
0x%x\n,
+pci_find_domain(pdev-bus), pci_bus_num(pdev-bus),
+PCI_SLOT(pdev-devfn), PCI_FUNC(pdev-devfn),
+cap_id, offset, size, config_size);
+   return -EINVAL;
+   }

  for (i = offset; i  offset + size; i++) {
  if (pdev-config_map[i]) {
  fprintf(stderr, ERROR: %04x:%02x:%02x.%x 
-Attempt to add PCI capability %x at offset 
-%x overlaps existing capability %x at offset %x\n,
+Attempt to add PCI capability 0x%x at offset 
+0x%x overlaps existing capability 0x%x at offset 
0x%x\n,
  pci_find_domain(pdev-bus), pci_bus_num(pdev-bus),
  PCI_SLOT(pdev-devfn), PCI_FUNC(pdev-devfn),
  cap_id, offset, pdev-config_map[i], i);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] device-assignment pci: correct pci config size default for cap version 2 endpoints

2011-07-21 Thread Don Dutile

On 07/21/2011 01:02 PM, Alex Williamson wrote:

On Thu, 2011-07-21 at 12:39 -0400, Donald Dutile wrote:

v2: do local boundary check with respect to legacy PCI header length,
 and don't depend on it in pci_add_capability().
   : fix compilation, and change else2 to simple else for all other cases.

Doing device assignement using a PCIe device with it's
PCI Cap structure at offset 0xcc showed a problem in
the default size mapped for this cap-id.

The failure caused a corruption which might have gone unnoticed
otherwise.

Fix assigned_device_pci_cap_init() to set the default
size of PCIe Cap structure (cap-id 0x10) to 0x34 instead of 0x3c.
0x34 is default, min, for endpoint device with a cap version of 2.

Add check in assigned_devic_pci_cap_init() to ensure
size of Cap structure doesn't exceed legacy PCI header space,
which is where it must fit.

Signed-off-by: Donald Dutileddut...@redhat.com
cc: Alex Williamsonalex.william...@redhat.com
cc: Michael S. Tsirkinm...@redhat.com
---

  hw/device-assignment.c |   19 ---
  1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 36ad6b0..6bb8af7 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1419,21 +1419,34 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
  }

  if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
-uint8_t version;
+uint8_t version, size;
  uint16_t type, devctl, lnkcap, lnksta;
  uint32_t devcap;
-int size = 0x3c; /* version 2 size */

  version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS);
  version= PCI_EXP_FLAGS_VERS;
  if (version == 1) {
  size = 0x14;
-} else if (version  2) {
+} else if (version == 2) {
+/* don't include slot cap/stat/ctrl 2 regs; only support endpoints 
*/
+size = 0x34;
+} else {
  fprintf(stderr, Unsupported PCI express capability version %d\n,
  version);
  return -EINVAL;
  }

+   /* make sure cap struct resides in legacy hdr space */
+   if (size  PCI_CONFIG_SPACE_SIZE - pos) {
+fprintf(stderr, ERROR: %04x:%02x:%02x.%x 
+Attempt to add PCI Cap Structure 0x%x at offset 0x%x,
+size 0x%x, which exceeds legacy PCI config space 0x%x\n,
+pci_find_domain(pci_dev-bus), pci_bus_num(pci_dev-bus),
+PCI_SLOT(pci_dev-devfn), PCI_FUNC(pci_dev-devfn),
+PCI_CAP_ID_EXP, pos, size, PCI_CONFIG_SPACE_SIZE);
+return -EINVAL;
+}
+


This is crazy, why would we only test this for PCI_CAP_ID_EXP?  If the
test is going to go in device-assignment, we need to wrap
pci_add_capability and do it for all callers.  However, maybe we can get
MST to take it in pci_add_capability() if we make the test more
complete...  something like:

if ((pos  256  size  256 - pos) ||
 (pci_config_size()  256  pos  256
  size  pci_config_size() - pos)) {
... badness

Thanks,

Alex


+1


  if ((ret = pci_add_capability(pci_dev, PCI_CAP_ID_EXP,
pos, size))  0) {
  return ret;







--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 6/9] KVM: Save/restore state of assigned PCI device

2010-11-09 Thread Don Dutile

Avi Kivity wrote:

On 11/09/2010 03:44 PM, Jan Kiszka wrote:


  Oh yes, I read the code but it didn't register.  Of course this change
  is quite necessary.

  (I understood you to mean that the PCI 2.3 reset doesn't reset
  everything, but that isn't what you said).

What the hardware makes out of the reset is even another story. No
guarantees I bet (isn't function-level reset an optional thing anyway?).


It is.


optional for PF's; mandatory for VFs.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] device-assignment: Better fd tracking

2010-07-08 Thread Don Dutile
Alex Williamson wrote:
 Commit 909bfdba fixed a hole with not closing resource file descriptors
 but we need to be more careful about tracking which are real fds,
 otherwise we might close fd 0, which doesn't work out so well for stdio.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
  v2: fix qemu style braces
 
  hw/device-assignment.c |   10 ++
  1 files changed, 6 insertions(+), 4 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 48ac73c..de9470c 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -694,6 +694,7 @@ again:
  
  rp = dev-regions + r;
  rp-valid = 0;
 +rp-resource_fd = -1;
  size = end - start + 1;
  flags = IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH;
  if (size == 0 || (flags  ~IORESOURCE_PREFETCH) == 0)
 @@ -785,7 +786,9 @@ static void free_assigned_device(AssignedDevice *dev)
  fprintf(stderr,
   Failed to unmap assigned device region: %s\n,
   strerror(errno));
 -close(pci_region-resource_fd);
 +if (pci_region-resource_fd = 0) {
 +close(pci_region-resource_fd);
 +}
  }
   }
  }
 @@ -793,9 +796,8 @@ static void free_assigned_device(AssignedDevice *dev)
  if (dev-cap.available  ASSIGNED_DEVICE_CAP_MSIX)
  assigned_dev_unregister_msix_mmio(dev);
  
 -if (dev-real_device.config_fd) {
 +if (dev-real_device.config_fd = 0) {
  close(dev-real_device.config_fd);
 -dev-real_device.config_fd = 0;
  }
  
  #ifdef KVM_CAP_IRQ_ROUTING
 @@ -1415,7 +1417,7 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
  
  if (!dev-host.seg  !dev-host.bus  !dev-host.dev  
 !dev-host.func) {
  error_report(pci-assign: error: no host device specified);
 -goto out;
 +return -1;
  }
  
  if (get_real_device(dev, dev-host.seg, dev-host.bus,
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

thanks for reply on other if check's.

Acked-by: Donald Dutile ddut...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] device-assignment: Better fd tracking

2010-07-07 Thread Don Dutile

Alex Williamson wrote:

Commit 909bfdba fixed a hole with not closing resource file descriptors
but we need to be more careful about tracking which are real fds,
otherwise we might close fd 0, which doesn't work out so well for stdio.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/device-assignment.c |   10 +-
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 48ac73c..3bcb63d 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -694,6 +694,7 @@ again:
 
 rp = dev-regions + r;

 rp-valid = 0;
+rp-resource_fd = -1;
 size = end - start + 1;
 flags = IORESOURCE_IO | IORESOURCE_MEM | IORESOURCE_PREFETCH;
 if (size == 0 || (flags  ~IORESOURCE_PREFETCH) == 0)
@@ -785,7 +786,8 @@ static void free_assigned_device(AssignedDevice *dev)
 fprintf(stderr,
Failed to unmap assigned device region: %s\n,
strerror(errno));
-close(pci_region-resource_fd);
+if (pci_region-resource_fd = 0)
+close(pci_region-resource_fd);
 }
}
 }
@@ -793,10 +795,8 @@ static void free_assigned_device(AssignedDevice *dev)
 if (dev-cap.available  ASSIGNED_DEVICE_CAP_MSIX)
 assigned_dev_unregister_msix_mmio(dev);
 
-if (dev-real_device.config_fd) {

+if (dev-real_device.config_fd = 0)
 close(dev-real_device.config_fd);
-dev-real_device.config_fd = 0;
-}
 
 #ifdef KVM_CAP_IRQ_ROUTING

 free_dev_irq_entries(dev);
@@ -1415,7 +1415,7 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
 
 if (!dev-host.seg  !dev-host.bus  !dev-host.dev  !dev-host.func) {

 error_report(pci-assign: error: no host device specified);
-goto out;
+return -1;
 }
 
 if (get_real_device(dev, dev-host.seg, dev-host.bus,




and why not change the goto out   in the next 2 if-check's to return -1
to skip the free_assigned_device(dev) call for those 2 cases as well ?



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html