from:"Don Dutile"

Re: [Bug 42679] New: DMA Read on Marvell 88SE9128 fails when Intel's IOMMU is on

2012-01-30 Thread Don Dutile


On 01/30/2012 03:59 PM, Andrew Morton wrote:


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sat, 28 Jan 2012 17:55:38 GMT
bugzilla-dae...@bugzilla.kernel.org wrote:


https://bugzilla.kernel.org/show_bug.cgi?id=42679


I don't know if this is a SATA issue or intel-iommu.  Could you guys
please take a look?


Summary: DMA Read on Marvell 88SE9128 fails when Intel's IOMMU
 is on
Product: Memory Management
Version: 2.5
   Platform: All
 OS/Version: Linux
   Tree: Mainline
 Status: NEW
   Severity: normal
   Priority: P1
  Component: Other
 AssignedTo: a...@linux-foundation.org
 ReportedBy: pawel@gmail.com
 Regression: No


Created an attachment (id=72217)
  -->  (https://bugzilla.kernel.org/attachment.cgi?id=72217)
Output of `dmesg' command

I have a MSI Z68A-GD80 B3 motherboard and when I try to enable Intel's IOMMU
(kernel booted with intel_iommu=on), integrated Marvell 88SE9128 SATA
controller doesn't work.

To reproduce:
1. Compile and prepare kernel with Intel IOMMU support enabled
(CONFIG_INTEL_IOMMU=y).
2. Reboot the computer.
3. Enter BIOS and enable VT-d.
4. Boot the kernel with intel_iommu=on parameter.

Right after boot, kernel reports the following errors (SATA controller is at
0b:00.0):

[2.639774] DRHD: handling fault status reg 3
[2.639782] DMAR:[DMA Read] Request device [0b:00.1] fault addr fff0
[2.639783] DMAR:[fault reason 02] Present bit in context entry is clear

After a while these entries appear:

[7.625837] ata14.00: qc timeout (cmd 0xa1)
[7.628341] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[7.935483] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   17.908407] ata14.00: qc timeout (cmd 0xa1)
[   17.910935] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   17.912276] ata14: limiting SATA link speed to 1.5 Gbps
[   18.219077] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   48.134607] ata14.00: qc timeout (cmd 0xa1)
[   48.137508] ata14.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[   48.444646] ata14: SATA link up 1.5 Gbps (SStatus 113 SControl 310)

When there is a disk connected to the controller it does not work. When there
are none, computer starts normally, apart from the huge lag caused by,
presumably, probing the device.

Since this is the secondary controller on these motherboards, to eliminate
those symptoms you can just plug disk in one of available ports of the built-in
Intel SATA controller and disable Marvell's one using BIOS. The other
work-around, if you need to use eSATA capabilities of the latter, is to disable
VT-d techonology also using BIOS.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Well, the lspci dump in the bugzilla report doesn't show a device w/BDF=0b:00.1;
so, if the SATA device (which is 0b:00.0) is spitting out 0b:00.1 as the source
of any of its DMA packets, the IOMMU will fault on it, since 0b:00.1 didn't
request DMA mappings (0b:00.0 did).
I semi-recall someone else reporting this 'feature' on this list.
Wonder if pci-quirk has to filter this case (0b:00.0 on this system means
map for 0b:00.0 & 0b:00.1 -- ick!)

do another lspci -vvv to ensure that 0b:00.1 wasn't excluded in the list.
if it doesn't exist, then the problem is the SATA device using an 
unknown/unrecognized
BDF of 0b:00.1
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] intel_iommu,dmar: reserve mmio of IOMMU registers

2012-03-09 Thread Don Dutile


self-nak.
Found an iounmap() with a  missing release_mem_region();
will post V2 shortly.

On 03/08/2012 06:51 PM, Donald Dutile wrote:

Intel-iommu initialization doesn't currently reserve the memory used
for the IOMMU registers. This can allow the pci resource allocator
to assign a device BAR to the same address as the IOMMU registers.
This can cause some not so nice side affects when the driver
ioremap's that region.

Signed-off-by: Donald Dutile
---
  drivers/iommu/dmar.c |   18 --
  1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 35c1e17..1fcbd96 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -599,10 +599,16 @@ int alloc_iommu(struct dmar_drhd_unit *drhd)
iommu->seq_id = iommu_allocated++;
sprintf (iommu->name, "dmar%d", iommu->seq_id);

+   map_size = VTD_PAGE_SIZE;
+   if (!request_mem_region(drhd->reg_base_addr, map_size, iommu->name)) {
+   printk(KERN_ERR "IOMMU: can't reserve memory\n");
+   goto error;
+   }
+
iommu->reg = ioremap(drhd->reg_base_addr, VTD_PAGE_SIZE);
if (!iommu->reg) {
printk(KERN_ERR "IOMMU: can't map the region\n");
-   goto error;
+   goto err_release;
}
iommu->cap = dmar_readq(iommu->reg + DMAR_CAP_REG);
iommu->ecap = dmar_readq(iommu->reg + DMAR_ECAP_REG);
@@ -637,10 +643,16 @@ int alloc_iommu(struct dmar_drhd_unit *drhd)
map_size = VTD_PAGE_ALIGN(map_size);
if (map_size>  VTD_PAGE_SIZE) {
iounmap(iommu->reg);
+   release_mem_region(drhd->reg_base_addr, VTD_PAGE_SIZE);
+   if (!request_mem_region(drhd->reg_base_addr, map_size,
+   iommu->name)) {
+   printk(KERN_ERR "IOMMU: can't reserve memory\n");
+   goto error;
+   }
iommu->reg = ioremap(drhd->reg_base_addr, map_size);
if (!iommu->reg) {
printk(KERN_ERR "IOMMU: can't map the region\n");
-   goto error;
+   goto err_release;
}
}

@@ -659,6 +671,8 @@ int alloc_iommu(struct dmar_drhd_unit *drhd)

   err_unmap:
iounmap(iommu->reg);
+ err_release:
+   release_mem_region(drhd->reg_base_addr, VTD_PAGE_SIZE);
   error:
kfree(iommu);
return -1;


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2] intel_iommu,dmar: reserve mmio of IOMMU registers

2012-04-05 Thread Don Dutile


On 04/05/2012 04:58 PM, Chris Wright wrote:

* Donald Dutile (ddut...@redhat.com) wrote:

--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -581,7 +581,7 @@ int __init detect_intel_iommu(void)
  int alloc_iommu(struct dmar_drhd_unit *drhd)
  {
struct intel_iommu *iommu;
-   int map_size;
+   resource_size_t map_size;
u32 ver;
static int iommu_allocated = 0;
int agaw = 0;
@@ -599,10 +599,17 @@ int alloc_iommu(struct dmar_drhd_unit *drhd)
iommu->seq_id = iommu_allocated++;
sprintf (iommu->name, "dmar%d", iommu->seq_id);

-   iommu->reg = ioremap(drhd->reg_base_addr, VTD_PAGE_SIZE);


I think it'd be nice to create a helper function that does the request
and map.  This should include the map, read cap/ecap, calculate size and
possbily remap.  Would probably simplify the error cleanup too.


+   iommu->reg_phys = drhd->reg_base_addr;
+   iommu->reg_size = VTD_PAGE_SIZE;
+   if (!request_mem_region(iommu->reg_phys, iommu->reg_size, iommu->name)) 
{
+   printk(KERN_ERR "IOMMU: can't reserve memory\n");
+   goto error;
+   }
+
+   iommu->reg = ioremap(iommu->reg_phys, iommu->reg_size);
if (!iommu->reg) {
printk(KERN_ERR "IOMMU: can't map the region\n");
-   goto error;
+   goto err_release;
}
iommu->cap = dmar_readq(iommu->reg + DMAR_CAP_REG);
iommu->ecap = dmar_readq(iommu->reg + DMAR_ECAP_REG);
@@ -635,19 +642,26 @@ int alloc_iommu(struct dmar_drhd_unit *drhd)
map_size = max_t(int, ecap_max_iotlb_offset(iommu->ecap),
cap_max_fault_reg_offset(iommu->cap));
map_size = VTD_PAGE_ALIGN(map_size);
-   if (map_size>  VTD_PAGE_SIZE) {
+   if (map_size>  iommu->reg_size) {
iounmap(iommu->reg);
-   iommu->reg = ioremap(drhd->reg_base_addr, map_size);
+   release_mem_region(iommu->reg_phys, iommu->reg_size);
+   iommu->reg_size = map_size;
+   if (!request_mem_region(iommu->reg_phys, iommu->reg_size,
+   iommu->name)) {
+   printk(KERN_ERR "IOMMU: can't reserve memory\n");
+   goto error;
+   }
+   iommu->reg = ioremap(iommu->reg_phys, iommu->reg_size);
if (!iommu->reg) {
printk(KERN_ERR "IOMMU: can't map the region\n");
-   goto error;
+   goto err_release;
}
}

ver = readl(iommu->reg + DMAR_VER_REG);
pr_info("IOMMU %d: reg_base_addr %llx ver %d:%d cap %llx ecap %llx\n",
iommu->seq_id,
-   (unsigned long long)drhd->reg_base_addr,
+   (unsigned long long)iommu->reg_phys,
DMAR_VER_MAJOR(ver), DMAR_VER_MINOR(ver),
(unsigned long long)iommu->cap,
(unsigned long long)iommu->ecap);
@@ -659,6 +673,8 @@ int alloc_iommu(struct dmar_drhd_unit *drhd)

   err_unmap:
iounmap(iommu->reg);
+ err_release:
+   release_mem_region(iommu->reg_phys, iommu->reg_size);
   error:
kfree(iommu);
return -1;
@@ -671,8 +687,11 @@ void free_iommu(struct intel_iommu *iommu)

free_dmar_iommu(iommu);

-   if (iommu->reg)
+   if (iommu->reg) {
iounmap(iommu->reg);
+   release_mem_region(iommu->reg_phys, iommu->reg_size);
+   }
+
kfree(iommu);
  }

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index e6ca56d..c6d132b 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -308,6 +308,8 @@ enum {

  struct intel_iommu {
void __iomem*reg; /* Pointer to hardware regs, virtual addr */
+   resource_size_t reg_phys;
+   resource_size_t reg_size;


I'd make these u64


ok, will work up v3 tomorrow..

thanks for feedback!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 05/13] pci: New pci_acs_enabled()

2012-05-16 Thread Don Dutile


On 05/15/2012 05:09 PM, Alex Williamson wrote:

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
  wrote:

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
  wrote:

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.


The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.


Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);


+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.


I'm not sure what "a complete PCI ACS chain" means.

The function starts from "dev" and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.


Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation,


Is this "redirection" just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

^
|
   ++---+
   ||
+--+-++-++-+
| Downstream || Downstream |
|Port||Port|
|   06:05.0  ||   06:06.0  |
+--+-++--+-+
   | |
  +v+   +v+
  | Endpoint|   | Endpoint|
  | 07:00.0 |   | 08:00.0 |
  +-+   +-+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.


Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.


detail: PCIe up/downstream routing is really done by an internal switch;
ACS forces the legacy, PCI base-limit address routing and *forces*
the switch to always route the transaction from a downstream port
to the upstream port.


I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.


Yep, that's my meager understanding as well.


+1


so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.


Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.


Yes, modulo this comment in libvirt source:

 /* if we have no parent, and this is the root bus, ACS doesn't come
  * into play since devices on the root bus can't P2P without going
  * through the root IOMMU.
  */


Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.


So we assume that a redirect at the point of the iommu will factor in
iommu translation.


If so, I think you're trying to find out the closest upstream device X
such that everything leading to X has ACS enabled.  Every device below
X can DMA freely to other devices below X, so they would all have to
be in the same isolated group.


Yes


I tried to work through some examples to develop some intuition about this:


(inserting fixed url)

http://www.asciiflow.com/#3736558963405980039



pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
if 00:00.0 is PCIe or if RP has ACS?))


Hmm, the latter is the assumption above.  For the former, I think
libvirt was probably assuming that PCI devices must have a PCIe device
upstream from them because x86 doesn't have assignment friendly IOMMUs
except on PCIe.  I'll need to work on making that more generic.


pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01

Re: RESEND3: Re: [PATCH 05/13] pci: New pci_acs_enabled()

2012-05-18 Thread Don Dutile


On 05/18/2012 06:02 PM, Alex Williamson wrote:

On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:

On 05/15/2012 05:09 PM, Alex Williamson wrote:

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
   wrote:

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
   wrote:

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.


The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.


Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);


+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

Finally, with my email filters fixed, I can see this email... :)



Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
   struct pci_dev *end, u16 acs_flags);


ok.


I'm not sure what "a complete PCI ACS chain" means.

The function starts from "dev" and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.


Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation,


Is this "redirection" just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

 ^
 |
++---+
||
 +--+-++-++-+
 | Downstream || Downstream |
 |Port||Port|
 |   06:05.0  ||   06:06.0  |
 +--+-++--+-+
| |
   +v+   +v+
   | Endpoint|   | Endpoint|
   | 07:00.0 |   | 08:00.0 |
   +-+   +-+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.


Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.


detail: PCIe up/downstream routing is really done by an internal switch;
  ACS forces the legacy, PCI base-limit address routing and *forces*
  the switch to always route the transaction from a downstream port
  to the upstream port.


I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.


Yep, that's my meager understanding as well.


+1


so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.


Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.


Yes, modulo this comment in libvirt source:

  /* if we have no parent, and this is the root bus, ACS doesn't come
   * into play since devices on the root bus can't P2P without going
   * through the root IOMMU.
   */


Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.


Would you mind looking for the paragraph that says this?  I'd rather
code this into the iommu driver callers than core PCI code if this is
just a platform standard.


In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
ACS upstream fwding: Must be implemented by Root Ports if the RC supports
 Redirected Request Validation;
-- which means, if a Root port allows a peer-to-peer transaction to another
   one of its ports, the

Re: [PATCH 05/13] pci: New pci_acs_enabled()

2012-05-21 Thread Don Dutile


On 05/18/2012 10:47 PM, Alex Williamson wrote:

On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:

On 05/18/2012 06:02 PM, Alex Williamson wrote:

On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:

On 05/15/2012 05:09 PM, Alex Williamson wrote:

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
wrote:

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
wrote:

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.


The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.


Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);


+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

Finally, with my email filters fixed, I can see this email... :)


Welcome back ;)


Indeed... and I recvd 3 copies of this reply,
so the pendulum has flipped the other direction... ;-)


Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
struct pci_dev *end, u16 acs_flags);


ok.


I'm not sure what "a complete PCI ACS chain" means.

The function starts from "dev" and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.


Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation,


Is this "redirection" just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

  ^
  |
 ++---+
 ||
  +--+-++-++-+
  | Downstream || Downstream |
  |Port||Port|
  |   06:05.0  ||   06:06.0  |
  +--+-++--+-+
 | |
+v+   +v+
| Endpoint|   | Endpoint|
| 07:00.0 |   | 08:00.0 |
+-+   +-+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.


Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.


detail: PCIe up/downstream routing is really done by an internal switch;
   ACS forces the legacy, PCI base-limit address routing and *forces*
   the switch to always route the transaction from a downstream port
   to the upstream port.


I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.


Yep, that's my meager understanding as well.


+1


so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.


Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.


Yes, modulo this comment in libvirt source:

   /* if we have no parent, and this is the root bus, ACS doesn't come
* into play since devices on the root bus can't P2P without going
* through the root IOMMU.
*/


Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.


Would you mind looking for the paragraph that says this?  I'd rather
code this into the iommu driver callers than core PCI code if this is
just a platform standard.


In section 6.12.1.1 of

Re: [PATCH 05/13] pci: New pci_acs_enabled()

2012-05-21 Thread Don Dutile


On 05/21/2012 10:59 AM, Alex Williamson wrote:

On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:

On 05/18/2012 10:47 PM, Alex Williamson wrote:

On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:

On 05/18/2012 06:02 PM, Alex Williamson wrote:

On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:

On 05/15/2012 05:09 PM, Alex Williamson wrote:

On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:

On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
 wrote:

On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:

On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
 wrote:

In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.


The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.


Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);


+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

Finally, with my email filters fixed, I can see this email... :)


Welcome back ;)


Indeed... and I recvd 3 copies of this reply,
so the pendulum has flipped the other direction... ;-)


Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
 struct pci_dev *end, u16 acs_flags);


ok.


I'm not sure what "a complete PCI ACS chain" means.

The function starts from "dev" and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.


Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation,


Is this "redirection" just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

   ^
   |
  ++---+
  ||
   +--+-++-++-+
   | Downstream || Downstream |
   |Port||Port|
   |   06:05.0  ||   06:06.0  |
   +--+-++--+-+
  | |
 +v+   +v+
 | Endpoint|   | Endpoint|
 | 07:00.0 |   | 08:00.0 |
 +-+   +-+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.


Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.


detail: PCIe up/downstream routing is really done by an internal switch;
ACS forces the legacy, PCI base-limit address routing and *forces*
the switch to always route the transaction from a downstream port
to the upstream port.


I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.


Yep, that's my meager understanding as well.


+1


so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.


Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.


Yes, modulo this comment in libvirt source:

/* if we have no parent, and this is the root bus, ACS doesn't come
 * into play since devices on the root bus can't P2P without going
 * through the root IOMMU.
 */


Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.


Would you mind looking for the paragraph that says this?  I'd rather
code this

Re: [PATCH v2 03/13] iommu: IOMMU groups for VT-d and AMD-Vi

2012-05-24 Thread Don Dutile


On 05/22/2012 01:04 AM, Alex Williamson wrote:

Add back group support for AMD&  Intel.  amd_iommu already tracks
devices and has init and uninit routines to manage groups.
intel-iommu does this on the fly, so we make use of the notifier
support built into iommu groups to create and remove groups.

Signed-off-by: Alex Williamson
---

  drivers/iommu/amd_iommu.c   |   28 +-
  drivers/iommu/intel-iommu.c |   46 +++
  2 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 32c00cd..b7e5ddf 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -256,9 +256,11 @@ static bool check_device(struct device *dev)

  static int iommu_init_device(struct device *dev)
  {
-   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_dev *dma_pdev, *pdev = to_pci_dev(dev);
struct iommu_dev_data *dev_data;
+   struct iommu_group *group;
u16 alias;
+   int ret;

if (dev->archdata.iommu)
return 0;
@@ -279,8 +281,30 @@ static int iommu_init_device(struct device *dev)
return -ENOTSUPP;
}
dev_data->alias_data = alias_data;
+
+   dma_pdev = pci_get_bus_and_slot(alias>>  8, alias&  0xff);
+   } else
+   dma_pdev = pdev;
+
+   if (!pdev->is_virtfn&&  PCI_FUNC(pdev->devfn)&&  iommu_group_mf&&
+   pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+   dma_pdev = pci_get_slot(pdev->bus,
+   PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+   group = iommu_group_get(&dma_pdev->dev);
+   if (!group) {
+   group = iommu_group_alloc();
+   if (IS_ERR(group))
+   return PTR_ERR(group);
}

+   ret = iommu_group_add_device(group, dev);
+
+   iommu_group_put(group);
+

do you want to do a put if there is a failure in the iommu_group_add_device()?

+   if (ret)
+   return ret;
+
if (pci_iommuv2_capable(pdev)) {
struct amd_iommu *iommu;

@@ -309,6 +333,8 @@ static void iommu_ignore_device(struct device *dev)

  static void iommu_uninit_device(struct device *dev)
  {
+   iommu_group_remove_device(dev);
+
/*
 * Nothing to do here - we keep dev_data around for unplugged devices
 * and reuse it when the device is re-plugged - not doing so would
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d4a0ff7..e63b33b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,6 +4087,50 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
  }

+static int intel_iommu_add_device(struct device *dev)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct pci_dev *bridge, *dma_pdev = pdev;
+   struct iommu_group *group;
+   int ret;
+
+   if (!device_to_iommu(pci_domain_nr(pdev->bus),
+pdev->bus->number, pdev->devfn))
+   return -ENODEV;
+
+   bridge = pci_find_upstream_pcie_bridge(pdev);
+   if (bridge) {
+   if (pci_is_pcie(bridge))
+   dma_pdev = pci_get_domain_bus_and_slot(
+   pci_domain_nr(pdev->bus),
+   bridge->subordinate->number, 0);
+   else
+   dma_pdev = bridge;
+   }
+
+   if (!pdev->is_virtfn&&  PCI_FUNC(pdev->devfn)&&  iommu_group_mf&&
+   pdev->hdr_type == PCI_HEADER_TYPE_NORMAL)
+   dma_pdev = pci_get_slot(pdev->bus,
+   PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
+
+   group = iommu_group_get(&dma_pdev->dev);
+   if (!group) {
+   group = iommu_group_alloc();
+   if (IS_ERR(group))
+   return PTR_ERR(group);
+   }
+
+   ret = iommu_group_add_device(group, dev);
+

ditto.

+   iommu_group_put(group);
+   return ret;
+}
+
+static void intel_iommu_remove_device(struct device *dev)
+{
+   iommu_group_remove_device(dev);
+}
+
  static struct iommu_ops intel_iommu_ops = {
.domain_init= intel_iommu_domain_init,
.domain_destroy = intel_iommu_domain_destroy,
@@ -4096,6 +4140,8 @@ static struct iommu_ops intel_iommu_ops = {
.unmap  = intel_iommu_unmap,
.iova_to_phys   = intel_iommu_iova_to_phys,
.domain_has_cap = intel_iommu_domain_has_cap,
+   .add_device = intel_iommu_add_device,
+   .remove_device  = intel_iommu_remove_device,
.pgsize_bitmap  = INTEL_IOMMU_PGSIZES,
  };




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 05/13] pci: Add ACS validation utility

2012-05-24 Thread Don Dutile


On 05/22/2012 01:05 AM, Alex Williamson wrote:

In a PCI environment, transactions aren't always required to reach
the root bus before being re-routed.  Intermediate switches between
an endpoint and the root bus can redirect DMA back downstream before
things like IOMMUs have a chance to intervene.  Legacy PCI is always
susceptible to this as it operates on a shared bus.  PCIe added a
new capability to describe and control this behavior, Access Control
Services, or ACS.  The utility function pci_acs_enabled() allows us
to test the ACS capabilities of an individual devices against a set
of flags while pci_acs_path_enabled() tests a complete path from
a given downstream device up to the specified upstream device.  We
also include the ability to add device specific tests as it's
likely we'll see devices that do no implement ACS, but want to
indicate support for various capabilities in this space.

Signed-off-by: Alex Williamson
---

  drivers/pci/pci.c|   76 ++
  drivers/pci/quirks.c |   29 +++
  include/linux/pci.h  |   10 ++-
  3 files changed, 114 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..ab6c2a6 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2359,6 +2359,82 @@ void pci_enable_acs(struct pci_dev *dev)
  }

  /**
+ * pci_acs_enable - test ACS against required flags for a given device

typo:   ^^^ missing 'd'


+ * @pdev: device to test
+ * @acs_flags: required PCI ACS flags
+ *
+ * Return true if the device supports the provided flags.  Automatically
+ * filters out flags that are not implemented on multifunction devices.
+ */
+bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
+{
+   int pos;
+   u16 ctrl;
+
+   if (pci_dev_specific_acs_enabled(pdev, acs_flags))
+   return true;
+
+   if (!pci_is_pcie(pdev))
+   return false;
+
+   if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
+   pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+   if (!pos)
+   return false;
+
+   pci_read_config_word(pdev, pos + PCI_ACS_CTRL,&ctrl);
+   if ((ctrl&  acs_flags) != acs_flags)
+   return false;
+   } else if (pdev->multifunction) {
+   /* Filter out flags not applicable to multifunction */
+   acs_flags&= (PCI_ACS_RR | PCI_ACS_CR |
+ PCI_ACS_EC | PCI_ACS_DT);
+
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
+   if (!pos)
+   return false;
+
+   pci_read_config_word(pdev, pos + PCI_ACS_CTRL,&ctrl);
+   if ((ctrl&  acs_flags) != acs_flags)
+   return false;
+   }
+
+   return true;

or, to reduce duplicated code (which compiler may do?):

/* Filter out flags not applicable to multifunction */
if (pdev->multifunction)
acs_flags &= (PCI_ACS_RR | PCI_ACS_CR |
  PCI_ACS_EC | PCI_ACS_DT);

if (pdev->pcie_type == PCI_EXP_TYPE_DOWNSTREAM ||
pdev->pcie_type == PCI_EXP_TYPE_ROOT_PORT ||
pdev->multifunction) {
pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ACS);
if (!pos)
return false;
pci_read_config_word(pdev, pos + PCI_ACS_CTRL, &ctrl);
if ((ctrl & acs_flags) != acs_flags)
return false;
}

return true;

+}


But the above doesn't handle the case where the RC does not do
peer-to-peer btwn root ports. Per ACS spec, such a RC's root ports
don't need to provide an ACS cap, since peer-to-peer port xfers aren't
allowed/enabled/supported, so by design, the root port is ACS compliant.
ATM, an IOMMU-capable system is a pre-req for VFIO,
and all such systems have an ACS cap, but they may not always be true.


+EXPORT_SYMBOL_GPL(pci_acs_enabled);
+
+/**
+ * pci_acs_path_enable - test ACS flags from start to end in a hierarchy
+ * @start: starting downstream device
+ * @end: ending upstream device or NULL to search to the root bus
+ * @acs_flags: required flags
+ *
+ * Walk up a device tree from start to end testing PCI ACS support.  If
+ * any step along the way does not support the required flags, return false.
+ */
+bool pci_acs_path_enabled(struct pci_dev *start,
+ struct pci_dev *end, u16 acs_flags)
+{
+   struct pci_dev *pdev, *parent = start;
+
+   do {
+   pdev = parent;
+
+   if (!pci_acs_enabled(pdev, acs_flags))
+   return false;
+
+   if (pci_is_root_bus(pdev->bus))
+   return (end == NULL);

doesn't this mean that a caller can't pass the pdev of the root port?
I would think that is a valid c

Re: [PATCH v2 09/13] vfio: x86 IOMMU implementation

2012-05-24 Thread Don Dutile


On 05/22/2012 01:05 AM, Alex Williamson wrote:

x86 is probably the wrong name for this VFIO IOMMU driver, but x86
is the primary target for it.  This driver support a very simple
usage model using the existing IOMMU API.  The IOMMU is expected to
support the full host address space with no special IOVA windows,
number of mappings restrictions, or unique processor target options.

Signed-off-by: Alex Williamson
---

  Documentation/ioctl/ioctl-number.txt |2
  drivers/vfio/Kconfig |6
  drivers/vfio/Makefile|2
  drivers/vfio/vfio.c  |7
  drivers/vfio/vfio_iommu_x86.c|  743 ++
  include/linux/vfio.h |   52 ++
  6 files changed, 811 insertions(+), 1 deletions(-)
  create mode 100644 drivers/vfio/vfio_iommu_x86.c

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 111e30a..9d1694e 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,7 +88,7 @@ Code  Seq#(hex)   Include FileComments
and kernel/power/user.c
  '8'   all SNP8023 advanced NIC card

-';'64-6F   linux/vfio.h
+';'64-72   linux/vfio.h
  '@'   00-0F   linux/radeonfb.hconflict!
  '@'   00-0F   drivers/video/aty/aty128fb.cconflict!
  'A'   00-1F   linux/apm_bios.hconflict!
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9acb1e7..bd88a30 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -1,6 +1,12 @@
+config VFIO_IOMMU_X86
+   tristate
+   depends on VFIO&&  X86
+   default n
+
  menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
+   select VFIO_IOMMU_X86 if X86
help
  VFIO provides a framework for secure userspace device drivers.
  See Documentation/vfio.txt for more details.


So a future refactoring that uses some chunk of this support
on a non-x86 machine could be a lot of useless renaming.

Why not rename vfio_iommu_x86 to something like vfio_iommu_no_iova
and just make it conditionally compiled on X86 (as you've done above in 
Kconfig's)?
Then if another arch can use it, or refactors the file to use
some of it, and split x86 vs  into separate per-arch files,
or per-iova schemes, it's more descriptive and less disruptive?


diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7500a67..1f1abee 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1 +1,3 @@
  obj-$(CONFIG_VFIO) += vfio.o
+obj-$(CONFIG_VFIO_IOMMU_X86) += vfio_iommu_x86.o
+obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6558eef..89899a8 100644

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 12/13] pci: Misc pci_reg additions

2012-05-24 Thread Don Dutile


On 05/22/2012 01:05 AM, Alex Williamson wrote:

Fill in many missing definitions and add sizeof fields for many
sections allowing for more extensive config parsing.

Signed-off-by: Alex Williamson
---


overall, i'm very glad to see defines instead of hardcoded numbers in the code, 
but


  include/linux/pci_regs.h |  112 +-
  1 files changed, 100 insertions(+), 12 deletions(-)

diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 4b608f5..379be84 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -26,6 +26,7 @@
   * Under PCI, each device has 256 bytes of configuration address space,
   * of which the first 64 bytes are standardized as follows:
   */
+#define PCI_STD_HEADER_SIZEOF  64
  #define PCI_VENDOR_ID 0x00/* 16 bits */
  #define PCI_DEVICE_ID 0x02/* 16 bits */
  #define PCI_COMMAND   0x04/* 16 bits */
@@ -209,9 +210,12 @@
  #define  PCI_CAP_ID_SHPC  0x0C/* PCI Standard Hot-Plug Controller */
  #define  PCI_CAP_ID_SSVID 0x0D/* Bridge subsystem vendor/device ID */
  #define  PCI_CAP_ID_AGP3  0x0E/* AGP Target PCI-PCI bridge */
+#define  PCI_CAP_ID_SECDEV 0x0F/* Secure Device */
  #define  PCI_CAP_ID_EXP   0x10/* PCI Express */
  #define  PCI_CAP_ID_MSIX  0x11/* MSI-X */
+#define  PCI_CAP_ID_SATA   0x12/* SATA Data/Index Conf. */
  #define  PCI_CAP_ID_AF0x13/* PCI Advanced Features */
+#define  PCI_CAP_ID_MAXPCI_CAP_ID_AF
  #define PCI_CAP_LIST_NEXT 1   /* Next capability in the list */
  #define PCI_CAP_FLAGS 2   /* Capability defined flags (16 bits) */
  #define PCI_CAP_SIZEOF4
@@ -276,6 +280,7 @@
  #define  PCI_VPD_ADDR_MASK0x7fff  /* Address mask */
  #define  PCI_VPD_ADDR_F   0x8000  /* Write 0, 1 indicates 
completion */
  #define PCI_VPD_DATA  4   /* 32-bits of data returned here */
+#define PCI_CAP_VPD_SIZEOF 8

  /* Slot Identification */

@@ -297,8 +302,10 @@
  #define PCI_MSI_ADDRESS_HI8   /* Upper 32 bits (if 
PCI_MSI_FLAGS_64BIT set) */
  #define PCI_MSI_DATA_32   8   /* 16 bits of data for 32-bit 
devices */
  #define PCI_MSI_MASK_32   12  /* Mask bits register for 
32-bit devices */
+#define PCI_MSI_PENDING_32 16  /* Pending intrs for 32-bit devices */
  #define PCI_MSI_DATA_64   12  /* 16 bits of data for 64-bit 
devices */
  #define PCI_MSI_MASK_64   16  /* Mask bits register for 
64-bit devices */
+#define PCI_MSI_PENDING_64 20  /* Pending intrs for 64-bit devices */

  /* MSI-X registers */
  #define PCI_MSIX_FLAGS2
@@ -308,6 +315,7 @@
  #define PCI_MSIX_TABLE4
  #define PCI_MSIX_PBA  8
  #define  PCI_MSIX_FLAGS_BIRMASK   (7<<  0)
+#define PCI_CAP_MSIX_SIZEOF12  /* size of MSIX registers */

  /* MSI-X entry's format */
  #define PCI_MSIX_ENTRY_SIZE   16
@@ -338,6 +346,7 @@
  #define  PCI_AF_CTRL_FLR  0x01
  #define PCI_AF_STATUS 5
  #define  PCI_AF_STATUS_TP 0x01
+#define PCI_CAP_AF_SIZEOF  6   /* size of AF registers */

  /* PCI-X registers */

@@ -374,6 +383,9 @@
  #define  PCI_X_STATUS_SPL_ERR 0x2000  /* Rcvd Split Completion Error 
Msg */
  #define  PCI_X_STATUS_266MHZ  0x4000  /* 266 MHz capable */
  #define  PCI_X_STATUS_533MHZ  0x8000  /* 533 MHz capable */
+#define PCI_X_ECC_CSR  8   /* ECC control and status */
+#define PCI_CAP_PCIX_SIZEOF_V0 8   /* size of registers for Version 0 */
+#define PCI_CAP_PCIX_SIZEOF_V1224  /* size for Version 1&  2 */

ew!
unlikely that version 12 will ever exist, but why not:
#define PCI_CAP_PCIX_SIZEOF_V1  24
#define PCI_CAP_PCIX_SIZEOF_V2  PCI_CAP_PCIX_SIZEOF_V1




  /* PCI Bridge Subsystem ID registers */

@@ -462,6 +474,7 @@
  #define  PCI_EXP_LNKSTA_DLLLA 0x2000  /* Data Link Layer Link Active */
  #define  PCI_EXP_LNKSTA_LBMS  0x4000  /* Link Bandwidth Management Status */
  #define  PCI_EXP_LNKSTA_LABS  0x8000  /* Link Autonomous Bandwidth Status */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1 20  /* v1 endpoints end here */
  #define PCI_EXP_SLTCAP20  /* Slot Capabilities */
  #define  PCI_EXP_SLTCAP_ABP   0x0001 /* Attention Button Present */
  #define  PCI_EXP_SLTCAP_PCP   0x0002 /* Power Controller Present */
@@ -521,6 +534,7 @@
  #define  PCI_EXP_OBFF_MSGA_EN 0x2000  /* OBFF enable with Message type A */
  #define  PCI_EXP_OBFF_MSGB_EN 0x4000  /* OBFF enable with Message type B */
  #define  PCI_EXP_OBFF_WAKE_EN 0x6000  /* OBFF using WAKE# signaling */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 44  /* v2 endpoints end here */
  #define PCI_EXP_LNKCTL2   48  /* Link Control 2 */
  #define PCI_EXP_SLTCTL2   56  /* Slot Control 2 */

@@ -529,23 +543,43 @@
  #define PCI_EXT_C

Re: [PATCH v2 00/13] IOMMU Groups + VFIO

2012-05-24 Thread Don Dutile


On 05/22/2012 01:04 AM, Alex Williamson wrote:

Version 2 incorporating acks and feedback from v1.  The PCI DMA quirk
and ACS check are reworked, sysfs iommu groups ABI Documentation
added as well as numerous other fixes, including patches from Alexey
Kardashevskiy towards supporting POWER usage of VFIO and IOMMU groups.

This series can be found here on top of 3.4:

git://github.com/awilliam/linux-vfio.git iommu-group-vfio-20120521

The Qemu tree has also been updated to Qemu 1.1 and can be found here:

git://github.com/awilliam/qemu-vfio.git iommu-group-vfio

I'd really like to make a push to get this in for 3.5, so let's talk
about how to do that across iommu, pci, and new driver.  Joerg, are
you sufficiently happy with the IOMMU group concept and code?  We'll
also need David Woodhouse buyin on the intel-iommu changes in patches
3&  6.  Who needs to approve VFIO as a new driver, GregKH?  Bjorn,
I'd be happy to send the PCI changes as a series for you, but I
wonder if it makes sense to collect acks for them if you approve and
bundle them in with the associated code that needs them so you're
not left with unused code.  Let me know which you prefer.  If there
are better ways to do it, please let me know.  Thanks,

Alex

---

ack to 1,2,4,6,8,10 & 11.
provided some minor feedback on 3,9,&12.
have to do final review of the big stuff, 7 & 13.


Alex Williamson (13):
   vfio: Add PCI device driver
   pci: Misc pci_reg additions
   pci: Create common pcibios_err_to_errno
   pci: export pci_user functions for use by other drivers
   vfio: x86 IOMMU implementation
   vfio: Add documentation
   vfio: VFIO core
   iommu: Make use of DMA quirking and ACS enabled check for groups
   pci: Add ACS validation utility
   pci: Add PCI DMA source ID quirk
   iommu: IOMMU groups for VT-d and AMD-Vi
   iommu: IOMMU Groups
   driver core: Add iommu_group tracking to struct device


  .../ABI/testing/sysfs-kernel-iommu_groups  |   14
  Documentation/ioctl/ioctl-number.txt   |1
  Documentation/vfio.txt |  315 
  MAINTAINERS|8
  drivers/Kconfig|2
  drivers/Makefile   |1
  drivers/iommu/amd_iommu.c  |   67 +
  drivers/iommu/intel-iommu.c|   87 +
  drivers/iommu/iommu.c  |  578 +++-
  drivers/pci/access.c   |6
  drivers/pci/pci.c  |   76 +
  drivers/pci/pci.h  |7
  drivers/pci/quirks.c   |   69 +
  drivers/vfio/Kconfig   |   16
  drivers/vfio/Makefile  |3
  drivers/vfio/pci/Kconfig   |8
  drivers/vfio/pci/Makefile  |4
  drivers/vfio/pci/vfio_pci.c|  557 +++
  drivers/vfio/pci/vfio_pci_config.c | 1522 
  drivers/vfio/pci/vfio_pci_intrs.c  |  724 ++
  drivers/vfio/pci/vfio_pci_private.h|   91 +
  drivers/vfio/pci/vfio_pci_rdwr.c   |  269 
  drivers/vfio/vfio.c| 1413 +++
  drivers/vfio/vfio_iommu_x86.c  |  743 ++
  drivers/xen/xen-pciback/conf_space.c   |6
  include/linux/device.h |2
  include/linux/iommu.h  |  104 +
  include/linux/pci.h|   49 +
  include/linux/pci_regs.h   |  112 +
  include/linux/vfio.h   |  444 ++
  30 files changed, 7182 insertions(+), 116 deletions(-)
  create mode 100644 Documentation/ABI/testing/sysfs-kernel-iommu_groups
  create mode 100644 Documentation/vfio.txt
  create mode 100644 drivers/vfio/Kconfig
  create mode 100644 drivers/vfio/Makefile
  create mode 100644 drivers/vfio/pci/Kconfig
  create mode 100644 drivers/vfio/pci/Makefile
  create mode 100644 drivers/vfio/pci/vfio_pci.c
  create mode 100644 drivers/vfio/pci/vfio_pci_config.c
  create mode 100644 drivers/vfio/pci/vfio_pci_intrs.c
  create mode 100644 drivers/vfio/pci/vfio_pci_private.h
  create mode 100644 drivers/vfio/pci/vfio_pci_rdwr.c
  create mode 100644 drivers/vfio/vfio.c
  create mode 100644 drivers/vfio/vfio_iommu_x86.c
  create mode 100644 include/linux/vfio.h


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 09/13] vfio: x86 IOMMU implementation

2012-05-25 Thread Don Dutile


On 05/24/2012 06:46 PM, Alex Williamson wrote:

On Thu, 2012-05-24 at 17:38 -0400, Don Dutile wrote:

On 05/22/2012 01:05 AM, Alex Williamson wrote:

x86 is probably the wrong name for this VFIO IOMMU driver, but x86
is the primary target for it.  This driver support a very simple
usage model using the existing IOMMU API.  The IOMMU is expected to
support the full host address space with no special IOVA windows,
number of mappings restrictions, or unique processor target options.

Signed-off-by: Alex Williamson
---

   Documentation/ioctl/ioctl-number.txt |2
   drivers/vfio/Kconfig |6
   drivers/vfio/Makefile|2
   drivers/vfio/vfio.c  |7
   drivers/vfio/vfio_iommu_x86.c|  743 
++
   include/linux/vfio.h |   52 ++
   6 files changed, 811 insertions(+), 1 deletions(-)
   create mode 100644 drivers/vfio/vfio_iommu_x86.c

diff --git a/Documentation/ioctl/ioctl-number.txt 
b/Documentation/ioctl/ioctl-number.txt
index 111e30a..9d1694e 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,7 +88,7 @@ Code  Seq#(hex)   Include FileComments
and kernel/power/user.c
   '8'  all SNP8023 advanced NIC card
<mailto:m...@solidum.com>
-';'64-6F   linux/vfio.h
+';'64-72   linux/vfio.h
   '@'  00-0F   linux/radeonfb.hconflict!
   '@'  00-0F   drivers/video/aty/aty128fb.cconflict!
   'A'  00-1F   linux/apm_bios.hconflict!
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 9acb1e7..bd88a30 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -1,6 +1,12 @@
+config VFIO_IOMMU_X86
+   tristate
+   depends on VFIO&&   X86
+   default n
+
   menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
+   select VFIO_IOMMU_X86 if X86
help
  VFIO provides a framework for secure userspace device drivers.
  See Documentation/vfio.txt for more details.


So a future refactoring that uses some chunk of this support
on a non-x86 machine could be a lot of useless renaming.

Why not rename vfio_iommu_x86 to something like vfio_iommu_no_iova
and just make it conditionally compiled on X86 (as you've done above in 
Kconfig's)?
Then if another arch can use it, or refactors the file to use
some of it, and split x86 vs  into separate per-arch files,
or per-iova schemes, it's more descriptive and less disruptive?


Yep, the problem is how to concisely describe what we expect to support
here.  This file supports IOMMU API based usage of an IOMMU with
effectively no DMA window or mapping constraints, optimized for static
mapping of an address space.  What's a good name for that?  Maybe I
should follow the example of others and just call it a Type 1 IOMMU
implementation so the marketing material looks better!  ;-P  That may
honestly be better than calling it x86.  Thoughts?  Thanks,

Alex


I'll vote for 'type1' over 'x86' 
Add a comment in the file what a 'type1 IOMMU' is.
Then others can dupe format for typeX.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 6/7] intel-iommu: Make use of DMA quirks and ACS checks in IOMMU groups

2012-05-31 Thread Don Dutile


On 05/30/2012 04:19 PM, Alex Williamson wrote:

Work around broken devices and adhere to ACS support when determining
IOMMU grouping.

Signed-off-by: Alex Williamson
---

  drivers/iommu/intel-iommu.c |   25 +
  1 file changed, 25 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 4a43452..ebf2b31 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4090,6 +4090,14 @@ static int intel_iommu_domain_has_cap(struct 
iommu_domain *domain,
return 0;
  }

+static void swap_pci_ref(struct pci_dev **from, struct pci_dev *to)
+{
+   pci_dev_put(*from);
+   *from = to;
+}
+
+#define REQ_ACS_FLAGS  (PCI_ACS_SV | PCI_ACS_RR | PCI_ACS_CR | PCI_ACS_UF)
+
  static int intel_iommu_add_device(struct device *dev)
  {
struct pci_dev *pdev = to_pci_dev(dev);
@@ -4112,6 +4120,23 @@ static int intel_iommu_add_device(struct device *dev)
} else
dma_pdev = pci_dev_get(pdev);

+   swap_pci_ref(&dma_pdev, pci_get_dma_source(dma_pdev));
+
+   if (dma_pdev->multifunction&&
+   !pci_acs_enabled(dma_pdev, REQ_ACS_FLAGS))
+   swap_pci_ref(&dma_pdev,
+pci_get_slot(dma_pdev->bus,
+ PCI_DEVFN(PCI_SLOT(dma_pdev->devfn),
+ 0)));
+
+   while (!pci_is_root_bus(dma_pdev->bus)) {
+   if (pci_acs_path_enabled(dma_pdev->bus->self,
+NULL, REQ_ACS_FLAGS))
+   break;
+
+   swap_pci_ref(&dma_pdev, pci_dev_get(dma_pdev->bus->self));
+   }
+

I'm having deja-vu on this patch
 why not just make the above two patches as two functions in 
drivers/iommu/iommu.c,
one exported for these two modules (and maybe others someday...), e.g., 
iommu_pdev_put())
which [intel-,amd-]iommu.c call ?

 

group = iommu_group_get(&dma_pdev->dev);
pci_dev_put(dma_pdev);
if (!group) {



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/2] iommu: dmar: replace printks with appropriate pr_*()

2012-06-04 Thread Don Dutile


On 06/04/2012 06:15 PM, Joe Perches wrote:

On Mon, 2012-06-04 at 17:29 -0400, Donald Dutile wrote:

Replace printk(KERN_*  with pr_*() functions.


Please add
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
before any include and remove the embedded PREFIX
from each printk


diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c


[]


break;
}
pdev = pci_get_slot(bus, PCI_DEVFN(path->dev, path->fn));
if (!pdev) {
-   printk(KERN_WARNING PREFIX
-   "Device scope device [%04x:%02x:%02x.%02x] not found\n",
+   pr_warn(PREFIX "Device scope device"
+   "[%04x:%02x:%02x.%02x] not found\n",
segment, bus->number, path->dev, path->fn);


Please don't split any format string.  You removed
a space between the scope device and an open bracket.
It's OK for format strings to exceed 80 chars.



Joe,
Thanks for the feeback.  I'll incorporate the changes once
others have a chance to review & feedback as well.

- Don

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/2] iommu: dmar -- reserve mmio space used by IOMMU

2012-06-04 Thread Don Dutile


On 06/04/2012 06:37 PM, David Woodhouse wrote:

On Mon, 2012-06-04 at 17:29 -0400, Donald Dutile wrote:

Intel-iommu initialization doesn't currently reserve the memory used
for the IOMMU registers. This can allow the pci resource allocator
to assign a device BAR to the same address as the IOMMU registers.
This can cause some not so nice side affects when the driver
ioremap's that region.


s/affect/effect/


ok.


And surely this can happen even when IOMMU support is compiled out of
the kernel. Shouldn't the BIOS be *telling* us that this region is
unavailable for PCI resource allocation (or anything else, for that
matter)?


good point


If the BIOS *doesn't* do that, then I believe this should be
WARN_TAINT_ONCE(…TAINT_FIRMWARE_WORKAROUND…) like other BIOS problems
that we have discovered.


well, one could argue it may be easier to claim the space reserved in
the OS then making yet another hole in the available IO address space
in the ACPI tables.  Although I think the workarounds more systems
implement is to stick the IOMMUs into an existing hole to avoid this problem.


And we should probably do it based on the actual chipset registers, not
the DMAR tables (which the BIOS has also been known to lie about).


but the DMAR tables are the source of all information wise and ... er, 
um... ;-)
yes, I've been on the receiving side of more bz's wrt bad DMAR tables then I 
can count,
but...

How does the kernel probe for chipsets, then registers with the chipsets
to find the programmed IOMMU BAR values?
-- I missed that class I only have Intel Virt Tech Directed I/O
Architecture spec., and the beginning of IOMMU is based on DMAR tables...
If you have more info/guidance, I'd appreciate it.

Seems like the patch would be easier to support, although it doesn't
solve the problem you mentioned above, unless the reservation code isn't
compiled out by INTEL-IOMMU (but something more general like !(x86 && PCI)).
the firmware taint message would be informative as to the quality of
the firmware, but my experience is nothing changes unless it's critical
to a system shipping.

IMO, if we can avoid a BIOS problem, we should.
The empirical data I've gathered so far in this space
(IOMMU, use by SRIOV VF devices), shows the BIOS has numerous
weaknesses, and this is yet another one.  The BIOS's are getting better,
but I've seen turtles run faster... ;-) .


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/2] iommu: dmar -- reserve mmio space used by IOMMU

2012-06-04 Thread Don Dutile


On 06/04/2012 07:23 PM, David Woodhouse wrote:

On Mon, 2012-06-04 at 19:09 -0400, Don Dutile wrote:

If the BIOS *doesn't* do that, then I believe this should be
WARN_TAINT_ONCE(…TAINT_FIRMWARE_WORKAROUND…) like other BIOS problems
that we have discovered.


well, one could argue it may be easier to claim the space reserved in
the OS then making yet another hole in the available IO address space
in the ACPI tables.


But how? It's got to work with operating systems that predate the IOMMU.
The registers *have* to be in a marked hole. If *not*, then we should
give a clear "YOUR BIOS IS BROKEN" output like all the similar
breakages, and do our best to work around it.

Working around it is fine; I'm not suggesting that we should WARN()
*instead* of working around it.


ok.


How does the kernel probe for chipsets, then registers with the chipsets
to find the programmed IOMMU BAR values?
-- I missed that class I only have Intel Virt Tech Directed I/O
Architecture spec., and the beginning of IOMMU is based on DMAR tables...
If you have more info/guidance, I'd appreciate it.


Hm, I thought we'd already started doing some of that in order to
sanity-check the DMAR tables. The VTBAR registers are in PCI config
space. The quirk_ioat_snb_local_iommu() check is already looking at
them...


except that quirk is conditionally compiled in intel-iommu.c;
to do the check indep of INTEL-IOMMU CONFIG tag, it'd have to move into
pci/quirks.c. ... and how does it get triggered? ... a dmar table check?
(typical quirks kicked based on vid/did...)


I'm not quite sure which document they are documented in. Doing it based
on the DMAR table, as you have, is certainly a good start. But do it
with a bigger shouty WARN(TAINT_FIRMWARE_WORKAROUND), and do it when the
IOMMU code isn't compiled in.


The 'do it for intel-iommu systems only' and not be CONFIG dependent
is a bit challenging given how the code is compiled, and the expected/normal
code flow starting from a dmar table.
of course, if the IOMMU just exposed itself as the first device on a PCI
bus, this would be trivial!
-- I really hate BIOS dependencies to get things right!


Seems like the patch would be easier to support, although it doesn't
solve the problem you mentioned above, unless the reservation code isn't
compiled out by INTEL-IOMMU (but something more general like !(x86&&  PCI)).
the firmware taint message would be informative as to the quality of
the firmware, but my experience is nothing changes unless it's critical
to a system shipping.



   The BIOS's are getting better, but I've seen turtles run faster... ;-) .


Thankfully, there are now some modern Intel systems on which you can run
Coreboot. This should be a huge benefit — you should be able to build an
up-to-date Tianocore and deploy it as your Coreboot payload, rather than
having to put up with the crap that's on the system when you receive it.



except, most system (hw, os, applic) certifications are based on vendor's
shipped BIOS, so Coreboot isn't a guarantee either.  Additionally, telling
a customer to replace their paid-for-BIOS for a build-your-own-coreboot bios
is a tough way to close a bz. ;-)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Fwd: [RFC] DMA mapping error check analysis

2012-09-13 Thread Don Dutile


For those on the iommu list that are not on
the devel-drivers or lkml list

Since failures/bugs in drivers using dma mapping API
result in IOMMU detected failures (faults) or IOMMU resource leakage...



 Original Message 
Subject: [RFC] DMA mapping error check analysis
Date: Fri, 07 Sep 2012 09:53:20 -0600
From: Shuah Khan 
Reply-To: shuah.k...@hp.com
Organization: ISS-Linux
To: fujita.tomon...@lab.ntt.co.jp, a...@linux-foundation.org,
paul.gortma...@windriver.com, bhelg...@google.com, amw...@redhat.com,
joerg.roe...@amd.com, paul.gortma...@windriver.com, kubak...@wp.pl,
st...@rowland.harvard.edu, dan.carpen...@oracle.com,Konrad Rzeszutek Wilk 

CC: de...@linuxdriverproject.org, LKML ,
shuahk...@gmail.com

I analyzed all calls to dma_map_single() and dma_map_page() in the
kernel, to see if callers check for mapping errors, before using the
returned address.

The goal of this analysis is to find drivers that currently do not
check dma mapping errors, and fix them.

I documented the results of this analysis:

http://linuxdriverproject.org/mediawiki/index.php/DMA_Mapping_Error_Analysis

Please review and give me feedback on the analysis and the proposed
next steps.

Thanks,
-- Shuah

___
devel mailing list
de...@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/devel
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] intel-iommu: Default to non-coherent for domains unattached to iommus

2012-09-18 Thread Don Dutile


On 09/18/2012 07:59 AM, Joerg Roedel wrote:

On Wed, Sep 12, 2012 at 03:55:05PM -0400, Donald Dutile wrote:

This patch was posted back in Nov 2011:
   http://lists.linuxfoundation.org/pipermail/iommu/2011-November/003086.html

and due to discussion about the patch, it was never pulled in.
Although the thread discussed an alternate patch to
default to non-coherent if any IOMMU didn't support coherency,
this alternate method was never implemented, and this bug persists.

This patch has been in RHEL6 for quite some time,
and it wasn't noticed that it didn't get into linux upstream,
until a RH partner reported this error when running upstream kernels,
and noticed how it doesn't occur on RHEL6 kernels.
Applying this patch to an upstream kernel resolved this issue.


  domain_update_iommu_coherency() currently defaults to setting
  domains as coherent when the domain is not attached to any iommus.
  This allows for a window in domain_context_mapping_one() where such a
  domain can update context entries non-coherently, and only after
  update the domain capability to clear iommu_coherency.
  This can be seen using KVM device assignment on VT-d systems that
  do not support coherency in the ecap register.  When a device is
  added to a guest, a domain is created (iommu_coherency = 0), the
  device is attached, and ranges are mapped.  If we then hot unplug
  the device, the coherency is updated and set to the default (1)
  since no iommus are attached to the domain.  A subsequent attach
  of a device makes use of the same dmar domain (now marked coherent)
  updates context entries with coherency enabled, and only disables
  coherency as the last step in the process.
  To fix this, switch domain_update_iommu_coherency() to use the
  safer, non-coherent default for domains not attached to iommus.

Signed-off-by: Donald Dutile
cc: Alex Williamson


Hmm, who is the author? The patch looks the same as what Alex submitted
last year. I applied Alex' patch because it includes also the Acked-bys
and he seems to be the author anyway. Oh, and I added a stable-tag.


Joerg



Yes, Alex was the original author, thus the reason I cc'd him on the update.

And you can't apply Alex's original patch as-is, since the iommu_bmp
structure element changed from a ptr to an array,
(a snippet of) alex's patch looked like:
+i = find_first_bit(&domain->iommu_bmp, g_num_of_iommus);

and the correct patch with the change to iommu_bmp is:
+i = find_first_bit(domain->iommu_bmp, g_num_of_iommus);

and the reason why I re-posted it, vs fwd-ing the original patch
& asking for inclusion.
Additionally, the above patch is what the customer tested & verified.

So, if you made the above adjustment to Alex's patch,
then the patch is ok.  If not, the above adjustment must be made.

- Don
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2] Intel IOMMU patch to reprocess RMRR info

2012-09-18 Thread Don Dutile


On 09/18/2012 12:49 PM, Tom Mingarelli wrote:

When a 32bit PCI device is removed from the SI Domain, the RMRR information
for this device becomes invalid and needs to be reprocessed to avoid DMA
Read errors. These errors are evidenced by the Present bit being cleared in
the device's context entry. Fixing this problem with an enhancement to process
the RMRR info when the device is assigned to another domain. The Bus Master bit
is cleared during the move to another domain and during the reprocessing of
the RMRR info so no DMA can take place at this time.

PATCH v1: https://lkml.org/lkml/2012/6/15/204

drivers/iommu/intel-iommu.c |   47 --
  1 files changed, 44 insertions(+), 3 deletions(-)

Signed-off-by: Thomas Mingarelli

diff -up ./drivers/iommu/intel-iommu.c.ORIG ./drivers/iommu/intel-iommu.c
--- ./drivers/iommu/intel-iommu.c.ORIG  2012-09-18 09:58:25.147976889 -0500
+++ ./drivers/iommu/intel-iommu.c   2012-09-18 10:39:43.286672765 -0500
@@ -2706,11 +2706,39 @@ static int iommu_dummy(struct pci_dev *p
return pdev->dev.archdata.iommu == DUMMY_DEVICE_DOMAIN_INFO;
  }

+static int reprocess_rmrr(struct device *dev)
+{
+   struct dmar_rmrr_unit *rmrr;
+   struct pci_dev *pdev;
+   int i, ret;
+
+   pdev = to_pci_dev(dev);
+
+   for_each_rmrr_units(rmrr) {
+   for (i = 0; i<  rmrr->devices_cnt; i++) {
+   /*
+* Here we are just concerned with
+* finding the one device that was
+* removed from the si_domain and
+* re-evaluating its RMRR info.
+*/
+   if (rmrr->devices[i] != pdev)
+   continue;
+   pr_info("IOMMU: Reprocess RMRR information for device 
%s.\n",
+   pci_name(pdev));
+   ret = iommu_prepare_rmrr_dev(rmrr, pdev);
+   if (ret)
+   pr_err("IOMMU: Reprocessing RMRR reserved region for 
device failed");
+   }
+   }
+return 0;
+}
+
  /* Check if the pdev needs to go through non-identity map and unmap process.*/
  static int iommu_no_mapping(struct device *dev)
  {
struct pci_dev *pdev;
-   int found;
+   int found, current_bus_master;

if (unlikely(dev->bus !=&pci_bus_type))
return 1;
@@ -2731,9 +2759,22 @@ static int iommu_no_mapping(struct devic
 * 32 bit DMA is removed from si_domain and fall back
 * to non-identity mapping.
 */
-   domain_remove_one_dev_info(si_domain, pdev);
printk(KERN_INFO "32bit %s uses non-identity mapping\n",
-  pci_name(pdev));
+   pci_name(pdev));
+   /*
+* If a device gets this far we need to clear the Bus
+* Master bit before we start moving devices from domain
+* to domain. We will also reset the Bus Master bit
+* after reprocessing the RMRR info. However, we only
+* do both the clearing and setting if needed.
+*/
+   current_bus_master = pdev->is_busmaster;
+   if (current_bus_master)
+   pci_clear_master(pdev);
+   domain_remove_one_dev_info(si_domain, pdev);
+   reprocess_rmrr(dev);
+   if (current_bus_master)
+   pci_set_master(pdev);
return 0;
}
} else {


appears to have recommended changes from v1, so looks good wrt handling devices 
w/rmrr.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [BUG 3.7-rc5] NULL pointer deref when using a pcie-pci bridged pci device and intel-iommu

2012-11-12 Thread Don Dutile


On 11/12/2012 04:26 AM, Doug Goldstein wrote:

On Sun, Nov 11, 2012 at 5:19 PM, Matthew Thode
  wrote:

System boots with vt-d disabled in bios. Otherwise I get the errors in
the attached log.  I can do whatever testing you need as this system is
not in production yet.  gonna paste the important part here.  Let me
know if you want anything else.

Please CC me directly as I am not subscribed to the LKML.


Trying to unpack rootfs image as initramfs...
Freeing initrd memory: 5124k freed
IOMMU 0 0xfbffe000: using Queued invalidation
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device :00:1d.0 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1d.1 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1d.2 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1d.7 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1a.0 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1a.1 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1a.2 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1a.7 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1d.0 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1d.1 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1d.2 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1d.7 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1a.0 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1a.1 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1a.2 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1a.7 [0xec000 - 0xe]
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device :00:1f.0 [0x0 - 0xff]
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
BUG: unable to handle kernel NULL pointer dereference at 003c
IP: [] pci_get_dma_source+0xf/0x41
PGD 0
Oops:  [#1] SMP
Modules linked in:
CPU 7
Pid: 1, comm: swapper/0 Not tainted 3.7.0-rc5 #1 Penguin Computing
Relion 1751/X8DTU
RIP: 0010:[]  []
pci_get_dma_source+0xf/0x41
RSP: :8806264d1d88  EFLAGS: 00010282
RAX: 813bd3a8 RBX: 8806261d1000 RCX: e8221180
RDX: 818624f0 RSI: 88062635b0c0 RDI: 
RBP: 8806264d1d88 R08: 8806263d6000 R09: 
R10: 8806264d1ca8 R11: 0005 R12: 
R13: 8806261d1098 R14:  R15: 
FS:  () GS:88063f2e() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 003c CR3: 01c0b000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process swapper/0 (pid: 1, threadinfo 8806264d, task
8806264cf910)
Stack:
  8806264d1dc8 815d02c9  8806
  8806264d1dd8 81c64b00 8806261d1098 8806264d1df8
  8806264d1de8 815cd5a4 81c64b00 815cd56a
Call Trace:
  [] intel_iommu_add_device+0x95/0x167
  [] add_iommu_group+0x3a/0x41
  [] ? bus_set_iommu+0x44/0x44
  [] bus_for_each_dev+0x54/0x81
  [] bus_set_iommu+0x3d/0x44
  [] intel_iommu_init+0xae5/0xb5e
  [] ? free_initrd+0x9e/0x9e
  [] ? memblock_find_dma_reserve+0x13f/0x13f
  [] pci_iommu_init+0x16/0x41
  [] ? pci_proc_init+0x6b/0x6b
  [] do_one_initcall+0x7a/0x129
  [] kernel_init+0x139/0x2a2
  [] ? loglevel+0x31/0x31
  [] ? rest_init+0x6f/0x6f
  [] ret_from_fork+0x7c/0xb0
  [] ? rest_init+0x6f/0x6f
Code: ff c1 75 04 ff d0 eb 12 48 83 c2 10 48 8b 42 08 48 85 c0 75 d3 b8
e7 ff ff ff c9 c3 55 48 c7 c2 f0 24 86 81 48 89 e5 eb 24 8b 0a<66>  3b
4f 3c 74 05 66 ff c1 75 13 66 8b 4a 02 66 3b 4f 3e 74 05
RIP  [] pci_get_dma_source+0xf/0x41
  RSP
CR2: 003c
---[ end trace 5c5a2ceca067e0ec ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0009

[ cut here ]
WARNING: at arch/x86/kernel/smp.c:123 native_smp_send_reschedule+0x25/0x51()
Hardware name: Relion 1751
Modules linked in:
Pid: 1, comm: swapper/0 Tainted: G  D  3.7.0-rc5 #1
Call Trace:
 [] warn_slowpath_common+0x80/0x98
  [] warn_slowpath_null+0x15/0x17
  [] native_smp_send_reschedule+0x25/0x51
  [] trigger_load_balance+0x1e8/0x214
  [] scheduler_tick+0xd8/0xe1
  [] update_process_times+0x62/0x73
  [] tick_sched_timer+0x7c/0x9b
  [] __run_hrtimer.clone.24+0x4e/0xc1
  [] hrtimer_interrupt+0xc7/0x1ac
  [] smp_apic_timer_interrupt+0x81/0x94
  [] apic_timer_interrupt+0x6a/0x70
 [] ? console_unlock+0x2c2/0x2ed
  [] ? panic+0x189/0x1c5
  [] ? panic+0xee/0x1c5
  [] do_exit+0x357/0x7b2
  [] oops_end+0xb2/0xba
  [] no_context+0x266/0x275
  [] __bad_area_nosemaphore+0x1bb/0x1db
  [] ? sysfs_addrm_finish+0x2f/0xa6
  [] bad_area_nosemapho

Re: [BUG 3.7-rc5] NULL pointer deref when using a pcie-pci bridged pci device and intel-iommu

2012-11-13 Thread Don Dutile


On 11/13/2012 10:38 AM, Alex Williamson wrote:

On Mon, 2012-11-12 at 15:05 -0600, Matthew Thode wrote:

On 11/12/2012 01:57 PM, Don Dutile wrote:

On 11/12/2012 04:26 AM, Doug Goldstein wrote:

On Sun, Nov 11, 2012 at 5:19 PM, Matthew Thode
   wrote:

System boots with vt-d disabled in bios. Otherwise I get the errors in
the attached log.  I can do whatever testing you need as this system is
not in production yet.  gonna paste the important part here.  Let me
know if you want anything else.

Please CC me directly as I am not subscribed to the LKML.


Trying to unpack rootfs image as initramfs...
Freeing initrd memory: 5124k freed
IOMMU 0 0xfbffe000: using Queued invalidation
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device :00:1d.0 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1d.1 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1d.2 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1d.7 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1a.0 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1a.1 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1a.2 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1a.7 [0xbf7ec000 -
0xbf7f]
IOMMU: Setting identity map for device :00:1d.0 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1d.1 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1d.2 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1d.7 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1a.0 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1a.1 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1a.2 [0xec000 - 0xe]
IOMMU: Setting identity map for device :00:1a.7 [0xec000 - 0xe]
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device :00:1f.0 [0x0 - 0xff]
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
BUG: unable to handle kernel NULL pointer dereference at
003c
IP: [] pci_get_dma_source+0xf/0x41
PGD 0
Oops:  [#1] SMP
Modules linked in:
CPU 7
Pid: 1, comm: swapper/0 Not tainted 3.7.0-rc5 #1 Penguin Computing
Relion 1751/X8DTU
RIP: 0010:[]  []
pci_get_dma_source+0xf/0x41
RSP: :8806264d1d88  EFLAGS: 00010282
RAX: 813bd3a8 RBX: 8806261d1000 RCX: e8221180
RDX: 818624f0 RSI: 88062635b0c0 RDI: 
RBP: 8806264d1d88 R08: 8806263d6000 R09: 
R10: 8806264d1ca8 R11: 0005 R12: 
R13: 8806261d1098 R14:  R15: 
FS:  () GS:88063f2e()
knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 003c CR3: 01c0b000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process swapper/0 (pid: 1, threadinfo 8806264d, task
8806264cf910)
Stack:
   8806264d1dc8 815d02c9  8806
   8806264d1dd8 81c64b00 8806261d1098 8806264d1df8
   8806264d1de8 815cd5a4 81c64b00 815cd56a
Call Trace:
   [] intel_iommu_add_device+0x95/0x167
   [] add_iommu_group+0x3a/0x41
   [] ? bus_set_iommu+0x44/0x44
   [] bus_for_each_dev+0x54/0x81
   [] bus_set_iommu+0x3d/0x44
   [] intel_iommu_init+0xae5/0xb5e
   [] ? free_initrd+0x9e/0x9e
   [] ? memblock_find_dma_reserve+0x13f/0x13f
   [] pci_iommu_init+0x16/0x41
   [] ? pci_proc_init+0x6b/0x6b
   [] do_one_initcall+0x7a/0x129
   [] kernel_init+0x139/0x2a2
   [] ? loglevel+0x31/0x31
   [] ? rest_init+0x6f/0x6f
   [] ret_from_fork+0x7c/0xb0
   [] ? rest_init+0x6f/0x6f
Code: ff c1 75 04 ff d0 eb 12 48 83 c2 10 48 8b 42 08 48 85 c0 75 d3 b8
e7 ff ff ff c9 c3 55 48 c7 c2 f0 24 86 81 48 89 e5 eb 24 8b 0a<66>   3b
4f 3c 74 05 66 ff c1 75 13 66 8b 4a 02 66 3b 4f 3e 74 05
RIP  [] pci_get_dma_source+0xf/0x41
   RSP
CR2: 003c
---[ end trace 5c5a2ceca067e0ec ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x0009

[ cut here ]
WARNING: at arch/x86/kernel/smp.c:123
native_smp_send_reschedule+0x25/0x51()
Hardware name: Relion 1751
Modules linked in:
Pid: 1, comm: swapper/0 Tainted: G  D  3.7.0-rc5 #1
Call Trace:
   [] warn_slowpath_common+0x80/0x98
   [] warn_slowpath_null+0x15/0x17
   [] native_smp_send_reschedule+0x25/0x51
   [] trigger_load_balance+0x1e8/0x214
   [] scheduler_tick+0xd8/0xe1
   [] update_process_times+0x62/0x73
   [] tick_sched_timer+0x7c/0x9b
   [] __run_hrtimer.clone.24+0x4e/0xc1
   [] hrtimer_interrupt+0xc7/0x1ac
   [] smp_apic_timer_interrupt+0x81/0x94
   [] apic_timer_interrupt+0x6a/0x70
   [] ? console_unlock+0x2c2/0x2ed
   [] ? panic+0x189/0x1c5
   [] ?

Re: [PATCH v4] intel-iommu: Prevent devices with RMRRs from being placed into SI Domain

2012-11-20 Thread Don Dutile


On 11/20/2012 02:43 PM, Tom Mingarelli wrote:

This patch is to prevent non-USB devices that have RMRRs associated with them 
from
being placed into the SI Domain during init. This fixes the issue where the 
RMRR info
for devices being placed in and out of the SI Domain gets lost.

Signed-off-by: Thomas Mingarelli
Tested-by: Shuah Khan
---
PATCH v1: https://lkml.org/lkml/2012/6/15/204
PATCH v2: https://lkml.org/lkml/2012/9/18/354
PATCH v3: https://lkml.org/lkml/2012/10/16/375

  drivers/iommu/intel-iommu.c |   31 +++
  1 files changed, 31 insertions(+), 0 deletions(-)


Thanks for the efforts in getting this completed.
Looks good to me.

Reviewed-by: Donald Dutile 


diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index d4a4cd4..8c064df 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2320,8 +2320,39 @@ static int domain_add_dev_info(struct dmar_domain 
*domain,
return 0;
  }

+static bool device_has_rmrr(struct pci_dev *dev)
+{
+   struct dmar_rmrr_unit *rmrr;
+   int i;
+
+   for_each_rmrr_units(rmrr) {
+   for (i = 0; i<  rmrr->devices_cnt; i++) {
+   /*
+* Return TRUE if this RMRR contains the device that
+* is passed in.
+*/
+   if (rmrr->devices[i] == dev)
+   return true;
+   }
+   }
+   return false;
+}
+
  static int iommu_should_identity_map(struct pci_dev *pdev, int startup)
  {
+
+   /*
+* We want to prevent any device associated with an RMRR from
+* getting placed into the SI Domain. This is done because
+* problems exist when devices are moved in and out of domains
+* and their respective RMRR info is lost. We exempt USB devices
+* from this process due to their usage of RMRRs that are known
+* to not be needed after BIOS hand-off to OS.
+*/
+   if (device_has_rmrr(pdev)&&
+   (pdev->class>>  8) != PCI_CLASS_SERIAL_USB)
+   return 0;
+
if ((iommu_identity_mapping&  IDENTMAP_AZALIA)&&  IS_AZALIA(pdev))
return 1;



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/1] iommu: add a dma remap fault reason.

2013-03-07 Thread Don Dutile


cc-ing the upstream iommu-list

On 03/05/2013 09:43 PM, Li, Zhen-Hua wrote:

The number of dma fault reasons in intel's document are from 1 to 0xD, but in 
dmar.c I cannot find fault reason 0xD.

In this document:
Intel Virtualization Technology for Directed I/O Architecture Specification
http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf

Chapter 4. Support For Device-IOTLBs

Table 6. Unsuccessful Translated Requests

There is fault reason for 0xD not listed in kernel:
 Present context-entry used to process translation request
 specifies blocking of Translation Requests (Translation Type (T)
 field value not equal to 01b).

So I think 0xD should be added.

Signed-off-by: Li, Zhen-Hua
---
  drivers/iommu/dmar.c |1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index dc7e478..e5cdaf8 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1083,6 +1083,7 @@ static const char *dma_remap_fault_reasons[] =
"non-zero reserved fields in RTP",
"non-zero reserved fields in CTP",
"non-zero reserved fields in PTE",
+   "PCE for translation request specifies blocking",
  };

  static const char *irq_remap_fault_reasons[] =


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/1] iommu: add a dma remap fault reason.

2013-03-07 Thread Don Dutile


On 03/07/2013 01:31 PM, Don Dutile wrote:

cc-ing the upstream iommu-list

On 03/05/2013 09:43 PM, Li, Zhen-Hua wrote:

The number of dma fault reasons in intel's document are from 1 to 0xD, but in 
dmar.c I cannot find fault reason 0xD.

In this document:
Intel Virtualization Technology for Directed I/O Architecture Specification
http://download.intel.com/technology/computing/vptech/Intel(r)_VT_for_Direct_IO.pdf

Chapter 4. Support For Device-IOTLBs

Table 6. Unsuccessful Translated Requests

There is fault reason for 0xD not listed in kernel:
Present context-entry used to process translation request
specifies blocking of Translation Requests (Translation Type (T)
field value not equal to 01b).

So I think 0xD should be added.

Signed-off-by: Li, Zhen-Hua
---
drivers/iommu/dmar.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index dc7e478..e5cdaf8 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1083,6 +1083,7 @@ static const char *dma_remap_fault_reasons[] =
"non-zero reserved fields in RTP",
"non-zero reserved fields in CTP",
"non-zero reserved fields in PTE",
+ "PCE for translation request specifies blocking",
};

static const char *irq_remap_fault_reasons[] =




Yes, the multiple tables, some short, some long, duplicating error codes,
and in this case, putting it out of order, helped this case!

btw -- Suresh not at intel any longer (email bounces)

So, patch looks good to me. Although, I don't know of any code that
actually sets a translation to 'block translation' but
for completeness, rest of code does range checking & sizing such
that it's doing the right thing.
The only other thing I can surmize from the dmar.c file is if one
of these faults occurred, an 'unknown error' would have been outputted.

cheers.. Don

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu: add a function to find an iommu group by id

2013-03-25 Thread Don Dutile


On 03/25/2013 10:28 AM, Alex Williamson wrote:

On Mon, 2013-03-25 at 10:23 +1100, Alexey Kardashevskiy wrote:

As IOMMU groups are exposed to the user space by their numbers,
the user space can use them in various kernel APIs so the kernel
might need an API to find a group by its ID.

As an example, QEMU VFIO on PPC64 platform needs it to associate
a logical bus number (LIOBN) with a specific IOMMU group in order
to support in-kernel handling of DMA map/unmap requests.

The patch adds the iommu_group_get_by_id(id) function which performs
such search.


Subject: [PATCH v3]

v2 was the last one, where's the changelog for v3?

v2: fixed reference counting.


and changed function name...


Signed-off-by: Alexey Kardashevskiy
---


Acked-by: Alex Williamson


  drivers/iommu/iommu.c |   29 +
  include/linux/iommu.h |1 +
  2 files changed, 30 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1065a1a..0de83eb 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -204,6 +204,35 @@ again:
  }
  EXPORT_SYMBOL_GPL(iommu_group_alloc);

+struct iommu_group *iommu_group_get_by_id(int id)
+{
+   struct kobject *group_kobj;
+   struct iommu_group *group;
+   const char *name;
+
+   if (!iommu_group_kset)
+   return NULL;
+
+   name = kasprintf(GFP_KERNEL, "%d", id);
+   if (!name)
+   return NULL;
+
+   group_kobj = kset_find_obj(iommu_group_kset, name);
+   kfree(name);
+
+   if (!group_kobj)
+   return NULL;
+
+   group = container_of(group_kobj, struct iommu_group, kobj);
+   BUG_ON(group->id != id);
+
+   kobject_get(group->devices_kobj);
+   kobject_put(&group->kobj);
+
+   return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get_by_id);
+
  /**
   * iommu_group_get_iommudata - retrieve iommu_data registered for a group
   * @group: the group
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f3b99e1..00e5d7d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -113,6 +113,7 @@ struct iommu_ops {
  extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
  extern bool iommu_present(struct bus_type *bus);
  extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
+extern struct iommu_group *iommu_group_get_by_id(int id);
  extern void iommu_domain_free(struct iommu_domain *domain);
  extern int iommu_attach_device(struct iommu_domain *domain,
   struct device *dev);




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/5 v10] iommu/fsl: Make iova u64 in the iommu_iova_to_phys API.

2013-03-25 Thread Don Dutile


I agree that unsigned long was an improper choice for iova's,
but why aren't they dma_addr_t ? ... an iova is a dma-addr, just
a 'virtual' one wrt phys-addr.

On 03/22/2013 06:34 PM, Varun Sethi wrote:

This is required in case of PAMU, as it can support a window size of up
to 64G (even on 32bit).

Signed-off-by: Varun Sethi
---
- no change in v10.

  drivers/iommu/amd_iommu.c  |2 +-
  drivers/iommu/exynos-iommu.c   |2 +-
  drivers/iommu/intel-iommu.c|2 +-
  drivers/iommu/iommu.c  |3 +--
  drivers/iommu/msm_iommu.c  |2 +-
  drivers/iommu/omap-iommu.c |2 +-
  drivers/iommu/shmobile-iommu.c |2 +-
  drivers/iommu/tegra-gart.c |2 +-
  drivers/iommu/tegra-smmu.c |2 +-
  include/linux/iommu.h  |9 +++--
  10 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 98f555d..42f6a71 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3412,7 +3412,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
  }

  static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
- unsigned long iova)
+ u64 iova)
  {
struct protection_domain *domain = dom->priv;
unsigned long offset_mask;
diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index 238a3ca..541e81b 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -1027,7 +1027,7 @@ done:
  }

  static phys_addr_t exynos_iommu_iova_to_phys(struct iommu_domain *domain,
- unsigned long iova)
+ u64 iova)
  {
struct exynos_iommu_domain *priv = domain->priv;
unsigned long *entry;
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 0099667..c9663ac 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4111,7 +4111,7 @@ static size_t intel_iommu_unmap(struct iommu_domain 
*domain,
  }

  static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
-   unsigned long iova)
+   u64 iova)
  {
struct dmar_domain *dmar_domain = domain->priv;
struct dma_pte *pte;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index b972d43..39106ec 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -706,8 +706,7 @@ void iommu_detach_group(struct iommu_domain *domain, struct 
iommu_group *group)
  }
  EXPORT_SYMBOL_GPL(iommu_detach_group);

-phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
-  unsigned long iova)
+phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, u64 iova)
  {
if (unlikely(domain->ops->iova_to_phys == NULL))
return 0;
diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
index 6a8870a..fcd14a3 100644
--- a/drivers/iommu/msm_iommu.c
+++ b/drivers/iommu/msm_iommu.c
@@ -554,7 +554,7 @@ fail:
  }

  static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain,
- unsigned long va)
+ u64 va)
  {
struct msm_priv *priv;
struct msm_iommu_drvdata *iommu_drvdata;
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index 6ac02fa..102ae56 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1219,7 +1219,7 @@ static void omap_iommu_domain_destroy(struct iommu_domain 
*domain)
  }

  static phys_addr_t omap_iommu_iova_to_phys(struct iommu_domain *domain,
- unsigned long da)
+ u64 da)
  {
struct omap_iommu_domain *omap_domain = domain->priv;
struct omap_iommu *oiommu = omap_domain->iommu_dev;
diff --git a/drivers/iommu/shmobile-iommu.c b/drivers/iommu/shmobile-iommu.c
index b6e8b57..9216802 100644
--- a/drivers/iommu/shmobile-iommu.c
+++ b/drivers/iommu/shmobile-iommu.c
@@ -296,7 +296,7 @@ done:
  }

  static phys_addr_t shmobile_iommu_iova_to_phys(struct iommu_domain *domain,
-  unsigned long iova)
+  u64 iova)
  {
struct shmobile_iommu_domain *sh_domain = domain->priv;
uint32_t l1entry = 0, l2entry = 0;
diff --git a/drivers/iommu/tegra-gart.c b/drivers/iommu/tegra-gart.c
index 8643757..17179c0 100644
--- a/drivers/iommu/tegra-gart.c
+++ b/drivers/iommu/tegra-gart.c
@@ -279,7 +279,7 @@ static size_t gart_iommu_unmap(struct iommu_domain *domain, 
unsigned long iova,
  }

  static phys_addr_t gart_iommu_iova_to_phys(struct iommu_domain *domain,
-  unsigned long iova)
+  u64 iova)
  {

Re: [PATCH 1/2 V2] iommu/amd: Add workaround for ERBT1312

2013-04-23 Thread Don Dutile


On 04/18/2013 12:28 PM, Joerg Roedel wrote:

On Thu, Apr 18, 2013 at 11:13:19AM -0500, Suravee Suthikulanit wrote:

This workaround is required for both event log and ppr log.  Your
patch is only taking care of the event log.


Right, thanks for the notice. Here is the updated patch.

 From cebe04596989c4b9001e2c1571c4fb219ea37b99 Mon Sep 17 00:00:00 2001
From: Joerg Roedel
Date: Thu, 18 Apr 2013 17:55:04 +0200
Subject: [PATCH] iommu/amd: Workaround for ERBT1312

Work around an IOMMU  hardware bug where clearing the
EVT_INT or PPR_INT bit in the status register may race with
the hardware trying to set it again. When not handled the
bit might not be cleared and we lose all future event or ppr
interrupts.

Reported-by: Suravee Suthikulpanit
Cc: sta...@vger.kernel.org
Signed-off-by: Joerg Roedel
---
  drivers/iommu/amd_iommu.c |   34 ++
  1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index f42793d..27792f8 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -700,14 +700,23 @@ retry:

  static void iommu_poll_events(struct amd_iommu *iommu)
  {
-   u32 head, tail;
+   u32 head, tail, status;
unsigned long flags;

-   /* enable event interrupts again */
-   writel(MMIO_STATUS_EVT_INT_MASK, iommu->mmio_base + MMIO_STATUS_OFFSET);
-
spin_lock_irqsave(&iommu->lock, flags);

+   /* enable event interrupts again */
+   do {
+   /*
+* Workaround for Erratum ERBT1312
+* Clearing the EVT_INT bit may race in the hardware, so read
+* it again and make sure it was really cleared
+*/
+   status = readl(iommu->mmio_base + MMIO_STATUS_OFFSET);
+   writel(MMIO_STATUS_EVT_INT_MASK,
+  iommu->mmio_base + MMIO_STATUS_OFFSET);
+   } while (status&  MMIO_STATUS_EVT_INT_MASK);
+
head = readl(iommu->mmio_base + MMIO_EVT_HEAD_OFFSET);
tail = readl(iommu->mmio_base + MMIO_EVT_TAIL_OFFSET);

@@ -744,16 +753,25 @@ static void iommu_handle_ppr_entry(struct amd_iommu 
*iommu, u64 *raw)
  static void iommu_poll_ppr_log(struct amd_iommu *iommu)
  {
unsigned long flags;
-   u32 head, tail;
+   u32 head, tail, status;

if (iommu->ppr_log == NULL)
return;

-   /* enable ppr interrupts again */
-   writel(MMIO_STATUS_PPR_INT_MASK, iommu->mmio_base + MMIO_STATUS_OFFSET);
-
spin_lock_irqsave(&iommu->lock, flags);

+   /* enable ppr interrupts again */
+   do {
+   /*
+* Workaround for Erratum ERBT1312
+* Clearing the PPR_INT bit may race in the hardware, so read
+* it again and make sure it was really cleared
+*/
+   status = readl(iommu->mmio_base + MMIO_STATUS_OFFSET);
+   writel(MMIO_STATUS_PPR_INT_MASK,
+  iommu->mmio_base + MMIO_STATUS_OFFSET);
+   } while (status&  MMIO_STATUS_PPR_INT_MASK);
+
head = readl(iommu->mmio_base + MMIO_PPR_HEAD_OFFSET);
tail = readl(iommu->mmio_base + MMIO_PPR_TAIL_OFFSET);


Given other threads on this mail list (and I've seen crashes with same problem)
where this type of logging during a flood of IOMMU errors will lock up the 
machine,
is there something that can be done to break the do-while loop after n 
iterations
have been exec'd, so the kernel can progress during a crash ?


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/2 V2] iommu/amd: Add workaround for ERBT1312

2013-04-24 Thread Don Dutile


On 04/24/2013 06:46 AM, Joerg Roedel wrote:

On Tue, Apr 23, 2013 at 09:22:45AM -0400, Don Dutile wrote:

Given other threads on this mail list (and I've seen crashes with same problem)
where this type of logging during a flood of IOMMU errors will lock up the 
machine,
is there something that can be done to break the do-while loop after n 
iterations
have been exec'd, so the kernel can progress during a crash ?


In the case of an IOMMU error flood this loop will only run until the
event-log/ppr-log overflows. So it should not turn into an endless loop.


Joerg



Thanks for verification.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: RFC: vfio / iommu driver for hardware with no iommu

2013-04-24 Thread Don Dutile


On 04/23/2013 03:47 PM, Alex Williamson wrote:

On Tue, 2013-04-23 at 19:16 +, Yoder Stuart-B08248 wrote:



-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com]
Sent: Tuesday, April 23, 2013 11:56 AM
To: Yoder Stuart-B08248
Cc: Joerg Roedel; iommu@lists.linux-foundation.org
Subject: Re: RFC: vfio / iommu driver for hardware with no iommu

On Tue, 2013-04-23 at 16:13 +, Yoder Stuart-B08248 wrote:

Joerg/Alex,

We have embedded systems where we use QEMU/KVM and have
the requirement to do device assignment, but have no
iommu.  So we would like to get vfio-pci working on
systems like this.

We're aware of the obvious limitations-- no protection,
DMA'able memory must be physically contiguous and will
have no iova->phy translation.  But there are use cases
where all OSes involved are trusted and customers can
live with those limitations.   Virtualization is used
here not to sandbox untrusted code, but to consolidate
multiple OSes.

We would like to get your feedback on the rough idea.  There
are two parts-- iommu driver and vfio-pci.

1.  iommu driver

First, we still need device groups created because vfio
is based on that, so we envision a 'dummy' iommu
driver that implements only  the add/remove device
ops.  Something like:

 static struct iommu_ops fsl_none_ops = {
 .add_device = fsl_none_add_device,
 .remove_device  = fsl_none_remove_device,
 };

 int fsl_iommu_none_init()
 {
 int ret = 0;

 ret = iommu_init_mempool();
 if (ret)
 return ret;

 bus_set_iommu(&platform_bus_type,&fsl_none_ops);
 bus_set_iommu(&pci_bus_type,&fsl_none_ops);

 return ret;
 }

2.  vfio-pci

For vfio-pci, we would ideally like to keep user space mostly
unchanged.  User space will have to follow the semantics
of mapping only physically contiguous chunks...and iova
will equal phys.

So, we propose to implement a new vfio iommu type,
called VFIO_TYPE_NONE_IOMMU.  This implements
any needed vfio interfaces, but there are no calls
to the iommu layer...e.g. map_dma() is a noop.

Would like your feedback.


My first thought is that this really detracts from vfio and iommu groups
being a secure interface, so somehow this needs to be clearly an
insecure mode that requires an opt-in and maybe taints the kernel.  Any
notion of unprivileged use needs to be blocked and it should test
CAP_COMPROMISE_KERNEL (or whatever it's called now) at critical access
points.  We might even have interfaces exported that would allow this to
be an out-of-tree driver (worth a check).

I would guess that you would probably want to do all the iommu group
setup from the vfio fake-iommu driver.  In other words, that driver both
creates the fake groups and provides the dummy iommu backend for vfio.
That would be a nice way to compartmentalize this as a
vfio-noiommu-special.


So you mean don't implement any of the iommu driver
ops at all and keep everything in the vfio layer?

Would you still have real iommu groups?...i.e.
$ readlink /sys/bus/pci/devices/:06:0d.0/iommu_group
../../../../kernel/iommu_groups/26

...and that is created by vfio-noiommu-special?


I'm suggesting (but haven't checked if it's possible), to implement the
iommu driver ops as part of the vfio iommu backend driver.  The primary
motivation for this would be to a) keep a fake iommu groups interface
out of the iommu proper (possibly containing it in an external driver)
and b) modularizing it so we don't have fake iommu groups being created
by default.  It would have to populate the iommu groups sysfs interfaces
to be compatible with vfio.


Right now when the PCI and platform buses are probed,
the iommu driver add-device callback gets called and
that is where the per-device group gets created.  Are
you envisioning registering a callback for the PCI
bus to do this in vfio-noiommu-special?


Yes.  It's just as easy to walk all the devices rather than doing
callbacks, iirc the group code does this when you register.  In fact,
this noiommu interface may not want to add all devices, we may want to
be very selective and only add some.


Right.
Sounds like a no-iommu driver is needed to leave vfio unaffected,
and still leverage/use vfio for qemu's device assignment.
Just not sure how to 'taint' it as 'not secure' if no-iommu driver put in place.

btw -- qemu has the inherent assumption that pci cfg cycles are trapped,
   so assigned devices are 'remapped' from system-B:D.F to virt-machine's
   (virtualized) B:D.F of the assigned device.
   Are pci-cfg cycles trapped in freescale qemu model ?


Would map/unmap really be no-ops?  Seems like you still want to do page
pinning.


You're right, that was a bad example...most would be no ops though.


Also, you're using fsl in the example above, but would such a
driver have any platform dependency?


This wouldn't have to be fsl specific if we thought it was
potential

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-24 Thread Don Dutile


On 04/24/2013 12:58 AM, Takao Indoh wrote:

This patch resets PCIe devices on boot to stop ongoing DMA. When
"pci=pcie_reset_devices" is specified, a hot reset is triggered on each
PCIe root port and downstream port to reset its downstream endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_devices() is called from fs_initcall_sync, and it finds root
port/downstream port whose child is PCIe endpoint, and then reset link
between them. If the endpoint is VGA device, it is skipped because the
monitor blacks out if VGA controller is reset.


Couple questions wrt VGA device:
(1) Many graphics devices are multi-function, one function being VGA;
is the VGA always function 0, so this scan sees it first & doesn't
do a reset on that PCIe link?  if the VGA is not function 0, won't
this logic break (will reset b/c function 0 is non-VGA graphics) ?
(2) I'm hearing VGA will soon not be the a required console; this logic
assumes it is, and why it isn't blanked.
Q: Should the filter be based on a device having a device-class of display ?


Actually this is v8 patch but quite different from v7 and it's been so
long since previous post, so I start over again.

Thanks for this re-start.  I need to continue reviewing the rest.

Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
   dump?  After the crash dump, the system is rebooting to previous (iommu=on) 
setting.
   That logic, along w/your previous patch to disable the IOMMU if iommu=off
   is set, would remove this (relatively slow) PCI init sequencing ?


Previous post:
[PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
https://lkml.org/lkml/2012/11/26/814

Signed-off-by: Takao Indoh
---
  Documentation/kernel-parameters.txt |2 +
  drivers/pci/pci.c   |  103 +++
  2 files changed, 105 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4609e81..2a31ade 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
any pair of devices, possibly at the cost of
reduced performance.  This also guarantees
that hot-added devices will work.
+   pcie_reset_devices  Reset PCIe endpoint on boot by hot
+   reset
cbiosize=nn[KMG]The fixed amount of bus space which is
reserved for the CardBus bridge's IO window.
The default value is 256 bytes.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b099e00..42385c9 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
  }
  EXPORT_SYMBOL(pci_fixup_cardbus);

+/*
+ * Return true if dev is PCIe root port or downstream port whose child is PCIe
+ * endpoint except VGA device.
+ */
+static int __init need_reset(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!pci_is_pcie(dev) || !dev->subordinate ||
+   list_empty(&dev->subordinate->devices) ||
+   ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
+(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
+   return 0;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,&subordinate->devices, bus_list) {
+   if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
+   (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
+   ((child->class>>  16) == PCI_BASE_CLASS_DISPLAY))
+   /* Don't reset switch, bridge, VGA device */
+   return 0;
+   }
+
+   return 1;
+}
+
+static void __init save_config(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!need_reset(dev))
+   return;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,&subordinate->devices, bus_list) {
+

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-25 Thread Don Dutile

On 04/25/2013 01:11 AM, Takao Indoh wrote:
> (2013/04/25 4:59), Don Dutile wrote:
>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>
>>> Problem:
>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>> occur in second kernel and it causes problems like driver error or PCI
>>> SERR, at last kdump fails. This problem is caused as follows.
>>> 1) Devices are working on first kernel.
>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>  and its DMA continues during this switch.
>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>  dma-remapping errors.
>>>
>>> Solution:
>>> All DMA transactions have to be stopped before iommu is initialized. By
>>> this patch devices are reset and in-flight DMA is stopped before
>>> pci_iommu_init.
>>>
>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>> between them. If the endpoint is VGA device, it is skipped because the
>>> monitor blacks out if VGA controller is reset.
>>>
>> Couple questions wrt VGA device:
>> (1) Many graphics devices are multi-function, one function being VGA;
>>   is the VGA always function 0, so this scan sees it first&  doesn't
>>   do a reset on that PCIe link?  if the VGA is not function 0, won't
>>   this logic break (will reset b/c function 0 is non-VGA graphics) ?
> 
> VGA is not reset irrespective of its function number. The logic of this
> patch is:
> 
> for_each_pci_dev(dev) {
>  if (dev is not PCIe)
> continue;
>  if (dev is not root port/downstream port) ---(1)
> continue;
>  list_for_each_entry(child,&dev->subordinate->devices, bus_list) {
>  if (child is upstream port or bridge or VGA) ---(2)
>  continue;
>  }
>  do_reset_its_child(dev);
> }
> 
> Therefore VGA itself is skipped by (1), and upstream device(root port or
> downstream port) of VGA is also skipped by (2).
> 
> 
>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>   assumes it is, and why it isn't blanked.
>>   Q: Should the filter be based on a device having a device-class of 
>> display ?
> 
> I want to avoid the situation that user's monitor blacks out and user
> cannot know what's going on. That's reason why I introduced the logic to
> skip VGA. As far as I tested the logic based on device-class works well,
sorry, I read your description, which said VGA, but your are filtering on 
display class,
which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
display devices
are probably one of the most aggressive DMA engines on a system and will 
grow as
asymmetric processing using GPUs gets architected into a device-agnostic manner.
So, this may work well for servers, which is the primary consumer/user of this 
feature,
and they typically have built-in graphics that are generally used in simple VGA 
mode,
so this may be sufficient for now.
 

> but I would appreciate it if there are better ways.
> 
You probably don't want to hear it but
a) only turn off cmd-reg master enable bit
b) only do reset based on a list of devices known not to 
   obey their cmd-reg master enable bit, and only do reset to those devices.
But, given the testing you've done so far, this optional (need cmdline) feature,
let's start here.

>>
>>> Actually this is v8 patch but quite different from v7 and it's been so
>>> long since previous post, so I start over again.
>> Thanks for this re-start.  I need to continue reviewing the rest.
> 
> Thank you for your review!
> 
>>
>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
>>  dump?  After the crash dump, the system is rebooting to previous 
>> (iommu=on) setting.
>>  That logic, along w/your previous patch to disable the IOMMU if 
>> iommu=off
>>  is set, would remove this (relatively slow) PCI init sequencing ?
> 
> To force iommu off, all ongoing DMA have to be stopped before that since
> they are accessing t

Re: RFC: vfio / iommu driver for hardware with no iommu

2013-04-25 Thread Don Dutile


On 04/24/2013 10:49 PM, Sethi Varun-B16395 wrote:




-Original Message-
From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
boun...@lists.linux-foundation.org] On Behalf Of Don Dutile
Sent: Thursday, April 25, 2013 1:11 AM
To: Alex Williamson
Cc: Yoder Stuart-B08248; iommu@lists.linux-foundation.org
Subject: Re: RFC: vfio / iommu driver for hardware with no iommu

On 04/23/2013 03:47 PM, Alex Williamson wrote:

On Tue, 2013-04-23 at 19:16 +, Yoder Stuart-B08248 wrote:



-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com]
Sent: Tuesday, April 23, 2013 11:56 AM
To: Yoder Stuart-B08248
Cc: Joerg Roedel; iommu@lists.linux-foundation.org
Subject: Re: RFC: vfio / iommu driver for hardware with no iommu

On Tue, 2013-04-23 at 16:13 +, Yoder Stuart-B08248 wrote:

Joerg/Alex,

We have embedded systems where we use QEMU/KVM and have the
requirement to do device assignment, but have no iommu.  So we
would like to get vfio-pci working on systems like this.

We're aware of the obvious limitations-- no protection, DMA'able
memory must be physically contiguous and will have no iova->phy
translation.  But there are use cases where all OSes involved are
trusted and customers can
live with those limitations.   Virtualization is used
here not to sandbox untrusted code, but to consolidate multiple
OSes.

We would like to get your feedback on the rough idea.  There are
two parts-- iommu driver and vfio-pci.

1.  iommu driver

First, we still need device groups created because vfio is based on
that, so we envision a 'dummy' iommu driver that implements only
the add/remove device ops.  Something like:

  static struct iommu_ops fsl_none_ops = {
  .add_device = fsl_none_add_device,
  .remove_device  = fsl_none_remove_device,
  };

  int fsl_iommu_none_init()
  {
  int ret = 0;

  ret = iommu_init_mempool();
  if (ret)
  return ret;

  bus_set_iommu(&platform_bus_type,&fsl_none_ops);
  bus_set_iommu(&pci_bus_type,&fsl_none_ops);

  return ret;
  }

2.  vfio-pci

For vfio-pci, we would ideally like to keep user space mostly
unchanged.  User space will have to follow the semantics of mapping
only physically contiguous chunks...and iova will equal phys.

So, we propose to implement a new vfio iommu type, called
VFIO_TYPE_NONE_IOMMU.  This implements any needed vfio interfaces,
but there are no calls to the iommu layer...e.g. map_dma() is a
noop.

Would like your feedback.


My first thought is that this really detracts from vfio and iommu
groups being a secure interface, so somehow this needs to be clearly
an insecure mode that requires an opt-in and maybe taints the
kernel.  Any notion of unprivileged use needs to be blocked and it
should test CAP_COMPROMISE_KERNEL (or whatever it's called now) at
critical access points.  We might even have interfaces exported that
would allow this to be an out-of-tree driver (worth a check).

I would guess that you would probably want to do all the iommu group
setup from the vfio fake-iommu driver.  In other words, that driver
both creates the fake groups and provides the dummy iommu backend for

vfio.

That would be a nice way to compartmentalize this as a
vfio-noiommu-special.


So you mean don't implement any of the iommu driver ops at all and
keep everything in the vfio layer?

Would you still have real iommu groups?...i.e.
$ readlink /sys/bus/pci/devices/:06:0d.0/iommu_group
../../../../kernel/iommu_groups/26

...and that is created by vfio-noiommu-special?


I'm suggesting (but haven't checked if it's possible), to implement
the iommu driver ops as part of the vfio iommu backend driver.  The
primary motivation for this would be to a) keep a fake iommu groups
interface out of the iommu proper (possibly containing it in an
external driver) and b) modularizing it so we don't have fake iommu
groups being created by default.  It would have to populate the iommu
groups sysfs interfaces to be compatible with vfio.


Right now when the PCI and platform buses are probed, the iommu
driver add-device callback gets called and that is where the
per-device group gets created.  Are you envisioning registering a
callback for the PCI bus to do this in vfio-noiommu-special?


Yes.  It's just as easy to walk all the devices rather than doing
callbacks, iirc the group code does this when you register.  In fact,
this noiommu interface may not want to add all devices, we may want to
be very selective and only add some.


Right.
Sounds like a no-iommu driver is needed to leave vfio unaffected, and
still leverage/use vfio for qemu's device assignment.
Just not sure how to 'taint' it as 'not secure' if no-iommu driver put in
place.

btw -- qemu has the inherent assumption that pci cfg cycles are trapped,
 s

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-26 Thread Don Dutile

On 04/25/2013 11:10 PM, Takao Indoh wrote:
> (2013/04/26 3:01), Don Dutile wrote:
>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>> (2013/04/25 4:59), Don Dutile wrote:
>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>
>>>>> Problem:
>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>> 1) Devices are working on first kernel.
>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>>and its DMA continues during this switch.
>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>>>dma-remapping errors.
>>>>>
>>>>> Solution:
>>>>> All DMA transactions have to be stopped before iommu is initialized. By
>>>>> this patch devices are reset and in-flight DMA is stopped before
>>>>> pci_iommu_init.
>>>>>
>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>>>> between them. If the endpoint is VGA device, it is skipped because the
>>>>> monitor blacks out if VGA controller is reset.
>>>>>
>>>> Couple questions wrt VGA device:
>>>> (1) Many graphics devices are multi-function, one function being VGA;
>>>> is the VGA always function 0, so this scan sees it first&   doesn't
>>>> do a reset on that PCIe link?  if the VGA is not function 0, won't
>>>> this logic break (will reset b/c function 0 is non-VGA graphics) ?
>>>
>>> VGA is not reset irrespective of its function number. The logic of this
>>> patch is:
>>>
>>> for_each_pci_dev(dev) {
>>>if (dev is not PCIe)
>>>   continue;
>>>if (dev is not root port/downstream port) ---(1)
>>>   continue;
>>>list_for_each_entry(child,&dev->subordinate->devices, bus_list) {
>>>if (child is upstream port or bridge or VGA) ---(2)
>>>continue;
>>>}
>>>do_reset_its_child(dev);
>>> }
>>>
>>> Therefore VGA itself is skipped by (1), and upstream device(root port or
>>> downstream port) of VGA is also skipped by (2).
>>>
>>>
>>>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>>> assumes it is, and why it isn't blanked.
>>>> Q: Should the filter be based on a device having a device-class of 
>>>> display ?
>>>
>>> I want to avoid the situation that user's monitor blacks out and user
>>> cannot know what's going on. That's reason why I introduced the logic to
>>> skip VGA. As far as I tested the logic based on device-class works well,
>> sorry, I read your description, which said VGA, but your are filtering on 
>> display class,
>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA 
>> display devices
>> are probably one of the most aggressive DMA engines on a system and will 
>> grow as
>> asymmetric processing using GPUs gets architected into a device-agnostic 
>> manner.
>> So, this may work well for servers, which is the primary consumer/user of 
>> this feature,
>> and they typically have built-in graphics that are generally used in simple 
>> VGA mode,
>> so this may be sufficient for now.
> 
> Ok, understood.
> 
> 
>>
>>> but I would appreciate it if there are better ways.
>>>
>> You probably don't want to hear it but
>> a) only turn off cmd-reg master enable bit
>> b) only do reset based on a list of devices known not to
>>  obey their cmd-reg master enable bit, and only do reset to those 
>> devices.
>> But, given the testing you'v

Re: RFC: IOMMU/AMD: Error Handling

2013-04-29 Thread Don Dutile


On 04/29/2013 03:45 PM, Suravee Suthikulanit wrote:

Joerg,

We are in the process of implementing AMD IOMMU error handling, and I would 
like some comments from you and the community.

Currently, the AMD IOMMU driver only reports events from the event log in the 
dmesg, and does not try to handle them in case of errors. AMD IOMMU errors can 
be categorized as device-specific errors and IOMMU errors.

1. For IOMMU errors such as:
- DEV_TAB_HADWARE_ERROR
- PAGE_TAB_ERROR
- COMMAND_HARDWARE_ERROR
If the error is detected during IOMMU initialization, we could disable IOMMU 
and proceed. If the error occurs after IOMMU is initialized, we won't be able 
to recover from this, and might need to result in panic.

2. For device-specific errors such as:
- ILLEGAL_DEV_TABLE_ENTRY
- IO_PAGE_FAULT
- INVALDE_DEVICE_REQUEST
We think the AMD IOMMU driver should try to isolate the device. This involves 
blocking device transactions at IOMMU DTE and tries to disable the device (e.g. 
calling the remove(struct pci_dev *pdev) interface generally provides by device 
drivers). This could prevents the device from continuing to fail and to risk of 
system instability.


disabling the device is not an option.
We've seen mis-configured ACPI tables generate storms
of invalide dte messages after iommu setup but before they are cleared up when
the OS driver is started & resets the device. The original storm is from 
bios-use
of IOMMU with a device.
I'd recommend creating a filter that prevents further logging from a device
for 5 mins at a time if a storm of DTE-related errors are seen.
by definition, the DMA is blocked from corrupting/changing memory, so isolation 
has been established;
keeping the failure log from consuming the system is the needed fix.


3. In case of posted memory write transaction, device driver might not be aware 
that the transaction has failed and blocked at IOMMU. If there is no HW IOMMU, 
I believe this is handled by PCI error handling code. If the IOMMU hardware 
reporth such case, could this potentially leverage the Linux IOMMU fault 
handling interface, iommu_set_fault_handler() and report_iommu_fault(), to 
communicate to device driver or PCI driver?


Wondering if you could use AER-like callback mechanism so a driver can be 
invoked when IOMMU error occurs,
so the device driver can quiesce or reset the device if it deems it transient.



Any feedback or comments are appreciated.

Thank you,
Suravee




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: RFC: IOMMU/AMD: Error Handling

2013-04-29 Thread Don Dutile


On 04/29/2013 04:34 PM, Duran, Leo wrote:

I'm wondering if resetting the IOMMU at init-time (once) would clear any BIOS 
induced noise.
Leo


Well, depends what you mean by 'reset'
(a) setting it up for OS use is effectively a reset, but doesn't quiesce a 
device
 doing dma reads of a (bios-setup) queue.  then the noisy messages begin
(b) disable the iommu, and then the dma just occurs... and bad for writes, 
potentially.

Similar issue is being reported & worked for kdump, where device are still
doing DMA while the system is trying to 'reset' to the kexec'd kernel, and
take a crash dump.

Solution: stop devices from doing dma... but some you _want_ enabled 
throughout...
  like keyboard & mouse via usb controller, so you get to pick os from
  grub...  not so for kexec...

so, again, for isolation faults let the hw do its job -- isolate
and throttle/silence the fault messages on a per-device, time-duration heuristic
so the system can get through boot-up where enough OS is init'd (drivers 
started)
to stop the temporary noise.


-Original Message-
From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
boun...@lists.linux-foundation.org] On Behalf Of Don Dutile
Sent: Monday, April 29, 2013 3:10 PM
To: Suthikulpanit, Suravee
Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org
Subject: Re: RFC: IOMMU/AMD: Error Handling

On 04/29/2013 03:45 PM, Suravee Suthikulanit wrote:

Joerg,

We are in the process of implementing AMD IOMMU error handling, and I

would like some comments from you and the community.


Currently, the AMD IOMMU driver only reports events from the event log

in the dmesg, and does not try to handle them in case of errors. AMD
IOMMU errors can be categorized as device-specific errors and IOMMU
errors.


1. For IOMMU errors such as:
- DEV_TAB_HADWARE_ERROR
- PAGE_TAB_ERROR
- COMMAND_HARDWARE_ERROR
If the error is detected during IOMMU initialization, we could disable

IOMMU and proceed. If the error occurs after IOMMU is initialized, we won't
be able to recover from this, and might need to result in panic.


2. For device-specific errors such as:
- ILLEGAL_DEV_TABLE_ENTRY
- IO_PAGE_FAULT
- INVALDE_DEVICE_REQUEST
We think the AMD IOMMU driver should try to isolate the device. This

involves blocking device transactions at IOMMU DTE and tries to disable the
device (e.g. calling the remove(struct pci_dev *pdev) interface generally
provides by device drivers). This could prevents the device from continuing
to fail and to risk of system instability.



disabling the device is not an option.
We've seen mis-configured ACPI tables generate storms of invalide dte
messages after iommu setup but before they are cleared up when the OS
driver is started&  resets the device. The original storm is from bios-use of
IOMMU with a device.
I'd recommend creating a filter that prevents further logging from a device
for 5 mins at a time if a storm of DTE-related errors are seen.
by definition, the DMA is blocked from corrupting/changing memory, so
isolation has been established; keeping the failure log from consuming the
system is the needed fix.


3. In case of posted memory write transaction, device driver might not be

aware that the transaction has failed and blocked at IOMMU. If there is no
HW IOMMU, I believe this is handled by PCI error handling code. If the
IOMMU hardware reporth such case, could this potentially leverage the
Linux IOMMU fault handling interface, iommu_set_fault_handler() and
report_iommu_fault(), to communicate to device driver or PCI driver?



Wondering if you could use AER-like callback mechanism so a driver can be
invoked when IOMMU error occurs, so the device driver can quiesce or reset
the device if it deems it transient.



Any feedback or comments are appreciated.

Thank you,
Suravee




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu





___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: RFC: IOMMU/AMD: Error Handling

2013-04-30 Thread Don Dutile


On 04/30/2013 10:49 AM, Suravee Suthikulanit wrote:

On 4/29/2013 3:10 PM, Don Dutile wrote:

On 04/29/2013 03:45 PM, Suravee Suthikulanit wrote:

Joerg,

We are in the process of implementing AMD IOMMU error handling, and I would 
like some comments from you and the community.

Currently, the AMD IOMMU driver only reports events from the event log in the 
dmesg, and does not try to handle them in case of errors. AMD IOMMU errors can 
be categorized as device-specific errors and IOMMU errors.

1. For IOMMU errors such as:
- DEV_TAB_HADWARE_ERROR
- PAGE_TAB_ERROR
- COMMAND_HARDWARE_ERROR
If the error is detected during IOMMU initialization, we could disable IOMMU 
and proceed. If the error occurs after IOMMU is initialized, we won't be able 
to recover from this, and might need to result in panic.

2. For device-specific errors such as:
- ILLEGAL_DEV_TABLE_ENTRY
- IO_PAGE_FAULT
- INVALDE_DEVICE_REQUEST
We think the AMD IOMMU driver should try to isolate the device. This involves 
blocking device transactions at IOMMU DTE and tries to disable the device (e.g. 
calling the remove(struct pci_dev *pdev) interface generally provides by device 
drivers). This could prevents the device from continuing to fail and to risk of 
system instability.


disabling the device is not an option.
We've seen mis-configured ACPI tables generate storms
of invalide dte messages after iommu setup but before they are cleared up when
the OS driver is started & resets the device. The original storm is from 
bios-use
of IOMMU with a device.

Would some sorts of threshold to help determine the badness of errors might be 
sufficient? For instance, if the device has generated N errors, it is then be 
removed (where N is tunable through sysfs or kernel boot options).


No!  removing a device is _not_ acceptable.
Again, the most common case I've seen is the *boot* device
not having the proper IVMD(AMD) or RMRR(Intel) structures in the ACPI tables,
or they are temporarily invalided during reboot (esp. during kexec'd kdump 
kernels).
Second most common -- the usb controller that the user may need to control the
system on power-up.  It'll be more fun when IPMI + IOMMU are put together in 
the ARM space.
Filter faults from a device; 'nuf said.



I'd recommend creating a filter that prevents further logging from a device
for 5 mins at a time if a storm of DTE-related errors are seen.
by definition, the DMA is blocked from corrupting/changing memory, so isolation 
has been established;
keeping the failure log from consuming the system is the needed fix.


I believe the IOMMU hardware can be configured to suppress logging of 
subsequent I/O page fault errors until
the device table cache is cleared. This should help avoiding storm of 
interrupts you are seeing.


If the tables are correct... if not then hung system.




3. In case of posted memory write transaction, device driver might not be aware 
that the transaction has failed and blocked at IOMMU. If there is no HW IOMMU, 
I believe this is handled by PCI error handling code. If the IOMMU hardware 
reporth such case, could this potentially leverage the Linux IOMMU fault 
handling interface, iommu_set_fault_handler() and report_iommu_fault(), to 
communicate to device driver or PCI driver?


Wondering if you could use AER-like callback mechanism so a driver can be 
invoked when IOMMU error occurs,
so the device driver can quiesce or reset the device if it deems it transient.

That might also be possible. I might need to look into it more.

Suravee


In summary: when BIOS's are made perfect, then you could implement your perfect 
disabling algorithm;
unfortunately, esp. with IOMMU's & intr-remap acpi tables, the 
bios's are notoriously buggy.




Any feedback or comments are appreciated.

Thank you,
Suravee




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu








___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: RFC: IOMMU/AMD: Error Handling

2013-04-30 Thread Don Dutile


On 04/30/2013 10:56 AM, Suravee Suthikulanit wrote:

On 4/29/2013 4:42 PM, Don Dutile wrote:

On 04/29/2013 04:34 PM, Duran, Leo wrote:

I'm wondering if resetting the IOMMU at init-time (once) would clear any BIOS 
induced noise.
Leo


Well, depends what you mean by 'reset'
(a) setting it up for OS use is effectively a reset, but doesn't quiesce a 
device
doing dma reads of a (bios-setup) queue. then the noisy messages begin
(b) disable the iommu, and then the dma just occurs... and bad for writes, 
potentially.

Similar issue is being reported & worked for kdump, where device are still
doing DMA while the system is trying to 'reset' to the kexec'd kernel, and
take a crash dump.

Solution: stop devices from doing dma... but some you _want_ enabled 
throughout...
like keyboard & mouse via usb controller, so you get to pick os from
grub... not so for kexec...

so, again, for isolation faults let the hw do its job -- isolate
and throttle/silence the fault messages on a per-device, time-duration heuristic
so the system can get through boot-up where enough OS is init'd (drivers 
started)
to stop the temporary noise.

This sounds more like issue with the order of how things are initialized in the 
system.
If so, could we separate the code which enabling of IOMMU error 
logging/handling and
delay it until we are certain that systems are stable?


So, you are proposing we not enable fault events when IOMMU is initially 
configured;
use the IOMMU through boot/driver-config, hoping all is well, and if not, 
continue blindly,
and then enable IOMMU faults post/late-init ?


Suravee



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-04-30 Thread Don Dutile


On 04/30/2013 10:54 AM, Sumner, William wrote:

I have installed your original patch set (from last November) and tested with 
three platforms, each with a different IO configuration.  On the first platform 
crashdumps were consistently successful.  On the second and third platforms, 
the reset of one specific PCI device on each platform (a different device type 
on each platform) immediately hung the crash kernel. This was consistent across 
multiple tries.  Temporarily modifying the patch to skip resetting the one 
problem-device on each platform allowed each platform to create a dump 
successfully.

I am now working with our IO team to determine the exact nature of the hang and 
to see if the hang is unique to the specific cards or the specific platforms.

Also I am starting to look at an alternate possibility:

The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level Reset) describes reset 
use-cases for "a partitioned environment where hardware is migrated from one partition to 
another" and for "system software is taking down the software stack for a Function and 
then rebuilding that stack".


Same section in PCIe spec v3.0.
The first paragraph after the 3 (square) bullets in 6.6.2 states: 
"Implemenation of FLR is optional (not required), but is strongly recommended.

This optional feature complicates the reset code, and may be the
reason you are seeing hangs.  do an lspci -xxx dump to see if the
devices you need to skip on reset have FLR capability (find device cap 
register, offset 0x4 in PCIe Cap struct, bit 28).

The only thing the device-endpoint-reset() doesn't do is monitor/check the
device's Device Status register (offset 0xA in PCI Cap struct) Transaction 
Pending bit.
The spec says that could take seconds to clear... typically milliseconds.
If typical, the reset mpsleep() handles that range.  Scanning the endpt-pdev 
list
to see if any Transaction Pending bits are set could be added, and then 
additional
1->2 second delay if still set added... potentially giving up after 2 secs.


These use-cases seem very similar to transitioning into the crash kernel.  The "Implementation 
Note" at the end of that section describes an algorithm for "Avoiding Data Corruption 
From Stale Completions" which looks like it might be applicable to stopping ongoing DMA.  I am 
hopeful that adding some of the steps from this algorithm to one of the already proposed patches 
would avoid the hang that I saw on two platforms.

Bill Sumner

-Original Message-
From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don Dutile
Sent: Friday, April 26, 2013 8:43 AM
To: Takao Indoh
Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
linux-ker...@vger.kernel.org; tin...@gmail.com; 
iommu@lists.linux-foundation.org; bhelg...@google.com
Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA

On 04/25/2013 11:10 PM, Takao Indoh wrote:

(2013/04/26 3:01), Don Dutile wrote:

On 04/25/2013 01:11 AM, Takao Indoh wrote:

(2013/04/25 4:59), Don Dutile wrote:

On 04/24/2013 12:58 AM, Takao Indoh wrote:

This patch resets PCIe devices on boot to stop ongoing DMA. When
"pci=pcie_reset_devices" is specified, a hot reset is triggered on each
PCIe root port and downstream port to reset its downstream endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_devices() is called from fs_initcall_sync, and it finds root
port/downstream port whose child is PCIe endpoint, and then reset link
between them. If the endpoint is VGA device, it is skipped because the
monitor blacks out if VGA controller is reset.


Couple questions wrt VGA device:
(1) Many graphics devices are multi-function, one function being VGA;
 is the VGA always function 0, so this scan sees it first&doesn't
 do a reset on that PCIe link?  if the VGA is not function 0, won't
 this logic break (will reset b/c function 0 is non-VGA graphics) ?


VGA is not reset irrespective of its function number. The logic of this
patch is:

for_each_pci_dev(dev) {
if (dev is not PCIe)
   continue;
if (dev is not root port/downstream port) ---(1)
   continue;

Re: RFC: vfio / iommu driver for hardware with no iommu

2013-04-30 Thread Don Dutile


On 04/30/2013 01:28 PM, Konrad Rzeszutek Wilk wrote:

On Sat, Apr 27, 2013 at 12:22:28PM +0800, Andrew Cooks wrote:

On Fri, Apr 26, 2013 at 6:23 AM, Don Dutile  wrote:

On 04/24/2013 10:49 PM, Sethi Varun-B16395 wrote:





-Original Message-
From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
boun...@lists.linux-foundation.org] On Behalf Of Don Dutile
Sent: Thursday, April 25, 2013 1:11 AM
To: Alex Williamson
Cc: Yoder Stuart-B08248; iommu@lists.linux-foundation.org
Subject: Re: RFC: vfio / iommu driver for hardware with no iommu

On 04/23/2013 03:47 PM, Alex Williamson wrote:


On Tue, 2013-04-23 at 19:16 +, Yoder Stuart-B08248 wrote:




-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com]
Sent: Tuesday, April 23, 2013 11:56 AM
To: Yoder Stuart-B08248
Cc: Joerg Roedel; iommu@lists.linux-foundation.org
Subject: Re: RFC: vfio / iommu driver for hardware with no iommu

On Tue, 2013-04-23 at 16:13 +, Yoder Stuart-B08248 wrote:


Joerg/Alex,

We have embedded systems where we use QEMU/KVM and have the
requirement to do device assignment, but have no iommu.  So we
would like to get vfio-pci working on systems like this.

We're aware of the obvious limitations-- no protection, DMA'able
memory must be physically contiguous and will have no iova->phy
translation.  But there are use cases where all OSes involved are
trusted and customers can
live with those limitations.   Virtualization is used
here not to sandbox untrusted code, but to consolidate multiple
OSes.

We would like to get your feedback on the rough idea.  There are
two parts-- iommu driver and vfio-pci.

1.  iommu driver

First, we still need device groups created because vfio is based on
that, so we envision a 'dummy' iommu driver that implements only
the add/remove device ops.  Something like:

   static struct iommu_ops fsl_none_ops = {
   .add_device = fsl_none_add_device,
   .remove_device  = fsl_none_remove_device,
   };

   int fsl_iommu_none_init()
   {
   int ret = 0;

   ret = iommu_init_mempool();
   if (ret)
   return ret;

   bus_set_iommu(&platform_bus_type,&fsl_none_ops);
   bus_set_iommu(&pci_bus_type,&fsl_none_ops);

   return ret;
   }

2.  vfio-pci

For vfio-pci, we would ideally like to keep user space mostly
unchanged.  User space will have to follow the semantics of mapping
only physically contiguous chunks...and iova will equal phys.

So, we propose to implement a new vfio iommu type, called
VFIO_TYPE_NONE_IOMMU.  This implements any needed vfio interfaces,
but there are no calls to the iommu layer...e.g. map_dma() is a
noop.

Would like your feedback.



My first thought is that this really detracts from vfio and iommu
groups being a secure interface, so somehow this needs to be clearly
an insecure mode that requires an opt-in and maybe taints the
kernel.  Any notion of unprivileged use needs to be blocked and it
should test CAP_COMPROMISE_KERNEL (or whatever it's called now) at
critical access points.  We might even have interfaces exported that
would allow this to be an out-of-tree driver (worth a check).

I would guess that you would probably want to do all the iommu group
setup from the vfio fake-iommu driver.  In other words, that driver
both creates the fake groups and provides the dummy iommu backend for


vfio.


That would be a nice way to compartmentalize this as a
vfio-noiommu-special.



So you mean don't implement any of the iommu driver ops at all and
keep everything in the vfio layer?

Would you still have real iommu groups?...i.e.
$ readlink /sys/bus/pci/devices/:06:0d.0/iommu_group
../../../../kernel/iommu_groups/26

...and that is created by vfio-noiommu-special?



I'm suggesting (but haven't checked if it's possible), to implement
the iommu driver ops as part of the vfio iommu backend driver.  The
primary motivation for this would be to a) keep a fake iommu groups
interface out of the iommu proper (possibly containing it in an
external driver) and b) modularizing it so we don't have fake iommu
groups being created by default.  It would have to populate the iommu
groups sysfs interfaces to be compatible with vfio.


Right now when the PCI and platform buses are probed, the iommu
driver add-device callback gets called and that is where the
per-device group gets created.  Are you envisioning registering a
callback for the PCI bus to do this in vfio-noiommu-special?



Yes.  It's just as easy to walk all the devices rather than doing
callbacks, iirc the group code does this when you register.  In fact,
this noiommu interface may not want to add all devices, we may want to
be very selective and only add some.


Right.
Sounds like a no-iommu driver is needed to leave vfio unaffected, and
still leverage/use vfio for qemu's device ass

Re: RFC: vfio / iommu driver for hardware with no iommu

2013-04-30 Thread Don Dutile


On 04/30/2013 01:28 PM, Konrad Rzeszutek Wilk wrote:

On Sat, Apr 27, 2013 at 12:22:28PM +0800, Andrew Cooks wrote:

On Fri, Apr 26, 2013 at 6:23 AM, Don Dutile  wrote:

On 04/24/2013 10:49 PM, Sethi Varun-B16395 wrote:





-Original Message-
From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
boun...@lists.linux-foundation.org] On Behalf Of Don Dutile
Sent: Thursday, April 25, 2013 1:11 AM
To: Alex Williamson
Cc: Yoder Stuart-B08248; iommu@lists.linux-foundation.org
Subject: Re: RFC: vfio / iommu driver for hardware with no iommu

On 04/23/2013 03:47 PM, Alex Williamson wrote:


On Tue, 2013-04-23 at 19:16 +, Yoder Stuart-B08248 wrote:




-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com]
Sent: Tuesday, April 23, 2013 11:56 AM
To: Yoder Stuart-B08248
Cc: Joerg Roedel; iommu@lists.linux-foundation.org
Subject: Re: RFC: vfio / iommu driver for hardware with no iommu

On Tue, 2013-04-23 at 16:13 +, Yoder Stuart-B08248 wrote:


Joerg/Alex,

We have embedded systems where we use QEMU/KVM and have the
requirement to do device assignment, but have no iommu.  So we
would like to get vfio-pci working on systems like this.

We're aware of the obvious limitations-- no protection, DMA'able
memory must be physically contiguous and will have no iova->phy
translation.  But there are use cases where all OSes involved are
trusted and customers can
live with those limitations.   Virtualization is used
here not to sandbox untrusted code, but to consolidate multiple
OSes.

We would like to get your feedback on the rough idea.  There are
two parts-- iommu driver and vfio-pci.

1.  iommu driver

First, we still need device groups created because vfio is based on
that, so we envision a 'dummy' iommu driver that implements only
the add/remove device ops.  Something like:

   static struct iommu_ops fsl_none_ops = {
   .add_device = fsl_none_add_device,
   .remove_device  = fsl_none_remove_device,
   };

   int fsl_iommu_none_init()
   {
   int ret = 0;

   ret = iommu_init_mempool();
   if (ret)
   return ret;

   bus_set_iommu(&platform_bus_type,&fsl_none_ops);
   bus_set_iommu(&pci_bus_type,&fsl_none_ops);

   return ret;
   }

2.  vfio-pci

For vfio-pci, we would ideally like to keep user space mostly
unchanged.  User space will have to follow the semantics of mapping
only physically contiguous chunks...and iova will equal phys.

So, we propose to implement a new vfio iommu type, called
VFIO_TYPE_NONE_IOMMU.  This implements any needed vfio interfaces,
but there are no calls to the iommu layer...e.g. map_dma() is a
noop.

Would like your feedback.



My first thought is that this really detracts from vfio and iommu
groups being a secure interface, so somehow this needs to be clearly
an insecure mode that requires an opt-in and maybe taints the
kernel.  Any notion of unprivileged use needs to be blocked and it
should test CAP_COMPROMISE_KERNEL (or whatever it's called now) at
critical access points.  We might even have interfaces exported that
would allow this to be an out-of-tree driver (worth a check).

I would guess that you would probably want to do all the iommu group
setup from the vfio fake-iommu driver.  In other words, that driver
both creates the fake groups and provides the dummy iommu backend for


vfio.


That would be a nice way to compartmentalize this as a
vfio-noiommu-special.



So you mean don't implement any of the iommu driver ops at all and
keep everything in the vfio layer?

Would you still have real iommu groups?...i.e.
$ readlink /sys/bus/pci/devices/:06:0d.0/iommu_group
../../../../kernel/iommu_groups/26

...and that is created by vfio-noiommu-special?



I'm suggesting (but haven't checked if it's possible), to implement
the iommu driver ops as part of the vfio iommu backend driver.  The
primary motivation for this would be to a) keep a fake iommu groups
interface out of the iommu proper (possibly containing it in an
external driver) and b) modularizing it so we don't have fake iommu
groups being created by default.  It would have to populate the iommu
groups sysfs interfaces to be compatible with vfio.


Right now when the PCI and platform buses are probed, the iommu
driver add-device callback gets called and that is where the
per-device group gets created.  Are you envisioning registering a
callback for the PCI bus to do this in vfio-noiommu-special?



Yes.  It's just as easy to walk all the devices rather than doing
callbacks, iirc the group code does this when you register.  In fact,
this noiommu interface may not want to add all devices, we may want to
be very selective and only add some.


Right.
Sounds like a no-iommu driver is needed to leave vfio unaffected, and
still leverage/use vfio for qemu's device ass

Re: RFC: vfio / iommu driver for hardware with no iommu

2013-04-30 Thread Don Dutile


On 04/30/2013 03:11 PM, Konrad Rzeszutek Wilk wrote:

Does vfio work with swiotlb and if not, can/should swiotlb be
extended? Or does the time and space overhead make it a moot point?


It does not work with SWIOTLB as it uses the DMA API, not the IOMMU API.


I think you got it reversed.  vfio uses iommu api, not dma api.


Right.  That is what I was saying :-) SWIOTLB uses the DMA API, not
the IOMMU API. Hence it won't work with VFIO. Unless SWIOTLB implements
the IOMMU API.






if vfio used dma api, swiotlb is configured as the default dma-ops interface
and it could work (with more interfaces... domain-alloc, etc.).






It could be extended to use it. I was toying with this b/c for Xen to
use VFIO I would have to implement an Xen IOMMU driver that would basically
piggyback on the SWIOTLB (as Xen itself does the IOMMU parts and takes
care of all the hard work of securing each guest).

But your requirement would be the same, so it might as well be an generic
driver called SWIOTLB-IOMMU driver.

If you are up for writting I am up for reviewing/Ack-ing/etc.

The complexity would be to figure out the VFIO group thing and how to assign
PCI B:D:F devices to the SWIOTLB-IOMMU driver. Perhaps the same way as
xen-pciback does (or pcistub). That is by writting the BDF in the "bind"
attribute in SysFS (or via a kernel parameter).



Did uio provide this un-secure support, and just needs some attention upstream?


I don't recall how UIO did it. Not sure if it even had the group
support.

no group support. probably doesn't have an iommu-like api either...

sounds like a no-iommu iommu interface is needed! :-p
(Alex: that slipped out! sorry!)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: RFC: vfio / iommu driver for hardware with no iommu

2013-04-30 Thread Don Dutile


On 04/30/2013 05:15 PM, Alex Williamson wrote:

On Tue, 2013-04-30 at 16:48 -0400, Don Dutile wrote:

On 04/30/2013 03:11 PM, Konrad Rzeszutek Wilk wrote:

Does vfio work with swiotlb and if not, can/should swiotlb be
extended? Or does the time and space overhead make it a moot point?


It does not work with SWIOTLB as it uses the DMA API, not the IOMMU API.


I think you got it reversed.  vfio uses iommu api, not dma api.


Right.  That is what I was saying :-) SWIOTLB uses the DMA API, not
the IOMMU API. Hence it won't work with VFIO. Unless SWIOTLB implements
the IOMMU API.






if vfio used dma api, swiotlb is configured as the default dma-ops interface
and it could work (with more interfaces... domain-alloc, etc.).






It could be extended to use it. I was toying with this b/c for Xen to
use VFIO I would have to implement an Xen IOMMU driver that would basically
piggyback on the SWIOTLB (as Xen itself does the IOMMU parts and takes
care of all the hard work of securing each guest).

But your requirement would be the same, so it might as well be an generic
driver called SWIOTLB-IOMMU driver.

If you are up for writting I am up for reviewing/Ack-ing/etc.

The complexity would be to figure out the VFIO group thing and how to assign
PCI B:D:F devices to the SWIOTLB-IOMMU driver. Perhaps the same way as
xen-pciback does (or pcistub). That is by writting the BDF in the "bind"
attribute in SysFS (or via a kernel parameter).



Did uio provide this un-secure support, and just needs some attention upstream?


I don't recall how UIO did it. Not sure if it even had the group
support.

no group support. probably doesn't have an iommu-like api either...


It doesn't, in fact uio-pci doesn't even allow enabling bus master
because there's zero isolation.


sounds like a no-iommu iommu interface is needed! :-p
(Alex: that slipped out! sorry!)


I wouldn't say "needed", I'm really not sure how or why this is even
practical.  What would we do with a userspace driver interface that's
backed by a software IOMMU that provides neither translation nor
isolation?  This is exactly why I suggested to the freescale guys that
it should be some kind of vfio-fake-iommu backend with very, very strict
capability checking and no default loading.  Thanks,

Alex


that's what I would expect as well.  but it's still a wonky fake-iommu...
writing code to do almost nothing sounds like pci-stub! :)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: decent performance drop for SCSI LLD / SAN initiator when iommu is turned on

2013-05-03 Thread Don Dutile

On 05/02/2013 10:13 AM, Yan Burman wrote:

-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Thursday, May 02, 2013 04:56
To: Or Gerlitz
Cc: Roland Dreier; iommu@lists.linux-foundation.org; Yan Burman; linux-
r...@vger.kernel.org
Subject: Re: decent performance drop for SCSI LLD / SAN initiator when
iommu is turned on

On Thu, May 02, 2013 at 02:11:15AM +0300, Or Gerlitz wrote:

Hi Roland, IOMMU folks,

So we've noted that when configuring the kernel&&  booting with intel
iommu set to on on a physical node (non VM, and without enabling SRIOV
by the HW device driver) raw performance of the iSER (iSCSI RDMA) SAN
initiator is reduced notably, e.g in the testbed we looked today we
had ~260K 1KB random IOPS and 5.5GBs BW for 128KB IOs with iommu
turned off for single LUN, and ~150K IOPS and 4GBs BW with iommu
turned on. No change on the target node between runs.

That's why we have iommu=pt.
See definition of iommu_pass_through in arch/x86/kernel/pci-dma.c.

I tried passing "intel_iommu=on iommu=pt" to 3.8.11 kernel and I still get 
performance degradation.
I get the same numbers with iommu=pt as without it.

I wanted to send perf output, but currently I seem to have some problem with 
its output.
Will try to get perf differences next week.

Yan

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

dmesg dump? -- interested to see if x2apic is on, and if MSI is used (or not)

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Don Dutile

On 05/07/2013 03:09 AM, Takao Indoh wrote:
> Sorry for the delayed response.
> 
> (2013/04/30 23:54), Sumner, William wrote:
>> I have installed your original patch set (from last November) and tested 
>> with three platforms, each with a different IO configuration.  On the first 
>> platform crashdumps were consistently successful.  On the second and third 
>> platforms, the reset of one specific PCI device on each platform (a 
>> different device type on each platform) immediately hung the crash kernel. 
>> This was consistent across multiple tries.  Temporarily modifying the patch 
>> to skip resetting the one problem-device on each platform allowed each 
>> platform to create a dump successfully.
>>
>> I am now working with our IO team to determine the exact nature of the hang 
>> and to see if the hang is unique to the specific cards or the specific 
>> platforms.
>>
>> Also I am starting to look at an alternate possibility:
>>
>> The PCIe spec (my copy is version 2.0) in section 6.6.2 (Function-Level 
>> Reset) describes reset use-cases for "a partitioned environment where 
>> hardware is migrated from one partition to another" and for "system software 
>> is taking down the software stack for a Function and then rebuilding that 
>> stack".
>>
>> These use-cases seem very similar to transitioning into the crash kernel.  
>> The "Implementation Note" at the end of that section describes an algorithm 
>> for "Avoiding Data Corruption From Stale Completions" which looks like it 
>> might be applicable to stopping ongoing DMA.  I am hopeful that adding some 
>> of the steps from this algorithm to one of the already proposed patches 
>> would avoid the hang that I saw on two platforms.
> 
> It seems that the algorithm you mentioned requires four steps.
> 
> 1. Clear Command register
> 2. Wait a few milliseconds (Or polling Transactions Pending bit)
> 3. Do FLR
> 4. Wait 100 ms
> 
> My patch does not do step 1 and 2. So, I would appreciate it if you
> could add the following into save_config() in my latest patch and
> confirm if kernel still hangs up or not.
> 
>  subordinate = dev->subordinate;
>  list_for_each_entry(child,&subordinate->devices, bus_list) {
>  dev_info(&child->dev, "save state\n");
>  pci_save_state(child);
> +   pci_write_config_word(child, PCI_COMMAND, 0);
> +   msleep(1000);
As Linus pointed out in an earlier patch, msleep() after each device
is s-l-o-w and unnecessary; should do a single msleep(1000) after
*all* the command registers are written.  That way, the added delay is 1sec,
not 1sec*ndevs.
q: is this turning off command register for only PCI(e) endpoints?
-- shouldn't have to turn off command register in bridges/switches.

>  }
> 
> Thanks,
> Takao Indoh
> 
> 
>>
>> Bill Sumner
>>
>> -Original Message-
>> From: kexec [mailto:kexec-boun...@lists.infradead.org] On Behalf Of Don 
>> Dutile
>> Sent: Friday, April 26, 2013 8:43 AM
>> To: Takao Indoh
>> Cc: linux-...@vger.kernel.org; ke...@lists.infradead.org; 
>> linux-ker...@vger.kernel.org; tin...@gmail.com; 
>> iommu@lists.linux-foundation.org; bhelg...@google.com
>> Subject: Re: [PATCH] Reset PCIe devices to stop ongoing DMA
>>
>> On 04/25/2013 11:10 PM, Takao Indoh wrote:
>>> (2013/04/26 3:01), Don Dutile wrote:
>>>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>>>> (2013/04/25 4:59), Don Dutile wrote:
>>>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>>>
>>>>>>> Problem:
>>>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>>>> 1) Devices are working on first kernel.
>>>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>>>>  and its DMA continues during this switch.
>>>>>>> 3) iommu is initialized

Re: [PATCH] Reset PCIe devices to stop ongoing DMA

2013-05-07 Thread Don Dutile


On 05/07/2013 12:39 PM, Alex Williamson wrote:

On Wed, 2013-04-24 at 13:58 +0900, Takao Indoh wrote:

This patch resets PCIe devices on boot to stop ongoing DMA. When
"pci=pcie_reset_devices" is specified, a hot reset is triggered on each
PCIe root port and downstream port to reset its downstream endpoint.

Problem:
This patch solves the problem that kdump can fail when intel_iommu=on is
specified. When intel_iommu=on is specified, many dma-remapping errors
occur in second kernel and it causes problems like driver error or PCI
SERR, at last kdump fails. This problem is caused as follows.
1) Devices are working on first kernel.
2) Switch to second kernel(kdump kernel). The devices are still working
and its DMA continues during this switch.
3) iommu is initialized during second kernel boot and ongoing DMA causes
dma-remapping errors.

Solution:
All DMA transactions have to be stopped before iommu is initialized. By
this patch devices are reset and in-flight DMA is stopped before
pci_iommu_init.

To invoke hot reset on an endpoint, its upstream link need to be reset.
reset_pcie_devices() is called from fs_initcall_sync, and it finds root
port/downstream port whose child is PCIe endpoint, and then reset link
between them. If the endpoint is VGA device, it is skipped because the
monitor blacks out if VGA controller is reset.

Actually this is v8 patch but quite different from v7 and it's been so
long since previous post, so I start over again.
Previous post:
[PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
https://lkml.org/lkml/2012/11/26/814

Signed-off-by: Takao Indoh
---
  Documentation/kernel-parameters.txt |2 +
  drivers/pci/pci.c   |  103 +++
  2 files changed, 105 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 4609e81..2a31ade 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
any pair of devices, possibly at the cost of
reduced performance.  This also guarantees
that hot-added devices will work.
+   pcie_reset_devices  Reset PCIe endpoint on boot by hot
+   reset
cbiosize=nn[KMG]The fixed amount of bus space which is
reserved for the CardBus bridge's IO window.
The default value is 256 bytes.
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b099e00..42385c9 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
  }
  EXPORT_SYMBOL(pci_fixup_cardbus);

+/*
+ * Return true if dev is PCIe root port or downstream port whose child is PCIe
+ * endpoint except VGA device.
+ */
+static int __init need_reset(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!pci_is_pcie(dev) || !dev->subordinate ||
+   list_empty(&dev->subordinate->devices) ||
+   ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
+(pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
+   return 0;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,&subordinate->devices, bus_list) {
+   if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
+   (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
+   ((child->class>>  16) == PCI_BASE_CLASS_DISPLAY))
+   /* Don't reset switch, bridge, VGA device */
+   return 0;
+   }
+
+   return 1;
+}
+
+static void __init save_config(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!need_reset(dev))
+   return;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,&subordinate->devices, bus_list) {
+   dev_info(&child->dev, "save state\n");
+   pci_save_state(child);
+   }
+}
+
+static void __init restore_config(struct pci_dev *dev)
+{
+   struct pci_bus *subordinate;
+   struct pci_dev *child;
+
+   if (!need_reset(dev))
+   return;
+
+   subordinate = dev->subordinate;
+   list_for_each_entry(child,&subordinate->devices, bus_list) {
+   dev_info(&child->dev, "restore state\n");
+   pci_restore_state(child);
+   }
+}
+
+static void __init do_device_reset(struct pci_dev *dev)
+{
+   u16 ctrl;
+
+   if (!need_reset(dev))
+   return;
+
+   dev_info(&dev->dev, "Reset Secondary bus\n");
+
+   /* Assert Secondary Bus Reset */
+   pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl);
+   ctrl |= PCI_BRIDGE_CTL_B

Re: [RFC PATCH v2, part 2 16/18] PCI, iommu: use hotplug-safe iterators to walk PCI buses

2013-05-14 Thread Don Dutile


On 05/14/2013 12:52 PM, Jiang Liu wrote:

Enhance iommu drviers to use hotplug-safe iterators to walk
PCI buses.

Signed-off-by: Jiang Liu
Cc: Joerg Roedel
Cc: Ingo Molnar
Cc: Donald Dutile
Cc: Hannes Reinecke
Cc: "Li, Zhen-Hua"
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
---
  drivers/iommu/amd_iommu.c | 4 +++-
  drivers/iommu/dmar.c  | 6 --
  2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index e046d7a..d6fdf94 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -357,6 +357,7 @@ static int init_iommu_group(struct device *dev)
struct iommu_dev_data *dev_data;
struct iommu_group *group;
struct pci_dev *dma_pdev;
+   struct pci_bus *b = NULL;
int ret;

group = iommu_group_get(dev);
@@ -393,7 +394,7 @@ static int init_iommu_group(struct device *dev)
 * the alias.  Be careful to also test the parent device if
 * we think the alias is the root of the group.
 */
-   bus = pci_find_bus(0, alias>>  8);
+   b = bus = pci_find_bus(0, alias>>  8);
if (!bus)
goto use_group;

@@ -413,6 +414,7 @@ static int init_iommu_group(struct device *dev)
dma_pdev = get_isolation_root(pci_dev_get(to_pci_dev(dev)));
  use_pdev:
ret = use_pdev_iommu_group(dma_pdev, dev);
+   pci_bus_put(b);

pci_find_bus() does a pci_bus_put() after the pci_get_bus();
is this needed, or did you mean to make the above patch pci_get_bus() ?


pci_dev_put(dma_pdev);
return ret;
  use_group:
diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index e5cdaf8..5bb3cdc 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -67,12 +67,12 @@ static void __init dmar_register_drhd_unit(struct 
dmar_drhd_unit *drhd)
  static int __init dmar_parse_one_dev_scope(struct acpi_dmar_device_scope 
*scope,
   struct pci_dev **dev, u16 segment)
  {
-   struct pci_bus *bus;
+   struct pci_bus *b, *bus;
struct pci_dev *pdev = NULL;
struct acpi_dmar_pci_path *path;
int count;

-   bus = pci_find_bus(segment, scope->bus);
+   b = bus = pci_find_bus(segment, scope->bus);
path = (struct acpi_dmar_pci_path *)(scope + 1);
count = (scope->length - sizeof(struct acpi_dmar_device_scope))
/ sizeof(struct acpi_dmar_pci_path);
@@ -97,6 +97,8 @@ static int __init dmar_parse_one_dev_scope(struct 
acpi_dmar_device_scope *scope,
count --;
bus = pdev->subordinate;
}
+   pci_bus_put(b);

ditto.


+
if (!pdev) {
pr_warn("Device scope device [%04x:%02x:%02x.%02x] not found\n",
segment, scope->bus, path->dev, path->fn);


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Qemu-devel] SR-IOV PF reset and QEMU VFs VFIO passthrough

2013-06-03 Thread Don Dutile


On 06/01/2013 08:13 AM, Benoît Canet wrote:


Hello,

I may have soon the PF driver of an SR-IOV card to code and make work with
QEMU/KVM so I have the following questions.

In an AMD64 setup where QEMU use VFIO to passthrough the VFs of an SR-IOV card
to a guest will the consequences of a PF FLR be handled fine by QEMU and the
guest ?


the reset occurs long before the device is passed to the guest.


I read that pci_reset_function would call pci_restore_state restoring the SR-IOV
configuration after the reset of the PF.


correct.


The ways the hardware work means that the VFs would disappear and reappear in a
short lapse of time.


Not sure your definitiion of 'disappear'.  If you mean: if I had another thread
poking at the device, the device would appear to be removed, then come back (if 
os
poking hasn't crashed from the device's lack of response).
If you mean the VF gets entirely removed from the PCI tree, then no.
A pci reset != hot unplug/plug.  The device remains in the device tree.


Will these events be handled by the kernel pci hotplug code ?


'these events' ??? -- which events
FLR is currently done by libvirt & qemu/vfio to ensure assigned devices are 
quiesced
as they are switched from host->guest domain, and guest->(back-to-)host domain.


Given that the PF driver restore the PF config space after the reset will /sys

The PF driver doesn't do the config space restore -- it's done in PCI core code.


files used by QEMU disappear and reappear messing the QEMU VFIO passthrough or

As stated above, the devices don't disappear from the device tree, so they don't
get removed/added to the /sys(/bus/pci/...) files.


will it goes smoothly ?


it goes smoothly today :-/


Best regards

Benoît Canet



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: VFIO and scheduled SR-IOV cards

2013-06-03 Thread Don Dutile


On 06/03/2013 02:02 PM, Alex Williamson wrote:

On Mon, 2013-06-03 at 18:33 +0200, Benoît Canet wrote:

Hello,

I plan to write a PF driver for an SR-IOV card and make the VFs work with QEMU's
VFIO passthrough so I am asking the following design question before trying to
write and push code.

After SR-IOV being enabled on this hardware only one VF function can be active
at a given time.


Is this actually an SR-IOV device or are you trying to write a driver
that emulates SR-IOV for a PF?


The PF host kernel driver is acting as a scheduler.
It switch every few milliseconds which VF is the current active function while
disabling the others VFs.


that's time-sharing of hw, which sw doesn't see ... so, ok.


One consequence of how the hardware works is that the MMR regions of the
switched off VFs must be unmapped and their io access should block until the VF
is switched on again.



This violates the spec., and does impact sw -- how can one assign such a VF to 
a guest
-- it does not work indep. of other VFs.


MMR = Memory Mapped Register?

This seems contradictory to the SR-IOV spec, which states:

 Each VF contains a non-shared set of physical resources required
 to deliver Function-specific
 services, e.g., resources such as work queues, data buffers,
 etc. These resources can be directly
 accessed by an SI without requiring VI or SR-PCIM intervention.

Furthermore, each VF should have a separate requester ID.  What's being
suggested here seems like maybe that's not the case.  If true, it would

I didn't read it that way above.  I read it as the PCIe end is timeshared
btwn VFs (& PFs?).  with some VFs disappearing (from a driver perspective)
as if the device was hot unplug w/o notification.  That will probably cause
read-timeouts & SME's, bringing down most enterprise-level systems.


make iommu groups challenging.  Is there any VF save/restore around the
scheduling?


Each IOMMU map/unmap should be done in less than 100ns.


I think that may be a lot to ask if we need to unmap the regions in the
guest and in the iommu.  If the "VFs" used different requester IDs,
iommu unmapping whouldn't be necessary.  I experimented with switching
between trapped (read/write) access to memory regions and mmap'd (direct
mapping) for handling legacy interrupts.  There was a noticeable
performance penalty switching per interrupt.


As the kernel iommu module is being called by the VFIO driver the PF driver
cannot interface with it.

Currently the only interface of the VFIO code is for the userland QEMU process
and I fear that notifying QEMU that it should do the unmap/block would take more
than 100ns.

Also blocking the IO access in QEMU under the BQL would freeze QEMU.

Do you have and idea on how to write this required map and block/unmap feature ?


It seems like there are several options, but I'm doubtful that any of
them will meet 100ns.  If this is completely fake SR-IOV and there's not
a different requester ID per VF, I'd start with seeing if you can even
do the iommu_unmap/iommu_map of the MMIO BARs in under 100ns.  If that's
close to your limit, then your only real option for QEMU is to freeze
it, which still involves getting multiple (maybe many) vCPUs out of VM
mode.  That's not free either.  If by some miracle you have time to
spare, you could remap the regions to trapped mode and let the vCPUs run
while vfio blocks on read/write.

Maybe there's even a question whether mmap'd mode is worthwhile for this
device.  Trapping every read/write is orders of magnitude slower, but
allows you to handle the "wait for VF" on the kernel side.

If you can provide more info on the device design/contraints, maybe we
can come up with better options.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Qemu-devel] SR-IOV PF reset and QEMU VFs VFIO passthrough

2013-06-03 Thread Don Dutile


On 06/03/2013 03:29 PM, Benoît Canet wrote:

to a guest will the consequences of a PF FLR be handled fine by QEMU and the
guest ?


the reset occurs long before the device is passed to the guest.


I was asking this because the PF driver should reset the PF while the VF are
used by VFIO/QEMU when the PF doesn't respond anymore.


What your VF does while your PF is being reset is PF (& VF) dependent.
A 'good design' would not impact the VF operation, other than to stall it until
the PF completed reset.  My experience, though, is that the PF has to be brought
up to some level of functionality to share the physical resources with the VFs.


The PF driver doesn't do the config space restore -- it's done in PCI core code.

files used by QEMU disappear and reappear messing the QEMU VFIO passthrough or

As stated above, the devices don't disappear from the device tree, so they don't
get removed/added to the /sys(/bus/pci/...) files.


will it goes smoothly ?


it goes smoothly today :-/


Happy to read that thanks for the answer.

Best regards

Benoît Canet
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Qemu-devel] SR-IOV PF reset and QEMU VFs VFIO passthrough

2013-06-03 Thread Don Dutile


On 06/03/2013 05:27 PM, Benoît Canet wrote:

I was asking this because the PF driver should reset the PF while the VF are
used by VFIO/QEMU when the PF doesn't respond anymore.


What your VF does while your PF is being reset is PF (&  VF) dependent.
A 'good design' would not impact the VF operation, other than to stall it until
the PF completed reset.  My experience, though, is that the PF has to be brought
up to some level of functionality to share the physical resources with the VFs.


When the PF does an FLR the hardware go back to its default state, the SR-IOV
configuration is gone and the VFs disappears from the bus.
Then the restore state function of the kernel reset code would bring the SR-IOV
PF configuration back.


Ok, now you're a bit mis-led here.
The configuration header for SRIOV is _not_ put back.
Only the std, PCI config header section is put back in place, along with
msi(x), pm-caps.
If the hw wipes out all VF state setup (which it should, IMO), all VF 
configuration
will be lost in the hw...
*but*, the PCI core will still think the VFs exist (not hot-unplugged, no more 
than PF was);
trying to setup the VFs again, will fail (or worse).


The hardware also have a privately owned SR-IOV related configuration in the PF
configuration space. This configuration is used to configure the VFs resources.
(memory)


Per the SRIOV spec, yes, but that's in PCIe ext cfg space.
That area of the PCI configuration is not saved or restored by dev-reset.


Best regards

Benoît Canet
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [Qemu-devel] SR-IOV PF reset and QEMU VFs VFIO passthrough

2013-06-03 Thread Don Dutile


On 06/03/2013 05:58 PM, Benoît Canet wrote:

When the PF does an FLR the hardware go back to its default state, the SR-IOV
configuration is gone and the VFs disappears from the bus.
Then the restore state function of the kernel reset code would bring the SR-IOV
PF configuration back.


Ok, now you're a bit mis-led here.
The configuration header for SRIOV is _not_ put back.
Only the std, PCI config header section is put back in place, along with
msi(x), pm-caps.
If the hw wipes out all VF state setup (which it should, IMO), all VF 
configuration
will be lost in the hw...
*but*, the PCI core will still think the VFs exist (not hot-unplugged, no more 
than PF was);
trying to setup the VFs again, will fail (or worse).


I read the following code on a not so hold kernel.

---
int pci_reset_function(struct pci_dev *dev)
{

...int rc;



...rc = pci_dev_reset(dev, 1);
...if (rc)
...>...return rc;



...pci_save_state(dev);



.../*
... * both INTx and MSI are disabled after the Interrupt Disable bit
... * is set and the Bus Master bit is cleared.
... */
...pci_write_config_word(dev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);



...rc = pci_dev_reset(dev, 0);



...pci_restore_state(dev);



...return rc;

}
EXPORT_SYMBOL_GPL(pci_reset_function);
---

and

---
/**
  * pci_restore_state - Restore the saved state of a PCI device
  * @dev: - PCI device that we're dealing with
  */
void pci_restore_state(struct pci_dev *dev)
{

...if (!dev->state_saved)
...>...return;



.../* PCI Express register must be restored first */
...pci_restore_pcie_state(dev);
...pci_restore_ats_state(dev);



...pci_restore_config_space(dev);



...pci_restore_pcix_state(dev);
...pci_restore_msi_state(dev);
...pci_restore_iov_state(dev);



...dev->state_saved = false;

}
---

with pci_restore_iov_state calling sriov_restore_state:

---
static void sriov_restore_state(struct pci_dev *dev)
{

...int i;
...u16 ctrl;
...struct pci_sriov *iov = dev->sriov;



...pci_read_config_word(dev, iov->pos + PCI_SRIOV_CTRL,&ctrl);
...if (ctrl&  PCI_SRIOV_CTRL_VFE)
...>...return;



...for (i = PCI_IOV_RESOURCES; i<= PCI_IOV_RESOURCE_END; i++)
...>...pci_update_resource(dev, i);



...pci_write_config_dword(dev, iov->pos + PCI_SRIOV_SYS_PGSIZE, iov->pgsz);
...pci_write_config_word(dev, iov->pos + PCI_SRIOV_NUM_VF, iov->num_VFs);
...pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
...if (iov->ctrl&  PCI_SRIOV_CTRL_VFE)
...>...msleep(100);

}


The sriov_restore_state looked like if it does the right thing but maybe I 
missread
the code.


/my bad; I forgot about the save|restore_iov_state calls doh!

Now it gets down to how well your hw (& driver) works after the reset is done...




The hardware also have a privately owned SR-IOV related configuration in the PF
configuration space. This configuration is used to configure the VFs resources.
(memory)


Per the SRIOV spec, yes, but that's in PCIe ext cfg space.
That area of the PCI configuration is not saved or restored by dev-reset.


Can a callback be added so PF driver can restore this state ?


As you pointed out, no need to, unless it's a device-specific,
PCIe cap structure.  the SRIOV caps are re-instated, as you showed above...


Best regards

Benoît Canet
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v2 1/2] pci: Create PCIe requester ID interface

2013-07-25 Thread Don Dutile


On 07/24/2013 05:22 PM, Alex Williamson wrote:

On Wed, 2013-07-24 at 16:42 -0400, Don Dutile wrote:

On 07/23/2013 06:35 PM, Bjorn Helgaas wrote:

On Thu, Jul 11, 2013 at 03:03:27PM -0600, Alex Williamson wrote:

This provides interfaces for drivers to discover the visible PCIe
requester ID for a device, for things like IOMMU setup, and iterate


IDs (plural)


a single device does not have multiple requester id's;
can have multiple tag-id's (that are ignored in this situation, but
can be used by switches for ordering purposes), but there's only 1/fcn
(except for those quirker pdevs!).


over the device chain from requestee to requester, including DMA
quirks at each step.


"requestee" doesn't make sense to me.  The "-ee" suffix added to a verb
normally makes a noun that refers to the object of the action.  So
"requestee" sounds like it means something like "target" or "responder,"
but that's not what you mean here.


sorry, I didn't follow the requester/requestee terminology either...


Suggested-by: Bjorn Helgaas
Signed-off-by: Alex Williamson
---
   drivers/pci/search.c |  198 
++
   include/linux/pci.h  |7 ++
   2 files changed, 205 insertions(+)

diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index d0627fa..4759c02 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -18,6 +18,204 @@ DECLARE_RWSEM(pci_bus_sem);
   EXPORT_SYMBOL_GPL(pci_bus_sem);

   /*
+ * pci_has_pcie_requester_id - Does @dev have a PCIe requester ID
+ * @dev: device to test
+ */
+static bool pci_has_pcie_requester_id(struct pci_dev *dev)
+{
+   /*
+* XXX There's no indicator of the bus type, conventional PCI vs
+* PCI-X vs PCI-e, but we assume that a caller looking for a PCIe
+* requester ID is a native PCIe based system (such as VT-d or
+* AMD-Vi).  It's common that PCIe root complex devices do not

could the above comment be x86-iommu-neutral?
by definition of PCIe, all devices have a requester id (min. to accept cfg 
cycles);
req'd if source of read/write requests, read completions.


I agree completely, the question is whether we have a PCIe root complex
or a conventional PCI host bus.  I don't think we have any way to tell,
so I'm assuming pci_is_root_bus() indicates we're on a PCIe root complex
and therefore have requester IDs.  If there's some way to determine this
let me know and we can avoid any kind of assumption.


+* include a PCIe capability, but we can assume they are PCIe
+* devices based on their topology.
+*/
+   if (pci_is_pcie(dev) || pci_is_root_bus(dev->bus))
+   return true;
+
+   /*
+* PCI-X devices have a requester ID, but the bridge may still take
+* ownership of transactions and create a requester ID.  We therefore
+* assume that the PCI-X requester ID is not the same one used on PCIe.
+*/
+
+#ifdef CONFIG_PCI_QUIRKS
+   /*
+* Quirk for PCIe-to-PCI bridges which do not expose a PCIe capability.
+* If the device is a bridge, look to the next device upstream of it.
+* If that device is PCIe and not a PCIe-to-PCI bridge, then by
+* deduction, the device must be PCIe and therefore has a requester ID.
+*/
+   if (dev->subordinate) {
+   struct pci_dev *parent = dev->bus->self;
+
+   if (pci_is_pcie(parent)&&
+   pci_pcie_type(parent) != PCI_EXP_TYPE_PCI_BRIDGE)
+   return true;
+   }
+#endif
+
+   return false;
+}
+
+/*
+ * pci_has_visible_pcie_requester_id - Can @bridge see @dev's requester ID?
+ * @dev: requester device
+ * @bridge: upstream bridge (or NULL for root bus)
+ */
+static bool pci_has_visible_pcie_requester_id(struct pci_dev *dev,
+ struct pci_dev *bridge)
+{
+   /*
+* The entire path must be tested, if any step does not have a
+* requester ID, the chain is broken.  This allows us to support
+* topologies with PCIe requester ID gaps, ex: PCIe-PCI-PCIe
+*/
+   while (dev != bridge) {
+   if (!pci_has_pcie_requester_id(dev))
+   return false;
+
+   if (pci_is_root_bus(dev->bus))
+   return !bridge; /* false if we don't hit @bridge */
+
+   dev = dev->bus->self;
+   }
+
+   return true;
+}
+
+/*
+ * Legacy PCI bridges within a root complex (ex. Intel 82801) report
+ * a different requester ID than a standard PCIe-to-PCI bridge.  Instead

First, I'm assuming you mean that devices behind a Legacy PCI bridge within a 
root complex
get assigned IDs different than std PCIe-to-PCI bridges (as quoted below).


Yes


+ * of using (subordinate<<   8 | 0) the use (bus<<   8 |

Re: [PATCH 1/7] VFIO_IOMMU_TYPE1 workaround to build for platform devices

2013-10-28 Thread Don Dutile


On 10/02/2013 08:14 AM, Alexander Graf wrote:


On 01.10.2013, at 21:21, Yoder Stuart-B08248 wrote:


static int __init vfio_iommu_type1_init(void)
{
-   if (!iommu_present(&pci_bus_type))
+#ifdef CONFIG_PCI
+   if (iommu_present(&pci_bus_type)) {
+   iommu_bus_type =&pci_bus_type;
+   /* For PCI targets, IOMMU_CAP_INTR_REMAP is required */
+   require_cap_intr_remap = true;
+   }
+#endif
+   if (!iommu_bus_type&&  iommu_present(&platform_bus_type))
+   iommu_bus_type =&platform_bus_type;
+
+   if(!iommu_bus_type)
return -ENODEV;

return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);


Is it possible to have a system with both PCI and platform devices?  How
would you support that?  Thanks,


It most certainly is a requirement to support both.  This is how
all of our (FSL) SoCs will expect to work.


I thought the PCI bus emits a cookie that the system wide IOMMU can then use to 
differentiate the origin of the transaction? So the same IOMMU can be used for 
PCI as well as platform routing.


*can* be the same IOMMU, yes;
have to, no, so there can be multiple IOMMUs of different types.



Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/7] Initial skeleton of VFIO support for Device Tree based devices

2013-10-28 Thread Don Dutile


On 09/30/2013 11:37 AM, Bhushan Bharat-R65777 wrote:




-Original Message-
From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
boun...@lists.linux-foundation.org] On Behalf Of Antonios Motakis
Sent: Monday, September 30, 2013 8:59 PM
To: kvm...@lists.cs.columbia.edu; alex.william...@redhat.com
Cc: linux-samsung-...@vger.kernel.org; k...@vger.kernel.org; ag...@suse.de; 
Yoder
Stuart-B08248; iommu@lists.linux-foundation.org; Antonios Motakis;
t...@virtualopensystems.com
Subject: [PATCH 2/7] Initial skeleton of VFIO support for Device Tree based
devices

Platform devices in the Linux kernel are usually managed by the DT interface.
This patch forms the base to support these kind of devices with VFIO.

Signed-off-by: Antonios Motakis
---
  drivers/vfio/Kconfig |  11 +++
  drivers/vfio/Makefile|   1 +
  drivers/vfio/vfio_platform.c | 187 +++
  include/uapi/linux/vfio.h|   1 +
  4 files changed, 200 insertions(+)
  create mode 100644 drivers/vfio/vfio_platform.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index 1f84eda..35254b7
100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -13,4 +13,15 @@ menuconfig VFIO

  If you don't know what to do here, say N.

+config VFIO_PLATFORM
+   tristate "VFIO support for device tree based platform devices"
+   depends on VFIO&&  EVENTFD&&  OF
+   help
+ Support for platform devices with VFIO. This is required to make
+ use of platform devices present on device tree nodes using the VFIO
+ framework. Devices that are not described in the device tree cannot
+ be used by this driver.
+
+ If you don't know what to do here, say N.
+
  source "drivers/vfio/pci/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
2398d4a..575c8dd 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
  obj-$(CONFIG_VFIO) += vfio.o
  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
  obj-$(CONFIG_VFIO_PCI) += pci/
+obj-$(CONFIG_VFIO_PLATFORM) += vfio_platform.o
diff --git a/drivers/vfio/vfio_platform.c b/drivers/vfio/vfio_platform.c new


We can make this parallel to PCI, something like 
drivers/vfio/platform/platform.c


pls, no.  'platform' is too generic, and it really means 'arm-dt' ... so can
move it to the arch/arm space, and have it's kconfig conditional on ARM&&VFIO.
if kept under drivers/vfio, then use a better directory name that ties it to 
arm-dt.
thanks.
- Don



-Bharat


file mode 100644 index 000..b9686b0
--- /dev/null
+++ b/drivers/vfio/vfio_platform.c
@@ -0,0 +1,187 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Antonios Motakis"
+#define DRIVER_DESC "VFIO Device Tree devices - User Level meta-driver"
+
+struct vfio_platform_device {
+   struct platform_device  *pdev;
+};
+
+static void vfio_platform_release(void *device_data) {
+   module_put(THIS_MODULE);
+}
+
+static int vfio_platform_open(void *device_data) {
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   return 0;
+}
+
+static long vfio_platform_ioctl(void *device_data,
+  unsigned int cmd, unsigned long arg) {
+   struct vfio_platform_device *vdev = device_data;
+   unsigned long minsz;
+
+   if (cmd == VFIO_DEVICE_GET_INFO) {
+   struct vfio_device_info info;
+
+   minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+   if (copy_from_user(&info, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (info.argsz<  minsz)
+   return -EINVAL;
+
+   info.flags = VFIO_DEVICE_FLAGS_PLATFORM;
+   info.num_regions = 0;
+   info.num_irqs = 0;
+
+   return copy_to_user((void __user *)arg,&info, minsz);
+
+   } else if (cmd == VFIO_DEVICE_GET_REGION_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_GET_IRQ_INFO)
+   return -EINVAL;
+
+   else if (cmd == VFIO_DEVICE_SET_IRQS)
+

Re: [PATCH 2/7] Initial skeleton of VFIO support for Device Tree based devices

2013-10-29 Thread Don Dutile


On 10/29/2013 07:47 AM, Alex Williamson wrote:

On Mon, 2013-10-28 at 21:29 -0400, Don Dutile wrote:

On 09/30/2013 11:37 AM, Bhushan Bharat-R65777 wrote:




-Original Message-
From: iommu-boun...@lists.linux-foundation.org [mailto:iommu-
boun...@lists.linux-foundation.org] On Behalf Of Antonios Motakis
Sent: Monday, September 30, 2013 8:59 PM
To: kvm...@lists.cs.columbia.edu; alex.william...@redhat.com
Cc: linux-samsung-...@vger.kernel.org; k...@vger.kernel.org; ag...@suse.de; 
Yoder
Stuart-B08248; iommu@lists.linux-foundation.org; Antonios Motakis;
t...@virtualopensystems.com
Subject: [PATCH 2/7] Initial skeleton of VFIO support for Device Tree based
devices

Platform devices in the Linux kernel are usually managed by the DT interface.
This patch forms the base to support these kind of devices with VFIO.

Signed-off-by: Antonios Motakis
---
   drivers/vfio/Kconfig |  11 +++
   drivers/vfio/Makefile|   1 +
   drivers/vfio/vfio_platform.c | 187 
+++
   include/uapi/linux/vfio.h|   1 +
   4 files changed, 200 insertions(+)
   create mode 100644 drivers/vfio/vfio_platform.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index 1f84eda..35254b7
100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -13,4 +13,15 @@ menuconfig VFIO

  If you don't know what to do here, say N.

+config VFIO_PLATFORM
+   tristate "VFIO support for device tree based platform devices"
+   depends on VFIO&&   EVENTFD&&   OF
+   help
+ Support for platform devices with VFIO. This is required to make
+ use of platform devices present on device tree nodes using the VFIO
+ framework. Devices that are not described in the device tree cannot
+ be used by this driver.
+
+ If you don't know what to do here, say N.
+
   source "drivers/vfio/pci/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
2398d4a..575c8dd 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,3 +1,4 @@
   obj-$(CONFIG_VFIO) += vfio.o
   obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
   obj-$(CONFIG_VFIO_PCI) += pci/
+obj-$(CONFIG_VFIO_PLATFORM) += vfio_platform.o
diff --git a/drivers/vfio/vfio_platform.c b/drivers/vfio/vfio_platform.c new


We can make this parallel to PCI, something like 
drivers/vfio/platform/platform.c


pls, no.  'platform' is too generic, and it really means 'arm-dt' ... so can
move it to the arch/arm space, and have it's kconfig conditional on ARM&&VFIO.
if kept under drivers/vfio, then use a better directory name that ties it to 
arm-dt.
thanks.


The intention is that vfio platform device support is not arm-dt
specific.  This is to be used by both arm and embedded ppc.  The devices
we intend to support with them are known as platform drivers in the
kernel, thus the name.  I suppose the question remains whether the
interface here is really generic for any "platform" device or whether
we're making whether we're making an interface specifically for device
tree platform devices, or if those are one in the same.  In any case,
arm-dt is certainly not the answer.

Alex


I thought that was the intention until I saw this use in platform.c:
static const struct of_device_id vfio_platform_match[] = {

So, of_device_id hit me as DT-specific, and thus, the file should have
a name that implies its not as platform-generic as one may want/expect.
I agree that the 'arm' part can be dropped, but it's not of-agnostic atm.



file mode 100644 index 000..b9686b0
--- /dev/null
+++ b/drivers/vfio/vfio_platform.c
@@ -0,0 +1,187 @@
+/*
+ * Copyright (C) 2013 - Virtual Open Systems
+ * Author: Antonios Motakis
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ */
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Antonios Motakis"
+#define DRIVER_DESC "VFIO Device Tree devices - User Level meta-driver"
+
+struct vfio_platform_device {
+   struct platform_device  *pdev;
+};
+
+static void vfio_platform_release(void *device_data) {
+   module_put(THIS_MODULE);
+}
+
+static int vfio_pl

Re: [PATCH v2] Enhance dmar to support device hotplug

2013-12-10 Thread Don Dutile


On 11/21/2013 03:21 AM, Yijing Wang wrote:

This is the v2 patch, the v1 link: 
http://marc.info/?l=linux-pci&m=138364004628824&w=2

v1->v2: keep (pci_dev *) pointer array in dmar_drhd_uni, only use pci device id
to update pci_dev * pointer info during device hotplug in intel 
iommu
driver notifier.

Currently, DMAR driver save target pci devices pointers for drhd/rmrr/atsr
in (pci_dev *) array, but never update these info after initialization.
It's not safe, because pci devices maybe hot added or removed during
system running. They will have new pci_dev * pointer. So if there have
two IOMMUs or more in system, these devices will find a wrong drhd during
DMA mapping. And DMAR faults will occur. This patch save pci device id
as well as (pci_dev *) to fix this issue. Pci device id will be used to update
pci_dev * poninter during device hotplug in intel iommu driver notifier.
Other, here use list to manage target devices for IOMMU,
we can easily use list helper.

Yijing Wang (1):
   IOMMU: enhance dmar to support device hotplug

  drivers/iommu/dmar.c|   82 +++---
  drivers/iommu/intel-iommu.c |  161 +-
  include/linux/dmar.h|   24 --
  3 files changed, 167 insertions(+), 100 deletions(-)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Can this bug & fix be demonstrated by configuring & de-configuring VFs
on an SRIOV device, since that effectively looks like a hot-add & hot-remove ?


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] vfio/iommu_type1: Multi-IOMMU domain support

2014-01-27 Thread Don Dutile


On 01/20/2014 11:21 AM, Alex Williamson wrote:

On Mon, 2014-01-20 at 14:45 +, Varun Sethi wrote:



-Original Message-
From: Alex Williamson [mailto:alex.william...@redhat.com]
Sent: Saturday, January 18, 2014 2:06 AM
To: Sethi Varun-B16395
Cc: iommu@lists.linux-foundation.org; linux-ker...@vger.kernel.org
Subject: [RFC PATCH] vfio/iommu_type1: Multi-IOMMU domain support

RFC: This is not complete but I want to share with Varun the dirrection
I'm thinking about.  In particular, I'm really not sure if we want to
introduce a "v2" interface version with slightly different unmap
semantics.  QEMU doesn't care about the difference, but other users
might.  Be warned, I'm not even sure if this code works at the moment.
Thanks,

Alex


We currently have a problem that we cannot support advanced features of
an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee that
those features will be supported by all of the hardware units involved
with the domain over its lifetime.  For instance, the Intel VT-d
architecture does not require that all DRHDs support snoop control.  If
we create a domain based on a device behind a DRHD that does support
snoop control and enable SNP support via the IOMMU_CACHE mapping option,
we cannot then add a device behind a DRHD which does not support snoop
control or we'll get reserved bit faults from the SNP bit in the
pagetables.  To add to the complexity, we can't know the properties of a
domain until a device is attached.

[Sethi Varun-B16395] Effectively, it's the same iommu and iommu_ops
are common across all bus types. The hardware feature differences are
abstracted by the driver.


That's a simplifying assumption that is not made anywhere else in the
code.  The IOMMU API allows entirely independent IOMMU drivers to
register per bus_type.  There is no guarantee that all devices are
backed by the same IOMMU hardware unit or make use of the same
iommu_ops.


We could pass this problem off to userspace and require that a separate
vfio container be used, but we don't know how to handle page accounting
in that case.  How do we know that a page pinned in one container is the
same page as a different container and avoid double billing the user for
the page.

The solution is therefore to support multiple IOMMU domains per
container.  In the majority of cases, only one domain will be required
since hardware is typically consistent within a system.  However, this
provides us the ability to validate compatibility of domains and support
mixed environments where page table flags can be different between
domains.

To do this, our DMA tracking needs to change.  We currently try to
coalesce user mappings into as few tracking entries as possible.  The
problem then becomes that we lose granularity of user mappings.  We've
never guaranteed that a user is able to unmap at a finer granularity than
the original mapping, but we must honor the granularity of the original
mapping.  This coalescing code is therefore removed, allowing only unmaps
covering complete maps.  The change in accounting is fairly small here, a
typical QEMU VM will start out with roughly a dozen entries, so it's
arguable if this coalescing was ever needed.

We also move IOMMU domain creation to the point where a group is attached
to the container.  An interesting side-effect of this is that we now have
access to the device at the time of domain creation and can probe the
devices within the group to determine the bus_type.
This finally makes vfio_iommu_type1 completely device/bus agnostic.
In fact, each IOMMU domain can host devices on different buses managed by
different physical IOMMUs, and present a single DMA mapping interface to
the user.  When a new domain is created, mappings are replayed to bring
the IOMMU pagetables up to the state of the current container.  And of
course, DMA mapping and unmapping automatically traverse all of the
configured IOMMU domains.


[Sethi Varun-B16395] This code still checks to see that devices being
attached to the domain are connected to the same bus type. If we
intend to merge devices from different bus types but attached to
compatible domains in to a single domain, why can't we avoid the bus
check? Why can't we remove the bus dependency from domain allocation?


So if I were to test iommu_ops instead of bus_type (ie. assume that if a
if an IOMMU driver manages iommu_ops across bus_types that it can accept
the devices), would that satisfy your concern?

It may be possible to remove the bus_type dependency from domain
allocation, but the IOMMU API currently makes the assumption that
there's one IOMMU driver per bus_type.  Your fix to remove the bus_type
dependency from iommu_domain_alloc() adds an assumption that there is
only one IOMMU driver for all bus_types.  That may work on your
platform, but I don't think it's a valid assumption in the general case.
If you'd like to propose alternative ways to remove the bus_type
dependency, please do.  Thanks,

Alex

Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU

2014-04-07 Thread Don Dutile


On 01/10/2014 05:07 PM, Bill Sumner wrote:

v2->v3:
1. Commented-out "#define DEBUG 1" to eliminate debug messages
2. Updated the comments about changes in each version in all patches in the set.
3. Fixed: one-line added to Copy-Translations" patch to initialize the iovad
   struct as recommended by Baoquan He [b...@redhat.com]
   init_iova_domain(&domain->iovad, DMA_32BIT_PFN);

v1->v2:
The following series implements a fix for:
A kdump problem about DMA that has been discussed for a long time. That is,
when a kernel panics and boots into the kdump kernel, DMA started by the
panicked kernel is not stopped before the kdump kernel is booted and the
kdump kernel disables the IOMMU while this DMA continues.  This causes the
IOMMU to stop translating the DMA addresses as IOVAs and begin to treat them
as physical memory addresses -- which causes the DMA to either:
(1) generate DMAR errors or (2) generate PCI SERR errors or (3) transfer
data to or from incorrect areas of memory. Often this causes the dump to fail.

This patch set modifies the behavior of the iommu in the (new) crashdump kernel:
1. to accept the iommu hardware in an active state,
2. to leave the current translations in-place so that legacy DMA will continue
using its current buffers until the device drivers in the crashdump kernel
initialize and initialize their devices,
3. to use different portions of the iova address ranges for the device drivers
in the crashdump kernel than the iova ranges that were in-use at the time
of the panic.

Advantages of this approach:
1. All manipulation of the IO-device is done by the Linux device-driver
for that device.
2. This approach behaves in a manner very similar to operation without an
active iommu.


Sorry to be late to the game finally getting out of a deep hole &
asked to look at this proposal...

Along this concept -- similar to operation without an active iommu --
have you considered the following:
a) if (this is crash kernel), turn *off* DMAR faults;
b) if (this is crash kernel), isolate all device DMA in IOMMU
b) as second kernel configures each device, have each device to use IOMMU 
hw-passthrough,
 i.e., the equivalent of having no IOMMU for the second, kexec'd kernel
  but, having the benefit of keeping all the other (potentially bad) devices
  sequestered / isolated, until they are initialized & re-configured in the 
second kernel,
   *if at all* -- note: kexec'd kernels may not enable/configure all devices 
that
   existed in the first kernel (Bill: I'm sure you know this, but others may 
not).

RMRR's that were previously setup could continue to work if they are skipped in 
step (b),
unless the device has gone mad/bad.  In that case, re-parsing the RMRR may or 
may not
clear up the issue.

Additionally, a tidbit of information like "some servers force NMI's on DMAR 
faults,
and cause a system reset, thereby, preventing a kdump to occur"
should have been included as one reason to stop DMAR faults from occurring on 
kexec-boot,
in addition to the fact that a flood of them can lock up a system.

Again, just turning off DMAR fault reporting for the 'if (this is crash 
kernel)',
short-term workaround sounds a whole lot less expensive to implement, as well as
'if (this is crash kernel), force hw-passthrough'.

If the IO devices are borked to the point that they won't complete DMA properly,
with or without IOMMU, the system is dead anyhow, game over.

Finally, copying the previous IOMMU state to the second kernel, and hoping
that the cause of the kernel crash wasn't an errant DMA (e.g., due to a device 
going bad,
or it's DMA-state being corrupted & causing an improper IO), is omitting an 
important failure
case/space.
Keeping the first-kernel DMA isolated (IOMMU on, all translations off, all DMAR 
faults off),
and then allowing each device (driver) configuration to sanely reset the device 
&
start a new (hw-passthrough) domain seems simpler and cleaner, for this 
dump-and-run kernel
effort.

- Don


3. Any activity between the IO-device and its RMRR areas is handled by the
device-driver in the same manner as during a non-kdump boot.
4. If an IO-device has no driver in the kdump kernel, it is simply left alone.
This supports the practice of creating a special kdump kernel without
drivers for any devices that are not required for taking a crashdump.

Changes since the RFC version of this patch:
1. Consolidated all of the operational code into the "copy..." functions.
The "process..." functions were primarily used for diagnostics and
exploration; however, there was a small amount of operational code that
used the "process..." functions.
This operational code has been moved into the "copy..." functions.

2. Removed the "Process ..." functions and the diagnostic code that ran
on that function set.  This removed about 1/4 of the code -- which this
operational patch set no longer needs.  These portions of the RFC patch
c

Re: [PATCHv3 0/6] Crashdump Accepting Active IOMMU

2014-04-08 Thread Don Dutile


On 04/08/2014 12:14 PM, David Woodhouse wrote:

On Mon, 2014-04-07 at 16:43 -0400, Don Dutile wrote:


Additionally, a tidbit of information like "some servers force NMI's
on DMAR faults,
and cause a system reset, thereby, preventing a kdump to occur"
should have been included as one reason to stop DMAR faults from
occurring on kexec-boot,
in addition to the fact that a flood of them can lock up a system.


How about allocating a physical scratch page, and setting up a mapping
for each device such that *every* virtual address (apart from those
listed in RMRRs, perhaps) is mapped to that same scratch page?

That way you avoid the faults, but you also avoid stray DMA to parts of
the system that you don't want to get corrupted.


+1... more isolation as second kernel booting sounds good.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Summary of LPC guest MSI discussion in Santa Fe

2016-11-11 Thread Don Dutile


On 11/11/2016 06:19 AM, Joerg Roedel wrote:

On Thu, Nov 10, 2016 at 10:46:01AM -0700, Alex Williamson wrote:

In the case of x86, we know that DMA mappings overlapping the MSI
doorbells won't be translated correctly, it's not a valid mapping for
that range, and therefore the iommu driver backing the IOMMU API
should describe that reserved range and reject mappings to it.


The drivers actually allow mappings to the MSI region via the IOMMU-API,
and I think it should stay this way also for other reserved ranges.
Address space management is done by the IOMMU-API user already (and has
to be done there nowadays), be it a DMA-API implementation which just
reserves these regions in its address space allocator or be it VFIO with
QEMU, which don't map RAM there anyway. So there is no point of checking
this again in the IOMMU drivers and we can keep that out of the
mapping/unmapping fast-path.


For PCI devices userspace can examine the topology of the iommu group
and exclude MMIO ranges of peer devices based on the BARs, which are
exposed in various places, pci-sysfs as well as /proc/iomem.  For
non-PCI or MSI controllers... ???


Right, the hardware resources can be examined. But maybe this can be
extended to also cover RMRR ranges? Then we would be able to assign
devices with RMRR mappings to guests.


eh gads no!

Assigning devices w/RMRR's is a security issue waiting to happen, if
it doesn't crash the system before the guest even gets the device --
reset the device before assignment; part of device is gathering system
environmental data; if BIOS/SMM support doesn't get env. data update,
it NMI's the system. in fear that it may overheat ...





Joerg



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Summary of LPC guest MSI discussion in Santa Fe

2016-11-11 Thread Don Dutile


On 11/11/2016 10:50 AM, Alex Williamson wrote:

On Fri, 11 Nov 2016 12:19:44 +0100
Joerg Roedel  wrote:


On Thu, Nov 10, 2016 at 10:46:01AM -0700, Alex Williamson wrote:

In the case of x86, we know that DMA mappings overlapping the MSI
doorbells won't be translated correctly, it's not a valid mapping for
that range, and therefore the iommu driver backing the IOMMU API
should describe that reserved range and reject mappings to it.


The drivers actually allow mappings to the MSI region via the IOMMU-API,
and I think it should stay this way also for other reserved ranges.
Address space management is done by the IOMMU-API user already (and has
to be done there nowadays), be it a DMA-API implementation which just
reserves these regions in its address space allocator or be it VFIO with
QEMU, which don't map RAM there anyway. So there is no point of checking
this again in the IOMMU drivers and we can keep that out of the
mapping/unmapping fast-path.


It's really just a happenstance that we don't map RAM over the x86 MSI
range though.  That property really can't be guaranteed once we mix
architectures, such as running an aarch64 VM on x86 host via TCG.
AIUI, the MSI range is actually handled differently than other DMA
ranges, so a iommu_map() overlapping a range that the iommu cannot map
should fail just like an attempt to map beyond the address width of the
iommu.


+1. As was stated at Plumbers, x86 MSI is in a fixed, hw location, so:
1) that memory space is never a valid page to the system to be used for IOVA,
   therefore, nothing to micro-manage in the iommu mapping (fast) path.
2) migration btwn different systems isn't an issue b/c all x86 systems have 
this mapping.
3) ACS resolves DMA writes to mem going to a device(-mmio space).

For aarch64, without such a 'fixed' MSI location, whatever 
hole/used-space-struct
concept that is contrived for MSI (DMA) writes on aarch64 won't guarantee 
migration
failure across mixed aarch64 systems (migrate guest-G from sys-vendor-A to
sys-vendor-B; sys-vendor-A has MSI at addr-A; sys-vendor-B has MSI at addr-B).
Without agreement, migration only possilbe across the same systems (can even
be broken btwn two sytems from same vendor).  ACS in the PCIe path handles
the iova->dev-mmio collision problem. q.e.d.

ergo, my proposal to put MSI space as the upper-most, space of any system
..FFFE0. ... and hw drops the upper 1's/F's, and uses that for MSI.
Allows it to vary on each system based on max-memory.  pseudo-fixed, but not
right smack in the middle of mem-space.

There is an inverse scenario for host phys addr's as well:
Wiring the upper-most bit of HPA to be 1==mmio, 0=mem simplifies a lot of
design issues in the cores & chipsets as well.  Alpha-EV6, case in point
(18+ yr old design decision). another q.e.d.

I hate to admit it, but jcm has it right wrt 'fixed sys addr map', at least in 
this IO area.



For PCI devices userspace can examine the topology of the iommu group
and exclude MMIO ranges of peer devices based on the BARs, which are
exposed in various places, pci-sysfs as well as /proc/iomem.  For
non-PCI or MSI controllers... ???


Right, the hardware resources can be examined. But maybe this can be
extended to also cover RMRR ranges? Then we would be able to assign
devices with RMRR mappings to guests.


RMRRs are special in a different way, the VT-d spec requires that the
OS honor RMRRs, the user has no responsibility (and currently no
visibility) to make that same arrangement.  In order to potentially
protect the physical host platform, the iommu drivers should prevent a
user from remapping RMRRS.  Maybe there needs to be a different
interface used by untrusted users vs in-kernel drivers, but I think the
kernel really needs to be defensive in the case of user mappings, which
is where the IOMMU API is rooted.  Thanks,

Alex



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Summary of LPC guest MSI discussion in Santa Fe

2016-11-23 Thread Don Dutile


On 11/21/2016 12:13 AM, Jon Masters wrote:

On 11/07/2016 07:45 PM, Will Deacon wrote:


I figured this was a reasonable post to piggy-back on for the LPC minutes
relating to guest MSIs on arm64.


Thanks for this Will. I'm still digging out post-LPC and SC16, but the
summary was much appreciated, and I'm glad the conversation is helping.


   1. The physical memory map is not standardised (Jon pointed out that
  this is something that was realised late on)


Just to note, we discussed this one about 3-4 years ago. I recall making
a vigorous slideshow at a committee meeting in defense of having a
single memory map for ARMv8 servers and requiring everyone to follow it.
I was weak. I listened to the comments that this was "unreasonable".
Instead, I consider it was unreasonable of me to not get with the other
OS vendors and force things to be done one way. The lack of a "map at
zero" RAM location on ARMv8 has been annoying enough for 32-bit DMA only
devices on 64-bit (behind an SMMU but in passthrough mode it doesn't
help) and other issues beyond fixing the MSI doorbell regions. If I ever
have a time machine, I tried harder.


Jon pointed out that most people are pretty conservative about hardware
choices when migrating between them -- that is, they may only migrate
between different revisions of the same SoC, or they know ahead of time
all of the memory maps they want to support and this could be communicated
by way of configuration to libvirt.


I think it's certainly reasonable to assume this in an initial
implementation and fix it later. Currently, we're very conservative
about host CPU passthrough anyway and can't migrate from one microarch
to another revision of the same microarch even. And on x86, nobody
really supports e.g. Intel to AMD and back again. I've always been of

Thats primarily due to diff virt implentations ... vme vs svm.
1) thats not the case here; can argue cross arch variations .. gicv2 vs gicv3 
vs gicv4 ...
   but cross vendor should work if arch and common feature map (like x86 
libvirt can resolve)
   is provided
2) second chance to do it better; look n learn!
   and common thread i am trying to drive home here ... learn from past 
mistakes *and* good choices.



the mind that we should ensure the architecture can handle this, but
then cautiously approach this with a default to not doing it.


Alex asked if there was a security
issue with DMA bypassing the SMMU, but there aren't currently any systems
where that is known to happen. Such a system would surely not be safe for
passthrough.


There are other potential security issues that came up but don't need to
be noted here (yet). I have wanted to clarify the SBSA for a long time
when it comes to how IOMMUs should be implemented. It's past time that
we went back and had a few conversations about that. I've poked.


Ben mused that a way to handle conflicts dynamically might be to hotplug
on the entire host bridge in the guest, passing firmware tables describing
the new reserved regions as a property of the host bridge. Whilst this
may well solve the issue, it was largely considered future work due to
its invasive nature and dependency on firmware tables (and guest support)
that do not currently exist.


Indeed. It's an elegant solution (thanks Ben) that I gather POWER
already does (good for them). We've obviously got a few things to clean
up after we get the basics in place. Again, I think we can consider it
reasonable that the MSI doorbell regions are predetermined on system A
well ahead of any potential migration (that may or may not then work)
for the moment. Vendors will want to loosen this later, and they can
drive the work to do that, for example by hotplugging a host bridge.

Jon.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC v3 00/10] KVM PCIe/MSI passthrough on ARM/ARM64 and IOVA reserved regions

2016-12-10 Thread Don Dutile


On 12/08/2016 04:36 AM, Auger Eric wrote:

Hi,

On 15/11/2016 14:09, Eric Auger wrote:

Following LPC discussions, we now report reserved regions through
iommu-group sysfs reserved_regions attribute file.



While I am respinning this series into v4, here is a tentative summary
of technical topics for which no consensus was reached at this point.

1) Shall we report the usable IOVA range instead of reserved IOVA
ranges. Not discussed at/after LPC.
x I currently report reserved regions. Alex expressed the need to
  report the full usable IOVA range instead (x86 min-max range
  minus MSI APIC window). I think this is meaningful for ARM
  too where arm-smmu might not support the full 64b range.
x Any objection we report the usable IOVA regions instead?

2) Shall the kernel check collision with MSI window* when userspace
calls VFIO_IOMMU_MAP_DMA?
Joerg/Will No; Alex yes
*for IOVA regions consumed downstream to the IOMMU: everyone says NO

3) RMRR reporting in the iommu group sysfs? Joerg: yes; Don: no

Um, I'm missing context, but the only thing I recall saying no to wrt RMRR
is that _any_ device that has an RMRR cannot be assigned to a guest.
Or, are you saying, RMRR's should be exposed in the guest os?  if so, then
you have my 'no' there.


My current series does not expose them in iommu group sysfs.
I understand we can expose the RMRR regions in the iomm group sysfs
without necessarily supporting RMRR requiring device assignment.

This sentence doesn't make sense to me.
Can you try re-wording it?
I can't tell what RMRR has to do w/device assignment, other than what I said 
above.
Exposing RMRR's in sysfs is not an issue in general.


We can also add this support later.

Thanks

Eric




Reserved regions are populated through the IOMMU get_resv_region callback
(former get_dm_regions), now implemented by amd-iommu, intel-iommu and
arm-smmu.

The intel-iommu reports the [FEE0_h - FEF0_000h] MSI window as an
IOMMU_RESV_NOMAP reserved region.

arm-smmu reports the MSI window (arbitrarily located at 0x800 and
1MB large) and the PCI host bridge windows.

The series integrates a not officially posted patch from Robin:
"iommu/dma: Allow MSI-only cookies".

This series currently does not address IRQ safety assessment.

Best Regards

Eric

Git: complete series available at
https://github.com/eauger/linux/tree/v4.9-rc5-reserved-rfc-v3

History:
RFC v2 -> v3:
- switch to an iommu-group sysfs API
- use new dummy allocator provided by Robin
- dummy allocator initialized by vfio-iommu-type1 after enumerating
   the reserved regions
- at the moment ARM MSI base address/size is left unchanged compared
   to v2
- we currently report reserved regions and not usable IOVA regions as
   requested by Alex

RFC v1 -> v2:
- fix intel_add_reserved_regions
- add mutex lock/unlock in vfio_iommu_type1


Eric Auger (10):
   iommu/dma: Allow MSI-only cookies
   iommu: Rename iommu_dm_regions into iommu_resv_regions
   iommu: Add new reserved IOMMU attributes
   iommu: iommu_alloc_resv_region
   iommu: Do not map reserved regions
   iommu: iommu_get_group_resv_regions
   iommu: Implement reserved_regions iommu-group sysfs file
   iommu/vt-d: Implement reserved region get/put callbacks
   iommu/arm-smmu: Implement reserved region get/put callbacks
   vfio/type1: Get MSI cookie

  drivers/iommu/amd_iommu.c   |  20 +++---
  drivers/iommu/arm-smmu.c|  52 +++
  drivers/iommu/dma-iommu.c   | 116 ++---
  drivers/iommu/intel-iommu.c |  50 ++
  drivers/iommu/iommu.c   | 141 
  drivers/vfio/vfio_iommu_type1.c |  26 
  include/linux/dma-iommu.h   |   7 ++
  include/linux/iommu.h   |  49 ++
  8 files changed, 391 insertions(+), 70 deletions(-)



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 0/3] Add support for APM X-Gene SoC AHBC IOMMU driver.

2014-12-15 Thread Don Dutile


On 12/15/2014 11:55 AM, Suman Tripathi wrote:

This patch adds the support for the APM X-Gene SoC AHBC IOMMU driver.

Signed-off-by: Suman Tripathi 
---

Suman Tripathi (3):
   xgene-ahbc-iommu: Add support for APM X-Gene SoC AHBC IOMMU driver.
   arm64: dts: Add the APM X-Gene AHBC IOMMU DTS node.
   Documentation: dt-bindings: Add the binding info for APM X-Gene AHBC
 IOMMU driver.

  .../devicetree/bindings/iommu/xgene,ahbc-iommu.txt |  17 ++
  arch/arm64/boot/dts/apm-storm.dtsi |   5 +
  drivers/iommu/Kconfig  |  10 +
  drivers/iommu/Makefile |   1 +
  drivers/iommu/xgene-ahbc-iommu.c   | 336 +
  5 files changed, 369 insertions(+)
  create mode 100644 
Documentation/devicetree/bindings/iommu/xgene,ahbc-iommu.txt
  create mode 100644 drivers/iommu/xgene-ahbc-iommu.c

--
1.8.2.1



Have you tested this with a kernel that uses 64K page size ?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Summary of LPC guest MSI discussion in Santa Fe

2016-11-08 Thread Don Dutile


On 11/07/2016 09:45 PM, Will Deacon wrote:

Hi all,

I figured this was a reasonable post to piggy-back on for the LPC minutes
relating to guest MSIs on arm64.

On Thu, Nov 03, 2016 at 10:02:05PM -0600, Alex Williamson wrote:

We can always have QEMU reject hot-adding the device if the reserved
region overlaps existing guest RAM, but I don't even really see how we
advise users to give them a reasonable chance of avoiding that
possibility.  Apparently there are also ARM platforms where MSI pages
cannot be remapped to support the previous programmable user/VM
address, is it even worthwhile to support those platforms?  Does that
decision influence whether user programmable MSI reserved regions are
really a second class citizen to fixed reserved regions?  I expect
we'll be talking about this tomorrow morning, but I certainly haven't
come up with any viable solutions to this.  Thanks,


At LPC last week, we discussed guest MSIs on arm64 as part of the PCI
microconference. I presented some slides to illustrate some of the issues
we're trying to solve:

   http://www.willdeacon.ukfsn.org/bitbucket/lpc-16/msi-in-guest-arm64.pdf

Punit took some notes (thanks!) on the etherpad here:

   https://etherpad.openstack.org/p/LPC2016_PCI

although the discussion was pretty lively and jumped about, so I've had
to go from memory where the notes didn't capture everything that was
said.

To summarise, arm64 platforms differ in their handling of MSIs when compared
to x86:

   1. The physical memory map is not standardised (Jon pointed out that
  this is something that was realised late on)
   2. MSIs are usually treated the same as DMA writes, in that they must be
  mapped by the SMMU page tables so that they target a physical MSI
  doorbell
   3. On some platforms, MSIs bypass the SMMU entirely (e.g. due to an MSI
  doorbell built into the PCI RC)

Chaulk this case to 'the learning curve'.
Q35 chipset (the one being use for x86-PCIe qemu model) had no intr-remap hw,
only DMA addrs destined for real memory. assigned-device intrs had to be caught
by kvm & injected into guests, and yes, a DoS was possible... and thus,
the intr-remap support being done after initial iommu support.



   4. Platforms typically have some set of addresses that abort before
  reaching the SMMU (e.g. because the PCI identifies them as P2P).

ARM platforms that don't implement the equivalent of ACS (in PCI bridges within
a PCIe switch) are either not device-assignment capable, or the IOMMU domain
expands across the entire peer-to-peer (sub-)tree.
ACS(-like) functionality is a fundamental component to the security model,
as is the IOMMU itself.  Without it, it's equivalent to not having an IOMMU.

Dare I ask?: Can these systems, or parts of these systems, just be deemed
"incomplete" or "not 100% secure" wrt device assignment, and other systems
can or will be ???
Not much different then the first x86 systems that tried to get it right
the first few times... :-/
I'm hearing (parts of) systems that are just not properly designed
for device-assignment use-case, probably b/c this (system) architecture
hasn't been pulled together from the variouis component architectures
(CPU, SMMU, IOT, etc.).



All of this means that userspace (QEMU) needs to identify the memory
regions corresponding to points (3) and (4) and ensure that they are
not allocated in the guest physical (IPA) space. For platforms that can
remap the MSI doorbell as in (2), then some space also needs to be
allocated for that.


Again, proper ACS features/control eliminates this need.
A (multi-function) device should never be able to perform
IO to itself via its PCIe interface.  Bridge-ACS pushes everything
up to SMMU for desitination resolution.
Without ACS, I don't see how a guest is migratible from one system to
another, unless the system-migration-group consists of system that
are exactly the same (wrt IO) [less/more system memory &/or cpu does
not affect VM system map.
 
Again, the initial Linux implementation did not have ACS,

but was 'resolved' by the default/common system mapping putting the PCI devices
into an area that was blocked from memory use (generally 3G->4G space).
ARM may not have that single, simple implementation, but a method to
indicated reserved regions, and then a check for matching/non-matching reserved
regions for guest migration, is the only way I see to resolve this issue
until ACS is sufficiently supported int the hw subsystems to be used
for device-assignment.


Rather than treat these as separate problems, a better interface is to
tell userspace about a set of reserved regions, and have this include
the MSI doorbell, irrespective of whether or not it can be remapped.
Don suggested that we statically pick an address for the doorbell in a
similar way to x86, and have the kernel map it there. We could even pick
0xfee0.

I suggest picking a 'relative-fixed' address: the last n-pages of system memory
address space, i.e.,
   0xfff[]fee0.

Re: Summary of LPC guest MSI discussion in Santa Fe

2016-11-08 Thread Don Dutile


On 11/08/2016 12:54 PM, Will Deacon wrote:

On Tue, Nov 08, 2016 at 03:27:23PM +0100, Auger Eric wrote:

On 08/11/2016 03:45, Will Deacon wrote:

Rather than treat these as separate problems, a better interface is to
tell userspace about a set of reserved regions, and have this include
the MSI doorbell, irrespective of whether or not it can be remapped.
Don suggested that we statically pick an address for the doorbell in a
similar way to x86, and have the kernel map it there. We could even pick
0xfee0. If it conflicts with a reserved region on the platform (due
to (4)), then we'd obviously have to (deterministically?) allocate it
somewhere else, but probably within the bottom 4G.

This is tentatively achieved now with
[1] [RFC v2 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 - Alt II
(http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1264506.html)

Yup, I saw that fly by. Hopefully some of the internals can be reused
with the current thinking on user ABI.


The next question is how to tell userspace about all of the reserved
regions. Initially, the idea was to extend VFIO, however Alex pointed
out a horrible scenario:

   1. QEMU spawns a VM on system 0
   2. VM is migrated to system 1
   3. QEMU attempts to passthrough a device using PCI hotplug

In this scenario, the guest memory map is chosen at step (1), yet there
is no VFIO fd available to determine the reserved regions. Furthermore,
the reserved regions may vary between system 0 and system 1. This pretty
much rules out using VFIO to determine the reserved regions.Alex suggested
that the SMMU driver can advertise the regions via /sys/class/iommu/. This
would solve part of the problem, but migration between systems with
different memory maps can still cause problems if the reserved regions
of the new system conflict with the guest memory map chosen by QEMU.


OK so I understand we do not want anymore the VFIO chain capability API
(patch 5 of above series) but we prefer a sysfs approach instead.

Right.


I understand the sysfs approach which allows the userspace to get the
info earlier and independently on VFIO. Keeping in mind current QEMU
virt - which is not the only userspace - will not do much from this info
until we bring upheavals in virt address space management. So if I am
not wrong, at the moment the main action to be undertaken is the
rejection of the PCI hotplug in case we detect a collision.

I don't think so; it should be up to userspace to reject the hotplug.
If userspace doesn't have support for the regions, then that's fine --
you just end up in a situation where the CPU page table maps memory
somewhere that the device can't see. In other words, you'll end up with
spurious DMA failures, but that's exactly what happens with current systems
if you passthrough an overlapping region (Robin demonstrated this on Juno).

Additionally, you can imagine some future support where you can tell the
guest not to use certain regions of its memory for DMA. In this case, you
wouldn't want to refuse the hotplug in the case of overlapping regions.

Really, I think the kernel side just needs to enumerate the fixed reserved
regions, place the doorbell at a fixed address and then advertise these
via sysfs.


I can respin [1]
- studying and taking into account Robin's comments about dm_regions
similarities
- removing the VFIO capability chain and replacing this by a sysfs API

Ideally, this would be reusable between different SMMU drivers so the sysfs
entries have the same format etc.


Would that be OK?

Sounds good to me. Are you in a position to prototype something on the qemu
side once we've got kernel-side agreement?


What about Alex comments who wanted to report the usable memory ranges
instead of unusable memory ranges?

Also did you have a chance to discuss the following items:
1) the VFIO irq safety assessment

The discussion really focussed on system topology, as opposed to properties
of the doorbell. Regardless of how the device talks to the doorbell, if
the doorbell can't protect against things like MSI spoofing, then it's
unsafe. My opinion is that we shouldn't allow passthrough by default on
systems with unsafe doorbells (we could piggyback on allow_unsafe_interrupts
cmdline option to VFIO).

A first step would be making all this opt-in, and only supporting GICv3
ITS for now.

You're trying to support a config that is < GICv3 and no ITS ? ...
That would be the equiv. of x86 pre-intr-remap, and that's why 
allow_unsafe_interrupts
hook was created ... to enable devel/kick-the-tires.

2) the MSI reserved size computation (is an arbitrary size OK?)

If we fix the base address, we could fix a size too. However, we'd still
need to enumerate the doorbells to check that they fit in the region we
have. If not, then we can warn during boot and treat it the same way as
a resource conflict (that is, reallocate the region in some deterministic
way).

Will


___
iommu mailing list
iommu@lists.linux-foun

Re: Summary of LPC guest MSI discussion in Santa Fe

2016-11-08 Thread Don Dutile


On 11/08/2016 06:35 PM, Alex Williamson wrote:

On Tue, 8 Nov 2016 21:29:22 +0100
Christoffer Dall  wrote:


Hi Will,

On Tue, Nov 08, 2016 at 02:45:59AM +, Will Deacon wrote:

Hi all,

I figured this was a reasonable post to piggy-back on for the LPC minutes
relating to guest MSIs on arm64.

On Thu, Nov 03, 2016 at 10:02:05PM -0600, Alex Williamson wrote:

We can always have QEMU reject hot-adding the device if the reserved
region overlaps existing guest RAM, but I don't even really see how we
advise users to give them a reasonable chance of avoiding that
possibility.  Apparently there are also ARM platforms where MSI pages
cannot be remapped to support the previous programmable user/VM
address, is it even worthwhile to support those platforms?  Does that
decision influence whether user programmable MSI reserved regions are
really a second class citizen to fixed reserved regions?  I expect
we'll be talking about this tomorrow morning, but I certainly haven't
come up with any viable solutions to this.  Thanks,


At LPC last week, we discussed guest MSIs on arm64 as part of the PCI
microconference. I presented some slides to illustrate some of the issues
we're trying to solve:

   http://www.willdeacon.ukfsn.org/bitbucket/lpc-16/msi-in-guest-arm64.pdf

Punit took some notes (thanks!) on the etherpad here:

   https://etherpad.openstack.org/p/LPC2016_PCI

although the discussion was pretty lively and jumped about, so I've had
to go from memory where the notes didn't capture everything that was
said.

To summarise, arm64 platforms differ in their handling of MSIs when compared
to x86:

   1. The physical memory map is not standardised (Jon pointed out that
  this is something that was realised late on)
   2. MSIs are usually treated the same as DMA writes, in that they must be
  mapped by the SMMU page tables so that they target a physical MSI
  doorbell
   3. On some platforms, MSIs bypass the SMMU entirely (e.g. due to an MSI
  doorbell built into the PCI RC)
   4. Platforms typically have some set of addresses that abort before
  reaching the SMMU (e.g. because the PCI identifies them as P2P).

All of this means that userspace (QEMU) needs to identify the memory
regions corresponding to points (3) and (4) and ensure that they are
not allocated in the guest physical (IPA) space. For platforms that can
remap the MSI doorbell as in (2), then some space also needs to be
allocated for that.

Rather than treat these as separate problems, a better interface is to
tell userspace about a set of reserved regions, and have this include
the MSI doorbell, irrespective of whether or not it can be remapped.


Is my understanding correct, that you need to tell userspace about the
location of the doorbell (in the IOVA space) in case (2), because even
though the configuration of the device is handled by the (host) kernel
through trapping of the BARs, we have to avoid the VFIO user programming
the device to create other DMA transactions to this particular address,
since that will obviously conflict and either not produce the desired
DMA transactions or result in unintended weird interrupts?


Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
it's potentially a DMA target and we'll get bogus data on DMA read from
the device, and lose data and potentially trigger spurious interrupts on
DMA write from the device.  Thanks,

Alex


That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
they address match before the SMMU checks are done.  if
all DMA addrs had to go through SMMU first, then the DMA access could
be ignored/rejected.
For bare-metal, memory can't be put in the same place as MSI addrs, or
DMA could never reach it.  So, only a virt issue, unless the VMs mem address
range mimic the host layout.

- Don

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: Summary of LPC guest MSI discussion in Santa Fe

2016-11-09 Thread Don Dutile


On 11/09/2016 12:03 PM, Will Deacon wrote:

On Tue, Nov 08, 2016 at 09:52:33PM -0500, Don Dutile wrote:

On 11/08/2016 06:35 PM, Alex Williamson wrote:

On Tue, 8 Nov 2016 21:29:22 +0100
Christoffer Dall  wrote:

Is my understanding correct, that you need to tell userspace about the
location of the doorbell (in the IOVA space) in case (2), because even
though the configuration of the device is handled by the (host) kernel
through trapping of the BARs, we have to avoid the VFIO user programming
the device to create other DMA transactions to this particular address,
since that will obviously conflict and either not produce the desired
DMA transactions or result in unintended weird interrupts?


Yes, that's the crux of the issue.


Correct, if the MSI doorbell IOVA range overlaps RAM in the VM, then
it's potentially a DMA target and we'll get bogus data on DMA read from
the device, and lose data and potentially trigger spurious interrupts on
DMA write from the device.  Thanks,


That's b/c the MSI doorbells are not positioned *above* the SMMU, i.e.,
they address match before the SMMU checks are done.  if
all DMA addrs had to go through SMMU first, then the DMA access could
be ignored/rejected.


That's actually not true :( The SMMU can't generally distinguish between MSI
writes and DMA writes, so it would just see a write transaction to the
doorbell address, regardless of how it was generated by the endpoint.

Will


So, we have real systems where MSI doorbells are placed at the same IOVA
that could have memory for a guest, but not at the same IOVA as memory on real 
hw ?
How are memory holes passed to SMMU so it doesn't have this issue for bare-metal
(assign an IOVA that overlaps an MSI doorbell address)?




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-04-07 Thread Don Dutile


On 04/06/2015 11:46 PM, Dave Young wrote:

On 04/05/15 at 09:54am, Baoquan He wrote:

On 04/03/15 at 05:21pm, Dave Young wrote:

On 04/03/15 at 05:01pm, Li, ZhenHua wrote:

Hi Dave,

There may be some possibilities that the old iommu data is corrupted by
some other modules. Currently we do not have a better solution for the
dmar faults.

But I think when this happens, we need to fix the module that corrupted
the old iommu data. I once met a similar problem in normal kernel, the
queue used by the qi_* functions was written again by another module.
The fix was in that module, not in iommu module.


It is too late, there will be no chance to save vmcore then.

Also if it is possible to continue corrupt other area of oldmem because
of using old iommu tables then it will cause more problems.

So I think the tables at least need some verifycation before being used.



Yes, it's a good thinking anout this and verification is also an
interesting idea. kexec/kdump do a sha256 calculation on loaded kernel
and then verify this again when panic happens in purgatory. This checks
whether any code stomps into region reserved for kexec/kernel and corrupt
the loaded kernel.

If this is decided to do it should be an enhancement to current
patchset but not a approach change. Since this patchset is going very
close to point as maintainers expected maybe this can be merged firstly,
then think about enhancement. After all without this patchset vt-d often
raised error message, hung.


It does not convince me, we should do it right at the beginning instead of
introduce something wrong.

I wonder why the old dma can not be remap to a specific page in kdump kernel
so that it will not corrupt more memory. But I may missed something, I will
looking for old threads and catch up.

Thanks
Dave


The (only) issue is not corruption, but once the iommu is re-configured, the 
old,
not-stopped-yet, dma engines will use iova's that will generate dmar faults, 
which
will be enabled when the iommu is re-configured (even to a single/simple paging 
scheme)
in the kexec kernel.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-04 Thread Don Dutile


On 05/04/2015 07:05 AM, Joerg Roedel wrote:

On Fri, Apr 03, 2015 at 04:40:31PM +0800, Dave Young wrote:

Have not read all the patches, but I have a question, not sure this
has been answered before. Old memory is not reliable, what if the old
memory get corrupted before panic? Is it safe to continue using it in
2nd kernel, I worry that it will cause problems.


Yes, the old memory could be corrupted, and there are more failure cases
left which we have no way of handling yet (if iommu data structures are
in kdump backup areas).

The question is what to do if we find some of the old data structures
corrupted, hand how far should the tests go. Should we also check the
page-tables, for example? I think if some of the data structures for a
device are corrupted it probably already failed in the old kernel and
things won't get worse in the new one.

So checking is not strictly necessary in the first version of these
patches (unless we find a valid failure scenario). Once we have some
good plan on what to do if we find corruption, we can add checking of
course.


Regards,

Joerg



Agreed.  This is a significant improvement over what we (don') have.

Corruption related to IOMMU must occur within the host, and it must be
a software corruption, b/c the IOMMU inherently protects itself by protecting
all of memory from errant DMAs.  Therefore, if the only IOMMU corruptor is
in the host, it's likely the entire host kernel crash dump will either be
useless, or corrupted by the security breach, at which point,
this is just another scenario of a failed crash dump that will never be taken.

The kernel can't protect the mapping tables, which are the most likely area to
be corrupted, b/c it'd (minimally) have to be per-device (to avoid locking
& coherency issues), and would require significant
overhead to keep/update a checksum-like scheme on (potentially) 4 levels of 
page tables.


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-07 Thread Don Dutile


On 05/07/2015 10:00 AM, Dave Young wrote:

On 04/07/15 at 10:12am, Don Dutile wrote:

On 04/06/2015 11:46 PM, Dave Young wrote:

On 04/05/15 at 09:54am, Baoquan He wrote:

On 04/03/15 at 05:21pm, Dave Young wrote:

On 04/03/15 at 05:01pm, Li, ZhenHua wrote:

Hi Dave,

There may be some possibilities that the old iommu data is corrupted by
some other modules. Currently we do not have a better solution for the
dmar faults.

But I think when this happens, we need to fix the module that corrupted
the old iommu data. I once met a similar problem in normal kernel, the
queue used by the qi_* functions was written again by another module.
The fix was in that module, not in iommu module.


It is too late, there will be no chance to save vmcore then.

Also if it is possible to continue corrupt other area of oldmem because
of using old iommu tables then it will cause more problems.

So I think the tables at least need some verifycation before being used.



Yes, it's a good thinking anout this and verification is also an
interesting idea. kexec/kdump do a sha256 calculation on loaded kernel
and then verify this again when panic happens in purgatory. This checks
whether any code stomps into region reserved for kexec/kernel and corrupt
the loaded kernel.

If this is decided to do it should be an enhancement to current
patchset but not a approach change. Since this patchset is going very
close to point as maintainers expected maybe this can be merged firstly,
then think about enhancement. After all without this patchset vt-d often
raised error message, hung.


It does not convince me, we should do it right at the beginning instead of
introduce something wrong.

I wonder why the old dma can not be remap to a specific page in kdump kernel
so that it will not corrupt more memory. But I may missed something, I will
looking for old threads and catch up.

Thanks
Dave


The (only) issue is not corruption, but once the iommu is re-configured, the 
old,
not-stopped-yet, dma engines will use iova's that will generate dmar faults, 
which
will be enabled when the iommu is re-configured (even to a single/simple paging 
scheme)
in the kexec kernel.



Don, so if iommu is not reconfigured then these faults will not happen?


Well, if iommu is not reconfigured, then if the crash isn't caused by
an IOMMU fault (some systems have firmware-first catch the IOMMU fault & convert
them into NMI_IOCK), then the DMA's will continue into the old kernel memory 
space.


Baoquan and me has a confusion below today about iommu=off/intel_iommu=off:

intel_iommu_init()
{
...

dmar_table_init();

disable active iommu translations;

if (no_iommu || dmar_disabled)
 goto out_free_dmar;

...
}

Any reason not move no_iommu check to the begining of intel_iommu_init function?


What does that do/help?


Thanks
Dave



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-10 Thread Don Dutile


On 05/07/2015 10:00 AM, Dave Young wrote:

On 04/07/15 at 10:12am, Don Dutile wrote:

On 04/06/2015 11:46 PM, Dave Young wrote:

On 04/05/15 at 09:54am, Baoquan He wrote:

On 04/03/15 at 05:21pm, Dave Young wrote:

On 04/03/15 at 05:01pm, Li, ZhenHua wrote:

Hi Dave,

There may be some possibilities that the old iommu data is corrupted by
some other modules. Currently we do not have a better solution for the
dmar faults.

But I think when this happens, we need to fix the module that corrupted
the old iommu data. I once met a similar problem in normal kernel, the
queue used by the qi_* functions was written again by another module.
The fix was in that module, not in iommu module.


It is too late, there will be no chance to save vmcore then.

Also if it is possible to continue corrupt other area of oldmem because
of using old iommu tables then it will cause more problems.

So I think the tables at least need some verifycation before being used.



Yes, it's a good thinking anout this and verification is also an
interesting idea. kexec/kdump do a sha256 calculation on loaded kernel
and then verify this again when panic happens in purgatory. This checks
whether any code stomps into region reserved for kexec/kernel and corrupt
the loaded kernel.

If this is decided to do it should be an enhancement to current
patchset but not a approach change. Since this patchset is going very
close to point as maintainers expected maybe this can be merged firstly,
then think about enhancement. After all without this patchset vt-d often
raised error message, hung.


It does not convince me, we should do it right at the beginning instead of
introduce something wrong.

I wonder why the old dma can not be remap to a specific page in kdump kernel
so that it will not corrupt more memory. But I may missed something, I will
looking for old threads and catch up.

Thanks
Dave


The (only) issue is not corruption, but once the iommu is re-configured, the 
old,
not-stopped-yet, dma engines will use iova's that will generate dmar faults, 
which
will be enabled when the iommu is re-configured (even to a single/simple paging 
scheme)
in the kexec kernel.



Don, so if iommu is not reconfigured then these faults will not happen?

Baoquan and me has a confusion below today about iommu=off/intel_iommu=off:

intel_iommu_init()
{
...

dmar_table_init();

disable active iommu translations;

if (no_iommu || dmar_disabled)
 goto out_free_dmar;

...
}

Any reason not move no_iommu check to the begining of intel_iommu_init function?

Thanks
Dave

Looks like you could.


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v9 0/10] iommu/vt-d: Fix intel vt-d faults in kdump kernel

2015-05-10 Thread Don Dutile


On 05/07/2015 09:21 PM, Dave Young wrote:

On 05/07/15 at 10:25am, Don Dutile wrote:

On 05/07/2015 10:00 AM, Dave Young wrote:

On 04/07/15 at 10:12am, Don Dutile wrote:

On 04/06/2015 11:46 PM, Dave Young wrote:

On 04/05/15 at 09:54am, Baoquan He wrote:

On 04/03/15 at 05:21pm, Dave Young wrote:

On 04/03/15 at 05:01pm, Li, ZhenHua wrote:

Hi Dave,

There may be some possibilities that the old iommu data is corrupted by
some other modules. Currently we do not have a better solution for the
dmar faults.

But I think when this happens, we need to fix the module that corrupted
the old iommu data. I once met a similar problem in normal kernel, the
queue used by the qi_* functions was written again by another module.
The fix was in that module, not in iommu module.


It is too late, there will be no chance to save vmcore then.

Also if it is possible to continue corrupt other area of oldmem because
of using old iommu tables then it will cause more problems.

So I think the tables at least need some verifycation before being used.



Yes, it's a good thinking anout this and verification is also an
interesting idea. kexec/kdump do a sha256 calculation on loaded kernel
and then verify this again when panic happens in purgatory. This checks
whether any code stomps into region reserved for kexec/kernel and corrupt
the loaded kernel.

If this is decided to do it should be an enhancement to current
patchset but not a approach change. Since this patchset is going very
close to point as maintainers expected maybe this can be merged firstly,
then think about enhancement. After all without this patchset vt-d often
raised error message, hung.


It does not convince me, we should do it right at the beginning instead of
introduce something wrong.

I wonder why the old dma can not be remap to a specific page in kdump kernel
so that it will not corrupt more memory. But I may missed something, I will
looking for old threads and catch up.

Thanks
Dave


The (only) issue is not corruption, but once the iommu is re-configured, the 
old,
not-stopped-yet, dma engines will use iova's that will generate dmar faults, 
which
will be enabled when the iommu is re-configured (even to a single/simple paging 
scheme)
in the kexec kernel.



Don, so if iommu is not reconfigured then these faults will not happen?


Well, if iommu is not reconfigured, then if the crash isn't caused by
an IOMMU fault (some systems have firmware-first catch the IOMMU fault & convert
them into NMI_IOCK), then the DMA's will continue into the old kernel memory 
space.


So NMI_IOCK is one reason to cause kernel hang, I think I'm still not clear 
about
what does re-configured means though. DMAR faults will happen originally this 
is the old
behavior but we are removing the faults by alowing DMA continuing into old 
memory
space.


A flood of faults occur when the 2nd kernel (re-)configures the IOMMU because
the second kernel effectively clears/disable all DMA except RMRRs, so any DMA 
from 1st kernel will flood
the system with faults.  Its the flood of dmar faults that eventually wedges 
&/or crashes the system
while trying to take a kdump.




Baoquan and me has a confusion below today about iommu=off/intel_iommu=off:

intel_iommu_init()
{
...

dmar_table_init();

disable active iommu translations;

if (no_iommu || dmar_disabled)
 goto out_free_dmar;

...
}

Any reason not move no_iommu check to the begining of intel_iommu_init function?


What does that do/help?


Just do not know why the previous handling is necessary with iommu=off, 
shouldn't
we do noting and return earlier?

Also there is a guess, dmar faults appears after iommu_init, so not sure if the 
codes
before dmar_disabled checking have some effect about enabling the faults 
messages.

Thanks
Dave



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 0/6] IOMMU/DMA map_resource support for peer-to-peer

2015-05-11 Thread Don Dutile

On 05/07/2015 02:11 PM, Jerome Glisse wrote:

On Thu, May 07, 2015 at 12:16:30PM -0500, Bjorn Helgaas wrote:

On Thu, May 7, 2015 at 11:23 AM, William Davis  wrote:

From: Bjorn Helgaas [mailto:bhelg...@google.com]
Sent: Thursday, May 7, 2015 8:13 AM
To: Yijing Wang
Cc: William Davis; Joerg Roedel; open list:INTEL IOMMU (VT-d); linux-
p...@vger.kernel.org; Terence Ripperda; John Hubbard; Jerome Glisse; Dave
Jiang; David S. Miller; Alex Williamson
Subject: Re: [PATCH 0/6] IOMMU/DMA map_resource support for peer-to-peer

On Wed, May 6, 2015 at 8:48 PM, Yijing Wang  wrote:

On 2015/5/7 6:18, Bjorn Helgaas wrote:

[+cc Yijing, Dave J, Dave M, Alex]

On Fri, May 01, 2015 at 01:32:12PM -0500, wda...@nvidia.com wrote:

From: Will Davis 

Hi,

This patch series adds DMA APIs to map and unmap a struct resource
to and from a PCI device's IOVA domain, and implements the AMD,
Intel, and nommu versions of these interfaces.

This solves a long-standing problem with the existing DMA-remapping
interfaces, which require that a struct page be given for the region
to be mapped into a device's IOVA domain. This requirement cannot
support peer device BAR ranges, for which no struct pages exist.
...

I think we currently assume there's no peer-to-peer traffic.

I don't know whether changing that will break anything, but I'm
concerned about these:

   - PCIe MPS configuration (see pcie_bus_configure_settings()).

I think it should be ok for PCIe MPS configuration, PCIE_BUS_PEER2PEER
force every device's MPS to 128B, what its concern is the TLP payload
size. In this series, it seems to only map a iova for device bar region.

MPS configuration makes assumptions about whether there will be any peer-
to-peer traffic.  If there will be none, MPS can be configured more
aggressively.

I don't think Linux has any way to detect whether a driver is doing peer-
to-peer, and there's no way to prevent a driver from doing it.
We're stuck with requiring the user to specify boot options
("pci=pcie_bus_safe", "pci=pcie_bus_perf", "pci=pcie_bus_peer2peer",
etc.) that tell the PCI core what the user expects to happen.

This is a terrible user experience.  The user has no way to tell what
drivers are going to do.  If he specifies the wrong thing, e.g., "assume no
peer-to-peer traffic," and then loads a driver that does peer-to-peer, the
kernel will configure MPS aggressively and when the device does a peer-to-
peer transfer, it may cause a Malformed TLP error.

I agree that this isn't a great user experience, but just want to clarify
that this problem is orthogonal to this patch series, correct?

Prior to this series, the MPS mismatch is still possible with p2p traffic,
but when an IOMMU is enabled p2p traffic will result in DMAR faults. The
aim of the series is to allow drivers to fix the latter, not the former.

Prior to this series, there wasn't any infrastructure for drivers to
do p2p, so it was mostly reasonable to assume that there *was* no p2p
traffic.

I think we currently default to doing nothing to MPS.  Prior to this
series, it might have been reasonable to optimize based on a "no-p2p"
assumption, e.g., default to pcie_bus_safe or pcie_bus_perf.  After
this series, I'm not sure what we could do, because p2p will be much
more likely.

It's just an issue; I don't know what the resolution is.

Can't we just have each device update its MPS at runtime. So if device A
decide to map something from device B then device A update MPS for A and
B to lowest common supported value.

Of course you need to keep track of that per device so that if a device C
comes around and want to exchange with device B and both C and B support
higher payload than A then if C reprogram B it will trigger issue for A.

I know we update other PCIE configuration parameter at runtime for GPU,
dunno if it is widely tested for other devices.

I believe all these cases are btwn endpts and the upstream ports of a
PCIe port/host-bringe/PCIe switch they are connected to, i.e., true, wire peers
-- not across a PCIe domain, which is the context of this p2p that the MPS has 
to span.

Cheers,
Jérôme
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 10/11] PCI: Stop caching ATS Invalidate Queue Depth

2015-07-27 Thread Don Dutile


On 07/20/2015 08:15 PM, Bjorn Helgaas wrote:

Stop caching the Invalidate Queue Depth in struct pci_dev.
pci_ats_queue_depth() is typically called only once per device, and it
returns a fixed value per-device, so callers who need the value frequently
can cache it themselves.

Signed-off-by: Bjorn Helgaas 
---
  drivers/pci/ats.c   |9 -
  include/linux/pci.h |1 -
  2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index 67524a7..bdb1383 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -20,16 +20,12 @@
  void pci_ats_init(struct pci_dev *dev)
  {
int pos;
-   u16 cap;

pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
if (!pos)
return;

dev->ats_cap = pos;
-   pci_read_config_word(dev, dev->ats_cap + PCI_ATS_CAP, &cap);
-   dev->ats_qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
-   PCI_ATS_MAX_QDEP;
  }

  /**
@@ -131,13 +127,16 @@ EXPORT_SYMBOL_GPL(pci_restore_ats_state);
   */
  int pci_ats_queue_depth(struct pci_dev *dev)
  {
+   u16 cap;
+
if (!dev->ats_cap)
return -EINVAL;

if (dev->is_virtfn)
return 0;


hmmm, isn't one of the fixes in this patch set to
change the caching of ats queue depth ?  won't above
make it 0 for all VF's ?  is that correct, or desired (virtual) value?


-   return dev->ats_qdep;
+   pci_read_config_word(dev, dev->ats_cap + PCI_ATS_CAP, &cap);
+   return PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) : PCI_ATS_MAX_QDEP;
  }
  EXPORT_SYMBOL_GPL(pci_ats_queue_depth);

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 307f96a..4b484fd 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -378,7 +378,6 @@ struct pci_dev {
};
u16 ats_cap;/* ATS Capability offset */
u8  ats_stu;/* ATS Smallest Translation Unit */
-   u8  ats_qdep;   /* ATS Invalidate Queue Depth */
atomic_tats_ref_cnt;/* number of VFs with ATS enabled */
  #endif
phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 10/11] PCI: Stop caching ATS Invalidate Queue Depth

2015-07-27 Thread Don Dutile


On 07/27/2015 06:27 PM, Bjorn Helgaas wrote:

Hi Don,

On Mon, Jul 27, 2015 at 10:00:53AM -0400, Don Dutile wrote:

On 07/20/2015 08:15 PM, Bjorn Helgaas wrote:

Stop caching the Invalidate Queue Depth in struct pci_dev.
pci_ats_queue_depth() is typically called only once per device, and it
returns a fixed value per-device, so callers who need the value frequently
can cache it themselves.

Signed-off-by: Bjorn Helgaas 
---
  drivers/pci/ats.c   |9 -
  include/linux/pci.h |1 -
  2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index 67524a7..bdb1383 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -20,16 +20,12 @@
  void pci_ats_init(struct pci_dev *dev)
  {
int pos;
-   u16 cap;

pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ATS);
if (!pos)
return;

dev->ats_cap = pos;
-   pci_read_config_word(dev, dev->ats_cap + PCI_ATS_CAP, &cap);
-   dev->ats_qdep = PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) :
-   PCI_ATS_MAX_QDEP;
  }

  /**
@@ -131,13 +127,16 @@ EXPORT_SYMBOL_GPL(pci_restore_ats_state);
   */
  int pci_ats_queue_depth(struct pci_dev *dev)
  {
+   u16 cap;
+
if (!dev->ats_cap)
return -EINVAL;

if (dev->is_virtfn)
return 0;


hmmm, isn't one of the fixes in this patch set to
change the caching of ats queue depth ?  won't above
make it 0 for all VF's ?  is that correct, or desired (virtual) value?


Per spec (SR-IOV r1.1., sec 3.7.4), the ATS queue depth register on a VF
always contains zero.

Here's the v4.1 code:

 int pci_ats_queue_depth(struct pci_dev *dev)
 {
 if (dev->is_virtfn)
 return 0;

 if (dev->ats)
 return dev->ats->qdep;

...


In v4.1, pci_ats_queue_depth() always returned 0 for a VF.  For VFs, we
didn't look at the dev->ats->qdep cache.  So I don't think this changes
anything for the caller.

A previous patch changed this path so we return -EINVAL instead of 0 for
VFs that don't support ATS.  I think the previous behavior there was wrong,
but I doubt anybody noticed.

Bjorn


ok. thanks for the sanity check.


-   return dev->ats_qdep;
+   pci_read_config_word(dev, dev->ats_cap + PCI_ATS_CAP, &cap);
+   return PCI_ATS_CAP_QDEP(cap) ? PCI_ATS_CAP_QDEP(cap) : PCI_ATS_MAX_QDEP;
  }
  EXPORT_SYMBOL_GPL(pci_ats_queue_depth);

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 307f96a..4b484fd 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -378,7 +378,6 @@ struct pci_dev {
};
u16 ats_cap;/* ATS Capability offset */
u8  ats_stu;/* ATS Smallest Translation Unit */
-   u8  ats_qdep;   /* ATS Invalidate Queue Depth */
atomic_tats_ref_cnt;/* number of VFs with ATS enabled */
  #endif
phys_addr_t rom; /* Physical address of ROM if it's not from the BAR */

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 00/16] Add new DMA mapping operation for P2PDMA

2021-05-11 Thread Don Dutile


On 4/8/21 1:01 PM, Logan Gunthorpe wrote:

Hi,

This patchset continues my work to to add P2PDMA support to the common
dma map operations. This allows for creating SGLs that have both P2PDMA
and regular pages which is a necessary step to allowing P2PDMA pages in
userspace.

The earlier RFC[1] generated a lot of great feedback and I heard no show
stopping objections. Thus, I've incorporated all the feedback and have
decided to post this as a proper patch series with hopes of eventually
getting it in mainline.

I'm happy to do a few more passes if anyone has any further feedback
or better ideas.

This series is based on v5.12-rc6 and a git branch can be found here:

   https://github.com/sbates130272/linux-p2pmem/  p2pdma_map_ops_v1

Thanks,

Logan

[1] 
https://lore.kernel.org/linux-block/20210311233142.7900-1-log...@deltatee.com/


Changes since the RFC:
  * Added comment and fixed up the pci_get_slot patch. (per Bjorn)
  * Fixed glaring sg_phys() double offset bug. (per Robin)
  * Created a new map operation (dma_map_sg_p2pdma()) with a new calling
convention instead of modifying the calling convention of
dma_map_sg(). (per Robin)
  * Integrated the two similar pci_p2pdma_dma_map_type() and
pci_p2pdma_map_type() functions into one (per Ira)
  * Reworked some of the logic in the map_sg() implementations into
helpers in the p2pdma code. (per Christoph)
  * Dropped a bunch of unnecessary symbol exports (per Christoph)
  * Expanded the code in dma_pci_p2pdma_supported() for clarity. (per
Ira and Christoph)
  * Finished off using the new dma_map_sg_p2pdma() call in rdma_rw
and removed the old pci_p2pdma_[un]map_sg(). (per Jason)

--

Logan Gunthorpe (16):
   PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
   PCI/P2PDMA: Avoid pci_get_slot() which sleeps
   PCI/P2PDMA: Attempt to set map_type if it has not been set
   PCI/P2PDMA: Refactor pci_p2pdma_map_type() to take pagmap and device
   dma-mapping: Introduce dma_map_sg_p2pdma()
   lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
   PCI/P2PDMA: Make pci_p2pdma_map_type() non-static
   PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
   dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
   dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
   iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
   nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
   nvme-pci: Convert to using dma_map_sg_p2pdma for p2pdma pages
   nvme-rdma: Ensure dma support when using p2pdma
   RDMA/rw: use dma_map_sg_p2pdma()
   PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()

  drivers/infiniband/core/rw.c |  50 +++---
  drivers/iommu/dma-iommu.c|  66 ++--
  drivers/nvme/host/core.c |   3 +-
  drivers/nvme/host/nvme.h |   2 +-
  drivers/nvme/host/pci.c  |  39 
  drivers/nvme/target/rdma.c   |   3 +-
  drivers/pci/Kconfig  |   2 +-
  drivers/pci/p2pdma.c | 188 +++
  include/linux/dma-map-ops.h  |   3 +
  include/linux/dma-mapping.h  |  20 
  include/linux/pci-p2pdma.h   |  53 ++
  include/linux/scatterlist.h  |  49 -
  include/rdma/ib_verbs.h  |  32 ++
  kernel/dma/direct.c  |  25 -
  kernel/dma/mapping.c |  70 +++--
  15 files changed, 416 insertions(+), 189 deletions(-)


base-commit: e49d033bddf5b565044e2abe4241353959bc9120
--
2.20.1


Apologies in the delay to provide feedback; climbing out of several deep 
trenches at the mother ship :-/

Replying to some directly, and indirectly (mostly through JohH's reply's).

General comments:
1) nits in 1,2,3,5;
   4: I agree w/JohnH & JasonG -- seems like it needs a device-layer that gets to a 
bus-layer, but I'm wearing my 'broader then PCI' hat in this review; I see a (classic) 
ChristophH refactoring and cleanup in this area, and wondering if we ought to clean it 
up now, since CH has done so much to clean it up and make the dma-mapping system so 
much easier to add/modify/review due to the broad arch (& bus) cleanup that has 
been done.  If that delays it too much, then add a TODO to do so.
2) 6: yes! let's not worry or even both supporting 32-bit anything wrt p2pdma.
3) 7:nit
4) 8: ok;
5) 9: ditto to JohnH's feedback on added / clearer comment & code flow 
(if-else).
6) 10: nits; q: should p2pdma mapping go through dma-ops so it is generalized 
for future interconnects (CXL, GenZ)?
7) 11: It says it is supporting p2pdma in dma-iommu's map_sg, but it seems like 
it is just leveraging shared code and short-circuiting IOMMU use.
8) 12-14: didn't review; letting the block/nvme/direct-io folks cover this space
9) 15: Looking to JasonG to sanitize
10) 16: cleanup; a-ok.

- DonD

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()

2021-05-11 Thread Don Dutile


On 5/1/21 11:58 PM, John Hubbard wrote:

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:

In order to call upstream_bridge_distance_warn() from a dma_map function,
it must not sleep. The only reason it does sleep is to allocate the seqbuf
to print which devices are within the ACS path.

Switch the kmalloc call to use a passed in gfp_mask and don't print that
message if the buffer fails to be allocated.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
  drivers/pci/p2pdma.c | 21 +++--
  1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 196382630363..bd89437faf06 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -267,7 +267,7 @@ static int pci_bridge_has_acs_redir(struct pci_dev *pdev)
    static void seq_buf_print_bus_devfn(struct seq_buf *buf, struct pci_dev 
*pdev)
  {
-    if (!buf)
+    if (!buf || !buf->buffer)


This is not great, sort of from an overall design point of view, even though
it makes the rest of the patch work. See below for other ideas, that will
avoid the need for this sort of odd point fix.


+1.
In fact, I didn't see how the kmalloc was changed... you refactored the code to 
pass-in the
GFP_KERNEL that was originally hard-coded into upstream_bridge_distance_warn();
I don't see how that avoided the kmalloc() call.
in fact, I also see you lost a failed kmalloc() check, so it seems to have 
taken a step back.


  return;
    seq_buf_printf(buf, "%s;", pci_name(pdev));
@@ -495,25 +495,26 @@ upstream_bridge_distance(struct pci_dev *provider, struct 
pci_dev *client,
    static enum pci_p2pdma_map_type
  upstream_bridge_distance_warn(struct pci_dev *provider, struct pci_dev 
*client,
-  int *dist)
+  int *dist, gfp_t gfp_mask)
  {
  struct seq_buf acs_list;
  bool acs_redirects;
  int ret;
  -    seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, GFP_KERNEL), PAGE_SIZE);
-    if (!acs_list.buffer)
-    return -ENOMEM;


Another odd thing: this used to check for memory failure and just give
up, and now it doesn't. Yes, I realize that it all still works at the
moment, but this is quirky and we shouldn't stop here.

Instead, a cleaner approach would be to push the memory allocation
slightly higher up the call stack, out to the
pci_p2pdma_distance_many(). So pci_p2pdma_distance_many() should make
the kmalloc() call, and fail out if it can't get a page for the seq_buf
buffer. Then you don't have to do all this odd stuff.

Furthermore, the call sites can then decide for themselves which GFP
flags, GFP_ATOMIC or GFP_KERNEL or whatever they want for kmalloc().


agree, good proposal to avoid a sleep due to kmalloc().


A related thing: this whole exercise would go better if there were a
preparatory patch or two that changed the return codes in this file to
something less crazy. There are too many functions that can fail, but
are treated as if they sort-of-mostly-would-never-fail, in the hopes of
using the return value directly for counting and such. This is badly
mistaken, and it leads developers to try to avoid returning -ENOMEM
(which is what we need here).

Really, these functions should all be doing "0 for success, -ERRNO for
failure, and pass other values, including results, in the arg list".


WFM!




+    seq_buf_init(&acs_list, kmalloc(PAGE_SIZE, gfp_mask), PAGE_SIZE);
    ret = upstream_bridge_distance(provider, client, dist, &acs_redirects,
 &acs_list);
  if (acs_redirects) {
  pci_warn(client, "ACS redirect is set between the client and provider 
(%s)\n",
   pci_name(provider));
-    /* Drop final semicolon */
-    acs_list.buffer[acs_list.len-1] = 0;
-    pci_warn(client, "to disable ACS redirect for this path, add the kernel 
parameter: pci=disable_acs_redir=%s\n",
- acs_list.buffer);
+
+    if (acs_list.buffer) {
+    /* Drop final semicolon */
+    acs_list.buffer[acs_list.len - 1] = 0;
+    pci_warn(client, "to disable ACS redirect for this path, add the kernel 
parameter: pci=disable_acs_redir=%s\n",
+ acs_list.buffer);
+    }
  }
    if (ret == PCI_P2PDMA_MAP_NOT_SUPPORTED) {
@@ -566,7 +567,7 @@ int pci_p2pdma_distance_many(struct pci_dev *provider, 
struct device **clients,
    if (verbose)
  ret = upstream_bridge_distance_warn(provider,
-    pci_client, &distance);
+    pci_client, &distance, GFP_KERNEL);
  else
  ret = upstream_bridge_distance(provider, pci_client,
 &distance, NULL, NULL);



thanks,


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 02/16] PCI/P2PDMA: Avoid pci_get_slot() which sleeps

2021-05-11 Thread Don Dutile


On 4/8/21 1:01 PM, Logan Gunthorpe wrote:

In order to use upstream_bridge_distance_warn() from a dma_map function,
it must not sleep. However, pci_get_slot() takes the pci_bus_sem so it
might sleep.

In order to avoid this, try to get the host bridge's device from
bus->self, and if that is not set, just get the first element in the
device list. It should be impossible for the host bridge's device to
go away while references are held on child devices, so the first element
should not be able to change and, thus, this should be safe.

Bjorn:
Why wouldn't (shouldn't?) the bus->self field be set for a host bridge device?
Should this situation be repaired in the host-brige config/setup code elsewhere 
in the kernel.
... and here, a check-and-fail with info of what doesn't have it setup (another new 
pci function to do the check & prinfo), so it can point to the offending 
host-bridge, and thus, the code that needs to be updated?



Signed-off-by: Logan Gunthorpe 
---
  drivers/pci/p2pdma.c | 14 --
  1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index bd89437faf06..473a08940fbc 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -311,16 +311,26 @@ static const struct pci_p2pdma_whitelist_entry {
  static bool __host_bridge_whitelist(struct pci_host_bridge *host,
bool same_host_bridge)
  {
-   struct pci_dev *root = pci_get_slot(host->bus, PCI_DEVFN(0, 0));
const struct pci_p2pdma_whitelist_entry *entry;
+   struct pci_dev *root = host->bus->self;
unsigned short vendor, device;
  
+	/*

+* This makes the assumption that the first device on the bus is the
+* bridge itself and it has the devfn of 00.0. This assumption should
+* hold for the devices in the white list above, and if there are cases
+* where this isn't true they will have to be dealt with when such a
+* case is added to the whitelist.
+*/
if (!root)
+   root = list_first_entry_or_null(&host->bus->devices,
+   struct pci_dev, bus_list);
+
+   if (!root || root->devfn)
return false;
  
  	vendor = root->vendor;

device = root->device;
-   pci_dev_put(root);
  
  	for (entry = pci_p2pdma_whitelist; entry->vendor; entry++) {

if (vendor != entry->vendor || device != entry->device)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 02/16] PCI/P2PDMA: Avoid pci_get_slot() which sleeps

2021-05-11 Thread Don Dutile


On 5/2/21 1:35 AM, John Hubbard wrote:

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:

In order to use upstream_bridge_distance_warn() from a dma_map function,
it must not sleep. However, pci_get_slot() takes the pci_bus_sem so it
might sleep.

In order to avoid this, try to get the host bridge's device from
bus->self, and if that is not set, just get the first element in the
device list. It should be impossible for the host bridge's device to
go away while references are held on child devices, so the first element
should not be able to change and, thus, this should be safe.

Signed-off-by: Logan Gunthorpe 
---
  drivers/pci/p2pdma.c | 14 --
  1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index bd89437faf06..473a08940fbc 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -311,16 +311,26 @@ static const struct pci_p2pdma_whitelist_entry {
  static bool __host_bridge_whitelist(struct pci_host_bridge *host,
  bool same_host_bridge)
  {
-    struct pci_dev *root = pci_get_slot(host->bus, PCI_DEVFN(0, 0));
  const struct pci_p2pdma_whitelist_entry *entry;
+    struct pci_dev *root = host->bus->self;
  unsigned short vendor, device;
  +    /*
+ * This makes the assumption that the first device on the bus is the
+ * bridge itself and it has the devfn of 00.0. This assumption should
+ * hold for the devices in the white list above, and if there are cases
+ * where this isn't true they will have to be dealt with when such a
+ * case is added to the whitelist.


Actually, it makes the assumption that the first device *in the list*
(the host->bus-devices list) is 00.0.  The previous code made the
assumption that you wrote.

By the way, pre-existing code comment: pci_p2pdma_whitelist[] seems
really short. From a naive point of view, I'd expect that there must be
a lot more CPUs/chipsets that can do pci p2p, what do you think? I
wonder if we have to be so super strict, anyway. It just seems extremely
limited, and I suspect there will be some additions to the list as soon
as we start to use this.



+ */
  if (!root)
+    root = list_first_entry_or_null(&host->bus->devices,
+    struct pci_dev, bus_list);


OK, yes this avoids taking the pci_bus_sem, but it's kind of cheating.
Why is it OK to avoid taking any locks in order to retrieve the
first entry from the list, but in order to retrieve any other entry, you
have to aquire the pci_bus_sem, and get a reference as well? Something
is inconsistent there.

The new version here also no longer takes a reference on the device,
which is also cheating. But I'm guessing that the unstated assumption
here is that there is always at least one entry in the list. But if
that's true, then it's better to show clearly that assumption, instead
of hiding it in an implicit call that skips both locking and reference
counting.

You could add a new function, which is a cut-down version of pci_get_slot(),
like this, and call this from __host_bridge_whitelist():

/*
 * A special purpose variant of pci_get_slot() that doesn't take the pci_bus_sem
 * lock, and only looks for the 00.0 bus-device-function. Once the PCI bus is
 * up, it is safe to call this, because there will always be a top-level PCI
 * root device.
 *
 * Other assumptions: the root device is the first device in the list, and the
 * root device is numbered 00.0.
 */
struct pci_dev *pci_get_root_slot(struct pci_bus *bus)
{
struct pci_dev *root;
unsigned devfn = PCI_DEVFN(0, 0);

root = list_first_entry_or_null(&bus->devices, struct pci_dev,
    bus_list);
if (root->devfn == devfn)
    goto out;


... add a flag (set for p2pdma use)  to the function to print out what the 
root->devfn is, and what
the device is so the needed quirk &/or modification can added to handle when 
this assumption fails;
or make it a prdebug that can be flipped on for this failing situation, again, 
to add needed change to accomodate.


root = NULL;
 out:
pci_dev_get(root);
return root;
}
EXPORT_SYMBOL(pci_get_root_slot);

...I think that's a lot clearer to the reader, about what's going on here.

Note that I'm not really sure if it *is* safe, I would need to ask other
PCIe subsystem developers with more experience. But I don't think anyone
is trying to make p2pdma calls so early that PCIe buses are uninitialized.



+
+    if (!root || root->devfn)
  return false;
    vendor = root->vendor;
  device = root->device;
-    pci_dev_put(root);

and the reason to remove the dev_put is b/c it can sleep as well?
is that ok, given the dev_get that John put into the new pci_get_root_slot()?
... seems like a locking version with no get/put's is needed, or, fix the 
host-bridge setups so no !NULL self pointers.



    for (entry = pci_p2pdma_whitelist; entry->vendor; entry++) {
  if (vendor != entry->vendor || device != entry->devic

Re: [PATCH 03/16] PCI/P2PDMA: Attempt to set map_type if it has not been set

2021-05-11 Thread Don Dutile


On 5/2/21 3:58 PM, John Hubbard wrote:

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:

Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.

Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.

Signed-off-by: Logan Gunthorpe 
---
  drivers/pci/p2pdma.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 473a08940fbc..2574a062a255 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -825,11 +825,18 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
  static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct pci_dev *provider,
  struct pci_dev *client)
  {
+    enum pci_p2pdma_map_type ret;
+
  if (!provider->p2pdma)
  return PCI_P2PDMA_MAP_NOT_SUPPORTED;
  -    return xa_to_value(xa_load(&provider->p2pdma->map_types,
-   map_types_idx(client)));
+    ret = xa_to_value(xa_load(&provider->p2pdma->map_types,
+  map_types_idx(client)));
+    if (ret != PCI_P2PDMA_MAP_UNKNOWN)
+    return ret;
+
+    return upstream_bridge_distance_warn(provider, client, NULL,
+ GFP_ATOMIC);


Returning a "bridge distance" from a "get map type" routine is jarring,
and I think it is because of a pre-existing problem: the above function
is severely misnamed. Let's try renaming it (and the other one) to
approximately:

    upstream_bridge_map_type_warn()
    upstream_bridge_map_type()

...and that should fix that. Well, that, plus tweaking the kernel doc
comments, which are also confused. I think someone started off thinking
about distances through PCIe, but in the end, the routine boils down to
just a few situations that are not distances at all.


+1. didn't like the 'distance' check  for a 'connection check" in the 
beginning, and looks like this is the time to clean it out.
:)


Also, the above will read a little better if it is written like this:

ret = xa_to_value(xa_load(&provider->p2pdma->map_types,
  map_types_idx(client)));

if (ret == PCI_P2PDMA_MAP_UNKNOWN)
    ret = upstream_bridge_map_type_warn(provider, client, NULL,
    GFP_ATOMIC);

return ret;



  }
    static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
@@ -877,7 +884,6 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
  case PCI_P2PDMA_MAP_BUS_ADDR:
  return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
  default:
-    WARN_ON_ONCE(1);


Why? Or at least, why, in this patch? It looks like an accidental
leftover from something, seeing as how it is not directly related to the
patch, and is not mentioned at all.


thanks,


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 05/16] dma-mapping: Introduce dma_map_sg_p2pdma()

2021-05-11 Thread Don Dutile


On 4/8/21 1:01 PM, Logan Gunthorpe wrote:

dma_map_sg() either returns a positive number indicating the number
of entries mapped or zero indicating that resources were not available
to create the mapping. When zero is returned, it is always safe to retry
the mapping later once resources have been freed.

Once P2PDMA pages are mixed into the SGL there may be pages that may
never be successfully mapped with a given device because that device may
not actually be able to access those pages. Thus, multiple error
conditions will need to be distinguished to determine weather a retry

s/weather/whether/

is safe.

Introduce dma_map_sg_p2pdma[_attrs]() with a different calling
convention from dma_map_sg(). The function will return a positive
integer on success or a negative errno on failure.

ENOMEM will be used to indicate a resource failure and EREMOTEIO to
indicate that a P2PDMA page is not mappable.

The __DMA_ATTR_PCI_P2PDMA attribute is introduced to inform the lower
level implementations that P2PDMA pages are allowed and to warn if a
caller introduces them into the regular dma_map_sg() interface.

Signed-off-by: Logan Gunthorpe 

John caught any other comment I had (and more).
-dd


---
  include/linux/dma-mapping.h | 15 +++
  kernel/dma/mapping.c| 52 -
  2 files changed, 61 insertions(+), 6 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2a984cb4d1e0..50b8f586cf59 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -60,6 +60,12 @@
   * at least read-only at lesser-privileged levels).
   */
  #define DMA_ATTR_PRIVILEGED   (1UL << 9)
+/*
+ * __DMA_ATTR_PCI_P2PDMA: This should not be used directly, use
+ * dma_map_sg_p2pdma() instead. Used internally to indicate that the
+ * caller is using the dma_map_sg_p2pdma() interface.
+ */
+#define __DMA_ATTR_PCI_P2PDMA  (1UL << 10)
  
  /*

   * A dma_addr_t can hold any valid DMA or bus address for the platform.  It 
can
@@ -107,6 +113,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t 
addr, size_t size,
enum dma_data_direction dir, unsigned long attrs);
  int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir, unsigned long attrs);
+int dma_map_sg_p2pdma_attrs(struct device *dev, struct scatterlist *sg,
+   int nents, enum dma_data_direction dir, unsigned long attrs);
  void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
  int nents, enum dma_data_direction dir,
  unsigned long attrs);
@@ -160,6 +168,12 @@ static inline int dma_map_sg_attrs(struct device *dev, 
struct scatterlist *sg,
  {
return 0;
  }
+static inline int dma_map_sg_p2pdma_attrs(struct device *dev,
+   struct scatterlist *sg, int nents, enum dma_data_direction dir,
+   unsigned long attrs)
+{
+   return 0;
+}
  static inline void dma_unmap_sg_attrs(struct device *dev,
struct scatterlist *sg, int nents, enum dma_data_direction dir,
unsigned long attrs)
@@ -392,6 +406,7 @@ static inline void dma_sync_sgtable_for_device(struct 
device *dev,
  #define dma_map_single(d, a, s, r) dma_map_single_attrs(d, a, s, r, 0)
  #define dma_unmap_single(d, a, s, r) dma_unmap_single_attrs(d, a, s, r, 0)
  #define dma_map_sg(d, s, n, r) dma_map_sg_attrs(d, s, n, r, 0)
+#define dma_map_sg_p2pdma(d, s, n, r) dma_map_sg_p2pdma_attrs(d, s, n, r, 0)
  #define dma_unmap_sg(d, s, n, r) dma_unmap_sg_attrs(d, s, n, r, 0)
  #define dma_map_page(d, p, o, s, r) dma_map_page_attrs(d, p, o, s, r, 0)
  #define dma_unmap_page(d, a, s, r) dma_unmap_page_attrs(d, a, s, r, 0)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b6a633679933..923089c4267b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -177,12 +177,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t 
addr, size_t size,
  }
  EXPORT_SYMBOL(dma_unmap_page_attrs);
  
-/*

- * dma_maps_sg_attrs returns 0 on error and > 0 on success.
- * It should never return a value < 0.
- */
-int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
-   enum dma_data_direction dir, unsigned long attrs)
+static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
+   int nents, enum dma_data_direction dir, unsigned long attrs)
  {
const struct dma_map_ops *ops = get_dma_ops(dev);
int ents;
@@ -197,6 +193,20 @@ int dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg, int nents,
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
+   return ents;
+}
+
+/*
+ * dma_maps_sg_attrs returns 0 on error and > 0 on success.
+ * It should never return a value < 0.
+ */
+int dma_map_sg_attrs(struct de

Re: [PATCH 10/16] dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support

2021-05-11 Thread Don Dutile


On 4/8/21 1:01 PM, Logan Gunthorpe wrote:

Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.

Also, add a helper to check if a device supports PCI P2PDMA.

Signed-off-by: Logan Gunthorpe 
---
  include/linux/dma-map-ops.h |  3 +++
  include/linux/dma-mapping.h |  5 +
  kernel/dma/mapping.c| 18 ++
  3 files changed, 26 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 51872e736e7b..481892822104 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -12,6 +12,9 @@
  struct cma;
  
  struct dma_map_ops {

+   unsigned int flags;
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+

I'm not a fan of in-line define's; if we're going to add a flags field to the 
dma-ops
-- and logically it'd be good to have p2pdma go through the dma-ops struct --
then let's move this up in front of the dma-ops description.

And now that the dma-ops struct is being 'opened' for p2pdma, should p2pdma ops 
be added
to this struct, so all this work can be mimic'd/reflected/leveraged/refactored 
for CXL, GenZ, etc. p2pdma in (the near?) future?


void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 50b8f586cf59..c31980ecca62 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -146,6 +146,7 @@ int dma_mmap_attrs(struct device *dev, struct 
vm_area_struct *vma,
unsigned long attrs);
  bool dma_can_mmap(struct device *dev);
  int dma_supported(struct device *dev, u64 mask);
+bool dma_pci_p2pdma_supported(struct device *dev);
  int dma_set_mask(struct device *dev, u64 mask);
  int dma_set_coherent_mask(struct device *dev, u64 mask);
  u64 dma_get_required_mask(struct device *dev);
@@ -247,6 +248,10 @@ static inline int dma_supported(struct device *dev, u64 
mask)
  {
return 0;
  }
+static inline bool dma_pci_p2pdma_supported(struct device *dev)
+{
+   return 0;
+}
  static inline int dma_set_mask(struct device *dev, u64 mask)
  {
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 923089c4267b..ce44a0fcc4e5 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -573,6 +573,24 @@ int dma_supported(struct device *dev, u64 mask)
  }
  EXPORT_SYMBOL(dma_supported);
  
+bool dma_pci_p2pdma_supported(struct device *dev)

+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   /* if ops is not set, dma direct will be used which supports P2PDMA */
+   if (!ops)
+   return true;

So, this means one cannot have p2pdma with IOMMU's? ...
-- or is this 'for now' and this may change?  if it may change, add a note here.


+
+   /*
+* Note: dma_ops_bypass is not checked here because P2PDMA should
+* not be used with dma mapping ops that do not have support even
+* if the specific device is bypassing them.
+*/
+
+   return ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;

that's a bool?


+}
+EXPORT_SYMBOL_GPL(dma_pci_p2pdma_supported);
+
  #ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
  void arch_dma_set_mask(struct device *dev, u64 mask);
  #else


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 11/16] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

2021-05-11 Thread Don Dutile


On 4/8/21 1:01 PM, Logan Gunthorpe wrote:

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.

Similar to dma-direct, the sg_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is
set to DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for
P2PDMA pages.

Signed-off-by: Logan Gunthorpe 

So, this code prevents use of p2pdma using an IOMMU, which wasn't checked and
short-circuited by other checks to use dma-direct?

So my overall comment to this code & related comments is that it should be 
sprinkled
with notes like "doesn't support IOMMU" and / or "TODO" when/if IOMMU is to be 
supported.
Or, if IOMMU-based p2pdma isn't supported in these routines directly, where/how 
they will be supported?


---
  drivers/iommu/dma-iommu.c | 66 ++-
  1 file changed, 58 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..ef49635f9819 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -864,6 +865,16 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
  
+		if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {

+   if (i > 0)
+   cur = sg_next(cur);
+
+   pci_p2pdma_map_bus_segment(s, cur);
+   count++;
+   cur_len = 0;
+   continue;
+   }
+
/*
 * Now fill in the real DMA data. If...
 * - there is a valid output segment to append to
@@ -961,10 +972,12 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
struct iova_domain *iovad = &cookie->iovad;
struct scatterlist *s, *prev = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
+   struct dev_pagemap *pgmap = NULL;
+   enum pci_p2pdma_map_type map_type;
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
-   int i;
+   int i, ret = 0;
  
  	if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&

iommu_deferred_attach(dev, domain))
@@ -993,6 +1006,31 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;
  
+		if (is_pci_p2pdma_page(sg_page(s))) {

+   if (sg_page(s)->pgmap != pgmap) {
+   pgmap = sg_page(s)->pgmap;
+   map_type = pci_p2pdma_map_type(pgmap, dev,
+  attrs);
+   }
+
+   switch (map_type) {
+   case PCI_P2PDMA_MAP_BUS_ADDR:
+   /*
+* A zero length will be ignored by
+* iommu_map_sg() and then can be detected
+* in __finalise_sg() to actually map the
+* bus address.
+*/
+   s->length = 0;
+   continue;



+   case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+   break;

So, this 'short-circuits' the use of the IOMMU, silently?
This seems ripe for users to enable IOMMU for secure computing reasons, and 
using/enabling p2pdma,
and not realizing that it isn't as secure as 1+1=2  appears to be.
If my understanding is wrong, please point me to the Documentation or code that 
corrects this mis-understanding.  I could have missed a warning when both are 
enabled in a past patch set.
Thanks.
--dd

+   default:
+   ret = -EREMOTEIO;
+   goto out_restore_sg;
+   }
+   }
+
/*
 * Due to the alignment of our single IOVA allocation, we can
 * depend on these assumptions about the segment boundary mask:
@@ -1015,6 +1053,9 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
prev = s;
}
  
+	if (!iova_len)

+   return __finalise_sg(dev, sg, nents, 0);
+
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev),

Re: [PATCH 07/16] PCI/P2PDMA: Make pci_p2pdma_map_type() non-static

2021-05-11 Thread Don Dutile


On 4/8/21 1:01 PM, Logan Gunthorpe wrote:

pci_p2pdma_map_type() will be needed by the dma-iommu map_sg
implementation because it will need to determine the mapping type
ahead of actually doing the mapping to create the actual iommu mapping.

Signed-off-by: Logan Gunthorpe 
---
  drivers/pci/p2pdma.c   | 34 +++---
  include/linux/pci-p2pdma.h | 15 +++
  2 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index bcb1a6d6119d..38c93f57a941 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -20,13 +20,6 @@
  #include 
  #include 
  
-enum pci_p2pdma_map_type {

-   PCI_P2PDMA_MAP_UNKNOWN = 0,
-   PCI_P2PDMA_MAP_NOT_SUPPORTED,
-   PCI_P2PDMA_MAP_BUS_ADDR,
-   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
  struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
@@ -822,13 +815,30 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool 
publish)
  }
  EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
  
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,

-   struct device *dev)
+/**
+ * pci_p2pdma_map_type - return the type of mapping that should be used for
+ * a given device and pgmap
+ * @pgmap: the pagemap of a page to determine the mapping type for
+ * @dev: device that is mapping the page
+ * @dma_attrs: the attributes passed to the dma_map operation --
+ * this is so they can be checked to ensure P2PDMA pages were not
+ * introduced into an incorrect interface (like dma_map_sg). *
+ *
+ * Returns one of:
+ * PCI_P2PDMA_MAP_NOT_SUPPORTED - The mapping should not be done
+ * PCI_P2PDMA_MAP_BUS_ADDR - The mapping should use the PCI bus address
+ * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE - The mapping should be done directly
+ */

I'd recommend putting these descriptions in the enum's in pci-p2pdma.h .
Also, can you use a better description for THRU_HOST_BRIDGE -- it leaves the 
reader wondering what 'done directly' means.

Thanks.
-dd


+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+   struct device *dev, unsigned long dma_attrs)
  {
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
enum pci_p2pdma_map_type ret;
struct pci_dev *client;
  
+	WARN_ONCE(!(dma_attrs & __DMA_ATTR_PCI_P2PDMA),

+ "PCI P2PDMA pages were mapped with dma_map_sg!");
+
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
  
@@ -879,7 +889,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,

struct pci_p2pdma_pagemap *p2p_pgmap =
to_p2p_pgmap(sg_page(sg)->pgmap);
  
-	switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev)) {

+   switch (pci_p2pdma_map_type(sg_page(sg)->pgmap, dev,
+   __DMA_ATTR_PCI_P2PDMA)) {
case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
return dma_map_sg_attrs(dev, sg, nents, dir, attrs);
case PCI_P2PDMA_MAP_BUS_ADDR:
@@ -904,7 +915,8 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
  {
enum pci_p2pdma_map_type map_type;
  
-	map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev);

+   map_type = pci_p2pdma_map_type(sg_page(sg)->pgmap, dev,
+  __DMA_ATTR_PCI_P2PDMA);
  
  	if (map_type == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)

dma_unmap_sg_attrs(dev, sg, nents, dir, attrs);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..a06072ac3a52 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -16,6 +16,13 @@
  struct block_device;
  struct scatterlist;
  
+enum pci_p2pdma_map_type {

+   PCI_P2PDMA_MAP_UNKNOWN = 0,
+   PCI_P2PDMA_MAP_NOT_SUPPORTED,
+   PCI_P2PDMA_MAP_BUS_ADDR,
+   PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
  #ifdef CONFIG_PCI_P2PDMA
  int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
@@ -30,6 +37,8 @@ struct scatterlist *pci_p2pmem_alloc_sgl(struct pci_dev *pdev,
 unsigned int *nents, u32 length);
  void pci_p2pmem_free_sgl(struct pci_dev *pdev, struct scatterlist *sgl);
  void pci_p2pmem_publish(struct pci_dev *pdev, bool publish);
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
+   struct device *dev, unsigned long dma_attrs);
  int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
  void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
@@ -83,6 +92,12 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
  static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
  {
  }
+static inline enum pci_p2pdma_map_type pci_p2pdma_map_type

Re: [PATCH 01/16] PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()

2021-05-11 Thread Don Dutile


On 5/11/21 12:12 PM, Logan Gunthorpe wrote:


On 2021-05-11 10:05 a.m., Don Dutile wrote:

On 5/1/21 11:58 PM, John Hubbard wrote:

On 4/8/21 10:01 AM, Logan Gunthorpe wrote:

In order to call upstream_bridge_distance_warn() from a dma_map function,
it must not sleep. The only reason it does sleep is to allocate the seqbuf
to print which devices are within the ACS path.

Switch the kmalloc call to use a passed in gfp_mask and don't print that
message if the buffer fails to be allocated.

Signed-off-by: Logan Gunthorpe 
Acked-by: Bjorn Helgaas 
---
   drivers/pci/p2pdma.c | 21 +++--
   1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 196382630363..bd89437faf06 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -267,7 +267,7 @@ static int pci_bridge_has_acs_redir(struct pci_dev *pdev)
     static void seq_buf_print_bus_devfn(struct seq_buf *buf, struct pci_dev 
*pdev)
   {
-    if (!buf)
+    if (!buf || !buf->buffer)

This is not great, sort of from an overall design point of view, even though
it makes the rest of the patch work. See below for other ideas, that will
avoid the need for this sort of odd point fix.


+1.
In fact, I didn't see how the kmalloc was changed... you refactored the code to 
pass-in the
GFP_KERNEL that was originally hard-coded into upstream_bridge_distance_warn();
I don't see how that avoided the kmalloc() call.
in fact, I also see you lost a failed kmalloc() check, so it seems to have 
taken a step back.

I've changed this in v2 to just use some memory allocated on the stack.
Avoids this argument all together.

Logan


Looking fwd to the v2; again, my apologies for the delay, and the redundancy it's 
adding to your feedback review & changes.
-Don

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

87 matches

Mail list logo