from:"Nate Watterson"

Re: [PATCH] iommu/arm-smmu-v3: Set GBPA to abort all transactions

2018-05-11 Thread Nate Watterson


Hi Mark,

On 4/12/2018 7:56 AM, Marc Zyngier wrote:

On 12/04/18 11:17, Robin Murphy wrote:

On 11/04/18 17:54, Marc Zyngier wrote:

Hi Sammer,

On 11/04/18 16:58, Goel, Sameer wrote:



On 3/28/2018 9:00 AM, Marc Zyngier wrote:

On 2018-03-28 15:39, Timur Tabi wrote:

From: Sameer Goel 

Set SMMU_GBPA to abort all incoming translations during the SMMU reset
when SMMUEN==0.

This prevents a race condition where a stray DMA from the crashed primary
kernel can try to access an IOVA address as an invalid PA when SMMU is
disabled during reset in the crash kernel.

Signed-off-by: Sameer Goel 
---
   drivers/iommu/arm-smmu-v3.c | 12 
   1 file changed, 12 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 3f2f1fc68b52..c04a89310c59 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2458,6 +2458,18 @@ static int arm_smmu_device_reset(struct
arm_smmu_device *smmu, bool bypass)
   if (reg & CR0_SMMUEN)
   dev_warn(smmu->dev, "SMMU currently enabled! Resetting...\n");

+    /*
+ * Abort all incoming translations. This can happen in a kdump case
+ * where SMMU is initialized when a prior DMA is pending. Just
+ * disabling the SMMU in this case might result in writes to invalid
+ * PAs.
+ */
+    ret = arm_smmu_update_gbpa(smmu, 1, GBPA_ABORT);
+    if (ret) {
+    dev_err(smmu->dev, "GBPA not responding to update\n");
+    return ret;
+    }
+
   ret = arm_smmu_device_disable(smmu);
   if (ret)
   return ret;


A tangential question: can we reliably detect that the SMMU already
has valid mappings, which would indicate that we're in a pretty bad
shape already by the time we set that bit? For all we know, memory
could have been corrupted long before we hit this point, and this
patch barely narrows the window of opportunity.


:) Yes that is correct. This only covers the kdump scenario. Trying
to get some reliability when booting up the crash kernel. The system
is already in a bad state. I don't think that this will happen in a
normal scenario. But please point me to the GICv3 change and I'll
have a look.


See this:
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/tree/drivers/irqchip/irq-gic-v3-its.c?h=irq/irqchip-4.17&id=6eb486b66a3094cdcd68dc39c9df3a29d6a51dd5#n3407


The nearest equivalent to that is probably the top-level SMMUEN check
that we already have (see the diff context above). To go beyond that
you'd have to chase the old stream table pointer and scan the whole
thing looking for valid contexts, then potentially walk page tables
within those contexts to check for live translations if you really
wanted to be sure. That would be a hell of a lot of work to do in the
boot path.

Yeah, feels a bit too involved for sanity. I'd simply suggest you taint
the kernel if you find the SMMU enabled, as you're already on shaky ground.


Finding the SMMU already enabled does not necessarily indicate that
anything catastrophic has occurred.

For instance, to support OSes without an SMMUv3 driver, boot FW may have
enabled the SMMU and installed 1-to-1 mappings for DDR and MSI target
addr(s) to compensate for a MSI-capable master whose default DMA attrs
needed tweaking (ex: non-coherent -> coherent).

If such a configuration warrants tainting the kernel, then we should
similarly check GBPA for attr overrides and taint the kernel if any are
found there.



Thanks,

M.



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2] iommu/arm-smmu-v3: limit reporting of MSI allocation failures

2018-01-20 Thread Nate Watterson

Currently, the arm-smmu-v3 driver expects to allocate MSIs for all SMMUs
with FEAT_MSI set. This results in unwarranted "failed to allocate MSIs"
warnings being printed on systems where FW was either deliberately
configured to force the use of SMMU wired interrupts -or- is altogether
incapable of describing SMMU MSI topology (ACPI IORT prior to rev.C).

Remedy this by checking msi_domain before attempting to allocate SMMU
MSIs.

Signed-off-by: Nate Watterson 
Signed-off-by: Sinan Kaya 
---
 drivers/iommu/arm-smmu-v3.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 744592d..00de028 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2328,10 +2328,15 @@ static void arm_smmu_setup_msis(struct arm_smmu_device 
*smmu)
if (!(smmu->features & ARM_SMMU_FEAT_MSI))
return;
 
+   if (!dev->msi_domain) {
+   dev_info(smmu->dev, "msi_domain absent - falling back to wired 
irqs\n");
+   return;
+   }
+
/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
ret = platform_msi_domain_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
if (ret) {
-   dev_warn(dev, "failed to allocate MSIs\n");
+   dev_warn(dev, "failed to allocate MSIs - falling back to wired 
irqs\n");
return;
}
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/arm-smmu-v3: suppress MSI allocation failure message

2018-01-17 Thread Nate Watterson

From: Sinan Kaya 

Even though QDF2400 supports MSI interrupts with SMMUv3, it is not enabled
in ACPI FW to preserve compatibility with older kernel versions. Code is
emitting warning message during boot.

This is causing some test tools to record a false warning and is causing
support issues.

Better reduce the message level.

Signed-off-by: Sinan Kaya 
Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 744592d..2118fda 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2331,7 +2331,7 @@ static void arm_smmu_setup_msis(struct arm_smmu_device 
*smmu)
/* Allocate MSIs for evtq, gerror and priq. Ignore cmdq */
ret = platform_msi_domain_alloc_irqs(dev, nvec, arm_smmu_write_msi_msg);
if (ret) {
-   dev_warn(dev, "failed to allocate MSIs\n");
+   dev_info(dev, "failed to allocate MSIs\n");
return;
}
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/4] iommu/io-pgtable-arm: Support 52-bit physical address

2017-11-29 Thread Nate Watterson


Hi Robin,

On 11/29/2017 6:29 AM, Robin Murphy wrote:

Hi Nate,

On 29/11/17 07:07, Nate Watterson wrote:

Hi Robin,

On 11/28/2017 12:27 PM, Robin Murphy wrote:

Bring io-pgtable-arm in line with the ARMv8.2-LPA feature allowing
52-bit physical addresses when using the 64KB translation granule.
This will be supported by SMMUv3.1.

Signed-off-by: Robin Murphy 
---
  drivers/iommu/io-pgtable-arm.c | 65 ++
  1 file changed, 47 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 51e5c43caed1..4d46017b3262 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -21,6 +21,7 @@
  #define pr_fmt(fmt)    "arm-lpae io-pgtable: " fmt
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -32,7 +33,7 @@
  #include "io-pgtable.h"
-#define ARM_LPAE_MAX_ADDR_BITS    48
+#define ARM_LPAE_MAX_ADDR_BITS    52
  #define ARM_LPAE_S2_MAX_CONCAT_PAGES    16
  #define ARM_LPAE_MAX_LEVELS    4
@@ -86,6 +87,8 @@
  #define ARM_LPAE_PTE_TYPE_TABLE    3
  #define ARM_LPAE_PTE_TYPE_PAGE    3
+#define ARM_LPAE_PTE_ADDR_MASK    GENMASK_ULL(47,12)
+
  #define ARM_LPAE_PTE_NSTABLE    (((arm_lpae_iopte)1) << 63)
  #define ARM_LPAE_PTE_XN    (((arm_lpae_iopte)3) << 53)
  #define ARM_LPAE_PTE_AF    (((arm_lpae_iopte)1) << 10)
@@ -159,6 +162,7 @@
  #define ARM_LPAE_TCR_PS_42_BIT    0x3ULL
  #define ARM_LPAE_TCR_PS_44_BIT    0x4ULL
  #define ARM_LPAE_TCR_PS_48_BIT    0x5ULL
+#define ARM_LPAE_TCR_PS_52_BIT    0x6ULL
  #define ARM_LPAE_MAIR_ATTR_SHIFT(n)    ((n) << 3)
  #define ARM_LPAE_MAIR_ATTR_MASK    0xff
@@ -170,9 +174,7 @@
  #define ARM_LPAE_MAIR_ATTR_IDX_DEV    2
  /* IOPTE accessors */
-#define iopte_deref(pte,d)    \
-    (__va((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)    \
-    & ~(ARM_LPAE_GRANULE(d) - 1ULL)))
+#define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))
  #define iopte_type(pte,l)    \
  (((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
@@ -184,12 +186,6 @@
  (iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_PAGE) :    \
  (iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_BLOCK))
-#define iopte_to_pfn(pte,d)    \
-    (((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)) >> (d)->pg_shift)
-
-#define pfn_to_iopte(pfn,d)    \
-    (((pfn) << (d)->pg_shift) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1))
-
  struct arm_lpae_io_pgtable {
  struct io_pgtable    iop;
@@ -203,6 +199,25 @@ struct arm_lpae_io_pgtable {
  typedef u64 arm_lpae_iopte;
+static arm_lpae_iopte paddr_to_iopte(phys_addr_t paddr,
+ struct arm_lpae_io_pgtable *data)
+{
+    arm_lpae_iopte pte = paddr;
+
+    /* Of the bits which overlap, either 51:48 or 15:12 are always RES0 */
+    return (pte | (pte >> 36)) & ARM_LPAE_PTE_ADDR_MASK;
+}
+
+static phys_addr_t iopte_to_paddr(arm_lpae_iopte pte,
+  struct arm_lpae_io_pgtable *data)
+{
+    phys_addr_t paddr = pte & ARM_LPAE_PTE_ADDR_MASK;
+    phys_addr_t paddr_hi = paddr & (ARM_LPAE_GRANULE(data) - 1);
+
+    /* paddr_hi spans nothing for 4K granule, and only RES0 bits for 16K */
+    return (paddr ^ paddr_hi) | (paddr_hi << 36);
+}
+
  static bool selftest_running = false;
  static dma_addr_t __arm_lpae_dma_addr(void *pages)
@@ -287,7 +302,7 @@ static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable 
*data,
  pte |= ARM_LPAE_PTE_TYPE_BLOCK;
  pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;
-    pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+    pte |= paddr_to_iopte(paddr, data);
  __arm_lpae_set_pte(ptep, pte, &data->iop.cfg);
  }
@@ -528,7 +543,7 @@ static int arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
  if (size == split_sz)
  unmap_idx = ARM_LPAE_LVL_IDX(iova, lvl, data);
-    blk_paddr = iopte_to_pfn(blk_pte, data) << data->pg_shift;
+    blk_paddr = iopte_to_paddr(blk_pte, data);
  pte = iopte_prot(blk_pte);
  for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) {
@@ -652,12 +667,13 @@ static phys_addr_t arm_lpae_iova_to_phys(struct 
io_pgtable_ops *ops,
  found_translation:
  iova &= (ARM_LPAE_BLOCK_SIZE(lvl, data) - 1);
-    return ((phys_addr_t)iopte_to_pfn(pte,data) << data->pg_shift) | iova;
+    return iopte_to_paddr(pte, data) | iova;
  }
  static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)
  {
-    unsigned long granule;
+    unsigned long granule, page_sizes;
+    unsigned int max_addr_bits = 48;
  /*
   * We need to restrict the supported page sizes to match the
@@ -677,17 +693,24 @@ static void arm_lpae_restrict_pgsizes(struct 
io_pgtable_cfg *cfg)
  switch (granule) {
  case SZ_4K:
-    cfg-&g

Re: [PATCH 2/4] iommu/io-pgtable-arm: Support 52-bit physical address

2017-11-28 Thread Nate Watterson


Hi Robin,

On 11/28/2017 12:27 PM, Robin Murphy wrote:

Bring io-pgtable-arm in line with the ARMv8.2-LPA feature allowing
52-bit physical addresses when using the 64KB translation granule.
This will be supported by SMMUv3.1.

Signed-off-by: Robin Murphy 
---
  drivers/iommu/io-pgtable-arm.c | 65 ++
  1 file changed, 47 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 51e5c43caed1..4d46017b3262 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -21,6 +21,7 @@
  #define pr_fmt(fmt)   "arm-lpae io-pgtable: " fmt
  
  #include 

+#include 
  #include 
  #include 
  #include 
@@ -32,7 +33,7 @@
  
  #include "io-pgtable.h"
  
-#define ARM_LPAE_MAX_ADDR_BITS		48

+#define ARM_LPAE_MAX_ADDR_BITS 52
  #define ARM_LPAE_S2_MAX_CONCAT_PAGES  16
  #define ARM_LPAE_MAX_LEVELS   4
  
@@ -86,6 +87,8 @@

  #define ARM_LPAE_PTE_TYPE_TABLE   3
  #define ARM_LPAE_PTE_TYPE_PAGE3
  
+#define ARM_LPAE_PTE_ADDR_MASK		GENMASK_ULL(47,12)

+
  #define ARM_LPAE_PTE_NSTABLE  (((arm_lpae_iopte)1) << 63)
  #define ARM_LPAE_PTE_XN   (((arm_lpae_iopte)3) << 53)
  #define ARM_LPAE_PTE_AF   (((arm_lpae_iopte)1) << 10)
@@ -159,6 +162,7 @@
  #define ARM_LPAE_TCR_PS_42_BIT0x3ULL
  #define ARM_LPAE_TCR_PS_44_BIT0x4ULL
  #define ARM_LPAE_TCR_PS_48_BIT0x5ULL
+#define ARM_LPAE_TCR_PS_52_BIT 0x6ULL
  
  #define ARM_LPAE_MAIR_ATTR_SHIFT(n)	((n) << 3)

  #define ARM_LPAE_MAIR_ATTR_MASK   0xff
@@ -170,9 +174,7 @@
  #define ARM_LPAE_MAIR_ATTR_IDX_DEV2
  
  /* IOPTE accessors */

-#define iopte_deref(pte,d) \
-   (__va((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)  \
-   & ~(ARM_LPAE_GRANULE(d) - 1ULL)))
+#define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))
  
  #define iopte_type(pte,l)	\

(((pte) >> ARM_LPAE_PTE_TYPE_SHIFT) & ARM_LPAE_PTE_TYPE_MASK)
@@ -184,12 +186,6 @@
(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_PAGE) : \
(iopte_type(pte,l) == ARM_LPAE_PTE_TYPE_BLOCK))
  
-#define iopte_to_pfn(pte,d)	\

-   (((pte) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1)) >> (d)->pg_shift)
-
-#define pfn_to_iopte(pfn,d)\
-   (((pfn) << (d)->pg_shift) & ((1ULL << ARM_LPAE_MAX_ADDR_BITS) - 1))
-
  struct arm_lpae_io_pgtable {
struct io_pgtable   iop;
  
@@ -203,6 +199,25 @@ struct arm_lpae_io_pgtable {
  
  typedef u64 arm_lpae_iopte;
  
+static arm_lpae_iopte paddr_to_iopte(phys_addr_t paddr,

+struct arm_lpae_io_pgtable *data)
+{
+   arm_lpae_iopte pte = paddr;
+
+   /* Of the bits which overlap, either 51:48 or 15:12 are always RES0 */
+   return (pte | (pte >> 36)) & ARM_LPAE_PTE_ADDR_MASK;
+}
+
+static phys_addr_t iopte_to_paddr(arm_lpae_iopte pte,
+ struct arm_lpae_io_pgtable *data)
+{
+   phys_addr_t paddr = pte & ARM_LPAE_PTE_ADDR_MASK;
+   phys_addr_t paddr_hi = paddr & (ARM_LPAE_GRANULE(data) - 1);
+
+   /* paddr_hi spans nothing for 4K granule, and only RES0 bits for 16K */
+   return (paddr ^ paddr_hi) | (paddr_hi << 36);
+}
+
  static bool selftest_running = false;
  
  static dma_addr_t __arm_lpae_dma_addr(void *pages)

@@ -287,7 +302,7 @@ static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable 
*data,
pte |= ARM_LPAE_PTE_TYPE_BLOCK;
  
  	pte |= ARM_LPAE_PTE_AF | ARM_LPAE_PTE_SH_IS;

-   pte |= pfn_to_iopte(paddr >> data->pg_shift, data);
+   pte |= paddr_to_iopte(paddr, data);
  
  	__arm_lpae_set_pte(ptep, pte, &data->iop.cfg);

  }
@@ -528,7 +543,7 @@ static int arm_lpae_split_blk_unmap(struct 
arm_lpae_io_pgtable *data,
if (size == split_sz)
unmap_idx = ARM_LPAE_LVL_IDX(iova, lvl, data);
  
-	blk_paddr = iopte_to_pfn(blk_pte, data) << data->pg_shift;

+   blk_paddr = iopte_to_paddr(blk_pte, data);
pte = iopte_prot(blk_pte);
  
  	for (i = 0; i < tablesz / sizeof(pte); i++, blk_paddr += split_sz) {

@@ -652,12 +667,13 @@ static phys_addr_t arm_lpae_iova_to_phys(struct 
io_pgtable_ops *ops,
  
  found_translation:

iova &= (ARM_LPAE_BLOCK_SIZE(lvl, data) - 1);
-   return ((phys_addr_t)iopte_to_pfn(pte,data) << data->pg_shift) | iova;
+   return iopte_to_paddr(pte, data) | iova;
  }
  
  static void arm_lpae_restrict_pgsizes(struct io_pgtable_cfg *cfg)

  {
-   unsigned long granule;
+   unsigned long granule, page_sizes;
+   unsigned int max_addr_bits = 48;
  
  	/*

 * We need to restrict the supported page sizes to match the
@@ -677,17 +693,24 @@ static void arm_lpae_restrict_pgsizes(struct 
io_pgtable_cfg *cfg)
  
  	switch (granule) {

case SZ_4K:
-   cfg->pgsize_bitmap &= (SZ_4K | SZ_2M | SZ

Re: [PATCH v2 0/3] arm-smmu: performance optimization

2017-09-18 Thread Nate Watterson


Hi Leizhen,

On 9/12/2017 9:00 AM, Zhen Lei wrote:

v1 -> v2:
base on (add02cfdc9bc2 "iommu: Introduce Interface for IOMMU TLB Flushing")

Zhen Lei (3):
   iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock
 confliction
   iommu/arm-smmu-v3: add support for unmap an iova range with only one
 tlb sync


I tested these (2) patches on QDF2400 hardware and saw performance
improvements in line with those I reported when testing the original
series. I don't have any hardware close at hand to test the 3rd patch
in the series so that will have to come from someone else.

Tested-by: Nate Watterson 

Thanks,
Nate


   iommu/arm-smmu: add support for unmap a memory range with only one tlb
 sync

  drivers/iommu/arm-smmu-v3.c| 52 ++
  drivers/iommu/arm-smmu.c   | 10 
  drivers/iommu/io-pgtable-arm-v7s.c | 32 +++
  drivers/iommu/io-pgtable-arm.c | 30 ++
  drivers/iommu/io-pgtable.h |  1 +
  5 files changed, 99 insertions(+), 26 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/1] iommu/iova: Make rcache flush optional on IOVA allocation failure

2017-09-18 Thread Nate Watterson


Hi Tomasz,

On 9/18/2017 12:02 PM, Robin Murphy wrote:

Hi Tomasz,

On 18/09/17 11:56, Tomasz Nowicki wrote:

Since IOVA allocation failure is not unusual case we need to flush
CPUs' rcache in hope we will succeed in next round.

However, it is useful to decide whether we need rcache flush step because
of two reasons:
- Not scalability. On large system with ~100 CPUs iterating and flushing
   rcache for each CPU becomes serious bottleneck so we may want to deffer it.

s/deffer/defer


- free_cpu_cached_iovas() does not care about max PFN we are interested in.
   Thus we may flush our rcaches and still get no new IOVA like in the
   commonly used scenario:

 if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
 iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift);

 if (!iova)
 iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);

1. First alloc_iova_fast() call is limited to DMA_BIT_MASK(32) to get
   PCI devices a SAC address
2. alloc_iova() fails due to full 32-bit space
3. rcaches contain PFNs out of 32-bit space so free_cpu_cached_iovas()
   throws entries away for nothing and alloc_iova() fails again
4. Next alloc_iova_fast() call cannot take advantage of rcache since we
   have just defeated caches. In this case we pick the slowest option
   to proceed.

This patch reworks flushed_rcache local flag to be additional function
argument instead and control rcache flush step. Also, it updates all users
to do the flush as the last chance.


Looks like you've run into the same thing Nate found[1] - I came up with
almost the exact same patch, only with separate alloc_iova_fast() and
alloc_iova_fast_noretry() wrapper functions, but on reflection, just
exposing the bool to callers is probably simpler. One nit, can you
document it in the kerneldoc comment too? With that:

Reviewed-by: Robin Murphy 

Thanks,
Robin.

[1]:https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg19758.html

This patch completely resolves the issue I reported in [1]!!
Tested-by: Nate Watterson 

Thanks,
Nate




Signed-off-by: Tomasz Nowicki 
---
  drivers/iommu/amd_iommu.c   | 5 +++--
  drivers/iommu/dma-iommu.c   | 6 --
  drivers/iommu/intel-iommu.c | 5 +++--
  drivers/iommu/iova.c| 7 +++
  include/linux/iova.h| 5 +++--
  5 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8d2ec60..ce68986 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1604,10 +1604,11 @@ static unsigned long dma_ops_alloc_iova(struct device 
*dev,
  
  	if (dma_mask > DMA_BIT_MASK(32))

pfn = alloc_iova_fast(&dma_dom->iovad, pages,
- IOVA_PFN(DMA_BIT_MASK(32)));
+ IOVA_PFN(DMA_BIT_MASK(32)), false);
  
  	if (!pfn)

-   pfn = alloc_iova_fast(&dma_dom->iovad, pages, 
IOVA_PFN(dma_mask));
+   pfn = alloc_iova_fast(&dma_dom->iovad, pages,
+ IOVA_PFN(dma_mask), true);
  
  	return (pfn << PAGE_SHIFT);

  }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 191be9c..25914d3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -370,10 +370,12 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
iommu_domain *domain,
  
  	/* Try to get PCI devices a SAC address */

if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
-   iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> 
shift);
+   iova = alloc_iova_fast(iovad, iova_len,
+  DMA_BIT_MASK(32) >> shift, false);
  
  	if (!iova)

-   iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);
+   iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
+  true);
  
  	return (dma_addr_t)iova << shift;

  }
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 05c0c3a..75c8320 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3460,11 +3460,12 @@ static unsigned long intel_alloc_iova(struct device 
*dev,
 * from higher range
 */
iova_pfn = alloc_iova_fast(&domain->iovad, nrpages,
-  IOVA_PFN(DMA_BIT_MASK(32)));
+  IOVA_PFN(DMA_BIT_MASK(32)), false);
if (iova_pfn)
return iova_pfn;
}
-   iova_pfn = alloc_iova_fast(&domain->iovad, nrpages, IOVA_PFN(dma_mask));
+   iova_pfn = alloc_iova_fast(&domain->iovad, nrpages,
+  IOVA_PFN(dma_mask), true);
if (unlikely(!iova_pfn)) {
pr_err(

Re: [PATCH v3 0/4] Optimise 64-bit IOVA allocations

2017-08-25 Thread Nate Watterson


Hi Robin,

On 8/22/2017 11:17 AM, Robin Murphy wrote:

Hi all,

Just a quick repost of v2[1] with a small fix for the bug reported by Nate.

I tested the series and can confirm that the crash I reported on v2
no longer occurs with this version.


To recap, whilst this mostly only improves worst-case performance, those
worst-cases have a tendency to be pathologically bad:

Ard reports general desktop performance with Chromium on AMD Seattle going
from ~1-2FPS to perfectly usable.

Leizhen reports gigabit ethernet throughput going from ~6.5Mbit/s to line
speed.

I also inadvertantly found that the HiSilicon hns_dsaf driver was taking ~35s
to probe simply becuase of the number of DMA buffers it maps on startup (perf
shows around 76% of that was spent under the lock in alloc_iova()). With this
series applied it takes a mere ~1s, mostly of unrelated mdelay()s, with
alloc_iova() entirely lost in the noise.


Are any of these cases PCI devices attached to domains that have run
out of 32-bit IOVAs and have to retry the allocation using dma_limit?

iommu_dma_alloc_iova() {
[...]
if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))  [<- TRUE]
iova = alloc_iova_fast(DMA_BIT_MASK(32)); [<- NULL]
if (!iova)
iova = alloc_iova_fast(dma_limit);[<- OK  ]
[...]   
}

I am asking because, when using 64k pages, the Mellanox CX4 adapter
exhausts the supply 32-bit IOVAs simply allocating per-cpu IOVA space
during 'ifconfig up' so the code path outlined above is taken for
nearly all subsequent allocations. Although I do see a notable (~2x)
performance improvement with this series, I would still characterize it
as "pathologically bad" at < 10% of iommu passthrough performance.

This was a bit surprising to me as I thought the iova_rcache would
have eliminated the need to walk the rbtree for runtime allocations.
Unfortunately, it looks like the failed attempt to allocate a 32-bit
IOVA actually drops the cached IOVAs that we could have used when
subsequently performing the allocation at dma_limit.

alloc_iova_fast() {
[...]
iova_pfn = iova_rcache_get(...); [<- Fail, no 32-bit IOVAs]
if (iova_pfn)
return iova_pfn;

retry:
new_iova = alloc_iova(...);  [<- Fail, no 32-bit IOVAs]
if (!new_iova) {
unsigned int cpu;

if (flushed_rcache)
return 0;

/* Try replenishing IOVAs by flushing rcache. */
flushed_rcache = true;
for_each_online_cpu(cpu)
free_cpu_cached_iovas(cpu, iovad); [<- :( ]
goto retry;
}
}

As an experiment, I added code to skip the rcache flushing/retry for
the 32-bit allocations. In this configuration, 100% of passthrough mode
performance was achieved. I made the same change in the baseline and
measured performance at ~95% of passthrough mode.

I also got similar results by altogether removing the 32-bit allocation
from iommu_dma_alloc_iova() which makes me wonder why we even bother.
What (PCIe) workloads have been shown to actually benefit from it?

Tested-by: Nate Watterson 
-Nate



Robin.

[1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg19139.html

Robin Murphy (1):
   iommu/iova: Extend rbtree node caching

Zhen Lei (3):
   iommu/iova: Optimise rbtree searching
   iommu/iova: Optimise the padding calculation
   iommu/iova: Make dma_32bit_pfn implicit

  drivers/gpu/drm/tegra/drm.c  |   3 +-
  drivers/gpu/host1x/dev.c |   3 +-
  drivers/iommu/amd_iommu.c|   7 +--
  drivers/iommu/dma-iommu.c|  18 +--
  drivers/iommu/intel-iommu.c  |  11 ++--
  drivers/iommu/iova.c | 114 +--
  drivers/misc/mic/scif/scif_rma.c |   3 +-
  include/linux/iova.h |   8 +--
  8 files changed, 62 insertions(+), 105 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 0/3] SMMUv3 CMD_SYNC optimisation

2017-08-24 Thread Nate Watterson


Hi Robin,

On 8/18/2017 1:33 PM, Robin Murphy wrote:

Hi all,

Waiting for the command queue to drain for CMD_SYNC completion is likely
a contention hotspot on high-core-count systems. If the SMMU is coherent
and supports MSIs, though, we can use this cool feature (as suggested by
the architecture, no less) to make syncs effectively non-blocking for
anyone other than the caller.

I don't have any hardware that supports MSIs, but this has at least
passed muster on the Fast Model with cache modelling enabled - I'm hoping
the Qualcomm machines have the appropriate configuration to actually test
how well it works in reality. If it is worthwhile, I do have most of a
plan for how we can do something similar in the non-MSI polling case (it's
mostly a problem of handling the queue-wrapping edge cases correctly).


I tested this on QDF2400 hardware which supports MSI as a CMD_SYNC
completion signal. As with Thunder's "performance optimization" series,
I evaluated the patches using FIO with 4 NVME drives connected to a
single SMMU. Here's how they compared:

FIO - 512k blocksize / io-depth 32 / 1 thread per drive
 Baseline 4.13-rc1 w/SMMU enabled: 25% of SMMU bypass performance
 Baseline + Thunder Patch 1  : 28%
 Baseline + CMD_SYNC Optimization: 36%
 Baseline + Thunder Patches 2-5  : 86%
 Baseline + Thunder Patches 1-5  : 100% [!!]

Seems like it would probably be worthwhile to implement this for the
non-MSI case also. Let me know if there are other workloads you're
particularly interested in, and I'll try to get those tested too.

-Nate


Robin.


Robin Murphy (3):
   iommu/arm-smmu-v3: Specialise CMD_SYNC handling
   iommu/arm-smmu-v3: Forget about cmdq-sync interrupt
   iommu/arm-smmu-v3: Utilise CMD_SYNC MSI feature

  drivers/iommu/arm-smmu-v3.c | 117 +---
  1 file changed, 77 insertions(+), 40 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-08-03 Thread Nate Watterson


Hi Robin,

On 7/31/2017 7:42 AM, Robin Murphy wrote:

Hi Nate,

On 29/07/17 04:57, Nate Watterson wrote:

Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
 Unable to handle kernel NULL pointer dereference at virtual address
0020
 user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
 [0020] *pgd=0006efab0003, *pud=0006efab0003,
*pmd=0007d8720003, *pte=
 Internal error: Oops: 9607 [#1] SMP
 Modules linked in:
 CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
 task: 8007da1e5780 task.stack: 8007ddcb8000
 PC is at __cached_rbnode_delete_update+0x2c/0x58
 LR is at private_free_iova+0x2c/0x60
 pc : [] lr : [] pstate: 204001c5
 sp : 8007ddcbba00
 x29: 8007ddcbba00 x28: 8007c8350210
 x27: 8007d1a8 x26: 8007dcc20800
 x25: 0140 x24: 8007c98f0008
 x23: fe4e x22: 0140
 x21: 8007c98f0008 x20: 8007c9adb240
 x19: 8007c98f0018 x18: 0010
 x17:  x16: 
 x15: 4000 x14: 
 x13:  x12: 0001
 x11: dead0200 x10: 
 x9 :  x8 : 8007c9adb1c0
 x7 : 40002000 x6 : 00210d00
 x5 :  x4 : c57e
 x3 : ffcf x2 : ffcf
 x1 : 8007c9adb240 x0 : 
 [...]
 [] __cached_rbnode_delete_update+0x2c/0x58
 [] private_free_iova+0x2c/0x60
 [] iova_magazine_free_pfns+0x4c/0xa0
 [] free_iova_fast+0x1b0/0x230
 [] iommu_dma_free_iova+0x5c/0x80
 [] __iommu_dma_unmap+0x5c/0x98
 [] iommu_dma_unmap_resource+0x24/0x30
 [] iommu_dma_unmap_page+0xc/0x18
 [] __iommu_unmap_page+0x40/0x60
 [] mlx5e_page_release+0xbc/0x128
 [] mlx5e_dealloc_rx_wqe+0x30/0x40
 [] mlx5e_close_channel+0x70/0x1f8
 [] mlx5e_close_channels+0x2c/0x50
 [] mlx5e_close_locked+0x54/0x68
 [] mlx5e_close+0x30/0x58
 [...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
   92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
 |curr = &iovad->cached32_node;
   94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
 |
 |cached_iova = rb_entry(*curr, struct iova, node);
 |
   99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
  100|*curr = rb_next(&free->node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.


Thanks for the report - I somehow managed to reason myself out of
keeping the "no cached node" check in __cached_rbnode_delete_update() on
the assumption that it must always be set by a previous allocation.
However, there is indeed just one case case for which that fails: when
you free any IOVA immediately after freeing the very topmost one. Which
is something that freeing an entire magazine's worth of IOVAs back to
the tree all at once has a very real chance of doing...

The obvious straightforward fix is inline below, but I'm now starting to
understand the appeal of reserving a sentinel node to ensure the tree
can never be empty, so I might have a quick go at that to see if it
results in nicer code overall.


After applying your fix, the crash no longer occurs, but the performance
of this use case is only marginally less awful

Re: [PATCH v2 3/4] iommu/iova: Extend rbtree node caching

2017-07-28 Thread Nate Watterson


Hi Robin,
I am seeing a crash when performing very basic testing on this series
with a Mellanox CX4 NIC. I dug into the crash a bit, and think this
patch is the culprit, but this rcache business is still mostly
witchcraft to me.

# ifconfig eth5 up
# ifconfig eth5 down
Unable to handle kernel NULL pointer dereference at virtual address 0020
user pgtable: 64k pages, 48-bit VAs, pgd = 8007dbf47c00
[0020] *pgd=0006efab0003, *pud=0006efab0003, 
*pmd=0007d8720003, *pte=
Internal error: Oops: 9607 [#1] SMP
Modules linked in:
CPU: 47 PID: 5082 Comm: ifconfig Not tainted 4.13.0-rtp-enablement+ #3
task: 8007da1e5780 task.stack: 8007ddcb8000
PC is at __cached_rbnode_delete_update+0x2c/0x58
LR is at private_free_iova+0x2c/0x60
pc : [] lr : [] pstate: 204001c5
sp : 8007ddcbba00
x29: 8007ddcbba00 x28: 8007c8350210
x27: 8007d1a8 x26: 8007dcc20800
x25: 0140 x24: 8007c98f0008
x23: fe4e x22: 0140
x21: 8007c98f0008 x20: 8007c9adb240
x19: 8007c98f0018 x18: 0010
x17:  x16: 
x15: 4000 x14: 
x13:  x12: 0001
x11: dead0200 x10: 
x9 :  x8 : 8007c9adb1c0
x7 : 40002000 x6 : 00210d00
x5 :  x4 : c57e
x3 : ffcf x2 : ffcf
x1 : 8007c9adb240 x0 : 
[...]
[] __cached_rbnode_delete_update+0x2c/0x58
[] private_free_iova+0x2c/0x60
[] iova_magazine_free_pfns+0x4c/0xa0
[] free_iova_fast+0x1b0/0x230
[] iommu_dma_free_iova+0x5c/0x80
[] __iommu_dma_unmap+0x5c/0x98
[] iommu_dma_unmap_resource+0x24/0x30
[] iommu_dma_unmap_page+0xc/0x18
[] __iommu_unmap_page+0x40/0x60
[] mlx5e_page_release+0xbc/0x128
[] mlx5e_dealloc_rx_wqe+0x30/0x40
[] mlx5e_close_channel+0x70/0x1f8
[] mlx5e_close_channels+0x2c/0x50
[] mlx5e_close_locked+0x54/0x68
[] mlx5e_close+0x30/0x58
[...]

** Disassembly for __cached_rbnode_delete_update() near the fault **
  92|if (free->pfn_hi < iovad->dma_32bit_pfn)
0852C6C4|ldr x3,[x1,#0x18]; x3,[free,#24]
0852C6C8|ldr x2,[x0,#0x30]; x2,[iovad,#48]
0852C6CC|cmp x3,x2
0852C6D0|b.cs0x0852C708
|curr = &iovad->cached32_node;
  94|if (!curr)
0852C6D4|addsx19,x0,#0x18 ; x19,iovad,#24
0852C6D8|b.eq0x0852C708
|
|cached_iova = rb_entry(*curr, struct iova, node);
|
  99|if (free->pfn_lo >= cached_iova->pfn_lo)
0852C6DC|ldr x0,[x19] ; xiovad,[curr]
0852C6E0|ldr x2,[x1,#0x20]; x2,[free,#32]
0852C6E4|ldr x0,[x0,#0x20]; x0,[x0,#32]
Apparently cached_iova was NULL so the pfn_lo access faulted.

0852C6E8|cmp x2,x0
0852C6EC|b.cc0x0852C6FC
0852C6F0|mov x0,x1; x0,free
 100|*curr = rb_next(&free->node);
After instrumenting the code a bit, this seems to be the culprit. In the
previous call, free->pfn_lo was 0x_ which is actually the
dma_limit for the domain so rb_next() returns NULL.

Let me know if you have any questions or would like additional tests
run. I also applied your "DMA domain debug info" patches and dumped the
contents of the domain at each of the steps above in case that would be
useful. If nothing else, they reinforce how thirsty the CX4 NIC is
especially when using 64k pages and many CPUs.

-Nate

On 7/21/2017 7:42 AM, Robin Murphy wrote:

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that traversing to the end then walking all the way down
to find free space every time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei .

Tested-by: Ard Biesheuvel 
Tested-by: Zhen Lei 
Signed-off-by: Robin Murphy 
---

v2: No change

  drivers/iommu/iova.c | 59 +---
  include/linux/iova.h |  3 ++-
  2 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index d094d1ca8f23..f5809a2ee6c2 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -46,6

Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

2017-07-20 Thread Nate Watterson


Hi Jonathan,

[...]
 

Hi All,

I'm a bit of late entry to this discussion.  Just been running some more
detailed tests on our d05 boards and wanted to bring some more numbers to
the discussion.

All tests against 4.12 with the following additions:
* Robin's series removing the io-pgtable spinlock (and a few recent fixes)
* Cherry picked updates to the sas driver, merged prior to 4.13-rc1
* An additional HNS (network card) bug fix that will be upstreamed shortly.

I've broken the results down into this patch and this patch + the remainder
of the set. As leizhen mentioned we got a nice little performance
bump from Robin's series so that was applied first (as it's in mainline now)

SAS tests were fio with noop scheduler, 4k block size and various io depths
1 process per disk.  Note this is probably a different setup to leizhen's
original numbers.

Precentages are off the performance seen with the smmu disabled.
SAS
4.12 - none of this series.
SMMU disabled
read io-depth 32 -   384K IOPS (100%)
read io-depth 2048 - 950K IOPS (100%)
rw io-depth 32 - 166K IOPS (100%)
rw io-depth 2048 -   340K IOPS (100%)

SMMU enabled
read io-depth 32 -   201K IOPS (52%)
read io-depth 2048 - 306K IOPS (32%)
rw io-depth 32 - 99K  IOPS (60%)
rw io-depth 2048 -   150K IOPS (44%)

Robin's recent series with fixes as seen on list (now merged)
SMMU enabled.
read io-depth 32 -   208K IOPS (54%)
read io-depth 2048 - 335K IOPS (35%)
rw io-depth 32 - 105K IOPS (63%)
rw io-depth 2048 -   165K IOPS (49%)

4.12 + Robin's series + just this patch SMMU enabled

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)

read io-depth 32 -   225K IOPS (59%)
read io-depth 2048 - 365K IOPS (38%)
rw io-depth 32 - 110K IOPS (66%)
rw io-depth 2048 -   179K IOPS (53%)

4.12 + Robin's series + Second part of this series

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
(iommu: add a new member unmap_tlb_sync into struct iommu_ops)
(iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
(iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)

read io-depth 32 -225K IOPS (59%)
read io-depth 2048 -  833K IOPS (88%)
rw io-depth 32 -  112K IOPS (67%)
rw io-depth 2048 -220K IOPS (65%)

Robin's series gave us small gains across the board (3-5% recovered)
relative to the no smmu performance (which we are taking as the ideal case)

This first patch gets us back another 2-5% of the no smmu performance

The next few patches get us very little advantage on the small io-depths
but make a large difference to the larger io-depths - in particular the
read IOPS which is over twice as fast as without the series.

For HNS it seems that we are less dependent on the SMMU performance and
can reach the non SMMU speed.

Tests with
iperf -t 30 -i 10 -c IPADDRESS -P 3 last 10 seconds taken to avoid any
initial variability.

The server end of the link was always running with smmu v3 disabled
so as to act as a fast sink of the data. Some variation seen across
repeat runs.

Mainline v4.12 + network card fix
NO SMMU
9.42 GBits/sec

SMMU
4.36 GBits/sec (46%)

Robin's io-pgtable spinlock series

6.68 to 7.34 (71% - 78% variation across runs)

Just this patch SMMU enabled

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)

7.96-8.8 GBits/sec (85% - 94%  some variation across runs)

Full series

(iommu/arm-smmu-v3: put of the execution of TLBI* to reduce lock conflict)
(iommu: add a new member unmap_tlb_sync into struct iommu_ops)
(iommu/arm-smmu-v3: add supprot for unmap an iova range with only on tlb sync)
(iommu/arm-smmu: add support for unmap of a memory range with only one tlb sync)

9.42 GBits/Sec (100%)

So HNS test shows a greater boost from Robin's series and this first patch.
This is most likely because the HNS test is not putting as high a load on
the SMMU and associated code as the SAS test.

In both cases however, this shows that both parts of this patch
series are beneficial.

So on to the questions ;)

Will, you mentioned that along with Robin and Nate you were working on
a somewhat related strategy to improve the performance.  Any ETA on that?


The strategy I was working on is basically equivalent to the second
part of the series. I will test your patches out sometime this week, and
I'll also try to have our performance team run it through their whole
suite.


Thanks, that's excellent.  Look forward to hearing how it goes.


I tested the patches with 4 NVME drives connected to a single SMMU and
the results seem to be inline with those you've reported.

FIO - 512k blocksize / io-depth 32 / 1 thread per drive
 Baseline 4.13-rc1 w/SMMU enabled: 25% of SMMU bypass performance
 Baseline + Patch 1  : 28%
 Baseline + Patches 2-5  : 86%
 Baseline + Complete series  : 100% [!!]

I saw performance improvements across all of the other FIO profiles I
tested, although not always as substantial as was seen in th

Re: [PATCH 1/5] iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock confliction

2017-07-17 Thread Nate Watterson


Hi Jonathan,

On 7/17/2017 10:23 AM, Jonathan Cameron wrote:

On Mon, 17 Jul 2017 14:06:42 +0100
John Garry  wrote:


+

On 29/06/2017 03:08, Leizhen (ThunderTown) wrote:



On 2017/6/28 17:32, Will Deacon wrote:

Hi Zhen Lei,

Nate (CC'd), Robin and I have been working on something very similar to
this series, but this patch is different to what we had planned. More below.

On Mon, Jun 26, 2017 at 09:38:46PM +0800, Zhen Lei wrote:

Because all TLBI commands should be followed by a SYNC command, to make
sure that it has been completely finished. So we can just add the TLBI
commands into the queue, and put off the execution until meet SYNC or
other commands. To prevent the followed SYNC command waiting for a long
time because of too many commands have been delayed, restrict the max
delayed number.

According to my test, I got the same performance data as I replaced writel
with writel_relaxed in queue_inc_prod.

Signed-off-by: Zhen Lei 
---
  drivers/iommu/arm-smmu-v3.c | 42 +-
  1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 291da5f..4481123 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -337,6 +337,7 @@
  /* Command queue */
  #define CMDQ_ENT_DWORDS   2
  #define CMDQ_MAX_SZ_SHIFT 8
+#define CMDQ_MAX_DELAYED   32

  #define CMDQ_ERR_SHIFT24
  #define CMDQ_ERR_MASK 0x7f
@@ -472,6 +473,7 @@ struct arm_smmu_cmdq_ent {
};
} cfgi;

+   #define CMDQ_OP_TLBI_NH_ALL 0x10
#define CMDQ_OP_TLBI_NH_ASID0x11
#define CMDQ_OP_TLBI_NH_VA  0x12
#define CMDQ_OP_TLBI_EL2_ALL0x20
@@ -499,6 +501,7 @@ struct arm_smmu_cmdq_ent {

  struct arm_smmu_queue {
int irq; /* Wired interrupt */
+   u32 nr_delay;

__le64  *base;
dma_addr_t  base_dma;
@@ -722,11 +725,16 @@ static int queue_sync_prod(struct arm_smmu_queue *q)
return ret;
  }

-static void queue_inc_prod(struct arm_smmu_queue *q)
+static void queue_inc_swprod(struct arm_smmu_queue *q)
  {
-   u32 prod = (Q_WRP(q, q->prod) | Q_IDX(q, q->prod)) + 1;
+   u32 prod = q->prod + 1;

q->prod = Q_OVF(q, q->prod) | Q_WRP(q, prod) | Q_IDX(q, prod);
+}
+
+static void queue_inc_prod(struct arm_smmu_queue *q)
+{
+   queue_inc_swprod(q);
writel(q->prod, q->prod_reg);
  }

@@ -761,13 +769,24 @@ static void queue_write(__le64 *dst, u64 *src, size_t 
n_dwords)
*dst++ = cpu_to_le64(*src++);
  }

-static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent)
+static int queue_insert_raw(struct arm_smmu_queue *q, u64 *ent, int optimize)
  {
if (queue_full(q))
return -ENOSPC;

queue_write(Q_ENT(q, q->prod), ent, q->ent_dwords);
-   queue_inc_prod(q);
+
+   /*
+* We don't want too many commands to be delayed, this may lead the
+* followed sync command to wait for a long time.
+*/
+   if (optimize && (++q->nr_delay < CMDQ_MAX_DELAYED)) {
+   queue_inc_swprod(q);
+   } else {
+   queue_inc_prod(q);
+   q->nr_delay = 0;
+   }
+


So here, you're effectively putting invalidation commands into the command
queue without updating PROD. Do you actually see a performance advantage
from doing so? Another side of the argument would be that we should be

Yes, my sas ssd performance test showed that it can improve about 
100-150K/s(the same to I directly replace
writel with writel_relaxed). And the average execution time of 
iommu_unmap(which called by iommu_dma_unmap_sg)
dropped from 10us to 5us.
  

moving PROD as soon as we can, so that the SMMU can process invalidation
commands in the background and reduce the cost of the final SYNC operation
when the high-level unmap operation is complete.

There maybe that __iowmb() is more expensive than wait for tlbi complete. 
Except the time of __iowmb()
itself, it also protected by spinlock, lock confliction will rise rapidly in 
the stress scene. __iowmb()
average cost 300-500ns(Sorry, I forget the exact value).

In addition, after applied this patcheset and Robin's v2, and my earlier dma64 
iova optimization patchset.
Our net performance test got the same data to global bypass. But sas ssd still 
have more than 20% dropped.
Maybe we should still focus at map/unamp, because the average execution time of 
iova alloc/free is only
about 400ns.

By the way, patch2-5 is more effective than this one, it can improve more than 
350K/s. And with it, we can
got about 100-150K/s improvement of Robin's v2. Otherwise, I saw non effective 
of Robin's v2. Sorry, I have
not tested how about this patch without patch2-5. Further more, I got the sam

Re: [PATCH v2] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson


I should have removed the '-v3' since this revision of the patch adds
shutdown to arm-smmu.c as well. I'll fix that in a subsequent version
after waiting to see if there are additional changes that need to be
made.

On 6/29/2017 6:18 PM, Nate Watterson wrote:

The shutdown method disables the SMMU to avoid corrupting a new kernel
started with kexec.

Signed-off-by: Nate Watterson 
---
  drivers/iommu/arm-smmu-v3.c | 7 +++
  drivers/iommu/arm-smmu.c| 6 ++
  2 files changed, 13 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..3d8ac29 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,15 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
  
  	arm_smmu_device_disable(smmu);

+
return 0;
  }
  
+static void arm_smmu_device_shutdown(struct platform_device *pdev)

+{
+   arm_smmu_device_remove(pdev);
+}
+
  static struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v3", },
{ },
@@ -2781,6 +2787,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
  };
  module_platform_driver(arm_smmu_driver);
  
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c

index 7ec30b0..af50bab 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -2319,6 +2319,11 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
return 0;
  }
  
+static void arm_smmu_device_shutdown(struct platform_device *pdev)

+{
+   arm_smmu_device_remove(pdev);
+}
+
  static struct platform_driver arm_smmu_driver = {
.driver = {
.name   = "arm-smmu",
@@ -2326,6 +2331,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
  };
  module_platform_driver(arm_smmu_driver);
  



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson

The shutdown method disables the SMMU to avoid corrupting a new kernel
started with kexec.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 7 +++
 drivers/iommu/arm-smmu.c| 6 ++
 2 files changed, 13 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..3d8ac29 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,15 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
 
arm_smmu_device_disable(smmu);
+
return 0;
 }
 
+static void arm_smmu_device_shutdown(struct platform_device *pdev)
+{
+   arm_smmu_device_remove(pdev);
+}
+
 static struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v3", },
{ },
@@ -2781,6 +2787,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
 };
 module_platform_driver(arm_smmu_driver);
 
diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7ec30b0..af50bab 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -2319,6 +2319,11 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
return 0;
 }
 
+static void arm_smmu_device_shutdown(struct platform_device *pdev)
+{
+   arm_smmu_device_remove(pdev);
+}
+
 static struct platform_driver arm_smmu_driver = {
.driver = {
.name   = "arm-smmu",
@@ -2326,6 +2331,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
 };
 module_platform_driver(arm_smmu_driver);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson


On 6/29/2017 2:34 PM, Will Deacon wrote:

On Thu, Jun 29, 2017 at 01:40:15PM -0400, Nate Watterson wrote:

The shutdown method disables the SMMU and its interrupts to avoid
potentially corrupting a new kernel started with kexec.

Signed-off-by: Nate Watterson 
---
  drivers/iommu/arm-smmu-v3.c | 11 +++
  1 file changed, 11 insertions(+)


We should update arm-smmu.c as well.


diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..907d576 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,19 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
  
  	arm_smmu_device_disable(smmu);

+
+   /* Disable IRQs */
+   arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
+   ARM_SMMU_IRQ_CTRLACK);
+


Can you justify the need for this? If we actually need to disable
interrupts, then I'd like to understand why so that we can make sure we
get the ordering right with respect to disabling the device. Also, do we
need to clear the MSI registers too?


I have no justification. Based on the number of drivers that take care
to prevent their HW from generating an interrupt, I thought it would be
required, but I can't find any such requirement explicitly laid out in
the documentation.

When you mention the MSI registers do you mean, for instance,
SMMU_GERROR_IRQ_CFG0? It looks like those are always cleared while
initializing the SMMU so the case where an SMMU transitions from using
MSIs to using wired interrupts between kernels will be handled properly.



My understanding is that kexec will mask irqs at the GIC, so there's not
actually an issue here.

Will



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/arm-smmu-v3: Implement shutdown method

2017-06-29 Thread Nate Watterson

The shutdown method disables the SMMU and its interrupts to avoid
potentially corrupting a new kernel started with kexec.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 380969a..907d576 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2765,9 +2765,19 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
struct arm_smmu_device *smmu = platform_get_drvdata(pdev);
 
arm_smmu_device_disable(smmu);
+
+   /* Disable IRQs */
+   arm_smmu_write_reg_sync(smmu, 0, ARM_SMMU_IRQ_CTRL,
+   ARM_SMMU_IRQ_CTRLACK);
+
return 0;
 }
 
+static void arm_smmu_device_shutdown(struct platform_device *pdev)
+{
+   arm_smmu_device_remove(pdev);
+}
+
 static struct of_device_id arm_smmu_of_match[] = {
{ .compatible = "arm,smmu-v3", },
{ },
@@ -2781,6 +2791,7 @@ static int arm_smmu_device_remove(struct platform_device 
*pdev)
},
.probe  = arm_smmu_device_probe,
.remove = arm_smmu_device_remove,
+   .shutdown = arm_smmu_device_shutdown,
 };
 module_platform_driver(arm_smmu_driver);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 0/8] io-pgtable lock removal

2017-06-09 Thread Nate Watterson


Hi Robin,

On 6/8/2017 7:51 AM, Robin Murphy wrote:

Hi all,

Here's the cleaned up nominally-final version of the patches everybody's
keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
#2-#4 do some preparatory work (and bid farewell to everyone's least
favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.

The branch I've previously shared has been updated too:

   git://linux-arm.org/linux-rm  iommu/pgtable

All feedback welcome, as I'd really like to land this for 4.13.



I tested the series on a QDF2400 development platform and see notable
performance improvements particularly in workloads that make concurrent
accesses to a single iommu_domain.


Robin.


Robin Murphy (8):
   iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
   iommu/io-pgtable-arm: Improve split_blk_unmap
   iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
   iommu/io-pgtable: Introduce explicit coherency
   iommu/io-pgtable-arm: Support lockless operation
   iommu/io-pgtable-arm-v7s: Support lockless operation
   iommu/arm-smmu: Remove io-pgtable spinlock
   iommu/arm-smmu-v3: Remove io-pgtable spinlock

  drivers/iommu/arm-smmu-v3.c|  36 ++-
  drivers/iommu/arm-smmu.c   |  48 --
  drivers/iommu/io-pgtable-arm-v7s.c | 173 +
  drivers/iommu/io-pgtable-arm.c | 190 -
  drivers/iommu/io-pgtable.h |   6 ++
  5 files changed, 268 insertions(+), 185 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 0/7] Add PCI ATS support to SMMUv3

2017-05-31 Thread Nate Watterson


Hi Jean-Philippe,

On 5/24/2017 2:01 PM, Jean-Philippe Brucker wrote:

PCIe devices can implement their own TLB, named Address Translation Cache
(ATC). In order to support Address Translation Service (ATS), the
following changes are needed in software:

* Enable ATS on endpoints when the system supports it. Both PCI root
   complex and associated SMMU must implement the ATS protocol.

* When unmapping an IOVA, send an ATC invalidate request to the endpoint
   in addition to the usual SMMU IOTLB invalidations.

I previously sent this as part of a lengthy RFC [1] adding SVM (ATS +
PASID + PRI) support to SMMUv3. The next PASID/PRI version is almost
ready, but isn't likely to get merged because it needs hardware testing,
so I will send it later. PRI depends on ATS, but ATS should be useful on
its own.

Without PASID and PRI, ATS is used for accelerating transactions. Instead
of having all memory accesses go through SMMU translation, the endpoint
can translate IOVA->PA once, store the result in its ATC, then issue
subsequent transactions using the PA, partially bypassing the SMMU. So in
theory it should be faster while keeping the advantages of an IOMMU,
namely scatter-gather and access control.

The ATS patches can now be tested on some hardware, even though the lack
of compatible PCI endpoints makes it difficult to assess what performance
optimizations we need. That's why the ATS implementation is a bit rough at
the moment, and we will work on optimizing things like invalidation ranges
later.


Sinan and I have tested this series on a QDF2400 development platform
using a PCIe exerciser card as the ATS capable endpoint. We were able
to verify that ATS requests complete with a valid translated address
and that DMA transactions using the pre-translated address "bypass"
the SMMU. Testing ATC invalidations was a bit more difficult as we
could not figure out how to get the exerciser card to automatically
send the completion message. We ended up having to write a debugger
script that would monitor the CMDQ and tell the exerciser to send
the completion when a hanging CMD_SYNC following a CMD_ATC_INV was
detected. Hopefully we'll get some real ATS capable endpoints to
test with soon.



Since the RFC [1]:
* added DT and ACPI patches,
* added invalidate-all on domain detach,
* removed smmu_group again,
* removed invalidation print from the fast path,
* disabled tagged pointers for good,
* some style changes.

These patches are based on Linux v4.12-rc2

[1] https://www.spinics.net/lists/linux-pci/msg58650.html

Jean-Philippe Brucker (7):
   PCI: Move ATS declarations outside of CONFIG_PCI
   dt-bindings: PCI: Describe ATS property for root complex nodes
   iommu/of: Check ATS capability in root complex nodes
   ACPI/IORT: Check ATS capability in root complex nodes
   iommu/arm-smmu-v3: Link domains and devices
   iommu/arm-smmu-v3: Add support for PCI ATS
   iommu/arm-smmu-v3: Disable tagged pointers

  .../devicetree/bindings/pci/pci-iommu.txt  |   8 +
  drivers/acpi/arm64/iort.c  |  10 +
  drivers/iommu/arm-smmu-v3.c| 258 -
  drivers/iommu/of_iommu.c   |   8 +
  include/linux/iommu.h  |   4 +
  include/linux/pci.h|  26 +--
  6 files changed, 293 insertions(+), 21 deletions(-)



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V11 08/11] drivers: acpi: Handle IOMMU lookup failure with deferred probing or error

2017-05-23 Thread Nate Watterson


Hi Lorenzo,

On 5/23/2017 5:26 AM, Lorenzo Pieralisi wrote:

On Tue, May 23, 2017 at 02:31:17PM +0530, Sricharan R wrote:

Hi Lorenzo,

On 5/23/2017 2:22 PM, Lorenzo Pieralisi wrote:

On Tue, May 23, 2017 at 02:26:10AM -0400, Nate Watterson wrote:

Hi Sricharan,

On 4/10/2017 7:21 AM, Sricharan R wrote:

This is an equivalent to the DT's handling of the iommu master's probe
with deferred probing when the corrsponding iommu is not probed yet.
The lack of a registered IOMMU can be caused by the lack of a driver for
the IOMMU, the IOMMU device probe not having been performed yet, having
been deferred, or having failed.

The first case occurs when the firmware describes the bus master and
IOMMU topology correctly but no device driver exists for the IOMMU yet
or the device driver has not been compiled in. Return NULL, the caller
will configure the device without an IOMMU.

The second and third cases are handled by deferring the probe of the bus
master device which will eventually get reprobed after the IOMMU.

The last case is currently handled by deferring the probe of the bus
master device as well. A mechanism to either configure the bus master
device without an IOMMU or to fail the bus master device probe depending
on whether the IOMMU is optional or mandatory would be a good
enhancement.

Tested-by: Hanjun Guo 
Reviewed-by: Robin Murphy 
[Lorenzo: Added fixes for dma_coherent_mask overflow, acpi_dma_configure
   called multiple times for same device]
Signed-off-by: Lorenzo Pieralisi 
Signed-off-by: Sricharan R 
---
  drivers/acpi/arm64/iort.c  | 33 -
  drivers/acpi/scan.c| 11 ---
  drivers/base/dma-mapping.c |  2 +-
  include/acpi/acpi_bus.h|  2 +-
  include/linux/acpi.h   |  7 +--
  5 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 3dd9ec3..e323ece 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -543,6 +543,14 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
const struct iommu_ops *ops = NULL;
int ret = -ENODEV;
struct fwnode_handle *iort_fwnode;
+   struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+   /*
+* If we already translated the fwspec there
+* is nothing left to do, return the iommu_ops.
+*/
+   if (fwspec && fwspec->ops)
+   return fwspec->ops;


Is this logic strictly required? It breaks masters with multiple SIDs
as only the first SID is actually added to the master's fwspec.


My bad, that's indeed a silly bug I introduced. Please let me know if the
patch below fixes it, we will send it upstream shortly.



oops, i think emails crossed. Please let me know if you are ok to add
this to the other fixes.


No worries, yes I am ok thanks but please give Nate some time to report
back to make sure the diff I sent actually fixes the problem.


The patch you sent fixes the problem. Thanks for the quick turnaround.



Apologies for the breakage.

Lorenzo



Regards,
  Sricharan



--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V11 08/11] drivers: acpi: Handle IOMMU lookup failure with deferred probing or error

2017-05-22 Thread Nate Watterson


Hi Sricharan,

On 4/10/2017 7:21 AM, Sricharan R wrote:

This is an equivalent to the DT's handling of the iommu master's probe
with deferred probing when the corrsponding iommu is not probed yet.
The lack of a registered IOMMU can be caused by the lack of a driver for
the IOMMU, the IOMMU device probe not having been performed yet, having
been deferred, or having failed.

The first case occurs when the firmware describes the bus master and
IOMMU topology correctly but no device driver exists for the IOMMU yet
or the device driver has not been compiled in. Return NULL, the caller
will configure the device without an IOMMU.

The second and third cases are handled by deferring the probe of the bus
master device which will eventually get reprobed after the IOMMU.

The last case is currently handled by deferring the probe of the bus
master device as well. A mechanism to either configure the bus master
device without an IOMMU or to fail the bus master device probe depending
on whether the IOMMU is optional or mandatory would be a good
enhancement.

Tested-by: Hanjun Guo 
Reviewed-by: Robin Murphy 
[Lorenzo: Added fixes for dma_coherent_mask overflow, acpi_dma_configure
   called multiple times for same device]
Signed-off-by: Lorenzo Pieralisi 
Signed-off-by: Sricharan R 
---
  drivers/acpi/arm64/iort.c  | 33 -
  drivers/acpi/scan.c| 11 ---
  drivers/base/dma-mapping.c |  2 +-
  include/acpi/acpi_bus.h|  2 +-
  include/linux/acpi.h   |  7 +--
  5 files changed, 47 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 3dd9ec3..e323ece 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -543,6 +543,14 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
const struct iommu_ops *ops = NULL;
int ret = -ENODEV;
struct fwnode_handle *iort_fwnode;
+   struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+   /*
+* If we already translated the fwspec there
+* is nothing left to do, return the iommu_ops.
+*/
+   if (fwspec && fwspec->ops)
+   return fwspec->ops;


Is this logic strictly required? It breaks masters with multiple SIDs
as only the first SID is actually added to the master's fwspec.

  
  	if (node) {

iort_fwnode = iort_get_fwnode(node);
@@ -550,8 +558,17 @@ static const struct iommu_ops *iort_iommu_xlate(struct 
device *dev,
return NULL;
  
  		ops = iommu_ops_from_fwnode(iort_fwnode);

+   /*
+* If the ops look-up fails, this means that either
+* the SMMU drivers have not been probed yet or that
+* the SMMU drivers are not built in the kernel;
+* Depending on whether the SMMU drivers are built-in
+* in the kernel or not, defer the IOMMU configuration
+* or just abort it.
+*/
if (!ops)
-   return NULL;
+   return iort_iommu_driver_enabled(node->type) ?
+  ERR_PTR(-EPROBE_DEFER) : NULL;
  
  		ret = arm_smmu_iort_xlate(dev, streamid, iort_fwnode, ops);

}
@@ -625,12 +642,26 @@ const struct iommu_ops *iort_iommu_configure(struct 
device *dev)
  
  		while (parent) {

ops = iort_iommu_xlate(dev, parent, streamid);
+   if (IS_ERR_OR_NULL(ops))
+   return ops;
  
  			parent = iort_node_get_id(node, &streamid,

  IORT_IOMMU_TYPE, i++);
}
}
  
+	/*

+* If we have reason to believe the IOMMU driver missed the initial
+* add_device callback for dev, replay it to get things in order.
+*/
+   if (!IS_ERR_OR_NULL(ops) && ops->add_device &&
+   dev->bus && !dev->iommu_group) {
+   int err = ops->add_device(dev);
+
+   if (err)
+   ops = ERR_PTR(err);
+   }
+
return ops;
  }
  
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c

index 1926918..2a513cc 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1373,20 +1373,25 @@ enum dev_dma_attr acpi_get_dma_attr(struct acpi_device 
*adev)
   * @dev: The pointer to the device
   * @attr: device dma attributes
   */
-void acpi_dma_configure(struct device *dev, enum dev_dma_attr attr)
+int acpi_dma_configure(struct device *dev, enum dev_dma_attr attr)
  {
const struct iommu_ops *iommu;
+   u64 size;
  
  	iort_set_dma_mask(dev);
  
  	iommu = iort_iommu_configure(dev);

+   if (IS_ERR(iommu))
+   return PTR_ERR(iommu);
  
+	size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);

/*
 * Assume dma valid range starts at 0 and covers the whole
 * coherent_dma_mask.
 */
-   arch_setup_dma_

Re: [PATCH] iommu/dma: Setup iova_domain granule for IOMMU_DMA_MSI cookies

2017-05-16 Thread Nate Watterson




On 5/16/2017 3:55 PM, Auger Eric wrote:

Hi,

On 13/04/2017 21:38, Nate Watterson wrote:

Hi Robin,

On 4/13/2017 7:21 AM, Robin Murphy wrote:

Hi Nate,

On 13/04/17 09:55, Nate Watterson wrote:

Currently, the __iommu_dma_{map/free} functions call iova_{offset/align}
making them unsuitable for use with iommu_domains having an
IOMMU_DMA_MSI
cookie since the cookie's iova_domain member, iovad, is uninitialized.

Now that iommu_dma_get_msi_page() calls __iommu_dma_map() regardless
of cookie type, failures are being seen when mapping MSI target
addresses for devices attached to UNMANAGED domains. To work around
this issue, the iova_domain granule for IOMMU_DMA_MSI cookies is
initialized to the value returned by cookie_msi_granule().


Oh bum. Thanks for the report.

However, I really don't like bodging around it with deliberate undefined
behaviour. Fixing things properly doesn't seem too hard:


I was not especially please with my solution, but I wanted to avoid
potentially missing any other spots in the code where granule was
used uninitialized. The compile time check made me feel a little
less dirty about innappropriately using the iova_domain with MSI
cookies.



->8-
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366ddd1..62618e77bedc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -396,13 +396,13 @@ static void iommu_dma_free_iova(struct
iommu_dma_cookie *cookie,
  dma_addr_t iova, size_t size)
   {
  struct iova_domain *iovad = &cookie->iovad;
-   unsigned long shift = iova_shift(iovad);

  /* The MSI case is only ever cleaning up its most recent
allocation */
  if (cookie->type == IOMMU_DMA_MSI_COOKIE)
  cookie->msi_iova -= size;
  else
-   free_iova_fast(iovad, iova >> shift, size >> shift);
+   free_iova_fast(iovad, iova_pfn(iovad, iova),
+   size >> iova_shift(iovad));
   }

   static void __iommu_dma_unmap(struct iommu_domain *domain, dma_addr_t
dma_addr,
@@ -617,11 +617,14 @@ static dma_addr_t __iommu_dma_map(struct device
*dev, phys_addr_t phys,
   {
  struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
  struct iommu_dma_cookie *cookie = domain->iova_cookie;
-   struct iova_domain *iovad = &cookie->iovad;
-   size_t iova_off = iova_offset(iovad, phys);
+   size_t iova_off = 0;
  dma_addr_t iova;

-   size = iova_align(iovad, size + iova_off);
+   if (cookie->type == IOMMU_DMA_IOVA_COOKIE) {
+   iova_off = iova_offset(&cookie->iovad, phys);
+   size = iova_align(&cookie->iovad, size + iova_off);
+   }
+
  iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev),
dev);
  if (!iova)
  return DMA_ERROR_CODE;
-8<-

Untested, and you'll probably want to double-check it anyway given that
the original oversight was mine in the first place ;)


This looks good to me. As Shanker has already mentioned, it does fix the
faults we were previously seeing with direct device assignment. I also
verified that there aren't any other obvious cases of a granule == 0
being used in the dma_iommu code by adding BUG_ON(!iovad->granule) to
iova_{mask/align/offset/...} and running a variety of tests without
issue.

Are you going to post the patch?


I also noticed PCIe passthrough/ARM is broken for me with 4.12-r1. I
tested Robin's patch as well, on Cavium ThunderX, and this fixes the
faults I have seen.

Has anyone sent a formal patch?


iommu/dma: Don't touch invalid iova_domain members



Thanks

Eric





Robin.


Fixes: a44e6657585b ("iommu/dma: Clean up MSI IOVA allocation")
Signed-off-by: Nate Watterson 
---
   drivers/iommu/dma-iommu.c | 10 ++
   1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..d7b0816 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -127,6 +127,16 @@ int iommu_get_msi_cookie(struct iommu_domain
*domain, dma_addr_t base)
 cookie->msi_iova = base;
   domain->iova_cookie = cookie;
+
+/*
+ * Setup granule for compatibility with __iommu_dma_{alloc/free}
and
+ * add a compile time check to ensure that writing granule won't
+ * clobber msi_iova.
+ */
+cookie->iovad.granule = cookie_msi_granule(cookie);
+BUILD_BUG_ON(offsetof(struct iova_domain, granule) <
+sizeof(cookie->msi_iova));
+
   return 0;
   }
   EXPORT_SYMBOL(iommu_get_msi_cookie);





--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/dma: Setup iova_domain granule for IOMMU_DMA_MSI cookies

2017-04-13 Thread Nate Watterson


Hi Robin,

On 4/13/2017 7:21 AM, Robin Murphy wrote:

Hi Nate,

On 13/04/17 09:55, Nate Watterson wrote:

Currently, the __iommu_dma_{map/free} functions call iova_{offset/align}
making them unsuitable for use with iommu_domains having an IOMMU_DMA_MSI
cookie since the cookie's iova_domain member, iovad, is uninitialized.

Now that iommu_dma_get_msi_page() calls __iommu_dma_map() regardless
of cookie type, failures are being seen when mapping MSI target
addresses for devices attached to UNMANAGED domains. To work around
this issue, the iova_domain granule for IOMMU_DMA_MSI cookies is
initialized to the value returned by cookie_msi_granule().


Oh bum. Thanks for the report.

However, I really don't like bodging around it with deliberate undefined
behaviour. Fixing things properly doesn't seem too hard:


I was not especially please with my solution, but I wanted to avoid
potentially missing any other spots in the code where granule was
used uninitialized. The compile time check made me feel a little
less dirty about innappropriately using the iova_domain with MSI
cookies.



->8-
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366ddd1..62618e77bedc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -396,13 +396,13 @@ static void iommu_dma_free_iova(struct
iommu_dma_cookie *cookie,
 dma_addr_t iova, size_t size)
  {
 struct iova_domain *iovad = &cookie->iovad;
-   unsigned long shift = iova_shift(iovad);

 /* The MSI case is only ever cleaning up its most recent
allocation */
 if (cookie->type == IOMMU_DMA_MSI_COOKIE)
 cookie->msi_iova -= size;
 else
-   free_iova_fast(iovad, iova >> shift, size >> shift);
+   free_iova_fast(iovad, iova_pfn(iovad, iova),
+   size >> iova_shift(iovad));
  }

  static void __iommu_dma_unmap(struct iommu_domain *domain, dma_addr_t
dma_addr,
@@ -617,11 +617,14 @@ static dma_addr_t __iommu_dma_map(struct device
*dev, phys_addr_t phys,
  {
 struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
 struct iommu_dma_cookie *cookie = domain->iova_cookie;
-   struct iova_domain *iovad = &cookie->iovad;
-   size_t iova_off = iova_offset(iovad, phys);
+   size_t iova_off = 0;
 dma_addr_t iova;

-   size = iova_align(iovad, size + iova_off);
+   if (cookie->type == IOMMU_DMA_IOVA_COOKIE) {
+   iova_off = iova_offset(&cookie->iovad, phys);
+   size = iova_align(&cookie->iovad, size + iova_off);
+   }
+
 iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
 if (!iova)
 return DMA_ERROR_CODE;
-8<-

Untested, and you'll probably want to double-check it anyway given that
the original oversight was mine in the first place ;)


This looks good to me. As Shanker has already mentioned, it does fix the
faults we were previously seeing with direct device assignment. I also
verified that there aren't any other obvious cases of a granule == 0
being used in the dma_iommu code by adding BUG_ON(!iovad->granule) to
iova_{mask/align/offset/...} and running a variety of tests without
issue.

Are you going to post the patch?



Robin.


Fixes: a44e6657585b ("iommu/dma: Clean up MSI IOVA allocation")
Signed-off-by: Nate Watterson 
---
  drivers/iommu/dma-iommu.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..d7b0816 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -127,6 +127,16 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
  
  	cookie->msi_iova = base;

domain->iova_cookie = cookie;
+
+   /*
+* Setup granule for compatibility with __iommu_dma_{alloc/free} and
+* add a compile time check to ensure that writing granule won't
+* clobber msi_iova.
+*/
+   cookie->iovad.granule = cookie_msi_granule(cookie);
+   BUILD_BUG_ON(offsetof(struct iova_domain, granule) <
+   sizeof(cookie->msi_iova));
+
return 0;
  }
  EXPORT_SYMBOL(iommu_get_msi_cookie);




--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/dma: Setup iova_domain granule for IOMMU_DMA_MSI cookies

2017-04-13 Thread Nate Watterson

Currently, the __iommu_dma_{map/free} functions call iova_{offset/align}
making them unsuitable for use with iommu_domains having an IOMMU_DMA_MSI
cookie since the cookie's iova_domain member, iovad, is uninitialized.

Now that iommu_dma_get_msi_page() calls __iommu_dma_map() regardless
of cookie type, failures are being seen when mapping MSI target
addresses for devices attached to UNMANAGED domains. To work around
this issue, the iova_domain granule for IOMMU_DMA_MSI cookies is
initialized to the value returned by cookie_msi_granule().

Fixes: a44e6657585b ("iommu/dma: Clean up MSI IOVA allocation")
Signed-off-by: Nate Watterson 
---
 drivers/iommu/dma-iommu.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8348f366..d7b0816 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -127,6 +127,16 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, 
dma_addr_t base)
 
cookie->msi_iova = base;
domain->iova_cookie = cookie;
+
+   /*
+* Setup granule for compatibility with __iommu_dma_{alloc/free} and
+* add a compile time check to ensure that writing granule won't
+* clobber msi_iova.
+*/
+   cookie->iovad.granule = cookie_msi_granule(cookie);
+   BUILD_BUG_ON(offsetof(struct iova_domain, granule) <
+   sizeof(cookie->msi_iova));
+
return 0;
 }
 EXPORT_SYMBOL(iommu_get_msi_cookie);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RESEND,3/3] iommu/dma: Plumb in the per-CPU IOVA caches

2017-04-07 Thread Nate Watterson

Hi Robin,
On 4/6/2017 2:56 PM, Robin Murphy wrote:

On 06/04/17 19:15, Manoj Iyer wrote:

On Fri, 31 Mar 2017, Robin Murphy wrote:

With IOVA allocation suitably tidied up, we are finally free to opt in
to the per-CPU caching mechanism. The caching alone can provide a modest
improvement over walking the rbtree for weedier systems (iperf3 shows
~10% more ethernet throughput on an ARM Juno r1 constrained to a single
650MHz Cortex-A53), but the real gain will be in sidestepping the rbtree
lock contention which larger ARM-based systems with lots of parallel I/O
are starting to feel the pain of.

[...]

This patch series helps to resolve the Ubuntu bug, where we see the
Ubuntu Zesty (4.10 based) kernel reporting multi cpu soft lockups on
QDF2400 SDP. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1680549

Wow, how many separate buffers does that driver have mapped at once to
spend 22 seconds walking the rbtree for a single allocation!? I'd almost
expect that to indicate a deadlock.

I'm guessing you wouldn't have seen this on older kernels, since I
assume that particular platform is booting via ACPI, so wouldn't have
had the SMMU enabled without the IORT support which landed in 4.10.

Although this series does improve performance, the soft lockups seen
in the Ubuntu bug Manoj mentioned were actually because while the
the mlx5 interface was being brought up, a huge number of concurrent
calls to alloc_iova() were being made with limit_pfn != dma_32bit_pfn
so the optimized iova lookup was being bypassed.

Internally we worked around the issue by adding a set_dma_mask handler
that would call iommu_dma_init_domain() to adjust dma_32bit_pfn to
match the input mask.

https://source.codeaurora.org/quic/server/kernel/commit/arch/arm64/mm/dma-mapping.c?h=qserver-4.10&id=503b36fd3866cab216fc51a5a4015aaa99daf173

This worked well, however it clearly would not have played nice with
your dma-iommu PCI optimizations that force dma_limit to 32-bits so
it was never sent out. The application of the "PCI allocation
optimisations" patch is what actually remedied the cpu soft lockups
seen by Manoj.

Back to your question of how many buffers does the mlx5 driver have
mapped at once. It seems to scale linearly with core count. For
example, with 24-cores, after doing 'ifconfig eth up', ~38k calls
to alloc_iova() have been made and the min iova is ~0xF600_. With
48-cores, those numbers jump to ~66k calls with min iova ~0xEF00_.

Things get really scary when you start using 64k pages. The number of
calls to alloc_iova() stays about the same which, when combined with
the reserved PCI windows, ends up consuming all of our 32-bit iovas
forcing us to once again call alloc_iova() but this time with a
limit_pfn != dma_32bit_pfn. This is actually how I stumbled upon
the alloc_iova() underflow bug that I posted about a bit earlier.

This patch series along with the following cherry-picks from Linus's tree
632b072f iommu/dma: Implement PCI allocation optimisation
de84f5f049d9 iommu/dma: Stop getting dma_32bit_pfn wrong

were applied to Ubuntu Zesty 4.10 kernel (Ubuntu-4.10.0-18.20) and
tested on a QDF2400 SDP.

Tested-by: Manoj Iyer

Thanks,
Robin.

Manoj Iyer
Ubuntu/Canonical
ARM Servers - Cloud

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

--
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/iova: fix underflow bug in __alloc_and_insert_iova_range

2017-04-06 Thread Nate Watterson

Normally, calling alloc_iova() using an iova_domain with insufficient
pfns remaining between start_pfn and dma_limit will fail and return a
NULL pointer. Unexpectedly, if such a "full" iova_domain contains an
iova with pfn_lo == 0, the alloc_iova() call will instead succeed and
return an iova containing invalid pfns.

This is caused by an underflow bug in __alloc_and_insert_iova_range()
that occurs after walking the "full" iova tree when the search ends
at the iova with pfn_lo == 0 and limit_pfn is then adjusted to be just
below that (-1). This (now huge) limit_pfn gives the impression that a
vast amount of space is available between it and start_pfn and thus
a new iova is allocated with the invalid pfn_hi value, 0xFFF .

To rememdy this, a check is introduced to ensure that adjustments to
limit_pfn will not underflow.

This issue has been observed in the wild, and is easily reproduced with
the following sample code.

struct iova_domain *iovad = kzalloc(sizeof(*iovad), GFP_KERNEL);
struct iova *rsvd_iova, *good_iova, *bad_iova;
unsigned long limit_pfn = 3;
unsigned long start_pfn = 1;
unsigned long va_size = 2;

init_iova_domain(iovad, SZ_4K, start_pfn, limit_pfn);
rsvd_iova = reserve_iova(iovad, 0, 0);
good_iova = alloc_iova(iovad, va_size, limit_pfn, true);
bad_iova = alloc_iova(iovad, va_size, limit_pfn, true);

Prior to the patch, this yielded:
*rsvd_iova == {0, 0}   /* Expected */
*good_iova == {2, 3}   /* Expected */
*bad_iova  == {-2, -1} /* Oh no... */

After the patch, bad_iova is NULL as expected since inadequate
space remains between limit_pfn and start_pfn after allocating
good_iova.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/iova.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b7268a1..f6533e0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -138,7 +138,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
break;  /* found a free slot */
}
 adjust_limit_pfn:
-   limit_pfn = curr_iova->pfn_lo - 1;
+   limit_pfn = curr_iova->pfn_lo ? (curr_iova->pfn_lo - 1) : 0;
 move_left:
prev = curr;
curr = rb_prev(curr);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 0/3] IOVA allocation improvements for iommu-dma

2017-03-22 Thread Nate Watterson


On 2017-03-15 09:33, Robin Murphy wrote:

Hi all,

Hi Robin,


Here's the first bit of lock contention removal to chew on - feedback
welcome! Note that for the current users of the io-pgtable framework,
this is most likely to simply push more contention onto the io-pgtable
lock, so may not show a great improvement alone. Will and I both have
rough proof-of-concept implementations of lock-free io-pgtable code
which we need to sit down and agree on at some point, hopefullt fairly
soon.

I've taken the opportunity to do a bit of cleanup and refactoring
within the series to make the final state of the code nicer, but the
diffstat still turns out surprisingly reasonable in the end - it would
actually be negative but for the new comments!

Magnus, Shimoda-san, the first two patches should be of interest as 
they

constitute the allocation rework I mentioned a while back[1] - if you
still need to implement that scary workaround, this should make it
simple to hook IPMMU-specific calls into the alloc and free paths, and
let the driver take care of the details internally.


I've tested your patches on a QDF2400 platform and generally
see modest improvements in iperf/fio performance. As you
suspected would happen, contention has indeed moved to the
io-pgtable lock. I am looking forward to testing with the
lock-free io-pgtable implementation, however I suspect that
there will still be contention issues acquiring the (SMMUv3)
cmdq lock on the unmap path.

Reviewed/Tested-by: Nate Watterson 


Robin.

[1]:https://lists.linuxfoundation.org/pipermail/iommu/2017-January/020189.html

Robin Murphy (3):
  iommu/dma: Convert to address-based allocation
  iommu/dma: Clean up MSI IOVA allocation
  iommu/dma: Plumb in the per-CPU IOVA caches

 drivers/iommu/dma-iommu.c | 176 
--

 1 file changed, 90 insertions(+), 86 deletions(-)


--
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
Linux

Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 3/5] iommu/arm-smmu-v3: Make arm_smmu_install_ste_for_dev return void

2017-03-16 Thread Nate Watterson


Hi Will,

On 2017-03-10 15:49, Will Deacon wrote:

arm_smmu_install_ste_for_dev cannot fail and always returns 0, however
the fact that it returns int means that callers end up implementing
redundant error handling code which complicates STE tracking and is
never executed.

This patch changes the return type of arm_smmu_install_ste_for_dev
to avoid, to make it explicit that it cannot fail.

Did you mean "a void" or just "void" instead of "avoid"?



Signed-off-by: Will Deacon 
---
 drivers/iommu/arm-smmu-v3.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 3d38e682071a..e18dbcd26f66 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1579,7 +1579,7 @@ static __le64 *arm_smmu_get_step_for_sid(struct
arm_smmu_device *smmu, u32 sid)
return step;
 }

-static int arm_smmu_install_ste_for_dev(struct iommu_fwspec *fwspec)
+static void arm_smmu_install_ste_for_dev(struct iommu_fwspec *fwspec)
 {
int i;
struct arm_smmu_master_data *master = fwspec->iommu_priv;
@@ -1591,8 +1591,6 @@ static int arm_smmu_install_ste_for_dev(struct
iommu_fwspec *fwspec)

arm_smmu_write_strtab_ent(smmu, sid, step, &master->ste);
}
-
-   return 0;
 }

 static void arm_smmu_detach_dev(struct device *dev)
@@ -1600,8 +1598,7 @@ static void arm_smmu_detach_dev(struct device 
*dev)

struct arm_smmu_master_data *master = dev->iommu_fwspec->iommu_priv;

master->ste.bypass = true;
-   if (arm_smmu_install_ste_for_dev(dev->iommu_fwspec) < 0)
-   dev_warn(dev, "failed to install bypass STE\n");
+   arm_smmu_install_ste_for_dev(dev->iommu_fwspec);
 }

 static int arm_smmu_attach_dev(struct iommu_domain *domain, struct 
device *dev)

@@ -1653,10 +1650,7 @@ static int arm_smmu_attach_dev(struct
iommu_domain *domain, struct device *dev)
ste->s2_cfg = &smmu_domain->s2_cfg;
}

-   ret = arm_smmu_install_ste_for_dev(dev->iommu_fwspec);
-   if (ret < 0)
-   ste->valid = false;
-
+   arm_smmu_install_ste_for_dev(dev->iommu_fwspec);
 out_unlock:
mutex_unlock(&smmu_domain->init_mutex);
return ret;


--
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
Linux

Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 4/5] iommu/arm-smmu-v3: Install bypass STEs for IOMMU_DOMAIN_IDENTITY domains

2017-03-16 Thread Nate Watterson


Hi Will,

On 2017-03-10 15:49, Will Deacon wrote:

In preparation for allowing the default domain type to be overridden,
this patch adds support for IOMMU_DOMAIN_IDENTITY domains to the
ARM SMMUv3 driver.

An identity domain is created by placing the corresponding stream table
entries into "bypass" mode, which allows transactions to flow through
the SMMU without any translation.



What about masters that require SMMU intervention to override their
native memory attributes to make them consistent with the CCA (acpi)
or dma-coherent (dt) values specified in FW? To make sure those cases
are handled, you could store away the master's coherency setting in
its strtab_ent at attach time and then setup STE[MemAttr/ALLOCCFG/SHCFG]
so the attributes are forced to the correct values while still
bypassing translation.


Signed-off-by: Will Deacon 
---
 drivers/iommu/arm-smmu-v3.c | 58 
+

 1 file changed, 37 insertions(+), 21 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index e18dbcd26f66..75fa4809f49e 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -554,9 +554,14 @@ struct arm_smmu_s2_cfg {
 };

 struct arm_smmu_strtab_ent {
-   boolvalid;
-
-   boolbypass; /* Overrides s1/s2 config */
+   /*
+* An STE is "assigned" if the master emitting the corresponding SID
+* is attached to a domain. The behaviour of an unassigned STE is
+* determined by the disable_bypass parameter, whereas an assigned
+* STE behaves according to s1_cfg/s2_cfg, which themselves are
+* configured according to the domain type.
+*/
+   boolassigned;
struct arm_smmu_s1_cfg  *s1_cfg;
struct arm_smmu_s2_cfg  *s2_cfg;
 };
@@ -632,6 +637,7 @@ enum arm_smmu_domain_stage {
ARM_SMMU_DOMAIN_S1 = 0,
ARM_SMMU_DOMAIN_S2,
ARM_SMMU_DOMAIN_NESTED,
+   ARM_SMMU_DOMAIN_BYPASS,
 };

 struct arm_smmu_domain {
@@ -1005,9 +1011,9 @@ static void arm_smmu_write_strtab_ent(struct
arm_smmu_device *smmu, u32 sid,
 * This is hideously complicated, but we only really care about
 * three cases at the moment:
 *
-* 1. Invalid (all zero) -> bypass  (init)
-* 2. Bypass -> translation (attach)
-* 3. Translation -> bypass (detach)
+* 1. Invalid (all zero) -> bypass/fault (init)
+* 2. Bypass/fault -> translation/bypass (attach)
+* 3. Translation/bypass -> bypass/fault (detach)
 *
 * Given that we can't update the STE atomically and the SMMU
 * doesn't read the thing in a defined order, that leaves us
@@ -1046,11 +1052,15 @@ static void arm_smmu_write_strtab_ent(struct
arm_smmu_device *smmu, u32 sid,
}

/* Nuke the existing STE_0 value, as we're going to rewrite it */
-   val = ste->valid ? STRTAB_STE_0_V : 0;
+   val = STRTAB_STE_0_V;
+
+   /* Bypass/fault */
+   if (!ste->assigned || !(ste->s1_cfg || ste->s2_cfg)) {
+   if (!ste->assigned && disable_bypass)
+   val |= STRTAB_STE_0_CFG_ABORT;
+   else
+   val |= STRTAB_STE_0_CFG_BYPASS;

-   if (ste->bypass) {
-   val |= disable_bypass ? STRTAB_STE_0_CFG_ABORT
- : STRTAB_STE_0_CFG_BYPASS;
dst[0] = cpu_to_le64(val);
dst[1] = cpu_to_le64(STRTAB_STE_1_SHCFG_INCOMING
 << STRTAB_STE_1_SHCFG_SHIFT);
@@ -,10 +1121,7 @@ static void arm_smmu_write_strtab_ent(struct
arm_smmu_device *smmu, u32 sid,
 static void arm_smmu_init_bypass_stes(u64 *strtab, unsigned int nent)
 {
unsigned int i;
-   struct arm_smmu_strtab_ent ste = {
-   .valid  = true,
-   .bypass = true,
-   };
+   struct arm_smmu_strtab_ent ste = { .assigned = false };

for (i = 0; i < nent; ++i) {
arm_smmu_write_strtab_ent(NULL, -1, strtab, &ste);
@@ -1378,7 +1385,9 @@ static struct iommu_domain
*arm_smmu_domain_alloc(unsigned type)
 {
struct arm_smmu_domain *smmu_domain;

-   if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
+   if (type != IOMMU_DOMAIN_UNMANAGED &&
+   type != IOMMU_DOMAIN_DMA &&
+   type != IOMMU_DOMAIN_IDENTITY)
return NULL;

/*
@@ -1509,6 +1518,11 @@ static int arm_smmu_domain_finalise(struct
iommu_domain *domain)
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_device *smmu = smmu_domain->smmu;

+   if (domain->type == IOMMU_DOMAIN_IDENTITY) {
+   smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
+   return 0;
+   }
+
/* Restrict the stage to what we can actually support */
if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))

Re: [PATCH V7 08/11] drivers: acpi: Handle IOMMU lookup failure with deferred probing or error

2017-02-02 Thread Nate Watterson


On 2017-02-01 13:52, Lorenzo Pieralisi wrote:


I debugged the issue and Nate's fix is correct, the fact that you
can't it hit it with mainline is just a matter of timing because it has
to do with the CTX pointer value (we OR it with the existing value), so
it may work or not depending on how the cdptr memory allocation
pattern turns out to be (which explains why Nate and I can hit it with
simple PCI device remove/add execution too).

So it is neither an ACPI nor an IOMMU probe deferral issue per-se,
fix is already queued, so it is all good.

What about USB stalls ?

Our fault. The USB controller was getting 48-bit IOVAs, but the
bus it sits on only supports 44-bits so the controller's DMA
transactions ended up targeting addresses that were never mapped.

It started working once I applied the iort/dma-mapping patches I
sent out earlier this week that use the iort memory_address_limit
field to check if a dma_mask is sane.

Sorry for the false alarm.
Nate


Thanks !
Lorenzo


--
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
Linux

Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V7 08/11] drivers: acpi: Handle IOMMU lookup failure with deferred probing or error

2017-01-30 Thread Nate Watterson


On 2017-01-30 09:38, Will Deacon wrote:

On Mon, Jan 30, 2017 at 09:33:50AM -0500, Sinan Kaya wrote:

On 1/30/2017 9:23 AM, Nate Watterson wrote:
> On 2017-01-30 08:59, Sinan Kaya wrote:
>> On 1/30/2017 7:22 AM, Robin Murphy wrote:
>>> On 29/01/17 17:53, Sinan Kaya wrote:
>>>> On 1/24/2017 7:37 AM, Lorenzo Pieralisi wrote:
>>>>> [+hanjun, tomasz, sinan]
>>>>>
>>>>> It is quite a key patchset, I would be glad if they can test on their
>>>>> respective platforms with IORT.
>>>>>
>>>>
>>>> Tested on top of 4.10-rc5.
>>>>
>>>> 1.Platform Hidma device passed dmatest
>>>> 2.Seeing some USB stalls on a platform USB device.
>>>> 3.PCIe NVME drive probed and worked fine with MSI interrupts after 
boot.
>>>> 4. NVMe driver didn't probe following a hotplug insertion and received 
an
>>>> SMMU error event during the insertion.
>>>
>>> What was the SMMU error - a translation/permission fault (implying the
>>> wrong DMA ops) or a bad STE fault (implying we totally failed to tell
>>> the SMMU about the device at all)?
>>>
>>
>> root@ubuntu:/sys/bus/pci/slots/4# echo 0 > power
>>
>> [__204.698522]_iommu:_Removing_device_0003:01:00.0_from_group_0
>> [  204.708704] pciehp 0003:00:00.0:pcie004: Slot(4): Link Down
>> [  204.708723] pciehp 0003:00:00.0:pcie004: Slot(4): Link Down event
>> ignored; already powering off
>>
>> root@ubuntu:/sys/bus/pci/slots/4#
>>
>> [__254.820440]_iommu:_Adding_device_0003:01:00.0_to_group_8
>> [  254.820599] nvme nvme0: pci function 0003:01:00.0
>> [  254.820621] nvme 0003:01:00.0: enabling device ( -> 0002)
>> [  261.948558] arm-smmu-v3 arm-smmu-v3.0.auto: event 0x0a received:
>> [  261.948561] arm-smmu-v3 arm-smmu-v3.0.auto:  0x010a
>> [  261.948563] arm-smmu-v3 arm-smmu-v3.0.auto:  0x
>> [  261.948564] arm-smmu-v3 arm-smmu-v3.0.auto:  0x
>> [  261.948566] arm-smmu-v3 arm-smmu-v3.0.auto:  0x

> Looks like C_BAD_CD. Can you please try with:
> iommu/arm-smmu-v3: Clear prior settings when updating STEs

This resolved the issue. Can we pull Nate's patch to 4.10 so that I 
don't see

this issue again.


I already sent the pull request to Joerg for 4.11. Do you see this 
problem
without Sricharan's patches (i.e. vanilla mainline)? If so, we'll need 
to

send the patch to stable after -rc1.

Using vanilla mainline, I see it most commonly when directly assigning
a device to a guest machine. I think I've also seen it after removing 
then
re-adding a PCI device. Basically anytime an STE's CTX pointer is 
changed
from a non-NULL value and STE[CFG] indicates translation will be 
performed.


Nate


Will


--
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
Linux

Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V7 08/11] drivers: acpi: Handle IOMMU lookup failure with deferred probing or error

2017-01-30 Thread Nate Watterson


On 2017-01-30 08:59, Sinan Kaya wrote:

On 1/30/2017 7:22 AM, Robin Murphy wrote:

On 29/01/17 17:53, Sinan Kaya wrote:

On 1/24/2017 7:37 AM, Lorenzo Pieralisi wrote:

[+hanjun, tomasz, sinan]

It is quite a key patchset, I would be glad if they can test on 
their

respective platforms with IORT.



Tested on top of 4.10-rc5.

1.  Platform Hidma device passed dmatest
2.  Seeing some USB stalls on a platform USB device.
3.	PCIe NVME drive probed and worked fine with MSI interrupts after 
boot.
4. 	NVMe driver didn't probe following a hotplug insertion and 
received an

SMMU error event during the insertion.


What was the SMMU error - a translation/permission fault (implying the
wrong DMA ops) or a bad STE fault (implying we totally failed to tell
the SMMU about the device at all)?



root@ubuntu:/sys/bus/pci/slots/4# echo 0 > power

[__204.698522]_iommu:_Removing_device_0003:01:00.0_from_group_0
[  204.708704] pciehp 0003:00:00.0:pcie004: Slot(4): Link Down
[  204.708723] pciehp 0003:00:00.0:pcie004: Slot(4): Link Down event
ignored; already powering off

root@ubuntu:/sys/bus/pci/slots/4#

[__254.820440]_iommu:_Adding_device_0003:01:00.0_to_group_8
[  254.820599] nvme nvme0: pci function 0003:01:00.0
[  254.820621] nvme 0003:01:00.0: enabling device ( -> 0002)
[  261.948558] arm-smmu-v3 arm-smmu-v3.0.auto: event 0x0a received:
[  261.948561] arm-smmu-v3 arm-smmu-v3.0.auto:  0x010a
[  261.948563] arm-smmu-v3 arm-smmu-v3.0.auto:  0x
[  261.948564] arm-smmu-v3 arm-smmu-v3.0.auto:  0x
[  261.948566] arm-smmu-v3 arm-smmu-v3.0.auto:  0x

Looks like C_BAD_CD. Can you please try with:
iommu/arm-smmu-v3: Clear prior settings when updating STEs


root@ubuntu:/sys/bus/pci/slots/4#

root@ubuntu:/sys/bus/pci/slots/4#ls /dev/nvme*
/dev/nvme0

I should have seen /dev/nvme0n1 partition here.


Robin.



/sys/bus/pci/slots/4 #
/sys/bus/pci/slots/4 # dmesg | grep nvme
[   14.041357] nvme nvme0: pci function 0003:01:00.0
[  198.399521] nvme nvme0: pci function 0003:01:00.0
[__198.416232]_nvme_0003:01:00.0:_enabling_device_(_->_0002)
[  264.402216] nvme nvme0: I/O 228 QID 0 timeout, disable controller
[  264.402313] nvme nvme0: Identify Controller failed (-4)
[  264.421270] nvme nvme0: Removing after probe failure status: -5
/sys/bus/pci/slots/4 #





--
To unsubscribe from this list: send the line "unsubscribe linux-pci" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
Linux

Foundation Collaborative Project.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/arm-smmu-v3: limit use of 2-level stream tables

2017-01-10 Thread Nate Watterson

In the current arm-smmu-v3 driver, all smmus that support 2-level
stream tables are being forced to use them. This is suboptimal for
smmus that support fewer stream id bits than would fill in a single
second level table. This patch limits the use of 2-level tables to
smmus that both support the feature and whose first level table can
possibly contain more than a single entry.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..7d1a7e5 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1983,17 +1983,9 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
u32 size, l1size;
struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
 
-   /*
-* If we can resolve everything with a single L2 table, then we
-* just need a single L1 descriptor. Otherwise, calculate the L1
-* size, capped to the SIDSIZE.
-*/
-   if (smmu->sid_bits < STRTAB_SPLIT) {
-   size = 0;
-   } else {
-   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
-   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
-   }
+   /* Calculate the L1 size, capped to the SIDSIZE. */
+   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
+   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
cfg->num_l1_ents = 1 << size;
 
size += STRTAB_SPLIT;
@@ -2504,6 +2496,13 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
smmu->ssid_bits = reg >> IDR1_SSID_SHIFT & IDR1_SSID_MASK;
smmu->sid_bits = reg >> IDR1_SID_SHIFT & IDR1_SID_MASK;
 
+   /*
+* If the SMMU supports fewer bits than would fill a single L2 stream
+* table, use a linear table instead.
+*/
+   if (smmu->sid_bits <= STRTAB_SPLIT)
+   smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/arm-smmu-v3: Clear prior settings when updating STEs

2016-12-20 Thread Nate Watterson

To prevent corruption of the stage-1 context pointer field when
updating STEs, rebuild the entire containing dword instead of
clearing individual fields.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..94f305d 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1042,13 +1042,8 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
}
}
 
-   /* Nuke the existing Config, as we're going to rewrite it */
-   val &= ~(STRTAB_STE_0_CFG_MASK << STRTAB_STE_0_CFG_SHIFT);
-
-   if (ste->valid)
-   val |= STRTAB_STE_0_V;
-   else
-   val &= ~STRTAB_STE_0_V;
+   /* Nuke the existing STE_0 value, as we're going to rewrite it */
+   val = ste->valid ? STRTAB_STE_0_V : 0;
 
if (ste->bypass) {
val |= disable_bypass ? STRTAB_STE_0_CFG_ABORT
@@ -1083,7 +1078,6 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK
<< STRTAB_STE_0_S1CTXPTR_SHIFT) |
STRTAB_STE_0_CFG_S1_TRANS;
-
}
 
if (ste->s2_cfg) {
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/arm-smmu-v3: prevent corruption of ste stage-1 context ptr

2016-12-19 Thread Nate Watterson

To ensure that the stage-1 context ptr for an ste points to the
intended context descriptor, this patch adds code to clear away
the stale context ptr value prior to or'ing in the new one.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..093f9f1 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1080,6 +1080,8 @@ static void arm_smmu_write_strtab_ent(struct 
arm_smmu_device *smmu, u32 sid,
if (smmu->features & ARM_SMMU_FEAT_STALLS)
dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
+   val &= ~(STRTAB_STE_0_S1CTXPTR_MASK <<
+STRTAB_STE_0_S1CTXPTR_SHIFT);
val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK
<< STRTAB_STE_0_S1CTXPTR_SHIFT) |
STRTAB_STE_0_CFG_S1_TRANS;
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/arm-smmu-v3: avoid over allocating for l2 stream tables

2016-12-19 Thread Nate Watterson

Currently, all l2 stream tables are being allocated with space for
(1<
---
 drivers/iommu/arm-smmu-v3.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 4d6ec44..5dca671 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1129,6 +1129,7 @@ static void arm_smmu_init_bypass_stes(u64 *strtab, 
unsigned int nent)
 
 static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 {
+   u8 span;
size_t size;
void *strtab;
struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
@@ -1137,10 +1138,11 @@ static int arm_smmu_init_l2_strtab(struct 
arm_smmu_device *smmu, u32 sid)
if (desc->l2ptr)
return 0;
 
-   size = 1 << (STRTAB_SPLIT + ilog2(STRTAB_STE_DWORDS) + 3);
+   span = (smmu->sid_bits < STRTAB_SPLIT) ? smmu->sid_bits : STRTAB_SPLIT;
+   size = 1 << (span + ilog2(STRTAB_STE_DWORDS) + 3);
strtab = &cfg->strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
 
-   desc->span = STRTAB_SPLIT + 1;
+   desc->span = span + 1;
desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, &desc->l2ptr_dma,
  GFP_KERNEL | __GFP_ZERO);
if (!desc->l2ptr) {
@@ -1150,7 +1152,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device 
*smmu, u32 sid)
return -ENOMEM;
}
 
-   arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT);
+   arm_smmu_init_bypass_stes(desc->l2ptr, 1 << span);
arm_smmu_write_strtab_l1_desc(strtab, desc);
return 0;
 }
@@ -2001,6 +2003,8 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
dev_warn(smmu->dev,
 "2-level strtab only covers %u/%u bits of SID\n",
 size, smmu->sid_bits);
+   else if (smmu->sid_bits < size)
+   size = smmu->sid_bits;
 
l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
strtab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->strtab_dma,
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v5 14/14] drivers: acpi: iort: introduce iort_iommu_configure

2016-09-13 Thread Nate Watterson


On 2016-09-09 10:23, Lorenzo Pieralisi wrote:

DT based systems have a generic kernel API to configure IOMMUs
for devices (ie of_iommu_configure()).

On ARM based ACPI systems, the of_iommu_configure() equivalent can
be implemented atop ACPI IORT kernel API, with the corresponding
functions to map device identifiers to IOMMUs and retrieve the
corresponding IOMMU operations necessary for DMA operations set-up.

By relying on the iommu_fwspec generic kernel infrastructure,
implement the IORT based IOMMU configuration for ARM ACPI systems
and hook it up in the ACPI kernel layer that implements DMA
configuration for a device.

Signed-off-by: Lorenzo Pieralisi 
Cc: Hanjun Guo 
Cc: Tomasz Nowicki 
Cc: "Rafael J. Wysocki" 
---
 drivers/acpi/arm64/iort.c | 96 
+++

 drivers/acpi/scan.c   |  7 +++-
 include/linux/acpi_iort.h |  6 +++
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 7c68eb4..55a4ae9 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -19,6 +19,7 @@
 #define pr_fmt(fmt)"ACPI: IORT: " fmt

 #include 
+#include 
 #include 
 #include 
 #include 
@@ -27,6 +28,8 @@

 #define IORT_TYPE_MASK(type)   (1 << (type))
 #define IORT_MSI_TYPE  (1 << ACPI_IORT_NODE_ITS_GROUP)
+#define IORT_IOMMU_TYPE((1 << ACPI_IORT_NODE_SMMU) | \
+   (1 << ACPI_IORT_NODE_SMMU_V3))

 struct iort_its_msi_chip {
struct list_headlist;
@@ -467,6 +470,99 @@ struct irq_domain *iort_get_device_domain(struct
device *dev, u32 req_id)
return irq_find_matching_fwnode(handle, DOMAIN_BUS_PCI_MSI);
 }

+static int __get_pci_rid(struct pci_dev *pdev, u16 alias, void *data)
+{
+   u32 *rid = data;
+
+   *rid = alias;
+   return 0;
+}
+
+static int arm_smmu_iort_xlate(struct device *dev, u32 streamid,
+  struct fwnode_handle *fwnode)
+{
+   int ret = iommu_fwspec_init(dev, fwnode);
+
+   if (!ret)
+   ret = iommu_fwspec_add_ids(dev, &streamid, 1);
+
+   return ret;
+}
+
+static const struct iommu_ops *iort_iommu_xlate(struct device *dev,
+   struct acpi_iort_node *node,
+   u32 streamid)
+{
+   struct fwnode_handle *iort_fwnode = NULL;
+   int ret;
+
+   if (node) {
+   iort_fwnode = iort_get_fwnode(node);
+   if (!iort_fwnode)
+   return NULL;
+
+   ret = arm_smmu_iort_xlate(dev, streamid,
+ iort_fwnode);
+   if (!ret)
+   return fwspec_iommu_get_ops(iort_fwnode);
+   }
+
+   return NULL;
+}
+
+/**
+ * iort_iommu_configure - Set-up IOMMU configuration for a device.
+ *
+ * @dev: device to configure
+ *
+ * Returns: iommu_ops pointer on configuration success
+ *  NULL on configuration failure
+ */
+const struct iommu_ops *iort_iommu_configure(struct device *dev)
+{
+   struct acpi_iort_node *node, *parent;
+   const struct iommu_ops *ops = NULL;
+   u32 streamid = 0;
+
+   if (dev_is_pci(dev)) {
+   struct pci_bus *bus = to_pci_dev(dev)->bus;
+   u32 rid;
+
+   pci_for_each_dma_alias(to_pci_dev(dev), __get_pci_rid,
+  &rid);
+
+   node = iort_scan_node(ACPI_IORT_NODE_PCI_ROOT_COMPLEX,
+ iort_match_node_callback, &bus->dev);
+   if (!node)
+   return NULL;
+
+   parent = iort_node_map_rid(node, rid, &streamid,
+  IORT_IOMMU_TYPE);
+
+   ops = iort_iommu_xlate(dev, parent, streamid);
+
+   } else {
+   int i = 0;
+
+   node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
+ iort_match_node_callback, dev);
+   if (!node)
+   return NULL;
+


Nothing wrong with your code here, but wanted to warn you that there
appears to be a bug in iort_match_node_callback() for NAMED_COMPONENTS.

iort_match_node_callback() {
acpi_status status = AE_NOT_FOUND;
...
case ACPI_IORT_NODE_NAMED_COMPONENT: {
...
status = acpi_get_name(adev->handle, ACPI_FULL_PATHNAME, &buf);
if (ACPI_FAILURE(status)) {
dev_warn(dev, "Can't get device full path name\n");
break;
}

ncomp = (struct acpi_iort_named_component *)node->node_data;
if (!strcmp(ncomp->device_name, buf.pointer))
status = AE_OK;

acpi_os_free(buf.pointer);
break;
}
...
return status;
}

Notice how if strcmp() fails, status remains set to the

[PATCH] iommu/iova: validate iova_domain input to put_iova_domain

2016-07-13 Thread Nate Watterson

Passing a NULL or uninitialized iova_domain into put_iova_domain
will currently crash the kernel when the unconfigured iova_domain
data members are accessed. To prevent this from occurring, this patch
adds a check to make sure that the domain is non-NULL and that the
domain granule is non-zero. The granule can be used to check if the
domain was properly initialized because calling init_iova_domain
with a granule of zero would have already triggered a BUG statement
crashing the kernel.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/iova.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e23001b..3511a1c 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -459,6 +459,10 @@ void put_iova_domain(struct iova_domain *iovad)
struct rb_node *node;
unsigned long flags;
 
+   /* Only teardown properly initialized domains */
+   if (!iovad || !iovad->granule)
+   return;
+
free_iova_rcaches(iovad);
spin_lock_irqsave(&iovad->iova_rbtree_lock, flags);
node = rb_first(&iovad->rbroot);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/dma-iommu: respect established iova region limits

2016-07-13 Thread Nate Watterson

In the current dma-iommu implementation, the upper limit used when
allocating iovas is based on the calling device's dma_mask without
considering the potentially more restrictive iova limits established
in iommu_dma_init_domain. To ensure that iovas are allocated within
the expected iova region, this patch adds logic in __alloc_iova to
clip input dma_limit values that are out of bounds.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/dma-iommu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index ea5a9eb..2066066 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -157,11 +157,14 @@ static struct iova *__alloc_iova(struct iova_domain 
*iovad, size_t size,
unsigned long shift = iova_shift(iovad);
unsigned long length = iova_align(iovad, size) >> shift;
 
+   /* Respect the upper limit established in iommu_dma_init_domain */
+   dma_limit = min_t(dma_addr_t, dma_limit >> shift, iovad->dma_32bit_pfn);
+
/*
 * Enforce size-alignment to be safe - there could perhaps be an
 * attribute to control this per-device, or at least per-domain...
 */
-   return alloc_iova(iovad, length, dma_limit >> shift, true);
+   return alloc_iova(iovad, length, dma_limit, true);
 }
 
 /* The IOVA allocator knows what we mapped, so just unmap whatever that was */
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2] iommu/arm-smmu-v3: limit use of 2-level stream tables

2016-07-12 Thread Nate Watterson

In the current arm-smmu-v3 driver, all smmus that support 2-level
stream tables are being forced to use them. This is suboptimal for
smmus that support fewer stream id bits than would fill in a single
second level table. This patch limits the use of 2-level tables to
smmus that both support the feature and whose first level table can
possibly contain more than a single entry.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5f6b3bc..f27b8dc 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2033,17 +2033,9 @@ static int arm_smmu_init_strtab_2lvl(struct 
arm_smmu_device *smmu)
u32 size, l1size;
struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
 
-   /*
-* If we can resolve everything with a single L2 table, then we
-* just need a single L1 descriptor. Otherwise, calculate the L1
-* size, capped to the SIDSIZE.
-*/
-   if (smmu->sid_bits < STRTAB_SPLIT) {
-   size = 0;
-   } else {
-   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
-   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
-   }
+   /* Calculate the L1 size, capped to the SIDSIZE. */
+   size = STRTAB_L1_SZ_SHIFT - (ilog2(STRTAB_L1_DESC_DWORDS) + 3);
+   size = min(size, smmu->sid_bits - STRTAB_SPLIT);
cfg->num_l1_ents = 1 << size;
 
size += STRTAB_SPLIT;
@@ -2531,6 +2523,13 @@ static int arm_smmu_device_probe(struct arm_smmu_device 
*smmu)
smmu->ssid_bits = reg >> IDR1_SSID_SHIFT & IDR1_SSID_MASK;
smmu->sid_bits = reg >> IDR1_SID_SHIFT & IDR1_SID_MASK;
 
+   /*
+* If the SMMU supports fewer bits than would fill a single L2 stream
+* table, use a linear table instead.
+*/
+   if (smmu->sid_bits <= STRTAB_SPLIT)
+   smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
 
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/arm-smmu-v3: limit use of 2-level stream tables

2016-07-11 Thread Nate Watterson

In the current arm-smmu-v3 driver, all smmus that support 2-level
stream tables are being forced to use them. This is suboptimal for
smmus that support fewer stream id bits than would fill in a single
second level table. This patch limits the use of 2-level tables to
smmus that both support the feature and whose first level table can
possibly contain more than a single entry.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/arm-smmu-v3.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 5f6b3bc..742254c 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2531,6 +2531,17 @@ static int arm_smmu_device_probe(struct arm_smmu_device 
*smmu)
smmu->ssid_bits = reg >> IDR1_SSID_SHIFT & IDR1_SSID_MASK;
smmu->sid_bits = reg >> IDR1_SID_SHIFT & IDR1_SID_MASK;
 
+   /*
+* If the SMMU supports fewer bits than would fill a single L2 stream
+* table, use a linear table instead.
+*/
+   if (smmu->sid_bits <= STRTAB_SPLIT &&
+   smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
+   smmu->features &= ~ARM_SMMU_FEAT_2_LVL_STRTAB;
+   dev_info(smmu->dev, "SIDSIZE (%d) <= STRTAB_SPLIT (%d) : 
disabling 2-level stream tables\n",
+smmu->sid_bits, STRTAB_SPLIT);
+   }
+
/* IDR5 */
reg = readl_relaxed(smmu->base + ARM_SMMU_IDR5);
 
-- 
Qualcomm Technologies, Inc. on behalf
of the Qualcomm Innovation Center, Inc.  The Qualcomm Innovation Center, Inc. 
is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

41 matches

Mail list logo