[PATCH v6 7/7] iommu/amd: Use only natural aligned flushes in a VM

2021-07-23 Thread Nadav Amit
From: Nadav Amit 

When running on an AMD vIOMMU, it is better to avoid TLB flushes
of unmodified PTEs. vIOMMUs require the hypervisor to synchronize the
virtualized IOMMU's PTEs with the physical ones. This process induce
overheads.

AMD IOMMU allows us to flush any range that is aligned to the power of
2. So when running on top of a vIOMMU, break the range into sub-ranges
that are naturally aligned, and flush each one separately. This apporach
is better when running with a vIOMMU, but on physical IOMMUs, the
penalty of IOTLB misses due to unnecessary flushed entries is likely to
be low.

Repurpose (i.e., keeping the name, changing the logic)
domain_flush_pages() so it is used to choose whether to perform one
flush of the whole range or multiple ones to avoid flushing unnecessary
ranges. Use NpCache, as usual, to infer whether the IOMMU is physical or
virtual.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Suggested-by: Robin Murphy 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 47 ++-
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 7846fcb1e92b..3f6428aa68cd 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1261,15 +1261,52 @@ static void __domain_flush_pages(struct 
protection_domain *domain,
 }
 
 static void domain_flush_pages(struct protection_domain *domain,
-  u64 address, size_t size)
+  u64 address, size_t size, int pde)
 {
-   __domain_flush_pages(domain, address, size, 0);
+   if (likely(!amd_iommu_np_cache)) {
+   __domain_flush_pages(domain, address, size, pde);
+   return;
+   }
+
+   /*
+* When NpCache is on, we infer that we run in a VM and use a vIOMMU.
+* In such setups it is best to avoid flushes of ranges which are not
+* naturally aligned, since it would lead to flushes of unmodified
+* PTEs. Such flushes would require the hypervisor to do more work than
+* necessary. Therefore, perform repeated flushes of aligned ranges
+* until you cover the range. Each iteration flushes the smaller
+* between the natural alignment of the address that we flush and the
+* greatest naturally aligned region that fits in the range.
+*/
+   while (size != 0) {
+   int addr_alignment = __ffs(address);
+   int size_alignment = __fls(size);
+   int min_alignment;
+   size_t flush_size;
+
+   /*
+* size is always non-zero, but address might be zero, causing
+* addr_alignment to be negative. As the casting of the
+* argument in __ffs(address) to long might trim the high bits
+* of the address on x86-32, cast to long when doing the check.
+*/
+   if (likely((unsigned long)address != 0))
+   min_alignment = min(addr_alignment, size_alignment);
+   else
+   min_alignment = size_alignment;
+
+   flush_size = 1ul << min_alignment;
+
+   __domain_flush_pages(domain, address, flush_size, pde);
+   address += flush_size;
+   size -= flush_size;
+   }
 }
 
 /* Flush the whole IO/TLB for a given protection domain - including PDE */
 void amd_iommu_domain_flush_tlb_pde(struct protection_domain *domain)
 {
-   __domain_flush_pages(domain, 0, CMD_INV_IOMMU_ALL_PAGES_ADDRESS, 1);
+   domain_flush_pages(domain, 0, CMD_INV_IOMMU_ALL_PAGES_ADDRESS, 1);
 }
 
 void amd_iommu_domain_flush_complete(struct protection_domain *domain)
@@ -1296,7 +1333,7 @@ static void domain_flush_np_cache(struct 
protection_domain *domain,
unsigned long flags;
 
spin_lock_irqsave(>lock, flags);
-   domain_flush_pages(domain, iova, size);
+   domain_flush_pages(domain, iova, size, 1);
amd_iommu_domain_flush_complete(domain);
spin_unlock_irqrestore(>lock, flags);
}
@@ -2200,7 +2237,7 @@ static void amd_iommu_iotlb_sync(struct iommu_domain 
*domain,
unsigned long flags;
 
spin_lock_irqsave(>lock, flags);
-   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   domain_flush_pages(dom, gather->start, gather->end - gather->start, 1);
amd_iommu_domain_flush_complete(dom);
spin_unlock_irqrestore(>lock, flags);
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 6/7] iommu/amd: Sync once for scatter-gather operations

2021-07-23 Thread Nadav Amit
From: Nadav Amit 

On virtual machines, software must flush the IOTLB after each page table
entry update.

The iommu_map_sg() code iterates through the given scatter-gather list
and invokes iommu_map() for each element in the scatter-gather list,
which calls into the vendor IOMMU driver through iommu_ops callback. As
the result, a single sg mapping may lead to multiple IOTLB flushes.

Fix this by adding amd_iotlb_sync_map() callback and flushing at this
point after all sg mappings we set.

This commit is followed and inspired by commit 933fcd01e97e2
("iommu/vt-d: Add iotlb_sync_map callback").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index f5d5f2124391..7846fcb1e92b 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2022,6 +2022,16 @@ static int amd_iommu_attach_device(struct iommu_domain 
*dom,
return ret;
 }
 
+static void amd_iommu_iotlb_sync_map(struct iommu_domain *dom,
+unsigned long iova, size_t size)
+{
+   struct protection_domain *domain = to_pdomain(dom);
+   struct io_pgtable_ops *ops = >iop.iop.ops;
+
+   if (ops->map)
+   domain_flush_np_cache(domain, iova, size);
+}
+
 static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
 phys_addr_t paddr, size_t page_size, int iommu_prot,
 gfp_t gfp)
@@ -2040,10 +2050,8 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
if (iommu_prot & IOMMU_WRITE)
prot |= IOMMU_PROT_IW;
 
-   if (ops->map) {
+   if (ops->map)
ret = ops->map(ops, iova, paddr, page_size, prot, gfp);
-   domain_flush_np_cache(domain, iova, page_size);
-   }
 
return ret;
 }
@@ -2223,6 +2231,7 @@ const struct iommu_ops amd_iommu_ops = {
.attach_dev = amd_iommu_attach_device,
.detach_dev = amd_iommu_detach_device,
.map = amd_iommu_map,
+   .iotlb_sync_map = amd_iommu_iotlb_sync_map,
.unmap = amd_iommu_unmap,
.iova_to_phys = amd_iommu_iova_to_phys,
.probe_device = amd_iommu_probe_device,
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 5/7] iommu/amd: Tailored gather logic for AMD

2021-07-23 Thread Nadav Amit
From: Nadav Amit 

AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
This is in contrast, for instnace, to Intel IOMMUs that have a limit on
the number of pages that can be flushed in a single flush.  In addition,
AMD's IOMMU do not care about the page-size, so changes of the page size
do not need to trigger a TLB flush.

So in most cases, a TLB flush due to disjoint range is not needed for
AMD. Yet, vIOMMUs require the hypervisor to synchronize the virtualized
IOMMU's PTEs with the physical ones. This process induce overheads, so
it is better not to cause unnecessary flushes, i.e., flushes of PTEs
that were not modified.

Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
of the generic iommu_iotlb_gather_add_page(). Ignore disjoint regions
unless "non-present cache" feature is reported by the IOMMU
capabilities, as this is an indication we are running on a physical
IOMMU. A similar indication is used by VT-d (see "caching mode"). The
new logic retains the same flushing behavior that we had before the
introduction of page-selective IOTLB flushes for AMD.

On virtualized environments, check if the newly flushed region and the
gathered one are disjoint and flush if it is.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Reviewed-by: Robin Murphy 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index bfae3928b98f..f5d5f2124391 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2048,6 +2048,27 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
return ret;
 }
 
+static void amd_iommu_iotlb_gather_add_page(struct iommu_domain *domain,
+   struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   /*
+* AMD's IOMMU can flush as many pages as necessary in a single flush.
+* Unless we run in a virtual machine, which can be inferred according
+* to whether "non-present cache" is on, it is probably best to prefer
+* (potentially) too extensive TLB flushing (i.e., more misses) over
+* mutliple TLB flushes (i.e., more flushes). For virtual machines the
+* hypervisor needs to synchronize the host IOMMU PTEs with those of
+* the guest, and the trade-off is different: unnecessary TLB flushes
+* should be avoided.
+*/
+   if (amd_iommu_np_cache &&
+   iommu_iotlb_gather_is_disjoint(gather, iova, size))
+   iommu_iotlb_sync(domain, gather);
+
+   iommu_iotlb_gather_add_range(gather, iova, size);
+}
+
 static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
  size_t page_size,
  struct iommu_iotlb_gather *gather)
@@ -2062,7 +2083,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 
r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
 
-   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+   amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
 
return r;
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 4/7] iommu: Factor iommu_iotlb_gather_is_disjoint() out

2021-07-23 Thread Nadav Amit
From: Nadav Amit 

Refactor iommu_iotlb_gather_add_page() and factor out the logic that
detects whether IOTLB gather range and a new range are disjoint. To be
used by the next patch that implements different gathering logic for
AMD.

Note that updating gather->pgsize unconditionally does not affect
correctness as the function had (and has) an invariant, in which
gather->pgsize always represents the flushing granularity of its range.
Arguably, “size" should never be zero, but lets assume for the matter of
discussion that it might.

If "size" equals to "gather->pgsize", then the assignment in question
has no impact.

Otherwise, if "size" is non-zero, then iommu_iotlb_sync() would
initialize the size and range (see iommu_iotlb_gather_init()), and the
invariant is kept.

Otherwise, "size" is zero, and "gather" already holds a range, so
gather->pgsize is non-zero and (gather->pgsize && gather->pgsize !=
size) is true. Therefore, again, iommu_iotlb_sync() would be called and
initialize the size.

Cc: Joerg Roedel 
Cc: Jiajun Cao 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Reviewed-by: Robin Murphy 
Acked-by: Will Deacon 
Signed-off-by: Nadav Amit 
---
 include/linux/iommu.h | 34 ++
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e554871db46f..979a5ceeea55 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,28 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_is_disjoint - Checks whether a new range is disjoint
+ *
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to check whether a new range and the gathered range
+ * are disjoint. For many IOMMUs, flushing the IOMMU in this case is better
+ * than merging the two, which might lead to unnecessary invalidations.
+ */
+static inline
+bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long start = iova, end = start + size - 1;
+
+   return gather->end != 0 &&
+   (end + 1 < gather->start || start > gather->end + 1);
+}
+
+
 /**
  * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
  * @gather: TLB gather data
@@ -533,20 +555,16 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
 {
-   unsigned long start = iova, end = start + size - 1;
-
/*
 * If the new page is disjoint from the current range or is mapped at
 * a different granularity, then sync the TLB so that the gather
 * structure can be rewritten.
 */
-   if (gather->pgsize != size ||
-   end + 1 < gather->start || start > gather->end + 1) {
-   if (gather->pgsize)
-   iommu_iotlb_sync(domain, gather);
-   gather->pgsize = size;
-   }
+   if ((gather->pgsize && gather->pgsize != size) ||
+   iommu_iotlb_gather_is_disjoint(gather, iova, size))
+   iommu_iotlb_sync(domain, gather);
 
+   gather->pgsize = size;
iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v6 1/7] iommu/amd: Selective flush on unmap

2021-07-23 Thread Nadav Amit
From: Nadav Amit 

Recent patch attempted to enable selective page flushes on AMD IOMMU but
neglected to adapt amd_iommu_iotlb_sync() to use the selective flushes.

Adapt amd_iommu_iotlb_sync() to use selective flushes and change
amd_iommu_unmap() to collect the flushes. As a defensive measure, to
avoid potential issues as those that the Intel IOMMU driver encountered
recently, flush the page-walk caches by always setting the "pde"
parameter. This can be removed later.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 811a49a95d04..bfae3928b98f 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2054,12 +2054,17 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 {
struct protection_domain *domain = to_pdomain(dom);
struct io_pgtable_ops *ops = >iop.iop.ops;
+   size_t r;
 
if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
(domain->iop.mode == PAGE_MODE_NONE))
return 0;
 
-   return (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+   r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+
+   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+   return r;
 }
 
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
@@ -2162,7 +2167,13 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
 static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
 {
-   amd_iommu_flush_iotlb_all(domain);
+   struct protection_domain *dom = to_pdomain(domain);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   amd_iommu_domain_flush_complete(dom);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static int amd_iommu_def_domain_type(struct device *dev)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 3/7] iommu: Improve iommu_iotlb_gather helpers

2021-07-23 Thread Nadav Amit
From: Robin Murphy 

The Mediatek driver is not the only one which might want a basic
address-based gathering behaviour, so although it's arguably simple
enough to open-code, let's factor it out for the sake of cleanliness.
Let's also take this opportunity to document the intent of these
helpers for clarity.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Robin Murphy 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/mtk_iommu.c |  6 +-
 include/linux/iommu.h | 38 +-
 2 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 6f7c69688ce2..d9939e4af35c 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -520,12 +520,8 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
  struct iommu_iotlb_gather *gather)
 {
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
-   unsigned long end = iova + size - 1;
 
-   if (gather->start > iova)
-   gather->start = iova;
-   if (gather->end < end)
-   gather->end = end;
+   iommu_iotlb_gather_add_range(gather, iova, size);
return dom->iop->unmap(dom->iop, iova, size, gather);
 }
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..e554871db46f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,38 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build arbitrarily-sized invalidation commands
+ * where only the address range matters, and simply minimising intermediate
+ * syncs is preferred.
+ */
+static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long end = iova + size - 1;
+
+   if (gather->start > iova)
+   gather->start = iova;
+   if (gather->end < end)
+   gather->end = end;
+}
+
+/**
+ * iommu_iotlb_gather_add_page - Gather for page-based TLB invalidation
+ * @domain: IOMMU domain to be invalidated
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build invalidation commands based on individual
+ * pages, or with page size/table level hints which cannot be gathered if they
+ * differ.
+ */
 static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
@@ -515,11 +547,7 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
gather->pgsize = size;
}
 
-   if (gather->end < end)
-   gather->end = end;
-
-   if (gather->start > start)
-   gather->start = start;
+   iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
 /* PCI device grouping function */
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 2/7] iommu/amd: Do not use flush-queue when NpCache is on

2021-07-23 Thread Nadav Amit
From: Nadav Amit 

Do not use flush-queue on virtualized environments, where the NpCache
capability of the IOMMU is set. This is required to reduce
virtualization overheads.

This change follows a similar change to Intel's VT-d and a detailed
explanation as for the rationale is described in commit 29b32839725f
("iommu/vt-d: Do not use flush-queue when caching-mode is on").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/init.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 46280e6e1535..1c7ae7d3c55d 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1850,8 +1850,13 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
if (ret)
return ret;
 
-   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE))
+   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE)) {
+   if (!amd_iommu_unmap_flush)
+   pr_info("IOMMU batching is disabled due to 
virtualization\n");
+
amd_iommu_np_cache = true;
+   amd_iommu_unmap_flush = true;
+   }
 
init_iommu_perf_ctr(iommu);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 0/7] iommu/amd: Enable page-selective flushes

2021-07-23 Thread Nadav Amit
From: Nadav Amit 

The previous patch, commit 268aa4548277 ("iommu/amd: Page-specific
invalidations for more than one page") was supposed to enable
page-selective IOTLB flushes on AMD.

Besides the bug that was already fixed by commit a017c567915f
("iommu/amd: Fix wrong parentheses on page-specific invalidations")
there are several remaining matters to enable and benefit from
page-selective IOTLB flushes on AMD:

1. Enable selective flushes on unmap (patch 1)
2. Avoid using flush-queue on vIOMMUs (patch 2)
3. Relaxed flushes when gathering, excluding vIOMMUs (patches 3-5)
4. Syncing once on scatter-gather map operations (patch 6)
5. Breaking flushes to naturally aligned ranges on vIOMMU (patch 7)

The main difference in this version is that the logic that flushes
vIOMMU was improved based on Robin's feedback. Batching decisions are
not based on alignment anymore, but instead the flushing range is broken
into naturally aligned regions on sync. Doing so allows us to flush only
the entries that we modified with the minimal number of flushes.

Robin, others: your feedback would be highly appreciated to get these
patches merge.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org

---

v5->v6:
* Remove redundant check on amd_iommu_iotlb_gather_add_page() [Robin]
* Add Robin's reviewed-by tags

v4->v5:
* Rebase on v5.14-rc1
* Change pr_warn() to pr_info() [John Garry]

v3->v4:
* Breaking flushes to naturally aligned ranges on vIOMMU [Robin]
* Removing unnecessary stubs; fixing comment [Robin]
* Removing unused variable [Yong]
* Changing pr_warn_once() to pr_warn() [Robin]
* Improving commit log [Will]

v2->v3:
* Rebase on v5.13-rc5
* Refactoring (patches 4-5) [Robin]
* Rework flush logic (patch 5): more relaxed on native
* Syncing once on scatter-gather operations (patch 6)

v1->v2:
* Rebase on v5.13-rc3

Nadav Amit (6):
  iommu/amd: Selective flush on unmap
  iommu/amd: Do not use flush-queue when NpCache is on
  iommu: Factor iommu_iotlb_gather_is_disjoint() out
  iommu/amd: Tailored gather logic for AMD
  iommu/amd: Sync once for scatter-gather operations
  iommu/amd: Use only natural aligned flushes in a VM

Robin Murphy (1):
  iommu: Improve iommu_iotlb_gather helpers

 drivers/iommu/amd/init.c  |  7 ++-
 drivers/iommu/amd/iommu.c | 96 +++
 drivers/iommu/mtk_iommu.c |  6 +--
 include/linux/iommu.h | 72 +++--
 4 files changed, 153 insertions(+), 28 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 5/7] iommu/amd: Tailored gather logic for AMD

2021-07-13 Thread Nadav Amit



> On Jul 13, 2021, at 11:40 AM, Robin Murphy  wrote:
> 
> On 2021-07-13 10:41, Nadav Amit wrote:
>> From: Nadav Amit 
>> AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
>> This is in contrast, for instnace, to Intel IOMMUs that have a limit on
>> the number of pages that can be flushed in a single flush.  In addition,
>> AMD's IOMMU do not care about the page-size, so changes of the page size
>> do not need to trigger a TLB flush.
>> So in most cases, a TLB flush due to disjoint range is not needed for
>> AMD. Yet, vIOMMUs require the hypervisor to synchronize the virtualized
>> IOMMU's PTEs with the physical ones. This process induce overheads, so
>> it is better not to cause unnecessary flushes, i.e., flushes of PTEs
>> that were not modified.
>> Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
>> of the generic iommu_iotlb_gather_add_page(). Ignore disjoint regions
>> unless "non-present cache" feature is reported by the IOMMU
>> capabilities, as this is an indication we are running on a physical
>> IOMMU. A similar indication is used by VT-d (see "caching mode"). The
>> new logic retains the same flushing behavior that we had before the
>> introduction of page-selective IOTLB flushes for AMD.
>> On virtualized environments, check if the newly flushed region and the
>> gathered one are disjoint and flush if it is.
>> Cc: Joerg Roedel 
>> Cc: Will Deacon 
>> Cc: Jiajun Cao 
>> Cc: Lu Baolu 
>> Cc: iommu@lists.linux-foundation.org
>> Cc: linux-ker...@vger.kernel.org>
>> Cc: Robin Murphy 
>> Signed-off-by: Nadav Amit 
>> ---
>>  drivers/iommu/amd/iommu.c | 23 ++-
>>  1 file changed, 22 insertions(+), 1 deletion(-)
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index bfae3928b98f..cc55c4c6a355 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -2048,6 +2048,27 @@ static int amd_iommu_map(struct iommu_domain *dom, 
>> unsigned long iova,
>>  return ret;
>>  }
>>  +static void amd_iommu_iotlb_gather_add_page(struct iommu_domain *domain,
>> +struct iommu_iotlb_gather *gather,
>> +unsigned long iova, size_t size)
>> +{
>> +/*
>> + * AMD's IOMMU can flush as many pages as necessary in a single flush.
>> + * Unless we run in a virtual machine, which can be inferred according
>> + * to whether "non-present cache" is on, it is probably best to prefer
>> + * (potentially) too extensive TLB flushing (i.e., more misses) over
>> + * mutliple TLB flushes (i.e., more flushes). For virtual machines the
>> + * hypervisor needs to synchronize the host IOMMU PTEs with those of
>> + * the guest, and the trade-off is different: unnecessary TLB flushes
>> + * should be avoided.
>> + */
>> +if (amd_iommu_np_cache && gather->end != 0 &&
> 
> iommu_iotlb_gather_is_disjoint() is also checking "gather->end != 0", so I 
> don't think we need both. Strictly it's only necessary here since the other 
> call from iommu_iotlb_gather_add_page() equivalently asserts that the gather 
> is already non-empty via its gather->pgsize check, but one could argue it 
> either way and I don't have a hugely strong preference.

You are correct (even if the compiler would have eliminated the redundancy).

I will remove the redundant check.

> 
> Otherwise, I love how neat this has all ended up, thanks for persevering!

Thank you for the thorough review!

Regards,
Nadav
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 7/7] iommu/amd: Use only natural aligned flushes in a VM

2021-07-13 Thread Nadav Amit
From: Nadav Amit 

When running on an AMD vIOMMU, it is better to avoid TLB flushes
of unmodified PTEs. vIOMMUs require the hypervisor to synchronize the
virtualized IOMMU's PTEs with the physical ones. This process induce
overheads.

AMD IOMMU allows us to flush any range that is aligned to the power of
2. So when running on top of a vIOMMU, break the range into sub-ranges
that are naturally aligned, and flush each one separately. This apporach
is better when running with a vIOMMU, but on physical IOMMUs, the
penalty of IOTLB misses due to unnecessary flushed entries is likely to
be low.

Repurpose (i.e., keeping the name, changing the logic)
domain_flush_pages() so it is used to choose whether to perform one
flush of the whole range or multiple ones to avoid flushing unnecessary
ranges. Use NpCache, as usual, to infer whether the IOMMU is physical or
virtual.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Suggested-by: Robin Murphy 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 47 ++-
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index c1fcd01b3c40..cb5ea94055e1 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1261,15 +1261,52 @@ static void __domain_flush_pages(struct 
protection_domain *domain,
 }
 
 static void domain_flush_pages(struct protection_domain *domain,
-  u64 address, size_t size)
+  u64 address, size_t size, int pde)
 {
-   __domain_flush_pages(domain, address, size, 0);
+   if (likely(!amd_iommu_np_cache)) {
+   __domain_flush_pages(domain, address, size, pde);
+   return;
+   }
+
+   /*
+* When NpCache is on, we infer that we run in a VM and use a vIOMMU.
+* In such setups it is best to avoid flushes of ranges which are not
+* naturally aligned, since it would lead to flushes of unmodified
+* PTEs. Such flushes would require the hypervisor to do more work than
+* necessary. Therefore, perform repeated flushes of aligned ranges
+* until you cover the range. Each iteration flushes the smaller
+* between the natural alignment of the address that we flush and the
+* greatest naturally aligned region that fits in the range.
+*/
+   while (size != 0) {
+   int addr_alignment = __ffs(address);
+   int size_alignment = __fls(size);
+   int min_alignment;
+   size_t flush_size;
+
+   /*
+* size is always non-zero, but address might be zero, causing
+* addr_alignment to be negative. As the casting of the
+* argument in __ffs(address) to long might trim the high bits
+* of the address on x86-32, cast to long when doing the check.
+*/
+   if (likely((unsigned long)address != 0))
+   min_alignment = min(addr_alignment, size_alignment);
+   else
+   min_alignment = size_alignment;
+
+   flush_size = 1ul << min_alignment;
+
+   __domain_flush_pages(domain, address, flush_size, pde);
+   address += flush_size;
+   size -= flush_size;
+   }
 }
 
 /* Flush the whole IO/TLB for a given protection domain - including PDE */
 void amd_iommu_domain_flush_tlb_pde(struct protection_domain *domain)
 {
-   __domain_flush_pages(domain, 0, CMD_INV_IOMMU_ALL_PAGES_ADDRESS, 1);
+   domain_flush_pages(domain, 0, CMD_INV_IOMMU_ALL_PAGES_ADDRESS, 1);
 }
 
 void amd_iommu_domain_flush_complete(struct protection_domain *domain)
@@ -1296,7 +1333,7 @@ static void domain_flush_np_cache(struct 
protection_domain *domain,
unsigned long flags;
 
spin_lock_irqsave(>lock, flags);
-   domain_flush_pages(domain, iova, size);
+   domain_flush_pages(domain, iova, size, 1);
amd_iommu_domain_flush_complete(domain);
spin_unlock_irqrestore(>lock, flags);
}
@@ -2200,7 +2237,7 @@ static void amd_iommu_iotlb_sync(struct iommu_domain 
*domain,
unsigned long flags;
 
spin_lock_irqsave(>lock, flags);
-   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   domain_flush_pages(dom, gather->start, gather->end - gather->start, 1);
amd_iommu_domain_flush_complete(dom);
spin_unlock_irqrestore(>lock, flags);
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 6/7] iommu/amd: Sync once for scatter-gather operations

2021-07-13 Thread Nadav Amit
From: Nadav Amit 

On virtual machines, software must flush the IOTLB after each page table
entry update.

The iommu_map_sg() code iterates through the given scatter-gather list
and invokes iommu_map() for each element in the scatter-gather list,
which calls into the vendor IOMMU driver through iommu_ops callback. As
the result, a single sg mapping may lead to multiple IOTLB flushes.

Fix this by adding amd_iotlb_sync_map() callback and flushing at this
point after all sg mappings we set.

This commit is followed and inspired by commit 933fcd01e97e2
("iommu/vt-d: Add iotlb_sync_map callback").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index cc55c4c6a355..c1fcd01b3c40 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2022,6 +2022,16 @@ static int amd_iommu_attach_device(struct iommu_domain 
*dom,
return ret;
 }
 
+static void amd_iommu_iotlb_sync_map(struct iommu_domain *dom,
+unsigned long iova, size_t size)
+{
+   struct protection_domain *domain = to_pdomain(dom);
+   struct io_pgtable_ops *ops = >iop.iop.ops;
+
+   if (ops->map)
+   domain_flush_np_cache(domain, iova, size);
+}
+
 static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
 phys_addr_t paddr, size_t page_size, int iommu_prot,
 gfp_t gfp)
@@ -2040,10 +2050,8 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
if (iommu_prot & IOMMU_WRITE)
prot |= IOMMU_PROT_IW;
 
-   if (ops->map) {
+   if (ops->map)
ret = ops->map(ops, iova, paddr, page_size, prot, gfp);
-   domain_flush_np_cache(domain, iova, page_size);
-   }
 
return ret;
 }
@@ -2223,6 +2231,7 @@ const struct iommu_ops amd_iommu_ops = {
.attach_dev = amd_iommu_attach_device,
.detach_dev = amd_iommu_detach_device,
.map = amd_iommu_map,
+   .iotlb_sync_map = amd_iommu_iotlb_sync_map,
.unmap = amd_iommu_unmap,
.iova_to_phys = amd_iommu_iova_to_phys,
.probe_device = amd_iommu_probe_device,
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 5/7] iommu/amd: Tailored gather logic for AMD

2021-07-13 Thread Nadav Amit
From: Nadav Amit 

AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
This is in contrast, for instnace, to Intel IOMMUs that have a limit on
the number of pages that can be flushed in a single flush.  In addition,
AMD's IOMMU do not care about the page-size, so changes of the page size
do not need to trigger a TLB flush.

So in most cases, a TLB flush due to disjoint range is not needed for
AMD. Yet, vIOMMUs require the hypervisor to synchronize the virtualized
IOMMU's PTEs with the physical ones. This process induce overheads, so
it is better not to cause unnecessary flushes, i.e., flushes of PTEs
that were not modified.

Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
of the generic iommu_iotlb_gather_add_page(). Ignore disjoint regions
unless "non-present cache" feature is reported by the IOMMU
capabilities, as this is an indication we are running on a physical
IOMMU. A similar indication is used by VT-d (see "caching mode"). The
new logic retains the same flushing behavior that we had before the
introduction of page-selective IOTLB flushes for AMD.

On virtualized environments, check if the newly flushed region and the
gathered one are disjoint and flush if it is.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Cc: Robin Murphy 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index bfae3928b98f..cc55c4c6a355 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2048,6 +2048,27 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
return ret;
 }
 
+static void amd_iommu_iotlb_gather_add_page(struct iommu_domain *domain,
+   struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   /*
+* AMD's IOMMU can flush as many pages as necessary in a single flush.
+* Unless we run in a virtual machine, which can be inferred according
+* to whether "non-present cache" is on, it is probably best to prefer
+* (potentially) too extensive TLB flushing (i.e., more misses) over
+* mutliple TLB flushes (i.e., more flushes). For virtual machines the
+* hypervisor needs to synchronize the host IOMMU PTEs with those of
+* the guest, and the trade-off is different: unnecessary TLB flushes
+* should be avoided.
+*/
+   if (amd_iommu_np_cache && gather->end != 0 &&
+   iommu_iotlb_gather_is_disjoint(gather, iova, size))
+   iommu_iotlb_sync(domain, gather);
+
+   iommu_iotlb_gather_add_range(gather, iova, size);
+}
+
 static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
  size_t page_size,
  struct iommu_iotlb_gather *gather)
@@ -2062,7 +2083,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 
r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
 
-   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+   amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
 
return r;
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 4/7] iommu: Factor iommu_iotlb_gather_is_disjoint() out

2021-07-13 Thread Nadav Amit
From: Nadav Amit 

Refactor iommu_iotlb_gather_add_page() and factor out the logic that
detects whether IOTLB gather range and a new range are disjoint. To be
used by the next patch that implements different gathering logic for
AMD.

Note that updating gather->pgsize unconditionally does not affect
correctness as the function had (and has) an invariant, in which
gather->pgsize always represents the flushing granularity of its range.
Arguably, “size" should never be zero, but lets assume for the matter of
discussion that it might.

If "size" equals to "gather->pgsize", then the assignment in question
has no impact.

Otherwise, if "size" is non-zero, then iommu_iotlb_sync() would
initialize the size and range (see iommu_iotlb_gather_init()), and the
invariant is kept.

Otherwise, "size" is zero, and "gather" already holds a range, so
gather->pgsize is non-zero and (gather->pgsize && gather->pgsize !=
size) is true. Therefore, again, iommu_iotlb_sync() would be called and
initialize the size.

Cc: Joerg Roedel 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Acked-by: Will Deacon 
Signed-off-by: Nadav Amit 
---
 include/linux/iommu.h | 34 ++
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e554871db46f..979a5ceeea55 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,28 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_is_disjoint - Checks whether a new range is disjoint
+ *
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to check whether a new range and the gathered range
+ * are disjoint. For many IOMMUs, flushing the IOMMU in this case is better
+ * than merging the two, which might lead to unnecessary invalidations.
+ */
+static inline
+bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long start = iova, end = start + size - 1;
+
+   return gather->end != 0 &&
+   (end + 1 < gather->start || start > gather->end + 1);
+}
+
+
 /**
  * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
  * @gather: TLB gather data
@@ -533,20 +555,16 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
 {
-   unsigned long start = iova, end = start + size - 1;
-
/*
 * If the new page is disjoint from the current range or is mapped at
 * a different granularity, then sync the TLB so that the gather
 * structure can be rewritten.
 */
-   if (gather->pgsize != size ||
-   end + 1 < gather->start || start > gather->end + 1) {
-   if (gather->pgsize)
-   iommu_iotlb_sync(domain, gather);
-   gather->pgsize = size;
-   }
+   if ((gather->pgsize && gather->pgsize != size) ||
+   iommu_iotlb_gather_is_disjoint(gather, iova, size))
+   iommu_iotlb_sync(domain, gather);
 
+   gather->pgsize = size;
iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v5 3/7] iommu: Improve iommu_iotlb_gather helpers

2021-07-13 Thread Nadav Amit
From: Robin Murphy 

The Mediatek driver is not the only one which might want a basic
address-based gathering behaviour, so although it's arguably simple
enough to open-code, let's factor it out for the sake of cleanliness.
Let's also take this opportunity to document the intent of these
helpers for clarity.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Robin Murphy 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/mtk_iommu.c |  6 +-
 include/linux/iommu.h | 38 +-
 2 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 6f7c69688ce2..d9939e4af35c 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -520,12 +520,8 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
  struct iommu_iotlb_gather *gather)
 {
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
-   unsigned long end = iova + size - 1;
 
-   if (gather->start > iova)
-   gather->start = iova;
-   if (gather->end < end)
-   gather->end = end;
+   iommu_iotlb_gather_add_range(gather, iova, size);
return dom->iop->unmap(dom->iop, iova, size, gather);
 }
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..e554871db46f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,38 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build arbitrarily-sized invalidation commands
+ * where only the address range matters, and simply minimising intermediate
+ * syncs is preferred.
+ */
+static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long end = iova + size - 1;
+
+   if (gather->start > iova)
+   gather->start = iova;
+   if (gather->end < end)
+   gather->end = end;
+}
+
+/**
+ * iommu_iotlb_gather_add_page - Gather for page-based TLB invalidation
+ * @domain: IOMMU domain to be invalidated
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build invalidation commands based on individual
+ * pages, or with page size/table level hints which cannot be gathered if they
+ * differ.
+ */
 static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
@@ -515,11 +547,7 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
gather->pgsize = size;
}
 
-   if (gather->end < end)
-   gather->end = end;
-
-   if (gather->start > start)
-   gather->start = start;
+   iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
 /* PCI device grouping function */
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 2/7] iommu/amd: Do not use flush-queue when NpCache is on

2021-07-13 Thread Nadav Amit
From: Nadav Amit 

Do not use flush-queue on virtualized environments, where the NpCache
capability of the IOMMU is set. This is required to reduce
virtualization overheads.

This change follows a similar change to Intel's VT-d and a detailed
explanation as for the rationale is described in commit 29b32839725f
("iommu/vt-d: Do not use flush-queue when caching-mode is on").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/init.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 46280e6e1535..1c7ae7d3c55d 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1850,8 +1850,13 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
if (ret)
return ret;
 
-   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE))
+   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE)) {
+   if (!amd_iommu_unmap_flush)
+   pr_info("IOMMU batching is disabled due to 
virtualization\n");
+
amd_iommu_np_cache = true;
+   amd_iommu_unmap_flush = true;
+   }
 
init_iommu_perf_ctr(iommu);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 1/7] iommu/amd: Selective flush on unmap

2021-07-13 Thread Nadav Amit
From: Nadav Amit 

Recent patch attempted to enable selective page flushes on AMD IOMMU but
neglected to adapt amd_iommu_iotlb_sync() to use the selective flushes.

Adapt amd_iommu_iotlb_sync() to use selective flushes and change
amd_iommu_unmap() to collect the flushes. As a defensive measure, to
avoid potential issues as those that the Intel IOMMU driver encountered
recently, flush the page-walk caches by always setting the "pde"
parameter. This can be removed later.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 811a49a95d04..bfae3928b98f 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2054,12 +2054,17 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 {
struct protection_domain *domain = to_pdomain(dom);
struct io_pgtable_ops *ops = >iop.iop.ops;
+   size_t r;
 
if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
(domain->iop.mode == PAGE_MODE_NONE))
return 0;
 
-   return (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+   r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+
+   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+   return r;
 }
 
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
@@ -2162,7 +2167,13 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
 static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
 {
-   amd_iommu_flush_iotlb_all(domain);
+   struct protection_domain *dom = to_pdomain(domain);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   amd_iommu_domain_flush_complete(dom);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static int amd_iommu_def_domain_type(struct device *dev)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v5 0/7] iommu/amd: Enable page-selective flushes

2021-07-13 Thread Nadav Amit
From: Nadav Amit 

The previous patch, commit 268aa4548277 ("iommu/amd: Page-specific
invalidations for more than one page") was supposed to enable
page-selective IOTLB flushes on AMD.

Besides the bug that was already fixed by commit a017c567915f
("iommu/amd: Fix wrong parentheses on page-specific invalidations")
there are several remaining matters to enable and benefit from
page-selective IOTLB flushes on AMD:

1. Enable selective flushes on unmap (patch 1)
2. Avoid using flush-queue on vIOMMUs (patch 2)
3. Relaxed flushes when gathering, excluding vIOMMUs (patches 3-5)
4. Syncing once on scatter-gather map operations (patch 6)
5. Breaking flushes to naturally aligned ranges on vIOMMU (patch 7)

The main difference in this version is that the logic that flushes
vIOMMU was improved based on Robin's feedback. Batching decisions are
not based on alignment anymore, but instead the flushing range is broken
into naturally aligned regions on sync. Doing so allows us to flush only
the entries that we modified with the minimal number of flushes.

Robin, others: your feedback would be highly appreciated to get these
patches merge.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org

---

v4->v5:
* Rebase on v5.14-rc1
* Change pr_warn() to pr_info() [John Garry]

v3->v4:
* Breaking flushes to naturally aligned ranges on vIOMMU [Robin]
* Removing unnecessary stubs; fixing comment [Robin]
* Removing unused variable [Yong]
* Changing pr_warn_once() to pr_warn() [Robin]
* Improving commit log [Will]

v2->v3:
* Rebase on v5.13-rc5
* Refactoring (patches 4-5) [Robin]
* Rework flush logic (patch 5): more relaxed on native
* Syncing once on scatter-gather operations (patch 6)

v1->v2:
* Rebase on v5.13-rc3


Nadav Amit (6):
  iommu/amd: Selective flush on unmap
  iommu/amd: Do not use flush-queue when NpCache is on
  iommu: Factor iommu_iotlb_gather_is_disjoint() out
  iommu/amd: Tailored gather logic for AMD
  iommu/amd: Sync once for scatter-gather operations
  iommu/amd: Use only natural aligned flushes in a VM

Robin Murphy (1):
  iommu: Improve iommu_iotlb_gather helpers

 drivers/iommu/amd/init.c  |  7 ++-
 drivers/iommu/amd/iommu.c | 96 +++
 drivers/iommu/mtk_iommu.c |  6 +--
 include/linux/iommu.h | 72 +++--
 4 files changed, 153 insertions(+), 28 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 7/7] iommu/amd: Use only natural aligned flushes in a VM

2021-06-16 Thread Nadav Amit
From: Nadav Amit 

When running on an AMD vIOMMU, it is better to avoid TLB flushes
of unmodified PTEs. vIOMMUs require the hypervisor to synchronize the
virtualized IOMMU's PTEs with the physical ones. This process induce
overheads.

AMD IOMMU allows us to flush any range that is aligned to the power of
2. So when running on top of a vIOMMU, break the range into sub-ranges
that are naturally aligned, and flush each one separately. This apporach
is better when running with a vIOMMU, but on physical IOMMUs, the
penalty of IOTLB misses due to unnecessary flushed entries is likely to
be low.

Repurpose (i.e., keeping the name, changing the logic)
domain_flush_pages() so it is used to choose whether to perform one
flush of the whole range or multiple ones to avoid flushing unnecessary
ranges. Use NpCache, as usual, to infer whether the IOMMU is physical or
virtual.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Suggested-by: Robin Murphy 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 47 ++-
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index ce8e970aac9a..ec0b6ad27e48 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1262,15 +1262,52 @@ static void __domain_flush_pages(struct 
protection_domain *domain,
 }
 
 static void domain_flush_pages(struct protection_domain *domain,
-  u64 address, size_t size)
+  u64 address, size_t size, int pde)
 {
-   __domain_flush_pages(domain, address, size, 0);
+   if (likely(!amd_iommu_np_cache)) {
+   __domain_flush_pages(domain, address, size, pde);
+   return;
+   }
+
+   /*
+* When NpCache is on, we infer that we run in a VM and use a vIOMMU.
+* In such setups it is best to avoid flushes of ranges which are not
+* naturally aligned, since it would lead to flushes of unmodified
+* PTEs. Such flushes would require the hypervisor to do more work than
+* necessary. Therefore, perform repeated flushes of aligned ranges
+* until you cover the range. Each iteration flush the smaller between
+* the natural alignment of the address that we flush and the highest
+* bit that is set in the remaining size.
+*/
+   while (size != 0) {
+   int addr_alignment = __ffs(address);
+   int size_alignment = __fls(size);
+   int min_alignment;
+   size_t flush_size;
+
+   /*
+* size is always non-zero, but address might be zero, causing
+* addr_alignment to be negative. As the casting of the
+* argument in __ffs(address) to long might trim the high bits
+* of the address on x86-32, cast to long when doing the check.
+*/
+   if (likely((unsigned long)address != 0))
+   min_alignment = min(addr_alignment, size_alignment);
+   else
+   min_alignment = size_alignment;
+
+   flush_size = 1ul << min_alignment;
+
+   __domain_flush_pages(domain, address, flush_size, pde);
+   address += flush_size;
+   size -= flush_size;
+   }
 }
 
 /* Flush the whole IO/TLB for a given protection domain - including PDE */
 void amd_iommu_domain_flush_tlb_pde(struct protection_domain *domain)
 {
-   __domain_flush_pages(domain, 0, CMD_INV_IOMMU_ALL_PAGES_ADDRESS, 1);
+   domain_flush_pages(domain, 0, CMD_INV_IOMMU_ALL_PAGES_ADDRESS, 1);
 }
 
 void amd_iommu_domain_flush_complete(struct protection_domain *domain)
@@ -1297,7 +1334,7 @@ static void domain_flush_np_cache(struct 
protection_domain *domain,
unsigned long flags;
 
spin_lock_irqsave(>lock, flags);
-   domain_flush_pages(domain, iova, size);
+   domain_flush_pages(domain, iova, size, 1);
amd_iommu_domain_flush_complete(domain);
spin_unlock_irqrestore(>lock, flags);
}
@@ -2205,7 +2242,7 @@ static void amd_iommu_iotlb_sync(struct iommu_domain 
*domain,
unsigned long flags;
 
spin_lock_irqsave(>lock, flags);
-   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   domain_flush_pages(dom, gather->start, gather->end - gather->start, 1);
amd_iommu_domain_flush_complete(dom);
spin_unlock_irqrestore(>lock, flags);
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 6/7] iommu/amd: Sync once for scatter-gather operations

2021-06-16 Thread Nadav Amit
From: Nadav Amit 

On virtual machines, software must flush the IOTLB after each page table
entry update.

The iommu_map_sg() code iterates through the given scatter-gather list
and invokes iommu_map() for each element in the scatter-gather list,
which calls into the vendor IOMMU driver through iommu_ops callback. As
the result, a single sg mapping may lead to multiple IOTLB flushes.

Fix this by adding amd_iotlb_sync_map() callback and flushing at this
point after all sg mappings we set.

This commit is followed and inspired by commit 933fcd01e97e2
("iommu/vt-d: Add iotlb_sync_map callback").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 63048aabaf5d..ce8e970aac9a 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2027,6 +2027,16 @@ static int amd_iommu_attach_device(struct iommu_domain 
*dom,
return ret;
 }
 
+static void amd_iommu_iotlb_sync_map(struct iommu_domain *dom,
+unsigned long iova, size_t size)
+{
+   struct protection_domain *domain = to_pdomain(dom);
+   struct io_pgtable_ops *ops = >iop.iop.ops;
+
+   if (ops->map)
+   domain_flush_np_cache(domain, iova, size);
+}
+
 static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
 phys_addr_t paddr, size_t page_size, int iommu_prot,
 gfp_t gfp)
@@ -2045,10 +2055,8 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
if (iommu_prot & IOMMU_WRITE)
prot |= IOMMU_PROT_IW;
 
-   if (ops->map) {
+   if (ops->map)
ret = ops->map(ops, iova, paddr, page_size, prot, gfp);
-   domain_flush_np_cache(domain, iova, page_size);
-   }
 
return ret;
 }
@@ -2228,6 +2236,7 @@ const struct iommu_ops amd_iommu_ops = {
.attach_dev = amd_iommu_attach_device,
.detach_dev = amd_iommu_detach_device,
.map = amd_iommu_map,
+   .iotlb_sync_map = amd_iommu_iotlb_sync_map,
.unmap = amd_iommu_unmap,
.iova_to_phys = amd_iommu_iova_to_phys,
.probe_device = amd_iommu_probe_device,
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 4/7] iommu: Factor iommu_iotlb_gather_is_disjoint() out

2021-06-16 Thread Nadav Amit
From: Nadav Amit 

Refactor iommu_iotlb_gather_add_page() and factor out the logic that
detects whether IOTLB gather range and a new range are disjoint. To be
used by the next patch that implements different gathering logic for
AMD.

Note that updating gather->pgsize unconditionally does not affect
correctness as the function had (and has) an invariant, in which
gather->pgsize always represents the flushing granularity of its range.
Arguably, “size" should never be zero, but lets assume for the matter of
discussion that it might.

If "size" equals to "gather->pgsize", then the assignment in question
has no impact.

Otherwise, if "size" is non-zero, then iommu_iotlb_sync() would
initialize the size and range (see iommu_iotlb_gather_init()), and the
invariant is kept.

Otherwise, "size" is zero, and "gather" already holds a range, so
gather->pgsize is non-zero and (gather->pgsize && gather->pgsize !=
size) is true. Therefore, again, iommu_iotlb_sync() would be called and
initialize the size.

Cc: Joerg Roedel 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Acked-by: Will Deacon 
Signed-off-by: Nadav Amit 
---
 include/linux/iommu.h | 34 ++
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e554871db46f..979a5ceeea55 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,28 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_is_disjoint - Checks whether a new range is disjoint
+ *
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to check whether a new range and the gathered range
+ * are disjoint. For many IOMMUs, flushing the IOMMU in this case is better
+ * than merging the two, which might lead to unnecessary invalidations.
+ */
+static inline
+bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long start = iova, end = start + size - 1;
+
+   return gather->end != 0 &&
+   (end + 1 < gather->start || start > gather->end + 1);
+}
+
+
 /**
  * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
  * @gather: TLB gather data
@@ -533,20 +555,16 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
 {
-   unsigned long start = iova, end = start + size - 1;
-
/*
 * If the new page is disjoint from the current range or is mapped at
 * a different granularity, then sync the TLB so that the gather
 * structure can be rewritten.
 */
-   if (gather->pgsize != size ||
-   end + 1 < gather->start || start > gather->end + 1) {
-   if (gather->pgsize)
-   iommu_iotlb_sync(domain, gather);
-   gather->pgsize = size;
-   }
+   if ((gather->pgsize && gather->pgsize != size) ||
+   iommu_iotlb_gather_is_disjoint(gather, iova, size))
+   iommu_iotlb_sync(domain, gather);
 
+   gather->pgsize = size;
iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 5/7] iommu/amd: Tailored gather logic for AMD

2021-06-16 Thread Nadav Amit
From: Nadav Amit 

AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
This is in contrast, for instnace, to Intel IOMMUs that have a limit on
the number of pages that can be flushed in a single flush.  In addition,
AMD's IOMMU do not care about the page-size, so changes of the page size
do not need to trigger a TLB flush.

So in most cases, a TLB flush due to disjoint range is not needed for
AMD. Yet, vIOMMUs require the hypervisor to synchronize the virtualized
IOMMU's PTEs with the physical ones. This process induce overheads, so
it is better not to cause unnecessary flushes, i.e., flushes of PTEs
that were not modified.

Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
of the generic iommu_iotlb_gather_add_page(). Ignore disjoint regions
unless "non-present cache" feature is reported by the IOMMU
capabilities, as this is an indication we are running on a physical
IOMMU. A similar indication is used by VT-d (see "caching mode"). The
new logic retains the same flushing behavior that we had before the
introduction of page-selective IOTLB flushes for AMD.

On virtualized environments, check if the newly flushed region and the
gathered one are disjoint and flush if it is.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Cc: Robin Murphy 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3e40f6610b6a..63048aabaf5d 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2053,6 +2053,27 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
return ret;
 }
 
+static void amd_iommu_iotlb_gather_add_page(struct iommu_domain *domain,
+   struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   /*
+* AMD's IOMMU can flush as many pages as necessary in a single flush.
+* Unless we run in a virtual machine, which can be inferred according
+* to whether "non-present cache" is on, it is probably best to prefer
+* (potentially) too extensive TLB flushing (i.e., more misses) over
+* mutliple TLB flushes (i.e., more flushes). For virtual machines the
+* hypervisor needs to synchronize the host IOMMU PTEs with those of
+* the guest, and the trade-off is different: unnecessary TLB flushes
+* should be avoided.
+*/
+   if (amd_iommu_np_cache && gather->end != 0 &&
+   iommu_iotlb_gather_is_disjoint(gather, iova, size))
+   iommu_iotlb_sync(domain, gather);
+
+   iommu_iotlb_gather_add_range(gather, iova, size);
+}
+
 static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
  size_t page_size,
  struct iommu_iotlb_gather *gather)
@@ -2067,7 +2088,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 
r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
 
-   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+   amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
 
return r;
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 3/7] iommu: Improve iommu_iotlb_gather helpers

2021-06-16 Thread Nadav Amit
From: Robin Murphy 

The Mediatek driver is not the only one which might want a basic
address-based gathering behaviour, so although it's arguably simple
enough to open-code, let's factor it out for the sake of cleanliness.
Let's also take this opportunity to document the intent of these
helpers for clarity.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Robin Murphy 
Signed-off-by: Nadav Amit 

---

Changes from Robin's version:
* Added iommu_iotlb_gather_add_range() stub !CONFIG_IOMMU_API
* Use iommu_iotlb_gather_add_range() in iommu_iotlb_gather_add_page()
---
 drivers/iommu/mtk_iommu.c |  6 +-
 include/linux/iommu.h | 38 +-
 2 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e06b8a0e2b56..cd457487ce81 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -521,12 +521,8 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
  struct iommu_iotlb_gather *gather)
 {
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
-   unsigned long end = iova + size - 1;
 
-   if (gather->start > iova)
-   gather->start = iova;
-   if (gather->end < end)
-   gather->end = end;
+   iommu_iotlb_gather_add_range(gather, iova, size);
return dom->iop->unmap(dom->iop, iova, size, gather);
 }
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..e554871db46f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,38 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build arbitrarily-sized invalidation commands
+ * where only the address range matters, and simply minimising intermediate
+ * syncs is preferred.
+ */
+static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long end = iova + size - 1;
+
+   if (gather->start > iova)
+   gather->start = iova;
+   if (gather->end < end)
+   gather->end = end;
+}
+
+/**
+ * iommu_iotlb_gather_add_page - Gather for page-based TLB invalidation
+ * @domain: IOMMU domain to be invalidated
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build invalidation commands based on individual
+ * pages, or with page size/table level hints which cannot be gathered if they
+ * differ.
+ */
 static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
@@ -515,11 +547,7 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
gather->pgsize = size;
}
 
-   if (gather->end < end)
-   gather->end = end;
-
-   if (gather->start > start)
-   gather->start = start;
+   iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
 /* PCI device grouping function */
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 2/7] iommu/amd: Do not use flush-queue when NpCache is on

2021-06-16 Thread Nadav Amit
From: Nadav Amit 

Do not use flush-queue on virtualized environments, where the NpCache
capability of the IOMMU is set. This is required to reduce
virtualization overheads.

This change follows a similar change to Intel's VT-d and a detailed
explanation as for the rationale is described in commit 29b32839725f
("iommu/vt-d: Do not use flush-queue when caching-mode is on").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/init.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index d006724f4dc2..4a52d22d0d6f 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1850,8 +1850,13 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
if (ret)
return ret;
 
-   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE))
+   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE)) {
+   if (!amd_iommu_unmap_flush)
+   pr_warn("IOMMU batching is disabled due to 
virtualization");
+
amd_iommu_np_cache = true;
+   amd_iommu_unmap_flush = true;
+   }
 
init_iommu_perf_ctr(iommu);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 1/7] iommu/amd: Selective flush on unmap

2021-06-16 Thread Nadav Amit
From: Nadav Amit 

Recent patch attempted to enable selective page flushes on AMD IOMMU but
neglected to adapt amd_iommu_iotlb_sync() to use the selective flushes.

Adapt amd_iommu_iotlb_sync() to use selective flushes and change
amd_iommu_unmap() to collect the flushes. As a defensive measure, to
avoid potential issues as those that the Intel IOMMU driver encountered
recently, flush the page-walk caches by always setting the "pde"
parameter. This can be removed later.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3ac42bbdefc6..3e40f6610b6a 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2059,12 +2059,17 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 {
struct protection_domain *domain = to_pdomain(dom);
struct io_pgtable_ops *ops = >iop.iop.ops;
+   size_t r;
 
if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
(domain->iop.mode == PAGE_MODE_NONE))
return 0;
 
-   return (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+   r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+
+   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+   return r;
 }
 
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
@@ -2167,7 +2172,13 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
 static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
 {
-   amd_iommu_flush_iotlb_all(domain);
+   struct protection_domain *dom = to_pdomain(domain);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   amd_iommu_domain_flush_complete(dom);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static int amd_iommu_def_domain_type(struct device *dev)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 0/7] iommu/amd: Enable page-selective flushes

2021-06-16 Thread Nadav Amit
From: Nadav Amit 

The previous patch, commit 268aa4548277 ("iommu/amd: Page-specific
invalidations for more than one page") was supposed to enable
page-selective IOTLB flushes on AMD.

Besides the bug that was already fixed by commit a017c567915f
("iommu/amd: Fix wrong parentheses on page-specific invalidations")
there are several remaining matters to enable and benefit from
page-selective IOTLB flushes on AMD:

1. Enable selective flushes on unmap (patch 1)
2. Avoid using flush-queue on vIOMMUs (patch 2)
3. Relaxed flushes when gathering, excluding vIOMMUs (patches 3-5)
4. Syncing once on scatter-gather map operations (patch 6)
5. Breaking flushes to naturally aligned ranges on vIOMMU (patch 7)

The main difference in this version is that the logic that flushes
vIOMMU was improved based on Robin's feedback. Batching decisions are
not based on alignment anymore, but instead the flushing range is broken
into naturally aligned regions on sync. Doing so allows us to flush only
the entries that we modified with the minimal number of flushes.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org

---

v3->v4:
* Breaking flushes to naturally aligned ranges on vIOMMU [Robin]
* Removing unnecessary stubs; fixing comment [Robin]
* Removing unused variable [Yong]
* Changing pr_warn_once() to pr_warn() [Robin]
* Improving commit log [Will]

v2->v3:
* Rebase on v5.13-rc5
* Refactoring (patches 4-5) [Robin]
* Rework flush logic (patch 5): more relaxed on native
* Syncing once on scatter-gather operations (patch 6)

v1->v2:
* Rebase on v5.13-rc3

Nadav Amit (6):
  iommu/amd: Selective flush on unmap
  iommu/amd: Do not use flush-queue when NpCache is on
  iommu: Factor iommu_iotlb_gather_is_disjoint() out
  iommu/amd: Tailored gather logic for AMD
  iommu/amd: Sync once for scatter-gather operations
  iommu/amd: Use only natural aligned flushes in a VM

Robin Murphy (1):
  iommu: Improve iommu_iotlb_gather helpers

 drivers/iommu/amd/init.c  |  7 ++-
 drivers/iommu/amd/iommu.c | 96 +++
 drivers/iommu/mtk_iommu.c |  6 +--
 include/linux/iommu.h | 72 +++--
 4 files changed, 153 insertions(+), 28 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

2021-06-15 Thread Nadav Amit


> On Jun 15, 2021, at 12:20 PM, Robin Murphy  wrote:
> 
> On 2021-06-15 19:14, Nadav Amit wrote:
>>> On Jun 15, 2021, at 5:55 AM, Robin Murphy  wrote:
>>> 
>>> On 2021-06-07 19:25, Nadav Amit wrote:
>>>> From: Nadav Amit 
>>>> AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
>>>> This is in contrast, for instnace, to Intel IOMMUs that have a limit on
>>>> the number of pages that can be flushed in a single flush.  In addition,
>>>> AMD's IOMMU do not care about the page-size, so changes of the page size
>>>> do not need to trigger a TLB flush.
>>>> So in most cases, a TLB flush due to disjoint range or page-size changes
>>>> are not needed for AMD. Yet, vIOMMUs require the hypervisor to
>>>> synchronize the virtualized IOMMU's PTEs with the physical ones. This
>>>> process induce overheads, so it is better not to cause unnecessary
>>>> flushes, i.e., flushes of PTEs that were not modified.
>>>> Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
>>>> of the generic iommu_iotlb_gather_add_page(). Ignore page-size changes
>>>> and disjoint regions unless "non-present cache" feature is reported by
>>>> the IOMMU capabilities, as this is an indication we are running on a
>>>> physical IOMMU. A similar indication is used by VT-d (see "caching
>>>> mode"). The new logic retains the same flushing behavior that we had
>>>> before the introduction of page-selective IOTLB flushes for AMD.
>>>> On virtualized environments, check if the newly flushed region and the
>>>> gathered one are disjoint and flush if it is. Also check whether the new
>>>> region would cause IOTLB invalidation of large region that would include
>>>> unmodified PTE. The latter check is done according to the "order" of the
>>>> IOTLB flush.
>>> 
>>> If it helps,
>>> 
>>> Reviewed-by: Robin Murphy 
>> Thanks!
>>> I wonder if it might be more effective to defer the alignment-based 
>>> splitting part to amd_iommu_iotlb_sync() itself, but that could be 
>>> investigated as another follow-up.
>> Note that the alignment-based splitting is only used for virtualized AMD 
>> IOMMUs, so it has no impact for most users.
>> Right now, the performance is kind of bad on VMs since AMD’s IOMMU driver 
>> does a full IOTLB flush whenever it unmaps more than a single page. So, 
>> although your idea makes sense, I do not know exactly how to implement it 
>> right now, and regardless it is likely to provide much lower performance 
>> improvements than those that avoiding full IOTLB flushes would.
>> Having said that, if I figure out a way to implement it, I would give it a 
>> try (although I am admittedly afraid of a complicated logic that might cause 
>> subtle, mostly undetectable bugs).
> 
> I was mainly thinking that when you observe a change in "order" and sync to 
> avoid over-invalidating adjacent pages, those pages may still be part of the 
> current unmap and you've just not seen them added yet. Hence simply gathering 
> contiguous pages regardless of alignment, then breaking the total range down 
> into appropriately-aligned commands in the sync once you know you've seen 
> everything, seems like it might allow issuing fewer commands overall. But I 
> haven't quite grasped the alignment rules either, so possibly this is moot 
> anyway.

Thanks for explaining. I think that what you propose makes sense. We might 
already flush more than needed in certain cases (e.g., patterns in which pages 
are added before and after the gathered range). I doubt these cases actually 
happen, and the tradeoff between being precise in what you flush (more flushes) 
and not causing the hypervisor to check unchanged mappings (synchronization 
cost) is less obvious.

I will see if I can change __domain_flush_pages() to your liking, and see 
whether it can be part of this series.



signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 3/6] iommu: Improve iommu_iotlb_gather helpers

2021-06-15 Thread Nadav Amit



> On Jun 15, 2021, at 12:05 PM, Nadav Amit  wrote:
> 
> 
> 
>> On Jun 15, 2021, at 3:42 AM, Robin Murphy  wrote:
>> 
>> On 2021-06-07 19:25, Nadav Amit wrote:
>>> From: Robin Murphy 
>>> The Mediatek driver is not the only one which might want a basic
>>> address-based gathering behaviour, so although it's arguably simple
>>> enough to open-code, let's factor it out for the sake of cleanliness.
>>> Let's also take this opportunity to document the intent of these
>>> helpers for clarity.
>>> Cc: Joerg Roedel 
>>> Cc: Will Deacon 
>>> Cc: Jiajun Cao 
>>> Cc: Robin Murphy 
>>> Cc: Lu Baolu 
>>> Cc: iommu@lists.linux-foundation.org
>>> Cc: linux-ker...@vger.kernel.org
>>> Signed-off-by: Robin Murphy 
>> 
>> Nit: missing your signoff.
>> 
>>> ---
>>> Changes from Robin's version:
>>> * Added iommu_iotlb_gather_add_range() stub !CONFIG_IOMMU_API
>> 
>> Out of curiosity, is there any config in which a stub is actually needed? 
>> Unlike iommu_iotlb_gather_init(), I would have thought that these helpers 
>> should only ever be called by driver code which already depends on IOMMU_API.
> 
> Indeed, this was only done as a defensive step.
> 
> I will remove it. I see no reason for it. Sorry for ruining your patch.

And remove the stub for iommu_iotlb_gather_is_disjoint() as well.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 3/6] iommu: Improve iommu_iotlb_gather helpers

2021-06-15 Thread Nadav Amit



> On Jun 15, 2021, at 3:42 AM, Robin Murphy  wrote:
> 
> On 2021-06-07 19:25, Nadav Amit wrote:
>> From: Robin Murphy 
>> The Mediatek driver is not the only one which might want a basic
>> address-based gathering behaviour, so although it's arguably simple
>> enough to open-code, let's factor it out for the sake of cleanliness.
>> Let's also take this opportunity to document the intent of these
>> helpers for clarity.
>> Cc: Joerg Roedel 
>> Cc: Will Deacon 
>> Cc: Jiajun Cao 
>> Cc: Robin Murphy 
>> Cc: Lu Baolu 
>> Cc: iommu@lists.linux-foundation.org
>> Cc: linux-ker...@vger.kernel.org
>> Signed-off-by: Robin Murphy 
> 
> Nit: missing your signoff.
> 
>> ---
>> Changes from Robin's version:
>> * Added iommu_iotlb_gather_add_range() stub !CONFIG_IOMMU_API
> 
> Out of curiosity, is there any config in which a stub is actually needed? 
> Unlike iommu_iotlb_gather_init(), I would have thought that these helpers 
> should only ever be called by driver code which already depends on IOMMU_API.

Indeed, this was only done as a defensive step.

I will remove it. I see no reason for it. Sorry for ruining your patch.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 4/6] iommu: Factor iommu_iotlb_gather_is_disjoint() out

2021-06-15 Thread Nadav Amit


> On Jun 15, 2021, at 3:29 AM, Will Deacon  wrote:
> 
> On Fri, Jun 11, 2021 at 09:50:31AM -0700, Nadav Amit wrote:
>> 
>> 
>>> On Jun 11, 2021, at 6:57 AM, Will Deacon  wrote:
>>> 
>>> On Mon, Jun 07, 2021 at 11:25:39AM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit 
>>>> 
>>>> Refactor iommu_iotlb_gather_add_page() and factor out the logic that
>>>> detects whether IOTLB gather range and a new range are disjoint. To be
>>>> used by the next patch that implements different gathering logic for
>>>> AMD.
>>>> 
>>>> Cc: Joerg Roedel 
>>>> Cc: Will Deacon 
>>>> Cc: Jiajun Cao 
>>>> Cc: Robin Murphy 
>>>> Cc: Lu Baolu 
>>>> Cc: iommu@lists.linux-foundation.org
>>>> Cc: linux-ker...@vger.kernel.org>
>>>> Signed-off-by: Nadav Amit 
>>>> ---
>>>> include/linux/iommu.h | 41 +
>>>> 1 file changed, 33 insertions(+), 8 deletions(-)
>>> 
>>> [...]
>>> 
>>>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>>>> index f254c62f3720..b5a2bfc68fb0 100644
>>>> --- a/include/linux/iommu.h
>>>> +++ b/include/linux/iommu.h
>>>> @@ -497,6 +497,28 @@ static inline void iommu_iotlb_sync(struct 
>>>> iommu_domain *domain,
>>>>iommu_iotlb_gather_init(iotlb_gather);
>>>> }
>>>> 
>>>> +/**
>>>> + * iommu_iotlb_gather_is_disjoint - Checks whether a new range is disjoint
>>>> + *
>>>> + * @gather: TLB gather data
>>>> + * @iova: start of page to invalidate
>>>> + * @size: size of page to invalidate
>>>> + *
>>>> + * Helper for IOMMU drivers to check whether a new range is and the 
>>>> gathered
>>>> + * range are disjoint.
>>> 
>>> I can't quite parse this. Delete the "is"?
>> 
>> Indeed. Will do (I mean I will do ;-) )
>> 
>>> 
>>>>   For many IOMMUs, flushing the IOMMU in this case is
>>>> + * better than merging the two, which might lead to unnecessary 
>>>> invalidations.
>>>> + */
>>>> +static inline
>>>> +bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
>>>> +  unsigned long iova, size_t size)
>>>> +{
>>>> +  unsigned long start = iova, end = start + size - 1;
>>>> +
>>>> +  return gather->end != 0 &&
>>>> +  (end + 1 < gather->start || start > gather->end + 1);
>>>> +}
>>>> +
>>>> +
>>>> /**
>>>> * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
>>>> * @gather: TLB gather data
>>>> @@ -533,20 +555,16 @@ static inline void 
>>>> iommu_iotlb_gather_add_page(struct iommu_domain *domain,
>>>>   struct iommu_iotlb_gather 
>>>> *gather,
>>>>   unsigned long iova, size_t size)
>>>> {
>>>> -  unsigned long start = iova, end = start + size - 1;
>>>> -
>>>>/*
>>>> * If the new page is disjoint from the current range or is mapped at
>>>> * a different granularity, then sync the TLB so that the gather
>>>> * structure can be rewritten.
>>>> */
>>>> -  if (gather->pgsize != size ||
>>>> -  end + 1 < gather->start || start > gather->end + 1) {
>>>> -  if (gather->pgsize)
>>>> -  iommu_iotlb_sync(domain, gather);
>>>> -  gather->pgsize = size;
>>>> -  }
>>>> +  if ((gather->pgsize && gather->pgsize != size) ||
>>>> +  iommu_iotlb_gather_is_disjoint(gather, iova, size))
>>>> +  iommu_iotlb_sync(domain, gather);
>>>> 
>>>> +  gather->pgsize = size;
>>> 
>>> Why have you made this unconditional? I think it's ok, but just not sure
>>> if it's necessary or not.
>> 
>> In regard to gather->pgsize, this function had (and has) an
>> invariant, in which gather->pgsize always represents the flushing
>> granularity of its range. Arguably, “size" should never be
>> zero, but lets assume for the matter of discussion that it might.
>> 
>> If “size” equals to “gather->pgsize”, then the assignment in
>> question has no impact.
>> 
>> Otherwise, if “size” is non-zero, then iommu_iotlb_sync() would
>> initialize the size and range (see iommu_iotlb_gather_init()),
>> and the invariant is kept.
>> 
>> Otherwise, “size” is zero, and “gather” already holds a range,
>> so gather->pgsize is non-zero and
>> (gather->pgsize && gather->pgsize != size) is true. Therefore,
>> again, iommu_iotlb_sync() would be called and initialize the
>> size.
>> 
>> I think that this change makes the code much simpler to read.
>> It probably has no performance impact as “gather” is probably
>> cached and anyhow accessed shortly after.
> 
> Thanks. I was just interested in whether it had a functional impact (I don't
> think it does) or whether it was just cleanup.
> 
> With the updated comment:
> 
> Acked-by: Will Deacon 

Thanks. I will add the explanation to the commit log, but not to the code in 
order not to inflate it too much.



signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 6/6] iommu/amd: Sync once for scatter-gather operations

2021-06-15 Thread Nadav Amit



> On Jun 15, 2021, at 4:25 AM, Robin Murphy  wrote:
> 
> On 2021-06-07 19:25, Nadav Amit wrote:
>> From: Nadav Amit 
>> On virtual machines, software must flush the IOTLB after each page table
>> entry update.
>> The iommu_map_sg() code iterates through the given scatter-gather list
>> and invokes iommu_map() for each element in the scatter-gather list,
>> which calls into the vendor IOMMU driver through iommu_ops callback. As
>> the result, a single sg mapping may lead to multiple IOTLB flushes.
>> Fix this by adding amd_iotlb_sync_map() callback and flushing at this
>> point after all sg mappings we set.
>> This commit is followed and inspired by commit 933fcd01e97e2
>> ("iommu/vt-d: Add iotlb_sync_map callback").
>> Cc: Joerg Roedel 
>> Cc: Will Deacon 
>> Cc: Jiajun Cao 
>> Cc: Robin Murphy 
>> Cc: Lu Baolu 
>> Cc: iommu@lists.linux-foundation.org
>> Cc: linux-ker...@vger.kernel.org
>> Signed-off-by: Nadav Amit 
>> ---
>>  drivers/iommu/amd/iommu.c | 15 ---
>>  1 file changed, 12 insertions(+), 3 deletions(-)
>> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
>> index 128f2e889ced..dd23566f1db8 100644
>> --- a/drivers/iommu/amd/iommu.c
>> +++ b/drivers/iommu/amd/iommu.c
>> @@ -2027,6 +2027,16 @@ static int amd_iommu_attach_device(struct 
>> iommu_domain *dom,
>>  return ret;
>>  }
>>  +static void amd_iommu_iotlb_sync_map(struct iommu_domain *dom,
>> + unsigned long iova, size_t size)
>> +{
>> +struct protection_domain *domain = to_pdomain(dom);
>> +struct io_pgtable_ops *ops = >iop.iop.ops;
>> +
>> +if (ops->map)
> 
> Not too critical since you're only moving existing code around, but is 
> ops->map ever not set? Either way the check ends up looking rather 
> out-of-place here :/
> 
> It's not very clear what the original intent was - I do wonder whether it's 
> supposed to be related to PAGE_MODE_NONE, but given that amd_iommu_map() has 
> an explicit check and errors out early in that case, we'd never get here 
> anyway. Possibly something to come back and clean up later?

[ +Suravee ]

According to what I see in the git log, the checks for ops->map (as well as 
ops->unmap) were relatively recently introduced by Suravee [1] in preparation 
to AMD IOMMU v2 page tables [2]. Since I do not know what he plans, I prefer 
not to touch this code.

[1] 
https://lore.kernel.org/linux-iommu/20200923101442.73157-13-suravee.suthikulpa...@amd.com/
[2] 
https://lore.kernel.org/linux-iommu/20200923101442.73157-1-suravee.suthikulpa...@amd.com/
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 2/6] iommu/amd: Do not use flush-queue when NpCache is on

2021-06-15 Thread Nadav Amit


> On Jun 15, 2021, at 6:08 AM, Robin Murphy  wrote:
> 
> On 2021-06-07 19:25, Nadav Amit wrote:
>> From: Nadav Amit 
>> Do not use flush-queue on virtualized environments, where the NpCache
>> capability of the IOMMU is set. This is required to reduce
>> virtualization overheads.
>> This change follows a similar change to Intel's VT-d and a detailed
>> explanation as for the rationale is described in commit 29b32839725f
>> ("iommu/vt-d: Do not use flush-queue when caching-mode is on").
>> Cc: Joerg Roedel 
>> Cc: Will Deacon 
>> Cc: Jiajun Cao 
>> Cc: Robin Murphy 
>> Cc: Lu Baolu 
>> Cc: iommu@lists.linux-foundation.org
>> Cc: linux-ker...@vger.kernel.org
>> Signed-off-by: Nadav Amit 
>> ---
>>  drivers/iommu/amd/init.c | 7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>> diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
>> index d006724f4dc2..ba3b76ed776d 100644
>> --- a/drivers/iommu/amd/init.c
>> +++ b/drivers/iommu/amd/init.c
>> @@ -1850,8 +1850,13 @@ static int __init iommu_init_pci(struct amd_iommu 
>> *iommu)
>>  if (ret)
>>  return ret;
>>  -   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE))
>> +if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE)) {
>> +if (!amd_iommu_unmap_flush)
>> +pr_warn_once("IOMMU batching is disabled due to 
>> virtualization");
> 
> Nit: you can just use pr_warn() (or arguably pr_info()) since the explicit 
> conditions already only match once.

Yes, my bad. I will fix it in the next version.

> Speaking of which, it might be better to use amd_iommu_np_cache instead, 
> since other patches are planning to clean up the last remnants of 
> amd_iommu_unmap_flush.

I prefer that the other patches (that remove amd_iommu_unmap_flush) would 
address this code as well. I certainly do not want to embed amd_iommu_np_cache 
deep into the flushing logic. IOW: I don’t know what you have exactly in mind, 
but I prefer the code would be clear.

This code follows (copies?) the same pattern+logic from commit 5f3116ea8b5 
("iommu/vt-d: Do not use flush-queue when caching-mode is on”). I see that 
changed the code in commit 53255e545807c ("iommu: remove 
DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE”), but did not get rid of intel_iommu_strict, 
so please allow me to use amd_iommu_unmap_flush.

To remind you/me/whoever: disabling batching due to caching-mode/NP-cache is 
not inherently needed. It was not needed for quite some time on Intel, but 
somehow along the way the consolidated flushing code broke it, and now it is 
needed (without intrusive code changes).

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

2021-06-15 Thread Nadav Amit


> On Jun 15, 2021, at 5:55 AM, Robin Murphy  wrote:
> 
> On 2021-06-07 19:25, Nadav Amit wrote:
>> From: Nadav Amit 
>> AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
>> This is in contrast, for instnace, to Intel IOMMUs that have a limit on
>> the number of pages that can be flushed in a single flush.  In addition,
>> AMD's IOMMU do not care about the page-size, so changes of the page size
>> do not need to trigger a TLB flush.
>> So in most cases, a TLB flush due to disjoint range or page-size changes
>> are not needed for AMD. Yet, vIOMMUs require the hypervisor to
>> synchronize the virtualized IOMMU's PTEs with the physical ones. This
>> process induce overheads, so it is better not to cause unnecessary
>> flushes, i.e., flushes of PTEs that were not modified.
>> Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
>> of the generic iommu_iotlb_gather_add_page(). Ignore page-size changes
>> and disjoint regions unless "non-present cache" feature is reported by
>> the IOMMU capabilities, as this is an indication we are running on a
>> physical IOMMU. A similar indication is used by VT-d (see "caching
>> mode"). The new logic retains the same flushing behavior that we had
>> before the introduction of page-selective IOTLB flushes for AMD.
>> On virtualized environments, check if the newly flushed region and the
>> gathered one are disjoint and flush if it is. Also check whether the new
>> region would cause IOTLB invalidation of large region that would include
>> unmodified PTE. The latter check is done according to the "order" of the
>> IOTLB flush.
> 
> If it helps,
> 
> Reviewed-by: Robin Murphy 

Thanks!


> I wonder if it might be more effective to defer the alignment-based splitting 
> part to amd_iommu_iotlb_sync() itself, but that could be investigated as 
> another follow-up.

Note that the alignment-based splitting is only used for virtualized AMD 
IOMMUs, so it has no impact for most users.

Right now, the performance is kind of bad on VMs since AMD’s IOMMU driver does 
a full IOTLB flush whenever it unmaps more than a single page. So, although 
your idea makes sense, I do not know exactly how to implement it right now, and 
regardless it is likely to provide much lower performance improvements than 
those that avoiding full IOTLB flushes would.

Having said that, if I figure out a way to implement it, I would give it a try 
(although I am admittedly afraid of a complicated logic that might cause 
subtle, mostly undetectable bugs).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 4/6] iommu: Factor iommu_iotlb_gather_is_disjoint() out

2021-06-11 Thread Nadav Amit


> On Jun 11, 2021, at 6:57 AM, Will Deacon  wrote:
> 
> On Mon, Jun 07, 2021 at 11:25:39AM -0700, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> Refactor iommu_iotlb_gather_add_page() and factor out the logic that
>> detects whether IOTLB gather range and a new range are disjoint. To be
>> used by the next patch that implements different gathering logic for
>> AMD.
>> 
>> Cc: Joerg Roedel 
>> Cc: Will Deacon 
>> Cc: Jiajun Cao 
>> Cc: Robin Murphy 
>> Cc: Lu Baolu 
>> Cc: iommu@lists.linux-foundation.org
>> Cc: linux-ker...@vger.kernel.org>
>> Signed-off-by: Nadav Amit 
>> ---
>> include/linux/iommu.h | 41 +
>> 1 file changed, 33 insertions(+), 8 deletions(-)
> 
> [...]
> 
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index f254c62f3720..b5a2bfc68fb0 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -497,6 +497,28 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
>> *domain,
>>  iommu_iotlb_gather_init(iotlb_gather);
>> }
>> 
>> +/**
>> + * iommu_iotlb_gather_is_disjoint - Checks whether a new range is disjoint
>> + *
>> + * @gather: TLB gather data
>> + * @iova: start of page to invalidate
>> + * @size: size of page to invalidate
>> + *
>> + * Helper for IOMMU drivers to check whether a new range is and the gathered
>> + * range are disjoint.
> 
> I can't quite parse this. Delete the "is"?

Indeed. Will do (I mean I will do ;-) )

> 
>>For many IOMMUs, flushing the IOMMU in this case is
>> + * better than merging the two, which might lead to unnecessary 
>> invalidations.
>> + */
>> +static inline
>> +bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
>> +unsigned long iova, size_t size)
>> +{
>> +unsigned long start = iova, end = start + size - 1;
>> +
>> +return gather->end != 0 &&
>> +(end + 1 < gather->start || start > gather->end + 1);
>> +}
>> +
>> +
>> /**
>>  * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
>>  * @gather: TLB gather data
>> @@ -533,20 +555,16 @@ static inline void iommu_iotlb_gather_add_page(struct 
>> iommu_domain *domain,
>> struct iommu_iotlb_gather 
>> *gather,
>> unsigned long iova, size_t size)
>> {
>> -unsigned long start = iova, end = start + size - 1;
>> -
>>  /*
>>   * If the new page is disjoint from the current range or is mapped at
>>   * a different granularity, then sync the TLB so that the gather
>>   * structure can be rewritten.
>>   */
>> -if (gather->pgsize != size ||
>> -end + 1 < gather->start || start > gather->end + 1) {
>> -if (gather->pgsize)
>> -iommu_iotlb_sync(domain, gather);
>> -gather->pgsize = size;
>> -}
>> +if ((gather->pgsize && gather->pgsize != size) ||
>> +iommu_iotlb_gather_is_disjoint(gather, iova, size))
>> +iommu_iotlb_sync(domain, gather);
>> 
>> +gather->pgsize = size;
> 
> Why have you made this unconditional? I think it's ok, but just not sure
> if it's necessary or not.

In regard to gather->pgsize, this function had (and has) an
invariant, in which gather->pgsize always represents the flushing
granularity of its range. Arguably, “size" should never be
zero, but lets assume for the matter of discussion that it might.

If “size” equals to “gather->pgsize”, then the assignment in
question has no impact.

Otherwise, if “size” is non-zero, then iommu_iotlb_sync() would
initialize the size and range (see iommu_iotlb_gather_init()),
and the invariant is kept.

Otherwise, “size” is zero, and “gather” already holds a range,
so gather->pgsize is non-zero and
(gather->pgsize && gather->pgsize != size) is true. Therefore,
again, iommu_iotlb_sync() would be called and initialize the
size.

I think that this change makes the code much simpler to read.
It probably has no performance impact as “gather” is probably
cached and anyhow accessed shortly after.

If anything, I can add a VM_BUG_ON() to check “size” is
non-zero, although this code seems correct regardless of that.



signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 5/6] iommu/amd: Tailored gather logic for AMD

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

AMD's IOMMU can flush efficiently (i.e., in a single flush) any range.
This is in contrast, for instnace, to Intel IOMMUs that have a limit on
the number of pages that can be flushed in a single flush.  In addition,
AMD's IOMMU do not care about the page-size, so changes of the page size
do not need to trigger a TLB flush.

So in most cases, a TLB flush due to disjoint range or page-size changes
are not needed for AMD. Yet, vIOMMUs require the hypervisor to
synchronize the virtualized IOMMU's PTEs with the physical ones. This
process induce overheads, so it is better not to cause unnecessary
flushes, i.e., flushes of PTEs that were not modified.

Implement and use amd_iommu_iotlb_gather_add_page() and use it instead
of the generic iommu_iotlb_gather_add_page(). Ignore page-size changes
and disjoint regions unless "non-present cache" feature is reported by
the IOMMU capabilities, as this is an indication we are running on a
physical IOMMU. A similar indication is used by VT-d (see "caching
mode"). The new logic retains the same flushing behavior that we had
before the introduction of page-selective IOTLB flushes for AMD.

On virtualized environments, check if the newly flushed region and the
gathered one are disjoint and flush if it is. Also check whether the new
region would cause IOTLB invalidation of large region that would include
unmodified PTE. The latter check is done according to the "order" of the
IOTLB flush.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 44 ++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3e40f6610b6a..128f2e889ced 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2053,6 +2053,48 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
return ret;
 }
 
+static void amd_iommu_iotlb_gather_add_page(struct iommu_domain *domain,
+   struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   /*
+* AMD's IOMMU can flush as many pages as necessary in a single flush.
+* Unless we run in a virtual machine, which can be inferred according
+* to whether "non-present cache" is on, it is probably best to prefer
+* (potentially) too extensive TLB flushing (i.e., more misses) over
+* mutliple TLB flushes (i.e., more flushes). For virtual machines the
+* hypervisor needs to synchronize the host IOMMU PTEs with those of
+* the guest, and the trade-off is different: unnecessary TLB flushes
+* should be avoided.
+*/
+   if (amd_iommu_np_cache && gather->end != 0) {
+   unsigned long start = iova, end = start + size - 1;
+
+   if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) {
+   /*
+* If the new page is disjoint from the current range,
+* flush.
+*/
+   iommu_iotlb_sync(domain, gather);
+   } else {
+   /*
+* If the order of TLB flushes increases by more than
+* 1, it means that we would have to flush PTEs that
+* were not modified. In this case, flush.
+*/
+   unsigned long new_start = min(gather->start, start);
+   unsigned long new_end = min(gather->end, end);
+   int msb_diff = fls64(gather->end ^ gather->start);
+   int new_msb_diff = fls64(new_end ^ new_start);
+
+   if (new_msb_diff > msb_diff + 1)
+   iommu_iotlb_sync(domain, gather);
+   }
+   }
+
+   iommu_iotlb_gather_add_range(gather, iova, size);
+}
+
 static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova,
  size_t page_size,
  struct iommu_iotlb_gather *gather)
@@ -2067,7 +2109,7 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 
r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
 
-   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+   amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
 
return r;
 }
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 6/6] iommu/amd: Sync once for scatter-gather operations

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

On virtual machines, software must flush the IOTLB after each page table
entry update.

The iommu_map_sg() code iterates through the given scatter-gather list
and invokes iommu_map() for each element in the scatter-gather list,
which calls into the vendor IOMMU driver through iommu_ops callback. As
the result, a single sg mapping may lead to multiple IOTLB flushes.

Fix this by adding amd_iotlb_sync_map() callback and flushing at this
point after all sg mappings we set.

This commit is followed and inspired by commit 933fcd01e97e2
("iommu/vt-d: Add iotlb_sync_map callback").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 128f2e889ced..dd23566f1db8 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2027,6 +2027,16 @@ static int amd_iommu_attach_device(struct iommu_domain 
*dom,
return ret;
 }
 
+static void amd_iommu_iotlb_sync_map(struct iommu_domain *dom,
+unsigned long iova, size_t size)
+{
+   struct protection_domain *domain = to_pdomain(dom);
+   struct io_pgtable_ops *ops = >iop.iop.ops;
+
+   if (ops->map)
+   domain_flush_np_cache(domain, iova, size);
+}
+
 static int amd_iommu_map(struct iommu_domain *dom, unsigned long iova,
 phys_addr_t paddr, size_t page_size, int iommu_prot,
 gfp_t gfp)
@@ -2045,10 +2055,8 @@ static int amd_iommu_map(struct iommu_domain *dom, 
unsigned long iova,
if (iommu_prot & IOMMU_WRITE)
prot |= IOMMU_PROT_IW;
 
-   if (ops->map) {
+   if (ops->map)
ret = ops->map(ops, iova, paddr, page_size, prot, gfp);
-   domain_flush_np_cache(domain, iova, page_size);
-   }
 
return ret;
 }
@@ -2249,6 +2257,7 @@ const struct iommu_ops amd_iommu_ops = {
.attach_dev = amd_iommu_attach_device,
.detach_dev = amd_iommu_detach_device,
.map = amd_iommu_map,
+   .iotlb_sync_map = amd_iommu_iotlb_sync_map,
.unmap = amd_iommu_unmap,
.iova_to_phys = amd_iommu_iova_to_phys,
.probe_device = amd_iommu_probe_device,
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 3/6] iommu: Improve iommu_iotlb_gather helpers

2021-06-07 Thread Nadav Amit
From: Robin Murphy 

The Mediatek driver is not the only one which might want a basic
address-based gathering behaviour, so although it's arguably simple
enough to open-code, let's factor it out for the sake of cleanliness.
Let's also take this opportunity to document the intent of these
helpers for clarity.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Robin Murphy 

---

Changes from Robin's version:
* Added iommu_iotlb_gather_add_range() stub !CONFIG_IOMMU_API
* Use iommu_iotlb_gather_add_range() in iommu_iotlb_gather_add_page()
---
 drivers/iommu/mtk_iommu.c |  5 +
 include/linux/iommu.h | 43 ++-
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e06b8a0e2b56..0af4a91ac7da 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -523,10 +523,7 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
unsigned long end = iova + size - 1;
 
-   if (gather->start > iova)
-   gather->start = iova;
-   if (gather->end < end)
-   gather->end = end;
+   iommu_iotlb_gather_add_range(gather, iova, size);
return dom->iop->unmap(dom->iop, iova, size, gather);
 }
 
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..f254c62f3720 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,38 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build arbitrarily-sized invalidation commands
+ * where only the address range matters, and simply minimising intermediate
+ * syncs is preferred.
+ */
+static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long end = iova + size - 1;
+
+   if (gather->start > iova)
+   gather->start = iova;
+   if (gather->end < end)
+   gather->end = end;
+}
+
+/**
+ * iommu_iotlb_gather_add_page - Gather for page-based TLB invalidation
+ * @domain: IOMMU domain to be invalidated
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to build invalidation commands based on individual
+ * pages, or with page size/table level hints which cannot be gathered if they
+ * differ.
+ */
 static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
@@ -515,11 +547,7 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
gather->pgsize = size;
}
 
-   if (gather->end < end)
-   gather->end = end;
-
-   if (gather->start > start)
-   gather->start = start;
+   iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
 /* PCI device grouping function */
@@ -702,6 +730,11 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
 {
 }
 
+static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
+   unsigned long iova, size_t size)
+{
+}
+
 static inline phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, 
dma_addr_t iova)
 {
return 0;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 4/6] iommu: Factor iommu_iotlb_gather_is_disjoint() out

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

Refactor iommu_iotlb_gather_add_page() and factor out the logic that
detects whether IOTLB gather range and a new range are disjoint. To be
used by the next patch that implements different gathering logic for
AMD.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org>
Signed-off-by: Nadav Amit 
---
 include/linux/iommu.h | 41 +
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index f254c62f3720..b5a2bfc68fb0 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -497,6 +497,28 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }
 
+/**
+ * iommu_iotlb_gather_is_disjoint - Checks whether a new range is disjoint
+ *
+ * @gather: TLB gather data
+ * @iova: start of page to invalidate
+ * @size: size of page to invalidate
+ *
+ * Helper for IOMMU drivers to check whether a new range is and the gathered
+ * range are disjoint. For many IOMMUs, flushing the IOMMU in this case is
+ * better than merging the two, which might lead to unnecessary invalidations.
+ */
+static inline
+bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   unsigned long start = iova, end = start + size - 1;
+
+   return gather->end != 0 &&
+   (end + 1 < gather->start || start > gather->end + 1);
+}
+
+
 /**
  * iommu_iotlb_gather_add_range - Gather for address-based TLB invalidation
  * @gather: TLB gather data
@@ -533,20 +555,16 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, size_t size)
 {
-   unsigned long start = iova, end = start + size - 1;
-
/*
 * If the new page is disjoint from the current range or is mapped at
 * a different granularity, then sync the TLB so that the gather
 * structure can be rewritten.
 */
-   if (gather->pgsize != size ||
-   end + 1 < gather->start || start > gather->end + 1) {
-   if (gather->pgsize)
-   iommu_iotlb_sync(domain, gather);
-   gather->pgsize = size;
-   }
+   if ((gather->pgsize && gather->pgsize != size) ||
+   iommu_iotlb_gather_is_disjoint(gather, iova, size))
+   iommu_iotlb_sync(domain, gather);
 
+   gather->pgsize = size;
iommu_iotlb_gather_add_range(gather, iova, size);
 }
 
@@ -730,6 +748,13 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
 {
 }
 
+static inline
+bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   return false;
+}
+
 static inline void iommu_iotlb_gather_add_range(struct iommu_iotlb_gather 
*gather,
unsigned long iova, size_t size)
 {
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 2/6] iommu/amd: Do not use flush-queue when NpCache is on

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

Do not use flush-queue on virtualized environments, where the NpCache
capability of the IOMMU is set. This is required to reduce
virtualization overheads.

This change follows a similar change to Intel's VT-d and a detailed
explanation as for the rationale is described in commit 29b32839725f
("iommu/vt-d: Do not use flush-queue when caching-mode is on").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/init.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index d006724f4dc2..ba3b76ed776d 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1850,8 +1850,13 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
if (ret)
return ret;
 
-   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE))
+   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE)) {
+   if (!amd_iommu_unmap_flush)
+   pr_warn_once("IOMMU batching is disabled due to 
virtualization");
+
amd_iommu_np_cache = true;
+   amd_iommu_unmap_flush = true;
+   }
 
init_iommu_perf_ctr(iommu);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 1/6] iommu/amd: Selective flush on unmap

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

Recent patch attempted to enable selective page flushes on AMD IOMMU but
neglected to adapt amd_iommu_iotlb_sync() to use the selective flushes.

Adapt amd_iommu_iotlb_sync() to use selective flushes and change
amd_iommu_unmap() to collect the flushes. As a defensive measure, to
avoid potential issues as those that the Intel IOMMU driver encountered
recently, flush the page-walk caches by always setting the "pde"
parameter. This can be removed later.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 3ac42bbdefc6..3e40f6610b6a 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2059,12 +2059,17 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 {
struct protection_domain *domain = to_pdomain(dom);
struct io_pgtable_ops *ops = >iop.iop.ops;
+   size_t r;
 
if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
(domain->iop.mode == PAGE_MODE_NONE))
return 0;
 
-   return (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+   r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+
+   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+   return r;
 }
 
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
@@ -2167,7 +2172,13 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
 static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
 {
-   amd_iommu_flush_iotlb_all(domain);
+   struct protection_domain *dom = to_pdomain(domain);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   amd_iommu_domain_flush_complete(dom);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static int amd_iommu_def_domain_type(struct device *dev)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 0/6] iommu/amd: Enable page-selective flushes

2021-06-07 Thread Nadav Amit
From: Nadav Amit 

The previous patch, commit 268aa4548277 ("iommu/amd: Page-specific
invalidations for more than one page") was supposed to enable
page-selective IOTLB flushes on AMD.

Besides the bug that was already fixed by commit a017c567915f
("iommu/amd: Fix wrong parentheses on page-specific invalidations")
there are several remaining matters to enable and benefit from
page-selective IOTLB flushes on AMD:

1. Enable selective flushes on unmap (patch 1)
2. Avoid using flush-queue on vIOMMUs (patch 2)
3. Relaxed flushes when gathering, excluding vIOMMUs (patches 3-5)
4. Syncing once on scatter-gather map operations (patch 6)

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: Robin Murphy 
Cc: Lu Baolu 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org

--- 

v2->v3:
* Rebase on v5.13-rc5
* Refactoring (patches 4-5) [Robin]
* Rework flush logic (patch 5): more relaxed on native
* Syncing once on scatter-gather operations (patch 6)

v1->v2:
* Rebase on v5.13-rc3

Nadav Amit (5):
  iommu/amd: Selective flush on unmap
  iommu/amd: Do not use flush-queue when NpCache is on
  iommu: Factor iommu_iotlb_gather_is_disjoint() out
  iommu/amd: Tailored gather logic for AMD
  iommu/amd: Sync once for scatter-gather operations

Robin Murphy (1):
  iommu: Improve iommu_iotlb_gather helpers

 drivers/iommu/amd/init.c  |  7 +++-
 drivers/iommu/amd/iommu.c | 72 ++---
 drivers/iommu/mtk_iommu.c |  5 +--
 include/linux/iommu.h | 84 +--
 4 files changed, 145 insertions(+), 23 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes

2021-06-04 Thread Nadav Amit


> On Jun 4, 2021, at 11:53 AM, Robin Murphy  wrote:
> 
> On 2021-06-04 18:10, Nadav Amit wrote:
>>> On Jun 4, 2021, at 8:38 AM, Joerg Roedel  wrote:
>>> 
>>> Hi Nadav,
>>> 
>>> [Adding Robin]
>>> 
>>> On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote:
>>>> Nadav Amit (4):
>>>>  iommu/amd: Fix wrong parentheses on page-specific invalidations
>>> 
>>> This patch is already upstream in v5.13-rc4. Please rebase to that
>>> version.
>> I guess it would be rc5 by the time I send it.
>>> 
>>>>  iommu/amd: Selective flush on unmap
>>>>  iommu/amd: Do not sync on page size changes
>>>>  iommu/amd: Do not use flush-queue when NpCache is on
>>> 
>>> And I think there have been objections from Robin Murphy on Patch 3,
>>> have those been worked out?
>> I am still waiting for Robin’s feedback on my proposed changes. If he does 
>> not respond soon, I will drop this patch for now.
> 
> Apologies, it feels like I've spent most of this week fighting fires,
> and a great deal of email got skimmed and mentally filed under "nothing
> so wrong that I need to respond immediately"...
> 
> FWIW I would have written the simpler patch below, but beyond that I
> think it might start descending into bikeshedding - if you still prefer
> your more comprehensive refactoring, or something in between, then don't
> let my personal preference in style/complexity trade-offs stand in the
> way of getting a useful functional change into the AMD driver. Whichever
> way, though, I *am* now sold on the idea of having some kerneldoc to
> clarify these things.

Thanks, I appreciate your feedback.

I will add kerneldoc as you indicated.

I see you took some parts of the patch I did for MediaTek, but I think this is 
not good enough for AMD, since AMD behavior should be different than MediaTek - 
they have different needs:

MediaTek wants as few IOTLB flushes as possible, even if it results in flushing 
of many irrelevant (unmodified) entries between start and end. That’s the 
reason it can just use iommu_iotlb_gather_update_range().

In contrast, for AMD we do not want to flush too many irrelevant entries, 
specifically if the the IOMMU is virtualized. When an IOTLB flush is initiated 
by the VM, the hypervisor needs to scan the IOMMU page-tables for changes and 
synchronize it with the physical IOMMU. You don’t want this range to be too 
big, and that is the reason I needed iommu_iotlb_gather_is_disjoint().

I will add documentation, since clearly this information was not conveyed well 
enough.

Thanks again,
Nadav


signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 0/4] iommu/amd: Enable page-selective flushes

2021-06-04 Thread Nadav Amit


> On Jun 4, 2021, at 8:38 AM, Joerg Roedel  wrote:
> 
> Hi Nadav,
> 
> [Adding Robin]
> 
> On Mon, May 24, 2021 at 03:41:55PM -0700, Nadav Amit wrote:
>> Nadav Amit (4):
>>  iommu/amd: Fix wrong parentheses on page-specific invalidations
> 
> This patch is already upstream in v5.13-rc4. Please rebase to that
> version.

I guess it would be rc5 by the time I send it.

> 
>>  iommu/amd: Selective flush on unmap
>>  iommu/amd: Do not sync on page size changes
>>  iommu/amd: Do not use flush-queue when NpCache is on
> 
> And I think there have been objections from Robin Murphy on Patch 3,
> have those been worked out?

I am still waiting for Robin’s feedback on my proposed changes. If he does not 
respond soon, I will drop this patch for now.

Thanks,
Nadav
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes

2021-06-01 Thread Nadav Amit


> On Jun 1, 2021, at 10:27 AM, Robin Murphy  wrote:
> 
> On 2021-06-01 17:39, Nadav Amit wrote:
>>> On Jun 1, 2021, at 8:59 AM, Robin Murphy  wrote:
>>> 
>>> On 2021-05-02 07:59, Nadav Amit wrote:
>>>> From: Nadav Amit 
>>>> Some IOMMU architectures perform invalidations regardless of the page
>>>> size. In such architectures there is no need to sync when the page size
>>>> changes or to regard pgsize when making interim flush in
>>>> iommu_iotlb_gather_add_page().
>>>> Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide
>>>> whether gather's pgsize is tracked and triggers a flush.
>>> 
>>> I've objected before[1], and I'll readily object again ;)
>>> 
>>> I still think it's very silly to add a bunch of indirection all over the 
>>> place to make a helper function not do the main thing it's intended to help 
>>> with. If you only need trivial address gathering then it's far simpler to 
>>> just implement trivial address gathering. I suppose if you really want to 
>>> you could factor out another helper to share the 5 lines of code which 
>>> ended up in mtk-iommu (see commit f21ae3b10084).
>> Thanks, Robin.
>> I read your comments but I cannot fully understand the alternative that you 
>> propose, although I do understand your objection to the indirection 
>> “ignore_gather_pgsize”. Would it be ok if “ignore_gather_pgsize" was 
>> provided as an argument for iommu_iotlb_gather_add_page()?
> 
> No, I mean if iommu_iotlb_gather_add_page() doesn't have the behaviour your 
> driver wants, just don't call it. Write or factor out a suitable helper that 
> *does* do what you want and call that, or just implement the logic directly 
> inline. Indirect argument or not, it just doesn't make much sense to have a 
> helper function call which says "do this except don't do most of it".
> 
>> In general, I can live without this patch. It probably would have negligent 
>> impact on performance anyhow.
> 
> As I implied, it sounds like your needs are the same as the Mediatek driver 
> had, so you could readily factor out a new page-size-agnostic gather helper 
> from that. I fully support making the functional change to amd-iommu 
> *somehow* - nobody likes unnecessary syncs - just not with this particular 
> implementation :)

Hm… avoid code duplication I need to extract some common code to another 
function.

Is the following resembles what you had in mind (untested)?

-- >8 --

Subject: [PATCH] iommu: add iommu_iotlb_gather_add_page_ignore_pgsize()

---
 drivers/iommu/mtk_iommu.c |  7 ++---
 include/linux/iommu.h | 55 ++-
 2 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index e168a682806a..5890e745bed3 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -520,12 +520,9 @@ static size_t mtk_iommu_unmap(struct iommu_domain *domain,
  struct iommu_iotlb_gather *gather)
 {
struct mtk_iommu_domain *dom = to_mtk_domain(domain);
-   unsigned long end = iova + size - 1;

-   if (gather->start > iova)
-   gather->start = iova;
-   if (gather->end < end)
-   gather->end = end;
+   iommu_iotlb_gather_update_range(gather, iova, size);
+
return dom->iop->unmap(dom->iop, iova, size, gather);
 }

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 9ca6e6b8084d..037434b6eb4c 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -535,29 +535,58 @@ static inline void iommu_iotlb_sync(struct iommu_domain 
*domain,
iommu_iotlb_gather_init(iotlb_gather);
 }

-static inline void iommu_iotlb_gather_add_page(struct iommu_domain *domain,
+static inline
+void iommu_iotlb_gather_update_range(struct iommu_iotlb_gather *gather,
+unsigned long iova, size_t size)
+{
+   unsigned long start = iova, end = start + size - 1;
+
+   if (gather->end < end)
+   gather->end = end;
+
+   if (gather->start > start)
+   gather->start = start;
+
+   gather->pgsize = size;
+}
+
+static inline
+bool iommu_iotlb_gather_is_disjoint(struct iommu_iotlb_gather *gather,
+   unsigned long iova, size_t size)
+{
+   return iova + size < gather->start || iova > gather->end + 1;
+}
+
+static inline
+void iommu_iotlb_gather_add_page_ignore_pgsize(struct iommu_domain *domain,
   struct iommu_iotlb_gather 
*gather,
   unsigned long iova, siz

Re: [PATCH 3/4] iommu/amd: Do not sync on page size changes

2021-06-01 Thread Nadav Amit


> On Jun 1, 2021, at 8:59 AM, Robin Murphy  wrote:
> 
> On 2021-05-02 07:59, Nadav Amit wrote:
>> From: Nadav Amit 
>> Some IOMMU architectures perform invalidations regardless of the page
>> size. In such architectures there is no need to sync when the page size
>> changes or to regard pgsize when making interim flush in
>> iommu_iotlb_gather_add_page().
>> Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide
>> whether gather's pgsize is tracked and triggers a flush.
> 
> I've objected before[1], and I'll readily object again ;)
> 
> I still think it's very silly to add a bunch of indirection all over the 
> place to make a helper function not do the main thing it's intended to help 
> with. If you only need trivial address gathering then it's far simpler to 
> just implement trivial address gathering. I suppose if you really want to you 
> could factor out another helper to share the 5 lines of code which ended up 
> in mtk-iommu (see commit f21ae3b10084).

Thanks, Robin.

I read your comments but I cannot fully understand the alternative that you 
propose, although I do understand your objection to the indirection 
“ignore_gather_pgsize”. Would it be ok if “ignore_gather_pgsize" was provided 
as an argument for iommu_iotlb_gather_add_page()?

In general, I can live without this patch. It probably would have negligent 
impact on performance anyhow.

Regards,
Nadav


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 1/4] iommu/amd: Fix wrong parentheses on page-specific invalidations

2021-05-31 Thread Nadav Amit


> On May 18, 2021, at 2:23 AM, Joerg Roedel  wrote:
> 
> On Sat, May 01, 2021 at 11:59:56PM -0700, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> The logic to determine the mask of page-specific invalidations was
>> tested in userspace. As the code was copied into the kernel, the
>> parentheses were mistakenly set in the wrong place, resulting in the
>> wrong mask.
>> 
>> Fix it.
>> 
>> Cc: Joerg Roedel 
>> Cc: Will Deacon 
>> Cc: Jiajun Cao 
>> Cc: iommu@lists.linux-foundation.org
>> Cc: linux-ker...@vger.kernel.org
>> Fixes: 268aa4548277 ("iommu/amd: Page-specific invalidations for more than 
>> one page")
>> Signed-off-by: Nadav Amit 
>> ---
>> drivers/iommu/amd/iommu.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Applied this one for v5.13, thanks Nadav.
> 
> Somehow the rest of the patch-set got screwed up during sending or so,
> at least I see some patches twice in my inbox and with different
> subjects.
> 
> Can you please re-send patches 2-4 when -rc3 it out?

Joerg,

Thanks for your understanding. I sent a version based on -rc3 a week
ago.

I noticed that there was some confusion regarding rc numbers. Do you
need a new version based on rc4 or can you apply the version I sent?

Thanks,
Nadav



signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [git pull] IOMMU Fixes for Linux v5.13-rc3

2021-05-27 Thread Nadav Amit


> On May 27, 2021, at 10:57 AM, Joerg Roedel  wrote:
> 
> Signed PGP part
> Hi Linus,
> 
> The following changes since commit d07f6ca923ea0927a1024dfccafc5b53b61cfecc:
> 
>  Linux 5.13-rc2 (2021-05-16 15:27:44 -0700)

For 5.13-rc3? Not -rc4?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 4/4] iommu/amd: Do not use flush-queue when NpCache is on

2021-05-25 Thread Nadav Amit
From: Nadav Amit 

Do not use flush-queue on virtualized environments, where the NpCache
capability of the IOMMU is set. This is required to reduce
virtualization overheads.

This change follows a similar change to Intel's VT-d and a detailed
explanation as for the rationale is described in commit 29b32839725f
("iommu/vt-d: Do not use flush-queue when caching-mode is on").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/init.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index d006724f4dc2..ba3b76ed776d 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1850,8 +1850,13 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
if (ret)
return ret;
 
-   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE))
+   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE)) {
+   if (!amd_iommu_unmap_flush)
+   pr_warn_once("IOMMU batching is disabled due to 
virtualization");
+
amd_iommu_np_cache = true;
+   amd_iommu_unmap_flush = true;
+   }
 
init_iommu_perf_ctr(iommu);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 2/4] iommu/amd: Selective flush on unmap

2021-05-25 Thread Nadav Amit
From: Nadav Amit 

Recent patch attempted to enable selective page flushes on AMD IOMMU but
neglected to adapt amd_iommu_iotlb_sync() to use the selective flushes.

Adapt amd_iommu_iotlb_sync() to use selective flushes and change
amd_iommu_unmap() to collect the flushes. As a defensive measure, to
avoid potential issues as those that the Intel IOMMU driver encountered
recently, flush the page-walk caches by always setting the "pde"
parameter. This can be removed later.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 6723cbcf4030..b8cabbbeed71 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2057,12 +2057,17 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 {
struct protection_domain *domain = to_pdomain(dom);
struct io_pgtable_ops *ops = >iop.iop.ops;
+   size_t r;
 
if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
(domain->iop.mode == PAGE_MODE_NONE))
return 0;
 
-   return (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+   r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+
+   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+   return r;
 }
 
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
@@ -2165,7 +2170,13 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
 static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
 {
-   amd_iommu_flush_iotlb_all(domain);
+   struct protection_domain *dom = to_pdomain(domain);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   amd_iommu_domain_flush_complete(dom);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static int amd_iommu_def_domain_type(struct device *dev)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 3/4] iommu/amd: Do not sync on page size changes

2021-05-25 Thread Nadav Amit
From: Nadav Amit 

Some IOMMU architectures perform invalidations regardless of the page
size. In such architectures there is no need to sync when the page size
changes or to regard pgsize when making interim flush in
iommu_iotlb_gather_add_page().

Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide
whether gather's pgsize is tracked and triggers a flush.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 1 +
 include/linux/iommu.h | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index b8cabbbeed71..1849b53f2149 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2215,6 +2215,7 @@ const struct iommu_ops amd_iommu_ops = {
.put_resv_regions = generic_iommu_put_resv_regions,
.is_attach_deferred = amd_iommu_is_attach_deferred,
.pgsize_bitmap  = AMD_IOMMU_PGSIZES,
+   .ignore_gather_pgsize = true,
.flush_iotlb_all = amd_iommu_flush_iotlb_all,
.iotlb_sync = amd_iommu_iotlb_sync,
.def_domain_type = amd_iommu_def_domain_type,
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..1fb2695418e9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -284,6 +284,7 @@ struct iommu_ops {
int (*def_domain_type)(struct device *dev);
 
unsigned long pgsize_bitmap;
+   bool ignore_gather_pgsize;
struct module *owner;
 };
 
@@ -508,7 +509,7 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
 * a different granularity, then sync the TLB so that the gather
 * structure can be rewritten.
 */
-   if (gather->pgsize != size ||
+   if ((gather->pgsize != size && !domain->ops->ignore_gather_pgsize) ||
end + 1 < gather->start || start > gather->end + 1) {
if (gather->pgsize)
iommu_iotlb_sync(domain, gather);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 1/4] iommu/amd: Fix wrong parentheses on page-specific invalidations

2021-05-25 Thread Nadav Amit
From: Nadav Amit 

The logic to determine the mask of page-specific invalidations was
tested in userspace. As the code was copied into the kernel, the
parentheses were mistakenly set in the wrong place, resulting in the
wrong mask.

Fix it.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Fixes: 268aa4548277 ("iommu/amd: Page-specific invalidations for more than one 
page")
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 80e8e1916dd1..6723cbcf4030 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -884,7 +884,7 @@ static inline u64 build_inv_address(u64 address, size_t 
size)
 * The msb-bit must be clear on the address. Just set all the
 * lower bits.
 */
-   address |= 1ull << (msb_diff - 1);
+   address |= (1ull << msb_diff) - 1;
}
 
/* Clear bits 11:0 */
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2 0/4] iommu/amd: Enable page-selective flushes

2021-05-25 Thread Nadav Amit
From: Nadav Amit 

The previous patch, commit 268aa4548277 ("iommu/amd: Page-specific
invalidations for more than one page") was supposed to enable
page-selective IOTLB flushes on AMD.

The patch had an embaressing bug, and I apologize for it.

Analysis as for why this bug did not result in failures raised
additional issues that caused at least most of the IOTLB flushes not to
be page-selective ones.

The first patch corrects the bug from the previous patch. The next
patches enable page-selective invalidations, which were not enabled
despite the previous patch.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org

---

v1->v2:
* Rebase on v5.13-rc3

Nadav Amit (4):
  iommu/amd: Fix wrong parentheses on page-specific invalidations
  iommu/amd: Selective flush on unmap
  iommu/amd: Do not sync on page size changes
  iommu/amd: Do not use flush-queue when NpCache is on

 drivers/iommu/amd/init.c  |  7 ++-
 drivers/iommu/amd/iommu.c | 18 +++---
 include/linux/iommu.h |  3 ++-
 3 files changed, 23 insertions(+), 5 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 4/4] iommu/amd: Do not use flush-queue when NpCache is on

2021-05-02 Thread Nadav Amit
From: Nadav Amit 

Do not use flush-queue on virtualized environments, where the NpCache
capability of the IOMMU is set. This is required to reduce
virtualization overheads.

This change follows a similar change to Intel's VT-d and a detailed
explanation as for the rationale is described in commit 29b32839725f
("iommu/vt-d: Do not use flush-queue when caching-mode is on").

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/init.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index d006724f4dc2..ba3b76ed776d 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1850,8 +1850,13 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
if (ret)
return ret;
 
-   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE))
+   if (iommu->cap & (1UL << IOMMU_CAP_NPCACHE)) {
+   if (!amd_iommu_unmap_flush)
+   pr_warn_once("IOMMU batching is disabled due to 
virtualization");
+
amd_iommu_np_cache = true;
+   amd_iommu_unmap_flush = true;
+   }
 
init_iommu_perf_ctr(iommu);
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/4] iommu/amd: Selective flush on unmap

2021-05-02 Thread Nadav Amit
From: Nadav Amit 

Recent patch attempted to enable selective page flushes on AMD IOMMU but
neglected to adapt amd_iommu_iotlb_sync() to use the selective flushes.

Adapt amd_iommu_iotlb_sync() to use selective flushes and change
amd_iommu_unmap() to collect the flushes. As a defensive measure, to
avoid potential issues as those that the Intel IOMMU driver encountered
recently, flush the page-walk caches by always setting the "pde"
parameter. This can be removed later.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 8a71ad477c34..1849b53f2149 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2057,12 +2057,17 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 {
struct protection_domain *domain = to_pdomain(dom);
struct io_pgtable_ops *ops = >iop.iop.ops;
+   size_t r;
 
if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
(domain->iop.mode == PAGE_MODE_NONE))
return 0;
 
-   return (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+   r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+
+   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+   return r;
 }
 
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
@@ -2165,7 +2170,13 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
 static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
 {
-   amd_iommu_flush_iotlb_all(domain);
+   struct protection_domain *dom = to_pdomain(domain);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   amd_iommu_domain_flush_complete(dom);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static int amd_iommu_def_domain_type(struct device *dev)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/4] iommu/amd: Do not sync on page size changes

2021-05-02 Thread Nadav Amit
From: Nadav Amit 

Some IOMMU architectures perform invalidations regardless of the page
size. In such architectures there is no need to sync when the page size
changes or to regard pgsize when making interim flush in
iommu_iotlb_gather_add_page().

Add a "ignore_gather_pgsize" property for each IOMMU-ops to decide
whether gather's pgsize is tracked and triggers a flush.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 1 +
 include/linux/iommu.h | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index b8cabbbeed71..1849b53f2149 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2215,6 +2215,7 @@ const struct iommu_ops amd_iommu_ops = {
.put_resv_regions = generic_iommu_put_resv_regions,
.is_attach_deferred = amd_iommu_is_attach_deferred,
.pgsize_bitmap  = AMD_IOMMU_PGSIZES,
+   .ignore_gather_pgsize = true,
.flush_iotlb_all = amd_iommu_flush_iotlb_all,
.iotlb_sync = amd_iommu_iotlb_sync,
.def_domain_type = amd_iommu_def_domain_type,
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..1fb2695418e9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -284,6 +284,7 @@ struct iommu_ops {
int (*def_domain_type)(struct device *dev);
 
unsigned long pgsize_bitmap;
+   bool ignore_gather_pgsize;
struct module *owner;
 };
 
@@ -508,7 +509,7 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
 * a different granularity, then sync the TLB so that the gather
 * structure can be rewritten.
 */
-   if (gather->pgsize != size ||
+   if ((gather->pgsize != size && !domain->ops->ignore_gather_pgsize) ||
end + 1 < gather->start || start > gather->end + 1) {
if (gather->pgsize)
iommu_iotlb_sync(domain, gather);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/4] iommu/amd: Selective flush on unmap

2021-05-02 Thread Nadav Amit
From: Nadav Amit 

Recent patch attempted to enable selective page flushes on AMD IOMMU but
neglected to adapt amd_iommu_iotlb_sync() to use the selective flushes.

Adapt amd_iommu_iotlb_sync() to use selective flushes and change
amd_iommu_unmap() to collect the flushes. As a defensive measure, to
avoid potential issues as those that the Intel IOMMU driver encountered
recently, flush the page-walk caches by always setting the "pde"
parameter. This can be removed later.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 6723cbcf4030..b8cabbbeed71 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2057,12 +2057,17 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, 
unsigned long iova,
 {
struct protection_domain *domain = to_pdomain(dom);
struct io_pgtable_ops *ops = >iop.iop.ops;
+   size_t r;
 
if ((amd_iommu_pgtable == AMD_IOMMU_V1) &&
(domain->iop.mode == PAGE_MODE_NONE))
return 0;
 
-   return (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+   r = (ops->unmap) ? ops->unmap(ops, iova, page_size, gather) : 0;
+
+   iommu_iotlb_gather_add_page(dom, gather, iova, page_size);
+
+   return r;
 }
 
 static phys_addr_t amd_iommu_iova_to_phys(struct iommu_domain *dom,
@@ -2165,7 +2170,13 @@ static void amd_iommu_flush_iotlb_all(struct 
iommu_domain *domain)
 static void amd_iommu_iotlb_sync(struct iommu_domain *domain,
 struct iommu_iotlb_gather *gather)
 {
-   amd_iommu_flush_iotlb_all(domain);
+   struct protection_domain *dom = to_pdomain(domain);
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   __domain_flush_pages(dom, gather->start, gather->end - gather->start, 
1);
+   amd_iommu_domain_flush_complete(dom);
+   spin_unlock_irqrestore(>lock, flags);
 }
 
 static int amd_iommu_def_domain_type(struct device *dev)
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/4] iommu/amd: Do not sync on page size changes

2021-05-02 Thread Nadav Amit
From: Nadav Amit 

Some IOMMU architectures perform invalidations regardless of the page
size. In such architectures there is no need to sync when the page size
changes. In such architecture, there is no need to regard pgsize when
making interim flush in iommu_iotlb_gather_add_page().

Add a "no_sync_on_pgsize_change" property for each IOMMU ops to decide
whether gather's pgsize is tracked and triggers a flush.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 1 +
 include/linux/iommu.h | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 6723cbcf4030..8a71ad477c34 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2204,6 +2204,7 @@ const struct iommu_ops amd_iommu_ops = {
.put_resv_regions = generic_iommu_put_resv_regions,
.is_attach_deferred = amd_iommu_is_attach_deferred,
.pgsize_bitmap  = AMD_IOMMU_PGSIZES,
+   .ignore_gather_pgsize = true,
.flush_iotlb_all = amd_iommu_flush_iotlb_all,
.iotlb_sync = amd_iommu_iotlb_sync,
.def_domain_type = amd_iommu_def_domain_type,
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..1fb2695418e9 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -284,6 +284,7 @@ struct iommu_ops {
int (*def_domain_type)(struct device *dev);
 
unsigned long pgsize_bitmap;
+   bool ignore_gather_pgsize;
struct module *owner;
 };
 
@@ -508,7 +509,7 @@ static inline void iommu_iotlb_gather_add_page(struct 
iommu_domain *domain,
 * a different granularity, then sync the TLB so that the gather
 * structure can be rewritten.
 */
-   if (gather->pgsize != size ||
+   if ((gather->pgsize != size && !domain->ops->ignore_gather_pgsize) ||
end + 1 < gather->start || start > gather->end + 1) {
if (gather->pgsize)
iommu_iotlb_sync(domain, gather);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/4] iommu/amd: Fix wrong parentheses on page-specific invalidations

2021-05-02 Thread Nadav Amit
From: Nadav Amit 

The logic to determine the mask of page-specific invalidations was
tested in userspace. As the code was copied into the kernel, the
parentheses were mistakenly set in the wrong place, resulting in the
wrong mask.

Fix it.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Fixes: 268aa4548277 ("iommu/amd: Page-specific invalidations for more than one 
page")
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 80e8e1916dd1..6723cbcf4030 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -884,7 +884,7 @@ static inline u64 build_inv_address(u64 address, size_t 
size)
 * The msb-bit must be clear on the address. Just set all the
 * lower bits.
 */
-   address |= 1ull << (msb_diff - 1);
+   address |= (1ull << msb_diff) - 1;
}
 
/* Clear bits 11:0 */
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/4] iommu/amd: Enable page-selective flushes

2021-05-02 Thread Nadav Amit
From: Nadav Amit 

The previous patch, commit 268aa4548277 ("iommu/amd: Page-specific
invalidations for more than one page") was supposed to enable
page-selective IOTLB flushes on AMD.

The patch had an embaressing bug, and I apologize for it.

Analysis as for why this bug did not result in failures raised
additional issues that caused at least most of the IOTLB flushes not to
be page-selective ones.

The first patch corrects the bug from the previous patch. The next
patches enable page-selective invalidations, which were not enabled
despite the previous patch.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org

Nadav Amit (4):
  iommu/amd: Fix wrong parentheses on page-specific invalidations
  iommu/amd: Selective flush on unmap
  iommu/amd: Do not sync on page size changes
  iommu/amd: Do not use flush-queue when NpCache is on

 drivers/iommu/amd/init.c  |  7 ++-
 drivers/iommu/amd/iommu.c | 18 +++---
 include/linux/iommu.h |  3 ++-
 3 files changed, 23 insertions(+), 5 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-30 Thread Nadav Amit


> On Apr 15, 2021, at 7:13 AM, Joerg Roedel  wrote:
> 
> On Thu, Apr 15, 2021 at 08:46:28AM +0800, Longpeng(Mike) wrote:
>> Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating 
>> superpage")
>> Cc:  # v3.0+
>> Link: 
>> https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/
>> Suggested-by: Lu Baolu 
>> Signed-off-by: Longpeng(Mike) 
>> ---
>> v1 -> v2:
>>  - add Joerg
>>  - reconstruct the solution base on the Baolu's suggestion
>> ---
>> drivers/iommu/intel/iommu.c | 52 
>> +
>> 1 file changed, 38 insertions(+), 14 deletions(-)
> 
> Applied, thanks.
> 

Err.. There is a bug in my patch, and some other problem. I will
investigate and get back to you.



signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/amd: page-specific invalidations for more than one page

2021-04-08 Thread Nadav Amit

> On Apr 8, 2021, at 12:18 AM, Joerg Roedel  wrote:
> 
> Hi Nadav,
> 
> On Wed, Apr 07, 2021 at 05:57:31PM +0000, Nadav Amit wrote:
>> I tested it on real bare-metal hardware. I ran some basic I/O workloads
>> with the IOMMU enabled, checkers enabled/disabled, and so on.
>> 
>> However, I only tested the IOMMU-flushes and I did not test that the
>> device-IOTLB flush work, since I did not have the hardware for that.
>> 
>> If you can refer me to the old patches, I will have a look and see
>> whether I can see a difference in the logic or test them. If you want
>> me to run different tests - let me know. If you want me to remove
>> the device-IOTLB invalidations logic - that is also fine with me.
> 
> Here is the patch-set, it is from 2010 and against a very old version of
> the AMD IOMMU driver:

Thanks. I looked at your code and I see a difference between the
implementations.

As far as I understand, pages are always assumed to be aligned to their
own sizes. I therefore assume that flushes should regard the lower bits
as a “mask” and not just as encoding of the size.

In the version that you referred me to, iommu_update_domain_tlb() only
regards the size of the region to be flushed and disregards the
alignment:

+   order   = get_order(domain->flush.end - domain->flush.start);
+   mask= (0x1000ULL << order) - 1;
+   address = ((domain->flush.start & ~mask) | (mask >> 1)) & ~0xfffULL;


If you need to flush for instance the region between 0x1000-0x5000, this
version would use the address|mask of 0x1000 (16KB page). The version I
sent regards the alignment, and since the range is not aligned would use
address|mask of 0x3000 (32KB page).

IIUC, IOVA allocations today are aligned in such way, but at least in
the past (looking on 3.19 for the matter), it was not like always like
that, which can explain the problems.

Thoughts?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/amd: page-specific invalidations for more than one page

2021-04-07 Thread Nadav Amit



> On Apr 7, 2021, at 3:01 AM, Joerg Roedel  wrote:
> 
> On Tue, Mar 23, 2021 at 02:06:19PM -0700, Nadav Amit wrote:
>> From: Nadav Amit 
>> 
>> Currently, IOMMU invalidations and device-IOTLB invalidations using
>> AMD IOMMU fall back to full address-space invalidation if more than a
>> single page need to be flushed.
>> 
>> Full flushes are especially inefficient when the IOMMU is virtualized by
>> a hypervisor, since it requires the hypervisor to synchronize the entire
>> address-space.
>> 
>> AMD IOMMUs allow to provide a mask to perform page-specific
>> invalidations for multiple pages that match the address. The mask is
>> encoded as part of the address, and the first zero bit in the address
>> (in bits [51:12]) indicates the mask size.
>> 
>> Use this hardware feature to perform selective IOMMU and IOTLB flushes.
>> Combine the logic between both for better code reuse.
>> 
>> The IOMMU invalidations passed a smoke-test. The device IOTLB
>> invalidations are untested.
> 
> Have you thoroughly tested this on real hardware? I had a patch-set
> doing the same many years ago and it lead to data corruption under load.
> Back then it could have been a bug in my code of course, but it made me
> cautious about using targeted invalidations.

I tested it on real bare-metal hardware. I ran some basic I/O workloads
with the IOMMU enabled, checkers enabled/disabled, and so on.

However, I only tested the IOMMU-flushes and I did not test that the
device-IOTLB flush work, since I did not have the hardware for that.

If you can refer me to the old patches, I will have a look and see
whether I can see a difference in the logic or test them. If you want
me to run different tests - let me know. If you want me to remove
the device-IOTLB invalidations logic - that is also fine with me.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: A problem of Intel IOMMU hardware ?

2021-03-26 Thread Nadav Amit


> On Mar 26, 2021, at 7:31 PM, Lu Baolu  wrote:
> 
> Hi Nadav,
> 
> On 3/19/21 12:46 AM, Nadav Amit wrote:
>> So here is my guess:
>> Intel probably used as a basis for the IOTLB an implementation of
>> some other (regular) TLB design.
>> Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):
>> "Software wishing to prevent this uncertainty should not write to
>> a paging-structure entry in a way that would change, for any linear
>> address, both the page size and either the page frame, access rights,
>> or other attributes.”
>> Now the aforementioned uncertainty is a bit different (multiple
>> *valid*  translations of a single address). Yet, perhaps this is
>> yet another thing that might happen.
>> From a brief look on the handling of MMU (not IOMMU) hugepages
>> in Linux, indeed the PMD is first cleared and flushed before a
>> new valid PMD is set. This is possible for MMUs since they
>> allow the software to handle spurious page-faults gracefully.
>> This is not the case for the IOMMU though (without PRI).
>> Not sure this explains everything though. If that is the problem,
>> then during a mapping that changes page-sizes, a TLB flush is
>> needed, similarly to the one Longpeng did manually.
> 
> I have been working with Longpeng on this issue these days. It turned
> out that your guess is right. The PMD is first cleared but not flushed
> before a new valid one is set. The previous entry might be cached in the
> paging structure caches hence leads to disaster.
> 
> In __domain_mapping():
> 
> 2352 /*
> 2353  * Ensure that old small page tables are
> 2354  * removed to make room for superpage(s).
> 2355  * We're adding new large pages, so make 
> sure
> 2356  * we don't remove their parent tables.
> 2357  */
> 2358 dma_pte_free_pagetable(domain, iov_pfn, 
> end_pfn,
> 2359 largepage_lvl + 1);
> 
> I guess adding a cache flush operation after PMD switching should solve
> the problem.
> 
> I am still not clear about this comment:
> 
> "
> This is possible for MMUs since they allow the software to handle
> spurious page-faults gracefully. This is not the case for the IOMMU
> though (without PRI).
> "
> 
> Can you please shed more light on this?

I was looking at the code in more detail, and apparently my concern
is incorrect.

I was under the assumption that the IOMMU map/unmap can merge/split
(specifically split) huge-pages. For instance, if you map 2MB and
then unmap 4KB out of the 2MB, then you would split the hugepage
and keep the rest of the mappings alive. This is the way MMU is
usually managed. To my defense, I also saw such partial unmappings
in Longpeng’s first scenario.

If this was possible, then you would have a case in which out of 2MB
(for instance), 4KB were unmapped, and you need to split the 2MB
hugepage into 4KB pages. If you try to clear the PMD, flush, and then
set the PMD to point to table with live 4KB PTES, you can have
an interim state in which the PMD is not present. DMAs that arrive
at this stage might fault, and without PRI (and device support)
you do not have a way of restarting the DMA after the hugepage split
is completed.

Anyhow, this concern is apparently not relevant. I guess I was too
naive to assume the IOMMU management is similar to the MMU. I now
see that there is a comment in intel_iommu_unmap() saying:

/* Cope with horrid API which requires us to unmap more than the
   size argument if it happens to be a large-page mapping. */

Regards,
Nadav


signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/amd: page-specific invalidations for more than one page

2021-03-23 Thread Nadav Amit
From: Nadav Amit 

Currently, IOMMU invalidations and device-IOTLB invalidations using
AMD IOMMU fall back to full address-space invalidation if more than a
single page need to be flushed.

Full flushes are especially inefficient when the IOMMU is virtualized by
a hypervisor, since it requires the hypervisor to synchronize the entire
address-space.

AMD IOMMUs allow to provide a mask to perform page-specific
invalidations for multiple pages that match the address. The mask is
encoded as part of the address, and the first zero bit in the address
(in bits [51:12]) indicates the mask size.

Use this hardware feature to perform selective IOMMU and IOTLB flushes.
Combine the logic between both for better code reuse.

The IOMMU invalidations passed a smoke-test. The device IOTLB
invalidations are untested.

Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: Jiajun Cao 
Cc: iommu@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/iommu/amd/iommu.c | 76 +--
 1 file changed, 42 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 9256f84f5ebf..5f2dc3d7f2dc 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -927,33 +927,58 @@ static void build_inv_dte(struct iommu_cmd *cmd, u16 
devid)
CMD_SET_TYPE(cmd, CMD_INV_DEV_ENTRY);
 }
 
-static void build_inv_iommu_pages(struct iommu_cmd *cmd, u64 address,
- size_t size, u16 domid, int pde)
+/*
+ * Builds an invalidation address which is suitable for one page or multiple
+ * pages. Sets the size bit (S) as needed is more than one page is flushed.
+ */
+static inline u64 build_inv_address(u64 address, size_t size)
 {
-   u64 pages;
-   bool s;
+   u64 pages, end, msb_diff;
 
pages = iommu_num_pages(address, size, PAGE_SIZE);
-   s = false;
 
-   if (pages > 1) {
+   if (pages == 1)
+   return address & PAGE_MASK;
+
+   end = address + size - 1;
+
+   /*
+* msb_diff would hold the index of the most significant bit that
+* flipped between the start and end.
+*/
+   msb_diff = fls64(end ^ address) - 1;
+
+   /*
+* Bits 63:52 are sign extended. If for some reason bit 51 is different
+* between the start and the end, invalidate everything.
+*/
+   if (unlikely(msb_diff > 51)) {
+   address = CMD_INV_IOMMU_ALL_PAGES_ADDRESS;
+   } else {
/*
-* If we have to flush more than one page, flush all
-* TLB entries for this domain
+* The msb-bit must be clear on the address. Just set all the
+* lower bits.
 */
-   address = CMD_INV_IOMMU_ALL_PAGES_ADDRESS;
-   s = true;
+   address |= 1ull << (msb_diff - 1);
}
 
+   /* Clear bits 11:0 */
address &= PAGE_MASK;
 
+   /* Set the size bit - we flush more than one 4kb page */
+   return address | CMD_INV_IOMMU_PAGES_SIZE_MASK;
+}
+
+static void build_inv_iommu_pages(struct iommu_cmd *cmd, u64 address,
+ size_t size, u16 domid, int pde)
+{
+   u64 inv_address = build_inv_address(address, size);
+
memset(cmd, 0, sizeof(*cmd));
cmd->data[1] |= domid;
-   cmd->data[2]  = lower_32_bits(address);
-   cmd->data[3]  = upper_32_bits(address);
+   cmd->data[2]  = lower_32_bits(inv_address);
+   cmd->data[3]  = upper_32_bits(inv_address);
CMD_SET_TYPE(cmd, CMD_INV_IOMMU_PAGES);
-   if (s) /* size bit - we flush more than one 4kb page */
-   cmd->data[2] |= CMD_INV_IOMMU_PAGES_SIZE_MASK;
if (pde) /* PDE bit - we want to flush everything, not only the PTEs */
cmd->data[2] |= CMD_INV_IOMMU_PAGES_PDE_MASK;
 }
@@ -961,32 +986,15 @@ static void build_inv_iommu_pages(struct iommu_cmd *cmd, 
u64 address,
 static void build_inv_iotlb_pages(struct iommu_cmd *cmd, u16 devid, int qdep,
  u64 address, size_t size)
 {
-   u64 pages;
-   bool s;
-
-   pages = iommu_num_pages(address, size, PAGE_SIZE);
-   s = false;
-
-   if (pages > 1) {
-   /*
-* If we have to flush more than one page, flush all
-* TLB entries for this domain
-*/
-   address = CMD_INV_IOMMU_ALL_PAGES_ADDRESS;
-   s = true;
-   }
-
-   address &= PAGE_MASK;
+   u64 inv_address = build_inv_address(address, size);
 
memset(cmd, 0, sizeof(*cmd));
cmd->data[0]  = devid;
cmd->data[0] |= (qdep & 0xff) << 24;
cmd->data[1]  = devid;
-   cmd->data[2]  = lower_32_bits(address);
-   cmd->data[3]  = upper_32_bits(address);
+   cmd->data[2]  = l

Re: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Nadav Amit


> On Mar 18, 2021, at 2:25 AM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 
> 
> 
>> -Original Message-
>> From: Tian, Kevin [mailto:kevin.t...@intel.com]
>> Sent: Thursday, March 18, 2021 4:56 PM
>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>> ; Nadav Amit 
>> Cc: chenjiashang ; David Woodhouse
>> ; iommu@lists.linux-foundation.org; LKML
>> ; alex.william...@redhat.com; Gonglei (Arei)
>> ; w...@kernel.org
>> Subject: RE: A problem of Intel IOMMU hardware ?
>> 
>>> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>>> 
>>> 
>>>> -Original Message-
>>>> From: Tian, Kevin [mailto:kevin.t...@intel.com]
>>>> Sent: Thursday, March 18, 2021 4:27 PM
>>>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>>>> ; Nadav Amit 
>>>> Cc: chenjiashang ; David Woodhouse
>>>> ; iommu@lists.linux-foundation.org; LKML
>>>> ; alex.william...@redhat.com; Gonglei
>>> (Arei)
>>>> ; w...@kernel.org
>>>> Subject: RE: A problem of Intel IOMMU hardware ?
>>>> 
>>>>> From: iommu  On Behalf
>>>>> Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>>>>> 
>>>>>> 2. Consider ensuring that the problem is not somehow related to
>>>>>> queued invalidations. Try to use __iommu_flush_iotlb() instead
>>>>>> of
>>>> qi_flush_iotlb().
>>>>>> 
>>>>> 
>>>>> I tried to force to use __iommu_flush_iotlb(), but maybe something
>>>>> wrong, the system crashed, so I prefer to lower the priority of
>>>>> this
>>> operation.
>>>>> 
>>>> 
>>>> The VT-d spec clearly says that register-based invalidation can be
>>>> used only
>>> when
>>>> queued-invalidations are not enabled. Intel-IOMMU driver doesn't
>>>> provide
>>> an
>>>> option to disable queued-invalidation though, when the hardware is
>>> capable. If you
>>>> really want to try, tweak the code in intel_iommu_init_qi.
>>>> 
>>> 
>>> Hi Kevin,
>>> 
>>> Thanks to point out this. Do you have any ideas about this problem ? I
>>> tried to descript the problem much clear in my reply to Alex, hope you
>>> could have a look if you're interested.
>>> 
>> 
>> btw I saw you used 4.18 kernel in this test. What about latest kernel?
>> 
> 
> Not test yet. It's hard to upgrade kernel in our environment.
> 
>> Also one way to separate sw/hw bug is to trace the low level interface (e.g.,
>> qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU
>> hardware. Check the window between b) and c) and see whether the software 
>> does
>> the right thing as expected there.
>> 
> 
> We add some log in iommu driver these days, the software seems fine. But we
> didn't look inside the qi_submit_sync yet, I'll try it tonight.

So here is my guess:

Intel probably used as a basis for the IOTLB an implementation of
some other (regular) TLB design.

Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):

"Software wishing to prevent this uncertainty should not write to
a paging-structure entry in a way that would change, for any linear
address, both the page size and either the page frame, access rights,
or other attributes.”


Now the aforementioned uncertainty is a bit different (multiple
*valid* translations of a single address). Yet, perhaps this is
yet another thing that might happen.

From a brief look on the handling of MMU (not IOMMU) hugepages
in Linux, indeed the PMD is first cleared and flushed before a
new valid PMD is set. This is possible for MMUs since they
allow the software to handle spurious page-faults gracefully.
This is not the case for the IOMMU though (without PRI).

Not sure this explains everything though. If that is the problem,
then during a mapping that changes page-sizes, a TLB flush is
needed, similarly to the one Longpeng did manually.




signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Nadav Amit

> On Mar 17, 2021, at 9:46 PM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 

[Snip]

> 
> NOTE, the magical thing happen...(*Operation-4*) we write the PTE
> of Operation-1 from 0 to 0x3 which means can Read/Write, and then
> we trigger DMA read again, it success and return the data of HPA 0 !!
> 
> Why we modify the older page table would make sense ? As we
> have discussed previously, the cache flush part of the driver is correct,
> it call flush_iotlb after (b) and no need to flush after (c). But the result
> of the experiment shows the older page table or older caches is effective
> actually.
> 
> Any ideas ?

Interesting. Sounds as if there is some page-walk cache that was not
invalidated properly.



signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: A problem of Intel IOMMU hardware ?

2021-03-17 Thread Nadav Amit


> On Mar 17, 2021, at 2:35 AM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 
> Hi Nadav,
> 
>> -Original Message-
>> From: Nadav Amit [mailto:nadav.a...@gmail.com]
>>>  reproduce the problem with high probability (~50%).
>> 
>> I saw Lu replied, and he is much more knowledgable than I am (I was just 
>> intrigued
>> by your email).
>> 
>> However, if I were you I would try also to remove some “optimizations” to 
>> look for
>> the root-cause (e.g., use domain specific invalidations instead of 
>> page-specific).
>> 
> 
> Good suggestion! But we did it these days, we tried to use global 
> invalidations as follow:
>   iommu->flush.flush_iotlb(iommu, did, 0, 0,
>   DMA_TLB_DSI_FLUSH);
> But can not resolve the problem.
> 
>> The first thing that comes to my mind is the invalidation hint (ih) in
>> iommu_flush_iotlb_psi(). I would remove it to see whether you get the failure
>> without it.
> 
> We also notice the IH, but the IH is always ZERO in our case, as the spec 
> says:
> '''
> Paging-structure-cache entries caching second-level mappings associated with 
> the specified
> domain-id and the second-level-input-address range are invalidated, if the 
> Invalidation Hint
> (IH) field is Clear.
> '''
> 
> It seems the software is everything fine, so we've no choice but to suspect 
> the hardware.

Ok, I am pretty much out of ideas. I have two more suggestions, but
they are much less likely to help. Yet, they can further help to rule
out software bugs:

1. dma_clear_pte() seems to be wrong IMHO. It should have used WRITE_ONCE()
to prevent split-write, which might potentially cause “invalid” (partially
cleared) PTE to be stored in the TLB. Having said that, the subsequent
IOTLB flush should have prevented the problem.

2. Consider ensuring that the problem is not somehow related to queued
invalidations. Try to use __iommu_flush_iotlb() instead of
qi_flush_iotlb().

Regards,
Nadav


signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: A problem of Intel IOMMU hardware ?

2021-03-16 Thread Nadav Amit


> On Mar 16, 2021, at 8:16 PM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 
> Hi guys,
> 
> We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a special
> situation, it would cause DMA fails or get wrong data.
> 
> The reproducer (based on Alex's vfio testsuite[1]) is in attachment, it can
> reproduce the problem with high probability (~50%).

I saw Lu replied, and he is much more knowledgable than I am (I was just
intrigued by your email).

However, if I were you I would try also to remove some “optimizations” to
look for the root-cause (e.g., use domain specific invalidations instead
of page-specific).

The first thing that comes to my mind is the invalidation hint (ih) in
iommu_flush_iotlb_psi(). I would remove it to see whether you get the
failure without it.



signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-27 Thread Nadav Amit
> On Jan 27, 2021, at 3:25 AM, Lu Baolu  wrote:
> 
> On 2021/1/27 14:17, Nadav Amit wrote:
>> From: Nadav Amit 
>> When an Intel IOMMU is virtualized, and a physical device is
>> passed-through to the VM, changes of the virtual IOMMU need to be
>> propagated to the physical IOMMU. The hypervisor therefore needs to
>> monitor PTE mappings in the IOMMU page-tables. Intel specifications
>> provide "caching-mode" capability that a virtual IOMMU uses to report
>> that the IOMMU is virtualized and a TLB flush is needed after mapping to
>> allow the hypervisor to propagate virtual IOMMU mappings to the physical
>> IOMMU. To the best of my knowledge no real physical IOMMU reports
>> "caching-mode" as turned on.
>> Synchronizing the virtual and the physical IOMMU tables is expensive if
>> the hypervisor is unaware which PTEs have changed, as the hypervisor is
>> required to walk all the virtualized tables and look for changes.
>> Consequently, domain flushes are much more expensive than page-specific
>> flushes on virtualized IOMMUs with passthrough devices. The kernel
>> therefore exploited the "caching-mode" indication to avoid domain
>> flushing and use page-specific flushing in virtualized environments. See
>> commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching
>> mode.")
>> This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use
>> of iova deferred flushing"). Now, when batched TLB flushing is used (the
>> default), full TLB domain flushes are performed frequently, requiring
>> the hypervisor to perform expensive synchronization between the virtual
>> TLB and the physical one.
>> Getting batched TLB flushes to use in such circumstances page-specific
>> invalidations again is not easy, since the TLB invalidation scheme
>> assumes that "full" domain TLB flushes are performed for scalability.
>> Disable batched TLB flushes when caching-mode is on, as the performance
>> benefit from using batched TLB invalidations is likely to be much
>> smaller than the overhead of the virtual-to-physical IOMMU page-tables
>> synchronization.
>> Fixes: 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.")
> 
> Isn't it
> 
> Fixes: 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing")
> 
> ?

Of course it is - bad copy-paste. I will send v3.

Thanks again,
Nadav
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-27 Thread Nadav Amit
From: Nadav Amit 

When an Intel IOMMU is virtualized, and a physical device is
passed-through to the VM, changes of the virtual IOMMU need to be
propagated to the physical IOMMU. The hypervisor therefore needs to
monitor PTE mappings in the IOMMU page-tables. Intel specifications
provide "caching-mode" capability that a virtual IOMMU uses to report
that the IOMMU is virtualized and a TLB flush is needed after mapping to
allow the hypervisor to propagate virtual IOMMU mappings to the physical
IOMMU. To the best of my knowledge no real physical IOMMU reports
"caching-mode" as turned on.

Synchronizing the virtual and the physical IOMMU tables is expensive if
the hypervisor is unaware which PTEs have changed, as the hypervisor is
required to walk all the virtualized tables and look for changes.
Consequently, domain flushes are much more expensive than page-specific
flushes on virtualized IOMMUs with passthrough devices. The kernel
therefore exploited the "caching-mode" indication to avoid domain
flushing and use page-specific flushing in virtualized environments. See
commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching
mode.")

This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use
of iova deferred flushing"). Now, when batched TLB flushing is used (the
default), full TLB domain flushes are performed frequently, requiring
the hypervisor to perform expensive synchronization between the virtual
TLB and the physical one.

Getting batched TLB flushes to use page-specific invalidations again in
such circumstances is not easy, since the TLB invalidation scheme
assumes that "full" domain TLB flushes are performed for scalability.

Disable batched TLB flushes when caching-mode is on, as the performance
benefit from using batched TLB invalidations is likely to be much
smaller than the overhead of the virtual-to-physical IOMMU page-tables
synchronization.

Fixes: 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing")
Signed-off-by: Nadav Amit 
Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: sta...@vger.kernel.org

---
v2->v3:

* Fix the fixes tag in the commit-log (Lu).
* Minor English rephrasing of the commit-log.

v1->v2:

* disable flush queue for all domains if caching-mode is on for any
  IOMMU (Lu).
---
 drivers/iommu/intel/iommu.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c5b021..de3dd617cf60 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5373,6 +5373,36 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
return ret;
 }
 
+static bool domain_use_flush_queue(void)
+{
+   struct dmar_drhd_unit *drhd;
+   struct intel_iommu *iommu;
+   bool r = true;
+
+   if (intel_iommu_strict)
+   return false;
+
+   /*
+* The flush queue implementation does not perform page-selective
+* invalidations that are required for efficient TLB flushes in virtual
+* environments. The benefit of batching is likely to be much lower than
+* the overhead of synchronizing the virtual and physical IOMMU
+* page-tables.
+*/
+   rcu_read_lock();
+   for_each_active_iommu(iommu, drhd) {
+   if (!cap_caching_mode(iommu->cap))
+   continue;
+
+   pr_warn_once("IOMMU batching is disabled due to 
virtualization");
+   r = false;
+   break;
+   }
+   rcu_read_unlock();
+
+   return r;
+}
+
 static int
 intel_iommu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
@@ -5383,7 +5413,7 @@ intel_iommu_domain_get_attr(struct iommu_domain *domain,
case IOMMU_DOMAIN_DMA:
switch (attr) {
case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = !intel_iommu_strict;
+   *(int *)data = domain_use_flush_queue();
return 0;
default:
return -ENODEV;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-26 Thread Nadav Amit
From: Nadav Amit 

When an Intel IOMMU is virtualized, and a physical device is
passed-through to the VM, changes of the virtual IOMMU need to be
propagated to the physical IOMMU. The hypervisor therefore needs to
monitor PTE mappings in the IOMMU page-tables. Intel specifications
provide "caching-mode" capability that a virtual IOMMU uses to report
that the IOMMU is virtualized and a TLB flush is needed after mapping to
allow the hypervisor to propagate virtual IOMMU mappings to the physical
IOMMU. To the best of my knowledge no real physical IOMMU reports
"caching-mode" as turned on.

Synchronizing the virtual and the physical IOMMU tables is expensive if
the hypervisor is unaware which PTEs have changed, as the hypervisor is
required to walk all the virtualized tables and look for changes.
Consequently, domain flushes are much more expensive than page-specific
flushes on virtualized IOMMUs with passthrough devices. The kernel
therefore exploited the "caching-mode" indication to avoid domain
flushing and use page-specific flushing in virtualized environments. See
commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching
mode.")

This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use
of iova deferred flushing"). Now, when batched TLB flushing is used (the
default), full TLB domain flushes are performed frequently, requiring
the hypervisor to perform expensive synchronization between the virtual
TLB and the physical one.

Getting batched TLB flushes to use in such circumstances page-specific
invalidations again is not easy, since the TLB invalidation scheme
assumes that "full" domain TLB flushes are performed for scalability.

Disable batched TLB flushes when caching-mode is on, as the performance
benefit from using batched TLB invalidations is likely to be much
smaller than the overhead of the virtual-to-physical IOMMU page-tables
synchronization.

Fixes: 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.")
Signed-off-by: Nadav Amit 
Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: sta...@vger.kernel.org

---
v1->v2:

* disable flush queue for all domains if caching-mode is on for any
  IOMMU (Lu).
---
 drivers/iommu/intel/iommu.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c5b021..de3dd617cf60 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5373,6 +5373,36 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
return ret;
 }
 
+static bool domain_use_flush_queue(void)
+{
+   struct dmar_drhd_unit *drhd;
+   struct intel_iommu *iommu;
+   bool r = true;
+
+   if (intel_iommu_strict)
+   return false;
+
+   /*
+* The flush queue implementation does not perform page-selective
+* invalidations that are required for efficient TLB flushes in virtual
+* environments. The benefit of batching is likely to be much lower than
+* the overhead of synchronizing the virtual and physical IOMMU
+* page-tables.
+*/
+   rcu_read_lock();
+   for_each_active_iommu(iommu, drhd) {
+   if (!cap_caching_mode(iommu->cap))
+   continue;
+
+   pr_warn_once("IOMMU batching is disabled due to 
virtualization");
+   r = false;
+   break;
+   }
+   rcu_read_unlock();
+
+   return r;
+}
+
 static int
 intel_iommu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
@@ -5383,7 +5413,7 @@ intel_iommu_domain_get_attr(struct iommu_domain *domain,
case IOMMU_DOMAIN_DMA:
switch (attr) {
case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = !intel_iommu_strict;
+   *(int *)data = domain_use_flush_queue();
return 0;
default:
return -ENODEV;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-26 Thread Nadav Amit
> On Jan 26, 2021, at 4:26 PM, Lu Baolu  wrote:
> 
> Hi Nadav,
> 
> On 1/27/21 4:38 AM, Nadav Amit wrote:
>> From: Nadav Amit 
>> When an Intel IOMMU is virtualized, and a physical device is
>> passed-through to the VM, changes of the virtual IOMMU need to be
>> propagated to the physical IOMMU. The hypervisor therefore needs to
>> monitor PTE mappings in the IOMMU page-tables. Intel specifications
>> provide "caching-mode" capability that a virtual IOMMU uses to report
>> that the IOMMU is virtualized and a TLB flush is needed after mapping to
>> allow the hypervisor to propagate virtual IOMMU mappings to the physical
>> IOMMU. To the best of my knowledge no real physical IOMMU reports
>> "caching-mode" as turned on.
>> Synchronizing the virtual and the physical TLBs is expensive if the
>> hypervisor is unaware which PTEs have changed, as the hypervisor is
>> required to walk all the virtualized tables and look for changes.
>> Consequently, domain flushes are much more expensive than page-specific
>> flushes on virtualized IOMMUs with passthrough devices. The kernel
>> therefore exploited the "caching-mode" indication to avoid domain
>> flushing and use page-specific flushing in virtualized environments. See
>> commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching
>> mode.")
>> This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use
>> of iova deferred flushing"). Now, when batched TLB flushing is used (the
>> default), full TLB domain flushes are performed frequently, requiring
>> the hypervisor to perform expensive synchronization between the virtual
>> TLB and the physical one.
> 
> Good catch. Thank you!
> 
>> Getting batched TLB flushes to use in such circumstances page-specific
>> invalidations again is not easy, since the TLB invalidation scheme
>> assumes that "full" domain TLB flushes are performed for scalability.
>> Disable batched TLB flushes when caching-mode is on, as the performance
>> benefit from using batched TLB invalidations is likely to be much
>> smaller than the overhead of the virtual-to-physical IOMMU page-tables
>> synchronization.
>> Fixes: 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.")
>> Signed-off-by: Nadav Amit 
>> Cc: David Woodhouse 
>> Cc: Lu Baolu 
>> Cc: Joerg Roedel 
>> Cc: Will Deacon 
>> Cc: sta...@vger.kernel.org
>> ---
>>  drivers/iommu/intel/iommu.c | 26 +-
>>  1 file changed, 25 insertions(+), 1 deletion(-)
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 788119c5b021..4e08f5e17175 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -5373,6 +5373,30 @@ intel_iommu_domain_set_attr(struct iommu_domain 
>> *domain,
>>  return ret;
>>  }
>>  +static int
>> +intel_iommu_domain_get_attr_use_flush_queue(struct iommu_domain *domain)
> 
> Just nit:
> 
> Can we just use this
> 
> static bool domain_use_flush_queue(struct iommu_domain *domain)
> 
> ?

Sure.

> 
>> +{
>> +struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +struct intel_iommu *iommu = domain_get_iommu(dmar_domain);
>> +
>> +if (intel_iommu_strict)
>> +return 0;
>> +
>> +/*
>> + * The flush queue implementation does not perform page-selective
>> + * invalidations that are required for efficient TLB flushes in virtual
>> + * environments. The benefit of batching is likely to be much lower than
>> + * the overhead of synchronizing the virtual and physical IOMMU
>> + * page-tables.
>> + */
>> +if (iommu && cap_caching_mode(iommu->cap)) {
>> +pr_warn_once("IOMMU batching is partially disabled due to 
>> virtualization");
>> +return 0;
>> +}
> 
> domain_get_iommu() only returns the first iommu, and could return NULL
> when this is called before domain attaching to any device. A better
> choice could be check caching mode globally and return false if caching
> mode is supported on any iommu.
> 
>   struct dmar_drhd_unit *drhd;
>   struct intel_iommu *iommu;
> 
>   rcu_read_lock();
>   for_each_active_iommu(iommu, drhd) {
>if (cap_caching_mode(iommu->cap))
>return false;
>}
>rcu_read_unlock();

Err.. You are correct (thanks!).

To be frank, I do not like it. I once (10 years ago) implemented a VT-d DMAR

[PATCH] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-26 Thread Nadav Amit
From: Nadav Amit 

When an Intel IOMMU is virtualized, and a physical device is
passed-through to the VM, changes of the virtual IOMMU need to be
propagated to the physical IOMMU. The hypervisor therefore needs to
monitor PTE mappings in the IOMMU page-tables. Intel specifications
provide "caching-mode" capability that a virtual IOMMU uses to report
that the IOMMU is virtualized and a TLB flush is needed after mapping to
allow the hypervisor to propagate virtual IOMMU mappings to the physical
IOMMU. To the best of my knowledge no real physical IOMMU reports
"caching-mode" as turned on.

Synchronizing the virtual and the physical TLBs is expensive if the
hypervisor is unaware which PTEs have changed, as the hypervisor is
required to walk all the virtualized tables and look for changes.
Consequently, domain flushes are much more expensive than page-specific
flushes on virtualized IOMMUs with passthrough devices. The kernel
therefore exploited the "caching-mode" indication to avoid domain
flushing and use page-specific flushing in virtualized environments. See
commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching
mode.")

This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use
of iova deferred flushing"). Now, when batched TLB flushing is used (the
default), full TLB domain flushes are performed frequently, requiring
the hypervisor to perform expensive synchronization between the virtual
TLB and the physical one.

Getting batched TLB flushes to use in such circumstances page-specific
invalidations again is not easy, since the TLB invalidation scheme
assumes that "full" domain TLB flushes are performed for scalability.

Disable batched TLB flushes when caching-mode is on, as the performance
benefit from using batched TLB invalidations is likely to be much
smaller than the overhead of the virtual-to-physical IOMMU page-tables
synchronization.

Fixes: 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.")
Signed-off-by: Nadav Amit 
Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: sta...@vger.kernel.org
---
 drivers/iommu/intel/iommu.c | 26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c5b021..4e08f5e17175 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5373,6 +5373,30 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
return ret;
 }
 
+static int
+intel_iommu_domain_get_attr_use_flush_queue(struct iommu_domain *domain)
+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   struct intel_iommu *iommu = domain_get_iommu(dmar_domain);
+
+   if (intel_iommu_strict)
+   return 0;
+
+   /*
+* The flush queue implementation does not perform page-selective
+* invalidations that are required for efficient TLB flushes in virtual
+* environments. The benefit of batching is likely to be much lower than
+* the overhead of synchronizing the virtual and physical IOMMU
+* page-tables.
+*/
+   if (iommu && cap_caching_mode(iommu->cap)) {
+   pr_warn_once("IOMMU batching is partially disabled due to 
virtualization");
+   return 0;
+   }
+
+   return 1;
+}
+
 static int
 intel_iommu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
@@ -5383,7 +5407,7 @@ intel_iommu_domain_get_attr(struct iommu_domain *domain,
case IOMMU_DOMAIN_DMA:
switch (attr) {
case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = !intel_iommu_strict;
+   *(int *)data = 
!intel_iommu_domain_get_attr_use_flush_queue(domain);
return 0;
default:
return -ENODEV;
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/vt-d: Fix wrong analysis whether devices share the same bus

2019-08-20 Thread Nadav Amit via iommu
set_msi_sid_cb() is used to determine whether device aliases share the
same bus, but it can provide false indications that aliases use the same
bus when in fact they do not. The reason is that set_msi_sid_cb()
assumes that pdev is fixed, while actually pci_for_each_dma_alias() can
call fn() when pdev is set to a subordinate device.

As a result, running an VM on ESX with VT-d emulation enabled can
results in the log warning such as:

  DMAR: [INTR-REMAP] Request device [00:11.0] fault index 3b [fault reason 38] 
Blocked an interrupt request due to source-id verification failure

This seems to cause additional ata errors such as:
  ata3.00: qc timeout (cmd 0xa1)
  ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)

These timeouts also cause boot to be much longer and other errors.

Fix it by checking comparing the alias with the previous one instead.

Fixes: 3f0c625c6ae71 ("iommu/vt-d: Allow interrupts from the entire bus for 
aliased devices")
Cc: sta...@vger.kernel.org
Cc: Logan Gunthorpe 
Cc: David Woodhouse 
Cc: Joerg Roedel 
Cc: Jacob Pan 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/intel_irq_remapping.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 4786ca061e31..81e43c1df7ec 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -376,13 +376,13 @@ static int set_msi_sid_cb(struct pci_dev *pdev, u16 
alias, void *opaque)
 {
struct set_msi_sid_data *data = opaque;
 
+   if (data->count == 0 || PCI_BUS_NUM(alias) == PCI_BUS_NUM(data->alias))
+   data->busmatch_count++;
+
data->pdev = pdev;
data->alias = alias;
data->count++;
 
-   if (PCI_BUS_NUM(alias) == pdev->bus->number)
-   data->busmatch_count++;
-
return 0;
 }
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-17 Thread Nadav Amit
> On Apr 17, 2019, at 10:26 AM, Ingo Molnar  wrote:
> 
> 
> * Nadav Amit  wrote:
> 
>>> On Apr 17, 2019, at 10:09 AM, Ingo Molnar  wrote:
>>> 
>>> 
>>> * Khalid Aziz  wrote:
>>> 
>>>>> I.e. the original motivation of the XPFO patches was to prevent execution 
>>>>> of direct kernel mappings. Is this motivation still present if those 
>>>>> mappings are non-executable?
>>>>> 
>>>>> (Sorry if this has been asked and answered in previous discussions.)
>>>> 
>>>> Hi Ingo,
>>>> 
>>>> That is a good question. Because of the cost of XPFO, we have to be very
>>>> sure we need this protection. The paper from Vasileios, Michalis and
>>>> Angelos - <http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf>,
>>>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>>>> and 6.2.
>>> 
>>> So it would be nice if you could generally summarize external arguments 
>>> when defending a patchset, instead of me having to dig through a PDF 
>>> which not only causes me to spend time that you probably already spent 
>>> reading that PDF, but I might also interpret it incorrectly. ;-)
>>> 
>>> The PDF you cited says this:
>>> 
>>> "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
>>>  in many platforms, including x86-64.  In our example, the content of 
>>>  user address 0xBEEF000 is also accessible through kernel address 
>>>  0x87FF9F08 as plain, executable code."
>>> 
>>> Is this actually true of modern x86-64 kernels? We've locked down W^X 
>>> protections in general.
>> 
>> As I was curious, I looked at the paper. Here is a quote from it:
>> 
>> "In x86-64, however, the permissions of physmap are not in sane state.
>> Kernels up to v3.8.13 violate the W^X property by mapping the entire region
>> as “readable, writeable, and executable” (RWX)—only very recent kernels
>> (≥v3.9) use the more conservative RW mapping.”
> 
> But v3.8.13 is a 5+ years old kernel, it doesn't count as a "modern" 
> kernel in any sense of the word. For any proposed patchset with 
> significant complexity and non-trivial costs the benchmark version 
> threshold is the "current upstream kernel".
> 
> So does that quote address my followup questions:
> 
>> Is this actually true of modern x86-64 kernels? We've locked down W^X
>> protections in general.
>> 
>> I.e. this conclusion:
>> 
>>  "Therefore, by simply overwriting kfptr with 0x87FF9F08 and
>>   triggering the kernel to dereference it, an attacker can directly
>>   execute shell code with kernel privileges."
>> 
>> ... appears to be predicated on imperfect W^X protections on the x86-64
>> kernel.
>> 
>> Do such holes exist on the latest x86-64 kernel? If yes, is there a
>> reason to believe that these W^X holes cannot be fixed, or that any fix
>> would be more expensive than XPFO?
> 
> ?
> 
> What you are proposing here is a XPFO patch-set against recent kernels 
> with significant runtime overhead, so my questions about the W^X holes 
> are warranted.
> 

Just to clarify - I am an innocent bystander and have no part in this work.
I was just looking (again) at the paper, as I was curious due to the recent
patches that I sent that improve W^X protection.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-17 Thread Nadav Amit
> On Apr 17, 2019, at 10:09 AM, Ingo Molnar  wrote:
> 
> 
> * Khalid Aziz  wrote:
> 
>>> I.e. the original motivation of the XPFO patches was to prevent execution 
>>> of direct kernel mappings. Is this motivation still present if those 
>>> mappings are non-executable?
>>> 
>>> (Sorry if this has been asked and answered in previous discussions.)
>> 
>> Hi Ingo,
>> 
>> That is a good question. Because of the cost of XPFO, we have to be very
>> sure we need this protection. The paper from Vasileios, Michalis and
>> Angelos - ,
>> does go into how ret2dir attacks can bypass SMAP/SMEP in sections 6.1
>> and 6.2.
> 
> So it would be nice if you could generally summarize external arguments 
> when defending a patchset, instead of me having to dig through a PDF 
> which not only causes me to spend time that you probably already spent 
> reading that PDF, but I might also interpret it incorrectly. ;-)
> 
> The PDF you cited says this:
> 
>  "Unfortunately, as shown in Table 1, the W^X prop-erty is not enforced 
>   in many platforms, including x86-64.  In our example, the content of 
>   user address 0xBEEF000 is also accessible through kernel address 
>   0x87FF9F08 as plain, executable code."
> 
> Is this actually true of modern x86-64 kernels? We've locked down W^X 
> protections in general.

As I was curious, I looked at the paper. Here is a quote from it:

"In x86-64, however, the permissions of physmap are not in sane state.
Kernels up to v3.8.13 violate the W^X property by mapping the entire region
as “readable, writeable, and executable” (RWX)—only very recent kernels
(≥v3.9) use the more conservative RW mapping.”

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Nadav Amit
> On Dec 6, 2018, at 9:43 AM, Jesper Dangaard Brouer  wrote:
> 
> On Thu,  6 Dec 2018 07:37:19 -0800
> Christoph Hellwig  wrote:
> 
>> Hi all,
>> 
>> a while ago Jesper reported major performance regressions due to the
>> spectre v2 mitigations in his XDP forwarding workloads.  A large part
>> of that is due to the DMA mapping API indirect calls.
>> 
>> It turns out that the most common implementation of the DMA API is the
>> direct mapping case, and now that we have merged almost all duplicate
>> implementations of that into a single generic one is easily feasily to
>> direct calls for this fast path.
>> 
>> This patch adds a check if we are using dma_direct_ops in each fast path
>> DMA operation, and just uses a direct call instead.  For the XDP workload
>> this increases the number of packets per second from 7,438,283 to
>> 9,610,088, so it provides a very significant speedup.
> 
> Full test report avail here:
> https://github.com/xdp-project/xdp-project/blob/master/areas/dma/dma01_test_hellwig_direct_dma.org
> 
> 
>> Note that the patch depends on a lot of work either queued up in the
>> DMA mapping tree, or still out on the list from review, so to actually
>> try the patch you probably want this git tree:
>> 
>> 
>>git://git.infradead.org/users/hch/misc.git dma-direct-calls
>> 
>> Gitweb:
>> 
>>
>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-direct-calls

Did you happen to see my RFC for "automatic" indirect call promotion? 

https://lkml.org/lkml/2018/10/18/175

I hope to get v1 based on Josh responses next week.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC/RFT] Add noats flag to boot parameters

2018-05-03 Thread Nadav Amit
Sinan Kaya  wrote:

> +Bjorn,
> 
> On 5/3/2018 9:59 AM, Joerg Roedel wrote:
>> On Thu, May 03, 2018 at 09:46:34AM -0400, Sinan Kaya wrote:
>>> I also like the idea in general.
>>> Minor nit..
>>> 
>>> Shouldn't this be an iommu parameter rather than a PCI kernel command line 
>>> parameter?
>>> We now have an iommu.passthrough argument that prevents page translation.
>>> 
>>> Doesn't this fit into the same category especially when it is the IOMMU 
>>> drivers that
>>> call ATS functions for enablement not the PCI drivers.
>> 
>> ATS is a bit of a grey area between PCI and IOMMU, but since ATS is
>> PCI-specific and the code to enable/disable it is in PCI as well, I
>> think the parameter makes sense for PCI too.
> 
> OK. Bjorn was interested in having a command line driven feature enables in 
> driver/pci
> directory with bitmasks for each optional PCI spec capability rather than 
> noXYZ feature.
> 
> This would allow us to troubleshoot code breakage as well as the platform 
> bring up to
> turn off all optional features.
> 
> Sounds like this would be a good match for that work.

I think that since this feature (ATS) has security implications, it should
be controllable through the kernel boot parameters. Otherwise, it can be
potentially too late to turn it off.

Regards,
Nadav
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Nadav Amit
Jerome Glisse  wrote:

> On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> 
>> I'd like some more explanation about the inner working of "that new
>> user" as per comment above.
>> 
>> It would be enough to drop mmu_notifier_invalidate_range from above
>> without adding it to the filebacked case. The above gives higher prio
>> to the hypothetical and uncertain future case, than to the current
>> real filebacked case that doesn't need ->invalidate_range inside the
>> PT lock, or do you see something that might already need such
>> ->invalidate_range?
> 
> No i don't see any new user today that might need such invalidate but
> i was trying to be extra cautious as i have a tendency to assume that
> someone might do a patch that use try_to_unmap() without going through
> all the comments in the function and thus possibly using it in a an
> unexpected way from mmu_notifier callback point of view. I am fine
> with putting the burden on new user to get it right and adding an
> extra warning in the function description to try to warn people in a
> sensible way.

I must be missing something. After the PTE is changed, but before the
secondary TLB notification/invalidation - What prevents another thread from
changing the mappings (e.g., using munmap/mmap), and setting a new page
at that PTE?

Wouldn’t it end with the page being mapped without a secondary TLB flush in
between?

Nadav

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

2017-08-31 Thread Nadav Amit
Andrea Arcangeli <aarca...@redhat.com> wrote:

> On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote:
>> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
>>> For both CoW and KSM, the correctness is maintained by calling
>>> ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
>>> (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
>>> will cause memory corruption, and page-lock would not be enough.
>> 
>> Just to add up, the IOMMU have its own CPU page table walker and it can
>> walk the page table at any time (not the page table current to current
>> CPU, IOMMU have an array that match a PASID with a page table and device
>> request translation for a given virtual address against a PASID).
>> 
>> So this means the following can happen with ptep_clear_flush() only:
>> 
>>  CPU  | IOMMU
>>   | - walk page table populate tlb at addr A
>>  - clear pte at addr A|
>>  - set new pte|
>  mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb
> 
>> Device is using old page and CPU new page :(
> 
> That condition won't be persistent.
> 
>> But with ptep_clear_flush_notify()
>> 
>>  CPU  | IOMMU
>>   | - walk page table populate tlb at addr A
>>  - clear pte at addr A|
>>  - notify -> invalidate_range | > flush IOMMU/device tlb
>>  - set new pte|
>> 
>> So now either the IOMMU see the empty pte and trigger a page fault (this is
>> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
>> before setting the new pte) or it see the new pte. Either way both IOMMU
>> and CPU have a coherent view of what a virtual address points to.
> 
> Sure, the _notify version is obviously safe.
> 
>> Andrea explained to me the historical reasons set_pte_at_notify call the
>> change_pte callback and it was intended so that KVM could update the
>> secondary page table directly without having to fault. It is now a pointless
>> optimization as the call to range_start() happening in all the places before
>> any set_pte_at_notify() invalidate the secondary page table and thus will
>> lead to page fault for the vm. I have talk with Andrea on way to bring back
>> this optimization.
> 
> Yes, we known for a long time, the optimization got basically disabled
> when range_start/end expanded. It'd be nice to optimize change_pte
> again but this is for later.
> 
>> Yes we need the following sequence for IOMMU:
>> - clear pte
>> - invalidate IOMMU/device TLB
>> - set new pte
>> 
>> Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
>> stall entry.
>> 
>> Note that this is not necessary for all the case. For try_to_unmap it
>> is fine for instance to move the IOMMU tlb shoot down after changing the
>> CPU page table as we are not pointing the pte to a different page. Either
>> we clear the pte or we set a swap entry and as long as the page that use
>> to be pointed by the pte is not free before the IOMMU tlb flush then we
>> are fine.
>> 
>> In fact i think the only case where we need the above sequence (clear,
>> flush secondary tlb, set new pte) is for COW. I think all other cases
>> we can get rid of invalidate_range() from inside the page table lock
>> and rely on invalidate_range_end() to call unconditionaly.
> 
> Even with CoW, it's not big issue if the IOMMU keeps reading from the
> old page for a little while longer (in between PT lock release and
> mmu_notifier_invalidate_range_end).
> 
> How can you tell you read the old data a bit longer, because it
> noticed the new page only when mmu_notifier_invalidate_range_end run,
> and not because the CPU was faster at writing than the IOMMU was fast
> at loading the new pagetable?
> 
> I figure it would be detectable only that the CPU could see pageA
> changing before pageB. The iommu-v2 could see pageB changing before
> pageA. If that's a concern that is the only good reason I can think of
> right now, for requiring ->invalidate_page inside the CoW PT lock to
> enforce the same ordering. However if the modifications happens
> simultaneously and it's a correct runtime, the order must not matter,
> but still it would be detectable which may not be desirable.

I don’t know what is the memory model that SVM provides, but what you
describe here potentially violates it. I don’t think user software should
be expected to deal with it.

> 
> Currently the _notify is absolutely needed to r

Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

2017-08-30 Thread Nadav Amit
[cc’ing IOMMU people, which for some reason are not cc’d]

Andrea Arcangeli <aarca...@redhat.com> wrote:

> On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
>> It is not trivial to flush TLBs (primary or secondary) without holding the
>> page-table lock, and as we recently encountered this resulted in several
>> bugs [1]. The main problem is that even if you perform the TLB flush
>> immediately after the PT-lock is released, you cause a situation in which
>> other threads may make decisions based on the updated PTE value, without
>> being aware that a TLB flush is needed.
>> 
>> For example, we recently encountered a Linux bug when two threads run
>> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
>> may update a PTE, setting it as non-present, and then deferring the TLB
>> flush (for batching). As a result, however, it would cause the second
>> thread, which also changes the PTEs to assume that the PTE is already
>> non-present and TLB flush is not necessary. As a result the second core may
>> still hold stale PTEs in its TLB following MADV_DONTNEED.
> 
> The source of those complex races that requires taking into account
> nested tlb gather to solve it, originates from skipping primary MMU
> tlb flushes depending on the value of the pagetables (as an
> optimization).
> 
> For mmu_notifier_invalidate_range_end we always ignore the value of
> the pagetables and mmu_notifier_invalidate_range_end always runs
> unconditionally invalidating the secondary MMU for the whole range
> under consideration. There are no optimizations that attempts to skip
> mmu_notifier_invalidate_range_end depending on the pagetable value and
> there's no TLB gather for secondary MMUs either. That is to keep it
> simple of course.
> 
> As long as those mmu_notifier_invalidate_range_end stay unconditional,
> I don't see how those races you linked, can be possibly relevant in
> evaluating if ->invalidate_range (again only for iommuv2 and
> intel-svm) has to run inside the PT lock or not.

Thanks for the clarifications. It now makes much more sense.

> 
>> There is a swarm of such problems, and some are not too trivial. Deferring
>> TLB flushes needs to be done in a very careful manner.
> 
> I agree it's not trivial, but I don't think any complexity comes from
> above.
> 
> The only complexity is about, what if the page is copied to some other
> page and replaced, because the copy is the only case where coherency
> could be retained by the primary MMU. What if the primary MMU starts
> working on the new page in between PT lock release and
> mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
> the old page? That is the only problem we deal with here, the copy to
> other page and replace. Any other case that doesn't involve the copy
> seems non coherent by definition, and you couldn't measure it.
> 
> I can't think of a scenario that requires the explicit
> mmu_notifier_invalidate_range call before releasing the PT lock, at
> least for try_to_unmap_one.
> 
> Could you think of a scenario where calling ->invalidate_range inside
> mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
> intel-svm? Those two are the only ones requiring
> ->invalidate_range calls, all other mmu notifier users are safe
> without running mmu_notifier_invalidate_range_end under PT lock thanks
> to mmu_notifier_invalidate_range_start blocking the secondary MMU.
> 
> Could you post a tangible scenario that invalidates my theory that
> those mmu_notifier_invalidate_range calls inside PT lock would be
> superfluous?
> 
> Some of the scenarios under consideration:
> 
> 1) migration entry -> migration_entry_wait -> page lock, plus
>   migrate_pages taking the lock so it can't race with try_to_unmap
>   from other places
> 2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
> 3) CoW -> do_wp_page -> page lock on old page
> 4) KSM -> replace_page -> page lock on old page
> 5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
>   it's not measurable that we let the guest run a bit longer on the
>   old page past PT lock release

For both CoW and KSM, the correctness is maintained by calling
ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
(i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
will cause memory corruption, and page-lock would not be enough.

BTW: I see some calls to ptep_clear_flush_notify() which are followed
immediately after by set_pte_at_notify(). I do not understand why it makes
sense, as both notifications end up doing the same thing - invalidating the
secondary MMU. The set_pte_at_notify() 

Re: [PATCH] iommu/vt-d: Remove unnecassary qi clflushes

2016-07-05 Thread Nadav Amit
Paolo Bonzini <pbonz...@redhat.com> wrote:

> 
> 
> On 05/07/2016 18:27, Nadav Amit wrote:
>>> Although such hardware is old, there are some hypervisors that do not set
>>> the ecap.coherency of emulated IOMMUs. Yes, it is unwise, but there is no
>>> reason to further punish these hypervisors.
> 
> QEMU will need the kernel to respect ecap.coherency in order to support
> nested VFIO, for example.

To clarify - the kernel respects the coherency, but performs more clflushes
than necessary. It has no functional impact, but induces performance
degradation (which I  did not measure, but is likely to be several hundreds
of cycles per flush).

Regards,
Nadav
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/vt-d: Remove unnecassary qi clflushes

2016-07-05 Thread Nadav Amit
Nadav Amit <nadav.a...@gmail.com> wrote:

> Joerg Roedel <j...@8bytes.org> wrote:
> 
>> On Fri, Jun 24, 2016 at 06:13:14AM -0700, Nadav Amit wrote:
>>> According to the manual: "Hardware access to ...  invalidation queue ...
>>> are always coherent."
>>> 
>>> Remove unnecassary clflushes accordingly.
>> 
>> It is one thing what the spec says and another how hardware really
>> behaves. Have you tested this on (potentially really old) VT-d machines
>> to make sure the spec is _always_ right here?
> 
> No I didn’t as the commit message says. I would be happy for someones’
> tested-by.
> 
> Having said that - FreeBSD does not do these (unnecessary)
> invalidations [1], and their code comment clearly says it. Since this code
> is not new, I would assume FreeBSD would crash by now if the code was
> buggy.
> 
> Although such hardware is old, there are some hypervisors that do not set
> the ecap.coherency of emulated IOMMUs. Yes, it is unwise, but there is no
> reason to further punish these hypervisors.

Ping? To clarify - this behavior applies to the guest of VMware and KVM
(which uses QEMU for IOMMU emulation).

Regardless, please respond to the other patch I sent as well.

Thanks,
Nadav

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/vt-d: Remove unnecassary qi clflushes

2016-06-27 Thread Nadav Amit
Joerg Roedel <j...@8bytes.org> wrote:

> On Fri, Jun 24, 2016 at 06:13:14AM -0700, Nadav Amit wrote:
>> According to the manual: "Hardware access to ...  invalidation queue ...
>> are always coherent."
>> 
>> Remove unnecassary clflushes accordingly.
> 
> It is one thing what the spec says and another how hardware really
> behaves. Have you tested this on (potentially really old) VT-d machines
> to make sure the spec is _always_ right here?

No I didn’t as the commit message says. I would be happy for someones’
tested-by.

Having said that - FreeBSD does not do these (unnecessary)
invalidations [1], and their code comment clearly says it. Since this code
is not new, I would assume FreeBSD would crash by now if the code was
buggy.

Although such hardware is old, there are some hypervisors that do not set
the ecap.coherency of emulated IOMMUs. Yes, it is unwise, but there is no
reason to further punish these hypervisors.

Thanks,
Nadav

[1] http://web.mit.edu/freebsd/head/sys/x86/iommu/intel_qi.c
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/vt-d: Remove unnecassary qi clflushes

2016-06-24 Thread Nadav Amit
According to the manual: "Hardware access to ...  invalidation queue ...
are always coherent."

Remove unnecassary clflushes accordingly.

Signed-off-by: Nadav Amit <na...@vmware.com>

---

Build-tested since I do not have an IOMMU that does not support
coherency.
---
 drivers/iommu/dmar.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index 6a86b5d..ca56ec3 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -1155,8 +1155,6 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index)
(unsigned long long)qi->desc[index].high);
memcpy(>desc[index], >desc[wait_index],
sizeof(struct qi_desc));
-   __iommu_flush_cache(iommu, >desc[index],
-   sizeof(struct qi_desc));
writel(DMA_FSTS_IQE, iommu->reg + DMAR_FSTS_REG);
return -EINVAL;
}
@@ -1231,9 +1229,6 @@ restart:
 
hw[wait_index] = wait_desc;
 
-   __iommu_flush_cache(iommu, [index], sizeof(struct qi_desc));
-   __iommu_flush_cache(iommu, [wait_index], sizeof(struct qi_desc));
-
qi->free_head = (qi->free_head + 2) % QI_LENGTH;
qi->free_cnt -= 2;
 
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3] iommu/vt-d: Avoid write-tearing on PTE clear

2016-06-15 Thread Nadav Amit
When a PTE is cleared, the write may be teared or perform by multiple
writes. In addition, in 32-bit kernel, writes are currently performed
using a single 64-bit write, which does not guarantee order.

The byte-code right now does not seem to cause a problem, but it may
still occur in the future.

Avoid this scenario by using WRITE_ONCE, and order the writes on
32-bit kernels.

Signed-off-by: Nadav Amit <na...@vmware.com>

---
V3: Move split_dma_pte struct to dma_clear_pte (Joerg)
Add comments (Joerg)
V2: Use two WRITE_ONCE on 32-bit to avoid reordering
---
 drivers/iommu/intel-iommu.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e1852e8..5df87a3 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -328,7 +328,28 @@ struct dma_pte {
 
 static inline void dma_clear_pte(struct dma_pte *pte)
 {
-   pte->val = 0;
+   /*
+* We want to prevent the compiler from doing store-tearing or multiple
+* writes when it clears the PTE. Otherwise, a DMA address may be
+* translated using a partially updated PTE.
+*/
+#ifdef CONFIG_64BIT
+   WRITE_ONCE(pte->val, 0);
+#else
+   /*
+* On 32-bit platform the PTE must be updated in two chunks. We first
+* update the lower part that holds the present bit. The two writes are
+* ordered in the byte-code by WRITE_ONCE, and in the execution by x86
+* TSO-like memory model. This allows us to avoid using dma_wmb().
+*/
+   struct split_dma_pte {
+   u32 val_low;
+   u32 val_high;
+   } __packed *sdma_pte = (struct split_dma_pte *)pte;
+
+   WRITE_ONCE(sdma_pte->val_low, 0);
+   WRITE_ONCE(sdma_pte->val_high, 0);
+#endif
 }
 
 static inline u64 dma_pte_addr(struct dma_pte *pte)
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2] iommu/vt-d: Avoid write-tearing on PTE clear

2016-06-15 Thread Nadav Amit
When a PTE is cleared, the write may be teared or perform by multiple
writes. In addition, in 32-bit kernel, writes are currently performed
using a single 64-bit write, which does not guarantee order.

The byte-code right now does not seem to cause a problem, but it may
still occur in theory.

Avoid this scenario by using WRITE_ONCE, and order the writes on
32-bit kernels.

Signed-off-by: Nadav Amit <na...@vmware.com>
---

V2: Use two WRITE_ONCE on 32-bit to avoid reordering

---
 drivers/iommu/intel-iommu.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e1852e8..6ffd3d3 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -326,9 +326,26 @@ struct dma_pte {
u64 val;
 };
 
+#ifndef CONFIG_64BIT
+union split_dma_pte {
+   struct {
+   u32 val_low;
+   u32 val_high;
+   };
+   u64 val;
+};
+#endif
+
 static inline void dma_clear_pte(struct dma_pte *pte)
 {
-   pte->val = 0;
+#ifdef CONFIG_64BIT
+   WRITE_ONCE(pte->val, 0);
+#else
+   union split_dma_pte *sdma_pte = (union split_dma_pte *)pte;
+
+   WRITE_ONCE(sdma_pte->val_low, 0);
+   WRITE_ONCE(sdma_pte->val_high, 0);
+#endif
 }
 
 static inline u64 dma_pte_addr(struct dma_pte *pte)
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: BUG: using smp_processor_id() in preemptible [00000000] code]

2016-06-09 Thread Nadav Amit
Alan Stern  wrote:

> On Thu, 9 Jun 2016, M G Berberich wrote:
> 
>> Hello,
>> 
>> With 4.7-rc2, after detecting a USB Mass Storage device
>> 
>>  [   11.589843] usb-storage 4-2:1.0: USB Mass Storage device detected
>> 
>> a constant flow of kernel-BUGS is reported (several per second).
>> 
>> [   11.599215] BUG: using smp_processor_id() in preemptible [] code:
>> systemd-udevd/389
>> [   11.599218] caller is debug_smp_processor_id+0x17/0x20
>> [   11.599220] CPU: 4 PID: 389 Comm: systemd-udevd Not tainted 4.7.0-rc2 #6
>> [   11.599220] Hardware name: Gigabyte Technology Co., Ltd. H87-HD3/H87-HD3,
>> BIOS F10 08/18/2015
>> [   11.599223]   88080466b6c8 813fc42d
>> 0004
>> [   11.599224]  81cc1da3 88080466b6f8 8141a0f6
>> 
>> [   11.599226]  880809fe8d98 0001 000f
>> 88080466b708
>> [   11.599226] Call Trace:
>> [   11.599229]  [] dump_stack+0x4f/0x72
>> [   11.599231]  [] check_preemption_disabled+0xd6/0xe0
>> [   11.599233]  [] debug_smp_processor_id+0x17/0x20
>> [   11.599235]  [] alloc_iova_fast+0xb6/0x210
>> [   11.599238]  [] ? __wait_on_bit+0x6f/0x90
>> [   11.599240]  [] intel_alloc_iova+0x9d/0xd0
>> [   11.599241]  [] __intel_map_single+0x93/0x190
>> [   11.599242]  [] intel_map_page+0x34/0x40
>> 
>> Please see https://bugzilla.kernel.org/show_bug.cgi?id=119801 for a
>> more complete kernel-log
> 
> This looks like a bug in the memory management subsystem.  It should be 
> reported on the linux-mm mailing list (CC'ed).

This bug is IOMMU related (mailing list CC’ed) and IIUC already fixed.

Regards,
Nadav


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/vt-d: Avoid write-tearing on PTE clear

2016-06-03 Thread Nadav Amit
Ping?

Nadav Amit <na...@vmware.com> wrote:

> When a PTE is cleared, the write may be teared or perform by multiple
> writes. In addition, in 32-bit kernel, writes are currently performed
> using a single 64-bit write, which does not guarantee order.
> 
> The byte-code right now does not seem to cause a problem, but it may
> still occur in theory.
> 
> Avoid this scenario by using WRITE_ONCE, and order the writes on
> 32-bit kernels.
> 
> Signed-off-by: Nadav Amit <na...@vmware.com>
> ---
> drivers/iommu/intel-iommu.c | 19 ++-
> 1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index e1852e8..4f488a5 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -326,9 +326,26 @@ struct dma_pte {
>   u64 val;
> };
> 
> +#ifndef CONFIG_64BIT
> +union split_dma_pte {
> + struct {
> + u32 val_low;
> + u32 val_high;
> + };
> + u64 val;
> +};
> +#endif
> +
> static inline void dma_clear_pte(struct dma_pte *pte)
> {
> - pte->val = 0;
> +#ifdef CONFIG_64BIT
> + WRITE_ONCE(pte->val, 0);
> +#else
> + union split_dma_pte *sdma_pte = (union split_dma_pte *)pte;
> +
> + WRITE_ONCE(sdma_pte->val_low, 0);
> + sdma_pte->val_high = 0;
> +#endif
> }
> 
> static inline u64 dma_pte_addr(struct dma_pte *pte)
> -- 
> 2.7.4


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/vt-d: Avoid write-tearing on PTE clear

2016-05-21 Thread Nadav Amit
When a PTE is cleared, the write may be teared or perform by multiple
writes. In addition, in 32-bit kernel, writes are currently performed
using a single 64-bit write, which does not guarantee order.

The byte-code right now does not seem to cause a problem, but it may
still occur in theory.

Avoid this scenario by using WRITE_ONCE, and order the writes on
32-bit kernels.

Signed-off-by: Nadav Amit <na...@vmware.com>
---
 drivers/iommu/intel-iommu.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index e1852e8..4f488a5 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -326,9 +326,26 @@ struct dma_pte {
u64 val;
 };
 
+#ifndef CONFIG_64BIT
+union split_dma_pte {
+   struct {
+   u32 val_low;
+   u32 val_high;
+   };
+   u64 val;
+};
+#endif
+
 static inline void dma_clear_pte(struct dma_pte *pte)
 {
-   pte->val = 0;
+#ifdef CONFIG_64BIT
+   WRITE_ONCE(pte->val, 0);
+#else
+   union split_dma_pte *sdma_pte = (union split_dma_pte *)pte;
+
+   WRITE_ONCE(sdma_pte->val_low, 0);
+   sdma_pte->val_high = 0;
+#endif
 }
 
 static inline u64 dma_pte_addr(struct dma_pte *pte)
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [v3 13/26] KVM: Define a new interface kvm_find_dest_vcpu() for VT-d PI

2015-01-20 Thread Nadav Amit
Radim Kr?má? rkrc...@redhat.com wrote:

 2015-01-14 01:27+, Wu, Feng:
 the new
 hardware even doesn't consider the TPR for lowest priority interrupts
 delivery.
 
 A bold move ... what hardware was the first to do so?
 
 I think it was starting with Nehalem.
 
 Thanks,  (Could be that QPI can't inform about TPR changes anymore ...)
 
 I played with Linux's TPR on Haswell and found that is has no effect.

Sorry for jumping into the discussion, but doesn’t it depend on
IA32_MISC_ENABLE[23]? This bit disables xTPR messages. On my machine it is
set (probably by the BIOS), but since there is no IA32_MISC_ENABLE is not
locked for changes, the OS can control it.

Nadav
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu