Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface

2021-04-20 Thread Lu Baolu

On 4/20/21 3:32 PM, Keqian Zhu wrote:

Hi Baolu,

Cheers for the your quick reply.

On 2021/4/20 10:09, Lu Baolu wrote:

Hi Keqian,

On 4/20/21 9:25 AM, Keqian Zhu wrote:

Hi Baolu,

On 2021/4/19 21:33, Lu Baolu wrote:

Hi Keqian,

On 2021/4/19 17:32, Keqian Zhu wrote:

+EXPORT_SYMBOL_GPL(iommu_split_block);

Do you really have any consumers of this interface other than the dirty
bit tracking? If not, I don't suggest to make this as a generic IOMMU
interface.

There is an implicit requirement for such interfaces. The
iommu_map/unmap(iova, size) shouldn't be called at the same time.
Currently there's no such sanity check in the iommu core. A poorly
written driver could mess up the kernel by misusing this interface.

Yes, I don't think up a scenario except dirty tracking.

Indeed, we'd better not make them as a generic interface.

Do you have any suggestion that underlying iommu drivers can share these code 
but
not make it as a generic iommu interface?

I have a not so good idea. Make the "split" interfaces as a static function, and
transfer the function pointer to start_dirty_log. But it looks weird and 
inflexible.


I understand splitting/merging super pages is an optimization, but not a
functional requirement. So is it possible to let the vendor iommu driver
decide whether splitting super pages when starting dirty bit tracking
and the opposite operation during when stopping it? The requirement for

Right. If I understand you correct, actually that is what this series does.


I mean to say no generic APIs, jut do it by the iommu subsystem itself.
It's totally transparent to the upper level, just like what map() does.
The upper layer doesn't care about either super page or small page is
in use when do a mapping, right?

If you want to consolidate some code, how about putting them in
start/stop_tracking()?


Yep, this reminds me. What we want to reuse is the logic of "chunk by chunk" in 
split().
We can implement switch_dirty_log to be "chunk by chunk" too (just the same as 
sync/clear),
then the vendor iommu driver can invoke it's own private implementation of 
split().
So we can completely remove split() in the IOMMU core layer.

example code logic

iommu.c:
switch_dirty_log(big range) {
 for_each_iommu_page(big range) {
   ops->switch_dirty_log(iommu_pgsize)
 }
}

vendor iommu driver:
switch_dirty_log(iommu_pgsize) {

 if (enable) {
 ops->split_block(iommu_pgsize)
 /* And other actions, such as enable hardware capability */
 } else {
 for_each_continuous_physical_address(iommu_pgsize)
 ops->merge_page()
 }
}

Besides, vendor iommu driver can invoke split() in clear_dirty_log instead of 
in switch_dirty_log.
The benefit is that we usually clear dirty log gradually during dirty tracking, 
then we can split
large page mapping gradually, which speedup start_dirty_log and make less side 
effect on DMA performance.

Does it looks good for you?


Yes. It's clearer now.



Thanks,
Keqian



Best regards,
baolu


Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface

2021-04-19 Thread Lu Baolu

Hi Keqian,

On 4/20/21 9:25 AM, Keqian Zhu wrote:

Hi Baolu,

On 2021/4/19 21:33, Lu Baolu wrote:

Hi Keqian,

On 2021/4/19 17:32, Keqian Zhu wrote:

+EXPORT_SYMBOL_GPL(iommu_split_block);

Do you really have any consumers of this interface other than the dirty
bit tracking? If not, I don't suggest to make this as a generic IOMMU
interface.

There is an implicit requirement for such interfaces. The
iommu_map/unmap(iova, size) shouldn't be called at the same time.
Currently there's no such sanity check in the iommu core. A poorly
written driver could mess up the kernel by misusing this interface.

Yes, I don't think up a scenario except dirty tracking.

Indeed, we'd better not make them as a generic interface.

Do you have any suggestion that underlying iommu drivers can share these code 
but
not make it as a generic iommu interface?

I have a not so good idea. Make the "split" interfaces as a static function, and
transfer the function pointer to start_dirty_log. But it looks weird and 
inflexible.


I understand splitting/merging super pages is an optimization, but not a
functional requirement. So is it possible to let the vendor iommu driver
decide whether splitting super pages when starting dirty bit tracking
and the opposite operation during when stopping it? The requirement for

Right. If I understand you correct, actually that is what this series does.


I mean to say no generic APIs, jut do it by the iommu subsystem itself.
It's totally transparent to the upper level, just like what map() does.
The upper layer doesn't care about either super page or small page is
in use when do a mapping, right?

If you want to consolidate some code, how about putting them in
start/stop_tracking()?

Best regards,
baolu


We realized split/merge in IOMMU core layer, but don't force vendor driver to 
use it.

The problem is that when we expose these interfaces to vendor IOMMU driver, 
will also
expose them to upper driver.


upper layer is that starting/stopping dirty bit tracking and
mapping/unmapping are mutually exclusive.

OK, I will explicitly add the hints. Thanks.

Thanks,
Keqian




On the other hand, if a driver calls map/unmap with split/merge at the same 
time,
it's a bug of driver, it should follow the rule.



Best regards,
baolu
.



Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface

2021-04-19 Thread Lu Baolu

Hi Keqian,

On 2021/4/19 17:32, Keqian Zhu wrote:

+EXPORT_SYMBOL_GPL(iommu_split_block);

Do you really have any consumers of this interface other than the dirty
bit tracking? If not, I don't suggest to make this as a generic IOMMU
interface.

There is an implicit requirement for such interfaces. The
iommu_map/unmap(iova, size) shouldn't be called at the same time.
Currently there's no such sanity check in the iommu core. A poorly
written driver could mess up the kernel by misusing this interface.

Yes, I don't think up a scenario except dirty tracking.

Indeed, we'd better not make them as a generic interface.

Do you have any suggestion that underlying iommu drivers can share these code 
but
not make it as a generic iommu interface?

I have a not so good idea. Make the "split" interfaces as a static function, and
transfer the function pointer to start_dirty_log. But it looks weird and 
inflexible.


I understand splitting/merging super pages is an optimization, but not a
functional requirement. So is it possible to let the vendor iommu driver
decide whether splitting super pages when starting dirty bit tracking
and the opposite operation during when stopping it? The requirement for
upper layer is that starting/stopping dirty bit tracking and
mapping/unmapping are mutually exclusive.



On the other hand, if a driver calls map/unmap with split/merge at the same 
time,
it's a bug of driver, it should follow the rule.



Best regards,
baolu


Re: [PATCH] iommu: Use passthrough mode for the Intel IPUs

2021-04-19 Thread Lu Baolu

Hi Bingbu,

On 4/19/21 12:57 PM, Bingbu Cao wrote:

Intel IPU(Image Processing Unit) has its own (IO)MMU hardware,
The IPU driver allocates its own page table that is not mapped
via the DMA, and thus the Intel IOMMU driver blocks access giving
this error:

DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:05.0] PASID 
   fault addr 76406000 [fault reason 06] PTE Read access is not set

As IPU is not an external facing device which is not risky, so use
IOMMU passthrough mode for Intel IPUs.


As a quirk, does it need to be back-ported to stable kernels? If so, add
Fixes tag and cc stable, please.



Signed-off-by: Bingbu Cao 
---
  drivers/iommu/intel/iommu.c   | 35 +++
  drivers/staging/media/ipu3/ipu3.c |  2 +-
  include/linux/pci_ids.h   |  5 +
  3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee0932307d64..59222d2fe73f 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -55,6 +55,12 @@
  #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
  #define IS_USB_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_SERIAL_USB)
  #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
+#define IS_IPU_DEVICE(pdev) ((pdev)->vendor == PCI_VENDOR_ID_INTEL &&  
 \
+((pdev)->device == PCI_DEVICE_ID_INTEL_IPU3 ||  \
+(pdev)->device == PCI_DEVICE_ID_INTEL_IPU6 ||   \
+(pdev)->device == PCI_DEVICE_ID_INTEL_IPU6SE || \
+(pdev)->device == PCI_DEVICE_ID_INTEL_IPU6SE_P ||  
 \
+(pdev)->device == PCI_DEVICE_ID_INTEL_IPU6EP))
  #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
  
  #define IOAPIC_RANGE_START	(0xfee0)

@@ -360,6 +366,7 @@ int intel_iommu_enabled = 0;
  EXPORT_SYMBOL_GPL(intel_iommu_enabled);
  
  static int dmar_map_gfx = 1;

+static int dmar_map_ipu = 1;
  static int dmar_forcedac;
  static int intel_iommu_strict;
  static int intel_iommu_superpage = 1;
@@ -368,6 +375,7 @@ static int iommu_skip_te_disable;
  
  #define IDENTMAP_GFX		2

  #define IDENTMAP_AZALIA   4
+#define IDENTMAP_IPU   8
  
  int intel_iommu_gfx_mapped;

  EXPORT_SYMBOL_GPL(intel_iommu_gfx_mapped);
@@ -2839,6 +2847,9 @@ static int device_def_domain_type(struct device *dev)
  
  		if ((iommu_identity_mapping & IDENTMAP_GFX) && IS_GFX_DEVICE(pdev))

return IOMMU_DOMAIN_IDENTITY;
+
+   if ((iommu_identity_mapping & IDENTMAP_IPU) && 
IS_IPU_DEVICE(pdev))
+   return IOMMU_DOMAIN_IDENTITY;
}
  
  	return 0;

@@ -3278,6 +3289,9 @@ static int __init init_dmars(void)
if (!dmar_map_gfx)
iommu_identity_mapping |= IDENTMAP_GFX;
  
+	if (!dmar_map_ipu)

+   iommu_identity_mapping |= IDENTMAP_IPU;
+
check_tylersburg_isoch();
  
  	ret = si_domain_init(hw_pass_through);

@@ -5622,6 +5636,15 @@ static void quirk_iommu_igfx(struct pci_dev *dev)
dmar_map_gfx = 0;
  }
  
+static void quirk_iommu_ipu(struct pci_dev *dev)

+{
+   if (risky_device(dev))
+   return;
+
+   pci_info(dev, "Passthrough IOMMU for integrated Intel IPU\n");
+   dmar_map_ipu = 0;
+}
+
  /* G4x/GM45 integrated gfx dmar support is totally busted. */
  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2a40, quirk_iommu_igfx);
  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2e00, quirk_iommu_igfx);
@@ -5657,6 +5680,18 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x1632, 
quirk_iommu_igfx);
  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x163A, quirk_iommu_igfx);
  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x163D, quirk_iommu_igfx);
  
+/* disable IPU dmar support */

+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU3,
+quirk_iommu_ipu);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU6EP,
+quirk_iommu_ipu);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU6SE_P,
+quirk_iommu_ipu);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU6,
+quirk_iommu_ipu);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IPU6SE,
+quirk_iommu_ipu);


This is duplicate with above IS_IPU_DEVICE(). Please keep a single
device list.


+
  static void quirk_iommu_rwbf(struct pci_dev *dev)
  {
if (risky_device(dev))
diff --git a/drivers/staging/media/ipu3/ipu3.c 
b/drivers/staging/media/ipu3/ipu3.c
index ee1bba6bdcac..aee1130ac042 100644
--- a/drivers/staging/media/ipu3/ipu3.c
+++ b/drivers/staging/media/ipu3/ipu3.c
@@ -16,7 +16,7 @@
  #include "ipu3-dmamap.h"
  #include "ipu3-mmu.h"
  
-#define IMGU_PCI_ID			0x1919


Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework

2021-04-18 Thread Lu Baolu

Hi Keqian,

On 4/16/21 5:07 PM, Keqian Zhu wrote:

I am worrying about having two sets of APIs for single purpose. From
vendor iommu driver's point of view, this feature is per device. Hence,
it still needs to do the same thing.

Yes, we can unify the granule of feature reporting and status management.

The basic granule of dirty tracking is iommu_domain, I think it's very 
reasonable. We need an
interface to report the feature of iommu_domain, then the logic is much more 
clear.

Every time we add new device or remove device from the domain, we should update 
the feature (e.g.,
maintain a counter of unsupported devices).


Yes. This looks cleaner.



What do you think about this idea?

Thanks,
Keqian


Best regards,
baolu


Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework

2021-04-15 Thread Lu Baolu

Hi,

On 2021/4/15 15:43, Keqian Zhu wrote:

design it as not switchable. I will modify the commit message of patch#12, 
thanks!

I am not sure that I fully get your point. But I can't see any gaps of
using iommu_dev_enable/disable_feature() to switch dirty log on and off.
Probably I missed anything.

IOMMU_DEV_FEAT_HWDBM just tells user whether underlying IOMMU driver supports
dirty tracking, it is not used to management the status of dirty log tracking.

The feature reporting is per device, but the status management is per 
iommu_domain.
Only when all devices in a domain support HWDBM, we can start dirty log for the 
domain.


So why not

for_each_device_attached_domain()
iommu_dev_enable_feature(IOMMU_DEV_FEAT_HWDBM)

?


And I think we'd better not mix the feature reporting and status management. 
Thoughts?



I am worrying about having two sets of APIs for single purpose. From
vendor iommu driver's point of view, this feature is per device. Hence,
it still needs to do the same thing.

Best regards,
baolu


Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework

2021-04-15 Thread Lu Baolu

On 4/15/21 2:18 PM, Keqian Zhu wrote:

Hi Baolu,

Thanks for the review!

On 2021/4/14 15:00, Lu Baolu wrote:

Hi Keqian,

On 4/13/21 4:54 PM, Keqian Zhu wrote:

Some types of IOMMU are capable of tracking DMA dirty log, such as
ARM SMMU with HTTU or Intel IOMMU with SLADE. This introduces the
dirty log tracking framework in the IOMMU base layer.

Three new essential interfaces are added, and we maintaince the status
of dirty log tracking in iommu_domain.
1. iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking
2. iommu_sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
3. iommu_clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap

A new dev feature are added to indicate whether a specific type of
iommu hardware supports and its driver realizes them.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
   drivers/iommu/iommu.c | 150 ++
   include/linux/iommu.h |  53 +++
   2 files changed, 203 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d0b0a15dba84..667b2d6d2fc0 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1922,6 +1922,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct 
bus_type *bus,
   domain->type = type;
   /* Assume all sizes by default; the driver may override this later */
   domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
+mutex_init(>switch_log_lock);
 return domain;
   }
@@ -2720,6 +2721,155 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
   }
   EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
   +int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable,
+   unsigned long iova, size_t size, int prot)
+{
+const struct iommu_ops *ops = domain->ops;
+int ret;
+
+if (unlikely(!ops || !ops->switch_dirty_log))
+return -ENODEV;
+
+mutex_lock(>switch_log_lock);
+if (enable && domain->dirty_log_tracking) {
+ret = -EBUSY;
+goto out;
+} else if (!enable && !domain->dirty_log_tracking) {
+ret = -EINVAL;
+goto out;
+}
+
+ret = ops->switch_dirty_log(domain, enable, iova, size, prot);
+if (ret)
+goto out;
+
+domain->dirty_log_tracking = enable;
+out:
+mutex_unlock(>switch_log_lock);
+return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_switch_dirty_log);


Since you also added IOMMU_DEV_FEAT_HWDBM, I am wondering what's the
difference between

iommu_switch_dirty_log(on) vs. iommu_dev_enable_feature(IOMMU_DEV_FEAT_HWDBM)

iommu_switch_dirty_log(off) vs. iommu_dev_disable_feature(IOMMU_DEV_FEAT_HWDBM)

Indeed. As I can see, IOMMU_DEV_FEAT_AUX is not switchable, so enable/disable
are not applicable for it. IOMMU_DEV_FEAT_SVA is switchable, so we can use these
interfaces for it.

IOMMU_DEV_FEAT_HWDBM is used to indicate whether hardware supports HWDBM, so we 
should


Previously we had iommu_dev_has_feature() and then was cleaned up due to
lack of real users. If you have a real case for it, why not bringing it
back?


design it as not switchable. I will modify the commit message of patch#12, 
thanks!


I am not sure that I fully get your point. But I can't see any gaps of
using iommu_dev_enable/disable_feature() to switch dirty log on and off.
Probably I missed anything.






+
+int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
+ size_t size, unsigned long *bitmap,
+ unsigned long base_iova, unsigned long bitmap_pgshift)
+{
+const struct iommu_ops *ops = domain->ops;
+unsigned int min_pagesz;
+size_t pgsize;
+int ret = 0;
+
+if (unlikely(!ops || !ops->sync_dirty_log))
+return -ENODEV;
+
+min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+if (!IS_ALIGNED(iova | size, min_pagesz)) {
+pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
+   iova, size, min_pagesz);
+return -EINVAL;
+}
+
+mutex_lock(>switch_log_lock);
+if (!domain->dirty_log_tracking) {
+ret = -EINVAL;
+goto out;
+}
+
+while (size) {
+pgsize = iommu_pgsize(domain, iova, size);
+
+ret = ops->sync_dirty_log(domain, iova, pgsize,
+  bitmap, base_iova, bitmap_pgshift);


Any reason why do you want to do this in a per-4K page manner? This can
lead to a lot of indirect calls and bad performance.

How about a sync_dirty_pages()?

The function name of iommu_pgsize() is a bit puzzling. Actually it will try to
compute the max size that fit into size, so the pgsize can be a large page size
even if the underlying mapping is 4K. The __iommu_unmap() also has a similar 
logic.


This series has some improvement on the iommu_pgsize() helper.

https://lore.kernel.org/linux-iommu/20210405191112.28192-1-isa...@codeaurora.org/

Best regards,
baolu


Re: [PATCH v2] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-14 Thread Lu Baolu

Hi Longpeng,

On 4/15/21 8:46 AM, Longpeng(Mike) wrote:

The translation caches may preserve obsolete data when the
mapping size is changed, suppose the following sequence which
can reveal the problem with high probability.

1.mmap(4GB,MAP_HUGETLB)
2.
   while (1) {
(a)DMA MAP   0,0xa
(b)DMA UNMAP 0,0xa
(c)DMA MAP   0,0xc000
  * DMA read IOVA 0 may failure here (Not present)
  * if the problem occurs.
(d)DMA UNMAP 0,0xc000
   }

The page table(only focus on IOVA 0) after (a) is:
  PML4: 0x19db5c1003   entry:0x899bdcd2f000
   PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
PDE: 0x1a30a72003  entry:0x89b39cacb000
 PTE: 0x21d200803  entry:0x89b3b0a72000

The page table after (b) is:
  PML4: 0x19db5c1003   entry:0x899bdcd2f000
   PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
PDE: 0x1a30a72003  entry:0x89b39cacb000
 PTE: 0x0  entry:0x89b3b0a72000

The page table after (c) is:
  PML4: 0x19db5c1003   entry:0x899bdcd2f000
   PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
PDE: 0x21d200883   entry:0x89b39cacb000 (*)

Because the PDE entry after (b) is present, it won't be
flushed even if the iommu driver flush cache when unmap,
so the obsolete data may be preserved in cache, which
would cause the wrong translation at end.

However, we can see the PDE entry is finally switch to
2M-superpage mapping, but it does not transform
to 0x21d200883 directly:

1. PDE: 0x1a30a72003
2. __domain_mapping
  dma_pte_free_pagetable
Set the PDE entry to ZERO
  Set the PDE entry to 0x21d200883

So we must flush the cache after the entry switch to ZERO
to avoid the obsolete info be preserved.

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Nadav Amit 
Cc: Alex Williamson 
Cc: Joerg Roedel 
Cc: Kevin Tian 
Cc: Gonglei (Arei) 

Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating 
superpage")
Cc:  # v3.0+
Link: 
https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/
Suggested-by: Lu Baolu 
Signed-off-by: Longpeng(Mike) 
---
v1 -> v2:
   - add Joerg
   - reconstruct the solution base on the Baolu's suggestion
---
  drivers/iommu/intel/iommu.c | 52 +
  1 file changed, 38 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..881c9f2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2289,6 +2289,41 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,
return level;
  }
  
+/*

+ * Ensure that old small page tables are removed to make room for superpage(s).
+ * We're going to add new large pages, so make sure we don't remove their 
parent
+ * tables. The IOTLB/devTLBs should be flushed if any PDE/PTEs are cleared.
+ */
+static void switch_to_super_page(struct dmar_domain *domain,
+unsigned long start_pfn,
+unsigned long end_pfn, int level)
+{
+   unsigned long lvl_pages = lvl_to_nr_pages(level);
+   struct dma_pte *pte = NULL;
+   int i;
+
+   while (start_pfn <= end_pfn) {
+   if (!pte)
+   pte = pfn_to_dma_pte(domain, start_pfn, );
+
+   if (dma_pte_present(pte)) {
+   dma_pte_free_pagetable(domain, start_pfn,
+  start_pfn + lvl_pages - 1,
+  level + 1);
+
+   for_each_domain_iommu(i, domain)
+   iommu_flush_iotlb_psi(g_iommus[i], domain,
+ start_pfn, lvl_pages,
+ 0, 0);
+   }
+
+   pte++;
+   start_pfn += lvl_pages;
+   if (first_pte_in_page(pte))
+   pte = NULL;
+   }
+}
+
  static int
  __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 unsigned long phys_pfn, unsigned long nr_pages, int prot)
@@ -2329,22 +2364,11 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,
return -ENOMEM;
/* It is large page*/
if (largepage_lvl > 1) {
-   unsigned long nr_superpages, end_pfn;
+   unsigned long end_pfn;
  
  pteval |= DMA_PTE_LARGE_PAGE;

-   lvl_pages = lvl_to_nr_pages(largepage_lvl);
-
-   nr_superpages = nr_pages / lvl_pages;
-   end_pfn = iov_pfn + nr_superpages * lvl_pages - 
1;
-
-   /*
-* Ensure that old small page tables are
-* removed to make r

Re: [PATCH 2/2] iommu/sva: Remove mm parameter from SVA bind API

2021-04-14 Thread Lu Baolu

Hi Jason,

On 4/14/21 7:26 PM, Jason Gunthorpe wrote:

On Wed, Apr 14, 2021 at 02:22:09PM +0800, Lu Baolu wrote:


I still worry about supervisor pasid allocation.

If we use iommu_sva_alloc_pasid() to allocate a supervisor pasid, which
mm should the pasid be set? I've ever thought about passing _mm to
iommu_sva_alloc_pasid(). But if you add "mm != current->mm", this seems
not to work. Or do you prefer a separated interface for supervisor pasid
allocation/free?


Without a mm_struct it is not SVA, so don't use SVA APIs for whatever
a 'supervisor pasid' is


The supervisor PASID has its mm_struct. The only difference is that the
device will set priv=1 in its DMA transactions with the PASID.

Best regards,
baolu


Re: [PATCH v3 02/12] iommu: Add iommu_split_block interface

2021-04-14 Thread Lu Baolu

On 4/13/21 4:54 PM, Keqian Zhu wrote:

Block(largepage) mapping is not a proper granule for dirty log tracking.
Take an extreme example, if DMA writes one byte, under 1G mapping, the
dirty amount reported is 1G, but under 4K mapping, the dirty amount is
just 4K.

This adds a new interface named iommu_split_block in IOMMU base layer.
A specific IOMMU driver can invoke it during start dirty log. If so, the
driver also need to realize the split_block iommu ops.

We flush all iotlbs after the whole procedure is completed to ease the
pressure of IOMMU, as we will hanle a huge range of mapping in general.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/iommu.c | 41 +
  include/linux/iommu.h | 11 +++
  2 files changed, 52 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 667b2d6d2fc0..bb413a927870 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2721,6 +2721,47 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
  }
  EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
  
+int iommu_split_block(struct iommu_domain *domain, unsigned long iova,

+ size_t size)
+{
+   const struct iommu_ops *ops = domain->ops;
+   unsigned int min_pagesz;
+   size_t pgsize;
+   bool flush = false;
+   int ret = 0;
+
+   if (unlikely(!ops || !ops->split_block))
+   return -ENODEV;
+
+   min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+   if (!IS_ALIGNED(iova | size, min_pagesz)) {
+   pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
+  iova, size, min_pagesz);
+   return -EINVAL;
+   }
+
+   while (size) {
+   flush = true;
+
+   pgsize = iommu_pgsize(domain, iova, size);
+
+   ret = ops->split_block(domain, iova, pgsize);
+   if (ret)
+   break;
+
+   pr_debug("split handled: iova 0x%lx size 0x%zx\n", iova, 
pgsize);
+
+   iova += pgsize;
+   size -= pgsize;
+   }
+
+   if (flush)
+   iommu_flush_iotlb_all(domain);
+
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_split_block);


Do you really have any consumers of this interface other than the dirty
bit tracking? If not, I don't suggest to make this as a generic IOMMU
interface.

There is an implicit requirement for such interfaces. The
iommu_map/unmap(iova, size) shouldn't be called at the same time.
Currently there's no such sanity check in the iommu core. A poorly
written driver could mess up the kernel by misusing this interface.

This also applies to iommu_merge_page().

Best regards,
baolu


Re: [PATCH v3 01/12] iommu: Introduce dirty log tracking framework

2021-04-14 Thread Lu Baolu

Hi Keqian,

On 4/13/21 4:54 PM, Keqian Zhu wrote:

Some types of IOMMU are capable of tracking DMA dirty log, such as
ARM SMMU with HTTU or Intel IOMMU with SLADE. This introduces the
dirty log tracking framework in the IOMMU base layer.

Three new essential interfaces are added, and we maintaince the status
of dirty log tracking in iommu_domain.
1. iommu_switch_dirty_log: Perform actions to start|stop dirty log tracking
2. iommu_sync_dirty_log: Sync dirty log from IOMMU into a dirty bitmap
3. iommu_clear_dirty_log: Clear dirty log of IOMMU by a mask bitmap

A new dev feature are added to indicate whether a specific type of
iommu hardware supports and its driver realizes them.

Signed-off-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/iommu.c | 150 ++
  include/linux/iommu.h |  53 +++
  2 files changed, 203 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d0b0a15dba84..667b2d6d2fc0 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1922,6 +1922,7 @@ static struct iommu_domain *__iommu_domain_alloc(struct 
bus_type *bus,
domain->type = type;
/* Assume all sizes by default; the driver may override this later */
domain->pgsize_bitmap  = bus->iommu_ops->pgsize_bitmap;
+   mutex_init(>switch_log_lock);
  
  	return domain;

  }
@@ -2720,6 +2721,155 @@ int iommu_domain_set_attr(struct iommu_domain *domain,
  }
  EXPORT_SYMBOL_GPL(iommu_domain_set_attr);
  
+int iommu_switch_dirty_log(struct iommu_domain *domain, bool enable,

+  unsigned long iova, size_t size, int prot)
+{
+   const struct iommu_ops *ops = domain->ops;
+   int ret;
+
+   if (unlikely(!ops || !ops->switch_dirty_log))
+   return -ENODEV;
+
+   mutex_lock(>switch_log_lock);
+   if (enable && domain->dirty_log_tracking) {
+   ret = -EBUSY;
+   goto out;
+   } else if (!enable && !domain->dirty_log_tracking) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = ops->switch_dirty_log(domain, enable, iova, size, prot);
+   if (ret)
+   goto out;
+
+   domain->dirty_log_tracking = enable;
+out:
+   mutex_unlock(>switch_log_lock);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_switch_dirty_log);


Since you also added IOMMU_DEV_FEAT_HWDBM, I am wondering what's the
difference between

iommu_switch_dirty_log(on) vs. 
iommu_dev_enable_feature(IOMMU_DEV_FEAT_HWDBM)


iommu_switch_dirty_log(off) vs. 
iommu_dev_disable_feature(IOMMU_DEV_FEAT_HWDBM)



+
+int iommu_sync_dirty_log(struct iommu_domain *domain, unsigned long iova,
+size_t size, unsigned long *bitmap,
+unsigned long base_iova, unsigned long bitmap_pgshift)
+{
+   const struct iommu_ops *ops = domain->ops;
+   unsigned int min_pagesz;
+   size_t pgsize;
+   int ret = 0;
+
+   if (unlikely(!ops || !ops->sync_dirty_log))
+   return -ENODEV;
+
+   min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+   if (!IS_ALIGNED(iova | size, min_pagesz)) {
+   pr_err("unaligned: iova 0x%lx size 0x%zx min_pagesz 0x%x\n",
+  iova, size, min_pagesz);
+   return -EINVAL;
+   }
+
+   mutex_lock(>switch_log_lock);
+   if (!domain->dirty_log_tracking) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   while (size) {
+   pgsize = iommu_pgsize(domain, iova, size);
+
+   ret = ops->sync_dirty_log(domain, iova, pgsize,
+ bitmap, base_iova, bitmap_pgshift);


Any reason why do you want to do this in a per-4K page manner? This can
lead to a lot of indirect calls and bad performance.

How about a sync_dirty_pages()?

The same comment applies to other places in this patch series.


+   if (ret)
+   break;
+
+   pr_debug("dirty_log_sync handle: iova 0x%lx pagesz 0x%zx\n",
+iova, pgsize);
+
+   iova += pgsize;
+   size -= pgsize;
+   }
+out:
+   mutex_unlock(>switch_log_lock);
+   return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_sync_dirty_log);
+
+static int __iommu_clear_dirty_log(struct iommu_domain *domain,
+  unsigned long iova, size_t size,
+  unsigned long *bitmap,
+  unsigned long base_iova,
+  unsigned long bitmap_pgshift)
+{
+   const struct iommu_ops *ops = domain->ops;
+   size_t pgsize;
+   int ret = 0;
+
+   if (unlikely(!ops || !ops->clear_dirty_log))
+   return -ENODEV;
+
+   while (size) {
+   pgsize = iommu_pgsize(domain, iova, size);
+
+   ret = ops->clear_dirty_log(domain, iova, pgsize, bitmap,
+   

Re: [PATCH 2/2] iommu/sva: Remove mm parameter from SVA bind API

2021-04-14 Thread Lu Baolu

Hi Jacob,

On 4/14/21 8:09 AM, Jacob Pan wrote:

Hi Jean,

On Fri, 9 Apr 2021 11:03:05 -0700, Jacob Pan
 wrote:


problems:

* We don't have a use-case for binding the mm of a remote process (and
   it's supposedly difficult for device drivers to do it securely). So
OK, we remove the mm argument from iommu_sva_bind_device() and use the
   current mm. But the IOMMU driver isn't going to do
get_task_mm(current) every time it needs the mm being bound, it will
take it from iommu_sva_bind_device(). Likewise iommu_sva_alloc_pasid()
shouldn't need to bother with get_task_mm().

* cgroup accounting for IOASIDs needs to be on the current task.
Removing the mm parameter from iommu_sva_alloc_pasid() doesn't help
with that. Sure it indicates that iommu_sva_alloc_pasid() needs a
specific task context but that's only for cgroup purpose, and I'd
rather pass the cgroup down from iommu_sva_bind_device() anyway (but am
fine with keeping it within ioasid_alloc() for now). Plus it's an
internal helper, easy for us to check that the callers are doing the
right thing.

With the above split, we really just have one allocation function:
ioasid_alloc(), so it can manage current cgroup accounting within. Would
this work?

After a few attempts, I don't think the split can work better. I will
restore the mm parameter and add a warning if mm != current->mm.


I still worry about supervisor pasid allocation.

If we use iommu_sva_alloc_pasid() to allocate a supervisor pasid, which
mm should the pasid be set? I've ever thought about passing _mm to
iommu_sva_alloc_pasid(). But if you add "mm != current->mm", this seems
not to work. Or do you prefer a separated interface for supervisor pasid
allocation/free?

Best regards,
baolu



Thanks,

Jacob



Re: [PATCH][next] iommu/vt-d: Fix out-bounds-warning in intel_svm_page_response()

2021-04-13 Thread Lu Baolu

Hi Gustavo,

On 4/14/21 3:54 AM, Gustavo A. R. Silva wrote:

Replace call to memcpy() with just a couple of simple assignments in
order to fix the following out-of-bounds warning:

drivers/iommu/intel/svm.c:1198:4: warning: 'memcpy' offset [25, 32] from the 
object at 'desc' is out of the bounds of referenced subobject 'qw2' with type 
'long long unsigned int' at offset 16 [-Warray-bounds]

The problem is that the original code is trying to copy data into a
couple of struct members adjacent to each other in a single call to
memcpy(). This causes a legitimate compiler warning because memcpy()
overruns the length of 

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

Link: https://github.com/KSPP/linux/issues/109
Signed-off-by: Gustavo A. R. Silva 
---
  drivers/iommu/intel/svm.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 5165cea90421..65909f504c50 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -1194,9 +1194,10 @@ int intel_svm_page_response(struct device *dev,
desc.qw1 = QI_PGRP_IDX(prm->grpid) | QI_PGRP_LPIG(last_page);
desc.qw2 = 0;
desc.qw3 = 0;
-   if (private_present)
-   memcpy(, prm->private_data,
-  sizeof(prm->private_data));


The same memcpy() is used in multiple places in this file. Did they
compile the same warnings? Or there are multiple patches to fix them
one by one?

Best regards,
baolu


+   if (private_present) {
+   desc.qw2 = prm->private_data[0];
+   desc.qw3 = prm->private_data[1];
+   }
  
  		qi_submit_sync(iommu, , 1, 0);

}



Re: [PATCH 5.4 v2 1/1] iommu/vt-d: Fix agaw for a supported 48 bit guest address width

2021-04-11 Thread Lu Baolu

I guess you need to ask Greg KH  with this
Cc-ing to sta...@vger.kernel.org.

Best regards,
baolu

On 2021/4/12 3:36, Saeed Mirzamohammadi wrote:

Hi Lu,

Thanks for the review. May I know when do we expect this to be applied 
to 5.4?


Thanks,
Saeed

On Apr 7, 2021, at 5:25 PM, Lu Baolu <mailto:baolu...@linux.intel.com>> wrote:


On 4/8/21 2:40 AM, Saeed Mirzamohammadi wrote:

The IOMMU driver calculates the guest addressability for a DMA request
based on the value of the mgaw reported from the IOMMU. However, this
is a fused value and as mentioned in the spec, the guest width
should be calculated based on the minimum of supported adjusted guest
address width (SAGAW) and MGAW.
This is from specification:
"Guest addressability for a given DMA request is limited to the
minimum of the value reported through this field and the adjusted
guest address width of the corresponding page-table structure.
(Adjusted guest address widths supported by hardware are reported
through the SAGAW field)."
This causes domain initialization to fail and following
errors appear for EHCI PCI driver:
[    2.486393] ehci-pci :01:00.4: EHCI Host Controller
[    2.486624] ehci-pci :01:00.4: new USB bus registered, 
assigned bus

number 1
[    2.489127] ehci-pci :01:00.4: DMAR: Allocating domain failed
[    2.489350] ehci-pci :01:00.4: DMAR: 32bit DMA uses non-identity
mapping
[    2.489359] ehci-pci :01:00.4: can't setup: -12
[    2.489531] ehci-pci :01:00.4: USB bus 1 deregistered
[    2.490023] ehci-pci :01:00.4: init :01:00.4 fail, -12
[    2.490358] ehci-pci: probe of :01:00.4 failed with error -12
This issue happens when the value of the sagaw corresponds to a
48-bit agaw. This fix updates the calculation of the agaw based on
the minimum of IOMMU's sagaw value and MGAW.
Signed-off-by: Saeed Mirzamohammadi <mailto:saeed.mirzamohamm...@oracle.com>>

Tested-by: Camille Lu mailto:camille...@hpe.com>>
---
Change in v2:
- Added cap_width to calculate AGAW based on the minimum value of 
MGAW and AGAW.

---
 drivers/iommu/intel-iommu.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 953d86ca6d2b..a2a03df97704 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1853,7 +1853,7 @@ static inline int guestwidth_to_adjustwidth(int 
gaw)
 static int domain_init(struct dmar_domain *domain, struct 
intel_iommu *iommu,

  int guest_width)
 {
-int adjust_width, agaw;
+int adjust_width, agaw, cap_width;
unsigned long sagaw;
int err;
 @@ -1867,8 +1867,9 @@ static int domain_init(struct dmar_domain 
*domain, struct intel_iommu *iommu,

domain_reserve_special_ranges(domain);
/* calculate AGAW */
-if (guest_width > cap_mgaw(iommu->cap))
-guest_width = cap_mgaw(iommu->cap);
+cap_width = min_t(int, cap_mgaw(iommu->cap), 
agaw_to_width(iommu->agaw));

+if (guest_width > cap_width)
+guest_width = cap_width;
domain->gaw = guest_width;
adjust_width = guestwidth_to_adjustwidth(guest_width);
agaw = width_to_agaw(adjust_width);


Reviewed-by: Lu Baolu <mailto:baolu...@linux.intel.com>>


Best regards,
baolu




Re: [PATCH] iommu/vt-d: Fix an error handling path in 'intel_prepare_irq_remapping()'

2021-04-11 Thread Lu Baolu

On 4/11/21 3:08 PM, Christophe JAILLET wrote:

If 'intel_cap_audit()' fails, we should return directly, as already done in
the surrounding error handling path.

Fixes: ad3d19029979 ("iommu/vt-d: Audit IOMMU Capabilities and add helper 
functions")
Signed-off-by: Christophe JAILLET 
---
This patch is completely speculative.
It is based on a spurious mix-up of direct return and goto.
---
  drivers/iommu/intel/irq_remapping.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index 75429a5373ec..f912fe45bea2 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -736,7 +736,7 @@ static int __init intel_prepare_irq_remapping(void)
return -ENODEV;
  
  	if (intel_cap_audit(CAP_AUDIT_STATIC_IRQR, NULL))

-   goto error;
+   return -ENODEV;
  
  	if (!dmar_ir_support())

return -ENODEV;



Thanks!

Acked-by: Lu Baolu 

Best regards,
baolu


[PATCH 1/1] iommu/vt-d: Fix build error of pasid_enable_wpe() with !X86

2021-04-11 Thread Lu Baolu
Commit f68c7f539b6e9 ("iommu/vt-d: Enable write protect for supervisor
SVM") added pasid_enable_wpe() which hit below compile error with !X86.

../drivers/iommu/intel/pasid.c: In function 'pasid_enable_wpe':
../drivers/iommu/intel/pasid.c:554:22: error: implicit declaration of function 
'read_cr0' [-Werror=implicit-function-declaration]
  554 |  unsigned long cr0 = read_cr0();
  |  ^~~~
In file included from ../include/linux/build_bug.h:5,
 from ../include/linux/bits.h:22,
 from ../include/linux/bitops.h:6,
 from ../drivers/iommu/intel/pasid.c:12:
../drivers/iommu/intel/pasid.c:557:23: error: 'X86_CR0_WP' undeclared (first 
use in this function)
  557 |  if (unlikely(!(cr0 & X86_CR0_WP))) {
  |   ^~
../include/linux/compiler.h:78:42: note: in definition of macro 'unlikely'
   78 | # define unlikely(x) __builtin_expect(!!(x), 0)
  |  ^
../drivers/iommu/intel/pasid.c:557:23: note: each undeclared identifier is 
reported only once for each function it appears in
  557 |  if (unlikely(!(cr0 & X86_CR0_WP))) {
  |   ^~
../include/linux/compiler.h:78:42: note: in definition of macro 'unlikely'
   78 | # define unlikely(x) __builtin_expect(!!(x), 0)
  |

Add the missing dependency.

Cc: Sanjay Kumar 
Cc: Jacob Pan 
Cc: Randy Dunlap 
Reported-by: kernel test robot 
Reported-by: Randy Dunlap 
Fixes: f68c7f539b6e9 ("iommu/vt-d: Enable write protect for supervisor SVM")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/pasid.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 477b2e1d303c..72646bafc52f 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -551,6 +551,7 @@ static void pasid_flush_caches(struct intel_iommu *iommu,
 
 static inline int pasid_enable_wpe(struct pasid_entry *pte)
 {
+#ifdef CONFIG_X86
unsigned long cr0 = read_cr0();
 
/* CR0.WP is normally set but just to be sure */
@@ -558,6 +559,7 @@ static inline int pasid_enable_wpe(struct pasid_entry *pte)
pr_err_ratelimited("No CPU write protect!\n");
return -EINVAL;
}
+#endif
pasid_set_wpe(pte);
 
return 0;
-- 
2.25.1



Re: [PATCH 2/2] iommu/sva: Remove mm parameter from SVA bind API

2021-04-09 Thread Lu Baolu

Hi,

On 2021/4/9 1:08, Jacob Pan wrote:

  /**
   * iommu_sva_alloc_pasid - Allocate a PASID for the mm
- * @mm: the mm
   * @min: minimum PASID value (inclusive)
   * @max: maximum PASID value (inclusive)
   *
- * Try to allocate a PASID for this mm, or take a reference to the existing one
- * provided it fits within the [@min, @max] range. On success the PASID is
- * available in mm->pasid, and must be released with iommu_sva_free_pasid().
+ * Try to allocate a PASID for the current mm, or take a reference to the
+ * existing one provided it fits within the [@min, @max] range. On success
+ * the PASID is available in the current mm->pasid, and must be released with
+ * iommu_sva_free_pasid().
   * @min must be greater than 0, because 0 indicates an unused mm->pasid.
   *
   * Returns 0 on success and < 0 on error.
   */
-int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max)
+int iommu_sva_alloc_pasid(ioasid_t min, ioasid_t max)
  {
int ret = 0;
ioasid_t pasid;
+   struct mm_struct *mm;
  
  	if (min == INVALID_IOASID || max == INVALID_IOASID ||

min == 0 || max < min)
return -EINVAL;
  
  	mutex_lock(_sva_lock);

+   mm = get_task_mm(current);


How could we allocate a supervisor PASID through iommu_sva_alloc_pasid()
if we always use current->mm here?


+   if (!mm) {
+   ret = -EINVAL;
+   goto out_unlock;
+   }
if (mm->pasid) {
if (mm->pasid >= min && mm->pasid <= max)
ioasid_get(mm->pasid);
@@ -45,22 +51,32 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t 
min, ioasid_t max)
else
mm->pasid = pasid;
}
+   mmput(mm);
+out_unlock:
mutex_unlock(_sva_lock);
return ret;
  }
  EXPORT_SYMBOL_GPL(iommu_sva_alloc_pasid);


Best regards,
baolu


Re: [PATCH 1/2] iommu/sva: Tighten SVA bind API with explicit flags

2021-04-09 Thread Lu Baolu

Hi Jacob,

On 2021/4/9 1:08, Jacob Pan wrote:

The void* drvdata parameter isn't really used in iommu_sva_bind_device()
API, the current IDXD code "borrows" the drvdata for a VT-d private flag
for supervisor SVA usage.

Supervisor/Privileged mode request is a generic feature. It should be
promoted from the VT-d vendor driver to the generic code.

This patch replaces void* drvdata with a unsigned int flags parameter
and adjusts callers accordingly.

Link: https://lore.kernel.org/linux-iommu/YFhiMLR35WWMW%2FHu@myrica/
Suggested-by: Jean-Philippe Brucker 
Signed-off-by: Jacob Pan 
---
  drivers/dma/idxd/cdev.c |  2 +-
  drivers/dma/idxd/init.c |  6 +++---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c |  2 +-
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  4 ++--
  drivers/iommu/intel/Kconfig |  1 +
  drivers/iommu/intel/svm.c   | 18 ++
  drivers/iommu/iommu.c   |  9 ++---
  drivers/misc/uacce/uacce.c  |  2 +-
  include/linux/intel-iommu.h |  2 +-
  include/linux/intel-svm.h   | 17 ++---
  include/linux/iommu.h   | 19 ---
  11 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index 0db9b82..21ec82b 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -103,7 +103,7 @@ static int idxd_cdev_open(struct inode *inode, struct file 
*filp)
filp->private_data = ctx;
  
  	if (device_pasid_enabled(idxd)) {

-   sva = iommu_sva_bind_device(dev, current->mm, NULL);
+   sva = iommu_sva_bind_device(dev, current->mm, 0);
if (IS_ERR(sva)) {
rc = PTR_ERR(sva);
dev_err(dev, "pasid allocation failed: %d\n", rc);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 085a0c3..cdc85f1 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -300,13 +300,13 @@ static struct idxd_device *idxd_alloc(struct pci_dev 
*pdev)
  
  static int idxd_enable_system_pasid(struct idxd_device *idxd)

  {
-   int flags;
+   unsigned int flags;
unsigned int pasid;
struct iommu_sva *sva;
  
-	flags = SVM_FLAG_SUPERVISOR_MODE;

+   flags = IOMMU_SVA_BIND_SUPERVISOR;


With SVM_FLAG_SUPERVISOR_MODE removed, I guess

#include 

in this file could be removed as well.

  
-	sva = iommu_sva_bind_device(>pdev->dev, NULL, );

+   sva = iommu_sva_bind_device(>pdev->dev, NULL, flags);
if (IS_ERR(sva)) {
dev_warn(>pdev->dev,
 "iommu sva bind failed: %ld\n", PTR_ERR(sva));
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index bb251ca..23e287e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -354,7 +354,7 @@ __arm_smmu_sva_bind(struct device *dev, struct mm_struct 
*mm)
  }
  
  struct iommu_sva *

-arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata)
+arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, unsigned int flags)
  {
struct iommu_sva *handle;
struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index f985817..b971d4d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -711,7 +711,7 @@ bool arm_smmu_master_sva_enabled(struct arm_smmu_master 
*master);
  int arm_smmu_master_enable_sva(struct arm_smmu_master *master);
  int arm_smmu_master_disable_sva(struct arm_smmu_master *master);
  struct iommu_sva *arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm,
-   void *drvdata);
+   unsigned int flags);
  void arm_smmu_sva_unbind(struct iommu_sva *handle);
  u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle);
  void arm_smmu_sva_notifier_synchronize(void);
@@ -742,7 +742,7 @@ static inline int arm_smmu_master_disable_sva(struct 
arm_smmu_master *master)
  }
  
  static inline struct iommu_sva *

-arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata)
+arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, unsigned int flags)
  {
return ERR_PTR(-ENODEV);
  }
diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 28a3d15..5415052 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -41,6 +41,7 @@ config INTEL_IOMMU_SVM
select PCI_PRI
select MMU_NOTIFIER
select IOASID
+   select IOMMU_SVA_LIB


Why?


help
  Shared Virtual Memory (SVM) provides a facility for devices
  to access DMA 

Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-08 Thread Lu Baolu

Hi Longpeng,

On 4/8/21 3:37 PM, Longpeng (Mike, Cloud Infrastructure Service Product 
Dept.) wrote:

Hi Baolu,


-Original Message-
From: Lu Baolu [mailto:baolu...@linux.intel.com]
Sent: Thursday, April 8, 2021 12:32 PM
To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
; io...@lists.linux-foundation.org;
linux-kernel@vger.kernel.org
Cc: baolu...@linux.intel.com; David Woodhouse ; Nadav
Amit ; Alex Williamson ;
Kevin Tian ; Gonglei (Arei) ;
sta...@vger.kernel.org
Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

Hi Longpeng,

On 4/7/21 2:35 PM, Longpeng (Mike, Cloud Infrastructure Service Product
Dept.) wrote:

Hi Baolu,


-Original Message-
From: Lu Baolu [mailto:baolu...@linux.intel.com]
Sent: Friday, April 2, 2021 12:44 PM
To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
; io...@lists.linux-foundation.org;
linux-kernel@vger.kernel.org
Cc: baolu...@linux.intel.com; David Woodhouse ;
Nadav Amit ; Alex Williamson
; Kevin Tian ;
Gonglei (Arei) ; sta...@vger.kernel.org
Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating
superpage

Hi Longpeng,

On 4/1/21 3:18 PM, Longpeng(Mike) wrote:

diff --git a/drivers/iommu/intel/iommu.c
b/drivers/iommu/intel/iommu.c index ee09323..cbcb434 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2342,9 +2342,20 @@ static inline int
hardware_largepage_caps(struct

dmar_domain *domain,

 * removed to make room for superpage(s).
 * We're adding new large pages, so make sure
 * we don't remove their parent tables.
+*
+* We also need to flush the iotlb before 
creating
+* superpage to ensure it does not perserves any
+* obsolete info.
 */
-   dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
-  largepage_lvl + 1);
+   if (dma_pte_present(pte)) {


The dma_pte_free_pagetable() clears a batch of PTEs. So checking
current PTE is insufficient. How about removing this check and always
performing cache invalidation?



Um...the PTE here may be present( e.g. 4K mapping --> superpage mapping )

orNOT-present ( e.g. create a totally new superpage mapping ), but we only need 
to
call free_pagetable and flush_iotlb in the former case, right ?

But this code covers multiple PTEs and perhaps crosses the page boundary.

How about moving this code into a separated function and check PTE presence
there. A sample code could look like below: [compiled but not tested!]

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index
d334f5b4e382..0e04d450c38a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2300,6 +2300,41 @@ static inline int hardware_largepage_caps(struct
dmar_domain *domain,
  return level;
   }

+/*
+ * Ensure that old small page tables are removed to make room for
superpage(s).
+ * We're going to add new large pages, so make sure we don't remove
their parent
+ * tables. The IOTLB/devTLBs should be flushed if any PDE/PTEs are cleared.
+ */
+static void switch_to_super_page(struct dmar_domain *domain,
+unsigned long start_pfn,
+unsigned long end_pfn, int level) {


Maybe "swith_to" will lead people to think "remove old and then setup new", so how about something 
like "remove_room_for_super_page" or "prepare_for_super_page" ?


I named it like this because we also want to have a opposite operation
split_from_super_page() which switch a PDE or PDPE from super page
setting up to small pages, which is needed to optimize dirty bit
tracking during VM live migration.




+   unsigned long lvl_pages = lvl_to_nr_pages(level);
+   struct dma_pte *pte = NULL;
+   int i;
+
+   while (start_pfn <= end_pfn) {


start_pfn < end_pfn ?


end_pfn is inclusive.




+   if (!pte)
+   pte = pfn_to_dma_pte(domain, start_pfn, );
+
+   if (dma_pte_present(pte)) {
+   dma_pte_free_pagetable(domain, start_pfn,
+  start_pfn + lvl_pages - 1,
+  level + 1);
+
+   for_each_domain_iommu(i, domain)
+   iommu_flush_iotlb_psi(g_iommus[i],
domain,
+ start_pfn,
lvl_pages,
+ 0, 0);
+   }
+
+   pte++;
+   start_pfn += lvl_pages;
+   if (first_pte_in_page(pte))
+   pte = NULL;
+   }
+}
+
   

Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-07 Thread Lu Baolu

Hi Longpeng,

On 4/7/21 2:35 PM, Longpeng (Mike, Cloud Infrastructure Service Product 
Dept.) wrote:

Hi Baolu,


-Original Message-
From: Lu Baolu [mailto:baolu...@linux.intel.com]
Sent: Friday, April 2, 2021 12:44 PM
To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
; io...@lists.linux-foundation.org;
linux-kernel@vger.kernel.org
Cc: baolu...@linux.intel.com; David Woodhouse ; Nadav
Amit ; Alex Williamson ;
Kevin Tian ; Gonglei (Arei) ;
sta...@vger.kernel.org
Subject: Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

Hi Longpeng,

On 4/1/21 3:18 PM, Longpeng(Mike) wrote:

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..cbcb434 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct

dmar_domain *domain,

 * removed to make room for superpage(s).
 * We're adding new large pages, so make sure
 * we don't remove their parent tables.
+*
+* We also need to flush the iotlb before 
creating
+* superpage to ensure it does not perserves any
+* obsolete info.
 */
-   dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
-  largepage_lvl + 1);
+   if (dma_pte_present(pte)) {


The dma_pte_free_pagetable() clears a batch of PTEs. So checking current PTE is
insufficient. How about removing this check and always performing cache
invalidation?



Um...the PTE here may be present( e.g. 4K mapping --> superpage mapping ) 
orNOT-present ( e.g. create a totally new superpage mapping ), but we only need to 
call free_pagetable and flush_iotlb in the former case, right ?


But this code covers multiple PTEs and perhaps crosses the page
boundary.

How about moving this code into a separated function and check PTE
presence there. A sample code could look like below: [compiled but not
tested!]

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d334f5b4e382..0e04d450c38a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2300,6 +2300,41 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,

return level;
 }

+/*
+ * Ensure that old small page tables are removed to make room for 
superpage(s).
+ * We're going to add new large pages, so make sure we don't remove 
their parent

+ * tables. The IOTLB/devTLBs should be flushed if any PDE/PTEs are cleared.
+ */
+static void switch_to_super_page(struct dmar_domain *domain,
+unsigned long start_pfn,
+unsigned long end_pfn, int level)
+{
+   unsigned long lvl_pages = lvl_to_nr_pages(level);
+   struct dma_pte *pte = NULL;
+   int i;
+
+   while (start_pfn <= end_pfn) {
+   if (!pte)
+   pte = pfn_to_dma_pte(domain, start_pfn, );
+
+   if (dma_pte_present(pte)) {
+   dma_pte_free_pagetable(domain, start_pfn,
+  start_pfn + lvl_pages - 1,
+  level + 1);
+
+   for_each_domain_iommu(i, domain)
+   iommu_flush_iotlb_psi(g_iommus[i], domain,
+ start_pfn, lvl_pages,
+ 0, 0);
+   }
+
+   pte++;
+   start_pfn += lvl_pages;
+   if (first_pte_in_page(pte))
+   pte = NULL;
+   }
+}
+
 static int
 __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
 unsigned long phys_pfn, unsigned long nr_pages, int prot)
@@ -2341,22 +2376,11 @@ __domain_mapping(struct dmar_domain *domain, 
unsigned long iov_pfn,

return -ENOMEM;
/* It is large page*/
if (largepage_lvl > 1) {
-   unsigned long nr_superpages, end_pfn;
+   unsigned long end_pfn;

pteval |= DMA_PTE_LARGE_PAGE;
-   lvl_pages = lvl_to_nr_pages(largepage_lvl);
-
-   nr_superpages = nr_pages / lvl_pages;
-   end_pfn = iov_pfn + nr_superpages * 
lvl_pages - 1;

-
-   /*
-* Ensure that old small page tables are
-* removed to make room for superpage(s).
-* We're adding new large pages, so m

Re: [PATCH 5.4 v2 1/1] iommu/vt-d: Fix agaw for a supported 48 bit guest address width

2021-04-07 Thread Lu Baolu

On 4/8/21 2:40 AM, Saeed Mirzamohammadi wrote:

The IOMMU driver calculates the guest addressability for a DMA request
based on the value of the mgaw reported from the IOMMU. However, this
is a fused value and as mentioned in the spec, the guest width
should be calculated based on the minimum of supported adjusted guest
address width (SAGAW) and MGAW.

This is from specification:
"Guest addressability for a given DMA request is limited to the
minimum of the value reported through this field and the adjusted
guest address width of the corresponding page-table structure.
(Adjusted guest address widths supported by hardware are reported
through the SAGAW field)."

This causes domain initialization to fail and following
errors appear for EHCI PCI driver:

[2.486393] ehci-pci :01:00.4: EHCI Host Controller
[2.486624] ehci-pci :01:00.4: new USB bus registered, assigned bus
number 1
[2.489127] ehci-pci :01:00.4: DMAR: Allocating domain failed
[2.489350] ehci-pci :01:00.4: DMAR: 32bit DMA uses non-identity
mapping
[2.489359] ehci-pci :01:00.4: can't setup: -12
[2.489531] ehci-pci :01:00.4: USB bus 1 deregistered
[2.490023] ehci-pci :01:00.4: init :01:00.4 fail, -12
[2.490358] ehci-pci: probe of :01:00.4 failed with error -12

This issue happens when the value of the sagaw corresponds to a
48-bit agaw. This fix updates the calculation of the agaw based on
the minimum of IOMMU's sagaw value and MGAW.

Signed-off-by: Saeed Mirzamohammadi 
Tested-by: Camille Lu 
---

Change in v2:
- Added cap_width to calculate AGAW based on the minimum value of MGAW and AGAW.
---
  drivers/iommu/intel-iommu.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 953d86ca6d2b..a2a03df97704 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1853,7 +1853,7 @@ static inline int guestwidth_to_adjustwidth(int gaw)
  static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu,
   int guest_width)
  {
-   int adjust_width, agaw;
+   int adjust_width, agaw, cap_width;
unsigned long sagaw;
int err;
  
@@ -1867,8 +1867,9 @@ static int domain_init(struct dmar_domain *domain, struct intel_iommu *iommu,

domain_reserve_special_ranges(domain);
  
  	/* calculate AGAW */

-   if (guest_width > cap_mgaw(iommu->cap))
-   guest_width = cap_mgaw(iommu->cap);
+   cap_width = min_t(int, cap_mgaw(iommu->cap), 
agaw_to_width(iommu->agaw));
+   if (guest_width > cap_width)
+   guest_width = cap_width;
domain->gaw = guest_width;
adjust_width = guestwidth_to_adjustwidth(guest_width);
agaw = width_to_agaw(adjust_width);



Reviewed-by: Lu Baolu 

Best regards,
baolu


Re: [PATCH v8 7/9] vfio/mdev: Add iommu related member in mdev_device

2021-04-06 Thread Lu Baolu

Hi Jason,

On 4/7/21 4:00 AM, Jason Gunthorpe wrote:

On Mon, Mar 25, 2019 at 09:30:34AM +0800, Lu Baolu wrote:

A parent device might create different types of mediated
devices. For example, a mediated device could be created
by the parent device with full isolation and protection
provided by the IOMMU. One usage case could be found on
Intel platforms where a mediated device is an assignable
subset of a PCI, the DMA requests on behalf of it are all
tagged with a PASID. Since IOMMU supports PASID-granular
translations (scalable mode in VT-d 3.0), this mediated
device could be individually protected and isolated by an
IOMMU.

This patch adds a new member in the struct mdev_device to
indicate that the mediated device represented by mdev could
be isolated and protected by attaching a domain to a device
represented by mdev->iommu_device. It also adds a helper to
add or set the iommu device.

* mdev_device->iommu_device
   - This, if set, indicates that the mediated device could
 be fully isolated and protected by IOMMU via attaching
 an iommu domain to this device. If empty, it indicates
 using vendor defined isolation, hence bypass IOMMU.

* mdev_set/get_iommu_device(dev, iommu_device)
   - Set or get the iommu device which represents this mdev
 in IOMMU's device scope. Drivers don't need to set the
 iommu device if it uses vendor defined isolation.

Cc: Ashok Raj 
Cc: Jacob Pan 
Cc: Kevin Tian 
Cc: Liu Yi L 
Suggested-by: Kevin Tian 
Suggested-by: Alex Williamson 
Signed-off-by: Lu Baolu 
Reviewed-by: Jean-Philippe Brucker 
---
  drivers/vfio/mdev/mdev_core.c| 18 ++
  drivers/vfio/mdev/mdev_private.h |  1 +
  include/linux/mdev.h | 14 ++
  3 files changed, 33 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index b96fedc77ee5..1b6435529166 100644
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -390,6 +390,24 @@ int mdev_device_remove(struct device *dev, bool 
force_remove)
return 0;
  }
  
+int mdev_set_iommu_device(struct device *dev, struct device *iommu_device)

+{
+   struct mdev_device *mdev = to_mdev_device(dev);
+
+   mdev->iommu_device = iommu_device;
+
+   return 0;
+}
+EXPORT_SYMBOL(mdev_set_iommu_device);


I was looking at these functions when touching the mdev stuff and I
have some concerns.

1) Please don't merge dead code. It is a year later and there is still
no in-tree user for any of this. This is not our process. Even
worse it was exported so it looks like this dead code is supporting
out of tree modules.

2) Why is this like this? Every struct device already has a connection
to the iommu layer and every mdev has a struct device all its own.

Why did we need to add special 'if (mdev)' stuff all over the
place? This smells like the same abuse Thomas
and I pointed out for the interrupt domains.


I've ever tried to implement a bus iommu_ops for mdev devices.

https://lore.kernel.org/lkml/20201030045809.957927-1-baolu...@linux.intel.com/

Any comments?

Best regards,
baolu



After my next series the mdev drivers will have direct access to
the vfio_device. So an alternative to using the struct device, or
adding 'if mdev' is to add an API to the vfio_device world to
inject what iommu configuration is needed from that direction
instead of trying to discover it from a struct device.

3) The vfio_bus_is_mdev() and related symbol_get() nonsense in
drivers/vfio/vfio_iommu_type1.c has to go, for the same reasons
it was not acceptable to do this for the interrupt side either.

4) It seems pretty clear to me this will be heavily impacted by the
/dev/ioasid discussion. Please consider removing the dead code now.

Basically, please fix this before trying to get idxd mdev merged as
the first user.

Jason



Re: [PATCH v5.4 1/1] iommu/vt-d: Fix agaw for a supported 48 bit guest address width

2021-04-06 Thread Lu Baolu

Hi Saeed,

On 4/7/21 12:35 AM, Saeed Mirzamohammadi wrote:

The IOMMU driver calculates the guest addressability for a DMA request
based on the value of the mgaw reported from the IOMMU. However, this
is a fused value and as mentioned in the spec, the guest width
should be calculated based on the supported adjusted guest address width
(SAGAW).

This is from specification:
"Guest addressability for a given DMA request is limited to the
minimum of the value reported through this field and the adjusted
guest address width of the corresponding page-table structure.
(Adjusted guest address widths supported by hardware are reported
through the SAGAW field)."

This causes domain initialization to fail and following
errors appear for EHCI PCI driver:

[2.486393] ehci-pci :01:00.4: EHCI Host Controller
[2.486624] ehci-pci :01:00.4: new USB bus registered, assigned bus
number 1
[2.489127] ehci-pci :01:00.4: DMAR: Allocating domain failed
[2.489350] ehci-pci :01:00.4: DMAR: 32bit DMA uses non-identity
mapping
[2.489359] ehci-pci :01:00.4: can't setup: -12
[2.489531] ehci-pci :01:00.4: USB bus 1 deregistered
[2.490023] ehci-pci :01:00.4: init :01:00.4 fail, -12
[2.490358] ehci-pci: probe of :01:00.4 failed with error -12

This issue happens when the value of the sagaw corresponds to a
48-bit agaw. This fix updates the calculation of the agaw based on
the IOMMU's sagaw value.

Signed-off-by: Saeed Mirzamohammadi 
Tested-by: Camille Lu 
---
  drivers/iommu/intel-iommu.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 953d86ca6d2b..396e14fad54b 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -1867,8 +1867,8 @@ static int domain_init(struct dmar_domain *domain, struct 
intel_iommu *iommu,
domain_reserve_special_ranges(domain);
  
  	/* calculate AGAW */

-   if (guest_width > cap_mgaw(iommu->cap))
-   guest_width = cap_mgaw(iommu->cap);
+   if (guest_width > agaw_to_width(iommu->agaw))
+   guest_width = agaw_to_width(iommu->agaw);


The spec requires to use a minimum of MGAW and AGAW, so why not,

cap_width = min_t(int, cap_mgaw(iommu->cap), 
agaw_to_width(iommu->agaw));
if (guest_width > cap_width)
guest_width = cap_width;

Best regards,
baolu


domain->gaw = guest_width;
adjust_width = guestwidth_to_adjustwidth(guest_width);
agaw = width_to_agaw(adjust_width);



Re: [PATCH v2 4/5] iommu/vt-d: Use user privilege for RID2PASID translation

2021-04-05 Thread Lu Baolu

On 3/20/21 10:54 AM, Lu Baolu wrote:

When first-level page tables are used for IOVA translation, we use user
privilege by setting U/S bit in the page table entry. This is to make it
consistent with the second level translation, where the U/S enforcement
is not available. Clear the SRE (Supervisor Request Enable) field in the
pasid table entry of RID2PASID so that requests requesting the supervisor
privilege are blocked and treated as DMA remapping faults.

Suggested-by: Jacob Pan 
Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level")
Signed-off-by: Lu Baolu 


We found some devices still require SRE to be set during internal tests.
I will drop this patch from my queue for v5.13 for now.

Best regards,
baolu


---
  drivers/iommu/intel/iommu.c | 7 +--
  drivers/iommu/intel/pasid.c | 3 ++-
  2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 868f195f55ff..7354f9ce47d8 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2494,9 +2494,9 @@ static int domain_setup_first_level(struct intel_iommu 
*iommu,
struct device *dev,
u32 pasid)
  {
-   int flags = PASID_FLAG_SUPERVISOR_MODE;
struct dma_pte *pgd = domain->pgd;
int agaw, level;
+   int flags = 0;
  
  	/*

 * Skip top levels of page tables for iommu which has
@@ -2512,7 +2512,10 @@ static int domain_setup_first_level(struct intel_iommu 
*iommu,
if (level != 4 && level != 5)
return -EINVAL;
  
-	flags |= (level == 5) ? PASID_FLAG_FL5LP : 0;

+   if (pasid != PASID_RID2PASID)
+   flags |= PASID_FLAG_SUPERVISOR_MODE;
+   if (level == 5)
+   flags |= PASID_FLAG_FL5LP;
  
  	return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid,

 domain->iommu_did[iommu->seq_id],
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 0bf7e0a76890..dd69df5a188a 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -673,7 +673,8 @@ int intel_pasid_setup_second_level(struct intel_iommu 
*iommu,
 * Since it is a second level only translation setup, we should
 * set SRE bit as well (addresses are expected to be GPAs).
 */
-   pasid_set_sre(pte);
+   if (pasid != PASID_RID2PASID)
+   pasid_set_sre(pte);
pasid_set_present(pte);
pasid_flush_caches(iommu, pte, pasid, did);
  



Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-01 Thread Lu Baolu
On 4/2/21 11:41 AM, Longpeng (Mike, Cloud Infrastructure Service Product 
Dept.) wrote:

Hi Baolu,

在 2021/4/2 11:06, Lu Baolu 写道:

Hi Longpeng,

On 4/1/21 3:18 PM, Longpeng(Mike) wrote:

The translation caches may preserve obsolete data when the
mapping size is changed, suppose the following sequence which
can reveal the problem with high probability.

1.mmap(4GB,MAP_HUGETLB)
2.
    while (1) {
     (a)    DMA MAP   0,0xa
     (b)    DMA UNMAP 0,0xa
     (c)    DMA MAP   0,0xc000
   * DMA read IOVA 0 may failure here (Not present)
   * if the problem occurs.
     (d)    DMA UNMAP 0,0xc000
    }

The page table(only focus on IOVA 0) after (a) is:
   PML4: 0x19db5c1003   entry:0x899bdcd2f000
    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
     PDE: 0x1a30a72003  entry:0x89b39cacb000
  PTE: 0x21d200803  entry:0x89b3b0a72000

The page table after (b) is:
   PML4: 0x19db5c1003   entry:0x899bdcd2f000
    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
     PDE: 0x1a30a72003  entry:0x89b39cacb000
  PTE: 0x0  entry:0x89b3b0a72000

The page table after (c) is:
   PML4: 0x19db5c1003   entry:0x899bdcd2f000
    PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
     PDE: 0x21d200883   entry:0x89b39cacb000 (*)

Because the PDE entry after (b) is present, it won't be
flushed even if the iommu driver flush cache when unmap,
so the obsolete data may be preserved in cache, which
would cause the wrong translation at end.

However, we can see the PDE entry is finally switch to
2M-superpage mapping, but it does not transform
to 0x21d200883 directly:

1. PDE: 0x1a30a72003
2. __domain_mapping
   dma_pte_free_pagetable
     Set the PDE entry to ZERO
   Set the PDE entry to 0x21d200883

So we must flush the cache after the entry switch to ZERO
to avoid the obsolete info be preserved.

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Nadav Amit 
Cc: Alex Williamson 
Cc: Kevin Tian 
Cc: Gonglei (Arei) 

Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating
superpage")
Cc:  # v3.0+
Link:
https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/

Suggested-by: Lu Baolu 
Signed-off-by: Longpeng(Mike) 
---
   drivers/iommu/intel/iommu.c | 15 +--
   1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..cbcb434 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct
dmar_domain *domain,
    * removed to make room for superpage(s).
    * We're adding new large pages, so make sure
    * we don't remove their parent tables.
+ *
+ * We also need to flush the iotlb before creating
+ * superpage to ensure it does not perserves any
+ * obsolete info.
    */
-    dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
-   largepage_lvl + 1);
+    if (dma_pte_present(pte)) {
+    int i;
+
+    dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
+   largepage_lvl + 1);
+    for_each_domain_iommu(i, domain)
+    iommu_flush_iotlb_psi(g_iommus[i], domain,
+  iov_pfn, nr_pages, 0, 0);


Thanks for patch!

How about making the flushed page size accurate? For example,

@@ -2365,8 +2365,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned long
iov_pfn,
     dma_pte_free_pagetable(domain, iov_pfn,
end_pfn,

largepage_lvl + 1);
     for_each_domain_iommu(i, domain)
- iommu_flush_iotlb_psi(g_iommus[i], domain,
- iov_pfn, nr_pages, 0, 0);
+ iommu_flush_iotlb_psi(g_iommus[i], domain, iov_pfn,
+ ALIGN_DOWN(nr_pages, lvl_pages), 0, 0);


Yes, make sense.

Maybe another alternative is 'end_pfn - iova_pfn + 1', it's readable because we
free pagetable with (iova_pfn, end_pfn) above. Which one do you prefer?


Yours looks better.

By the way, if you are willing to prepare a v2, please make sure to add
Joerg (IOMMU subsystem maintainer) to the list.

Best regards,
baolu


Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-01 Thread Lu Baolu

Hi Longpeng,

On 4/1/21 3:18 PM, Longpeng(Mike) wrote:

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..cbcb434 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,
 * removed to make room for superpage(s).
 * We're adding new large pages, so make sure
 * we don't remove their parent tables.
+*
+* We also need to flush the iotlb before 
creating
+* superpage to ensure it does not perserves any
+* obsolete info.
 */
-   dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
-  largepage_lvl + 1);
+   if (dma_pte_present(pte)) {


The dma_pte_free_pagetable() clears a batch of PTEs. So checking current
PTE is insufficient. How about removing this check and always performing
cache invalidation?


+   int i;
+
+   dma_pte_free_pagetable(domain, iov_pfn, 
end_pfn,
+  largepage_lvl + 
1);
+   for_each_domain_iommu(i, domain)
+   
iommu_flush_iotlb_psi(g_iommus[i], domain,
+ iov_pfn, 
nr_pages, 0, 0);
+   


Best regards,
baolu


Re: [PATCH] iommu/vt-d: Force to flush iotlb before creating superpage

2021-04-01 Thread Lu Baolu

Hi Longpeng,

On 4/1/21 3:18 PM, Longpeng(Mike) wrote:

The translation caches may preserve obsolete data when the
mapping size is changed, suppose the following sequence which
can reveal the problem with high probability.

1.mmap(4GB,MAP_HUGETLB)
2.
   while (1) {
(a)DMA MAP   0,0xa
(b)DMA UNMAP 0,0xa
(c)DMA MAP   0,0xc000
  * DMA read IOVA 0 may failure here (Not present)
  * if the problem occurs.
(d)DMA UNMAP 0,0xc000
   }

The page table(only focus on IOVA 0) after (a) is:
  PML4: 0x19db5c1003   entry:0x899bdcd2f000
   PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
PDE: 0x1a30a72003  entry:0x89b39cacb000
 PTE: 0x21d200803  entry:0x89b3b0a72000

The page table after (b) is:
  PML4: 0x19db5c1003   entry:0x899bdcd2f000
   PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
PDE: 0x1a30a72003  entry:0x89b39cacb000
 PTE: 0x0  entry:0x89b3b0a72000

The page table after (c) is:
  PML4: 0x19db5c1003   entry:0x899bdcd2f000
   PDPE: 0x1a1cacb003  entry:0x89b35b5c1000
PDE: 0x21d200883   entry:0x89b39cacb000 (*)

Because the PDE entry after (b) is present, it won't be
flushed even if the iommu driver flush cache when unmap,
so the obsolete data may be preserved in cache, which
would cause the wrong translation at end.

However, we can see the PDE entry is finally switch to
2M-superpage mapping, but it does not transform
to 0x21d200883 directly:

1. PDE: 0x1a30a72003
2. __domain_mapping
  dma_pte_free_pagetable
Set the PDE entry to ZERO
  Set the PDE entry to 0x21d200883

So we must flush the cache after the entry switch to ZERO
to avoid the obsolete info be preserved.

Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Nadav Amit 
Cc: Alex Williamson 
Cc: Kevin Tian 
Cc: Gonglei (Arei) 

Fixes: 6491d4d02893 ("intel-iommu: Free old page tables before creating 
superpage")
Cc:  # v3.0+
Link: 
https://lore.kernel.org/linux-iommu/670baaf8-4ff8-4e84-4be3-030b95ab5...@huawei.com/
Suggested-by: Lu Baolu 
Signed-off-by: Longpeng(Mike) 
---
  drivers/iommu/intel/iommu.c | 15 +--
  1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee09323..cbcb434 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2342,9 +2342,20 @@ static inline int hardware_largepage_caps(struct 
dmar_domain *domain,
 * removed to make room for superpage(s).
 * We're adding new large pages, so make sure
 * we don't remove their parent tables.
+*
+* We also need to flush the iotlb before 
creating
+* superpage to ensure it does not perserves any
+* obsolete info.
 */
-   dma_pte_free_pagetable(domain, iov_pfn, end_pfn,
-  largepage_lvl + 1);
+   if (dma_pte_present(pte)) {
+   int i;
+
+   dma_pte_free_pagetable(domain, iov_pfn, 
end_pfn,
+  largepage_lvl + 
1);
+   for_each_domain_iommu(i, domain)
+   
iommu_flush_iotlb_psi(g_iommus[i], domain,
+ iov_pfn, 
nr_pages, 0, 0);


Thanks for patch!

How about making the flushed page size accurate? For example,

@@ -2365,8 +2365,8 @@ __domain_mapping(struct dmar_domain *domain, 
unsigned long iov_pfn,
dma_pte_free_pagetable(domain, 
iov_pfn, end_pfn,


largepage_lvl + 1);
for_each_domain_iommu(i, domain)
- 
iommu_flush_iotlb_psi(g_iommus[i], domain,
- 
iov_pfn, nr_pages, 0, 0);
+ 
iommu_flush_iotlb_psi(g_iommus[i], domain, iov_pfn,
+ 
ALIGN_DOWN(nr_pages, lvl_pages), 0, 0);




+   }
} else {
pteval &= ~(uint64_t)DMA_PTE_LARGE_PAGE;
}



Best regards,
baolu


[PATCH 1/1] iommu/vt-d: Report right snoop capability when using FL for IOVA

2021-03-29 Thread Lu Baolu
The Intel VT-d driver checks wrong register to report snoop capablility
when using first level page table for GPA to HPA translation. This might
lead the IOMMU driver to say that it supports snooping control, but in
reality, it does not. Fix this by always setting PASID-table-entry.PGSNP
whenever a pasid entry is setting up for GPA to HPA translation so that
the IOMMU driver could report snoop capability as long as it runs in the
scalable mode.

Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level")
Suggested-by: Rajesh Sankaran 
Suggested-by: Kevin Tian 
Suggested-by: Ashok Raj 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/pasid.h |  1 +
 drivers/iommu/intel/iommu.c | 11 ++-
 drivers/iommu/intel/pasid.c | 16 
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 079534fcf55d..5ff61c3d401f 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -48,6 +48,7 @@
  */
 #define PASID_FLAG_SUPERVISOR_MODE BIT(0)
 #define PASID_FLAG_NESTED  BIT(1)
+#define PASID_FLAG_PAGE_SNOOP  BIT(2)
 
 /*
  * The PASID_FLAG_FL5LP flag Indicates using 5-level paging for first-
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 7354f9ce47d8..deaa87ad4e5f 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -657,7 +657,14 @@ static int domain_update_iommu_snooping(struct intel_iommu 
*skip)
rcu_read_lock();
for_each_active_iommu(iommu, drhd) {
if (iommu != skip) {
-   if (!ecap_sc_support(iommu->ecap)) {
+   /*
+* If the hardware is operating in the scalable mode,
+* the snooping control is always supported since we
+* always set PASID-table-entry.PGSNP bit if the domain
+* is managed outside (UNMANAGED).
+*/
+   if (!sm_supported(iommu) &&
+   !ecap_sc_support(iommu->ecap)) {
ret = 0;
break;
}
@@ -2516,6 +2523,8 @@ static int domain_setup_first_level(struct intel_iommu 
*iommu,
flags |= PASID_FLAG_SUPERVISOR_MODE;
if (level == 5)
flags |= PASID_FLAG_FL5LP;
+   if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED)
+   flags |= PASID_FLAG_PAGE_SNOOP;
 
return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid,
 domain->iommu_did[iommu->seq_id],
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index c896aef7db40..b901909da79e 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -425,6 +425,16 @@ static inline void pasid_set_page_snoop(struct pasid_entry 
*pe, bool value)
pasid_set_bits(>val[1], 1 << 23, value << 23);
 }
 
+/*
+ * Setup the Page Snoop (PGSNP) field (Bit 88) of a scalable mode
+ * PASID entry.
+ */
+static inline void
+pasid_set_pgsnp(struct pasid_entry *pe)
+{
+   pasid_set_bits(>val[1], 1ULL << 24, 1ULL << 24);
+}
+
 /*
  * Setup the First Level Page table Pointer field (Bit 140~191)
  * of a scalable mode PASID entry.
@@ -599,6 +609,9 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
}
}
 
+   if (flags & PASID_FLAG_PAGE_SNOOP)
+   pasid_set_pgsnp(pte);
+
pasid_set_domain_id(pte, did);
pasid_set_address_width(pte, iommu->agaw);
pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
@@ -677,6 +690,9 @@ int intel_pasid_setup_second_level(struct intel_iommu 
*iommu,
pasid_set_fault_enable(pte);
pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
 
+   if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED)
+   pasid_set_pgsnp(pte);
+
/*
 * Since it is a second level only translation setup, we should
 * set SRE bit as well (addresses are expected to be GPAs).
-- 
2.25.1



Re: A problem of Intel IOMMU hardware ?

2021-03-26 Thread Lu Baolu

Hi Nadav,

On 3/27/21 12:36 PM, Nadav Amit wrote:




On Mar 26, 2021, at 7:31 PM, Lu Baolu  wrote:

Hi Nadav,

On 3/19/21 12:46 AM, Nadav Amit wrote:

So here is my guess:
Intel probably used as a basis for the IOTLB an implementation of
some other (regular) TLB design.
Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):
"Software wishing to prevent this uncertainty should not write to
a paging-structure entry in a way that would change, for any linear
address, both the page size and either the page frame, access rights,
or other attributes.”
Now the aforementioned uncertainty is a bit different (multiple
*valid*  translations of a single address). Yet, perhaps this is
yet another thing that might happen.
 From a brief look on the handling of MMU (not IOMMU) hugepages
in Linux, indeed the PMD is first cleared and flushed before a
new valid PMD is set. This is possible for MMUs since they
allow the software to handle spurious page-faults gracefully.
This is not the case for the IOMMU though (without PRI).
Not sure this explains everything though. If that is the problem,
then during a mapping that changes page-sizes, a TLB flush is
needed, similarly to the one Longpeng did manually.


I have been working with Longpeng on this issue these days. It turned
out that your guess is right. The PMD is first cleared but not flushed
before a new valid one is set. The previous entry might be cached in the
paging structure caches hence leads to disaster.

In __domain_mapping():

2352 /*
2353  * Ensure that old small page tables are
2354  * removed to make room for superpage(s).
2355  * We're adding new large pages, so make 
sure
2356  * we don't remove their parent tables.
2357  */
2358 dma_pte_free_pagetable(domain, iov_pfn, 
end_pfn,
2359 largepage_lvl + 1);

I guess adding a cache flush operation after PMD switching should solve
the problem.

I am still not clear about this comment:

"
This is possible for MMUs since they allow the software to handle
spurious page-faults gracefully. This is not the case for the IOMMU
though (without PRI).
"

Can you please shed more light on this?


I was looking at the code in more detail, and apparently my concern
is incorrect.

I was under the assumption that the IOMMU map/unmap can merge/split
(specifically split) huge-pages. For instance, if you map 2MB and
then unmap 4KB out of the 2MB, then you would split the hugepage
and keep the rest of the mappings alive. This is the way MMU is
usually managed. To my defense, I also saw such partial unmappings
in Longpeng’s first scenario.

If this was possible, then you would have a case in which out of 2MB
(for instance), 4KB were unmapped, and you need to split the 2MB
hugepage into 4KB pages. If you try to clear the PMD, flush, and then
set the PMD to point to table with live 4KB PTES, you can have
an interim state in which the PMD is not present. DMAs that arrive
at this stage might fault, and without PRI (and device support)
you do not have a way of restarting the DMA after the hugepage split
is completed.


Get you and thanks a lot for sharing.

For current IOMMU usage, I can't see any case to split a huge page into
4KB pages, but in the near future, we do have a need of splitting huge
pages. For example, when we want to use the A/D bit to track the DMA
dirty pages during VM migration, it's an optimization if we could split
a huge page into 4K ones. So far, the solution I have considered is:

1) Prepare the split subtables in advance;
   [it's identical to the existing one only use 4k pages instead of huge
page.]
2) Switch the super (huge) page's leaf entry;
   [at this point, hardware could use both subtables. I am not sure
whether the hardware allows a dynamic switch of page table entry
from on valid entry to another valid one.]
3) Flush the cache.
   [hardware will use the new subtable]

As long as we can make sure that the old subtable won't be used by
hardware, we can safely release the old table.



Anyhow, this concern is apparently not relevant. I guess I was too
naive to assume the IOMMU management is similar to the MMU. I now
see that there is a comment in intel_iommu_unmap() saying:

 /* Cope with horrid API which requires us to unmap more than the
size argument if it happens to be a large-page mapping. */

Regards,
Nadav



Best regards,
baolu


Re: A problem of Intel IOMMU hardware ?

2021-03-26 Thread Lu Baolu

Hi Nadav,

On 3/19/21 12:46 AM, Nadav Amit wrote:

So here is my guess:

Intel probably used as a basis for the IOTLB an implementation of
some other (regular) TLB design.

Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):

"Software wishing to prevent this uncertainty should not write to
a paging-structure entry in a way that would change, for any linear
address, both the page size and either the page frame, access rights,
or other attributes.”


Now the aforementioned uncertainty is a bit different (multiple
*valid*  translations of a single address). Yet, perhaps this is
yet another thing that might happen.

 From a brief look on the handling of MMU (not IOMMU) hugepages
in Linux, indeed the PMD is first cleared and flushed before a
new valid PMD is set. This is possible for MMUs since they
allow the software to handle spurious page-faults gracefully.
This is not the case for the IOMMU though (without PRI).

Not sure this explains everything though. If that is the problem,
then during a mapping that changes page-sizes, a TLB flush is
needed, similarly to the one Longpeng did manually.


I have been working with Longpeng on this issue these days. It turned
out that your guess is right. The PMD is first cleared but not flushed
before a new valid one is set. The previous entry might be cached in the
paging structure caches hence leads to disaster.

In __domain_mapping():

2352 /*
2353  * Ensure that old small page 
tables are
2354  * removed to make room for 
superpage(s).
2355  * We're adding new large pages, so 
make sure

2356  * we don't remove their parent tables.
2357  */
2358 dma_pte_free_pagetable(domain, 
iov_pfn, end_pfn,
2359 
largepage_lvl + 1);


I guess adding a cache flush operation after PMD switching should solve
the problem.

I am still not clear about this comment:

"
 This is possible for MMUs since they allow the software to handle
 spurious page-faults gracefully. This is not the case for the IOMMU
 though (without PRI).
"

Can you please shed more light on this?

Best regards,
baolu


Re: [PATCH 3/5] iommu/vt-d: Remove SVM_FLAG_PRIVATE_PASID

2021-03-24 Thread Lu Baolu

Hi Christoph,

On 3/24/21 1:33 AM, Christoph Hellwig wrote:

On Tue, Mar 23, 2021 at 09:05:58AM +0800, Lu Baolu wrote:

The SVM_FLAG_PRIVATE_PASID has never been referenced in the tree, and
there's no plan to have anything to use it. So cleanup it.

Signed-off-by: Lu Baolu 


Looks good,

Reviewed-by: Christoph Hellwig 


Thank you!



But can we take this a little further?  SVM_FLAG_GUEST_PASID is unused
as well.  SVM_FLAG_GUEST_MODE is only used in drivers/iommu/intel/svm.c,
and SVM_FLAG_SUPERVISOR_MODE is actually used as an API flag to
iommu_sva_bind_device.  So IMHO the latter should be elevated to an
IOMMU API level flag, and then include/linux/intel-svm.h can go away
entirely or at least be moved to drivers/iommu/intel/.



Sure. I will consider it and make it in separated patches.

Best regards,
baolu


Re: [PATCH 2/3] iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining

2021-03-22 Thread Lu Baolu

On 3/1/21 8:12 PM, John Garry wrote:

Now that the core code handles flushing per-IOVA domain CPU rcaches,
remove the handling here.

Signed-off-by: John Garry 
---
  drivers/iommu/intel/iommu.c | 31 ---
  include/linux/cpuhotplug.h  |  1 -
  2 files changed, 32 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee0932307d64..d1e66e1b07b8 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4065,35 +4065,6 @@ static struct notifier_block intel_iommu_memory_nb = {
.priority = 0
  };
  
-static void free_all_cpu_cached_iovas(unsigned int cpu)

-{
-   int i;
-
-   for (i = 0; i < g_num_of_iommus; i++) {
-   struct intel_iommu *iommu = g_iommus[i];
-   struct dmar_domain *domain;
-   int did;
-
-   if (!iommu)
-   continue;
-
-   for (did = 0; did < cap_ndoms(iommu->cap); did++) {
-   domain = get_iommu_domain(iommu, (u16)did);
-
-   if (!domain || domain->domain.type != IOMMU_DOMAIN_DMA)
-   continue;
-
-   iommu_dma_free_cpu_cached_iovas(cpu, >domain);
-   }
-   }
-}
-
-static int intel_iommu_cpu_dead(unsigned int cpu)
-{
-   free_all_cpu_cached_iovas(cpu);
-   return 0;
-}
-
  static void intel_disable_iommus(void)
  {
struct intel_iommu *iommu = NULL;
@@ -4388,8 +4359,6 @@ int __init intel_iommu_init(void)
bus_set_iommu(_bus_type, _iommu_ops);
if (si_domain && !hw_pass_through)
register_memory_notifier(_iommu_memory_nb);
-   cpuhp_setup_state(CPUHP_IOMMU_INTEL_DEAD, "iommu/intel:dead", NULL,
- intel_iommu_cpu_dead);
  
  	down_read(_global_lock);

if (probe_acpi_namespace_devices())
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index cedac9986557..85996494bec1 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -57,7 +57,6 @@ enum cpuhp_state {
CPUHP_PAGE_ALLOC_DEAD,
CPUHP_NET_DEV_DEAD,
CPUHP_PCI_XGENE_DEAD,
-   CPUHP_IOMMU_INTEL_DEAD,
CPUHP_IOMMU_IOVA_DEAD,
CPUHP_LUSTRE_CFS_DEAD,
CPUHP_AP_ARM_CACHE_B15_RAC_DEAD,



Reviewed-by: Lu Baolu 

Best regards,
baolu


[PATCH 5/5] iommu/vt-d: Make unnecessarily global functions static

2021-03-22 Thread Lu Baolu
Make some functions static as they are only used inside pasid.c.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/pasid.c | 4 ++--
 drivers/iommu/intel/pasid.h | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index f2c747e62c6a..c896aef7db40 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -230,7 +230,7 @@ struct pasid_table *intel_pasid_get_table(struct device 
*dev)
return info->pasid_table;
 }
 
-int intel_pasid_get_dev_max_id(struct device *dev)
+static int intel_pasid_get_dev_max_id(struct device *dev)
 {
struct device_domain_info *info;
 
@@ -241,7 +241,7 @@ int intel_pasid_get_dev_max_id(struct device *dev)
return info->pasid_table->max_pasid;
 }
 
-struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
+static struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid)
 {
struct device_domain_info *info;
struct pasid_table *pasid_table;
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 90a3268d7a77..079534fcf55d 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -102,8 +102,6 @@ extern unsigned int intel_pasid_max_id;
 int intel_pasid_alloc_table(struct device *dev);
 void intel_pasid_free_table(struct device *dev);
 struct pasid_table *intel_pasid_get_table(struct device *dev);
-int intel_pasid_get_dev_max_id(struct device *dev);
-struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid);
 int intel_pasid_setup_first_level(struct intel_iommu *iommu,
  struct device *dev, pgd_t *pgd,
  u32 pasid, u16 did, int flags);
-- 
2.25.1



[PATCH 2/5] iommu/vt-d: Remove svm_dev_ops

2021-03-22 Thread Lu Baolu
The svm_dev_ops has never been referenced in the tree, and there's no
plan to have anything to use it. Remove it to make the code neat.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c   | 15 +--
 include/linux/intel-iommu.h |  3 ---
 include/linux/intel-svm.h   |  7 ---
 3 files changed, 1 insertion(+), 24 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 5d590d63ab52..875ee74d8c41 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -462,7 +462,6 @@ static void load_pasid(struct mm_struct *mm, u32 pasid)
 /* Caller must hold pasid_mutex, mm reference */
 static int
 intel_svm_bind_mm(struct device *dev, unsigned int flags,
- struct svm_dev_ops *ops,
  struct mm_struct *mm, struct intel_svm_dev **sd)
 {
struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
@@ -512,10 +511,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
 
/* Find the matching device in svm list */
for_each_svm_dev(sdev, svm, dev) {
-   if (sdev->ops != ops) {
-   ret = -EBUSY;
-   goto out;
-   }
sdev->users++;
goto success;
}
@@ -550,7 +545,6 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
 
/* Finish the setup now we know we're keeping it */
sdev->users = 1;
-   sdev->ops = ops;
init_rcu_head(>rcu);
 
if (!svm) {
@@ -1006,13 +1000,6 @@ static irqreturn_t prq_event_thread(int irq, void *d)
mmap_read_unlock(svm->mm);
mmput(svm->mm);
 bad_req:
-   WARN_ON(!sdev);
-   if (sdev && sdev->ops && sdev->ops->fault_cb) {
-   int rwxp = (req->rd_req << 3) | (req->wr_req << 2) |
-   (req->exe_req << 1) | (req->pm_req);
-   sdev->ops->fault_cb(sdev->dev, req->pasid, req->addr,
-   req->priv_data, rwxp, result);
-   }
/* We get here in the error case where the PASID lookup failed,
   and these can be NULL. Do not use them below this point! */
sdev = NULL;
@@ -1087,7 +1074,7 @@ intel_svm_bind(struct device *dev, struct mm_struct *mm, 
void *drvdata)
if (drvdata)
flags = *(unsigned int *)drvdata;
mutex_lock(_mutex);
-   ret = intel_svm_bind_mm(dev, flags, NULL, mm, );
+   ret = intel_svm_bind_mm(dev, flags, mm, );
if (ret)
sva = ERR_PTR(ret);
else if (sdev)
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 76f974da8ca4..03faf20a6817 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -770,14 +770,11 @@ u32 intel_svm_get_pasid(struct iommu_sva *handle);
 int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt,
struct iommu_page_response *msg);
 
-struct svm_dev_ops;
-
 struct intel_svm_dev {
struct list_head list;
struct rcu_head rcu;
struct device *dev;
struct intel_iommu *iommu;
-   struct svm_dev_ops *ops;
struct iommu_sva sva;
u32 pasid;
int users;
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index 39d368a810b8..6c9d10c0fb1e 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -8,13 +8,6 @@
 #ifndef __INTEL_SVM_H__
 #define __INTEL_SVM_H__
 
-struct device;
-
-struct svm_dev_ops {
-   void (*fault_cb)(struct device *dev, u32 pasid, u64 address,
-void *private, int rwxp, int response);
-};
-
 /* Values for rxwp in fault_cb callback */
 #define SVM_REQ_READ   (1<<3)
 #define SVM_REQ_WRITE  (1<<2)
-- 
2.25.1



[PATCH 3/5] iommu/vt-d: Remove SVM_FLAG_PRIVATE_PASID

2021-03-22 Thread Lu Baolu
The SVM_FLAG_PRIVATE_PASID has never been referenced in the tree, and
there's no plan to have anything to use it. So cleanup it.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c | 40 ++-
 include/linux/intel-svm.h | 16 +++-
 2 files changed, 21 insertions(+), 35 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 875ee74d8c41..5165cea90421 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -465,9 +465,9 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
  struct mm_struct *mm, struct intel_svm_dev **sd)
 {
struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
+   struct intel_svm *svm = NULL, *t;
struct device_domain_info *info;
struct intel_svm_dev *sdev;
-   struct intel_svm *svm = NULL;
unsigned long iflags;
int pasid_max;
int ret;
@@ -493,30 +493,26 @@ intel_svm_bind_mm(struct device *dev, unsigned int flags,
}
}
 
-   if (!(flags & SVM_FLAG_PRIVATE_PASID)) {
-   struct intel_svm *t;
-
-   list_for_each_entry(t, _svm_list, list) {
-   if (t->mm != mm || (t->flags & SVM_FLAG_PRIVATE_PASID))
-   continue;
-
-   svm = t;
-   if (svm->pasid >= pasid_max) {
-   dev_warn(dev,
-"Limited PASID width. Cannot use 
existing PASID %d\n",
-svm->pasid);
-   ret = -ENOSPC;
-   goto out;
-   }
+   list_for_each_entry(t, _svm_list, list) {
+   if (t->mm != mm)
+   continue;
 
-   /* Find the matching device in svm list */
-   for_each_svm_dev(sdev, svm, dev) {
-   sdev->users++;
-   goto success;
-   }
+   svm = t;
+   if (svm->pasid >= pasid_max) {
+   dev_warn(dev,
+"Limited PASID width. Cannot use existing 
PASID %d\n",
+svm->pasid);
+   ret = -ENOSPC;
+   goto out;
+   }
 
-   break;
+   /* Find the matching device in svm list */
+   for_each_svm_dev(sdev, svm, dev) {
+   sdev->users++;
+   goto success;
}
+
+   break;
}
 
sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h
index 6c9d10c0fb1e..10fa80eef13a 100644
--- a/include/linux/intel-svm.h
+++ b/include/linux/intel-svm.h
@@ -14,16 +14,6 @@
 #define SVM_REQ_EXEC   (1<<1)
 #define SVM_REQ_PRIV   (1<<0)
 
-/*
- * The SVM_FLAG_PRIVATE_PASID flag requests a PASID which is *not* the "main"
- * PASID for the current process. Even if a PASID already exists, a new one
- * will be allocated. And the PASID allocated with SVM_FLAG_PRIVATE_PASID
- * will not be given to subsequent callers. This facility allows a driver to
- * disambiguate between multiple device contexts which access the same MM,
- * if there is no other way to do so. It should be used sparingly, if at all.
- */
-#define SVM_FLAG_PRIVATE_PASID (1<<0)
-
 /*
  * The SVM_FLAG_SUPERVISOR_MODE flag requests a PASID which can be used only
  * for access to kernel addresses. No IOTLB flushes are automatically done
@@ -35,18 +25,18 @@
  * It is unlikely that we will ever hook into flush_tlb_kernel_range() to
  * do such IOTLB flushes automatically.
  */
-#define SVM_FLAG_SUPERVISOR_MODE   (1<<1)
+#define SVM_FLAG_SUPERVISOR_MODE   BIT(0)
 /*
  * The SVM_FLAG_GUEST_MODE flag is used when a PASID bind is for guest
  * processes. Compared to the host bind, the primary differences are:
  * 1. mm life cycle management
  * 2. fault reporting
  */
-#define SVM_FLAG_GUEST_MODE(1<<2)
+#define SVM_FLAG_GUEST_MODEBIT(1)
 /*
  * The SVM_FLAG_GUEST_PASID flag is used when a guest has its own PASID space,
  * which requires guest and host PASID translation at both directions.
  */
-#define SVM_FLAG_GUEST_PASID   (1<<3)
+#define SVM_FLAG_GUEST_PASID   BIT(2)
 
 #endif /* __INTEL_SVM_H__ */
-- 
2.25.1



[PATCH 0/5] iommu/vt-d: Several misc cleanups

2021-03-22 Thread Lu Baolu
Hi Joerg et al,

This series includes several cleanups in the VT-d driver. Please help to
review.

Best regards,
baolu

Lu Baolu (5):
  iommu/vt-d: Remove unused dma map/unmap trace events
  iommu/vt-d: Remove svm_dev_ops
  iommu/vt-d: Remove SVM_FLAG_PRIVATE_PASID
  iommu/vt-d: Remove unused function declarations
  iommu/vt-d: Make unnecessarily global functions static

 drivers/iommu/intel/pasid.c|   4 +-
 drivers/iommu/intel/pasid.h|   5 --
 drivers/iommu/intel/svm.c  |  55 +
 include/linux/intel-iommu.h|   3 -
 include/linux/intel-svm.h  |  23 +-
 include/trace/events/intel_iommu.h | 120 -
 6 files changed, 24 insertions(+), 186 deletions(-)

-- 
2.25.1



[PATCH 1/5] iommu/vt-d: Remove unused dma map/unmap trace events

2021-03-22 Thread Lu Baolu
With commit c588072bba6b5 ("iommu/vt-d: Convert intel iommu driver to
the iommu ops"), the trace events for dma map/unmap have no users any
more. Cleanup them to make the code neat.

Signed-off-by: Lu Baolu 
---
 include/trace/events/intel_iommu.h | 120 -
 1 file changed, 120 deletions(-)

diff --git a/include/trace/events/intel_iommu.h 
b/include/trace/events/intel_iommu.h
index e801f4910522..d233f2916584 100644
--- a/include/trace/events/intel_iommu.h
+++ b/include/trace/events/intel_iommu.h
@@ -15,126 +15,6 @@
 #include 
 #include 
 
-DECLARE_EVENT_CLASS(dma_map,
-   TP_PROTO(struct device *dev, dma_addr_t dev_addr, phys_addr_t phys_addr,
-size_t size),
-
-   TP_ARGS(dev, dev_addr, phys_addr, size),
-
-   TP_STRUCT__entry(
-   __string(dev_name, dev_name(dev))
-   __field(dma_addr_t, dev_addr)
-   __field(phys_addr_t, phys_addr)
-   __field(size_t, size)
-   ),
-
-   TP_fast_assign(
-   __assign_str(dev_name, dev_name(dev));
-   __entry->dev_addr = dev_addr;
-   __entry->phys_addr = phys_addr;
-   __entry->size = size;
-   ),
-
-   TP_printk("dev=%s dev_addr=0x%llx phys_addr=0x%llx size=%zu",
- __get_str(dev_name),
- (unsigned long long)__entry->dev_addr,
- (unsigned long long)__entry->phys_addr,
- __entry->size)
-);
-
-DEFINE_EVENT(dma_map, map_single,
-   TP_PROTO(struct device *dev, dma_addr_t dev_addr, phys_addr_t phys_addr,
-size_t size),
-   TP_ARGS(dev, dev_addr, phys_addr, size)
-);
-
-DEFINE_EVENT(dma_map, bounce_map_single,
-   TP_PROTO(struct device *dev, dma_addr_t dev_addr, phys_addr_t phys_addr,
-size_t size),
-   TP_ARGS(dev, dev_addr, phys_addr, size)
-);
-
-DECLARE_EVENT_CLASS(dma_unmap,
-   TP_PROTO(struct device *dev, dma_addr_t dev_addr, size_t size),
-
-   TP_ARGS(dev, dev_addr, size),
-
-   TP_STRUCT__entry(
-   __string(dev_name, dev_name(dev))
-   __field(dma_addr_t, dev_addr)
-   __field(size_t, size)
-   ),
-
-   TP_fast_assign(
-   __assign_str(dev_name, dev_name(dev));
-   __entry->dev_addr = dev_addr;
-   __entry->size = size;
-   ),
-
-   TP_printk("dev=%s dev_addr=0x%llx size=%zu",
- __get_str(dev_name),
- (unsigned long long)__entry->dev_addr,
- __entry->size)
-);
-
-DEFINE_EVENT(dma_unmap, unmap_single,
-   TP_PROTO(struct device *dev, dma_addr_t dev_addr, size_t size),
-   TP_ARGS(dev, dev_addr, size)
-);
-
-DEFINE_EVENT(dma_unmap, unmap_sg,
-   TP_PROTO(struct device *dev, dma_addr_t dev_addr, size_t size),
-   TP_ARGS(dev, dev_addr, size)
-);
-
-DEFINE_EVENT(dma_unmap, bounce_unmap_single,
-   TP_PROTO(struct device *dev, dma_addr_t dev_addr, size_t size),
-   TP_ARGS(dev, dev_addr, size)
-);
-
-DECLARE_EVENT_CLASS(dma_map_sg,
-   TP_PROTO(struct device *dev, int index, int total,
-struct scatterlist *sg),
-
-   TP_ARGS(dev, index, total, sg),
-
-   TP_STRUCT__entry(
-   __string(dev_name, dev_name(dev))
-   __field(dma_addr_t, dev_addr)
-   __field(phys_addr_t, phys_addr)
-   __field(size_t, size)
-   __field(int, index)
-   __field(int, total)
-   ),
-
-   TP_fast_assign(
-   __assign_str(dev_name, dev_name(dev));
-   __entry->dev_addr = sg->dma_address;
-   __entry->phys_addr = sg_phys(sg);
-   __entry->size = sg->dma_length;
-   __entry->index = index;
-   __entry->total = total;
-   ),
-
-   TP_printk("dev=%s [%d/%d] dev_addr=0x%llx phys_addr=0x%llx size=%zu",
- __get_str(dev_name), __entry->index, __entry->total,
- (unsigned long long)__entry->dev_addr,
- (unsigned long long)__entry->phys_addr,
- __entry->size)
-);
-
-DEFINE_EVENT(dma_map_sg, map_sg,
-   TP_PROTO(struct device *dev, int index, int total,
-struct scatterlist *sg),
-   TP_ARGS(dev, index, total, sg)
-);
-
-DEFINE_EVENT(dma_map_sg, bounce_map_sg,
-   TP_PROTO(struct device *dev, int index, int total,
-struct scatterlist *sg),
-   TP_ARGS(dev, index, total, sg)
-);
-
 TRACE_EVENT(qi_submit,
TP_PROTO(struct intel_iommu *iommu, u64 qw0, u64 qw1, u64 qw2, u64 qw3),
 
-- 
2.25.1



[PATCH 4/5] iommu/vt-d: Remove unused function declarations

2021-03-22 Thread Lu Baolu
Some functions have been deprecated. Remove the remaining function
delarations.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/pasid.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 444c0bec221a..90a3268d7a77 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -99,9 +99,6 @@ static inline bool pasid_pte_is_present(struct pasid_entry 
*pte)
 }
 
 extern unsigned int intel_pasid_max_id;
-int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
-void intel_pasid_free_id(u32 pasid);
-void *intel_pasid_lookup_id(u32 pasid);
 int intel_pasid_alloc_table(struct device *dev);
 void intel_pasid_free_table(struct device *dev);
 struct pasid_table *intel_pasid_get_table(struct device *dev);
-- 
2.25.1



[PATCH v2 5/5] iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown

2021-03-19 Thread Lu Baolu
When a present pasid entry is disassembled, all kinds of pasid related
caches need to be flushed. But when a pasid entry is not being used
(PRESENT bit not set), we don't need to do this. Check the PRESENT bit
in intel_pasid_tear_down_entry() and avoid flushing caches if it's not
set.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/pasid.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index dd69df5a188a..7a73385edcc0 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -502,6 +502,9 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, 
struct device *dev,
if (WARN_ON(!pte))
return;
 
+   if (!(pte->val[0] & PASID_PTE_PRESENT))
+   return;
+
did = pasid_get_domain_id(pte);
intel_pasid_clear_entry(dev, pasid, fault_ignore);
 
-- 
2.25.1



[PATCH v2 3/5] iommu/vt-d: Invalidate PASID cache when root/context entry changed

2021-03-19 Thread Lu Baolu
When the Intel IOMMU is operating in the scalable mode, some information
from the root and context table may be used to tag entries in the PASID
cache. Software should invalidate the PASID-cache when changing root or
context table entries.

Suggested-by: Ashok Raj 
Fixes: 7373a8cc38197 ("iommu/vt-d: Setup context and enable RID2PASID support")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 18 +-
 include/linux/intel-iommu.h |  1 +
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 132cbf9f214f..868f195f55ff 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1339,6 +1339,11 @@ static void iommu_set_root_entry(struct intel_iommu 
*iommu)
  readl, (sts & DMA_GSTS_RTPS), sts);
 
raw_spin_unlock_irqrestore(>register_lock, flag);
+
+   iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
+   if (sm_supported(iommu))
+   qi_flush_pasid_cache(iommu, 0, QI_PC_GLOBAL, 0);
+   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
 }
 
 void iommu_flush_write_buffer(struct intel_iommu *iommu)
@@ -2422,6 +2427,10 @@ static void domain_context_clear_one(struct intel_iommu 
*iommu, u8 bus, u8 devfn
   (((u16)bus) << 8) | devfn,
   DMA_CCMD_MASK_NOBIT,
   DMA_CCMD_DEVICE_INVL);
+
+   if (sm_supported(iommu))
+   qi_flush_pasid_cache(iommu, did_old, QI_PC_ALL_PASIDS, 0);
+
iommu->flush.flush_iotlb(iommu,
 did_old,
 0,
@@ -3267,8 +3276,6 @@ static int __init init_dmars(void)
register_pasid_allocator(iommu);
 #endif
iommu_set_root_entry(iommu);
-   iommu->flush.flush_context(iommu, 0, 0, 0, 
DMA_CCMD_GLOBAL_INVL);
-   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
}
 
 #ifdef CONFIG_INTEL_IOMMU_BROKEN_GFX_WA
@@ -3458,12 +3465,7 @@ static int init_iommu_hw(void)
}
 
iommu_flush_write_buffer(iommu);
-
iommu_set_root_entry(iommu);
-
-   iommu->flush.flush_context(iommu, 0, 0, 0,
-  DMA_CCMD_GLOBAL_INVL);
-   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
iommu_enable_translation(iommu);
iommu_disable_protect_mem_regions(iommu);
}
@@ -3846,8 +3848,6 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru)
goto disable_iommu;
 
iommu_set_root_entry(iommu);
-   iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
-   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
iommu_enable_translation(iommu);
 
iommu_disable_protect_mem_regions(iommu);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 1732298ce888..76f974da8ca4 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -378,6 +378,7 @@ enum {
 /* PASID cache invalidation granu */
 #define QI_PC_ALL_PASIDS   0
 #define QI_PC_PASID_SEL1
+#define QI_PC_GLOBAL   3
 
 #define QI_EIOTLB_ADDR(addr)   ((u64)(addr) & VTD_PAGE_MASK)
 #define QI_EIOTLB_IH(ih)   (((u64)ih) << 6)
-- 
2.25.1



[PATCH v2 4/5] iommu/vt-d: Use user privilege for RID2PASID translation

2021-03-19 Thread Lu Baolu
When first-level page tables are used for IOVA translation, we use user
privilege by setting U/S bit in the page table entry. This is to make it
consistent with the second level translation, where the U/S enforcement
is not available. Clear the SRE (Supervisor Request Enable) field in the
pasid table entry of RID2PASID so that requests requesting the supervisor
privilege are blocked and treated as DMA remapping faults.

Suggested-by: Jacob Pan 
Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 7 +--
 drivers/iommu/intel/pasid.c | 3 ++-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 868f195f55ff..7354f9ce47d8 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2494,9 +2494,9 @@ static int domain_setup_first_level(struct intel_iommu 
*iommu,
struct device *dev,
u32 pasid)
 {
-   int flags = PASID_FLAG_SUPERVISOR_MODE;
struct dma_pte *pgd = domain->pgd;
int agaw, level;
+   int flags = 0;
 
/*
 * Skip top levels of page tables for iommu which has
@@ -2512,7 +2512,10 @@ static int domain_setup_first_level(struct intel_iommu 
*iommu,
if (level != 4 && level != 5)
return -EINVAL;
 
-   flags |= (level == 5) ? PASID_FLAG_FL5LP : 0;
+   if (pasid != PASID_RID2PASID)
+   flags |= PASID_FLAG_SUPERVISOR_MODE;
+   if (level == 5)
+   flags |= PASID_FLAG_FL5LP;
 
return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid,
 domain->iommu_did[iommu->seq_id],
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 0bf7e0a76890..dd69df5a188a 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -673,7 +673,8 @@ int intel_pasid_setup_second_level(struct intel_iommu 
*iommu,
 * Since it is a second level only translation setup, we should
 * set SRE bit as well (addresses are expected to be GPAs).
 */
-   pasid_set_sre(pte);
+   if (pasid != PASID_RID2PASID)
+   pasid_set_sre(pte);
pasid_set_present(pte);
pasid_flush_caches(iommu, pte, pasid, did);
 
-- 
2.25.1



[PATCH v2 0/5] iommu/vt-d: Several misc fixes

2021-03-19 Thread Lu Baolu
Hi Joerg,

This series includes some misc fixes for the VT-d iommu driver. Please
help to review and merge.

Best regards,
baolu

Change log:
 v1->v2:
  - v1: 
https://lore.kernel.org/linux-iommu/20210225062654.2864322-1-baolu...@linux.intel.com/
  - [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries
 - Refine the commit message to make the intention clear.

Lu Baolu (5):
  iommu/vt-d: Report the right page fault address
  iommu/vt-d: Remove WO permissions on second-level paging entries
  iommu/vt-d: Invalidate PASID cache when root/context entry changed
  iommu/vt-d: Use user privilege for RID2PASID translation
  iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown

 drivers/iommu/intel/iommu.c | 28 
 drivers/iommu/intel/pasid.c |  6 +-
 drivers/iommu/intel/svm.c   |  2 +-
 include/linux/intel-iommu.h |  1 +
 4 files changed, 23 insertions(+), 14 deletions(-)

-- 
2.25.1



[PATCH v2 1/5] iommu/vt-d: Report the right page fault address

2021-03-19 Thread Lu Baolu
The Address field of the Page Request Descriptor only keeps bit [63:12]
of the offending address. Convert it to a full address before reporting
it to device drivers.

Fixes: eb8d93ea3c1d3 ("iommu/vt-d: Report page request faults for guest SVA")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 54db58945c2d..677d7f6b43bb 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -862,7 +862,7 @@ intel_svm_prq_report(struct device *dev, struct 
page_req_dsc *desc)
/* Fill in event data for device specific processing */
memset(, 0, sizeof(struct iommu_fault_event));
event.fault.type = IOMMU_FAULT_PAGE_REQ;
-   event.fault.prm.addr = desc->addr;
+   event.fault.prm.addr = (u64)desc->addr << VTD_PAGE_SHIFT;
event.fault.prm.pasid = desc->pasid;
event.fault.prm.grpid = desc->prg_index;
event.fault.prm.perm = prq_to_iommu_prot(desc);
-- 
2.25.1



[PATCH v2 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries

2021-03-19 Thread Lu Baolu
When the first level page table is used for IOVA translation, it only
supports Read-Only and Read-Write permissions. The Write-Only permission
is not supported as the PRESENT bit (implying Read permission) should
always set. When using second level, we still give separate permissions
that allows WriteOnly which seems inconsistent and awkward. We want to
have consistent behavior. After moving to 1st level, we don't want things
to work sometimes, and break if we use 2nd level for the same mappings.
Hence remove this configuration.

Suggested-by: Ashok Raj 
Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 167219ea8d70..132cbf9f214f 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2304,8 +2304,9 @@ __domain_mapping(struct dmar_domain *domain, unsigned 
long iov_pfn,
return -EINVAL;
 
attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP);
+   attr |= DMA_FL_PTE_PRESENT;
if (domain_use_first_level(domain)) {
-   attr |= DMA_FL_PTE_PRESENT | DMA_FL_PTE_XD | DMA_FL_PTE_US;
+   attr |= DMA_FL_PTE_XD | DMA_FL_PTE_US;
 
if (domain->domain.type == IOMMU_DOMAIN_DMA) {
attr |= DMA_FL_PTE_ACCESS;
-- 
2.25.1



[PATCH v2 1/1] iommu/vt-d: Don't set then clear private data in prq_event_thread()

2021-03-19 Thread Lu Baolu
The VT-d specification (section 7.6) requires that the value in the
Private Data field of a Page Group Response Descriptor must match
the value in the Private Data field of the respective Page Request
Descriptor.

The private data field of a page group response descriptor is set then
immediately cleared in prq_event_thread(). This breaks the rule defined
by the VT-d specification. Fix it by moving clearing code up.

Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode")
Cc: Jacob Pan 
Reviewed-by: Liu Yi L 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Log:
v1->v2:
  - v1: 
https://lore.kernel.org/linux-iommu/20210309004641.3809653-1-baolu...@linux.intel.com/
  - Refine the commit title to make the affected field clear.
  - Refine the commit message to declare why the change matters.

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 677d7f6b43bb..5d590d63ab52 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -1034,12 +1034,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
QI_PGRP_RESP_TYPE;
resp.qw1 = QI_PGRP_IDX(req->prg_index) |
QI_PGRP_LPIG(req->lpig);
+   resp.qw2 = 0;
+   resp.qw3 = 0;
 
if (req->priv_data_present)
memcpy(, req->priv_data,
   sizeof(req->priv_data));
-   resp.qw2 = 0;
-   resp.qw3 = 0;
qi_submit_sync(iommu, , 1, 0);
}
 prq_advance:
-- 
2.25.1



[PATCH v2 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()

2021-03-19 Thread Lu Baolu
The pasid_lock is used to synchronize different threads from modifying a
same pasid directory entry at the same time. It causes below lockdep splat.

[   83.296538] 
[   83.296538] WARNING: possible irq lock inversion dependency detected
[   83.296539] 5.12.0-rc3+ #25 Tainted: GW
[   83.296539] 
[   83.296540] bash/780 just changed the state of lock:
[   83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at:
   iommu_flush_dev_iotlb.part.0+0x32/0x110
[   83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past:
[   83.296547]  (pasid_lock){+.+.}-{2:2}
[   83.296548]

   and interrupts could create inverse lock ordering between them.

[   83.296549] other info that might help us debug this:
[   83.296549] Chain exists of:
 device_domain_lock --> >lock --> pasid_lock
[   83.296551]  Possible interrupt unsafe locking scenario:

[   83.296551]CPU0CPU1
[   83.296552]
[   83.296552]   lock(pasid_lock);
[   83.296553]local_irq_disable();
[   83.296553]lock(device_domain_lock);
[   83.296554]lock(>lock);
[   83.296554]   
[   83.296554] lock(device_domain_lock);
[   83.296555]
*** DEADLOCK ***

Fix it by replacing the pasid_lock with an atomic exchange operation.

Reported-and-tested-by: Dave Jiang 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/pasid.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

Log:
v1->v2:
  - v1: 
https://lore.kernel.org/linux-iommu/20210317005834.173503-1-baolu...@linux.intel.com/
  - Use retry to make code neat;
  - Add a comment about no clear case, hence no race.

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 7a73385edcc0..f2c747e62c6a 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -24,7 +24,6 @@
 /*
  * Intel IOMMU system wide PASID name space:
  */
-static DEFINE_SPINLOCK(pasid_lock);
 u32 intel_pasid_max_id = PASID_MAX;
 
 int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid)
@@ -259,19 +258,25 @@ struct pasid_entry *intel_pasid_get_entry(struct device 
*dev, u32 pasid)
dir_index = pasid >> PASID_PDE_SHIFT;
index = pasid & PASID_PTE_MASK;
 
-   spin_lock(_lock);
+retry:
entries = get_pasid_table_from_pde([dir_index]);
if (!entries) {
entries = alloc_pgtable_page(info->iommu->node);
-   if (!entries) {
-   spin_unlock(_lock);
+   if (!entries)
return NULL;
-   }
 
-   WRITE_ONCE(dir[dir_index].val,
-  (u64)virt_to_phys(entries) | PASID_PTE_PRESENT);
+   /*
+* The pasid directory table entry won't be freed after
+* allocation. No worry about the race with free and
+* clear. However, this entry might be populated by others
+* while we are preparing it. Use theirs with a retry.
+*/
+   if (cmpxchg64([dir_index].val, 0ULL,
+ (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) {
+   free_pgtable_page(entries);
+   goto retry;
+   }
}
-   spin_unlock(_lock);
 
return [index];
 }
-- 
2.25.1



Re: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-03-19 Thread Lu Baolu

On 3/19/21 9:30 AM, Keqian Zhu wrote:

Hi Baolu,

On 2021/3/19 8:33, Lu Baolu wrote:

On 3/18/21 7:53 PM, Shenming Lu wrote:

On 2021/3/18 17:07, Tian, Kevin wrote:

From: Shenming Lu
Sent: Thursday, March 18, 2021 3:53 PM

On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many

devices allow I/O faulting only in selective contexts. However, there
is no standard way (e.g. PCISIG) for the device to report whether
arbitrary I/O fault is allowed. Then we may have to maintain device
specific knowledge in software, e.g. in an opt-in table to list devices
which allows arbitrary faults. For devices which only support selective
faulting, a mediator (either through vendor extensions on vfio-pci-core
or a mdev wrapper) might be necessary to help lock down non-faultable
mappings and then enable faulting on the rest mappings.

For devices which only support selective faulting, they could tell it to the
IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?

Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
selectively page-pinning. The matter is that 'they' imply some device
specific logic to decide which pages must be pinned and such knowledge
is outside of VFIO.

  From enabling p.o.v we could possibly do it in phased approach. First
handles devices which tolerate arbitrary DMA faults, and then extends
to devices with selective-faulting. The former is simpler, but with one
main open whether we want to maintain such device IDs in a static
table in VFIO or rely on some hints from other components (e.g. PF
driver in VF assignment case). Let's see how Alex thinks about it.

Hi Kevin,

You mentioned selective-faulting some time ago. I still have some doubt
about it:
There is already a vfio_pin_pages() which is used for limiting the IOMMU
group dirty scope to pinned pages, could it also be used for indicating
the faultable scope is limited to the pinned pages and the rest mappings
is non-faultable that should be pinned and mapped immediately? But it
seems to be a little weird and not exactly to what you meant... I will
be grateful if you can help to explain further.:-)


The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
pages that are not faultable (based on its specific knowledge) and then
the rest memory becomes faultable.

Ahh...
Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
only the page faults within the pinned range are valid in the registered
iommu fault handler...

Isn't it opposite? The pinned pages will never generate any page faults.
I might miss some contexts here.

It seems that vfio_pin_pages() just pin some pages and record the pinned scope 
to pfn_list of vfio_dma.
No mapping is established, so we still has page faults.


Make sense. Thanks a lot for the explanation.



IIUC, vfio_pin_pages() is used to
1. pin pages for non-iommu backed devices.
2. mark dirty scope for non-iommu backed devices and iommu backed devices.


Best regards,
baolu


Re: [PATCH 1/1] iommu/vt-d: Don't set then immediately clear in prq_event_thread()

2021-03-18 Thread Lu Baolu

Hi Joerg,

On 3/18/21 6:10 PM, Joerg Roedel wrote:

Hi Baolu,

On Tue, Mar 09, 2021 at 08:46:41AM +0800, Lu Baolu wrote:

The private data field of a page group response descriptor is set then
immediately cleared in prq_event_thread(). Fix this by moving clearing
code up.

Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode")
Cc: Jacob Pan 
Reviewed-by: Liu Yi L 
Signed-off-by: Lu Baolu 


Does this fix an actual bug? If so, please state it in the commit


It will cause real problem according to the VT-d spec. I haven't got a
chance run this on a real hardware yet. I'll add a commit message to
explain why this will cause problem.


message and also fix the subject line to state what is set/cleared.



Sure!


Thanks,

Joerg



Best regards,
baolu


Re: [PATCH 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()

2021-03-18 Thread Lu Baolu

Hi Joerg,

On 3/18/21 6:21 PM, Joerg Roedel wrote:

On Wed, Mar 17, 2021 at 08:58:34AM +0800, Lu Baolu wrote:

The pasid_lock is used to synchronize different threads from modifying a
same pasid directory entry at the same time. It causes below lockdep splat.

[   83.296538] 
[   83.296538] WARNING: possible irq lock inversion dependency detected
[   83.296539] 5.12.0-rc3+ #25 Tainted: GW
[   83.296539] 
[   83.296540] bash/780 just changed the state of lock:
[   83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at:
iommu_flush_dev_iotlb.part.0+0x32/0x110
[   83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past:
[   83.296547]  (pasid_lock){+.+.}-{2:2}
[   83.296548]

and interrupts could create inverse lock ordering between them.

[   83.296549] other info that might help us debug this:
[   83.296549] Chain exists of:
  device_domain_lock --> >lock --> pasid_lock
[   83.296551]  Possible interrupt unsafe locking scenario:

[   83.296551]CPU0CPU1
[   83.296552]
[   83.296552]   lock(pasid_lock);
[   83.296553]local_irq_disable();
[   83.296553]lock(device_domain_lock);
[   83.296554]lock(>lock);
[   83.296554]   
[   83.296554] lock(device_domain_lock);
[   83.296555]
 *** DEADLOCK ***

Fix it by replacing the pasid_lock with an atomic exchange operation.

Reported-and-tested-by: Dave Jiang 
Signed-off-by: Lu Baolu 
---
  drivers/iommu/intel/pasid.c | 14 ++
  1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 9fb3d3e80408..1ddcb8295f72 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -24,7 +24,6 @@
  /*
   * Intel IOMMU system wide PASID name space:
   */
-static DEFINE_SPINLOCK(pasid_lock);
  u32 intel_pasid_max_id = PASID_MAX;
  
  int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid)

@@ -259,19 +258,18 @@ struct pasid_entry *intel_pasid_get_entry(struct device 
*dev, u32 pasid)
dir_index = pasid >> PASID_PDE_SHIFT;
index = pasid & PASID_PTE_MASK;
  
-	spin_lock(_lock);

entries = get_pasid_table_from_pde([dir_index]);
if (!entries) {
entries = alloc_pgtable_page(info->iommu->node);
-   if (!entries) {
-   spin_unlock(_lock);
+   if (!entries)
return NULL;
-   }
  
-		WRITE_ONCE(dir[dir_index].val,

-  (u64)virt_to_phys(entries) | PASID_PTE_PRESENT);
+   if (cmpxchg64([dir_index].val, 0ULL,
+ (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) {
+   free_pgtable_page(entries);
+   entries = get_pasid_table_from_pde([dir_index]);


This is racy, someone could have already cleared the pasid-entry again.


This code modifies the pasid directory entry. The pasid directory
entries are allocated on demand and will never be freed.


What you need to do here is to retry the whole path by adding a goto
to before  the first get_pasid_table_from_pde() call.


Yes. Retrying by adding a goto makes the code clearer.



Btw, what makes sure that the pasid_entry does not go away when it is
returned here?


As explained above, it handles the pasid directory table entry which
won't go away.



Regards,

Joerg



Best regards,
baolu


Re: [RFC PATCH v1 0/4] vfio: Add IOPF support for VFIO passthrough

2021-03-18 Thread Lu Baolu

On 3/18/21 7:53 PM, Shenming Lu wrote:

On 2021/3/18 17:07, Tian, Kevin wrote:

From: Shenming Lu 
Sent: Thursday, March 18, 2021 3:53 PM

On 2021/2/4 14:52, Tian, Kevin wrote:>>> In reality, many

devices allow I/O faulting only in selective contexts. However, there
is no standard way (e.g. PCISIG) for the device to report whether
arbitrary I/O fault is allowed. Then we may have to maintain device
specific knowledge in software, e.g. in an opt-in table to list devices
which allows arbitrary faults. For devices which only support selective
faulting, a mediator (either through vendor extensions on vfio-pci-core
or a mdev wrapper) might be necessary to help lock down non-faultable
mappings and then enable faulting on the rest mappings.


For devices which only support selective faulting, they could tell it to the
IOMMU driver and let it filter out non-faultable faults? Do I get it wrong?


Not exactly to IOMMU driver. There is already a vfio_pin_pages() for
selectively page-pinning. The matter is that 'they' imply some device
specific logic to decide which pages must be pinned and such knowledge
is outside of VFIO.

 From enabling p.o.v we could possibly do it in phased approach. First
handles devices which tolerate arbitrary DMA faults, and then extends
to devices with selective-faulting. The former is simpler, but with one
main open whether we want to maintain such device IDs in a static
table in VFIO or rely on some hints from other components (e.g. PF
driver in VF assignment case). Let's see how Alex thinks about it.


Hi Kevin,

You mentioned selective-faulting some time ago. I still have some doubt
about it:
There is already a vfio_pin_pages() which is used for limiting the IOMMU
group dirty scope to pinned pages, could it also be used for indicating
the faultable scope is limited to the pinned pages and the rest mappings
is non-faultable that should be pinned and mapped immediately? But it
seems to be a little weird and not exactly to what you meant... I will
be grateful if you can help to explain further. :-)



The opposite, i.e. the vendor driver uses vfio_pin_pages to lock down
pages that are not faultable (based on its specific knowledge) and then
the rest memory becomes faultable.


Ahh...
Thus, from the perspective of VFIO IOMMU, if IOPF enabled for such device,
only the page faults within the pinned range are valid in the registered
iommu fault handler...


Isn't it opposite? The pinned pages will never generate any page faults.
I might miss some contexts here.


I have another question here, for the IOMMU backed devices, they are already
all pinned and mapped when attaching, is there a need to call vfio_pin_pages()
to lock down pages for them? Did I miss something?...


Best regards,
baolu


Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries

2021-03-18 Thread Lu Baolu

Hi Joerg,

On 3/18/21 5:12 PM, Joerg Roedel wrote:

Hi,

On Mon, Mar 08, 2021 at 11:47:46AM -0800, Raj, Ashok wrote:

That is the primary motivation, given that we have moved to 1st level for
general IOVA, first level doesn't have a WO mapping. I didn't know enough
about the history to determine if a WO without a READ is very useful. I
guess the ZLR was invented to support those cases without a READ in PCIe. I


Okay, please update the commit message and re-send. I guess these
patches are 5.13 stuff. In that case, Baolu can include them into his
pull request later this cycle.


Okay! It works for me.

Best regards,
baolu


Re: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Lu Baolu

On 3/18/21 4:56 PM, Tian, Kevin wrote:

From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)



-Original Message-
From: Tian, Kevin [mailto:kevin.t...@intel.com]
Sent: Thursday, March 18, 2021 4:27 PM
To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
; Nadav Amit 
Cc: chenjiashang ; David Woodhouse
; io...@lists.linux-foundation.org; LKML
; alex.william...@redhat.com; Gonglei

(Arei)

; w...@kernel.org
Subject: RE: A problem of Intel IOMMU hardware ?


From: iommu  On Behalf Of
Longpeng (Mike, Cloud Infrastructure Service Product Dept.)


2. Consider ensuring that the problem is not somehow related to
queued invalidations. Try to use __iommu_flush_iotlb() instead of

qi_flush_iotlb().




I tried to force to use __iommu_flush_iotlb(), but maybe something
wrong, the system crashed, so I prefer to lower the priority of this

operation.




The VT-d spec clearly says that register-based invalidation can be used only

when

queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide

an

option to disable queued-invalidation though, when the hardware is

capable. If you

really want to try, tweak the code in intel_iommu_init_qi.



Hi Kevin,

Thanks to point out this. Do you have any ideas about this problem ? I tried
to descript the problem much clear in my reply to Alex, hope you could have
a look if you're interested.



btw I saw you used 4.18 kernel in this test. What about latest kernel?

Also one way to separate sw/hw bug is to trace the low level interface (e.g.,
qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU
hardware. Check the window between b) and c) and see whether the
software does the right thing as expected there.


Yes. It's better if we can reproduce this with the latest kernel which
has debugfs files to expose page tables and the invalidation queues etc.

Best regards,
baolu


Re: A problem of Intel IOMMU hardware ?

2021-03-17 Thread Lu Baolu

Hi Nadav,

On 3/18/21 2:12 AM, Nadav Amit wrote:




On Mar 17, 2021, at 2:35 AM, Longpeng (Mike, Cloud Infrastructure Service Product 
Dept.)  wrote:

Hi Nadav,


-Original Message-
From: Nadav Amit [mailto:nadav.a...@gmail.com]

  reproduce the problem with high probability (~50%).


I saw Lu replied, and he is much more knowledgable than I am (I was just 
intrigued
by your email).

However, if I were you I would try also to remove some “optimizations” to look 
for
the root-cause (e.g., use domain specific invalidations instead of 
page-specific).



Good suggestion! But we did it these days, we tried to use global invalidations 
as follow:
iommu->flush.flush_iotlb(iommu, did, 0, 0,
DMA_TLB_DSI_FLUSH);
But can not resolve the problem.


The first thing that comes to my mind is the invalidation hint (ih) in
iommu_flush_iotlb_psi(). I would remove it to see whether you get the failure
without it.


We also notice the IH, but the IH is always ZERO in our case, as the spec says:
'''
Paging-structure-cache entries caching second-level mappings associated with 
the specified
domain-id and the second-level-input-address range are invalidated, if the 
Invalidation Hint
(IH) field is Clear.
'''

It seems the software is everything fine, so we've no choice but to suspect the 
hardware.


Ok, I am pretty much out of ideas. I have two more suggestions, but
they are much less likely to help. Yet, they can further help to rule
out software bugs:

1. dma_clear_pte() seems to be wrong IMHO. It should have used WRITE_ONCE()
to prevent split-write, which might potentially cause “invalid” (partially
cleared) PTE to be stored in the TLB. Having said that, the subsequent
IOTLB flush should have prevented the problem.


Agreed. The pte read/write should use READ/WRITE_ONCE() instead.



2. Consider ensuring that the problem is not somehow related to queued
invalidations. Try to use __iommu_flush_iotlb() instead of
qi_flush_iotlb().

Regards,
Nadav



Best regards,
baolu


Re: A problem of Intel IOMMU hardware ?

2021-03-17 Thread Lu Baolu

Hi Alex,

On 3/17/21 11:18 PM, Alex Williamson wrote:

  {MAP,   0x0, 0xc000}, - (b)
  use GDB to pause at here, and then DMA read IOVA=0,

IOVA 0 seems to be a special one. Have you verified with other addresses
than IOVA 0?

It is???  That would be a problem.



No problem from hardware point of view as far as I can see. Just
thought about software might handle it specially.

Best regards,
baolu


[PATCH 1/1] iommu/vt-d: Report more information about invalidation errors

2021-03-17 Thread Lu Baolu
When the invalidation queue errors are encountered, dump the information
logged by the VT-d hardware together with the pending queue invalidation
descriptors.

Signed-off-by: Ashok Raj 
Tested-by: Guo Kaijie 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/dmar.c  | 68 ++---
 include/linux/intel-iommu.h |  6 
 2 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index d5c51b5c20af..6971397805f3 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -1205,6 +1205,63 @@ static inline void reclaim_free_desc(struct q_inval *qi)
}
 }
 
+static const char *qi_type_string(u8 type)
+{
+   switch (type) {
+   case QI_CC_TYPE:
+   return "Context-cache Invalidation";
+   case QI_IOTLB_TYPE:
+   return "IOTLB Invalidation";
+   case QI_DIOTLB_TYPE:
+   return "Device-TLB Invalidation";
+   case QI_IEC_TYPE:
+   return "Interrupt Entry Cache Invalidation";
+   case QI_IWD_TYPE:
+   return "Invalidation Wait";
+   case QI_EIOTLB_TYPE:
+   return "PASID-based IOTLB Invalidation";
+   case QI_PC_TYPE:
+   return "PASID-cache Invalidation";
+   case QI_DEIOTLB_TYPE:
+   return "PASID-based Device-TLB Invalidation";
+   case QI_PGRP_RESP_TYPE:
+   return "Page Group Response";
+   default:
+   return "UNKNOWN";
+   }
+}
+
+static void qi_dump_fault(struct intel_iommu *iommu, u32 fault)
+{
+   unsigned int head = dmar_readl(iommu->reg + DMAR_IQH_REG);
+   u64 iqe_err = dmar_readq(iommu->reg + DMAR_IQER_REG);
+   struct qi_desc *desc = iommu->qi->desc + head;
+
+   if (fault & DMA_FSTS_IQE)
+   pr_err("VT-d detected Invalidation Queue Error: Reason %llx",
+  DMAR_IQER_REG_IQEI(iqe_err));
+   if (fault & DMA_FSTS_ITE)
+   pr_err("VT-d detected Invalidation Time-out Error: SID %llx",
+  DMAR_IQER_REG_ITESID(iqe_err));
+   if (fault & DMA_FSTS_ICE)
+   pr_err("VT-d detected Invalidation Completion Error: SID %llx",
+  DMAR_IQER_REG_ICESID(iqe_err));
+
+   pr_err("QI HEAD: %s qw0 = 0x%llx, qw1 = 0x%llx\n",
+  qi_type_string(desc->qw0 & 0xf),
+  (unsigned long long)desc->qw0,
+  (unsigned long long)desc->qw1);
+
+   head = ((head >> qi_shift(iommu)) + QI_LENGTH - 1) % QI_LENGTH;
+   head <<= qi_shift(iommu);
+   desc = iommu->qi->desc + head;
+
+   pr_err("QI PRIOR: %s qw0 = 0x%llx, qw1 = 0x%llx\n",
+  qi_type_string(desc->qw0 & 0xf),
+  (unsigned long long)desc->qw0,
+  (unsigned long long)desc->qw1);
+}
+
 static int qi_check_fault(struct intel_iommu *iommu, int index, int wait_index)
 {
u32 fault;
@@ -1216,6 +1273,8 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index, int wait_index)
return -EAGAIN;
 
fault = readl(iommu->reg + DMAR_FSTS_REG);
+   if (fault & (DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE))
+   qi_dump_fault(iommu, fault);
 
/*
 * If IQE happens, the head points to the descriptor associated
@@ -1232,12 +1291,10 @@ static int qi_check_fault(struct intel_iommu *iommu, 
int index, int wait_index)
 * used by software as private data. We won't print
 * out these two qw's for security consideration.
 */
-   pr_err("VT-d detected invalid descriptor: qw0 = %llx, 
qw1 = %llx\n",
-  (unsigned long long)desc->qw0,
-  (unsigned long long)desc->qw1);
memcpy(desc, qi->desc + (wait_index << shift),
   1 << shift);
writel(DMA_FSTS_IQE, iommu->reg + DMAR_FSTS_REG);
+   pr_info("Invalidation Queue Error (IQE) cleared\n");
return -EINVAL;
}
}
@@ -1254,6 +1311,7 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index, int wait_index)
tail = ((tail >> shift) - 1 + QI_LENGTH) % QI_LENGTH;
 
writel(DMA_FSTS_ITE, iommu->reg + DMAR_FSTS_REG);
+   pr_info("Invalidation Time-out Error (ITE) cleared\n");
 
do {
if (qi->desc_status[head] == QI_IN_USE)
@@ -1265,8 +1323,10 @@ static int qi_check_fault(struct intel_iommu *iommu, int 
index, int wait_index)
return -EAGA

Re: A problem of Intel IOMMU hardware ?

2021-03-16 Thread Lu Baolu

Hi Longpeng,

On 3/17/21 11:16 AM, Longpeng (Mike, Cloud Infrastructure Service 
Product Dept.) wrote:

Hi guys,

We find the Intel iommu cache (i.e. iotlb) maybe works wrong in a special
situation, it would cause DMA fails or get wrong data.

The reproducer (based on Alex's vfio testsuite[1]) is in attachment, it can
reproduce the problem with high probability (~50%).

The machine we used is:
processor   : 47
vendor_id   : GenuineIntel
cpu family  : 6
model   : 85
model name  : Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz
stepping: 4
microcode   : 0x269

And the iommu capability reported is:
ver 1:0 cap 8d2078c106f0466 ecap f020df
(caching mode = 0 , page-selective invalidation = 1)

(The problem is also on 'Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz' and
'Intel(R) Xeon(R) Platinum 8378A CPU @ 3.00GHz')

We run the reproducer on Linux 4.18 and it works as follow:

Step 1. alloc 4G *2M-hugetlb* memory (N.B. no problem with 4K-page mapping)


I don't understand 2M-hugetlb here means exactly. The IOMMU hardware
supports both 2M and 1G super page. The mapping physical memory is 4G.
Why couldn't it use 1G super page?


Step 2. DMA Map 4G memory
Step 3.
 while (1) {
 {UNMAP, 0x0, 0xa},  (a)
 {UNMAP, 0xc, 0xbff4},


Have these two ranges been mapped before? Does the IOMMU driver
complains when you trying to unmap a range which has never been
mapped? The IOMMU driver implicitly assumes that mapping and
unmapping are paired.


 {MAP,   0x0, 0xc000}, - (b)
 use GDB to pause at here, and then DMA read IOVA=0,


IOVA 0 seems to be a special one. Have you verified with other addresses
than IOVA 0?


 sometimes DMA success (as expected),
 but sometimes DMA error (report not-present).
 {UNMAP, 0x0, 0xc000}, - (c)
 {MAP,   0x0, 0xa},
 {MAP,   0xc, 0xbff4},
 }

The DMA read operations sholud success between (b) and (c), it should NOT report
not-present at least!

After analysis the problem, we think maybe it's caused by the Intel iommu iotlb.
It seems the DMA Remapping hardware still uses the IOTLB or other caches of (a).

When do DMA unmap at (a), the iotlb will be flush:
 intel_iommu_unmap
 domain_unmap
 iommu_flush_iotlb_psi

When do DMA map at (b), no need to flush the iotlb according to the capability
of this iommu:
 intel_iommu_map
 domain_pfn_mapping
 domain_mapping
 __mapping_notify_one
 if (cap_caching_mode(iommu->cap)) // FALSE
 iommu_flush_iotlb_psi


That's true. The iotlb flushing is not needed in case of PTE been
changed from non-present to present unless caching mode.


But the problem will disappear if we FORCE flush here. So we suspect the iommu
hardware.

Do you have any suggestion ?


Best regards,
baolu


[PATCH 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()

2021-03-16 Thread Lu Baolu
The pasid_lock is used to synchronize different threads from modifying a
same pasid directory entry at the same time. It causes below lockdep splat.

[   83.296538] 
[   83.296538] WARNING: possible irq lock inversion dependency detected
[   83.296539] 5.12.0-rc3+ #25 Tainted: GW
[   83.296539] 
[   83.296540] bash/780 just changed the state of lock:
[   83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at:
   iommu_flush_dev_iotlb.part.0+0x32/0x110
[   83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past:
[   83.296547]  (pasid_lock){+.+.}-{2:2}
[   83.296548]

   and interrupts could create inverse lock ordering between them.

[   83.296549] other info that might help us debug this:
[   83.296549] Chain exists of:
 device_domain_lock --> >lock --> pasid_lock
[   83.296551]  Possible interrupt unsafe locking scenario:

[   83.296551]CPU0CPU1
[   83.296552]
[   83.296552]   lock(pasid_lock);
[   83.296553]local_irq_disable();
[   83.296553]lock(device_domain_lock);
[   83.296554]lock(>lock);
[   83.296554]   
[   83.296554] lock(device_domain_lock);
[   83.296555]
*** DEADLOCK ***

Fix it by replacing the pasid_lock with an atomic exchange operation.

Reported-and-tested-by: Dave Jiang 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/pasid.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 9fb3d3e80408..1ddcb8295f72 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -24,7 +24,6 @@
 /*
  * Intel IOMMU system wide PASID name space:
  */
-static DEFINE_SPINLOCK(pasid_lock);
 u32 intel_pasid_max_id = PASID_MAX;
 
 int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid)
@@ -259,19 +258,18 @@ struct pasid_entry *intel_pasid_get_entry(struct device 
*dev, u32 pasid)
dir_index = pasid >> PASID_PDE_SHIFT;
index = pasid & PASID_PTE_MASK;
 
-   spin_lock(_lock);
entries = get_pasid_table_from_pde([dir_index]);
if (!entries) {
entries = alloc_pgtable_page(info->iommu->node);
-   if (!entries) {
-   spin_unlock(_lock);
+   if (!entries)
return NULL;
-   }
 
-   WRITE_ONCE(dir[dir_index].val,
-  (u64)virt_to_phys(entries) | PASID_PTE_PRESENT);
+   if (cmpxchg64([dir_index].val, 0ULL,
+ (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) {
+   free_pgtable_page(entries);
+   entries = get_pasid_table_from_pde([dir_index]);
+   }
}
-   spin_unlock(_lock);
 
return [index];
 }
-- 
2.25.1



Re: [PATCH] iommu/vt-d: Disable SVM when ATS/PRI/PASID are not enabled in the device

2021-03-14 Thread Lu Baolu

On 3/15/21 4:15 AM, Kyung Min Park wrote:

Currently, the Intel VT-d supports Shared Virtual Memory (SVM) only when
IO page fault is supported. Otherwise, shared memory pages can not be
swapped out and need to be pinned. The device needs the Address Translation
Service (ATS), Page Request Interface (PRI) and Process Address Space
Identifier (PASID) capabilities to be enabled to support IO page fault.

Disable SVM when ATS, PRI and PASID are not enabled in the device.

Signed-off-by: Kyung Min Park 
---
  drivers/iommu/intel/iommu.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee0932307d64..956a02eb40b4 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5380,6 +5380,9 @@ intel_iommu_dev_enable_feat(struct device *dev, enum 
iommu_dev_features feat)
if (!info)
return -EINVAL;
  
+		if (!info->pasid_enabled || !info->pri_enabled || !info->ats_enabled)

+   return -EINVAL;
+
if (info->iommu->flags & VTD_FLAG_SVM_CAPABLE)
return 0;
}



Thanks for the patch.

Acked-by: Lu Baolu 

Best regards,
baolu


Re: [RFC PATCH v2 1/6] iommu: Evolve to support more scenarios of using IOPF

2021-03-09 Thread Lu Baolu

Hi Shenming,

On 3/9/21 2:22 PM, Shenming Lu wrote:

This patch follows the discussion here:

https://lore.kernel.org/linux-acpi/YAaxjmJW+ZMvrhac@myrica/

In order to support more scenarios of using IOPF (mainly consider
the nested extension), besides keeping IOMMU_DEV_FEAT_IOPF as a
general capability for whether delivering faults through the IOMMU,
we extend iommu_register_fault_handler() with flags and introduce
IOPF_REPORT_FLAT and IOPF_REPORT_NESTED to describe the page fault
reporting capability under a specific configuration.
IOPF_REPORT_NESTED needs additional info to indicate which level/stage
is concerned since the fault client may be interested in only one
level.

Signed-off-by: Shenming Lu 
---
  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  3 +-
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 11 ++--
  drivers/iommu/io-pgfault.c|  4 --
  drivers/iommu/iommu.c | 56 ++-
  include/linux/iommu.h | 21 ++-
  include/uapi/linux/iommu.h|  3 +
  6 files changed, 85 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index ee66d1f4cb81..5de9432349d4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -482,7 +482,8 @@ static int arm_smmu_master_sva_enable_iopf(struct 
arm_smmu_master *master)
if (ret)
return ret;
  
-	ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf, dev);

+   ret = iommu_register_device_fault_handler(dev, iommu_queue_iopf,
+ IOPF_REPORT_FLAT, dev);
if (ret) {
iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 363744df8d51..f40529d0075d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1447,10 +1447,6 @@ static int arm_smmu_handle_evt(struct arm_smmu_device 
*smmu, u64 *evt)
return -EOPNOTSUPP;
}
  
-	/* Stage-2 is always pinned at the moment */

-   if (evt[1] & EVTQ_1_S2)
-   return -EFAULT;
-
if (evt[1] & EVTQ_1_RnW)
perm |= IOMMU_FAULT_PERM_READ;
else
@@ -1468,13 +1464,18 @@ static int arm_smmu_handle_evt(struct arm_smmu_device 
*smmu, u64 *evt)
.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
.grpid = FIELD_GET(EVTQ_1_STAG, evt[1]),
.perm = perm,
-   .addr = FIELD_GET(EVTQ_2_ADDR, evt[2]),
};
  
  		if (ssid_valid) {

flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
flt->prm.pasid = FIELD_GET(EVTQ_0_SSID, evt[0]);
}
+
+   if (evt[1] & EVTQ_1_S2) {
+   flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_L2;
+   flt->prm.addr = FIELD_GET(EVTQ_3_IPA, evt[3]);
+   } else
+   flt->prm.addr = FIELD_GET(EVTQ_2_ADDR, evt[2]);
} else {
flt->type = IOMMU_FAULT_DMA_UNRECOV;
flt->event = (struct iommu_fault_unrecoverable) {
diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c
index 1df8c1dcae77..abf16e06bcf5 100644
--- a/drivers/iommu/io-pgfault.c
+++ b/drivers/iommu/io-pgfault.c
@@ -195,10 +195,6 @@ int iommu_queue_iopf(struct iommu_fault *fault, void 
*cookie)
  
  	lockdep_assert_held(>lock);
  
-	if (fault->type != IOMMU_FAULT_PAGE_REQ)

-   /* Not a recoverable page fault */
-   return -EOPNOTSUPP;
-


Any reasons why do you want to remove this check?


/*
 * As long as we're holding param->lock, the queue can't be unlinked
 * from the device and therefore cannot disappear.
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d0b0a15dba84..cb1d93b00f7d 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1056,6 +1056,40 @@ int iommu_group_unregister_notifier(struct iommu_group 
*group,
  }
  EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
  
+/*

+ * iommu_update_device_fault_handler - Update the device fault handler via 
flags
+ * @dev: the device
+ * @mask: bits(not set) to clear
+ * @set: bits to set
+ *
+ * Update the device fault handler installed by
+ * iommu_register_device_fault_handler().
+ *
+ * Return 0 on success, or an error.
+ */
+int iommu_update_device_fault_handler(struct device *dev, u32 mask, u32 set)
+{
+   struct dev_iommu *param = dev->iommu;
+   int ret = 0;
+
+   if (!param)
+   return -EINVAL;
+
+   mutex_lock(>lock);
+
+   if (param->fault_param) {
+   ret = -EINVAL;
+   goto 

[PATCH 1/1] iommu/vt-d: Don't set then immediately clear in prq_event_thread()

2021-03-08 Thread Lu Baolu
The private data field of a page group response descriptor is set then
immediately cleared in prq_event_thread(). Fix this by moving clearing
code up.

Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode")
Cc: Jacob Pan 
Reviewed-by: Liu Yi L 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index d76cc79f3496..f688c5f29b1a 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -1021,12 +1021,12 @@ static irqreturn_t prq_event_thread(int irq, void *d)
QI_PGRP_RESP_TYPE;
resp.qw1 = QI_PGRP_IDX(req->prg_index) |
QI_PGRP_LPIG(req->lpig);
+   resp.qw2 = 0;
+   resp.qw3 = 0;
 
if (req->priv_data_present)
memcpy(, req->priv_data,
   sizeof(req->priv_data));
-   resp.qw2 = 0;
-   resp.qw3 = 0;
qi_submit_sync(iommu, , 1, 0);
}
 prq_advance:
-- 
2.25.1



Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries

2021-03-07 Thread Lu Baolu

Hi Joerg,

On 3/4/21 8:26 PM, Joerg Roedel wrote:

On Thu, Feb 25, 2021 at 02:26:51PM +0800, Lu Baolu wrote:

When the first level page table is used for IOVA translation, it only
supports Read-Only and Read-Write permissions. The Write-Only permission
is not supported as the PRESENT bit (implying Read permission) should
always set. When using second level, we still give separate permissions
that allows WriteOnly which seems inconsistent and awkward. There is no
use case we can think off, hence remove that configuration to make it
consistent.


No use-case for WriteOnly mappings? How about DMA_FROM_DEVICE mappings?



The statement of no use case is not correct. Sorry about it.

As we have moved to use first level for IOVA translation, the first
level page table entry only provides RO and RW permissions. So if any
device driver specifies DMA_FROM_DEVICE attribution, it will get RW
permission in the page table. This patch aims to make the permissions
of second level and first level consistent. No impact on the use of
DMA_FROM_DEVICE attribution.

Best regards,
baolu


Re: [PATCH] iommu/dma: Resurrect the "forcedac" option

2021-03-07 Thread Lu Baolu
 devices\n");
-   dmar_forcedac = 1;
+   pr_warn("intel_iommu=forcedac deprecated; use iommu.forcedac 
instead\n");
+   iommu_dma_forcedac = true;
} else if (!strncmp(str, "strict", 6)) {
pr_info("Disable batched IOTLB flush\n");
intel_iommu_strict = 1;
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 706b68d1359b..13d1f4c14d7b 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -40,6 +40,8 @@ void iommu_dma_get_resv_regions(struct device *dev, struct 
list_head *list);
  void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,
struct iommu_domain *domain);
  
+extern bool iommu_dma_forcedac;

+
  #else /* CONFIG_IOMMU_DMA */
  
  struct iommu_domain;




Thanks for catching this.

Acked-by: Lu Baolu 

Best regards,
baolu


Re: [PATCH v2 1/4] iommu/vt-d: Enable write protect for supervisor SVM

2021-03-03 Thread Lu Baolu

On 3/2/21 6:13 PM, Jacob Pan wrote:

Write protect bit, when set, inhibits supervisor writes to the read-only
pages. In supervisor shared virtual addressing (SVA), where page tables
are shared between CPU and DMA, IOMMU PASID entry WPE bit should match
CR0.WP bit in the CPU.
This patch sets WPE bit for supervisor PASIDs if CR0.WP is set.

Signed-off-by: Sanjay Kumar 
Signed-off-by: Jacob Pan 
---
  drivers/iommu/intel/pasid.c | 26 ++
  1 file changed, 26 insertions(+)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 0cceaabc3ce6..0b7e0e726ade 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -410,6 +410,15 @@ static inline void pasid_set_sre(struct pasid_entry *pe)
pasid_set_bits(>val[2], 1 << 0, 1);
  }
  
+/*

+ * Setup the WPE(Write Protect Enable) field (Bit 132) of a
+ * scalable mode PASID entry.
+ */
+static inline void pasid_set_wpe(struct pasid_entry *pe)
+{
+   pasid_set_bits(>val[2], 1 << 4, 1 << 4);
+}
+
  /*
   * Setup the P(Present) field (Bit 0) of a scalable mode PASID
   * entry.
@@ -553,6 +562,20 @@ static void pasid_flush_caches(struct intel_iommu *iommu,
}
  }
  
+static inline int pasid_enable_wpe(struct pasid_entry *pte)

+{
+   unsigned long cr0 = read_cr0();
+
+   /* CR0.WP is normally set but just to be sure */
+   if (unlikely(!(cr0 & X86_CR0_WP))) {
+   pr_err_ratelimited("No CPU write protect!\n");
+   return -EINVAL;
+   }
+   pasid_set_wpe(pte);
+
+   return 0;
+};
+
  /*
   * Set up the scalable mode pasid table entry for first only
   * translation type.
@@ -584,6 +607,9 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
return -EINVAL;
}
pasid_set_sre(pte);
+   if (pasid_enable_wpe(pte))
+   return -EINVAL;
+
}
  
  	if (flags & PASID_FLAG_FL5LP) {




Acked-by: Lu Baolu 

Best regards,
baolu


Re: [PATCH v2 2/4] iommu/vt-d: Enable write protect propagation from guest

2021-03-03 Thread Lu Baolu

On 3/2/21 6:13 PM, Jacob Pan wrote:

Write protect bit, when set, inhibits supervisor writes to the read-only
pages. In guest supervisor shared virtual addressing (SVA), write-protect
should be honored upon guest bind supervisor PASID request.

This patch extends the VT-d portion of the IOMMU UAPI to include WP bit.
WPE bit of the  supervisor PASID entry will be set to match CPU CR0.WP bit.

Signed-off-by: Sanjay Kumar 
Signed-off-by: Jacob Pan 
---
  drivers/iommu/intel/pasid.c | 3 +++
  include/uapi/linux/iommu.h  | 3 ++-
  2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 0b7e0e726ade..b7e39239f539 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -763,6 +763,9 @@ intel_pasid_setup_bind_data(struct intel_iommu *iommu, 
struct pasid_entry *pte,
return -EINVAL;
}
pasid_set_sre(pte);
+   /* Enable write protect WP if guest requested */
+   if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_WPE)
+   pasid_set_wpe(pte);
}
  
  	if (pasid_data->flags & IOMMU_SVA_VTD_GPASID_EAFE) {

diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 35d48843acd8..3a9164cc9937 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -288,7 +288,8 @@ struct iommu_gpasid_bind_data_vtd {
  #define IOMMU_SVA_VTD_GPASID_PWT  (1 << 3) /* page-level write through */
  #define IOMMU_SVA_VTD_GPASID_EMTE (1 << 4) /* extended mem type enable */
  #define IOMMU_SVA_VTD_GPASID_CD   (1 << 5) /* PASID-level cache 
disable */
-#define IOMMU_SVA_VTD_GPASID_LAST  (1 << 6)
+#define IOMMU_SVA_VTD_GPASID_WPE   (1 << 6) /* Write protect enable */
+#define IOMMU_SVA_VTD_GPASID_LAST  (1 << 7)
__u64 flags;
    __u32 pat;
__u32 emt;



Acked-by: Lu Baolu 

Best regards,
baolu


Re: [PATCH] iommu/vt-d: Fix status code for Allocate/Free PASID command

2021-02-28 Thread Lu Baolu

On 2/27/21 3:39 PM, Zenghui Yu wrote:

As per Intel vt-d spec, Rev 3.0 (section 10.4.45 "Virtual Command Response
Register"), the status code of "No PASID available" error in response to
the Allocate PASID command is 2, not 1. The same for "Invalid PASID" error
in response to the Free PASID command.

We will otherwise see confusing kernel log under the command failure from
guest side. Fix it.

Fixes: 24f27d32ab6b ("iommu/vt-d: Enlightened PASID allocation")
Signed-off-by: Zenghui Yu 
---
  drivers/iommu/intel/pasid.h | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index 97dfcffbf495..444c0bec221a 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -30,8 +30,8 @@
  #define VCMD_VRSP_IP  0x1
  #define VCMD_VRSP_SC(e)   (((e) >> 1) & 0x3)
  #define VCMD_VRSP_SC_SUCCESS  0
-#define VCMD_VRSP_SC_NO_PASID_AVAIL1
-#define VCMD_VRSP_SC_INVALID_PASID 1
+#define VCMD_VRSP_SC_NO_PASID_AVAIL2
+#define VCMD_VRSP_SC_INVALID_PASID 2
  #define VCMD_VRSP_RESULT_PASID(e) (((e) >> 8) & 0xf)
  #define VCMD_CMD_OPERAND(e)   ((e) << 8)
  /*



Thanks a lot for catching this.

Acked-by: Lu Baolu 

Best regards,
baolu


[PATCH 5/5] iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown

2021-02-24 Thread Lu Baolu
When a present pasid entry is disassembled, all kinds of pasid related
caches need to be flushed. But when a pasid entry is not being used
(PRESENT bit not set), we don't need to do this. Check the PRESENT bit
in intel_pasid_tear_down_entry() and avoid flushing caches if it's not
set.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/pasid.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 07531e5edfa2..9fb3d3e80408 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -493,6 +493,9 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, 
struct device *dev,
if (WARN_ON(!pte))
return;
 
+   if (!(pte->val[0] & PASID_PTE_PRESENT))
+   return;
+
did = pasid_get_domain_id(pte);
intel_pasid_clear_entry(dev, pasid, fault_ignore);
 
-- 
2.25.1



[PATCH 4/5] iommu/vt-d: Use user privilege for RID2PASID translation

2021-02-24 Thread Lu Baolu
When first-level page tables are used for IOVA translation, we use user
privilege by setting U/S bit in the page table entry. This is to make it
consistent with the second level translation, where the U/S enforcement
is not available. Clear the SRE (Supervisor Request Enable) field in the
pasid table entry of RID2PASID so that requests requesting the supervisor
privilege are blocked and treated as DMA remapping faults.

Suggested-by: Jacob Pan 
Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 7 +--
 drivers/iommu/intel/pasid.c | 3 ++-
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index f41b184ce6eb..b14427d8121f 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2495,9 +2495,9 @@ static int domain_setup_first_level(struct intel_iommu 
*iommu,
struct device *dev,
u32 pasid)
 {
-   int flags = PASID_FLAG_SUPERVISOR_MODE;
struct dma_pte *pgd = domain->pgd;
int agaw, level;
+   int flags = 0;
 
/*
 * Skip top levels of page tables for iommu which has
@@ -2513,7 +2513,10 @@ static int domain_setup_first_level(struct intel_iommu 
*iommu,
if (level != 4 && level != 5)
return -EINVAL;
 
-   flags |= (level == 5) ? PASID_FLAG_FL5LP : 0;
+   if (pasid != PASID_RID2PASID)
+   flags |= PASID_FLAG_SUPERVISOR_MODE;
+   if (level == 5)
+   flags |= PASID_FLAG_FL5LP;
 
return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid,
 domain->iommu_did[iommu->seq_id],
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index f26cb6195b2c..07531e5edfa2 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -647,7 +647,8 @@ int intel_pasid_setup_second_level(struct intel_iommu 
*iommu,
 * Since it is a second level only translation setup, we should
 * set SRE bit as well (addresses are expected to be GPAs).
 */
-   pasid_set_sre(pte);
+   if (pasid != PASID_RID2PASID)
+   pasid_set_sre(pte);
pasid_set_present(pte);
pasid_flush_caches(iommu, pte, pasid, did);
 
-- 
2.25.1



[PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries

2021-02-24 Thread Lu Baolu
When the first level page table is used for IOVA translation, it only
supports Read-Only and Read-Write permissions. The Write-Only permission
is not supported as the PRESENT bit (implying Read permission) should
always set. When using second level, we still give separate permissions
that allows WriteOnly which seems inconsistent and awkward. There is no
use case we can think off, hence remove that configuration to make it
consistent.

Suggested-by: Ashok Raj 
Fixes: b802d070a52a1 ("iommu/vt-d: Use iova over first level")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee0932307d64..19b3fd0d035b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2305,8 +2305,9 @@ __domain_mapping(struct dmar_domain *domain, unsigned 
long iov_pfn,
return -EINVAL;
 
attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP);
+   attr |= DMA_FL_PTE_PRESENT;
if (domain_use_first_level(domain)) {
-   attr |= DMA_FL_PTE_PRESENT | DMA_FL_PTE_XD | DMA_FL_PTE_US;
+   attr |= DMA_FL_PTE_XD | DMA_FL_PTE_US;
 
if (domain->domain.type == IOMMU_DOMAIN_DMA) {
attr |= DMA_FL_PTE_ACCESS;
-- 
2.25.1



[PATCH 1/5] iommu/vt-d: Report the right page fault address

2021-02-24 Thread Lu Baolu
The Address field of the Page Request Descriptor only keeps bit [63:12]
of the offending address. Convert it to a full address before reporting
it to device drivers.

Fixes: eb8d93ea3c1d3 ("iommu/vt-d: Report page request faults for guest SVA")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 574a7e657a9a..d76cc79f3496 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -862,7 +862,7 @@ intel_svm_prq_report(struct device *dev, struct 
page_req_dsc *desc)
/* Fill in event data for device specific processing */
memset(, 0, sizeof(struct iommu_fault_event));
event.fault.type = IOMMU_FAULT_PAGE_REQ;
-   event.fault.prm.addr = desc->addr;
+   event.fault.prm.addr = (u64)desc->addr << VTD_PAGE_SHIFT;
event.fault.prm.pasid = desc->pasid;
event.fault.prm.grpid = desc->prg_index;
event.fault.prm.perm = prq_to_iommu_prot(desc);
-- 
2.25.1



[PATCH 3/5] iommu/vt-d: Invalidate PASID cache when root/context entry changed

2021-02-24 Thread Lu Baolu
When the Intel IOMMU is operating in the scalable mode, some information
from the root and context table may be used to tag entries in the PASID
cache. Software should invalidate the PASID-cache when changing root or
context table entries.

Suggested-by: Ashok Raj 
Fixes: 7373a8cc38197 ("iommu/vt-d: Setup context and enable RID2PASID support")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 18 +-
 include/linux/intel-iommu.h |  1 +
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 19b3fd0d035b..f41b184ce6eb 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1340,6 +1340,11 @@ static void iommu_set_root_entry(struct intel_iommu 
*iommu)
  readl, (sts & DMA_GSTS_RTPS), sts);
 
raw_spin_unlock_irqrestore(>register_lock, flag);
+
+   iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
+   if (sm_supported(iommu))
+   qi_flush_pasid_cache(iommu, 0, QI_PC_GLOBAL, 0);
+   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
 }
 
 void iommu_flush_write_buffer(struct intel_iommu *iommu)
@@ -2423,6 +2428,10 @@ static void domain_context_clear_one(struct intel_iommu 
*iommu, u8 bus, u8 devfn
   (((u16)bus) << 8) | devfn,
   DMA_CCMD_MASK_NOBIT,
   DMA_CCMD_DEVICE_INVL);
+
+   if (sm_supported(iommu))
+   qi_flush_pasid_cache(iommu, did_old, QI_PC_ALL_PASIDS, 0);
+
iommu->flush.flush_iotlb(iommu,
 did_old,
 0,
@@ -3268,8 +3277,6 @@ static int __init init_dmars(void)
register_pasid_allocator(iommu);
 #endif
iommu_set_root_entry(iommu);
-   iommu->flush.flush_context(iommu, 0, 0, 0, 
DMA_CCMD_GLOBAL_INVL);
-   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
}
 
 #ifdef CONFIG_INTEL_IOMMU_BROKEN_GFX_WA
@@ -3459,12 +3466,7 @@ static int init_iommu_hw(void)
}
 
iommu_flush_write_buffer(iommu);
-
iommu_set_root_entry(iommu);
-
-   iommu->flush.flush_context(iommu, 0, 0, 0,
-  DMA_CCMD_GLOBAL_INVL);
-   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
iommu_enable_translation(iommu);
iommu_disable_protect_mem_regions(iommu);
}
@@ -3847,8 +3849,6 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru)
goto disable_iommu;
 
iommu_set_root_entry(iommu);
-   iommu->flush.flush_context(iommu, 0, 0, 0, DMA_CCMD_GLOBAL_INVL);
-   iommu->flush.flush_iotlb(iommu, 0, 0, 0, DMA_TLB_GLOBAL_FLUSH);
iommu_enable_translation(iommu);
 
iommu_disable_protect_mem_regions(iommu);
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 1bc46b88711a..d1f32b33415a 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -372,6 +372,7 @@ enum {
 /* PASID cache invalidation granu */
 #define QI_PC_ALL_PASIDS   0
 #define QI_PC_PASID_SEL1
+#define QI_PC_GLOBAL   3
 
 #define QI_EIOTLB_ADDR(addr)   ((u64)(addr) & VTD_PAGE_MASK)
 #define QI_EIOTLB_IH(ih)   (((u64)ih) << 6)
-- 
2.25.1



[PATCH 0/5] iommu/vt-d: Several misc fixes

2021-02-24 Thread Lu Baolu
Hi Joerg,

This series includes some misc fixes for the VT-d iommu driver. Please
help to review and merge.

Best regards,
baolu

Lu Baolu (5):
  iommu/vt-d: Report the right page fault address
  iommu/vt-d: Remove WO permissions on second-level paging entries
  iommu/vt-d: Invalidate PASID cache when root/context entry changed
  iommu/vt-d: Use user privilege for RID2PASID translation
  iommu/vt-d: Avoid unnecessary cache flush in pasid entry teardown

 drivers/iommu/intel/iommu.c | 28 
 drivers/iommu/intel/pasid.c |  6 +-
 drivers/iommu/intel/svm.c   |  2 +-
 include/linux/intel-iommu.h |  1 +
 4 files changed, 23 insertions(+), 14 deletions(-)

-- 
2.25.1



[PATCH 1/1] iommu: Don't use lazy flush for untrusted device

2021-02-24 Thread Lu Baolu
The lazy IOTLB flushing setup leaves a time window, in which the device
can still access some system memory, which has already been unmapped by
the device driver. It's not suitable for untrusted devices. A malicious
device might use this to attack the system by obtaining data that it
shouldn't obtain.

Fixes: c588072bba6b5 ("iommu/vt-d: Convert intel iommu driver to the iommu ops")
Signed-off-by: Lu Baolu 
---
 drivers/iommu/dma-iommu.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f659395e7959..65234e383d6b 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -311,6 +311,11 @@ static void iommu_dma_flush_iotlb_all(struct iova_domain 
*iovad)
domain->ops->flush_iotlb_all(domain);
 }
 
+static bool dev_is_untrusted(struct device *dev)
+{
+   return dev_is_pci(dev) && to_pci_dev(dev)->untrusted;
+}
+
 /**
  * iommu_dma_init_domain - Initialise a DMA mapping domain
  * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
@@ -365,8 +370,9 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
 
init_iova_domain(iovad, 1UL << order, base_pfn);
 
-   if (!cookie->fq_domain && !iommu_domain_get_attr(domain,
-   DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, ) && attr) {
+   if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
+   !iommu_domain_get_attr(domain, DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE, 
) &&
+   attr) {
if (init_iova_flush_queue(iovad, iommu_dma_flush_iotlb_all,
  iommu_dma_entry_dtor))
pr_warn("iova flush queue initialization failed\n");
@@ -508,11 +514,6 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
dma_addr_t dma_addr,
iova_align(iovad, size), dir, attrs);
 }
 
-static bool dev_is_untrusted(struct device *dev)
-{
-   return dev_is_pci(dev) && to_pci_dev(dev)->untrusted;
-}
-
 static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
size_t size, int prot, u64 dma_mask)
 {
-- 
2.25.1



Re: [PATCH 4/4] iommu/vt-d: Calculate and set flags for handle_mm_fault

2021-02-19 Thread Lu Baolu

On 2/19/21 5:31 AM, Jacob Pan wrote:

Page requests are originated from the user page fault. Therefore, we
shall set FAULT_FLAG_USER.

FAULT_FLAG_REMOTE indicates that we are walking an mm which is not
guaranteed to be the same as the current->mm and should not be subject
to protection key enforcement. Therefore, we should set FAULT_FLAG_REMOTE
to avoid faults when both SVM and PKEY are used.

References: commit 1b2ee1266ea6 ("mm/core: Do not enforce PKEY permissions on remote 
mm access")
Reviewed-by: Raj Ashok 
Signed-off-by: Jacob Pan 


Acked-by: Lu Baolu 

Best regards,
baolu


---
  drivers/iommu/intel/svm.c | 9 ++---
  1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index ff7ae7cc17d5..7bfd20a24a60 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -1086,6 +1086,7 @@ static irqreturn_t prq_event_thread(int irq, void *d)
struct intel_iommu *iommu = d;
struct intel_svm *svm = NULL;
int head, tail, handled = 0;
+   unsigned int flags = 0;
  
  	/* Clear PPR bit before reading head/tail registers, to

 * ensure that we get a new interrupt if needed. */
@@ -1186,9 +1187,11 @@ static irqreturn_t prq_event_thread(int irq, void *d)
if (access_error(vma, req))
goto invalid;
  
-		ret = handle_mm_fault(vma, address,

- req->wr_req ? FAULT_FLAG_WRITE : 0,
- NULL);
+   flags = FAULT_FLAG_USER | FAULT_FLAG_REMOTE;
+   if (req->wr_req)
+   flags |= FAULT_FLAG_WRITE;
+
+   ret = handle_mm_fault(vma, address, flags, NULL);
if (ret & VM_FAULT_ERROR)
goto invalid;
  



Re: [PATCH 3/4] iommu/vt-d: Reject unsupported page request modes

2021-02-19 Thread Lu Baolu

On 2/19/21 5:31 AM, Jacob Pan wrote:

When supervisor/privilige mode SVM is used, we bind init_mm.pgd with
a supervisor PASID. There should not be any page fault for init_mm.
Execution request with DMA read is also not supported.

This patch checks PRQ descriptor for both unsupported configurations,
reject them both with invalid responses.

Signed-off-by: Jacob Pan 


Fixes: 1c4f88b7f1f92 ("iommu/vt-d: Shared virtual address in scalable mode")
Acked-by: Lu Baolu 

Best regards,
baolu


---
  drivers/iommu/intel/svm.c | 12 +++-
  1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 23a1e4f58c54..ff7ae7cc17d5 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -1113,7 +1113,17 @@ static irqreturn_t prq_event_thread(int irq, void *d)
   ((unsigned long long *)req)[1]);
goto no_pasid;
}
-
+   /* We shall not receive page request for supervisor SVM */
+   if (req->pm_req && (req->rd_req | req->wr_req)) {
+   pr_err("Unexpected page request in Privilege Mode");
+   /* No need to find the matching sdev as for bad_req */
+   goto no_pasid;
+   }
+   /* DMA read with exec requeset is not supported. */
+   if (req->exe_req && req->rd_req) {
+   pr_err("Execution request not supported\n");
+   goto no_pasid;
+   }
if (!svm || svm->pasid != req->pasid) {
rcu_read_lock();
svm = ioasid_find(NULL, req->pasid, NULL);



Re: [PATCH 1/4] iommu/vt-d: Enable write protect for supervisor SVM

2021-02-19 Thread Lu Baolu

Hi Jacob and Sanjay,

On 2/19/21 5:31 AM, Jacob Pan wrote:

Write protect bit, when set, inhibits supervisor writes to the read-only
pages. In supervisor shared virtual addressing (SVA), where page tables
are shared between CPU and DMA, IOMMU PASID entry WPE bit should match
CR0.WP bit in the CPU.
This patch sets WPE bit for supervisor PASIDs if CR0.WP is set.


From reading the commit message, the intention of this patch is to match
PASID entry WPE bith with CPU CR0.WP if 1) SRE is set (supervisor
pasid); 2) page table is shared between CPU and IOMMU. Do I understand
it right?

But what the real code doing is failing pasid entry setup for first
level translation if CPU CR0.WP is not set. It's not consistent with
what described above.

What I am thinking is that, as long as SRE is set, we should always set
WPE in intel_pasid_setup_first_level(). For supervisor SVA case, we
should check CPU CR0.WP in intel_svm_bind_mm() and abort binding if
CR0.WP is not set.

Thought?

Best regards,
baolu



Signed-off-by: Sanjay Kumar 
Signed-off-by: Jacob Pan 
---
  drivers/iommu/intel/pasid.c | 26 ++
  1 file changed, 26 insertions(+)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 0cceaabc3ce6..0b7e0e726ade 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -410,6 +410,15 @@ static inline void pasid_set_sre(struct pasid_entry *pe)
pasid_set_bits(>val[2], 1 << 0, 1);
  }
  
+/*

+ * Setup the WPE(Write Protect Enable) field (Bit 132) of a
+ * scalable mode PASID entry.
+ */
+static inline void pasid_set_wpe(struct pasid_entry *pe)
+{
+   pasid_set_bits(>val[2], 1 << 4, 1 << 4);
+}
+
  /*
   * Setup the P(Present) field (Bit 0) of a scalable mode PASID
   * entry.
@@ -553,6 +562,20 @@ static void pasid_flush_caches(struct intel_iommu *iommu,
}
  }
  
+static inline int pasid_enable_wpe(struct pasid_entry *pte)

+{
+   unsigned long cr0 = read_cr0();
+
+   /* CR0.WP is normally set but just to be sure */
+   if (unlikely(!(cr0 & X86_CR0_WP))) {
+   pr_err_ratelimited("No CPU write protect!\n");
+   return -EINVAL;
+   }
+   pasid_set_wpe(pte);
+
+   return 0;
+};
+
  /*
   * Set up the scalable mode pasid table entry for first only
   * translation type.
@@ -584,6 +607,9 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
return -EINVAL;
}
pasid_set_sre(pte);
+   if (pasid_enable_wpe(pte))
+   return -EINVAL;
+
}
  
  	if (flags & PASID_FLAG_FL5LP) {




Re: [PATCH 11/12] platform-msi: Add platform check for subdevice irq domain

2021-02-08 Thread Lu Baolu

Hi Leon,

On 2/8/21 4:21 PM, Leon Romanovsky wrote:

On Wed, Feb 03, 2021 at 12:56:44PM -0800, Megha Dey wrote:

From: Lu Baolu 

The pci_subdevice_msi_create_irq_domain() should fail if the underlying
platform is not able to support IMS (Interrupt Message Storage). Otherwise,
the isolation of interrupt is not guaranteed.

For x86, IMS is only supported on bare metal for now. We could enable it
in the virtualization environments in the future if interrupt HYPERCALL
domain is supported or the hardware has the capability of interrupt
isolation for subdevices.

Cc: David Woodhouse 
Cc: Leon Romanovsky 
Cc: Kevin Tian 
Suggested-by: Thomas Gleixner 
Link: https://lore.kernel.org/linux-pci/87pn4nk7nn@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/linux-pci/877dqrnzr3@nanos.tec.linutronix.de/
Link: https://lore.kernel.org/linux-pci/877dqqmc2h@nanos.tec.linutronix.de/
Signed-off-by: Lu Baolu 
Signed-off-by: Megha Dey 
---
  arch/x86/pci/common.c   | 74 +
  drivers/base/platform-msi.c |  8 +
  include/linux/msi.h |  1 +
  3 files changed, 83 insertions(+)

diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index 3507f45..263ccf6 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -12,6 +12,8 @@
  #include 
  #include 
  #include 
+#include 
+#include 

  #include 
  #include 
@@ -724,3 +726,75 @@ struct pci_dev *pci_real_dma_dev(struct pci_dev *dev)
return dev;
  }
  #endif
+
+#ifdef CONFIG_DEVICE_MSI


Sorry for my naive question, but I see it in all your patches in this series
and wonder why did you wrap everything with ifdefs?.


The added code is only called when DEVICE_MSI is configured.



All *.c code is wrapped with those ifdefs, which is hard to navigate and
unlikely to give any code/size optimization benefit if kernel is compiled
without CONFIG_DEVICE_MSI. The more common approach is to put those
ifdef in the public header files and leave to the compiler to drop not
called functions.


Yes. This looks better.



Thanks



Best regards,
baolu


Re: [PATCH 11/12] platform-msi: Add platform check for subdevice irq domain

2021-02-04 Thread Lu Baolu

Hi Jason,

On 2021/2/4 20:14, Jason Gunthorpe wrote:

On Wed, Feb 03, 2021 at 12:56:44PM -0800, Megha Dey wrote:

+bool arch_support_pci_device_ims(struct pci_dev *pdev)
+{


Consistent language please, we are not using IMS elsewhere, this
feature is called device_msi in Linux.



Thanks for pointing this out. I will correct it.


Jason



Best regards,
baolu


Re: [PATCH v2 3/3] iommu/vt-d: Apply SATC policy

2021-02-03 Thread Lu Baolu

Hi Kevin,

On 2/4/21 9:59 AM, Tian, Kevin wrote:

From: Lu Baolu
Sent: Wednesday, February 3, 2021 5:33 PM

From: Yian Chen

Starting from Intel VT-d v3.2, Intel platform BIOS can provide a new SATC
table structure. SATC table lists a set of SoC integrated devices that
require ATC to work (VT-d specification v3.2, section 8.8). Furthermore,

This statement is not accurate. The purpose of SATC is to tell whether a
SoC integrated device has been validated to meet the isolation requirements
of using device TLB. All devices listed in SATC can have ATC safely enabled by
OS. In addition, there is a flag for each listed device for whether ATC is a
functional requirement. However, above description only captured the last
point.


You are right. This series only addresses the devices with the flag set
which have functional requirement for ATS.




the new version of IOMMU supports SoC device ATS in both its Scalable
mode
and legacy mode.

When IOMMU is working in scalable mode, software must enable device ATS
support.

"must enable" is misleading here. You need describe the policies for three
categories:

- SATC devices with ATC_REQUIRED=1
- SATC devices with ATC_REQUIRED=0
- devices not listed in SATC, or when SATC is missing


Yian is working on this part. We planed it for v5.13. He will bring it
up for discussion later.




On the other hand, when IOMMU is in legacy mode for whatever
reason, the hardware managed ATS will automatically take effect and the
SATC required devices can work transparently to the software. As the

No background about hardware-managed ATS.


result, software shouldn't enable ATS on that device, otherwise duplicate
device TLB invalidations will occur.

This description draws a equation between legacy mode and hardware
managed ATS. Do we care about the scenario where there is no hardware
managed ATS but people also want to turn on ATC in legacy mode?


The hardware managed ATS is defined in the platform specific
specification. The purpose of this hardware design is backward
compatibility - legacy OSes with no SM or ATS awareness still running
well on them.



Thanks
Kevin



Best regards,
baolu


Re: [PATCH v2 0/3] iommu/vt-d: Add support for ACPI/SATC table

2021-02-03 Thread Lu Baolu

Add the change log. Sorry for the quick sent.

On 2021/2/3 17:33, Lu Baolu wrote:

The Intel VT-d specification v3.2 comes with a new ACPI SATC (SoC-
Integrated Address Translation Cache) reporting structure. The SoC
integrated device enumerated in this table has a functional
requirement to enable its ATC (Address Translation Cache) via the
ATS capability for device operation.

This patch set adds the code to parse the SATC table, enumerates the
devices in it and satisfies the ATS requirement for them. Please help
to review. I will queue it in VT-d update for v5.12 if no objection.



Change log:
v1->v2:
 - Rephrase some words in the cover letter, commit message and
   code comments.
 - Refactored a code style to make it look nicer.

Best regards,
baolu


Yian Chen (3):
   iommu/vt-d: Add new enum value and structure for SATC
   iommu/vt-d: Parse SATC reporting structure
   iommu/vt-d: Apply SATC policy

  drivers/iommu/intel/dmar.c  |   8 ++
  drivers/iommu/intel/iommu.c | 162 +++-
  include/acpi/actbl1.h   |  11 ++-
  include/linux/dmar.h|   2 +
  4 files changed, 178 insertions(+), 5 deletions(-)



[PATCH v2 3/3] iommu/vt-d: Apply SATC policy

2021-02-03 Thread Lu Baolu
From: Yian Chen 

Starting from Intel VT-d v3.2, Intel platform BIOS can provide a new SATC
table structure. SATC table lists a set of SoC integrated devices that
require ATC to work (VT-d specification v3.2, section 8.8). Furthermore,
the new version of IOMMU supports SoC device ATS in both its Scalable mode
and legacy mode.

When IOMMU is working in scalable mode, software must enable device ATS
support. On the other hand, when IOMMU is in legacy mode for whatever
reason, the hardware managed ATS will automatically take effect and the
SATC required devices can work transparently to the software. As the
result, software shouldn't enable ATS on that device, otherwise duplicate
device TLB invalidations will occur.

Signed-off-by: Yian Chen 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/iommu.c | 73 +++--
 1 file changed, 69 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee0932307d64..3e30c340e6a9 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -872,6 +872,60 @@ static bool iommu_is_dummy(struct intel_iommu *iommu, 
struct device *dev)
return false;
 }
 
+static bool iommu_support_ats(struct intel_iommu *iommu)
+{
+   return ecap_dev_iotlb_support(iommu->ecap);
+}
+
+static bool device_support_ats(struct pci_dev *dev)
+{
+   return pci_ats_supported(dev) && dmar_find_matched_atsr_unit(dev);
+}
+
+static int segment_atc_required(u16 segment)
+{
+   struct acpi_dmar_satc *satc;
+   struct dmar_satc_unit *satcu;
+   int ret = 1;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(satcu, _satc_units, list) {
+   satc = container_of(satcu->hdr, struct acpi_dmar_satc, header);
+   if (satcu->atc_required && satcu->devices_cnt &&
+   satc->segment == segment)
+   goto out;
+   }
+   ret = 0;
+out:
+   rcu_read_unlock();
+   return ret;
+}
+
+static int device_atc_required(struct pci_dev *dev)
+{
+   struct dmar_satc_unit *satcu;
+   struct acpi_dmar_satc *satc;
+   struct device *tmp;
+   int i, ret = 1;
+
+   dev = pci_physfn(dev);
+   rcu_read_lock();
+   list_for_each_entry_rcu(satcu, _satc_units, list) {
+   satc = container_of(satcu->hdr, struct acpi_dmar_satc, header);
+   if (!satcu->atc_required ||
+   satc->segment != pci_domain_nr(dev->bus))
+   continue;
+
+   for_each_dev_scope(satcu->devices, satcu->devices_cnt, i, tmp)
+   if (to_pci_dev(tmp) == dev)
+   goto out;
+   }
+   ret = 0;
+out:
+   rcu_read_unlock();
+   return ret;
+}
+
 struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn)
 {
struct dmar_drhd_unit *drhd = NULL;
@@ -2555,10 +2609,16 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
 
-   if (ecap_dev_iotlb_support(iommu->ecap) &&
-   pci_ats_supported(pdev) &&
-   dmar_find_matched_atsr_unit(pdev))
-   info->ats_supported = 1;
+   /*
+* Support ATS by default if it's supported by both IOMMU and
+* client sides, except that the device's ATS is required by
+* ACPI/SATC but the IOMMU scalable mode is disabled. In that
+* case the hardware managed ATS will be automatically used.
+*/
+   if (iommu_support_ats(iommu) && device_support_ats(pdev)) {
+   if (!device_atc_required(pdev) || sm_supported(iommu))
+   info->ats_supported = 1;
+   }
 
if (sm_supported(iommu)) {
if (pasid_supported(iommu)) {
@@ -3155,6 +3215,11 @@ static int __init init_dmars(void)
 * endfor
 */
for_each_drhd_unit(drhd) {
+   if (pci_ats_disabled() && segment_atc_required(drhd->segment)) {
+   pr_warn("Scalable mode disabled -- Use hardware managed 
ATS because PCIe ATS is disabled but some devices in PCIe segment 0x%x require 
it.",
+   drhd->segment);
+   intel_iommu_sm = 0;
+   }
/*
 * lock not needed as this is only incremented in the single
 * threaded kernel __init code path all other access are read
-- 
2.25.1



[PATCH v2 0/3] iommu/vt-d: Add support for ACPI/SATC table

2021-02-03 Thread Lu Baolu
The Intel VT-d specification v3.2 comes with a new ACPI SATC (SoC-
Integrated Address Translation Cache) reporting structure. The SoC
integrated device enumerated in this table has a functional
requirement to enable its ATC (Address Translation Cache) via the
ATS capability for device operation.

This patch set adds the code to parse the SATC table, enumerates the
devices in it and satisfies the ATS requirement for them. Please help
to review. I will queue it in VT-d update for v5.12 if no objection.

Yian Chen (3):
  iommu/vt-d: Add new enum value and structure for SATC
  iommu/vt-d: Parse SATC reporting structure
  iommu/vt-d: Apply SATC policy

 drivers/iommu/intel/dmar.c  |   8 ++
 drivers/iommu/intel/iommu.c | 162 +++-
 include/acpi/actbl1.h   |  11 ++-
 include/linux/dmar.h|   2 +
 4 files changed, 178 insertions(+), 5 deletions(-)

-- 
2.25.1



[PATCH v2 1/3] iommu/vt-d: Add new enum value and structure for SATC

2021-02-03 Thread Lu Baolu
From: Yian Chen 

Starting from Intel Platform VT-d v3.2, BIOS may provide new remapping
structure SATC for SOC integrated devices, according to section 8.8 of
Intel VT-d architecture specification v3.2. The SATC structure reports
a list of the devices that require ATS for normal device operation. It
is a functional requirement that these devices will not work without OS
enabling ATS capability.

This patch introduces the new enum value and structure to represent the
remapping information. Kernel should parse the information from the
reporting structure and enable ATC for the devices as needed.

Signed-off-by: Yian Chen 
Signed-off-by: Lu Baolu 
---
 include/acpi/actbl1.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
index 43549547ed3e..c38e08cf1b9e 100644
--- a/include/acpi/actbl1.h
+++ b/include/acpi/actbl1.h
@@ -514,7 +514,8 @@ enum acpi_dmar_type {
ACPI_DMAR_TYPE_ROOT_ATS = 2,
ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,
ACPI_DMAR_TYPE_NAMESPACE = 4,
-   ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */
+   ACPI_DMAR_TYPE_SATC = 5,
+   ACPI_DMAR_TYPE_RESERVED = 6 /* 6 and greater are reserved */
 };
 
 /* DMAR Device Scope structure */
@@ -607,6 +608,14 @@ struct acpi_dmar_andd {
char device_name[1];
 };
 
+/* 5: SOC Integrated Address Translation Cache Reporting Structure */
+
+struct acpi_dmar_satc {
+   struct acpi_dmar_header header;
+   u8 flags;
+   u8 reserved;
+   u16 segment;
+};
 
/***
  *
  * DRTM - Dynamic Root of Trust for Measurement table
-- 
2.25.1



[PATCH v2 2/3] iommu/vt-d: Parse SATC reporting structure

2021-02-03 Thread Lu Baolu
From: Yian Chen 

Software should parse every SATC table and all devices in the tables
reported by the BIOS and keep the information in kernel list for further
reference.

Signed-off-by: Yian Chen 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/dmar.c  |  8 
 drivers/iommu/intel/iommu.c | 89 +
 include/linux/dmar.h|  2 +
 3 files changed, 99 insertions(+)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 980e8ae090af..d5c51b5c20af 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -526,6 +526,7 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header)
struct acpi_dmar_reserved_memory *rmrr;
struct acpi_dmar_atsr *atsr;
struct acpi_dmar_rhsa *rhsa;
+   struct acpi_dmar_satc *satc;
 
switch (header->type) {
case ACPI_DMAR_TYPE_HARDWARE_UNIT:
@@ -555,6 +556,10 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header 
*header)
/* We don't print this here because we need to sanity-check
   it first. So print it in dmar_parse_one_andd() instead. */
break;
+   case ACPI_DMAR_TYPE_SATC:
+   satc = container_of(header, struct acpi_dmar_satc, header);
+   pr_info("SATC flags: 0x%x\n", satc->flags);
+   break;
}
 }
 
@@ -642,6 +647,7 @@ parse_dmar_table(void)
.cb[ACPI_DMAR_TYPE_ROOT_ATS] = _parse_one_atsr,
.cb[ACPI_DMAR_TYPE_HARDWARE_AFFINITY] = _parse_one_rhsa,
.cb[ACPI_DMAR_TYPE_NAMESPACE] = _parse_one_andd,
+   .cb[ACPI_DMAR_TYPE_SATC] = _parse_one_satc,
};
 
/*
@@ -2077,6 +2083,7 @@ static guid_t dmar_hp_guid =
 #defineDMAR_DSM_FUNC_DRHD  1
 #defineDMAR_DSM_FUNC_ATSR  2
 #defineDMAR_DSM_FUNC_RHSA  3
+#defineDMAR_DSM_FUNC_SATC  4
 
 static inline bool dmar_detect_dsm(acpi_handle handle, int func)
 {
@@ -2094,6 +2101,7 @@ static int dmar_walk_dsm_resource(acpi_handle handle, int 
func,
[DMAR_DSM_FUNC_DRHD] = ACPI_DMAR_TYPE_HARDWARE_UNIT,
[DMAR_DSM_FUNC_ATSR] = ACPI_DMAR_TYPE_ROOT_ATS,
[DMAR_DSM_FUNC_RHSA] = ACPI_DMAR_TYPE_HARDWARE_AFFINITY,
+   [DMAR_DSM_FUNC_SATC] = ACPI_DMAR_TYPE_SATC,
};
 
if (!dmar_detect_dsm(handle, func))
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ecbd05d8a1fc..ee0932307d64 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -316,8 +316,18 @@ struct dmar_atsr_unit {
u8 include_all:1;   /* include all ports */
 };
 
+struct dmar_satc_unit {
+   struct list_head list;  /* list of SATC units */
+   struct acpi_dmar_header *hdr;   /* ACPI header */
+   struct dmar_dev_scope *devices; /* target devices */
+   struct intel_iommu *iommu;  /* the corresponding iommu */
+   int devices_cnt;/* target device count */
+   u8 atc_required:1;  /* ATS is required */
+};
+
 static LIST_HEAD(dmar_atsr_units);
 static LIST_HEAD(dmar_rmrr_units);
+static LIST_HEAD(dmar_satc_units);
 
 #define for_each_rmrr_units(rmrr) \
list_for_each_entry(rmrr, _rmrr_units, list)
@@ -3716,6 +3726,57 @@ int dmar_check_one_atsr(struct acpi_dmar_header *hdr, 
void *arg)
return 0;
 }
 
+static struct dmar_satc_unit *dmar_find_satc(struct acpi_dmar_satc *satc)
+{
+   struct dmar_satc_unit *satcu;
+   struct acpi_dmar_satc *tmp;
+
+   list_for_each_entry_rcu(satcu, _satc_units, list,
+   dmar_rcu_check()) {
+   tmp = (struct acpi_dmar_satc *)satcu->hdr;
+   if (satc->segment != tmp->segment)
+   continue;
+   if (satc->header.length != tmp->header.length)
+   continue;
+   if (memcmp(satc, tmp, satc->header.length) == 0)
+   return satcu;
+   }
+
+   return NULL;
+}
+
+int dmar_parse_one_satc(struct acpi_dmar_header *hdr, void *arg)
+{
+   struct acpi_dmar_satc *satc;
+   struct dmar_satc_unit *satcu;
+
+   if (system_state >= SYSTEM_RUNNING && !intel_iommu_enabled)
+   return 0;
+
+   satc = container_of(hdr, struct acpi_dmar_satc, header);
+   satcu = dmar_find_satc(satc);
+   if (satcu)
+   return 0;
+
+   satcu = kzalloc(sizeof(*satcu) + hdr->length, GFP_KERNEL);
+   if (!satcu)
+   return -ENOMEM;
+
+   satcu->hdr = (void *)(satcu + 1);
+   memcpy(satcu->hdr, hdr, hdr->length);
+   satcu->atc_required = satc->flags & 0x1;
+   satcu->devices = dmar_alloc_dev_scope((void *)(satc + 1),
+ (void *)satc + 
satc->header.length,
+   

Re: [PATCH 0/3] iommu/vt-d: Add support for ACPI/SATC table

2021-02-03 Thread Lu Baolu

Hi Christoph,

On 2021/2/3 16:41, Christoph Hellwig wrote:

On Tue, Feb 02, 2021 at 12:40:54PM +0800, Lu Baolu wrote:

Intel platform VT-d (v3.2) comes with a new type of DMAR subtable
SATC. The SATC table includes a list of SoC integrated devices
that require SATC. OS software can use this table to enable ATS
only for the devices in the list.


This all sounds like gibberish. Please expand and if necessary explain
acronyms when first used in each commit log / cover letter.



I will rephrase the words.

Best regards,
baolu


Re: [PATCH 3/3] iommu/vt-d: Apply SATC policy

2021-02-02 Thread Lu Baolu

On 2/2/21 9:55 PM, Joerg Roedel wrote:

On Tue, Feb 02, 2021 at 12:40:57PM +0800, Lu Baolu wrote:

+   list_for_each_entry_rcu(satcu, _satc_units, list) {
+   satc = container_of(satcu->hdr, struct acpi_dmar_satc, header);
+   if (satc->segment == pci_domain_nr(dev->bus) && 
satcu->atc_required) {


You can safe a level of indentation and make this look nicer if you do:

if (satc->segment != pci_domain_nr(dev->bus) || 
!satcu->atc_required)
continue;




Yes. Thanks!

Best regards,
baolu


+   for_each_dev_scope(satcu->devices, satcu->devices_cnt, 
i, tmp)
+   if (to_pci_dev(tmp) == dev)
+   goto out;
+   }
+   }
+   ret = 0;
+out:
+   rcu_read_unlock();
+   return ret;
+}


Re: [PATCH 2/3] iommu/vt-d: Parse SATC reporting structure

2021-02-02 Thread Lu Baolu

Hi Ashok,

On 2/3/21 12:41 AM, Raj, Ashok wrote:

On Tue, Feb 02, 2021 at 12:40:56PM +0800, Lu Baolu wrote:

From: Yian Chen 

Software should parse every SATC table and all devices in the tables
reported by the BIOS and keep the information in kernel list for further
SATC policy deployment.


The last part seems bit vague? Are you trying to imply,

if kernel is booted with noats for instance, a device listed in SATC table
as "requires ATS" will fail to load?

Otherwise its not clear what the policy enforcement means.



Yes. This is a bit vague. The ATS policy is out of the purpose of this
patch. It only parses the table and keep the device list for further
reference.

Best regards,
baolu


Re: [PATCH 2/3] iommu/vt-d: Parse SATC reporting structure

2021-02-02 Thread Lu Baolu

Hi Joerg,

On 2/2/21 9:53 PM, Joerg Roedel wrote:

On Tue, Feb 02, 2021 at 12:40:56PM +0800, Lu Baolu wrote:

+   case ACPI_DMAR_TYPE_SATC:
+   satc = container_of(header, struct acpi_dmar_satc, header);
+   pr_info("SATC flags: 0x%x\n", satc->flags);
+   break;


Did the pr_info() slip through or is there a real purpose for having
this information in the kernel log?



Here is just the easiest way for users to know SATC: system has SATC?
ATS is required?

Best regards,
baolu


Re: [PATCH 1/3] iommu/vt-d: Add new enum value and structure for SATC

2021-02-02 Thread Lu Baolu

Hi Ashok,

On 2/3/21 12:02 AM, Raj, Ashok wrote:

On Tue, Feb 02, 2021 at 12:40:55PM +0800, Lu Baolu wrote:

From: Yian Chen 

Starting from Intel Platform VT-d v3.2, BIOS may provide new remapping
structure SATC for SOC integrated devices, according to section 8.8 of
Intel VT-d architecture specification v3.2. The SATC structure reports
a list of the devices that require SATC enabling via ATS capacity.


nit: s/require SATC/require ATS for normal device operation. This is a
functional requirement that these devices will not work without OS enabling
ATS capability.



Yes. This looks clearer.

Best regards,
baolu



This patch introduces the new enum value and structure to represent the
remapping information. Kernel should parse the information from the
reporting structure and enable ATC for the devices as needed.

Signed-off-by: Yian Chen 
---
  include/acpi/actbl1.h | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
index 43549547ed3e..b7ca802b66d2 100644
--- a/include/acpi/actbl1.h
+++ b/include/acpi/actbl1.h
@@ -514,7 +514,8 @@ enum acpi_dmar_type {
ACPI_DMAR_TYPE_ROOT_ATS = 2,
ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,
ACPI_DMAR_TYPE_NAMESPACE = 4,
-   ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */
+   ACPI_DMAR_TYPE_SATC = 5,
+   ACPI_DMAR_TYPE_RESERVED = 6 /* 5 and greater are reserved */
  };
  


Think Joerg spotted the comment update.



Re: [PATCH 1/3] iommu/vt-d: Add new enum value and structure for SATC

2021-02-02 Thread Lu Baolu

Hi Joerg,

On 2/2/21 9:51 PM, Joerg Roedel wrote:

Hi Baolu,

On Tue, Feb 02, 2021 at 12:40:55PM +0800, Lu Baolu wrote:

@@ -514,7 +514,8 @@ enum acpi_dmar_type {
ACPI_DMAR_TYPE_ROOT_ATS = 2,
ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,
ACPI_DMAR_TYPE_NAMESPACE = 4,
-   ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */
+   ACPI_DMAR_TYPE_SATC = 5,
+   ACPI_DMAR_TYPE_RESERVED = 6 /* 5 and greater are reserved */


Nit: The comment needs to be updated too.



Yes. Will update it.

Best regards,
baolu


[PATCH 3/3] iommu/vt-d: Apply SATC policy

2021-02-01 Thread Lu Baolu
From: Yian Chen 

Starting from Intel VT-d v3.2, Intel platform BIOS can provide a new SATC
table structure. SATC table lists a set of SoC integrated devices that
require ATC to work (VT-d specification v3.2, section 8.8). Furthermore,
the new version of IOMMU supports SoC device ATS in both its Scalable mode
and legacy mode.

When IOMMU is working in scalable mode, software must enable device ATS
support. On the other hand, when IOMMU is in legacy mode for whatever
reason, the hardware managed ATS will automatically take effect and the
SATC required devices can work transparently to the software. As the
result, software shouldn't enable ATS on that device, otherwise duplicate
device TLB invalidations will occur.

Signed-off-by: Yian Chen 
---
 drivers/iommu/intel/iommu.c | 72 ++---
 1 file changed, 68 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index b20fd305fc60..131fac718a4b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -872,6 +872,59 @@ static bool iommu_is_dummy(struct intel_iommu *iommu, 
struct device *dev)
return false;
 }
 
+static bool iommu_support_ats(struct intel_iommu *iommu)
+{
+   return ecap_dev_iotlb_support(iommu->ecap);
+}
+
+static bool device_support_ats(struct pci_dev *dev)
+{
+   return pci_ats_supported(dev) && dmar_find_matched_atsr_unit(dev);
+}
+
+static int segment_atc_required(u16 segment)
+{
+   struct acpi_dmar_satc *satc;
+   struct dmar_satc_unit *satcu;
+   int ret = 1;
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(satcu, _satc_units, list) {
+   satc = container_of(satcu->hdr, struct acpi_dmar_satc, header);
+   if (satcu->atc_required &&
+   satcu->devices_cnt &&
+   satc->segment == segment)
+   goto out;
+   }
+   ret = 0;
+out:
+   rcu_read_unlock();
+   return ret;
+}
+
+static int device_atc_required(struct pci_dev *dev)
+{
+   struct dmar_satc_unit *satcu;
+   struct acpi_dmar_satc *satc;
+   struct device *tmp;
+   int i, ret = 1;
+
+   dev = pci_physfn(dev);
+   rcu_read_lock();
+   list_for_each_entry_rcu(satcu, _satc_units, list) {
+   satc = container_of(satcu->hdr, struct acpi_dmar_satc, header);
+   if (satc->segment == pci_domain_nr(dev->bus) && 
satcu->atc_required) {
+   for_each_dev_scope(satcu->devices, satcu->devices_cnt, 
i, tmp)
+   if (to_pci_dev(tmp) == dev)
+   goto out;
+   }
+   }
+   ret = 0;
+out:
+   rcu_read_unlock();
+   return ret;
+}
+
 struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn)
 {
struct dmar_drhd_unit *drhd = NULL;
@@ -2575,10 +2628,16 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
 
-   if (ecap_dev_iotlb_support(iommu->ecap) &&
-   pci_ats_supported(pdev) &&
-   dmar_find_matched_atsr_unit(pdev))
-   info->ats_supported = 1;
+   /*
+* Support ATS by default if it's supported by both IOMMU and
+* client sides, except that the device's ATS is required by
+* ACPI/SATC but the IOMMU scalable mode is disabled. In that
+* case the hardware managed ATS will be automatically used.
+*/
+   if (iommu_support_ats(iommu) && device_support_ats(pdev)) {
+   if (!device_atc_required(pdev) || sm_supported(iommu))
+   info->ats_supported = 1;
+   }
 
if (sm_supported(iommu)) {
if (pasid_supported(iommu)) {
@@ -3175,6 +3234,11 @@ static int __init init_dmars(void)
 * endfor
 */
for_each_drhd_unit(drhd) {
+   if (pci_ats_disabled() && segment_atc_required(drhd->segment)) {
+   pr_warn("Scalable mode disabled -- Use hardware managed 
ATS because PCIe ATS is disabled but some devices in PCIe segment 0x%x require 
it.",
+   drhd->segment);
+   intel_iommu_sm = 0;
+   }
/*
 * lock not needed as this is only incremented in the single
 * threaded kernel __init code path all other access are read
-- 
2.25.1



[PATCH 2/3] iommu/vt-d: Parse SATC reporting structure

2021-02-01 Thread Lu Baolu
From: Yian Chen 

Software should parse every SATC table and all devices in the tables
reported by the BIOS and keep the information in kernel list for further
SATC policy deployment.

Signed-off-by: Yian Chen 
---
 drivers/iommu/intel/dmar.c  |  9 
 drivers/iommu/intel/iommu.c | 89 +
 include/linux/dmar.h|  2 +
 3 files changed, 100 insertions(+)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 980e8ae090af..6bd357966d4e 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -526,6 +526,7 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header *header)
struct acpi_dmar_reserved_memory *rmrr;
struct acpi_dmar_atsr *atsr;
struct acpi_dmar_rhsa *rhsa;
+   struct acpi_dmar_satc *satc;
 
switch (header->type) {
case ACPI_DMAR_TYPE_HARDWARE_UNIT:
@@ -555,6 +556,11 @@ dmar_table_print_dmar_entry(struct acpi_dmar_header 
*header)
/* We don't print this here because we need to sanity-check
   it first. So print it in dmar_parse_one_andd() instead. */
break;
+   case ACPI_DMAR_TYPE_SATC:
+   satc = container_of(header, struct acpi_dmar_satc, header);
+   pr_info("SATC flags: 0x%x\n", satc->flags);
+   break;
+
}
 }
 
@@ -642,6 +648,7 @@ parse_dmar_table(void)
.cb[ACPI_DMAR_TYPE_ROOT_ATS] = _parse_one_atsr,
.cb[ACPI_DMAR_TYPE_HARDWARE_AFFINITY] = _parse_one_rhsa,
.cb[ACPI_DMAR_TYPE_NAMESPACE] = _parse_one_andd,
+   .cb[ACPI_DMAR_TYPE_SATC] = _parse_one_satc,
};
 
/*
@@ -2077,6 +2084,7 @@ static guid_t dmar_hp_guid =
 #defineDMAR_DSM_FUNC_DRHD  1
 #defineDMAR_DSM_FUNC_ATSR  2
 #defineDMAR_DSM_FUNC_RHSA  3
+#defineDMAR_DSM_FUNC_SATC  4
 
 static inline bool dmar_detect_dsm(acpi_handle handle, int func)
 {
@@ -2094,6 +2102,7 @@ static int dmar_walk_dsm_resource(acpi_handle handle, int 
func,
[DMAR_DSM_FUNC_DRHD] = ACPI_DMAR_TYPE_HARDWARE_UNIT,
[DMAR_DSM_FUNC_ATSR] = ACPI_DMAR_TYPE_ROOT_ATS,
[DMAR_DSM_FUNC_RHSA] = ACPI_DMAR_TYPE_HARDWARE_AFFINITY,
+   [DMAR_DSM_FUNC_SATC] = ACPI_DMAR_TYPE_SATC,
};
 
if (!dmar_detect_dsm(handle, func))
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ecd36e456de3..b20fd305fc60 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -316,8 +316,18 @@ struct dmar_atsr_unit {
u8 include_all:1;   /* include all ports */
 };
 
+struct dmar_satc_unit {
+   struct list_head list;  /* list of SATC units */
+   struct acpi_dmar_header *hdr;   /* ACPI header */
+   struct dmar_dev_scope *devices; /* target devices */
+   struct intel_iommu *iommu;  /* the corresponding iommu */
+   int devices_cnt;/* target device count */
+   u8 atc_required:1;  /* ATS is required */
+};
+
 static LIST_HEAD(dmar_atsr_units);
 static LIST_HEAD(dmar_rmrr_units);
+static LIST_HEAD(dmar_satc_units);
 
 #define for_each_rmrr_units(rmrr) \
list_for_each_entry(rmrr, _rmrr_units, list)
@@ -3736,6 +3746,57 @@ int dmar_check_one_atsr(struct acpi_dmar_header *hdr, 
void *arg)
return 0;
 }
 
+static struct dmar_satc_unit *dmar_find_satc(struct acpi_dmar_satc *satc)
+{
+   struct dmar_satc_unit *satcu;
+   struct acpi_dmar_satc *tmp;
+
+   list_for_each_entry_rcu(satcu, _satc_units, list,
+   dmar_rcu_check()) {
+   tmp = (struct acpi_dmar_satc *)satcu->hdr;
+   if (satc->segment != tmp->segment)
+   continue;
+   if (satc->header.length != tmp->header.length)
+   continue;
+   if (memcmp(satc, tmp, satc->header.length) == 0)
+   return satcu;
+   }
+
+   return NULL;
+}
+
+int dmar_parse_one_satc(struct acpi_dmar_header *hdr, void *arg)
+{
+   struct acpi_dmar_satc *satc;
+   struct dmar_satc_unit *satcu;
+
+   if (system_state >= SYSTEM_RUNNING && !intel_iommu_enabled)
+   return 0;
+
+   satc = container_of(hdr, struct acpi_dmar_satc, header);
+   satcu = dmar_find_satc(satc);
+   if (satcu)
+   return 0;
+
+   satcu = kzalloc(sizeof(*satcu) + hdr->length, GFP_KERNEL);
+   if (!satcu)
+   return -ENOMEM;
+
+   satcu->hdr = (void *)(satcu + 1);
+   memcpy(satcu->hdr, hdr, hdr->length);
+   satcu->atc_required = satc->flags & 0x1;
+   satcu->devices = dmar_alloc_dev_scope((void *)(satc + 1),
+ (void *)satc + 
satc->header.length,
+ >devices_cnt);
+   if (satcu->devices_cnt && !satcu->devices) 

[PATCH 1/3] iommu/vt-d: Add new enum value and structure for SATC

2021-02-01 Thread Lu Baolu
From: Yian Chen 

Starting from Intel Platform VT-d v3.2, BIOS may provide new remapping
structure SATC for SOC integrated devices, according to section 8.8 of
Intel VT-d architecture specification v3.2. The SATC structure reports
a list of the devices that require SATC enabling via ATS capacity.

This patch introduces the new enum value and structure to represent the
remapping information. Kernel should parse the information from the
reporting structure and enable ATC for the devices as needed.

Signed-off-by: Yian Chen 
---
 include/acpi/actbl1.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
index 43549547ed3e..b7ca802b66d2 100644
--- a/include/acpi/actbl1.h
+++ b/include/acpi/actbl1.h
@@ -514,7 +514,8 @@ enum acpi_dmar_type {
ACPI_DMAR_TYPE_ROOT_ATS = 2,
ACPI_DMAR_TYPE_HARDWARE_AFFINITY = 3,
ACPI_DMAR_TYPE_NAMESPACE = 4,
-   ACPI_DMAR_TYPE_RESERVED = 5 /* 5 and greater are reserved */
+   ACPI_DMAR_TYPE_SATC = 5,
+   ACPI_DMAR_TYPE_RESERVED = 6 /* 5 and greater are reserved */
 };
 
 /* DMAR Device Scope structure */
@@ -607,6 +608,14 @@ struct acpi_dmar_andd {
char device_name[1];
 };
 
+/* 5: SOC Integrated Address Translation Cache Reporting Structure */
+
+struct acpi_dmar_satc {
+   struct acpi_dmar_header header;
+   u8 flags;
+   u8 reserved;
+   u16 segment;
+};
 
/***
  *
  * DRTM - Dynamic Root of Trust for Measurement table
-- 
2.25.1



[PATCH 0/3] iommu/vt-d: Add support for ACPI/SATC table

2021-02-01 Thread Lu Baolu
Intel platform VT-d (v3.2) comes with a new type of DMAR subtable
SATC. The SATC table includes a list of SoC integrated devices
that require SATC. OS software can use this table to enable ATS
only for the devices in the list.

Yian Chen (3):
  iommu/vt-d: Add new enum value and structure for SATC
  iommu/vt-d: Parse SATC reporting structure
  iommu/vt-d: Apply SATC policy

 drivers/iommu/intel/dmar.c  |   9 ++
 drivers/iommu/intel/iommu.c | 161 +++-
 include/acpi/actbl1.h   |  11 ++-
 include/linux/dmar.h|   2 +
 4 files changed, 178 insertions(+), 5 deletions(-)

-- 
2.25.1



[PATCH 1/1] iommu/vt-d: Fix compile error [-Werror=implicit-function-declaration]

2021-01-30 Thread Lu Baolu
trace_qi_submit() could be used when interrupt remapping is supported,
but DMA remapping is not. In this case, the following compile error
occurs.

../drivers/iommu/intel/dmar.c: In function 'qi_submit_sync':
../drivers/iommu/intel/dmar.c:1311:3: error: implicit declaration of function 
'trace_qi_submit';
  did you mean 'ftrace_nmi_exit'? [-Werror=implicit-function-declaration]
   trace_qi_submit(iommu, desc[i].qw0, desc[i].qw1,
   ^~~
   ftrace_nmi_exit

Fixes: f2dd871799ba5 ("iommu/vt-d: Add qi_submit trace event")
Reported-by: kernel test robot 
Reported-by: Randy Dunlap 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/Makefile   | 2 +-
 drivers/iommu/intel/iommu.c| 1 -
 include/trace/events/intel_iommu.h | 2 --
 3 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
index fb8e1e8c8029..ae570810a35e 100644
--- a/drivers/iommu/intel/Makefile
+++ b/drivers/iommu/intel/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_DMAR_TABLE) += dmar.o
 obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o
-obj-$(CONFIG_INTEL_IOMMU) += trace.o
+obj-$(CONFIG_DMAR_TABLE) += trace.o
 obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
 obj-$(CONFIG_INTEL_IOMMU_SVM) += svm.o
 obj-$(CONFIG_IRQ_REMAP) += irq_remapping.o
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d7fecc109947..37da4caa67c9 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -44,7 +44,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "../irq_remapping.h"
 #include "pasid.h"
diff --git a/include/trace/events/intel_iommu.h 
b/include/trace/events/intel_iommu.h
index aad2ff0c1e2e..e801f4910522 100644
--- a/include/trace/events/intel_iommu.h
+++ b/include/trace/events/intel_iommu.h
@@ -6,7 +6,6 @@
  *
  * Author: Lu Baolu 
  */
-#ifdef CONFIG_INTEL_IOMMU
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM intel_iommu
 
@@ -176,4 +175,3 @@ TRACE_EVENT(qi_submit,
 
 /* This part must be outside protection */
 #include 
-#endif /* CONFIG_INTEL_IOMMU */
-- 
2.25.1



Re: [PATCH v3] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-27 Thread Lu Baolu

On 1/28/21 1:53 AM, Nadav Amit wrote:

From: Nadav Amit 

When an Intel IOMMU is virtualized, and a physical device is
passed-through to the VM, changes of the virtual IOMMU need to be
propagated to the physical IOMMU. The hypervisor therefore needs to
monitor PTE mappings in the IOMMU page-tables. Intel specifications
provide "caching-mode" capability that a virtual IOMMU uses to report
that the IOMMU is virtualized and a TLB flush is needed after mapping to
allow the hypervisor to propagate virtual IOMMU mappings to the physical
IOMMU. To the best of my knowledge no real physical IOMMU reports
"caching-mode" as turned on.

Synchronizing the virtual and the physical IOMMU tables is expensive if
the hypervisor is unaware which PTEs have changed, as the hypervisor is
required to walk all the virtualized tables and look for changes.
Consequently, domain flushes are much more expensive than page-specific
flushes on virtualized IOMMUs with passthrough devices. The kernel
therefore exploited the "caching-mode" indication to avoid domain
flushing and use page-specific flushing in virtualized environments. See
commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching
mode.")

This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use
of iova deferred flushing"). Now, when batched TLB flushing is used (the
default), full TLB domain flushes are performed frequently, requiring
the hypervisor to perform expensive synchronization between the virtual
TLB and the physical one.

Getting batched TLB flushes to use page-specific invalidations again in
such circumstances is not easy, since the TLB invalidation scheme
assumes that "full" domain TLB flushes are performed for scalability.

Disable batched TLB flushes when caching-mode is on, as the performance
benefit from using batched TLB invalidations is likely to be much
smaller than the overhead of the virtual-to-physical IOMMU page-tables
synchronization.

Fixes: 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing")
Signed-off-by: Nadav Amit 
Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: sta...@vger.kernel.org

---
v2->v3:

* Fix the fixes tag in the commit-log (Lu).
* Minor English rephrasing of the commit-log.

v1->v2:

* disable flush queue for all domains if caching-mode is on for any
   IOMMU (Lu).
---
  drivers/iommu/intel/iommu.c | 32 +++-
  1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c5b021..de3dd617cf60 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5373,6 +5373,36 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
return ret;
  }
  
+static bool domain_use_flush_queue(void)

+{
+   struct dmar_drhd_unit *drhd;
+   struct intel_iommu *iommu;
+   bool r = true;
+
+   if (intel_iommu_strict)
+   return false;
+
+   /*
+* The flush queue implementation does not perform page-selective
+* invalidations that are required for efficient TLB flushes in virtual
+* environments. The benefit of batching is likely to be much lower than
+* the overhead of synchronizing the virtual and physical IOMMU
+* page-tables.
+*/
+   rcu_read_lock();
+   for_each_active_iommu(iommu, drhd) {
+   if (!cap_caching_mode(iommu->cap))
+   continue;
+
+   pr_warn_once("IOMMU batching is disabled due to 
virtualization");
+   r = false;
+   break;
+   }
+   rcu_read_unlock();
+
+   return r;
+}
+
  static int
  intel_iommu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
@@ -5383,7 +5413,7 @@ intel_iommu_domain_get_attr(struct iommu_domain *domain,
case IOMMU_DOMAIN_DMA:
switch (attr) {
case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = !intel_iommu_strict;
+   *(int *)data = domain_use_flush_queue();
    return 0;
default:
return -ENODEV;



Acked-by: Lu Baolu 

Best regards,
baolu


Re: [PATCH v2] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-27 Thread Lu Baolu

On 2021/1/27 14:17, Nadav Amit wrote:

From: Nadav Amit 

When an Intel IOMMU is virtualized, and a physical device is
passed-through to the VM, changes of the virtual IOMMU need to be
propagated to the physical IOMMU. The hypervisor therefore needs to
monitor PTE mappings in the IOMMU page-tables. Intel specifications
provide "caching-mode" capability that a virtual IOMMU uses to report
that the IOMMU is virtualized and a TLB flush is needed after mapping to
allow the hypervisor to propagate virtual IOMMU mappings to the physical
IOMMU. To the best of my knowledge no real physical IOMMU reports
"caching-mode" as turned on.

Synchronizing the virtual and the physical IOMMU tables is expensive if
the hypervisor is unaware which PTEs have changed, as the hypervisor is
required to walk all the virtualized tables and look for changes.
Consequently, domain flushes are much more expensive than page-specific
flushes on virtualized IOMMUs with passthrough devices. The kernel
therefore exploited the "caching-mode" indication to avoid domain
flushing and use page-specific flushing in virtualized environments. See
commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching
mode.")

This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use
of iova deferred flushing"). Now, when batched TLB flushing is used (the
default), full TLB domain flushes are performed frequently, requiring
the hypervisor to perform expensive synchronization between the virtual
TLB and the physical one.

Getting batched TLB flushes to use in such circumstances page-specific
invalidations again is not easy, since the TLB invalidation scheme
assumes that "full" domain TLB flushes are performed for scalability.

Disable batched TLB flushes when caching-mode is on, as the performance
benefit from using batched TLB invalidations is likely to be much
smaller than the overhead of the virtual-to-physical IOMMU page-tables
synchronization.

Fixes: 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.")


Isn't it

Fixes: 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing")

?

Best regards,
baolu


Signed-off-by: Nadav Amit 
Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: sta...@vger.kernel.org >
---
v1->v2:

* disable flush queue for all domains if caching-mode is on for any
   IOMMU (Lu).
---
  drivers/iommu/intel/iommu.c | 32 +++-
  1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c5b021..de3dd617cf60 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5373,6 +5373,36 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
return ret;
  }
  
+static bool domain_use_flush_queue(void)

+{
+   struct dmar_drhd_unit *drhd;
+   struct intel_iommu *iommu;
+   bool r = true;
+
+   if (intel_iommu_strict)
+   return false;
+
+   /*
+* The flush queue implementation does not perform page-selective
+* invalidations that are required for efficient TLB flushes in virtual
+* environments. The benefit of batching is likely to be much lower than
+* the overhead of synchronizing the virtual and physical IOMMU
+* page-tables.
+*/
+   rcu_read_lock();
+   for_each_active_iommu(iommu, drhd) {
+   if (!cap_caching_mode(iommu->cap))
+   continue;
+
+   pr_warn_once("IOMMU batching is disabled due to 
virtualization");
+   r = false;
+   break;
+   }
+   rcu_read_unlock();
+
+   return r;
+}
+
  static int
  intel_iommu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
@@ -5383,7 +5413,7 @@ intel_iommu_domain_get_attr(struct iommu_domain *domain,
case IOMMU_DOMAIN_DMA:
switch (attr) {
case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = !intel_iommu_strict;
+   *(int *)data = domain_use_flush_queue();
return 0;
default:
return -ENODEV;



Re: [PATCH] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-26 Thread Lu Baolu

On 1/27/21 8:26 AM, Lu Baolu wrote:

+{
+    struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+    struct intel_iommu *iommu = domain_get_iommu(dmar_domain);
+
+    if (intel_iommu_strict)
+    return 0;
+
+    /*
+ * The flush queue implementation does not perform page-selective
+ * invalidations that are required for efficient TLB flushes in 
virtual
+ * environments. The benefit of batching is likely to be much 
lower than

+ * the overhead of synchronizing the virtual and physical IOMMU
+ * page-tables.
+ */
+    if (iommu && cap_caching_mode(iommu->cap)) {
+    pr_warn_once("IOMMU batching is partially disabled due to 
virtualization");

+    return 0;
+    }


domain_get_iommu() only returns the first iommu, and could return NULL
when this is called before domain attaching to any device. A better
choice could be check caching mode globally and return false if caching
mode is supported on any iommu.

    struct dmar_drhd_unit *drhd;
    struct intel_iommu *iommu;

    rcu_read_lock();
    for_each_active_iommu(iommu, drhd) {
     if (cap_caching_mode(iommu->cap))
     return false;


We should unlock rcu before return here. Sorry!


     }
     rcu_read_unlock();


Best regards,
baolu


Re: [PATCH] iommu/vt-d: do not use flush-queue when caching-mode is on

2021-01-26 Thread Lu Baolu

Hi Nadav,

On 1/27/21 4:38 AM, Nadav Amit wrote:

From: Nadav Amit 

When an Intel IOMMU is virtualized, and a physical device is
passed-through to the VM, changes of the virtual IOMMU need to be
propagated to the physical IOMMU. The hypervisor therefore needs to
monitor PTE mappings in the IOMMU page-tables. Intel specifications
provide "caching-mode" capability that a virtual IOMMU uses to report
that the IOMMU is virtualized and a TLB flush is needed after mapping to
allow the hypervisor to propagate virtual IOMMU mappings to the physical
IOMMU. To the best of my knowledge no real physical IOMMU reports
"caching-mode" as turned on.

Synchronizing the virtual and the physical TLBs is expensive if the
hypervisor is unaware which PTEs have changed, as the hypervisor is
required to walk all the virtualized tables and look for changes.
Consequently, domain flushes are much more expensive than page-specific
flushes on virtualized IOMMUs with passthrough devices. The kernel
therefore exploited the "caching-mode" indication to avoid domain
flushing and use page-specific flushing in virtualized environments. See
commit 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching
mode.")

This behavior changed after commit 13cf01744608 ("iommu/vt-d: Make use
of iova deferred flushing"). Now, when batched TLB flushing is used (the
default), full TLB domain flushes are performed frequently, requiring
the hypervisor to perform expensive synchronization between the virtual
TLB and the physical one.


Good catch. Thank you!



Getting batched TLB flushes to use in such circumstances page-specific
invalidations again is not easy, since the TLB invalidation scheme
assumes that "full" domain TLB flushes are performed for scalability.

Disable batched TLB flushes when caching-mode is on, as the performance
benefit from using batched TLB invalidations is likely to be much
smaller than the overhead of the virtual-to-physical IOMMU page-tables
synchronization.

Fixes: 78d5f0f500e6 ("intel-iommu: Avoid global flushes with caching mode.")
Signed-off-by: Nadav Amit 
Cc: David Woodhouse 
Cc: Lu Baolu 
Cc: Joerg Roedel 
Cc: Will Deacon 
Cc: sta...@vger.kernel.org
---
  drivers/iommu/intel/iommu.c | 26 +-
  1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 788119c5b021..4e08f5e17175 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5373,6 +5373,30 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
return ret;
  }
  
+static int

+intel_iommu_domain_get_attr_use_flush_queue(struct iommu_domain *domain)


Just nit:

Can we just use this

static bool domain_use_flush_queue(struct iommu_domain *domain)

?


+{
+   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   struct intel_iommu *iommu = domain_get_iommu(dmar_domain);
+
+   if (intel_iommu_strict)
+   return 0;
+
+   /*
+* The flush queue implementation does not perform page-selective
+* invalidations that are required for efficient TLB flushes in virtual
+* environments. The benefit of batching is likely to be much lower than
+* the overhead of synchronizing the virtual and physical IOMMU
+* page-tables.
+*/
+   if (iommu && cap_caching_mode(iommu->cap)) {
+   pr_warn_once("IOMMU batching is partially disabled due to 
virtualization");
+   return 0;
+   }


domain_get_iommu() only returns the first iommu, and could return NULL
when this is called before domain attaching to any device. A better
choice could be check caching mode globally and return false if caching
mode is supported on any iommu.

   struct dmar_drhd_unit *drhd;
   struct intel_iommu *iommu;

   rcu_read_lock();
   for_each_active_iommu(iommu, drhd) {
if (cap_caching_mode(iommu->cap))
return false;
}
rcu_read_unlock();


+
+   return 1;
+}
+
  static int
  intel_iommu_domain_get_attr(struct iommu_domain *domain,
enum iommu_attr attr, void *data)
@@ -5383,7 +5407,7 @@ intel_iommu_domain_get_attr(struct iommu_domain *domain,
case IOMMU_DOMAIN_DMA:
switch (attr) {
case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-   *(int *)data = !intel_iommu_strict;
+   *(int *)data = 
!intel_iommu_domain_get_attr_use_flush_queue(domain);
return 0;
default:
return -ENODEV;



Best regards,
baolu


[PATCH v2 1/2] iommu/vt-d: Clear PRQ overflow only when PRQ is empty

2021-01-26 Thread Lu Baolu
It is incorrect to always clear PRO when it's set w/o first checking
whether the overflow condition has been cleared. Current code assumes
that if an overflow condition occurs it must have been cleared by earlier
loop. However since the code runs in a threaded context, the overflow
condition could occur even after setting the head to the tail under some
extreme condition. To be sane, we should read both head/tail again when
seeing a pending PRO and only clear PRO after all pending PRs have been
handled.

Suggested-by: Kevin Tian 
Link: 
https://lore.kernel.org/linux-iommu/mwhpr11mb18862d2ea5bd432bf22d99a48c...@mwhpr11mb1886.namprd11.prod.outlook.com/
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 033b25886e57..d7c98c5fa4e7 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -1042,8 +1042,17 @@ static irqreturn_t prq_event_thread(int irq, void *d)
 * Clear the page request overflow bit and wake up all threads that
 * are waiting for the completion of this handling.
 */
-   if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO)
-   writel(DMA_PRS_PRO, iommu->reg + DMAR_PRS_REG);
+   if (readl(iommu->reg + DMAR_PRS_REG) & DMA_PRS_PRO) {
+   pr_info_ratelimited("IOMMU: %s: PRQ overflow detected\n",
+   iommu->name);
+   head = dmar_readq(iommu->reg + DMAR_PQH_REG) & PRQ_RING_MASK;
+   tail = dmar_readq(iommu->reg + DMAR_PQT_REG) & PRQ_RING_MASK;
+   if (head == tail) {
+   writel(DMA_PRS_PRO, iommu->reg + DMAR_PRS_REG);
+   pr_info_ratelimited("IOMMU: %s: PRQ overflow cleared",
+   iommu->name);
+   }
+   }
 
if (!completion_done(>prq_complete))
complete(>prq_complete);
-- 
2.25.1



[PATCH v2 2/2] iommu/vt-d: Use INVALID response code instead of FAILURE

2021-01-26 Thread Lu Baolu
The VT-d IOMMU response RESPONSE_FAILURE for a page request in below
cases:

- When it gets a Page_Request with no PASID;
- When it gets a Page_Request with PASID that is not in use for this
  device.

This is allowed by the spec, but IOMMU driver doesn't support such cases
today. When the device receives RESPONSE_FAILURE, it sends the device
state machine to HALT state. Now if we try to unload the driver, it hangs
since the device doesn't send any outbound transactions to host when the
driver is trying to clear things up. The only possible responses would be
for invalidation requests.

Let's use RESPONSE_INVALID instead for now, so that the device state
machine doesn't enter HALT state.

Suggested-by: Ashok Raj 
Signed-off-by: Lu Baolu 
---
 drivers/iommu/intel/svm.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index d7c98c5fa4e7..574a7e657a9a 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -911,10 +911,8 @@ static irqreturn_t prq_event_thread(int irq, void *d)
u64 address;
 
handled = 1;
-
req = >prq[head / sizeof(*req)];
-
-   result = QI_RESP_FAILURE;
+   result = QI_RESP_INVALID;
address = (u64)req->addr << VTD_PAGE_SHIFT;
if (!req->pasid_present) {
pr_err("%s: Page request without PASID: %08llx 
%08llx\n",
@@ -952,7 +950,6 @@ static irqreturn_t prq_event_thread(int irq, void *d)
rcu_read_unlock();
}
 
-   result = QI_RESP_INVALID;
/* Since we're using init_mm.pgd directly, we should never take
 * any faults on kernel addresses. */
if (!svm->mm)
-- 
2.25.1



  1   2   3   4   5   6   7   8   9   10   >