Re: [PATCH] iommu/vt-d: Increase DMAR_UNITS_SUPPORTED

2022-05-05 Thread Baolu Lu

On 2022/5/6 03:46, Steve Wahl wrote:

Increase DMAR_UNITS_SUPPORTED to support 64 sockets with 10 DMAR units
each, for a total of 640.

If the available hardware exceeds DMAR_UNITS_SUPPORTED (previously set
to MAX_IO_APICS, or 128), it causes these messages: "DMAR: Failed to
allocate seq_id", "DMAR: Parse DMAR table failure.", and "x2apic: IRQ
remapping doesn't support X2APIC mode x2apic disabled"; and the system
fails to boot.

Signed-off-by: Steve Wahl 
Reviewed-by: Mike Travis 
---

Note that we could not find a reason for connecting
DMAR_UNITS_SUPPORTED to MAX_IO_APICS as was done previously.  Perhaps
it seemed like the two would continue to match on earlier processors.
There doesn't appear to be kernel code that assumes that the value of
one is related to the other.


+Kevin

This maximum value was introduced by below commit. And I don't see any
hardware/software restrictions that we can't enlarge it after ten years.

commit 1b198bb04ad72669d4bd6575fc9945ed595bfee0
Author: Mike Travis 
Date:   Mon Mar 5 15:05:16 2012 -0800

x86/iommu/intel: Increase the number of iommus supported to 
MAX_IO_APICS


The number of IOMMUs supported should be the same as the number
of IO APICS.  This limit comes into play when the IOMMUs are
identity mapped, thus the number of possible IOMMUs in the
"static identity" (si) domain should be this same number.
[...]



  include/linux/dmar.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index 45e903d84733..9d4867b8f42e 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -19,7 +19,7 @@
  struct acpi_dmar_header;
  
  #ifdef	CONFIG_X86

-# define   DMAR_UNITS_SUPPORTEDMAX_IO_APICS
+# define   DMAR_UNITS_SUPPORTED640
  #else
  # define  DMAR_UNITS_SUPPORTED64
  #endif


Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 10/12] iommu: Prepare IOMMU domain for IOPF

2022-05-05 Thread Baolu Lu

On 2022/5/5 21:38, Jean-Philippe Brucker wrote:

Hi Baolu,

On Thu, May 05, 2022 at 04:31:38PM +0800, Baolu Lu wrote:

On 2022/5/4 02:20, Jean-Philippe Brucker wrote:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7cae631c1baa..33449523afbe 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3174,3 +3174,24 @@ void iommu_detach_device_pasid(struct iommu_domain 
*domain,
iommu_group_put(group);
   }
+
+struct iommu_domain *iommu_get_domain_for_dev_pasid(struct device *dev,
+   ioasid_t pasid)
+{
+   struct iommu_domain *domain;
+   struct iommu_group *group;
+
+   if (!pasid_valid(pasid))
+   return NULL;
+
+   group = iommu_group_get(dev);
+   if (!group)
+   return NULL;
+
+   mutex_lock(>mutex);

Unfortunately this still causes the deadlock when unbind() flushes the
IOPF queue while holding the group mutex.


Sorry, I didn't get your point here.

Do you mean unbind() could hold group mutex before calling this helper?
The group mutex is only available in iommu.c. The unbind() has no means
to hold this lock. Or, I missed anything?


I wasn't clear, it's iommu_detach_device_pasid() that holds the
group->mutex:

  iommu_sva_unbind_device()  |
   iommu_detach_device_pasid()   |
mutex_lock(>mutex)|
domain->ops->detach_dev_pasid()  | iopf_handle_group()
 iopf_queue_flush_dev()  |  iommu_get_domain_for_dev_pasid()
  ... wait for IOPF work |   mutex_lock(>mutex)
 |... deadlock


Ah! Yes. Thank you for the clarification.



Thanks,
Jean



Best regards,
baolu



If we make this function private to IOPF, then we can get rid of this
mutex_lock(). It's OK because:

* xarray protects its internal state with RCU, so we can call
xa_load() outside the lock.

* The domain obtained from xa_load is finalized. Its content is valid
because xarray stores the domain using rcu_assign_pointer(), which has a
release memory barrier, which pairs with data dependencies in IOPF
(domain->sva_ioas etc).

We'll need to be careful about this when allowing other users to install
a fault handler. Should be fine as long as the handler and data are
installed before the domain is added to pasid_array.

* We know the domain is valid the whole time IOPF is using it, because
unbind() waits for pending faults.

We just need a comment explaining the last point, something like:

 /*
* Safe to fetch outside the group mutex because:
  * - xarray protects its internal state with RCU
  * - the domain obtained is either NULL or fully formed
* - the IOPF work is the only caller and is flushed before the
*   domain is freed.
  */


Agreed. The mutex is needed only when domain could possibly be freed
before unbind(). In that case, we need this mutex and get a reference
from the domain. As we have dropped the domain user reference, this lock
is unnecessary.



Thanks,
Jean


+   domain = xa_load(>pasid_array, pasid);
+   mutex_unlock(>mutex);
+   iommu_group_put(group);
+
+   return domain;
+}




Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] iommu/mediatek: Fix NULL pointer dereference when printing dev_name

2022-05-05 Thread Yong Wu via iommu
On Thu, 2022-05-05 at 21:27 +0800, Miles Chen wrote:
> When larbdev is NULL (in the case I hit, the node is incorrectly set
> iommus = < NUM>), it will cause device_link_add() fail and
> kernel crashes when we try to print dev_name(larbdev).
> 
> Let's fail the probe if a larbdev is NULL to avoid invalid inputs
> from
> dts.
> 
> It should work for normal correct setting and avoid the crash caused
> by my incorrect setting.
> 
> Error log:
> [   18.189042][  T301] Unable to handle kernel NULL pointer
> dereference at virtual address 0050
> ...
> [   18.344519][  T301] pstate: a045 (NzCv daif +PAN -UAO)
> [   18.345213][  T301] pc : mtk_iommu_probe_device+0xf8/0x118
> [mtk_iommu]
> [   18.346050][  T301] lr : mtk_iommu_probe_device+0xd0/0x118
> [mtk_iommu]
> [   18.346884][  T301] sp : ffc00a5635e0
> [   18.347392][  T301] x29: ffc00a5635e0 x28: ffd44a46c1d8
> [   18.348156][  T301] x27: ff80c39a8000 x26: ffd44a80cc38
> [   18.348917][  T301] x25:  x24: ffd44a80cc38
> [   18.349677][  T301] x23: ffd44e4da4c6 x22: ffd44a80cc38
> [   18.350438][  T301] x21: ff80cecd1880 x20: 
> [   18.351198][  T301] x19: ff80c439f010 x18: ffc00a50d0c0
> [   18.351959][  T301] x17:  x16: 0004
> [   18.352719][  T301] x15: 0004 x14: ffd44eb5d420
> [   18.353480][  T301] x13: 0ad2 x12: 0003
> [   18.354241][  T301] x11: fad2 x10: c000fad2
> [   18.355003][  T301] x9 : a0d288d8d7142d00 x8 : a0d288d8d7142d00
> [   18.355763][  T301] x7 : ffd44c2bc640 x6 : 
> [   18.356524][  T301] x5 : 0080 x4 : 0001
> [   18.357284][  T301] x3 :  x2 : 0005
> [   18.358045][  T301] x1 :  x0 : 
> [   18.360208][  T301] Hardware name: MT6873 (DT)
> [   18.360771][  T301] Call trace:
> [   18.361168][  T301]  dump_backtrace+0xf8/0x1f0
> [   18.361737][  T301]  dump_stack_lvl+0xa8/0x11c
> [   18.362305][  T301]  dump_stack+0x1c/0x2c
> [   18.362816][  T301]  mrdump_common_die+0x184/0x40c [mrdump]
> [   18.363575][  T301]  ipanic_die+0x24/0x38 [mrdump]
> [   18.364230][  T301]  atomic_notifier_call_chain+0x128/0x2b8
> [   18.364937][  T301]  die+0x16c/0x568
> [   18.365394][  T301]  __do_kernel_fault+0x1e8/0x214
> [   18.365402][  T301]  do_page_fault+0xb8/0x678
> [   18.366934][  T301]  do_translation_fault+0x48/0x64
> [   18.368645][  T301]  do_mem_abort+0x68/0x148
> [   18.368652][  T301]  el1_abort+0x40/0x64
> [   18.368660][  T301]  el1h_64_sync_handler+0x54/0x88
> [   18.368668][  T301]  el1h_64_sync+0x68/0x6c
> [   18.368673][  T301]  mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu]
> ...
> 
> Cc: Robin Murphy 
> Cc: Yong Wu 
> Reported-by: kernel test robot 
> Fixes: 635319a4a744 ("media: iommu/mediatek: Add device_link between
> the consumer and the larb devices")
> Signed-off-by: Miles Chen 

Reviewed-by: Yong Wu 

And this doesn't conflict with the MT8195 patchset.

Thanks.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 05/11] iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit

2022-05-05 Thread Baolu Lu

On 2022/5/3 15:49, Jean-Philippe Brucker wrote:

On Sat, Apr 30, 2022 at 03:33:17PM +0800, Baolu Lu wrote:

Jean, another quick question about the iommu_sva_bind_device()

/**
  * iommu_sva_bind_device() - Bind a process address space to a device
  * @dev: the device
  * @mm: the mm to bind, caller must hold a reference to it
  * @drvdata: opaque data pointer to pass to bind callback

This interface requires the caller to take a reference to mm. Which
reference should it take, mm->mm_count or mm->mm_users? It's better to
make it explicit in this comment.


Agreed, it's mm_users as required by mmu_notifier_register()


Thanks! I will add this in my refactoring patch.

Best regards,
baolu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 4/4] iommu/vt-d: Remove hard coding PGSNP bit in PASID entries

2022-05-05 Thread Lu Baolu
As enforce_cache_coherency has been introduced into the iommu_domain_ops,
the kernel component which owns the iommu domain is able to opt-in its
requirement for force snooping support. The iommu driver has no need to
hard code the page snoop control bit in the PASID table entries anymore.

Signed-off-by: Lu Baolu 
Reviewed-by: Kevin Tian 
---
 drivers/iommu/intel/pasid.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 41a0e3b02c79..0abfa7fc7fb0 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -710,9 +710,6 @@ int intel_pasid_setup_second_level(struct intel_iommu 
*iommu,
pasid_set_fault_enable(pte);
pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));
 
-   if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED)
-   pasid_set_pgsnp(pte);
-
/*
 * Since it is a second level only translation setup, we should
 * set SRE bit as well (addresses are expected to be GPAs).
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 3/4] iommu/vt-d: Remove domain_update_iommu_snooping()

2022-05-05 Thread Lu Baolu
The IOMMU force snooping capability is not required to be consistent
among all the IOMMUs anymore. Remove force snooping capability check
in the IOMMU hot-add path and domain_update_iommu_snooping() becomes
a dead code now.

Signed-off-by: Lu Baolu 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Kevin Tian 
---
 drivers/iommu/intel/iommu.c | 34 +-
 1 file changed, 1 insertion(+), 33 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 048ebfbd5fcb..444d51a18c93 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -533,33 +533,6 @@ static void domain_update_iommu_coherency(struct 
dmar_domain *domain)
rcu_read_unlock();
 }
 
-static bool domain_update_iommu_snooping(struct intel_iommu *skip)
-{
-   struct dmar_drhd_unit *drhd;
-   struct intel_iommu *iommu;
-   bool ret = true;
-
-   rcu_read_lock();
-   for_each_active_iommu(iommu, drhd) {
-   if (iommu != skip) {
-   /*
-* If the hardware is operating in the scalable mode,
-* the snooping control is always supported since we
-* always set PASID-table-entry.PGSNP bit if the domain
-* is managed outside (UNMANAGED).
-*/
-   if (!sm_supported(iommu) &&
-   !ecap_sc_support(iommu->ecap)) {
-   ret = false;
-   break;
-   }
-   }
-   }
-   rcu_read_unlock();
-
-   return ret;
-}
-
 static int domain_update_iommu_superpage(struct dmar_domain *domain,
 struct intel_iommu *skip)
 {
@@ -3606,12 +3579,7 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru)
iommu->name);
return -ENXIO;
}
-   if (!ecap_sc_support(iommu->ecap) &&
-   domain_update_iommu_snooping(iommu)) {
-   pr_warn("%s: Doesn't support snooping.\n",
-   iommu->name);
-   return -ENXIO;
-   }
+
sp = domain_update_iommu_superpage(NULL, iommu) - 1;
if (sp >= 0 && !(cap_super_page_val(iommu->cap) & (1 << sp))) {
pr_warn("%s: Doesn't support large page.\n",
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 2/4] iommu/vt-d: Check domain force_snooping against attached devices

2022-05-05 Thread Lu Baolu
As domain->force_snooping only impacts the devices attached with the
domain, there's no need to check against all IOMMU units. On the other
hand, force_snooping could be set on a domain no matter whether it has
been attached or not, and once set it is an immutable flag. If no
device attached, the operation always succeeds. Then this empty domain
can be only attached to a device of which the IOMMU supports snoop
control.

Signed-off-by: Lu Baolu 
---
 include/linux/intel-iommu.h |  1 +
 drivers/iommu/intel/pasid.h |  2 ++
 drivers/iommu/intel/iommu.c | 53 ++---
 drivers/iommu/intel/pasid.c | 23 
 4 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 72e5d7900e71..4f29139bbfc3 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -540,6 +540,7 @@ struct dmar_domain {
u8 has_iotlb_device: 1;
u8 iommu_coherency: 1;  /* indicate coherency of iommu access */
u8 force_snooping : 1;  /* Create IOPTEs with snoop control */
+   u8 set_pte_snp:1;
 
struct list_head devices;   /* all devices' list */
struct iova_domain iovad;   /* iova's that belong to this domain */
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index ab4408c824a5..583ea67fc783 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -123,4 +123,6 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu,
 bool fault_ignore);
 int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid);
 void vcmd_free_pasid(struct intel_iommu *iommu, u32 pasid);
+void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu,
+ struct device *dev, u32 pasid);
 #endif /* __INTEL_PASID_H */
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index b4802f4055a0..048ebfbd5fcb 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2459,7 +2459,7 @@ static int domain_setup_first_level(struct intel_iommu 
*iommu,
if (level == 5)
flags |= PASID_FLAG_FL5LP;
 
-   if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED)
+   if (domain->force_snooping)
flags |= PASID_FLAG_PAGE_SNOOP;
 
return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid,
@@ -,7 +,7 @@ static int intel_iommu_map(struct iommu_domain *domain,
prot |= DMA_PTE_READ;
if (iommu_prot & IOMMU_WRITE)
prot |= DMA_PTE_WRITE;
-   if (dmar_domain->force_snooping)
+   if (dmar_domain->set_pte_snp)
prot |= DMA_PTE_SNP;
 
max_addr = iova + size;
@@ -4567,13 +4567,60 @@ static phys_addr_t intel_iommu_iova_to_phys(struct 
iommu_domain *domain,
return phys;
 }
 
+static bool domain_support_force_snooping(struct dmar_domain *domain)
+{
+   struct device_domain_info *info;
+   bool support = true;
+
+   assert_spin_locked(_domain_lock);
+   list_for_each_entry(info, >devices, link) {
+   if (!ecap_sc_support(info->iommu->ecap)) {
+   support = false;
+   break;
+   }
+   }
+
+   return support;
+}
+
+static void domain_set_force_snooping(struct dmar_domain *domain)
+{
+   struct device_domain_info *info;
+
+   assert_spin_locked(_domain_lock);
+
+   /*
+* Second level page table supports per-PTE snoop control. The
+* iommu_map() interface will handle this by setting SNP bit.
+*/
+   if (!domain_use_first_level(domain)) {
+   domain->set_pte_snp = true;
+   return;
+   }
+
+   list_for_each_entry(info, >devices, link)
+   intel_pasid_setup_page_snoop_control(info->iommu, info->dev,
+PASID_RID2PASID);
+}
+
 static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain)
 {
struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+   unsigned long flags;
 
-   if (!domain_update_iommu_snooping(NULL))
+   if (dmar_domain->force_snooping)
+   return true;
+
+   spin_lock_irqsave(_domain_lock, flags);
+   if (!domain_support_force_snooping(dmar_domain)) {
+   spin_unlock_irqrestore(_domain_lock, flags);
return false;
+   }
+
+   domain_set_force_snooping(dmar_domain);
dmar_domain->force_snooping = true;
+   spin_unlock_irqrestore(_domain_lock, flags);
+
return true;
 }
 
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index f8d215d85695..41a0e3b02c79 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -762,3 +762,26 @@ int intel_pasid_setup_pass_through(struct intel_iommu 
*iommu,
 
return 0;
 }
+
+/*
+ * Set the page snoop 

[PATCH v3 1/4] iommu/vt-d: Block force-snoop domain attaching if no SC support

2022-05-05 Thread Lu Baolu
In the attach_dev callback of the default domain ops, if the domain has
been set force_snooping, but the iommu hardware of the device does not
support SC(Snoop Control) capability, the callback should block it and
return a corresponding error code.

Signed-off-by: Lu Baolu 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Kevin Tian 
---
 drivers/iommu/intel/iommu.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 2990f80c5e08..b4802f4055a0 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4367,6 +4367,9 @@ static int prepare_domain_attach_device(struct 
iommu_domain *domain,
if (!iommu)
return -ENODEV;
 
+   if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
+   return -EOPNOTSUPP;
+
/* check if this iommu agaw is sufficient for max mapped address */
addr_width = agaw_to_width(iommu->agaw);
if (addr_width > cap_mgaw(iommu->cap))
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3 0/4] iommu/vt-d: Force snooping improvement

2022-05-05 Thread Lu Baolu
Hi folks,

Previously, the IOMMU capability of enforcing cache coherency was queried
through iommu_capable(IOMMU_CAP_CACHE_COHERENCY). This is a global
capability, hence the IOMMU driver reports support for this capability
only when all IOMMUs in the system has this support.

Commit 6043257b1de06 ("iommu: Introduce the domain op
enforce_cache_coherency()") converts this into a per-domain test-and-set
option, and the previous iommu_capable(IOMMU_CAP_CACHE_COHERENCY) is
deprecated.

This is a follow-up series which improves the Intel IOMMU driver to
support the per-domain scheme better.

Best regards,
baolu

Change log:
v3:
 - Hold the device_domain_lock when check and set force snooping.
 - Refind the commit messages.

v2:
 - 
https://lore.kernel.org/linux-iommu/20220505010710.1477739-1-baolu...@linux.intel.com/
 - Check whether force_snooping has already been set in
   intel_iommu_enforce_cache_coherency().
 - Set PGSNP pasid bit field during domain attaching if forcing_snooping
   is set.
 - Remove redundant list_empty() checks.
 - Add dmar_domain->set_pte_snp and set it if force snooping is enforced
   on a domain with 2nd-level translation.

v1:
 - 
https://lore.kernel.org/linux-iommu/20220501112434.874236-1-baolu...@linux.intel.com
 - Initial post.

Lu Baolu (4):
  iommu/vt-d: Block force-snoop domain attaching if no SC support
  iommu/vt-d: Check domain force_snooping against attached devices
  iommu/vt-d: Remove domain_update_iommu_snooping()
  iommu/vt-d: Remove hard coding PGSNP bit in PASID entries

 include/linux/intel-iommu.h |  1 +
 drivers/iommu/intel/pasid.h |  2 +
 drivers/iommu/intel/iommu.c | 90 ++---
 drivers/iommu/intel/pasid.c | 26 +--
 4 files changed, 80 insertions(+), 39 deletions(-)

-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-05 Thread David Gibson
On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote:
> On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote:
> 
> > > It is a bit more CPU work since maps in the lower range would have to
> > > be copied over, but conceptually the model matches the HW nesting.
> > 
> > Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
> > fixed windows, but we fake dynamic windows in the userspace
> > implementation by flipping the devices over to a new IOAS with the new
> > windows.  Is that right?
> 
> Yes
> 
> > Where exactly would the windows be specified?  My understanding was
> > that when creating a back-end specific IOAS, that would typically be
> > for the case where you're using a user / guest managed IO pagetable,
> > with the backend specifying the format for that.  In the ppc case we'd
> > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
> > operations to manage the mappings.  The PAPR vIOMMU is
> > paravirtualized, so all updates come via hypercalls, so there's no
> > user/guest managed data structure.
> 
> When the iommu_domain is created I want to have a
> iommu-driver-specific struct, so PPC can customize its iommu_domain
> however it likes.

This requires that the client be aware of the host side IOMMU model.
That's true in VFIO now, and it's nasty; I was really hoping we could
*stop* doing that.

Note that I'm talking here *purely* about the non-optimized case where
all updates to the host side IO pagetables are handled by IOAS_MAP /
IOAS_COPY, with no direct hardware access to user or guest managed IO
pagetables.  The optimized case obviously requires end-to-end
agreement on the pagetable format amongst other domain properties.

What I'm hoping is that qemu (or whatever) can use this non-optimized
as a fallback case where it does't need to know the properties of
whatever host side IOMMU models there are.  It just requests what it
needs based on the vIOMMU properties it needs to replicate and the
host kernel either can supply it or can't.

In many cases it should be perfectly possible to emulate a PPC style
vIOMMU on an x86 host, because the x86 IOMMU has such a colossal
aperture that it will encompass wherever the ppc apertures end
up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc
host (currently somewhere between awkward and impossible) by placing
the host apertures to cover guest memory.

Admittedly those are pretty niche cases, but allowing for them gives
us flexibility for the future.  Emulating an ARM SMMUv3 guest on an
ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in
the future, and AFAICT, ARM are much less conservative that x86 about
maintaining similar hw interfaces over time.  That's why I think
considering these ppc cases will give a more robust interface for
other future possibilities as well.

> > That should work from the point of view of the userspace and guest
> > side interfaces.  It might be fiddly from the point of view of the
> > back end.  The ppc iommu doesn't really have the notion of
> > configurable domains - instead the address spaces are the hardware or
> > firmware fixed PEs, so they have a fixed set of devices.  At the bare
> > metal level it's possible to sort of do domains by making the actual
> > pagetable pointers for several PEs point to a common place.
> 
> I'm not sure I understand this - a domain is just a storage container
> for an IO page table, if the HW has IOPTEs then it should be able to
> have a domain?
> 
> Making page table pointers point to a common IOPTE tree is exactly
> what iommu_domains are for - why is that "sort of" for ppc?

Ok, fair enough, it's only "sort of" in the sense that the hw specs /
docs don't present any equivalent concept.

> > However, in the future, nested KVM under PowerVM is likely to be the
> > norm.  In that situation the L1 as well as the L2 only has the
> > paravirtualized interfaces, which don't have any notion of domains,
> > only PEs.  All updates take place via hypercalls which explicitly
> > specify a PE (strictly speaking they take a "Logical IO Bus Number"
> > (LIOBN), but those generally map one to one with PEs), so it can't use
> > shared pointer tricks either.
> 
> How does the paravirtualized interfaces deal with the page table? Does
> it call a map/unmap hypercall instead of providing guest IOPTEs?

Sort of.  The main interface is H_PUT_TCE ("TCE" - Translation Control
Entry - being IBMese for an IOPTE). This takes an LIOBN (which selects
which PE and aperture), an IOVA and a TCE value - which is a guest
physical address plus some permission bits.  There are some variants
for performance that can set a batch of IOPTEs from a buffer, or clear
a range of IOPTEs, but they're just faster ways of doing the same
thing as a bunch of H_PUT_TCE calls.  H_PUT_TCE calls.

You can consider that a map/unmap hypercall, but the size of the
mapping is fixed (the IO pagesize which was previously set for the
aperture).

> Assuming yes, 

RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, May 5, 2022 10:08 PM
> 
> On Thu, May 05, 2022 at 07:40:37AM +, Tian, Kevin wrote:
> 
> > In concept this is an iommu property instead of a domain property.
> 
> Not really, domains shouldn't be changing behaviors once they are
> created. If a domain supports dirty tracking and I attach a new device
> then it still must support dirty tracking.

That sort of suggests that userspace should specify whether a domain
supports dirty tracking when it's created. But how does userspace
know that it should create the domain in this way in the first place? 
live migration is triggered on demand and it may not happen in the
lifetime of a VM.

and if the user always creates domain to allow dirty tracking by default,
how does it know a failed attach is due to missing dirty tracking support
by the IOMMU and then creates another domain which disables dirty
tracking and retry-attach again?

In any case IMHO having a device capability still sounds applausive even
in above model so userspace can create domain with right property
based on a potential list of devices to be attached. Once the domain is
created, then further attached devices must be compatible to the
domain property.

> 
> I suppose we may need something here because we need to control when
> domains are re-used if they don't have the right properties in case
> the system iommu's are discontiguous somehow.
> 
> ie iommufd should be able to assert that dirty tracking is desired and
> an existing non-dirty tracking capable domain will not be
> automatically re-used.
> 
> We don't really have the right infrastructure to do this currently.
> 
> > From this angle IMHO it's more reasonable to report this IOMMU
> > property to userspace via a device capability. If all devices attached
> > to a hwpt claim IOMMU dirty tracking capability, the user can call
> > set_dirty_tracking() on the hwpt object.
> 
> Inherent domain properties need to be immutable or, at least one-way,
> like enforced coherent, or it just all stops making any kind of sense.
> 
> > Once dirty tracking is enabled on a hwpt, further attaching a device
> > which doesn't claim this capability is simply rejected.
> 
> It would be OK to do as enforced coherent does as flip a domain
> permanently into dirty-tracking enabled, or specify a flag at domain
> creation time.
> 

Either way I think a device capability is useful for the user to decide
the necessity of flipping one-way or specifying a flag at domain
creation.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, May 5, 2022 9:55 PM
> 
> On Thu, May 05, 2022 at 11:03:18AM +, Tian, Kevin wrote:
> 
> > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> > in the said race window until unmap and iotlb is invalidated is completed.
> 
> No, the purpose is to perform "unmap" without destroying the dirty bit
> in the process.
> 
> If an IOMMU architecture has a way to render the page unmaped and
> flush back the dirty bit/not destroy then it doesn't require a write
> protect pass.
> 

Yes, I see the point now. As you said let's consider it only when
there is a real use case requiring such atomicity.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Tian, Kevin
> From: Joao Martins 
> Sent: Thursday, May 5, 2022 7:51 PM
> 
> On 5/5/22 12:03, Tian, Kevin wrote:
> >> From: Joao Martins 
> >> Sent: Thursday, May 5, 2022 6:07 PM
> >>
> >> On 5/5/22 08:42, Tian, Kevin wrote:
>  From: Jason Gunthorpe 
>  Sent: Tuesday, May 3, 2022 2:53 AM
> 
>  On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> > On Fri, 29 Apr 2022 05:45:20 +
> > "Tian, Kevin"  wrote:
> >>> From: Joao Martins 
> >>>  3) Unmapping an IOVA range while returning its dirty bit prior to
> >>> unmap. This case is specific for non-nested vIOMMU case where an
> >>> erronous guest (or device) DMAing to an address being unmapped
> at
>  the
> >>> same time.
> >>
> >> an erroneous attempt like above cannot anticipate which DMAs can
> >> succeed in that window thus the end behavior is undefined. For an
> >> undefined behavior nothing will be broken by losing some bits dirtied
> >> in the window between reading back dirty bits of the range and
> >> actually calling unmap. From guest p.o.v. all those are black-box
> >> hardware logic to serve a virtual iotlb invalidation request which just
> >> cannot be completed in one cycle.
> >>
> >> Hence in reality probably this is not required except to meet vfio
> >> compat requirement. Just in concept returning dirty bits at unmap
> >> is more accurate.
> >>
> >> I'm slightly inclined to abandon it in iommufd uAPI.
> >
> > Sorry, I'm not following why an unmap with returned dirty bitmap
> > operation is specific to a vIOMMU case, or in fact indicative of some
> > sort of erroneous, racy behavior of guest or device.
> 
>  It is being compared against the alternative which is to explicitly
>  query dirty then do a normal unmap as two system calls and permit a
>  race.
> 
>  The only case with any difference is if the guest is racing DMA with
>  the unmap - in which case it is already indeterminate for the guest if
>  the DMA will be completed or not.
> 
>  eg on the vIOMMU case if the guest races DMA with unmap then we
> are
>  already fine with throwing away that DMA because that is how the race
>  resolves during non-migration situations, so resovling it as throwing
>  away the DMA during migration is OK too.
> 
> > We need the flexibility to support memory hot-unplug operations
> > during migration,
> 
>  I would have thought that hotplug during migration would simply
>  discard all the data - how does it use the dirty bitmap?
> 
> > This was implemented as a single operation specifically to avoid
> > races where ongoing access may be available after retrieving a
> > snapshot of the bitmap.  Thanks,
> 
>  The issue is the cost.
> 
>  On a real iommu elminating the race is expensive as we have to write
>  protect the pages before query dirty, which seems to be an extra IOTLB
>  flush.
> 
>  It is not clear if paying this cost to become atomic is actually
>  something any use case needs.
> 
>  So, I suggest we think about a 3rd op 'write protect and clear
>  dirties' that will be followed by a normal unmap - the extra op will
>  have the extra oveheard and userspace can decide if it wants to pay or
>  not vs the non-atomic read dirties operation. And lets have a use case
>  where this must be atomic before we implement it..
> >>>
> >>> and write-protection also relies on the support of I/O page fault...
> >>>
> >> /I think/ all IOMMUs in this series already support
> permission/unrecoverable
> >> I/O page faults for a long time IIUC.
> >>
> >> The earlier suggestion was just to discard the I/O page fault after
> >> write-protection happens. fwiw, some IOMMUs also support suppressing
> >> the event notification (like AMD).
> >
> > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> > in the said race window until unmap and iotlb is invalidated is completed.
> >
> But then we depend on PRS being there on the device, because without it,
> DMA is
> aborted on the target on a read-only IOVA prior to the page fault, thus the
> page
> is not going to be dirty anyways.
> 
> > *unrecoverable* faults are not expected to be used in a feature path
> > as occurrence of such faults may lead to severe reaction in iommu
> > drivers e.g. completely block DMA from the device causing such faults.
> 
> Unless I totally misunderstood ... the later is actually what we were
> suggesting
> here /in the context of unmaping an GIOVA/(*).
> 
> The wrprotect() was there to ensure we get an atomic dirty state of the IOVA
> range
> afterwards, by blocking DMA (as opposed to sort of mediating DMA). The I/O
> page fault is
> not supposed to happen unless there's rogue DMA AIUI.

You are right. It's me misunderstanding the proposal here. 

> 
> TBH, the same could be said for normal DMA 

Re: [PATCH v2] iommu/sva: Fix PASID use-after-free issue

2022-05-05 Thread Zhangfei Gao



On 2022/4/29 上午9:39, Zhangfei Gao wrote:



On 2022/4/29 上午2:00, Fenghua Yu wrote:

The PASID is being freed too early.  It needs to stay around until after
device drivers that might be using it have had a chance to clear it out
of the hardware.

As a reminder:

mmget() /mmput()  refcount the mm's address space
mmgrab()/mmdrop() refcount the mm itself

The PASID is currently tied to the life of the mm's address space and
freed in __mmput().  This makes logical sense because the PASID can't be
used once the address space is gone.

But, this misses an important point: even after the address space is
gone, the PASID will still be programmed into a device.  Device drivers
might, for instance, still need to flush operations that are outstanding
and need to use that PASID.  They do this at file->release() time.

Device drivers call the IOMMU driver to hold a reference on the mm 
itself

and drop it at file->release() time.  But, the IOMMU driver holds a
reference on the mm itself, not the address space.  The address space
(and the PASID) is long gone by the time the driver tries to clean up.
This is effectively a use-after-free bug on the PASID.

To fix this, move the PASID free operation from __mmput() to __mmdrop().
This ensures that the IOMMU driver's existing mmgrab() keeps the PASID
allocated until it drops its mm reference.

Fixes: 701fac40384f ("iommu/sva: Assign a PASID to mm on PASID 
allocation and free it on mm exit")


Reported-by: Zhangfei Gao 
Tested-by: Zhangfei Gao 


Tested-by: Zhangfei Gao 

Use the formal email, thanks


Suggested-by: Jean-Philippe Brucker 
Suggested-by: Jacob Pan 
Reviewed-by: Jean-Philippe Brucker 
Signed-off-by: Fenghua Yu 


Hi,

Will this be taken for 5.18?

Thanks


---

v2:
- Dave Hansen rewrites the change log.
- Add Tested-by: Zhangfei Gao 
- Add Reviewed-by: Jean-Philippe Brucker 

The original patch was posted and discussed in:
https://lore.kernel.org/lkml/ymdzffx7fn586...@fyu1.sc.intel.com/

  kernel/fork.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9796897560ab..35a3beff140b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -792,6 +792,7 @@ void __mmdrop(struct mm_struct *mm)
  mmu_notifier_subscriptions_destroy(mm);
  check_mm(mm);
  put_user_ns(mm->user_ns);
+    mm_pasid_drop(mm);
  free_mm(mm);
  }
  EXPORT_SYMBOL_GPL(__mmdrop);
@@ -1190,7 +1191,6 @@ static inline void __mmput(struct mm_struct *mm)
  }
  if (mm->binfmt)
  module_put(mm->binfmt->module);
-    mm_pasid_drop(mm);
  mmdrop(mm);
  }




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v6 29/29] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable

2022-05-05 Thread Ricardo Neri
The HPET-based hardlockup detector relies on the TSC to determine if an
observed NMI interrupt was originated by HPET timer. Hence, this detector
can no longer be used with an unstable TSC.

In such case, permanently stop the HPET-based hardlockup detector and
start the perf-based detector.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Thomas Gleixner 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Relocated the delcaration of hardlockup_detector_switch_to_perf() to
   x86/nmi.h It does not depend on HPET.
 * Removed function stub. The shim hardlockup detector is always for x86.

Changes since v4:
 * Added a stub version of hardlockup_detector_switch_to_perf() for
   !CONFIG_HPET_TIMER. (lkp)
 * Reconfigure the whole lockup detector instead of unconditionally
   starting the perf-based hardlockup detector.

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/nmi.h | 6 ++
 arch/x86/kernel/tsc.c  | 2 ++
 arch/x86/kernel/watchdog_hld.c | 6 ++
 3 files changed, 14 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 4a0d5b562c91..47752ff67d8b 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -63,4 +63,10 @@ void stop_nmi(void);
 void restart_nmi(void);
 void local_touch_nmi(void);
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR
+void hardlockup_detector_switch_to_perf(void);
+#else
+static inline void hardlockup_detector_switch_to_perf(void) { }
+#endif
+
 #endif /* _ASM_X86_NMI_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cc1843044d88..74772ffc79d1 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1176,6 +1176,8 @@ void mark_tsc_unstable(char *reason)
 
clocksource_mark_unstable(_tsc_early);
clocksource_mark_unstable(_tsc);
+
+   hardlockup_detector_switch_to_perf();
 }
 
 EXPORT_SYMBOL_GPL(mark_tsc_unstable);
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
index ef11f0af4ef5..7940977c6312 100644
--- a/arch/x86/kernel/watchdog_hld.c
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -83,3 +83,9 @@ void watchdog_nmi_start(void)
if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
hardlockup_detector_hpet_start();
 }
+
+void hardlockup_detector_switch_to_perf(void)
+{
+   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+   lockup_detector_reconfigure();
+}
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 26/29] x86/watchdog: Add a shim hardlockup detector

2022-05-05 Thread Ricardo Neri
The generic hardlockup detector is based on perf. It also provides a set
of weak functions that CPU architectures can override. Add a shim
hardlockup detector for x86 that overrides such functions and can
select between perf and HPET implementations of the detector.

For clarity, add the intermediate Kconfig symbol X86_HARDLOCKUP_DETECTOR
that is selected whenever the core of the hardlockup detector is
selected.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Nicholas Piggin 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Added watchdog_nmi_start() to be used when tsc_khz is recalibrated.
 * Always build the x86-specific hardlockup detector shim; not only
   when the HPET-based detector is selected.
 * Corrected a typo in comment in watchdog_nmi_probe() (Ani)
 * Removed useless local ret variable in watchdog_nmi_enable(). (Ani)

Changes since v4:
 * Use a switch to enable and disable the various available detectors.
   (Andi)

Changes since v3:
 * Fixed style in multi-line comment. (Randy Dunlap)

Changes since v2:
 * Pass cpu number as argument to hardlockup_detector_[enable|disable].
   (Thomas Gleixner)

Changes since v1:
 * Introduced this patch: Added an x86-specific shim hardlockup
   detector. (Nicholas Piggin)
---
 arch/x86/Kconfig.debug |  3 ++
 arch/x86/kernel/Makefile   |  2 +
 arch/x86/kernel/watchdog_hld.c | 85 ++
 3 files changed, 90 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index bc34239589db..599001157847 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -6,6 +6,9 @@ config TRACE_IRQFLAGS_NMI_SUPPORT
 config EARLY_PRINTK_USB
bool
 
+config X86_HARDLOCKUP_DETECTOR
+   def_bool y if HARDLOCKUP_DETECTOR_CORE
+
 config X86_VERBOSE_BOOTUP
bool "Enable verbose x86 bootup info messages"
default y
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index c700b00a2d86..af3d54e4c836 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -114,6 +114,8 @@ obj-$(CONFIG_KGDB)  += kgdb.o
 obj-$(CONFIG_VM86) += vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR) += watchdog_hld.o
+
 obj-$(CONFIG_HPET_TIMER)   += hpet.o
 obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
new file mode 100644
index ..ef11f0af4ef5
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A shim hardlockup detector. It overrides the weak stubs of the generic
+ * implementation to select between the perf- or the hpet-based implementation.
+ *
+ * Copyright (C) Intel Corporation 2022
+ */
+
+#include 
+#include 
+
+enum x86_hardlockup_detector {
+   X86_HARDLOCKUP_DETECTOR_PERF,
+   X86_HARDLOCKUP_DETECTOR_HPET,
+};
+
+static enum __read_mostly x86_hardlockup_detector detector_type;
+
+int watchdog_nmi_enable(unsigned int cpu)
+{
+   switch (detector_type) {
+   case X86_HARDLOCKUP_DETECTOR_PERF:
+   hardlockup_detector_perf_enable();
+   break;
+   case X86_HARDLOCKUP_DETECTOR_HPET:
+   hardlockup_detector_hpet_enable(cpu);
+   break;
+   default:
+   return -ENODEV;
+   }
+
+   return 0;
+}
+
+void watchdog_nmi_disable(unsigned int cpu)
+{
+   switch (detector_type) {
+   case X86_HARDLOCKUP_DETECTOR_PERF:
+   hardlockup_detector_perf_disable();
+   break;
+   case X86_HARDLOCKUP_DETECTOR_HPET:
+   hardlockup_detector_hpet_disable(cpu);
+   break;
+   }
+}
+
+int __init watchdog_nmi_probe(void)
+{
+   int ret;
+
+   /*
+* Try first with the HPET hardlockup detector. It will only
+* succeed if selected at build time and requested in the
+* nmi_watchdog command-line parameter. This ensures that the
+* perf-based detector is used by default, if selected at
+* build time.
+*/
+   ret = hardlockup_detector_hpet_init();
+   if (!ret) {
+   detector_type = X86_HARDLOCKUP_DETECTOR_HPET;
+   return ret;
+   }
+
+   ret = hardlockup_detector_perf_init();
+   if (!ret) {
+   detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+   return ret;
+   }
+
+   return 0;
+}
+
+void watchdog_nmi_stop(void)
+{
+   /* Only the HPET lockup detector defines a stop function. */
+   if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+   hardlockup_detector_hpet_stop();
+}
+
+void watchdog_nmi_start(void)
+{
+   /* Only the HPET lockup detector defines a start function. */
+ 

[PATCH v6 23/29] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI

2022-05-05 Thread Ricardo Neri
It is not possible to determine the source of a non-maskable interrupt
(NMI) in x86. When dealing with an HPET channel, the only direct method to
determine whether it caused an NMI would be to read the Interrupt Status
register.

However, reading HPET registers is slow and, therefore, not to be done
while in NMI context. Furthermore, status is not available if the HPET
channel is programmed to deliver an MSI interrupt.

An indirect manner to infer if an incoming NMI was caused by the HPET
channel of the detector is to use the time-stamp counter (TSC). Compute the
value that the TSC is expected to have at the next interrupt of the HPET
channel and compare it with the value it has when the interrupt does
happen. If the actual value falls within a small error window, assume
that the HPET channel of the detector is the source of the NMI.

Let tsc_delta be the difference between the value the TSC has now and the
value it will have when the next HPET channel interrupt happens. Define the
error window as a percentage of tsc_delta.

Below is a table that characterizes the error in the error in the expected
TSC value when the HPET channel fires on a variety of systems. It presents
the error as a percentage of tsc_delta and in microseconds.

The table summarizes the error of 4096 interrupts of the HPET channel
collected after the system has been up for 5 minutes as well as since boot.

The maximum observed error on any system is 0.045%. When the error since
boot is considered, the maximum observed error is 0.198%.

To find the most common error value, the collected data is grouped into
buckets of 0.01 percentage points of the error and 10ns, respectively.
The most common error on any system is of 0.01317%

Allow a maximum error that is twice as big the maximum error observed in
these experiments: 0.4%

watchdog_thresh 1s10s 60s
Error wrt
expected
TSC value %us %us%us

AMD EPYC 7742 64-Core Processor
Abs max
since boot 0.04517   451.740.00171   171.04   0.00034   201.89
Abs max0.04517   451.740.00171   171.04   0.00034   201.89
Mode   0.2 0.180.2 2.07  -0.3   -19.20

Intel(R) Xeon(R) CPU E7-8890 - INTEL_FAM6_HASWELL_X
abs max
since boot 0.0081181.150.00462   462.40   0.0001481.65
Abs max0.0081181.150.0008484.31   0.0001481.65
Mode  -0.00422   -42.16   -0.00043   -42.50  -0.7   -40.40

Intel(R) Xeon(R) Platinum 8170M - INTEL_FAM6_SKYLAKE_X
Abs max
since boot 0.10530  1053.040.01324  1324.27   0.00407  2443.25
Abs max0.01166   116.590.00114   114.11   0.00024   143.47
Mode  -0.01023  -102.32   -0.00103  -102.44  -0.00022  -132.38

Intel(R) Xeon(R) CPU E5-2699A v4 - INTEL_FAM6_BROADSWELL_X
Abs max
since boot 0.0001099.340.0009998.83   0.0001697.50
Abs max0.0001099.340.0009998.83   0.0001697.50
Mode  -0.7   -74.29   -0.00074   -73.99  -0.00012   -73.12

Intel(R) Xeon(R) Gold 5318H - INTEL_FAM6_COOPERLAKE_X
Abs max
since boot 0.11262  1126.170.01109  1109.17   0.00409  2455.73
Abs max0.01073   107.310.00109   109.02   0.00019   115.34
Mode  -0.00953   -95.26   -0.00094   -93.63  -0.00015   -90.42

Intel(R) Xeon(R) Platinum 8360Y - INTEL_FAM6_ICELAKE_X
Abs max
since boot 0.19853  1985.300.00784   783.53  -0.00017  -104.77
Abs max0.01550   155.020.00158   157.56   0.00020   117.74
Mode  -0.01317  -131.65   -0.00136  -136.42  -0.00018  -105.06

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Andi Kleen 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
NOTE: The error characterization data is repetead here from the cover
letter.
---
Changes since v5:
 * Reworked is_hpet_hld_interrupt() to reduce indentation.
 * Use time_in_range64() to compare the actual TSC value vs the expected
   value. This makes it more readable. (Tony)
 * Reduced the error window of the expected TSC value at the time of the
   HPET channel expiration.
 * Described better the heuristics used to determine if the HPET channel
   caused the NMI. (Tony)
 * Added a table to characterize the error in the expected TSC value when
   the HPET channel fires.
 * Removed references to groups of monitored CPUs. Instead, use tsc_khz
   directly.

Changes since v4:
 * Compute the TSC expected value at the next HPET interrupt based on the
   number of monitored packages and not the number of monitored CPUs.

Changes since v3:
 * None

Changes since v2:
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid an unnecessary conditional. (Peter Zijlstra)
 * Removed TSC error margin from struct hld_data; use a global variable
   instead. (Peter Zijlstra)

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hpet.h |  3 ++
 

[PATCH v6 27/29] watchdog: Expose lockup_detector_reconfigure()

2022-05-05 Thread Ricardo Neri
When there are multiple implementations of the NMI watchdog, there may be
situations in which switching from one to another is needed. If the time-
stamp counter becomes unstable, the HPET-based NMI watchdog can no longer
be used. Similarly, the HPET-based NMI watchdog relies on tsc_khz and
needs to be informed when it is refined.

Reloading the NMI watchdog or switching to another hardlockup detector can
be done cleanly by updating the arch-specific stub and then reconfiguring
the whole lockup detector.

Expose lockup_detector_reconfigure() to achieve this goal.

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * None

Changes since v4:
 * Switching to the perf-based lockup detector under the hood is hacky.
   Instead, reconfigure the whole lockup detector.

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 include/linux/nmi.h | 2 ++
 kernel/watchdog.c   | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index cf12380e51b3..73827a477288 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -16,6 +16,7 @@ void lockup_detector_init(void);
 void lockup_detector_soft_poweroff(void);
 void lockup_detector_cleanup(void);
 bool is_hardlockup(void);
+void lockup_detector_reconfigure(void);
 
 extern int watchdog_user_enabled;
 extern int nmi_watchdog_user_enabled;
@@ -37,6 +38,7 @@ extern int sysctl_hardlockup_all_cpu_backtrace;
 static inline void lockup_detector_init(void) { }
 static inline void lockup_detector_soft_poweroff(void) { }
 static inline void lockup_detector_cleanup(void) { }
+static inline void lockup_detector_reconfigure(void) { }
 #endif /* !CONFIG_LOCKUP_DETECTOR */
 
 #ifdef CONFIG_SOFTLOCKUP_DETECTOR
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 6443841a755f..e5b67544f8c8 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -537,7 +537,7 @@ int lockup_detector_offline_cpu(unsigned int cpu)
return 0;
 }
 
-static void lockup_detector_reconfigure(void)
+void lockup_detector_reconfigure(void)
 {
cpus_read_lock();
watchdog_nmi_stop();
@@ -579,7 +579,7 @@ static __init void lockup_detector_setup(void)
 }
 
 #else /* CONFIG_SOFTLOCKUP_DETECTOR */
-static void lockup_detector_reconfigure(void)
+void lockup_detector_reconfigure(void)
 {
cpus_read_lock();
watchdog_nmi_stop();
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 28/29] x86/tsc: Restart NMI watchdog after refining tsc_khz

2022-05-05 Thread Ricardo Neri
The HPET hardlockup detector relies on tsc_khz to estimate the value of
that the TSC will have when its HPET channel fires. A refined tsc_khz
helps to estimate better the expected TSC value.

Using the early value of tsc_khz may lead to a large error in the expected
TSC value. Restarting the NMI watchdog detector has the effect of kicking
its HPET channel and make use of the refined tsc_khz.

When the HPET hardlockup is not in use, restarting the NMI watchdog is
a noop.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4
 * N/A

Changes since v3
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/tsc.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..cc1843044d88 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1386,6 +1386,12 @@ static void tsc_refine_calibration_work(struct 
work_struct *work)
/* Inform the TSC deadline clockevent devices about the recalibration */
lapic_update_tsc_freq();
 
+   /*
+* If in use, the HPET hardlockup detector relies on tsc_khz.
+* Reconfigure it to make use of the refined tsc_khz.
+*/
+   lockup_detector_reconfigure();
+
/* Update the sched_clock() rate to match the clocksource one */
for_each_possible_cpu(cpu)
set_cyc2ns_scale(tsc_khz, cpu, tsc_stop);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 25/29] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter

2022-05-05 Thread Ricardo Neri
Keep the HPET-based hardlockup detector disabled unless explicitly enabled
via a command-line argument. If such parameter is not given, the
initialization of the HPET-based hardlockup detector fails and the NMI
watchdog will fall back to use the perf-based implementation.

Implement the command-line parsing using an early_param, as
__setup("nmi_watchdog=") only parses generic options.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
--
Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Do not imply that using nmi_watchdog=hpet means the detector is
   enabled. Instead, print a warning in such case.

Changes since v1:
 * Added documentation to the function handing the nmi_watchdog
   kernel command-line argument.
---
 .../admin-guide/kernel-parameters.txt |  8 ++-
 arch/x86/kernel/watchdog_hld_hpet.c   | 22 +++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 269be339d738..89eae950fdb8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3370,7 +3370,7 @@
Format: [state][,regs][,debounce][,die]
 
nmi_watchdog=   [KNL,BUGS=X86] Debugging features for SMP kernels
-   Format: [panic,][nopanic,][num]
+   Format: [panic,][nopanic,][num,][hpet]
Valid num: 0 or 1
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
@@ -3381,6 +3381,12 @@
please see 'nowatchdog'.
This is useful when you use a panic=... timeout and
need the box quickly up again.
+   When hpet is specified, the NMI watchdog will be driven
+   by an HPET timer, if available in the system. Otherwise,
+   it falls back to the default implementation (perf or
+   architecture-specific). Specifying hpet has no effect
+   if the NMI watchdog is not enabled (either at build time
+   or via the command line).
 
These settings can be accessed at runtime via
the nmi_watchdog and hardlockup_panic sysctls.
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c 
b/arch/x86/kernel/watchdog_hld_hpet.c
index 3effdbf29095..4413d5fb94f4 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -379,6 +379,28 @@ void hardlockup_detector_hpet_start(void)
enable_timer(hld_data);
 }
 
+/**
+ * hardlockup_detector_hpet_setup() - Parse command-line parameters
+ * @str:   A string containing the kernel command line
+ *
+ * Parse the nmi_watchdog parameter from the kernel command line. If
+ * selected by the user, use this implementation to detect hardlockups.
+ */
+static int __init hardlockup_detector_hpet_setup(char *str)
+{
+   if (!str)
+   return -EINVAL;
+
+   if (parse_option_str(str, "hpet"))
+   hardlockup_use_hpet = true;
+
+   if (!nmi_watchdog_user_enabled && hardlockup_use_hpet)
+   pr_err("Selecting HPET NMI watchdog has no effect with NMI 
watchdog disabled\n");
+
+   return 0;
+}
+early_param("nmi_watchdog", hardlockup_detector_hpet_setup);
+
 /**
  * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
  *
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 22/29] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector

2022-05-05 Thread Ricardo Neri
Implement a hardlockup detector that uses an HPET channel as the source
of the non-maskable interrupt. Implement the basic functionality to
start, stop, and configure the timer.

Designate as the handling CPU one of the CPUs that the detector monitors.
Use it to service the NMI from the HPET channel. When servicing the HPET
NMI, issue an inter-processor interrupt to the rest of the monitored CPUs.
Only enable the detector if IPI shorthands are enabled in the system.

During operation, the HPET registers are only accessed to kick the timer.
This operation can be avoided if a periodic HPET channel is added to the
detector.

To configure the HPET channel interrupt, the detector relies on the
interrupt subsystem to configure the deliver mode as NMI (as requested
in hpet_hld_get_timer()) throughout the IRQ hierarchy. This covers
systems with and without interrupt remapping enabled.

The detector is not functional at this stage. A subsequent changeset will
invoke the interfaces implemented in this changeset go start, stop, and
reconfigure the detector. Another subsequent changeset implements logic
to determine if the HPET timer caused the NMI. For now, implement a
stub function.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Squashed a previously separate patch to support interrupt remapping into
   this patch. There is no need to handle interrupt remapping separately.
   All the necessary plumbing is done in the interrupt subsytem. Now it
   uses request_irq().
 * Use IPI shorthands to send an NMI to the CPUs being monitored. (Thomas)
 * Added extra check to only use the HPET hardlockup detector if the IPI
   shorthands are enabled. (Thomas)
 * Relocated flushing of outstanding interrupts from enable_timer() to
   disable_timer(). On some systems, making any change in the
   configuration of the HPET channel causes it to issue an interrupt.
 * Added a new cpumask to function as a per-cpu test bit to determine if
   a CPU should check for hardlockups.
 * Dropped pointless X86_64 || X86_32 check in Kconfig. (Tony)
 * Dropped pointless dependency on CONFIG_HPET.
 * Added dependency on CONFIG_GENERIC_MSI_IRQ, needed to build the [|IR]-
   HPET-MSI irq_chip.
 * Added hardlockup_detector_hpet_start() to be used when tsc_khz is
   recalibrated.
 * Reworked the periodic setting the HPET channel. Rather than changing it
   every time the channel is disabled or enabled, do it only once. While
   at here, wrap the code in an initial setup function.
 * Implemented hardlockup_detector_hpet_start() to be called when tsc_khz
   is refined.
 * Enhanced inline comments for clarity.
 * Added missing #include files.
 * Relocated function declarations to not depend on CONFIG_HPET_TIMER.

Changes since v4:
 * Dropped hpet_hld_data.enabled_cpus and instead use cpumask_weight().
 * Renamed hpet_hld_data.cpu_monitored_mask to
   hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t.
 * Flushed out any outstanding interrupt before enabling the HPET channel.
 * Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message.
 * Added comments in hardlockup_detector_nmi_handler() to explain how
   CPUs are targeted for an IPI.
 * Updated code to only issue an IPI when needed (i.e., there are monitored
   CPUs to be inspected via an IPI).
 * Reworked hardlockup_detector_hpet_init() for readability.
 * Now reserve the cpumasks in the hardlockup detector code and not in the
   generic HPET code.
 * Handled the case of watchdog_thresh = 0 when disabling the detector.
 * Made this detector available to i386.
 * Reworked logic to kick the timer to remove a local variable. (Andi)
 * Added a comment on what type of timer channel will be assigned to the
   detector. (Andi)
 * Reworded prompt comment in Kconfig. (Andi)
 * Removed unneeded switch to level interrupt mode when disabling the
   timer. (Andi)
 * Disabled the HPET timer to avoid a race between an incoming interrupt
   and an update of the MSI destination ID. (Ashok)
 * Corrected a typo in an inline comment. (Tony)
 * Made the HPET hardlockup detector depend on HARDLOCKUP_DETECTOR instead
   of selecting it.

Changes since v3:
 * Fixed typo in Kconfig.debug. (Randy Dunlap)
 * Added missing slab.h to include the definition of kfree to fix a build
   break.

Changes since v2:
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc. (Peter Zijlstra)
 * Removed redundant documentation of functions. (Thomas Gleixner)
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable().
   (Thomas Gleixner).

Changes since v1:
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Dropped support for IO APIC interrupts and instead use only MSI
   interrupts.
 * Removed 

[PATCH v6 24/29] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"

2022-05-05 Thread Ricardo Neri
Prepare hardlockup_panic_setup() to handle a comma-separated list of
options. Thus, it can continue parsing its own command-line options while
ignoring parameters that are relevant only to specific implementations of
the hardlockup detector. Such implementations may use an early_param to
parse their own options.

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Corrected typo in commit message. (Tony)

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * None
---
 kernel/watchdog.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 9166220457bc..6443841a755f 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -73,13 +73,13 @@ void __init hardlockup_detector_disable(void)
 
 static int __init hardlockup_panic_setup(char *str)
 {
-   if (!strncmp(str, "panic", 5))
+   if (parse_option_str(str, "panic"))
hardlockup_panic = 1;
-   else if (!strncmp(str, "nopanic", 7))
+   else if (parse_option_str(str, "nopanic"))
hardlockup_panic = 0;
-   else if (!strncmp(str, "0", 1))
+   else if (parse_option_str(str, "0"))
nmi_watchdog_user_enabled = 0;
-   else if (!strncmp(str, "1", 1))
+   else if (parse_option_str(str, "1"))
nmi_watchdog_user_enabled = 1;
return 1;
 }
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 20/29] init/main: Delay initialization of the lockup detector after smp_init()

2022-05-05 Thread Ricardo Neri
Certain implementations of the hardlockup detector require support for
Inter-Processor Interrupt shorthands. On x86, support for these can only
be determined after all the possible CPUs have booted once (in
smp_init()). Other architectures may not need such check.

lockup_detector_init() only performs the initializations of data
structures of the lockup detector. Hence, there are no dependencies on
smp_init().

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 init/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/init/main.c b/init/main.c
index 98182c3c2c4b..62c52c9e4c2b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1600,9 +1600,11 @@ static noinline void __init kernel_init_freeable(void)
 
rcu_init_tasks_generic();
do_pre_smp_initcalls();
-   lockup_detector_init();
 
smp_init();
+
+   lockup_detector_init();
+
sched_init_smp();
 
padata_init();
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 21/29] x86/nmi: Add an NMI_WATCHDOG NMI handler category

2022-05-05 Thread Ricardo Neri
Add a NMI_WATCHDOG as a new category of NMI handler. This new category
is to be used with the HPET-based hardlockup detector. This detector
does not have a direct way of checking if the HPET timer is the source of
the NMI. Instead, it indirectly estimates it using the time-stamp counter.

Therefore, we may have false-positives in case another NMI occurs within
the estimated time window. For this reason, we want the handler of the
detector to be called after all the NMI_LOCAL handlers. A simple way
of achieving this with a new NMI handler category.

Cc: Andi Kleen 
Cc: Andrew Morton 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Updated to call instrumentation_end() as per f051f6979550 ("x86/nmi:
   Protect NMI entry against instrumentation")

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/nmi.h |  1 +
 arch/x86/kernel/nmi.c  | 10 ++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 1cb9c17a4cb4..4a0d5b562c91 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -28,6 +28,7 @@ enum {
NMI_UNKNOWN,
NMI_SERR,
NMI_IO_CHECK,
+   NMI_WATCHDOG,
NMI_MAX
 };
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index e73f7df362f5..fde387e0812a 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -61,6 +61,10 @@ static struct nmi_desc nmi_desc[NMI_MAX] =
.lock = __RAW_SPIN_LOCK_UNLOCKED(_desc[3].lock),
.head = LIST_HEAD_INIT(nmi_desc[3].head),
},
+   {
+   .lock = __RAW_SPIN_LOCK_UNLOCKED(_desc[4].lock),
+   .head = LIST_HEAD_INIT(nmi_desc[4].head),
+   },
 
 };
 
@@ -168,6 +172,8 @@ int __register_nmi_handler(unsigned int type, struct 
nmiaction *action)
 */
WARN_ON_ONCE(type == NMI_SERR && !list_empty(>head));
WARN_ON_ONCE(type == NMI_IO_CHECK && !list_empty(>head));
+   WARN_ON_ONCE(type == NMI_WATCHDOG && !list_empty(>head));
+
 
/*
 * some handlers need to be executed first otherwise a fake
@@ -379,6 +385,10 @@ static noinstr void default_do_nmi(struct pt_regs *regs)
}
raw_spin_unlock(_reason_lock);
 
+   handled = nmi_handle(NMI_WATCHDOG, regs);
+   if (handled == NMI_HANDLED)
+   goto out;
+
/*
 * Only one NMI can be latched at a time.  To handle
 * this we may process multiple nmi handlers at once to
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 18/29] watchdog/hardlockup: Define a generic function to detect hardlockups

2022-05-05 Thread Ricardo Neri
The procedure to detect hardlockups is independent of the underlying
mechanism that generates the non-maskable interrupt used to drive the
detector. Thus, it can be put in a separate, generic function. In this
manner, it can be invoked by various implementations of the NMI watchdog.

For this purpose, move the bulk of watchdog_overflow_callback() to the
new function inspect_for_hardlockups(). This function can then be called
from the applicable NMI handlers. No functional changes.

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 include/linux/nmi.h   |  1 +
 kernel/watchdog_hld.c | 18 +++---
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 750c7f395ca9..1b68f48ad440 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -207,6 +207,7 @@ int proc_nmi_watchdog(struct ctl_table *, int , void *, 
size_t *, loff_t *);
 int proc_soft_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_thresh(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_cpumask(struct ctl_table *, int, void *, size_t *, loff_t *);
+void inspect_for_hardlockups(struct pt_regs *regs);
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #include 
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..b352e507b17f 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = {
.disabled   = 1,
 };
 
-/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event,
-  struct perf_sample_data *data,
-  struct pt_regs *regs)
+void inspect_for_hardlockups(struct pt_regs *regs)
 {
-   /* Ensure the watchdog never gets throttled */
-   event->hw.interrupts = 0;
-
if (__this_cpu_read(watchdog_nmi_touch) == true) {
__this_cpu_write(watchdog_nmi_touch, false);
return;
@@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event 
*event,
return;
 }
 
+/* Callback function for perf event subsystem */
+static void watchdog_overflow_callback(struct perf_event *event,
+  struct perf_sample_data *data,
+  struct pt_regs *regs)
+{
+   /* Ensure the watchdog never gets throttled */
+   event->hw.interrupts = 0;
+   inspect_for_hardlockups(regs);
+}
+
 static int hardlockup_detector_event_create(void)
 {
unsigned int cpu = smp_processor_id();
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 19/29] watchdog/hardlockup: Decouple the hardlockup detector from perf

2022-05-05 Thread Ricardo Neri
The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: Andi Kleen 
Cc: Nicholas Piggin 
Cc: Andrew Morton 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * Squashed into this patch a previous patch to make
   arch_touch_nmi_watchdog() part of the core detector code.

Changes since v2:
 * Undid split of the generic hardlockup detector into a separate file.
   (Thomas Gleixner)
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).

Changes since v1:
 * Make the generic detector code with CONFIG_HARDLOCKUP_DETECTOR.
---
 include/linux/nmi.h   |  5 -
 kernel/Makefile   |  2 +-
 kernel/watchdog_hld.c | 32 
 lib/Kconfig.debug |  4 
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 1b68f48ad440..cf12380e51b3 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM  0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index 847a82bfe0e3..27e75b735ef7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -95,7 +95,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-   .type   = PERF_TYPE_HARDWARE,
-   .config = PERF_COUNT_HW_CPU_CYCLES,
-   .size   = sizeof(struct perf_event_attr),
-   .pinned = 1,
-   .disabled   = 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .config = PERF_COUNT_HW_CPU_CYCLES,
+   .size   = sizeof(struct perf_event_attr),
+   .pinned = 1,
+   .disabled   = 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
   struct perf_sample_data *data,
@@ -298,3 +304,5 @@ int __init hardlockup_detector_perf_init(void)
}
return ret;
 }
+
+#endif /* 

[PATCH v6 17/29] x86/hpet: Reserve an HPET channel for the hardlockup detector

2022-05-05 Thread Ricardo Neri
The HPET hardlockup detector needs a dedicated HPET channel. Hence, create
a new HPET_MODE_NMI_WATCHDOG mode category to indicate that it cannot be
used for other purposes. Using MSI interrupts greatly simplifies the
implementation of the detector. Specifically, it helps to avoid the
complexities of routing the interrupt via the IO-APIC (e.g., potential
race conditions that arise from re-programming the IO-APIC while also
servicing an NMI). Therefore, only reserve the timer if it supports Front
Side Bus interrupt delivery.

HPET channels are reserved at various stages. First, from
x86_late_time_init(), hpet_time_init() checks if the HPET timer supports
Legacy Replacement Routing. If this is the case, channels 0 and 1 are
reserved as HPET_MODE_LEGACY.

At a later stage, from lockup_detector_init(), reserve the HPET channel
for the hardlockup detector. Then, the HPET clocksource reserves the
channels it needs and then the remaining channels are given to the HPET
char driver via hpet_alloc().

Hence, the channel assigned to the HPET hardlockup detector depends on
whether the first two channels are reserved for legacy mode.

Lastly, only reserve the channel for the hardlockup detector if enabled
in the kernel command line.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Added a check for the allowed maximum frequency of the HPET.
 * Added hpet_hld_free_timer() to properly free the reserved HPET channel
   if the initialization is not completed.
 * Call hpet_assign_irq() with as_nmi = true.
 * Relocated declarations of functions and data structures of the detector
   to not depend on CONFIG_HPET_TIMER.

Changes since v4:
 * Reworked timer reservation to use Thomas' rework on HPET channel
   management.
 * Removed hard-coded channel number for the hardlockup detector.
 * Provided more details on the sequence of HPET channel reservations.
   (Thomas Gleixner)
 * Only reserve a channel for the hardlockup detector if enabled via
   kernel command line. The function reserving the channel is called from
   hardlockup detector. (Thomas Gleixner)
 * Shorten the name of hpet_hardlockup_detector_get_timer() to
   hpet_hld_get_timer(). (Andi)
 * Simplify error handling when a channel is not found. (Tony)

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h |  22 
 arch/x86/kernel/hpet.c  | 105 
 2 files changed, 127 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 486e001413c7..5762bd0169a1 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -103,4 +103,26 @@ static inline int is_hpet_enabled(void) { return 0; }
 #define default_setup_hpet_msi NULL
 
 #endif
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+/**
+ * struct hpet_hld_data - Data needed to operate the detector
+ * @has_periodic:  The HPET channel supports periodic mode
+ * @channel:   HPET channel assigned to the detector
+ * @channe_priv:   Private data of the assigned channel
+ * @ticks_per_second:  Frequency of the HPET timer
+ * @irq:   IRQ number assigned to the HPET channel
+ */
+struct hpet_hld_data {
+   boolhas_periodic;
+   u32 channel;
+   struct hpet_channel *channel_priv;
+   u64 ticks_per_second;
+   int irq;
+};
+
+extern struct hpet_hld_data *hpet_hld_get_timer(void);
+extern void hpet_hld_free_timer(struct hpet_hld_data *hdata);
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #endif /* _ASM_X86_HPET_H */
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 02d25e00e93f..ee9275c013f5 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -20,6 +20,7 @@ enum hpet_mode {
HPET_MODE_LEGACY,
HPET_MODE_CLOCKEVT,
HPET_MODE_DEVICE,
+   HPET_MODE_NMI_WATCHDOG,
 };
 
 struct hpet_channel {
@@ -216,6 +217,7 @@ static void __init hpet_reserve_platform_timers(void)
break;
case HPET_MODE_CLOCKEVT:
case HPET_MODE_LEGACY:
+   case HPET_MODE_NMI_WATCHDOG:
hpet_reserve_timer(, hc->num);
break;
}
@@ -1496,3 +1498,106 @@ irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id)
 }
 EXPORT_SYMBOL_GPL(hpet_rtc_interrupt);
 #endif
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+
+/*
+ * We program the timer in 32-bit mode to reduce the number of register
+ * accesses. The maximum value of watch_thresh is 60 seconds. The HPET counter
+ * should not wrap around more frequently than that. Thus, the frequency of the
+ * HPET timer must be less than 71.582788 MHz. 

[PATCH v6 08/29] iommu/vt-d: Rework prepare_irte() to support per-IRQ delivery mode

2022-05-05 Thread Ricardo Neri
struct irq_cfg::delivery_mode specifies the delivery mode of each IRQ
separately. Configuring the delivery mode of an IRTE would require adding
a third argument to prepare_irte(). Instead, simply take a pointer to the
irq_cfg for which an IRTE is being configured. This change does not cause
functional changes.

Cc: Andi Kleen 
Cc: David Woodhouse 
Cc: "Ravi V. Shankar" 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Ashok Raj 
Reviewed-by: Tony Luck 
Reviewed-by: Lu Baolu 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Only change the signature of prepare_irte(). A separate patch changes
   the setting of the delivery_mode.

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Introduced this patch.
---
 drivers/iommu/intel/irq_remapping.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index d2764a71f91a..66d37186ec28 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -,7 +,7 @@ void intel_irq_remap_add_device(struct 
dmar_pci_notify_info *info)
dev_set_msi_domain(>dev->dev, map_dev_to_ir(info->dev));
 }
 
-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 {
memset(irte, 0, sizeof(*irte));
 
@@ -1126,8 +1126,8 @@ static void prepare_irte(struct irte *irte, int vector, 
unsigned int dest)
*/
irte->trigger_mode = 0;
irte->dlvry_mode = apic->delivery_mode;
-   irte->vector = vector;
-   irte->dest_id = IRTE_DEST(dest);
+   irte->vector = irq_cfg->vector;
+   irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
 
/*
 * When using the destination mode of physical APICID, only the
@@ -1278,8 +1278,7 @@ static void intel_irq_remapping_prepare_irte(struct 
intel_ir_data *data,
 {
struct irte *irte = >irte_entry;
 
-   prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
-
+   prepare_irte(irte, irq_cfg);
switch (info->type) {
case X86_IRQ_ALLOC_TYPE_IOAPIC:
/* Set source-id of interrupt request */
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 01/29] irq/matrix: Expose functions to allocate the best CPU for new vectors

2022-05-05 Thread Ricardo Neri
Certain types of interrupts, such as NMI, do not have an associated vector.
They, however, target specific CPUs. Thus, when assigning the destination
CPU, it is beneficial to select the one with the lowest number of vectors.
Prepend the functions matrix_find_best_cpu_managed() and
matrix_find_best_cpu_managed() with the irq_ prefix and expose them for
IRQ controllers to use when allocating and activating vector-less IRQs.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 include/linux/irq.h |  4 
 kernel/irq/matrix.c | 32 +++-
 2 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index f92788ccdba2..9e674e73d295 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1223,6 +1223,10 @@ struct irq_matrix *irq_alloc_matrix(unsigned int 
matrix_bits,
 void irq_matrix_online(struct irq_matrix *m);
 void irq_matrix_offline(struct irq_matrix *m);
 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool 
replace);
+unsigned int irq_matrix_find_best_cpu(struct irq_matrix *m,
+ const struct cpumask *msk);
+unsigned int irq_matrix_find_best_cpu_managed(struct irq_matrix *m,
+ const struct cpumask *msk);
 int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask 
*msk);
 void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask 
*msk);
 int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 1698e77645ac..810479f608f4 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -125,9 +125,16 @@ static unsigned int matrix_alloc_area(struct irq_matrix 
*m, struct cpumap *cm,
return area;
 }
 
-/* Find the best CPU which has the lowest vector allocation count */
-static unsigned int matrix_find_best_cpu(struct irq_matrix *m,
-   const struct cpumask *msk)
+/**
+ * irq_matrix_find_best_cpu() - Find the best CPU for an IRQ
+ * @m: Matrix pointer
+ * @msk:   On which CPUs the search will be performed
+ *
+ * Find the best CPU which has the lowest vector allocation count
+ * Returns: The best CPU to use
+ */
+unsigned int irq_matrix_find_best_cpu(struct irq_matrix *m,
+ const struct cpumask *msk)
 {
unsigned int cpu, best_cpu, maxavl = 0;
struct cpumap *cm;
@@ -146,9 +153,16 @@ static unsigned int matrix_find_best_cpu(struct irq_matrix 
*m,
return best_cpu;
 }
 
-/* Find the best CPU which has the lowest number of managed IRQs allocated */
-static unsigned int matrix_find_best_cpu_managed(struct irq_matrix *m,
-   const struct cpumask *msk)
+/**
+ * irq_matrix_find_best_cpu_managed() - Find the best CPU for a managed IRQ
+ * @m: Matrix pointer
+ * @msk:   On which CPUs the search will be performed
+ *
+ * Find the best CPU which has the lowest number of managed IRQs allocated
+ * Returns: The best CPU to use
+ */
+unsigned int irq_matrix_find_best_cpu_managed(struct irq_matrix *m,
+ const struct cpumask *msk)
 {
unsigned int cpu, best_cpu, allocated = UINT_MAX;
struct cpumap *cm;
@@ -292,7 +306,7 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, const 
struct cpumask *msk,
if (cpumask_empty(msk))
return -EINVAL;
 
-   cpu = matrix_find_best_cpu_managed(m, msk);
+   cpu = irq_matrix_find_best_cpu_managed(m, msk);
if (cpu == UINT_MAX)
return -ENOSPC;
 
@@ -381,13 +395,13 @@ int irq_matrix_alloc(struct irq_matrix *m, const struct 
cpumask *msk,
struct cpumap *cm;
 
/*
-* Not required in theory, but matrix_find_best_cpu() uses
+* Not required in theory, but irq_matrix_find_best_cpu() uses
 * for_each_cpu() which ignores the cpumask on UP .
 */
if (cpumask_empty(msk))
return -EINVAL;
 
-   cpu = matrix_find_best_cpu(m, msk);
+   cpu = irq_matrix_find_best_cpu(m, msk);
if (cpu == UINT_MAX)
return -ENOSPC;
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 16/29] x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag

2022-05-05 Thread Ricardo Neri
The flag X86_ALLOC_AS_NMI indicates that the IRQs to be allocated in an
IRQ domain need to be configured as NMIs.  Add an as_nmi argument to
hpet_assign_irq(). Even though the HPET clock events do not need NMI
IRQs, the HPET hardlockup detector does. A subsequent changeset will
implement the reservation of a channel for it.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Thomas Gleixner 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/hpet.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 2c6713b40921..02d25e00e93f 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -618,7 +618,7 @@ static inline int hpet_dev_id(struct irq_domain *domain)
 }
 
 static int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
-  int dev_num)
+  int dev_num, bool as_nmi)
 {
struct irq_alloc_info info;
 
@@ -627,6 +627,8 @@ static int hpet_assign_irq(struct irq_domain *domain, 
struct hpet_channel *hc,
info.data = hc;
info.devid = hpet_dev_id(domain);
info.hwirq = dev_num;
+   if (as_nmi)
+   info.flags |= X86_IRQ_ALLOC_AS_NMI;
 
return irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, );
 }
@@ -755,7 +757,7 @@ static void __init hpet_select_clockevents(void)
 
sprintf(hc->name, "hpet%d", i);
 
-   irq = hpet_assign_irq(hpet_domain, hc, hc->num);
+   irq = hpet_assign_irq(hpet_domain, hc, hc->num, false);
if (irq <= 0)
continue;
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 15/29] x86/hpet: Add helper function hpet_set_comparator_periodic()

2022-05-05 Thread Ricardo Neri
Programming an HPET channel as periodic requires setting the
HPET_TN_SETVAL bit in the channel configuration. Plus, the comparator
register must be written twice (once for the comparator value and once for
the periodic value). Since this programming might be needed in several
places (e.g., the HPET clocksource and the HPET-based hardlockup detector),
add a helper function for this purpose.

A helper function hpet_set_comparator_oneshot() could also be implemented.
However, such function would only program the comparator register and the
function would be quite small. Hence, it is better to not bloat the code
with such an obvious function.

Cc: Andi Kleen 
Cc: Tony Luck 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Originally-by: Suravee Suthikulpanit 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
When programming the HPET channel in periodic mode, a udelay(1) between
the two successive writes to HPET_Tn_CMP was introduced in commit
e9e2cdb41241 ("[PATCH] clockevents: i386 drivers"). The commit message
does not give any reason for such delay. The hardware specification does
not seem to require it. The refactoring in this patch simply carries such
delay.
---
Changes since v5:
 * None

Changes since v4:
 * Implement function only for periodic mode. This removed extra logic to
   to use a non-zero period value as a proxy for periodic mode
   programming. (Thomas)
 * Added a comment on the history of the udelay() when programming the
   channel in periodic mode. (Ashok)

Changes since v3:
 * Added back a missing hpet_writel() for time configuration.

Changes since v2:
 *  Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/hpet.c  | 49 -
 2 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index be9848f0883f..486e001413c7 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -74,6 +74,8 @@ extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
 extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
+extern void hpet_set_comparator_periodic(int channel, unsigned int cmp,
+unsigned int period);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 47678e7927ff..2c6713b40921 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -294,6 +294,39 @@ static void hpet_enable_legacy_int(void)
hpet_legacy_int_enabled = true;
 }
 
+/**
+ * hpet_set_comparator_periodic() - Helper function to set periodic channel
+ * @channel:   The HPET channel
+ * @cmp:   The value to be written to the comparator/accumulator
+ * @period:Number of ticks per period
+ *
+ * Helper function for updating comparator, accumulator and period values.
+ *
+ * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
+ * to the Tn_CMP to update the accumulator. Then, HPET needs a second
+ * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
+ * The HPET_TN_SETVAL bit is automatically cleared after the first write.
+ *
+ * This function takes a 1 microsecond delay. However, this function is 
supposed
+ * to be called only once (or when reprogramming the timer) as it deals with a
+ * periodic timer channel.
+ *
+ * See the following documents:
+ *   - Intel IA-PC HPET (High Precision Event Timers) Specification
+ *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
+ */
+void hpet_set_comparator_periodic(int channel, unsigned int cmp, unsigned int 
period)
+{
+   unsigned int v = hpet_readl(HPET_Tn_CFG(channel));
+
+   hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(channel));
+
+   hpet_writel(cmp, HPET_Tn_CMP(channel));
+
+   udelay(1);
+   hpet_writel(period, HPET_Tn_CMP(channel));
+}
+
 static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt)
 {
unsigned int channel = clockevent_to_channel(evt)->num;
@@ -306,19 +339,11 @@ static int hpet_clkevt_set_state_periodic(struct 
clock_event_device *evt)
now = hpet_readl(HPET_COUNTER);
cmp = now + (unsigned int)delta;
cfg = hpet_readl(HPET_Tn_CFG(channel));
-   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
-  HPET_TN_32BIT;
+   cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
hpet_writel(cfg, HPET_Tn_CFG(channel));
-   hpet_writel(cmp, HPET_Tn_CMP(channel));
-   udelay(1);
-   /*
-* HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
-* cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
-* bit is automatically cleared after the first write.
-* (See AMD-8111 HyperTransport I/O Hub Data Sheet,
-* Publication # 24674)
-

[PATCH v6 03/29] x86/apic/msi: Set the delivery mode individually for each IRQ

2022-05-05 Thread Ricardo Neri
There are no restrictions in hardware to set  MSI messages with its
own delivery mode. Use the mode specified in the provided IRQ hardware
configuration data. Since most of the IRQs are configured to use the
delivery mode of the APIC driver in use (set in all of them to
APIC_DELIVERY_MODE_FIXED), the only functional changes are where
IRQs are configured to use a specific delivery mode.

Changing the utility function __irq_msi_compose_msg() takes care of
implementing the change in the in the local APIC, PCI-MSI, and DMAR-MSI
irq_chips.

The IO-APIC irq_chip configures the entries in the interrupt redirection
table using the delivery mode specified in the corresponding MSI message.
Since the MSI message is composed by a higher irq_chip in the hierarchy,
it does not need to be updated.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/apic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 189d3a5e471a..d1e12da1e9af 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2528,7 +2528,7 @@ void __irq_msi_compose_msg(struct irq_cfg *cfg, struct 
msi_msg *msg,
msg->arch_addr_lo.dest_mode_logical = apic->dest_mode_logical;
msg->arch_addr_lo.destid_0_7 = cfg->dest_apicid & 0xFF;
 
-   msg->arch_data.delivery_mode = APIC_DELIVERY_MODE_FIXED;
+   msg->arch_data.delivery_mode = cfg->delivery_mode;
msg->arch_data.vector = cfg->vector;
 
msg->address_hi = X86_MSI_BASE_ADDRESS_HIGH;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 11/29] iommu/amd: Expose [set|get]_dev_entry_bit()

2022-05-05 Thread Ricardo Neri
These functions are used to check and set specific bits in a Device Table
Entry. For instance, they can be used to modify the setting of the NMIPass
field.

Currently, these functions are used only for ACPI-specified devices.
However, an interrupt is to be allocated with NMI as delivery mode, the
Device Table Entry needs modified accordingly in irq_remapping_alloc().

As a first step expose these two functions. No functional changes.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Joerg Roedel 
Cc: Suravee Suthikulpanit 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/amd_iommu.h | 3 +++
 drivers/iommu/amd/init.c  | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 1ab31074f5b3..9f3d1564c84e 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -128,4 +128,7 @@ static inline void amd_iommu_apply_ivrs_quirks(void) { }
 
 extern void amd_iommu_domain_set_pgtable(struct protection_domain *domain,
 u64 *root, int mode);
+
+extern void set_dev_entry_bit(u16 devid, u8 bit);
+extern int get_dev_entry_bit(u16 devid, u8 bit);
 #endif
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index b4a798c7b347..823e76b284f1 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -914,7 +914,7 @@ static void iommu_enable_gt(struct amd_iommu *iommu)
 }
 
 /* sets a specific bit in the device table entry. */
-static void set_dev_entry_bit(u16 devid, u8 bit)
+void set_dev_entry_bit(u16 devid, u8 bit)
 {
int i = (bit >> 6) & 0x03;
int _bit = bit & 0x3f;
@@ -922,7 +922,7 @@ static void set_dev_entry_bit(u16 devid, u8 bit)
amd_iommu_dev_table[devid].data[i] |= (1UL << _bit);
 }
 
-static int get_dev_entry_bit(u16 devid, u8 bit)
+int get_dev_entry_bit(u16 devid, u8 bit)
 {
int i = (bit >> 6) & 0x03;
int _bit = bit & 0x3f;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 12/29] iommu/amd: Enable NMIPass when allocating an NMI irq

2022-05-05 Thread Ricardo Neri
As per the AMD I/O Virtualization Technology (IOMMU) Specification, the
AMD IOMMU only remaps fixed and arbitrated MSIs. NMIs are controlled
by the NMIPass bit of a Device Table Entry. When set, the IOMMU passes
through NMI interrupt messages unmapped. Otherwise, they are aborted.

Furthermore, Section 2.2.5 Table 19 states that the IOMMU will also
abort NMIs when the destination mode is logical.

Update the NMIPass setting of a device's DTE when an NMI irq is being
allocated. Only do so when the destination mode of the APIC is not
logical.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Joerg Roedel 
Cc: Suravee Suthikulpanit 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index a1ada7bff44e..4d7421b6858d 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3156,6 +3156,15 @@ static int irq_remapping_alloc(struct irq_domain 
*domain, unsigned int virq,
info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX)
return -EINVAL;
 
+   if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+   /* Only one IRQ per NMI */
+   if (nr_irqs != 1)
+   return -EINVAL;
+
+   /* NMIs are aborted when the destination mode is logical. */
+   if (apic->dest_mode_logical)
+   return -EPERM;
+   }
/*
 * With IRQ remapping enabled, don't need contiguous CPU vectors
 * to support multiple MSI interrupts.
@@ -3208,6 +3217,15 @@ static int irq_remapping_alloc(struct irq_domain 
*domain, unsigned int virq,
goto out_free_parent;
}
 
+   if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+   struct amd_iommu *iommu = amd_iommu_rlookup_table[devid];
+
+   if (!get_dev_entry_bit(devid, DEV_ENTRY_NMI_PASS)) {
+   set_dev_entry_bit(devid, DEV_ENTRY_NMI_PASS);
+   iommu_flush_dte(iommu, devid);
+   }
+   }
+
for (i = 0; i < nr_irqs; i++) {
irq_data = irq_domain_get_irq_data(domain, virq + i);
cfg = irq_data ? irqd_cfg(irq_data) : NULL;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 00/29] x86: Implement an HPET-based hardlockup detector

2022-05-05 Thread Ricardo Neri
Hi,

This is the sixth version of this patchset. It again took me a while to
post this version as I have been working on other projects and implemented
major reworks in the series.

This work has gone through several versions. I have incorporated feedback
from Thomas Gleixner and others. Many of the patches have review tags (the
patches for arch/x86 have review tags from the trusted x86 reviewers).
I believe it is ready for review by the x86 maintainers. Thus, I have
dropped the RFC qualifier.

I would especially appreciate review from the IOMMU experts on the patches
for the AMD IOMMU (patches 11-13) and from the powerpc experts on the
refactoring of the generic hardlockup detector (patches 18-20, 24, 28).

Also, I observed the error in the expected TSC value has its largest value
when the system just booted. Afterwards, the error becomes very small. In
this version allows a larger than necessary error. I also would appreciate
feedback on this particular. Perhaps the system can switch to this detector
after boot and free up the PMU counters. Please see the Testing section for
details.

Previous version of the patches can be found in [1], [2], [3], [4], and
[5]. Also, thanks to Thomas' feedback, this version handles the changes to
the interrupt subsystem more cleanly and therefore it is not necessary to
handle interrupt remapping separately [6] as in the last version.

For easier review, please refer to the changelog below for a list of
unchanged, updated, and new patches.

== Problem statement

In CPU architectures that do not have an non-maskable (NMI) watchdog, one
can be constructed using any resource capable of issuing NMIs. In x86,
this is done using a counter (one per CPU) of the Performance Monitoring
Unit (PMU). PMU counters, however, are scarce and used mainly to profile
workloads. Using them for the NMI watchdog is an overkill.

Moreover, since the counter that the NMI watchdog uses cannot be shared
with any other perf users, the current perf-event subsystem may return a
false positive when validating certain metric groups. Certain metric groups
may never get a chance to be scheduled [7], [8].

== Description of the detector

Unlike the perf-based hardlockup detector, this implementation is driven by
a single source of NMIs: an HPET channel. One of the CPUs that the
hardlockup detector monitors handles the HPET NMI. Upon reception, it
issues a shorthand IPI to the rest of the CPUs. All CPUs will get the NMI,
but only those online and being monitored will look for hardlockups.

The time-stamp counter (TSC) is used to determine whether the HPET caused
the NMI. When the HPET channel fires, we compute the value the TSC is
expected to have the next time it fires. Once it does, we read the actual
TSC value. If it falls within a small error window, we determine that the
HPET channel issued the NMI. I have found experimentally that the expected
TSC value consistently has an error of less than 0.2% (see further details
in the Testing section).

A side effect of using this detector is that in a kernel configured with
NO_HZ_FULL, the CPUs specified in the nohz_full command-line argument would
also receive the NMI IPI (but they would not look for hardlokcups). Thus,
this hardlockup detector is an opt-in feature.

This detector cannot be used on systems that disable the HPET, unless it is
force-enabled.

On AMD systems with interrupt remapping enabled, the detector can only be
used in APIC physical mode (see patch 12).

== Main updates in this version

Thomas Gleixner pointed out several design issues:

   1) In previous versions, the HPET channel targeted individual CPUs in a
  round-robin manner. This required changing the affinity of the NMI.
  Doing this with interrupt remapping is not feasible since there is
  not an atomic, lock-less way of updating an IRTE in the interrupt
  remapping domain.
+ Solution: In this version, the affinity of the HPET timer never
  changes while it is in operation. When a change in affinity is
  needed (e.g., the handling CPU is going offline or stops being
  monitored), the detector is disabled first.

   2) In previous versions, I issued apic::send_IPI_mask_allbutself() to
  trigger a hardlockup check on the monitored CPUs. This does not work
  as the target cpumask and the contents of the APIC ICR would be
  corrupted [9].
+ Solution: Only use this detector when IPI shorthands [10] are
  enabled in the system.

   3) In previous versions, I added checks in the interrupt remapping
  drivers checks to identify the HPET hardlockup detector IRQ and
  changed its delivery mode. This looked hacky [11].
+ Solution: As per suggestion from Thomas, I added an
  X86_IRQ_ALLOC_AS_NMI to be used when allocating IRQs. Every
  irq_domain in the IRQ hierarchy will configure the delivery mode
  of the IRQ as specified in the IRQ config info.

== Parts of this series ==


[PATCH v6 14/29] x86/hpet: Expose hpet_writel() in header

2022-05-05 Thread Ricardo Neri
In order to allow hpet_writel() to be used by other components (e.g.,
the HPET-based hardlockup detector), expose it in the HPET header file.

Cc: Andi Kleen 
Cc: Stephane Eranian 
Cc: "Ravi V. Shankar" 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * None

Changes since v4:
 * Dropped exposing hpet_readq() as it is not needed.

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h | 1 +
 arch/x86/kernel/hpet.c  | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index ab9f3dd87c80..be9848f0883f 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -72,6 +72,7 @@ extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
+extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 71f336425e58..47678e7927ff 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -79,7 +79,7 @@ inline unsigned int hpet_readl(unsigned int a)
return readl(hpet_virt_address + a);
 }
 
-static inline void hpet_writel(unsigned int d, unsigned int a)
+inline void hpet_writel(unsigned int d, unsigned int a)
 {
writel(d, hpet_virt_address + a);
 }
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 13/29] iommu/amd: Compose MSI messages for NMI irqs in non-IR format

2022-05-05 Thread Ricardo Neri
If NMIPass is enabled in a device's DTE, the IOMMU lets NMI interrupt
messages pass through unmapped. Therefore, the contents of the MSI
message, not an IRTE, determine how and where the NMI is delivered.

Since the IOMMU driver owns the MSI message of the NMI irq, compose
it using the non-interrupt-remapping format. Also, let descendant
irqchips write the MSI as appropriate for the device.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Joerg Roedel 
Cc: Suravee Suthikulpanit 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 4d7421b6858d..6e07949b3e2a 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3111,7 +3111,16 @@ static void irq_remapping_prepare_irte(struct 
amd_ir_data *data,
case X86_IRQ_ALLOC_TYPE_HPET:
case X86_IRQ_ALLOC_TYPE_PCI_MSI:
case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
-   fill_msi_msg(>msi_entry, irte_info->index);
+   if (irq_cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+   /*
+* The IOMMU lets NMIs pass through unmapped. Thus, the
+* MSI message, not the IRTE, determines the irq
+* configuration. Since we own the MSI message,
+* compose it. Descendant irqchips will write it.
+*/
+   __irq_msi_compose_msg(irq_cfg, >msi_entry, true);
+   else
+   fill_msi_msg(>msi_entry, irte_info->index);
break;
 
default:
@@ -3509,6 +3518,18 @@ static int amd_ir_set_affinity(struct irq_data *data,
 */
send_cleanup_vector(cfg);
 
+   /*
+* When the delivery mode of an irq is NMI, the IOMMU lets the NMI
+* interrupt messages pass through unmapped. Hence, changes in the
+* destination are to be reflected in the NMI message itself, not the
+* IRTE. Thus, descendant irqchips must set the affinity and compose
+* write the MSI message.
+*
+* Also, NMIs do not have an associated vector. No need for cleanup.
+*/
+   if (cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+   return IRQ_SET_MASK_OK;
+
return IRQ_SET_MASK_OK_DONE;
 }
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 07/29] iommu/vt-d: Clear the redirection hint when the destination mode is physical

2022-05-05 Thread Ricardo Neri
When the destination mode of an interrupt is physical APICID, the interrupt
is delivered only to the single CPU of which the physical APICID is
specified in the destination ID field. Therefore, the redirection hint is
meaningless.

Furthermore, on certain processors, the IOMMU does not deliver the
interrupt when the delivery mode is NMI, the redirection hint is set, and
the destination mode is physical. Clearing the redirection hint ensures
that the NMI is delivered.

Cc: Andi Kleen 
Cc: David Woodhouse 
Cc: "Ravi V. Shankar" 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Ashok Raj 
Reviewed-by: Lu Baolu 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index a67319597884..d2764a71f91a 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1128,7 +1128,17 @@ static void prepare_irte(struct irte *irte, int vector, 
unsigned int dest)
irte->dlvry_mode = apic->delivery_mode;
irte->vector = vector;
irte->dest_id = IRTE_DEST(dest);
-   irte->redir_hint = 1;
+
+   /*
+* When using the destination mode of physical APICID, only the
+* processor specified in @dest receives the interrupt. Thus, the
+* redirection hint is meaningless.
+*
+* Furthermore, on some processors, NMIs with physical delivery mode
+* and the redirection hint set are delivered as regular interrupts
+* or not delivered at all.
+*/
+   irte->redir_hint = apic->dest_mode_logical;
 }
 
 struct irq_remap_ops intel_irq_remap_ops = {
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 10/29] iommu/vt-d: Implement minor tweaks for NMI irqs

2022-05-05 Thread Ricardo Neri
The Intel IOMMU interrupt remapping driver already programs correctly the
delivery mode of individual irqs as per their irq_data. Improve handling
of NMIs. Allow only one irq per NMI. Also, it is not necessary to cleanup
irq vectors after updating affinity. NMIs do not have associated vectors.

Cc: Andi Kleen 
Cc: David Woodhouse 
Cc: "Ravi V. Shankar" 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Lu Baolu 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index fb2d71bea98d..791a9331e257 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1198,8 +1198,12 @@ intel_ir_set_affinity(struct irq_data *data, const 
struct cpumask *mask,
 * After this point, all the interrupts will start arriving
 * at the new destination. So, time to cleanup the previous
 * vector allocation.
+*
+* Do it only for non-NMI irqs. NMIs don't have associated
+* vectors.
 */
-   send_cleanup_vector(cfg);
+   if (cfg->delivery_mode != APIC_DELIVERY_MODE_NMI)
+   send_cleanup_vector(cfg);
 
return IRQ_SET_MASK_OK_DONE;
 }
@@ -1352,6 +1356,9 @@ static int intel_irq_remapping_alloc(struct irq_domain 
*domain,
if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI)
info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS;
 
+   if ((info->flags & X86_IRQ_ALLOC_AS_NMI) && nr_irqs != 1)
+   return -EINVAL;
+
ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg);
if (ret < 0)
return ret;
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 02/29] x86/apic: Add irq_cfg::delivery_mode

2022-05-05 Thread Ricardo Neri
Currently, the delivery mode of all interrupts is set to the mode of the
APIC driver in use. There are no restrictions in hardware to configure the
delivery mode of each interrupt individually. Also, certain IRQs need to be
configured with a specific delivery mode (e.g., NMI).

Add a new member, delivery_mode, to struct irq_cfg. Subsequent changesets
will update every irq_domain to set the delivery mode of each IRQ to that
specified in its irq_cfg data.

To keep the current behavior, when allocating an IRQ in the root domain
(i.e., the x86_vector_domain), set the delivery mode of the IRQ as that of
the APIC driver.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Ashok Raj 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Updated indentation of the existing members of struct irq_cfg.
 * Reworded the commit message.

Changes since v4:
 * Rebased to use new enumeration apic_delivery_modes.

Changes since v3:
 * None

Changes since v2:
 * Reduced scope to only add the interrupt delivery mode in
   struct irq_alloc_info.

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hw_irq.h | 5 +++--
 arch/x86/kernel/apic/vector.c | 9 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index d465ece58151..5ac5e6c603ee 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -88,8 +88,9 @@ struct irq_alloc_info {
 };
 
 struct irq_cfg {
-   unsigned intdest_apicid;
-   unsigned intvector;
+   unsigned intdest_apicid;
+   unsigned intvector;
+   enum apic_delivery_modesdelivery_mode;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3e6f6b448f6a..838e220e8860 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -567,6 +567,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, 
unsigned int virq,
irqd->chip_data = apicd;
irqd->hwirq = virq + i;
irqd_set_single_target(irqd);
+
/*
 * Prevent that any of these interrupts is invoked in
 * non interrupt context via e.g. generic_handle_irq()
@@ -577,6 +578,14 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
/* Don't invoke affinity setter on deactivated interrupts */
irqd_set_affinity_on_activate(irqd);
 
+   /*
+* Initialize the delivery mode of this irq to match the
+* default delivery mode of the APIC. Children irq domains
+* may take the delivery mode from the individual irq
+* configuration rather than from the APIC driver.
+*/
+   apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode;
+
/*
 * Legacy vectors are already assigned when the IOAPIC
 * takes them over. They stay on the same vector. This is
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 09/29] iommu/vt-d: Set the IRTE delivery mode individually for each IRQ

2022-05-05 Thread Ricardo Neri
There are no hardware requirements to use the same delivery mode for all
interrupts. Use the mode specified in the provided IRQ hardware
configuration data. Since all IRQs are configured to use the delivery mode
of the APIC drive, the only functional changes are where IRQs are
configured to use a specific delivery mode.

Cc: Andi Kleen 
Cc: David Woodhouse 
Cc: "Ravi V. Shankar" 
Cc: Lu Baolu 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Reviewed-by: Lu Baolu 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c 
b/drivers/iommu/intel/irq_remapping.c
index 66d37186ec28..fb2d71bea98d 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1125,7 +1125,7 @@ static void prepare_irte(struct irte *irte, struct 
irq_cfg *irq_cfg)
 * irq migration in the presence of interrupt-remapping.
*/
irte->trigger_mode = 0;
-   irte->dlvry_mode = apic->delivery_mode;
+   irte->dlvry_mode = irq_cfg->delivery_mode;
irte->vector = irq_cfg->vector;
irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
 
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 06/29] x86/apic/vector: Implement support for NMI delivery mode

2022-05-05 Thread Ricardo Neri
The flag X86_IRQ_ALLOC_AS_NMI indicates to the interrupt controller that
it should configure the delivery mode of an IRQ as NMI. Implement such
request. This causes irq_domain children in the hierarchy to configure
their irq_chips accordingly. When no specific delivery mode is requested,
continue using the delivery mode of the APIC driver in use.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Thomas Gleixner 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/vector.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 11f881f45cec..df4d7b9f6e27 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -570,6 +570,10 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1)
return -ENOSYS;
 
+   /* Only one IRQ per NMI */
+   if ((info->flags & X86_IRQ_ALLOC_AS_NMI) && nr_irqs != 1)
+   return -EINVAL;
+
/*
 * Catch any attempt to touch the cascade interrupt on a PIC
 * equipped system.
@@ -610,7 +614,15 @@ static int x86_vector_alloc_irqs(struct irq_domain 
*domain, unsigned int virq,
 * default delivery mode of the APIC. Children irq domains
 * may take the delivery mode from the individual irq
 * configuration rather than from the APIC driver.
+*
+* Vectors are meaningless if the delivery mode is NMI. Since
+* nr_irqs is 1, we can return.
 */
+   if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+   apicd->hw_irq_cfg.delivery_mode = 
APIC_DELIVERY_MODE_NMI;
+   return 0;
+   }
+
apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode;
 
/*
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 04/29] x86/apic: Add the X86_IRQ_ALLOC_AS_NMI irq allocation flag

2022-05-05 Thread Ricardo Neri
There are cases in which it is necessary to set the delivery mode of an
interrupt as NMI. Add a new flag that callers can specify when allocating
an IRQ.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Suggested-by: Thomas Gleixner 
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/include/asm/irqdomain.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/irqdomain.h b/arch/x86/include/asm/irqdomain.h
index 125c23b7bad3..de1cf2e80443 100644
--- a/arch/x86/include/asm/irqdomain.h
+++ b/arch/x86/include/asm/irqdomain.h
@@ -10,6 +10,7 @@ enum {
/* Allocate contiguous CPU vectors */
X86_IRQ_ALLOC_CONTIGUOUS_VECTORS= 0x1,
X86_IRQ_ALLOC_LEGACY= 0x2,
+   X86_IRQ_ALLOC_AS_NMI= 0x4,
 };
 
 extern int x86_fwspec_is_ioapic(struct irq_fwspec *fwspec);
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs

2022-05-05 Thread Ricardo Neri
Vectors are meaningless when allocating IRQs with NMI as the delivery mode.
In such case, skip the reservation of IRQ vectors. Do it in the lowest-
level functions where the actual IRQ reservation takes place.

Since NMIs target specific CPUs, keep the functionality to find the best
CPU.

Cc: Andi Kleen 
Cc: "Ravi V. Shankar" 
Cc: Stephane Eranian 
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-...@lists.ozlabs.org
Cc: x...@kernel.org
Reviewed-by: Tony Luck 
Signed-off-by: Ricardo Neri 
---
Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/vector.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 838e220e8860..11f881f45cec 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -245,11 +245,20 @@ assign_vector_locked(struct irq_data *irqd, const struct 
cpumask *dest)
if (apicd->move_in_progress || !hlist_unhashed(>clist))
return -EBUSY;
 
+   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
+   cpu = irq_matrix_find_best_cpu(vector_matrix, dest);
+   apicd->cpu = cpu;
+   vector = 0;
+   goto no_vector;
+   }
+
vector = irq_matrix_alloc(vector_matrix, dest, resvd, );
trace_vector_alloc(irqd->irq, vector, resvd, vector);
if (vector < 0)
return vector;
apic_update_vector(irqd, vector, cpu);
+
+no_vector:
apic_update_irq_cfg(irqd, vector, cpu);
 
return 0;
@@ -321,12 +330,22 @@ assign_managed_vector(struct irq_data *irqd, const struct 
cpumask *dest)
/* set_affinity might call here for nothing */
if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask))
return 0;
+
+   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) {
+   cpu = irq_matrix_find_best_cpu_managed(vector_matrix, dest);
+   apicd->cpu = cpu;
+   vector = 0;
+   goto no_vector;
+   }
+
vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask,
  );
trace_vector_alloc_managed(irqd->irq, vector, vector);
if (vector < 0)
return vector;
apic_update_vector(irqd, vector, cpu);
+
+no_vector:
apic_update_irq_cfg(irqd, vector, cpu);
return 0;
 }
@@ -376,6 +395,10 @@ static void x86_vector_deactivate(struct irq_domain *dom, 
struct irq_data *irqd)
if (apicd->has_reserved)
return;
 
+   /* NMI IRQs do not have associated vectors; nothing to do. */
+   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+   return;
+
raw_spin_lock_irqsave(_lock, flags);
clear_irq_vector(irqd);
if (apicd->can_reserve)
@@ -472,6 +495,10 @@ static void vector_free_reserved_and_managed(struct 
irq_data *irqd)
trace_vector_teardown(irqd->irq, apicd->is_managed,
  apicd->has_reserved);
 
+   /* NMI IRQs do not have associated vectors; nothing to do. */
+   if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+   return;
+
if (apicd->has_reserved)
irq_matrix_remove_reserved(vector_matrix);
if (apicd->is_managed)
-- 
2.17.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] iommu: Reorganize __iommu_attach_group()

2022-05-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, May 6, 2022 12:16 AM
> 
> Call iommu_group_do_attach_device() only from
> __iommu_group_attach_domain() which should be used to attach any
> domain to
> the group.
> 
> The only unique thing __iommu_attach_group() does is to check if the group
> is already attached to some caller specified group. Put this test into
> __iommu_group_is_core_domain(), matching the
> __iommu_group_attach_core_domain() nomenclature.
> 
> Change the two callers to directly call __iommu_group_attach_domain() and
> test __iommu_group_is_core_domain().
> 
> iommu_attach_device() should trigger a WARN_ON if the group is attached
> as
> the caller is using the function wrong.
> 
> Suggested-by: "Tian, Kevin" 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 

> ---
>  drivers/iommu/iommu.c | 42 +++---
>  1 file changed, 19 insertions(+), 23 deletions(-)
> 
> This goes after "iommu: iommu_group_claim_dma_owner() must always
> assign a
> domain" and simplifies some of the remaining duplication.
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index c1bdec807ea381..09d00be887f924 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -81,9 +81,10 @@ static struct iommu_domain
> *__iommu_domain_alloc(struct bus_type *bus,
>unsigned type);
>  static int __iommu_attach_device(struct iommu_domain *domain,
>struct device *dev);
> -static int __iommu_attach_group(struct iommu_domain *domain,
> - struct iommu_group *group);
> +static int __iommu_group_attach_domain(struct iommu_group *group,
> +struct iommu_domain *new_domain);
>  static void __iommu_group_attach_core_domain(struct iommu_group
> *group);
> +static bool __iommu_group_is_core_domain(struct iommu_group *group);
>  static int iommu_create_device_direct_mappings(struct iommu_group
> *group,
>  struct device *dev);
>  static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
> @@ -1938,10 +1939,11 @@ int iommu_attach_device(struct iommu_domain
> *domain, struct device *dev)
>*/
>   mutex_lock(>mutex);
>   ret = -EINVAL;
> - if (iommu_group_device_count(group) != 1)
> + if (iommu_group_device_count(group) != 1 ||
> + WARN_ON(!__iommu_group_is_core_domain(group)))
>   goto out_unlock;
> 
> - ret = __iommu_attach_group(domain, group);
> + ret = __iommu_group_attach_domain(group, domain);
> 
>  out_unlock:
>   mutex_unlock(>mutex);
> @@ -2032,31 +2034,19 @@ static int iommu_group_do_attach_device(struct
> device *dev, void *data)
>   return __iommu_attach_device(domain, dev);
>  }
> 
> -static int __iommu_attach_group(struct iommu_domain *domain,
> - struct iommu_group *group)
> -{
> - int ret;
> -
> - if (group->domain && group->domain != group->default_domain &&
> - group->domain != group->blocking_domain)
> - return -EBUSY;
> -
> - ret = __iommu_group_for_each_dev(group, domain,
> -  iommu_group_do_attach_device);
> - if (ret == 0)
> - group->domain = domain;
> -
> - return ret;
> -}
> -
>  int iommu_attach_group(struct iommu_domain *domain, struct
> iommu_group *group)
>  {
>   int ret;
> 
>   mutex_lock(>mutex);
> - ret = __iommu_attach_group(domain, group);
> - mutex_unlock(>mutex);
> + if (!__iommu_group_is_core_domain(group)) {
> + ret = -EBUSY;
> + goto out_unlock;
> + }
> 
> + ret = __iommu_group_attach_domain(group, domain);
> +out_unlock:
> + mutex_unlock(>mutex);
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(iommu_attach_group);
> @@ -2110,6 +2100,12 @@ static int __iommu_group_attach_domain(struct
> iommu_group *group,
>   return 0;
>  }
> 
> +static bool __iommu_group_is_core_domain(struct iommu_group *group)
> +{
> + return !group->domain || group->domain == group-
> >default_domain ||
> +group->domain == group->blocking_domain;
> +}
> +
>  /*
>   * Put the group's domain back to the appropriate core-owned domain -
> either the
>   * standard kernel-mode DMA configuration or an all-DMA-blocked domain.
> 
> base-commit: f9169ee95787fe691287eeed1a1c269ea72c8fb4
> --
> 2.36.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain

2022-05-05 Thread Tian, Kevin
> From: Jason Gunthorpe
> Sent: Thursday, May 5, 2022 11:33 PM
> > >   /*
> > > -  * If the group has been claimed already, do not re-attach the default
> > > -  * domain.
> > > +  * New drivers should support default domains and so the
> > > detach_dev() op
> > > +  * will never be called. Otherwise the NULL domain indicates the
> > > +  * translation for the group should be set so it will work with the
> >
> > translation should be 'blocked'?
> 
> No, not blocked.
> 
> > > +  * platform DMA ops.
> >
> > I didn't get the meaning of the last sentence.
> 
> It is as discussed with Robin, NULL means to return the group back to
> some platform defined translation, and in some cases this means
> returning back to work with the platform's DMA ops - presumably if it
> isn't using the dma api.

This is clearer than the original text.

> 
> > > + /*
> > > +  * Changing the domain is done by calling attach on the new domain.
> > > +  * Drivers should implement this so that DMA is always translated by
> >
> > what does 'this' mean?
> 
> The code below - attach_dev called in a loop for each dev in the group.

Yes.

> 
> > > +  * either the new, old, or a blocking domain. DMA should never
> >
> > isn't the blocking domain passed in as the new domain?
> 
> Soemtimes. This is a statement about the required
> atomicity. New/old/block are all valid translations during the
> operation. Identity is not.

but new/old/block are not the same type of classifications. A group
has an old domain and a new domain at this transition point, and
both old/new domains have a domain type (dma, unmanged, block,
identity, etc.). Mixing them together only increases confusion here.

> 
> So, I'm going to leave this patch as-is since it has been tested now
> and we need to get the series back into iommu-next ASAP.
> 

Agree. Just hope some improvements on the code comment.

But either way given no more code change:

Reviewed-by: Kevin Tian 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/vt-d: Increase DMAR_UNITS_SUPPORTED

2022-05-05 Thread Steve Wahl
Increase DMAR_UNITS_SUPPORTED to support 64 sockets with 10 DMAR units
each, for a total of 640.

If the available hardware exceeds DMAR_UNITS_SUPPORTED (previously set
to MAX_IO_APICS, or 128), it causes these messages: "DMAR: Failed to
allocate seq_id", "DMAR: Parse DMAR table failure.", and "x2apic: IRQ
remapping doesn't support X2APIC mode x2apic disabled"; and the system
fails to boot.

Signed-off-by: Steve Wahl 
Reviewed-by: Mike Travis 
---

Note that we could not find a reason for connecting
DMAR_UNITS_SUPPORTED to MAX_IO_APICS as was done previously.  Perhaps
it seemed like the two would continue to match on earlier processors.
There doesn't appear to be kernel code that assumes that the value of
one is related to the other.

 include/linux/dmar.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/dmar.h b/include/linux/dmar.h
index 45e903d84733..9d4867b8f42e 100644
--- a/include/linux/dmar.h
+++ b/include/linux/dmar.h
@@ -19,7 +19,7 @@
 struct acpi_dmar_header;
 
 #ifdef CONFIG_X86
-# define   DMAR_UNITS_SUPPORTEDMAX_IO_APICS
+# define   DMAR_UNITS_SUPPORTED640
 #else
 # define   DMAR_UNITS_SUPPORTED64
 #endif
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain

2022-05-05 Thread Robin Murphy

On 2022-05-05 20:27, Jason Gunthorpe wrote:

On Thu, May 05, 2022 at 07:56:59PM +0100, Robin Murphy wrote:


Ack to that, there are certainly further improvements to consider once we've
got known-working code into a released kernel, but let's not get ahead of
ourselves just now.


Yes please
  

(I'm pretty sure we could get away with a single blocking domain per IOMMU
instance, rather than per group, but I deliberately saved that one for later
- right now one per group to match default domains is simpler to reason
about and less churny in the context of the current ownership patches)


I noticed this too..

But I thought the driver can do a better job of this. There is no
reason it has to allocate memory to return a IOMMU_DOMAIN_BLOCKED
domain - this struct could be a globally allocated singleton for the
entire driver and that would be even better memory savings.


I was thinking about the domain pointers themselves as well, but on 
reflection, little systems are most likely to have the simpler IOMMUs 
with one fixed group per instance anyway, while the kind of systems 
which end up with a couple of hundred groups can probably bear the cost 
of a couple of hundred extra pointers. So yeah, probably not worth the 
code cost of chasing down a group->devices->dev->iommu->blocking_domain 
in practice.



For instance, here is a sketch for the Intel driver based on Baolu's
remark that intel_iommu_detach_device() establishes a blocking
behavior already on detach_dev (Baolu if you like it can you make a
proper patch?):

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index df5c62ecf942b8..317945f6134d42 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4324,6 +4324,8 @@ static struct iommu_domain 
*intel_iommu_domain_alloc(unsigned type)
return domain;
case IOMMU_DOMAIN_IDENTITY:
return _domain->domain;
+   case IOMMU_DOMAIN_BLOCKED;
+   return _blocking_domain;
default:
return NULL;
}
@@ -4404,12 +4406,25 @@ static int intel_iommu_attach_device(struct 
iommu_domain *domain,
return domain_add_dev_info(to_dmar_domain(domain), dev);
  }
  
-static void intel_iommu_detach_device(struct iommu_domain *domain,

- struct device *dev)
+static int intel_iommu_block_dev(struct iommu_domain *domain,
+struct device *dev)
  {
+   if (!device_to_iommu(dev, NULL, NULL))
+   return -EINVAL;
+
dmar_remove_one_dev_info(dev);
+   return 0;
  }
  
+static struct iommu_domain intel_blocking_domain {

+   .type = IOMMU_DOMAIN_BLOCKED,
+   .ops = &(const struct iommu_domain_ops){
+   .attach_dev = intel_iommu_detach_device,
+   // Fix core code so this works
+   .free = NULL,


TBH unless and until it becomes commonplace, I'd just have drivers 
provide a no-op .free callback for least surprise, if their .alloc does 
something funky. But either way, with domain ops now we can so easily do 
whatever we need, yay!


Cheers,
Robin.


+   },
+};
+
  static int intel_iommu_map(struct iommu_domain *domain,
   unsigned long iova, phys_addr_t hpa,
   size_t size, int iommu_prot, gfp_t gfp)
@@ -4890,7 +4905,6 @@ const struct iommu_ops intel_iommu_ops = {
  #endif
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = intel_iommu_attach_device,
-   .detach_dev = intel_iommu_detach_device,
.map_pages  = intel_iommu_map_pages,
.unmap_pages= intel_iommu_unmap_pages,
.iotlb_sync_map = intel_iommu_iotlb_sync_map,

Thanks,
Jason

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain

2022-05-05 Thread Jason Gunthorpe via iommu
On Thu, May 05, 2022 at 07:56:59PM +0100, Robin Murphy wrote:

> Ack to that, there are certainly further improvements to consider once we've
> got known-working code into a released kernel, but let's not get ahead of
> ourselves just now.

Yes please
 
> (I'm pretty sure we could get away with a single blocking domain per IOMMU
> instance, rather than per group, but I deliberately saved that one for later
> - right now one per group to match default domains is simpler to reason
> about and less churny in the context of the current ownership patches)

I noticed this too.. 

But I thought the driver can do a better job of this. There is no
reason it has to allocate memory to return a IOMMU_DOMAIN_BLOCKED
domain - this struct could be a globally allocated singleton for the
entire driver and that would be even better memory savings.

For instance, here is a sketch for the Intel driver based on Baolu's
remark that intel_iommu_detach_device() establishes a blocking
behavior already on detach_dev (Baolu if you like it can you make a
proper patch?):

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index df5c62ecf942b8..317945f6134d42 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4324,6 +4324,8 @@ static struct iommu_domain 
*intel_iommu_domain_alloc(unsigned type)
return domain;
case IOMMU_DOMAIN_IDENTITY:
return _domain->domain;
+   case IOMMU_DOMAIN_BLOCKED;
+   return _blocking_domain;
default:
return NULL;
}
@@ -4404,12 +4406,25 @@ static int intel_iommu_attach_device(struct 
iommu_domain *domain,
return domain_add_dev_info(to_dmar_domain(domain), dev);
 }
 
-static void intel_iommu_detach_device(struct iommu_domain *domain,
- struct device *dev)
+static int intel_iommu_block_dev(struct iommu_domain *domain,
+struct device *dev)
 {
+   if (!device_to_iommu(dev, NULL, NULL))
+   return -EINVAL;
+
dmar_remove_one_dev_info(dev);
+   return 0;
 }
 
+static struct iommu_domain intel_blocking_domain {
+   .type = IOMMU_DOMAIN_BLOCKED,
+   .ops = &(const struct iommu_domain_ops){
+   .attach_dev = intel_iommu_detach_device,
+   // Fix core code so this works
+   .free = NULL,
+   },
+};
+
 static int intel_iommu_map(struct iommu_domain *domain,
   unsigned long iova, phys_addr_t hpa,
   size_t size, int iommu_prot, gfp_t gfp)
@@ -4890,7 +4905,6 @@ const struct iommu_ops intel_iommu_ops = {
 #endif
.default_domain_ops = &(const struct iommu_domain_ops) {
.attach_dev = intel_iommu_attach_device,
-   .detach_dev = intel_iommu_detach_device,
.map_pages  = intel_iommu_map_pages,
.unmap_pages= intel_iommu_unmap_pages,
.iotlb_sync_map = intel_iommu_iotlb_sync_map,

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility

2022-05-05 Thread Jason Gunthorpe via iommu
On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote:

> > It is a bit more CPU work since maps in the lower range would have to
> > be copied over, but conceptually the model matches the HW nesting.
> 
> Ah.. ok.  IIUC what you're saying is that the kernel-side IOASes have
> fixed windows, but we fake dynamic windows in the userspace
> implementation by flipping the devices over to a new IOAS with the new
> windows.  Is that right?

Yes

> Where exactly would the windows be specified?  My understanding was
> that when creating a back-end specific IOAS, that would typically be
> for the case where you're using a user / guest managed IO pagetable,
> with the backend specifying the format for that.  In the ppc case we'd
> need to specify the windows, but we'd still need the IOAS_MAP/UNMAP
> operations to manage the mappings.  The PAPR vIOMMU is
> paravirtualized, so all updates come via hypercalls, so there's no
> user/guest managed data structure.

When the iommu_domain is created I want to have a
iommu-driver-specific struct, so PPC can customize its iommu_domain
however it likes.

> That should work from the point of view of the userspace and guest
> side interfaces.  It might be fiddly from the point of view of the
> back end.  The ppc iommu doesn't really have the notion of
> configurable domains - instead the address spaces are the hardware or
> firmware fixed PEs, so they have a fixed set of devices.  At the bare
> metal level it's possible to sort of do domains by making the actual
> pagetable pointers for several PEs point to a common place.

I'm not sure I understand this - a domain is just a storage container
for an IO page table, if the HW has IOPTEs then it should be able to
have a domain?

Making page table pointers point to a common IOPTE tree is exactly
what iommu_domains are for - why is that "sort of" for ppc?

> However, in the future, nested KVM under PowerVM is likely to be the
> norm.  In that situation the L1 as well as the L2 only has the
> paravirtualized interfaces, which don't have any notion of domains,
> only PEs.  All updates take place via hypercalls which explicitly
> specify a PE (strictly speaking they take a "Logical IO Bus Number"
> (LIOBN), but those generally map one to one with PEs), so it can't use
> shared pointer tricks either.

How does the paravirtualized interfaces deal with the page table? Does
it call a map/unmap hypercall instead of providing guest IOPTEs?

Assuming yes, I'd expect that:

The iommu_domain for nested PPC is just a log of map/unmap hypervsior
calls to make. Whenever a new PE is attached to that domain it gets
the logged map's replayed to set it up, and when a PE is detached the
log is used to unmap everything.

It is not perfectly memory efficient - and we could perhaps talk about
a API modification to allow re-use of the iommufd datastructure
somehow, but I think this is a good logical starting point.

The PE would have to be modeled as an iommu_group.

> So, here's an alternative set of interfaces that should work for ppc,
> maybe you can tell me whether they also work for x86 and others:

Fundamentally PPC has to fit into the iommu standard framework of
group and domains, we can talk about modifications, but drifting too
far away is a big problem.

>   * Each domain/IOAS has a concept of one or more IOVA windows, which
> each have a base address, size, pagesize (granularity) and optionally
> other flags/attributes.
>   * This has some bearing on hardware capabilities, but is
> primarily a software notion

iommu_domain has the aperture, PPC will require extending this to a
list of apertures since it is currently only one window.

Once a domain is created and attached to a group the aperture should
be immutable.

>   * MAP/UNMAP operations are only permitted within an existing IOVA
> window (and with addresses aligned to the window's pagesize)
>   * This is enforced by software whether or not it is required by
> the underlying hardware
>   * Likewise IOAS_COPY operations are only permitted if the source and
> destination windows have compatible attributes

Already done, domain's aperture restricts all the iommufd operations

>   * A newly created kernel-managed IOAS has *no* IOVA windows

Already done, the iommufd IOAS has no iommu_domains inside it at
creation time.

>   * A CREATE_WINDOW operation is added
>   * This takes a size, pagesize/granularity, optional base address
> and optional additional attributes 
>   * If any of the specified attributes are incompatible with the
> underlying hardware, the operation fails

iommu layer has nothing called a window. The closest thing is a
domain.

I really don't want to try to make a new iommu layer object that is so
unique and special to PPC - we have to figure out how to fit PPC into
the iommu_domain model with reasonable extensions.

> > > > Maybe every device gets a copy of the error notification?
> > > 
> > > Alas, it's 

Re: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain

2022-05-05 Thread Robin Murphy

On 2022-05-05 16:33, Jason Gunthorpe wrote:

On Thu, May 05, 2022 at 10:56:28AM +, Tian, Kevin wrote:

From: Jason Gunthorpe 
Sent: Thursday, May 5, 2022 3:09 AM

Once the group enters 'owned' mode it can never be assigned back to the
default_domain or to a NULL domain. It must always be actively assigned to


worth pointing out that a NULL domain is not always translated to DMA
blocked on all platforms. That was a wrong assumption before this patch.


This is described in a comment blow


@@ -2040,7 +2037,8 @@ static int __iommu_attach_group(struct
iommu_domain *domain,
  {
int ret;

-   if (group->domain && group->domain != group->default_domain)
+   if (group->domain && group->domain != group->default_domain &&
+   group->domain != group->blocking_domain)
return -EBUSY;

ret = __iommu_group_for_each_dev(group, domain,


Suppose this can be also replaced by __iommu_group_attach_domain()?


It can, but lets do this as a follow patching since it is a bit big


@@ -2072,38 +2070,68 @@ static int
iommu_group_do_detach_device(struct device *dev, void *data)
return 0;
  }

-static void __iommu_detach_group(struct iommu_domain *domain,
-struct iommu_group *group)
+static int __iommu_group_attach_domain(struct iommu_group *group,
+  struct iommu_domain *new_domain)
  {
int ret;

+   if (group->domain == new_domain)
+   return 0;
+
/*
-* If the group has been claimed already, do not re-attach the default
-* domain.
+* New drivers should support default domains and so the
detach_dev() op
+* will never be called. Otherwise the NULL domain indicates the
+* translation for the group should be set so it will work with the


translation should be 'blocked'?


No, not blocked.


+* platform DMA ops.


I didn't get the meaning of the last sentence.


It is as discussed with Robin, NULL means to return the group back to
some platform defined translation, and in some cases this means
returning back to work with the platform's DMA ops - presumably if it
isn't using the dma api.


+   /*
+* Changing the domain is done by calling attach on the new domain.
+* Drivers should implement this so that DMA is always translated by


what does 'this' mean?


The code below - attach_dev called in a loop for each dev in the group.


+* either the new, old, or a blocking domain. DMA should never


isn't the blocking domain passed in as the new domain?


Soemtimes. This is a statement about the required
atomicity. New/old/block are all valid translations during the
operation. Identity is not.
  

+* untranslated.
+*
+* Note that this is called in error unwind paths, attaching to a
+* domain that has already been attached cannot fail.


this is called in the normal path. Where does error unwind happen?


Any place that checks the return and does WARN_ON

Also the loop over each dev doesn't error unwind so it assumes that if
the first dev succeeds then all subsequent devs will succeed.

  /**
   * iommu_group_claim_dma_owner() - Set DMA ownership of a group
   * @group: The group.
@@ -3111,9 +3162,15 @@ int iommu_group_claim_dma_owner(struct
iommu_group *group, void *owner)
goto unlock_out;
}

+   ret = __iommu_group_alloc_blocking_domain(group);
+   if (ret)
+   goto unlock_out;
+
+   ret = __iommu_group_attach_domain(group,
+ group->blocking_domain);
+   if (ret)
+   goto unlock_out;
group->owner = owner;


Here can use __iommu_group_attach_core_domain() if calling it after
setting group->owner.


That is obfuscating. This function must always and only attach the
blocking_domain.

So, I'm going to leave this patch as-is since it has been tested now
and we need to get the series back into iommu-next ASAP.


Ack to that, there are certainly further improvements to consider once 
we've got known-working code into a released kernel, but let's not get 
ahead of ourselves just now.


(I'm pretty sure we could get away with a single blocking domain per 
IOMMU instance, rather than per group, but I deliberately saved that one 
for later - right now one per group to match default domains is simpler 
to reason about and less churny in the context of the current ownership 
patches)


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 5/7] perf tool: Add support for HiSilicon PCIe Tune and Trace device driver

2022-05-05 Thread liuqi (BA) via iommu



Hi Leo,

Thanks for your review, some replies below.

On 2022/4/30 15:35, Leo Yan wrote:

On Thu, Apr 07, 2022 at 08:58:39PM +0800, Yicong Yang via iommu wrote:

From: Qi Liu 

'perf record' and 'perf report --dump-raw-trace' supported in this
patch.

Example usage:

Output will contain raw PTT data and its textual representation, such
as:

0 0 0x5810 [0x30]: PERF_RECORD_AUXTRACE size: 0x40  offset: 0
ref: 0xa5d50c725  idx: 0  tid: -1  cpu: 0
.
. ... HISI PTT data: size 4194304 bytes
.  : 00 00 00 00 Prefix
.  0004: 08 20 00 60 Header DW0
.  0008: ff 02 00 01 Header DW1
.  000c: 20 08 00 00 Header DW2
.  0010: 10 e7 44 ab Header DW3
.  0014: 2a a8 1e 01 Time
.  0020: 00 00 00 00 Prefix
.  0024: 01 00 00 60 Header DW0
.  0028: 0f 1e 00 01 Header DW1
.  002c: 04 00 00 00 Header DW2
.  0030: 40 00 81 02 Header DW3
.  0034: ee 02 00 00 Time


Signed-off-by: Qi Liu 
Signed-off-by: Yicong Yang 
---
  tools/perf/arch/arm/util/auxtrace.c   |  76 +-
  tools/perf/arch/arm/util/pmu.c|   3 +
  tools/perf/arch/arm64/util/Build  |   2 +-
  tools/perf/arch/arm64/util/hisi_ptt.c | 195 
  tools/perf/util/Build |   2 +
  tools/perf/util/auxtrace.c|   4 +
  tools/perf/util/auxtrace.h|   1 +
  tools/perf/util/hisi-ptt-decoder/Build|   1 +
  .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.c   | 170 ++
  .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.h   |  28 +++
  tools/perf/util/hisi_ptt.c| 218 ++
  tools/perf/util/hisi_ptt.h|  28 +++


It's good to divide the big patch into smaller patches, e.g. one patch
is to add PTT auxtrace (so mainly for perf record), and the second
patch is to add PTT decoder for perf decoding.


got it, I'll do this, thanks.


  12 files changed, 724 insertions(+), 4 deletions(-)
  create mode 100644 tools/perf/arch/arm64/util/hisi_ptt.c
  create mode 100644 tools/perf/util/hisi-ptt-decoder/Build
  create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.c
  create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.h
  create mode 100644 tools/perf/util/hisi_ptt.c
  create mode 100644 tools/perf/util/hisi_ptt.h

diff --git a/tools/perf/arch/arm/util/auxtrace.c 
b/tools/perf/arch/arm/util/auxtrace.c
index 5fc6a2a3dbc5..393f5757c039 100644
--- a/tools/perf/arch/arm/util/auxtrace.c
+++ b/tools/perf/arch/arm/util/auxtrace.c
@@ -4,9 +4,11 @@
   * Author: Mathieu Poirier 
   */

[...]

+
+   rewinddir(dir);
+   while ((dent = readdir(dir))) {
+   if (strstr(dent->d_name, HISI_PTT_PMU_NAME) && idx < 
(*nr_ptts)) {
+   hisi_ptt_pmus[idx] = perf_pmu__find(dent->d_name);
+   if (hisi_ptt_pmus[idx]) {
+   pr_debug2("%s %d: hisi_ptt_pmu %d type %d name 
%s\n",
+   __func__, __LINE__, idx,
+   hisi_ptt_pmus[idx]->type,
+   hisi_ptt_pmus[idx]->name);
+   idx++;


Indentation for "idx++" is not right.

will fix this, thanks.




+   }
+


Redundant new line.

will fix this, thanks.



+   }
+   }
+
+out:
+   closedir(dir);
+   return hisi_ptt_pmus;
+}
+
  struct auxtrace_record
  *auxtrace_record__init(struct evlist *evlist, int *err)
  {
@@ -57,8 +112,12 @@ struct auxtrace_record
struct evsel *evsel;
bool found_etm = false;
struct perf_pmu *found_spe = NULL;
+   struct perf_pmu *found_ptt = NULL;
struct perf_pmu **arm_spe_pmus = NULL;
+   struct perf_pmu **hisi_ptt_pmus = NULL;
+
int nr_spes = 0;
+   int nr_ptts = 0;
int i = 0;
  
  	if (!evlist)

@@ -66,13 +125,14 @@ struct auxtrace_record
  
  	cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME);

arm_spe_pmus = find_all_arm_spe_pmus(_spes, err);
+   hisi_ptt_pmus = find_all_hisi_ptt_pmus(_ptts, err);
  
  	evlist__for_each_entry(evlist, evsel) {

if (cs_etm_pmu &&
evsel->core.attr.type == cs_etm_pmu->type)
found_etm = true;
  
-		if (!nr_spes || found_spe)

+   if ((!nr_spes || found_spe) && (!nr_ptts || found_ptt))
continue;
  
  		for (i = 0; i < nr_spes; i++) {

@@ -81,11 +141,18 @@ struct auxtrace_record
break;
 

[PATCH] iommu: Reorganize __iommu_attach_group()

2022-05-05 Thread Jason Gunthorpe via iommu
Call iommu_group_do_attach_device() only from
__iommu_group_attach_domain() which should be used to attach any domain to
the group.

The only unique thing __iommu_attach_group() does is to check if the group
is already attached to some caller specified group. Put this test into
__iommu_group_is_core_domain(), matching the
__iommu_group_attach_core_domain() nomenclature.

Change the two callers to directly call __iommu_group_attach_domain() and
test __iommu_group_is_core_domain().

iommu_attach_device() should trigger a WARN_ON if the group is attached as
the caller is using the function wrong.

Suggested-by: "Tian, Kevin" 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 42 +++---
 1 file changed, 19 insertions(+), 23 deletions(-)

This goes after "iommu: iommu_group_claim_dma_owner() must always assign a
domain" and simplifies some of the remaining duplication.

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c1bdec807ea381..09d00be887f924 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -81,9 +81,10 @@ static struct iommu_domain *__iommu_domain_alloc(struct 
bus_type *bus,
 unsigned type);
 static int __iommu_attach_device(struct iommu_domain *domain,
 struct device *dev);
-static int __iommu_attach_group(struct iommu_domain *domain,
-   struct iommu_group *group);
+static int __iommu_group_attach_domain(struct iommu_group *group,
+  struct iommu_domain *new_domain);
 static void __iommu_group_attach_core_domain(struct iommu_group *group);
+static bool __iommu_group_is_core_domain(struct iommu_group *group);
 static int iommu_create_device_direct_mappings(struct iommu_group *group,
   struct device *dev);
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
@@ -1938,10 +1939,11 @@ int iommu_attach_device(struct iommu_domain *domain, 
struct device *dev)
 */
mutex_lock(>mutex);
ret = -EINVAL;
-   if (iommu_group_device_count(group) != 1)
+   if (iommu_group_device_count(group) != 1 ||
+   WARN_ON(!__iommu_group_is_core_domain(group)))
goto out_unlock;
 
-   ret = __iommu_attach_group(domain, group);
+   ret = __iommu_group_attach_domain(group, domain);
 
 out_unlock:
mutex_unlock(>mutex);
@@ -2032,31 +2034,19 @@ static int iommu_group_do_attach_device(struct device 
*dev, void *data)
return __iommu_attach_device(domain, dev);
 }
 
-static int __iommu_attach_group(struct iommu_domain *domain,
-   struct iommu_group *group)
-{
-   int ret;
-
-   if (group->domain && group->domain != group->default_domain &&
-   group->domain != group->blocking_domain)
-   return -EBUSY;
-
-   ret = __iommu_group_for_each_dev(group, domain,
-iommu_group_do_attach_device);
-   if (ret == 0)
-   group->domain = domain;
-
-   return ret;
-}
-
 int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
 {
int ret;
 
mutex_lock(>mutex);
-   ret = __iommu_attach_group(domain, group);
-   mutex_unlock(>mutex);
+   if (!__iommu_group_is_core_domain(group)) {
+   ret = -EBUSY;
+   goto out_unlock;
+   }
 
+   ret = __iommu_group_attach_domain(group, domain);
+out_unlock:
+   mutex_unlock(>mutex);
return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_attach_group);
@@ -2110,6 +2100,12 @@ static int __iommu_group_attach_domain(struct 
iommu_group *group,
return 0;
 }
 
+static bool __iommu_group_is_core_domain(struct iommu_group *group)
+{
+   return !group->domain || group->domain == group->default_domain ||
+  group->domain == group->blocking_domain;
+}
+
 /*
  * Put the group's domain back to the appropriate core-owned domain - either 
the
  * standard kernel-mode DMA configuration or an all-DMA-blocked domain.

base-commit: f9169ee95787fe691287eeed1a1c269ea72c8fb4
-- 
2.36.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain

2022-05-05 Thread Jason Gunthorpe via iommu
On Thu, May 05, 2022 at 10:56:28AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Thursday, May 5, 2022 3:09 AM
> > 
> > Once the group enters 'owned' mode it can never be assigned back to the
> > default_domain or to a NULL domain. It must always be actively assigned to
> 
> worth pointing out that a NULL domain is not always translated to DMA
> blocked on all platforms. That was a wrong assumption before this patch.

This is described in a comment blow

> > @@ -2040,7 +2037,8 @@ static int __iommu_attach_group(struct
> > iommu_domain *domain,
> >  {
> > int ret;
> > 
> > -   if (group->domain && group->domain != group->default_domain)
> > +   if (group->domain && group->domain != group->default_domain &&
> > +   group->domain != group->blocking_domain)
> > return -EBUSY;
> > 
> > ret = __iommu_group_for_each_dev(group, domain,
> 
> Suppose this can be also replaced by __iommu_group_attach_domain()? 

It can, but lets do this as a follow patching since it is a bit big

> > @@ -2072,38 +2070,68 @@ static int
> > iommu_group_do_detach_device(struct device *dev, void *data)
> > return 0;
> >  }
> > 
> > -static void __iommu_detach_group(struct iommu_domain *domain,
> > -struct iommu_group *group)
> > +static int __iommu_group_attach_domain(struct iommu_group *group,
> > +  struct iommu_domain *new_domain)
> >  {
> > int ret;
> > 
> > +   if (group->domain == new_domain)
> > +   return 0;
> > +
> > /*
> > -* If the group has been claimed already, do not re-attach the default
> > -* domain.
> > +* New drivers should support default domains and so the
> > detach_dev() op
> > +* will never be called. Otherwise the NULL domain indicates the
> > +* translation for the group should be set so it will work with the
> 
> translation should be 'blocked'?

No, not blocked.

> > +* platform DMA ops.
> 
> I didn't get the meaning of the last sentence.

It is as discussed with Robin, NULL means to return the group back to
some platform defined translation, and in some cases this means
returning back to work with the platform's DMA ops - presumably if it
isn't using the dma api.

> > +   /*
> > +* Changing the domain is done by calling attach on the new domain.
> > +* Drivers should implement this so that DMA is always translated by
>
> what does 'this' mean?

The code below - attach_dev called in a loop for each dev in the group.

> > +* either the new, old, or a blocking domain. DMA should never
> 
> isn't the blocking domain passed in as the new domain?

Soemtimes. This is a statement about the required
atomicity. New/old/block are all valid translations during the
operation. Identity is not.
 
> > +* untranslated.
> > +*
> > +* Note that this is called in error unwind paths, attaching to a
> > +* domain that has already been attached cannot fail.
> 
> this is called in the normal path. Where does error unwind happen?

Any place that checks the return and does WARN_ON

Also the loop over each dev doesn't error unwind so it assumes that if
the first dev succeeds then all subsequent devs will succeed.
> >  /**
> >   * iommu_group_claim_dma_owner() - Set DMA ownership of a group
> >   * @group: The group.
> > @@ -3111,9 +3162,15 @@ int iommu_group_claim_dma_owner(struct
> > iommu_group *group, void *owner)
> > goto unlock_out;
> > }
> > 
> > +   ret = __iommu_group_alloc_blocking_domain(group);
> > +   if (ret)
> > +   goto unlock_out;
> > +
> > +   ret = __iommu_group_attach_domain(group,
> > + group->blocking_domain);
> > +   if (ret)
> > +   goto unlock_out;
> > group->owner = owner;
> 
> Here can use __iommu_group_attach_core_domain() if calling it after
> setting group->owner.

That is obfuscating. This function must always and only attach the
blocking_domain.

So, I'm going to leave this patch as-is since it has been tested now
and we need to get the series back into iommu-next ASAP.

Thanks,
Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 0/3] iommu/arm-smmu: Support Tegra234 SMMU

2022-05-05 Thread Thierry Reding
On Thu, May 05, 2022 at 03:53:08PM +0100, Will Deacon wrote:
> On Thu, May 05, 2022 at 04:15:29PM +0200, Thierry Reding wrote:
> > On Fri, Apr 29, 2022 at 10:22:40AM +0200, Thierry Reding wrote:
> > > From: Thierry Reding 
> > > 
> > > Hi Joerg,
> > > 
> > > this is essentially a resend of v2 with a Acked-by:s from Robin and Will
> > > added. These have been on the list for quite a while now, but apparently
> > > there was a misunderstanding, so neither you nor Will picked this up.
> > > 
> > > Since Will acked these, I think it's probably best for you to pick these
> > > up directly. If not, let me know and I'll work with Will to merge via
> > > the ARM SMMU tree.
> > > 
> > > Thanks,
> > > Thierry
> > > 
> > > Thierry Reding (3):
> > >   dt-bindings: arm-smmu: Document nvidia,memory-controller property
> > >   dt-bindings: arm-smmu: Add compatible for Tegra234 SOC
> > >   iommu/arm-smmu: Support Tegra234 SMMU
> > > 
> > >  .../devicetree/bindings/iommu/arm,smmu.yaml   | 23 +--
> > >  drivers/iommu/arm/arm-smmu/arm-smmu-impl.c|  3 ++-
> > >  2 files changed, 23 insertions(+), 3 deletions(-)
> > 
> > Joerg,
> > 
> > anything left to do on this from your perspective, or can this go into
> > v5.19?
> 
> I'll pick them up in the Arm SMMU queue, as there are some other SMMU
> patches kicking around and we may as well keep them all together.

Sounds good, thanks!

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3] iommu/mediatek: Fix NULL pointer dereference when printing dev_name

2022-05-05 Thread AngeloGioacchino Del Regno

Il 05/05/22 15:27, Miles Chen ha scritto:

When larbdev is NULL (in the case I hit, the node is incorrectly set
iommus = < NUM>), it will cause device_link_add() fail and
kernel crashes when we try to print dev_name(larbdev).

Let's fail the probe if a larbdev is NULL to avoid invalid inputs from
dts.

It should work for normal correct setting and avoid the crash caused
by my incorrect setting.

Error log:
[   18.189042][  T301] Unable to handle kernel NULL pointer dereference at 
virtual address 0050
...
[   18.344519][  T301] pstate: a045 (NzCv daif +PAN -UAO)
[   18.345213][  T301] pc : mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu]
[   18.346050][  T301] lr : mtk_iommu_probe_device+0xd0/0x118 [mtk_iommu]
[   18.346884][  T301] sp : ffc00a5635e0
[   18.347392][  T301] x29: ffc00a5635e0 x28: ffd44a46c1d8
[   18.348156][  T301] x27: ff80c39a8000 x26: ffd44a80cc38
[   18.348917][  T301] x25:  x24: ffd44a80cc38
[   18.349677][  T301] x23: ffd44e4da4c6 x22: ffd44a80cc38
[   18.350438][  T301] x21: ff80cecd1880 x20: 
[   18.351198][  T301] x19: ff80c439f010 x18: ffc00a50d0c0
[   18.351959][  T301] x17:  x16: 0004
[   18.352719][  T301] x15: 0004 x14: ffd44eb5d420
[   18.353480][  T301] x13: 0ad2 x12: 0003
[   18.354241][  T301] x11: fad2 x10: c000fad2
[   18.355003][  T301] x9 : a0d288d8d7142d00 x8 : a0d288d8d7142d00
[   18.355763][  T301] x7 : ffd44c2bc640 x6 : 
[   18.356524][  T301] x5 : 0080 x4 : 0001
[   18.357284][  T301] x3 :  x2 : 0005
[   18.358045][  T301] x1 :  x0 : 
[   18.360208][  T301] Hardware name: MT6873 (DT)
[   18.360771][  T301] Call trace:
[   18.361168][  T301]  dump_backtrace+0xf8/0x1f0
[   18.361737][  T301]  dump_stack_lvl+0xa8/0x11c
[   18.362305][  T301]  dump_stack+0x1c/0x2c
[   18.362816][  T301]  mrdump_common_die+0x184/0x40c [mrdump]
[   18.363575][  T301]  ipanic_die+0x24/0x38 [mrdump]
[   18.364230][  T301]  atomic_notifier_call_chain+0x128/0x2b8
[   18.364937][  T301]  die+0x16c/0x568
[   18.365394][  T301]  __do_kernel_fault+0x1e8/0x214
[   18.365402][  T301]  do_page_fault+0xb8/0x678
[   18.366934][  T301]  do_translation_fault+0x48/0x64
[   18.368645][  T301]  do_mem_abort+0x68/0x148
[   18.368652][  T301]  el1_abort+0x40/0x64
[   18.368660][  T301]  el1h_64_sync_handler+0x54/0x88
[   18.368668][  T301]  el1h_64_sync+0x68/0x6c
[   18.368673][  T301]  mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu]
...

Cc: Robin Murphy 
Cc: Yong Wu 
Reported-by: kernel test robot 
Fixes: 635319a4a744 ("media: iommu/mediatek: Add device_link between the consumer 
and the larb devices")
Signed-off-by: Miles Chen 


Reviewed-by: AngeloGioacchino Del Regno 



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 0/3] iommu/arm-smmu: Support Tegra234 SMMU

2022-05-05 Thread Will Deacon
On Thu, May 05, 2022 at 04:15:29PM +0200, Thierry Reding wrote:
> On Fri, Apr 29, 2022 at 10:22:40AM +0200, Thierry Reding wrote:
> > From: Thierry Reding 
> > 
> > Hi Joerg,
> > 
> > this is essentially a resend of v2 with a Acked-by:s from Robin and Will
> > added. These have been on the list for quite a while now, but apparently
> > there was a misunderstanding, so neither you nor Will picked this up.
> > 
> > Since Will acked these, I think it's probably best for you to pick these
> > up directly. If not, let me know and I'll work with Will to merge via
> > the ARM SMMU tree.
> > 
> > Thanks,
> > Thierry
> > 
> > Thierry Reding (3):
> >   dt-bindings: arm-smmu: Document nvidia,memory-controller property
> >   dt-bindings: arm-smmu: Add compatible for Tegra234 SOC
> >   iommu/arm-smmu: Support Tegra234 SMMU
> > 
> >  .../devicetree/bindings/iommu/arm,smmu.yaml   | 23 +--
> >  drivers/iommu/arm/arm-smmu/arm-smmu-impl.c|  3 ++-
> >  2 files changed, 23 insertions(+), 3 deletions(-)
> 
> Joerg,
> 
> anything left to do on this from your perspective, or can this go into
> v5.19?

I'll pick them up in the Arm SMMU queue, as there are some other SMMU
patches kicking around and we may as well keep them all together.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 0/3] iommu/arm-smmu: Support Tegra234 SMMU

2022-05-05 Thread Thierry Reding
On Fri, Apr 29, 2022 at 10:22:40AM +0200, Thierry Reding wrote:
> From: Thierry Reding 
> 
> Hi Joerg,
> 
> this is essentially a resend of v2 with a Acked-by:s from Robin and Will
> added. These have been on the list for quite a while now, but apparently
> there was a misunderstanding, so neither you nor Will picked this up.
> 
> Since Will acked these, I think it's probably best for you to pick these
> up directly. If not, let me know and I'll work with Will to merge via
> the ARM SMMU tree.
> 
> Thanks,
> Thierry
> 
> Thierry Reding (3):
>   dt-bindings: arm-smmu: Document nvidia,memory-controller property
>   dt-bindings: arm-smmu: Add compatible for Tegra234 SOC
>   iommu/arm-smmu: Support Tegra234 SMMU
> 
>  .../devicetree/bindings/iommu/arm,smmu.yaml   | 23 +--
>  drivers/iommu/arm/arm-smmu/arm-smmu-impl.c|  3 ++-
>  2 files changed, 23 insertions(+), 3 deletions(-)

Joerg,

anything left to do on this from your perspective, or can this go into
v5.19?

Thanks,
Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Jason Gunthorpe via iommu
On Thu, May 05, 2022 at 07:40:37AM +, Tian, Kevin wrote:
 
> In concept this is an iommu property instead of a domain property.

Not really, domains shouldn't be changing behaviors once they are
created. If a domain supports dirty tracking and I attach a new device
then it still must support dirty tracking.

I suppose we may need something here because we need to control when
domains are re-used if they don't have the right properties in case
the system iommu's are discontiguous somehow.

ie iommufd should be able to assert that dirty tracking is desired and
an existing non-dirty tracking capable domain will not be
automatically re-used.

We don't really have the right infrastructure to do this currently.

> From this angle IMHO it's more reasonable to report this IOMMU
> property to userspace via a device capability. If all devices attached
> to a hwpt claim IOMMU dirty tracking capability, the user can call
> set_dirty_tracking() on the hwpt object. 

Inherent domain properties need to be immutable or, at least one-way,
like enforced coherent, or it just all stops making any kind of sense.

> Once dirty tracking is enabled on a hwpt, further attaching a device
> which doesn't claim this capability is simply rejected.

It would be OK to do as enforced coherent does as flip a domain
permanently into dirty-tracking enabled, or specify a flag at domain
creation time.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Jason Gunthorpe via iommu
On Thu, May 05, 2022 at 11:03:18AM +, Tian, Kevin wrote:

> iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> in the said race window until unmap and iotlb is invalidated is completed.

No, the purpose is to perform "unmap" without destroying the dirty bit
in the process.
 
If an IOMMU architecture has a way to render the page unmaped and
flush back the dirty bit/not destroy then it doesn't require a write
protect pass.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 10/12] iommu: Prepare IOMMU domain for IOPF

2022-05-05 Thread Jean-Philippe Brucker
Hi Baolu,

On Thu, May 05, 2022 at 04:31:38PM +0800, Baolu Lu wrote:
> On 2022/5/4 02:20, Jean-Philippe Brucker wrote:
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index 7cae631c1baa..33449523afbe 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -3174,3 +3174,24 @@ void iommu_detach_device_pasid(struct iommu_domain 
> > > *domain,
> > >   iommu_group_put(group);
> > >   }
> > > +
> > > +struct iommu_domain *iommu_get_domain_for_dev_pasid(struct device *dev,
> > > + ioasid_t pasid)
> > > +{
> > > + struct iommu_domain *domain;
> > > + struct iommu_group *group;
> > > +
> > > + if (!pasid_valid(pasid))
> > > + return NULL;
> > > +
> > > + group = iommu_group_get(dev);
> > > + if (!group)
> > > + return NULL;
> > > +
> > > + mutex_lock(>mutex);
> > Unfortunately this still causes the deadlock when unbind() flushes the
> > IOPF queue while holding the group mutex.
> 
> Sorry, I didn't get your point here.
> 
> Do you mean unbind() could hold group mutex before calling this helper?
> The group mutex is only available in iommu.c. The unbind() has no means
> to hold this lock. Or, I missed anything?

I wasn't clear, it's iommu_detach_device_pasid() that holds the
group->mutex:

 iommu_sva_unbind_device()  |
  iommu_detach_device_pasid()   |
   mutex_lock(>mutex)|
   domain->ops->detach_dev_pasid()  | iopf_handle_group()
iopf_queue_flush_dev()  |  iommu_get_domain_for_dev_pasid()
 ... wait for IOPF work |   mutex_lock(>mutex)
|... deadlock

Thanks,
Jean

> 
> Best regards,
> baolu
> 
> > 
> > If we make this function private to IOPF, then we can get rid of this
> > mutex_lock(). It's OK because:
> > 
> > * xarray protects its internal state with RCU, so we can call
> >xa_load() outside the lock.
> > 
> > * The domain obtained from xa_load is finalized. Its content is valid
> >because xarray stores the domain using rcu_assign_pointer(), which has a
> >release memory barrier, which pairs with data dependencies in IOPF
> >(domain->sva_ioas etc).
> > 
> >We'll need to be careful about this when allowing other users to install
> >a fault handler. Should be fine as long as the handler and data are
> >installed before the domain is added to pasid_array.
> > 
> > * We know the domain is valid the whole time IOPF is using it, because
> >unbind() waits for pending faults.
> > 
> > We just need a comment explaining the last point, something like:
> > 
> > /*
> > * Safe to fetch outside the group mutex because:
> >  * - xarray protects its internal state with RCU
> >  * - the domain obtained is either NULL or fully formed
> > * - the IOPF work is the only caller and is flushed before the
> > *   domain is freed.
> >  */
> > 
> > Thanks,
> > Jean
> > 
> > > + domain = xa_load(>pasid_array, pasid);
> > > + mutex_unlock(>mutex);
> > > + iommu_group_put(group);
> > > +
> > > + return domain;
> > > +}
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v3] iommu/mediatek: Fix NULL pointer dereference when printing dev_name

2022-05-05 Thread Miles Chen via iommu
When larbdev is NULL (in the case I hit, the node is incorrectly set
iommus = < NUM>), it will cause device_link_add() fail and
kernel crashes when we try to print dev_name(larbdev).

Let's fail the probe if a larbdev is NULL to avoid invalid inputs from
dts.

It should work for normal correct setting and avoid the crash caused
by my incorrect setting.

Error log:
[   18.189042][  T301] Unable to handle kernel NULL pointer dereference at 
virtual address 0050
...
[   18.344519][  T301] pstate: a045 (NzCv daif +PAN -UAO)
[   18.345213][  T301] pc : mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu]
[   18.346050][  T301] lr : mtk_iommu_probe_device+0xd0/0x118 [mtk_iommu]
[   18.346884][  T301] sp : ffc00a5635e0
[   18.347392][  T301] x29: ffc00a5635e0 x28: ffd44a46c1d8
[   18.348156][  T301] x27: ff80c39a8000 x26: ffd44a80cc38
[   18.348917][  T301] x25:  x24: ffd44a80cc38
[   18.349677][  T301] x23: ffd44e4da4c6 x22: ffd44a80cc38
[   18.350438][  T301] x21: ff80cecd1880 x20: 
[   18.351198][  T301] x19: ff80c439f010 x18: ffc00a50d0c0
[   18.351959][  T301] x17:  x16: 0004
[   18.352719][  T301] x15: 0004 x14: ffd44eb5d420
[   18.353480][  T301] x13: 0ad2 x12: 0003
[   18.354241][  T301] x11: fad2 x10: c000fad2
[   18.355003][  T301] x9 : a0d288d8d7142d00 x8 : a0d288d8d7142d00
[   18.355763][  T301] x7 : ffd44c2bc640 x6 : 
[   18.356524][  T301] x5 : 0080 x4 : 0001
[   18.357284][  T301] x3 :  x2 : 0005
[   18.358045][  T301] x1 :  x0 : 
[   18.360208][  T301] Hardware name: MT6873 (DT)
[   18.360771][  T301] Call trace:
[   18.361168][  T301]  dump_backtrace+0xf8/0x1f0
[   18.361737][  T301]  dump_stack_lvl+0xa8/0x11c
[   18.362305][  T301]  dump_stack+0x1c/0x2c
[   18.362816][  T301]  mrdump_common_die+0x184/0x40c [mrdump]
[   18.363575][  T301]  ipanic_die+0x24/0x38 [mrdump]
[   18.364230][  T301]  atomic_notifier_call_chain+0x128/0x2b8
[   18.364937][  T301]  die+0x16c/0x568
[   18.365394][  T301]  __do_kernel_fault+0x1e8/0x214
[   18.365402][  T301]  do_page_fault+0xb8/0x678
[   18.366934][  T301]  do_translation_fault+0x48/0x64
[   18.368645][  T301]  do_mem_abort+0x68/0x148
[   18.368652][  T301]  el1_abort+0x40/0x64
[   18.368660][  T301]  el1h_64_sync_handler+0x54/0x88
[   18.368668][  T301]  el1h_64_sync+0x68/0x6c
[   18.368673][  T301]  mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu]
...

Cc: Robin Murphy 
Cc: Yong Wu 
Reported-by: kernel test robot 
Fixes: 635319a4a744 ("media: iommu/mediatek: Add device_link between the 
consumer and the larb devices")
Signed-off-by: Miles Chen 

---

Change since v2
probe fail if larbdev is NULL so we do not have to worry about release logic

Change since v1
fix a build warning reported by kernel test robot 
https://lore.kernel.org/lkml/202204231446.iykdz674-...@intel.com/

---
 drivers/iommu/mtk_iommu.c| 6 ++
 drivers/iommu/mtk_iommu_v1.c | 7 +++
 2 files changed, 13 insertions(+)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 6fd75a60abd6..155acfbce44f 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -572,6 +572,9 @@ static struct iommu_device *mtk_iommu_probe_device(struct 
device *dev)
 * All the ports in each a device should be in the same larbs.
 */
larbid = MTK_M4U_TO_LARB(fwspec->ids[0]);
+   if (larbid >= MTK_LARB_NR_MAX)
+   return ERR_PTR(-EINVAL);
+
for (i = 1; i < fwspec->num_ids; i++) {
larbidx = MTK_M4U_TO_LARB(fwspec->ids[i]);
if (larbid != larbidx) {
@@ -581,6 +584,9 @@ static struct iommu_device *mtk_iommu_probe_device(struct 
device *dev)
}
}
larbdev = data->larb_imu[larbid].dev;
+   if (!larbdev)
+   return ERR_PTR(-EINVAL);
+
link = device_link_add(dev, larbdev,
   DL_FLAG_PM_RUNTIME | DL_FLAG_STATELESS);
if (!link)
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index ecff800656e6..74563f689fbd 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -80,6 +80,7 @@
 /* MTK generation one iommu HW only support 4K size mapping */
 #define MT2701_IOMMU_PAGE_SHIFT12
 #define MT2701_IOMMU_PAGE_SIZE (1UL << MT2701_IOMMU_PAGE_SHIFT)
+#define MT2701_LARB_NR_MAX 3
 
 /*
  * MTK m4u support 4GB iova address space, and only support 4K page
@@ -457,6 +458,9 @@ static struct iommu_device *mtk_iommu_probe_device(struct 
device *dev)
 
/* Link the consumer device with the smi-larb device(supplier) */
larbid = mt2701_m4u_to_larb(fwspec->ids[0]);
+   if (larbid >= MT2701_LARB_NR_MAX)
+   return ERR_PTR(-EINVAL);
+

Re: [PATCH 1/7] dt-bindings: gpio: renesas,rcar-gpio: R-Car V3U is R-Car Gen4

2022-05-05 Thread Bartosz Golaszewski
On Mon, May 2, 2022 at 3:35 PM Geert Uytterhoeven
 wrote:
>
> Despite the name, R-Car V3U is the first member of the R-Car Gen4
> family.  Hence move its compatible value to the R-Car Gen4 section.
>
> Signed-off-by: Geert Uytterhoeven 
> ---
>  Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml 
> b/Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml
> index 0681a4790cd62e23..75e5da6a7cc04bbd 100644
> --- a/Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml
> +++ b/Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml
> @@ -48,11 +48,9 @@ properties:
>- renesas,gpio-r8a77995 # R-Car D3
>- const: renesas,rcar-gen3-gpio # R-Car Gen3 or RZ/G2
>
> -  - items:
> -  - const: renesas,gpio-r8a779a0  # R-Car V3U
> -
>- items:
>- enum:
> +  - renesas,gpio-r8a779a0 # R-Car V3U
>- renesas,gpio-r8a779f0 # R-Car S4-8
>- const: renesas,rcar-gen4-gpio # R-Car Gen4
>
> --
> 2.25.1
>

Acked-by: Bartosz Golaszewski 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 4/4] iommu/vt-d: Remove hard coding PGSNP bit in PASID entries

2022-05-05 Thread Baolu Lu

On 2022/5/5 16:46, Tian, Kevin wrote:

From: Lu Baolu 
Sent: Thursday, May 5, 2022 9:07 AM

As enforce_cache_coherency has been introduced into the
iommu_domain_ops,
the kernel component which owns the iommu domain is able to opt-in its
requirement for force snooping support. The iommu driver has no need to
hard code the page snoop control bit in the PASID table entries anymore.

Signed-off-by: Lu Baolu 


Reviewed-by: Kevin Tian , with one nit:


---
  drivers/iommu/intel/pasid.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 41a0e3b02c79..0abfa7fc7fb0 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -710,9 +710,6 @@ int intel_pasid_setup_second_level(struct
intel_iommu *iommu,
pasid_set_fault_enable(pte);
pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));


Probably in a separate patch but above should really be renamed
to pasid_set_page_walk_snoop().


Yeah! Need a cleanup here. Above name is confusing.





-   if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED)
-   pasid_set_pgsnp(pte);
-
/*
 * Since it is a second level only translation setup, we should
 * set SRE bit as well (addresses are expected to be GPAs).
--
2.25.1




Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device

2022-05-05 Thread Yicong Yang via iommu
Hi Leo,

Thanks for the comments. Some questions and replies below.

On 2022/4/30 0:00, Leo Yan wrote:
> On Thu, Apr 07, 2022 at 08:58:36PM +0800, Yicong Yang via iommu wrote:
>> HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated
>> Endpoint(RCiEP) device, providing the capability to dynamically monitor and
>> tune the PCIe traffic, and trace the TLP headers.
>>
>> Add the driver for the device to enable the trace function. Register PMU
>> device of PTT trace, then users can use trace through perf command. The
>> driver makes use of perf AUX trace and support following events to
>> configure the trace:
>>
>> - filter: select Root port or Endpoint to trace
>> - type: select the type of traced TLP headers
>> - direction: select the direction of traced TLP headers
>> - format: select the data format of the traced TLP headers
>>
>> This patch adds the driver part of PTT trace. The perf command support of
>> PTT trace is added in the following patch.
>>
>> Signed-off-by: Yicong Yang 
>> Reviewed-by: Jonathan Cameron 
>> ---
[...]
>> +static int hisi_ptt_update_aux(struct hisi_ptt *hisi_ptt, int index, bool 
>> stop)
>> +{
>> +struct hisi_ptt_trace_ctrl *ctrl = _ptt->trace_ctrl;
>> +struct perf_output_handle *handle = >handle;
>> +struct perf_event *event = handle->event;
>> +struct hisi_ptt_pmu_buf *buf;
>> +void *addr;
>> +
>> +buf = perf_get_aux(handle);
>> +if (!buf || !handle->size)
>> +return -EINVAL;
>> +
>> +addr = ctrl->trace_buf[ctrl->buf_index].addr;
>> +
>> +memcpy(buf->base + buf->pos, addr, HISI_PTT_TRACE_BUF_SIZE);
>> +memset(addr, 0, HISI_PTT_TRACE_BUF_SIZE);
> 
> I am a bit worry buffer usages, at least for below aspects:
> 
> The first thing is for memset(), which cleans up the buffer and the
> buffer size is 4MiB, this means it will consume much CPU time to
> cleanup the buffer, and trace_buf is mapped as non-cacheable, the
> performance would be get worse.
> 
> The second thing is here it always copies the trace data with size
> HISI_PTT_TRACE_BUF_SIZE, usually, the trace module can provide a read
> pointer register, so you can know the trace data length based on the
> delta value between write and read pointers.
> 
> The last thing is the ctrl->trace_buf[] works as bounce buffer, so it
> means actually there have an extra copy from bounce buffer to AUX
> buffer, is it possible to directly output PTT trace data to AUX buffer?
> 
> Sorry if I bring up duplicate questions and before have the simliar
> discussion when reviewed the patch.
> 

The hardware is designed to use 4 DMA buffers and fill the buffer with the
traced data one by one. When one buffer is full the device will notify the
driver by interrupt and continue to trace the PCIe TLPs in the following
buffer without pausing. If the interrupt status bit of the buffer going to
use is uncleared, the trace will be paused until the corresponding interrupt
status bit is cleared. So the buffer size of 4 MiB is calculated that even
at the max flow the driver can handle the traced data and clear the status
in time, so there won't be a data loss.

For the first thing, the performance is acceptable as we only handle 4MiB at
one time. It's in thread context so we won't block others. The hardware can
keep tracing in the following DMA buffer so handling here won't pause the
trace and will not cause data loss.

For the second thing, this function is called in 2 places: 1) in the IRQ thread
2) the trace is going to stop. In the 1st case the data length will always be
HISI_PTT_TRACE_BUF_SIZE as the interrupt only occurs when one buffer is full.
In the 2nd case we may not have HISI_PTT_TRACE_BUF_SIZE data, the unfilled 
buffer
is zeroed to be distinguished. We keep committing HISI_PTT_TRACE_BUF_SIZE data
to keep consistence with handling in interrupt and make it simpler here.
(HISI_PTT_TRACE_WR_STS register indicates will buffer is currently used and
the offset in the buffer data is currently writing to)

For the third thing, I'm not sure if we can map the AUX buffer as DMA buffer
and by taht way, considering the case the buffer is full and we need to commit
the AUX buffer and apply a new one, the trace is paused during the procedure and
any TLPs on the link will be missed. But by current approach we won't have
this problem as the tracing is still on even when we're committing the AUX
buffer as the hardware is not directly writing to it.

>> +buf->pos += HISI_PTT_TRACE_BUF_SIZE;
>> +
>> +if (stop) {
>> +perf_aux_output_end(handle, buf->pos);
>> +} else if (buf->length - buf->pos < HISI_PTT_TRACE_BUF_SIZE) {
>> +perf_aux_output_skip(handle, buf->length - buf->pos);
>> +perf_aux_output_end(handle, buf->pos);
>> +
>> +buf = perf_aux_output_begin(handle, event);
>> +if (!buf)
>> +return -EINVAL;
>> +
>> +buf->pos = handle->head % buf->length;
>> +if (buf->length - 

Re: [PATCH v2 2/4] iommu/vt-d: Check domain force_snooping against attached devices

2022-05-05 Thread Baolu Lu

On 2022/5/5 16:43, Tian, Kevin wrote:

From: Lu Baolu 
Sent: Thursday, May 5, 2022 9:07 AM

As domain->force_snooping only impacts the devices attached with the
domain, there's no need to check against all IOMMU units. At the same
time, for a brand new domain (hasn't been attached to any device), the
force_snooping field could be set, but the attach_dev callback will
return failure if it wants to attach to a device which IOMMU has no
snoop control capability.


The description about brand new domain is not very clear. I think the
point here is that force_snooping could be set on a domain no matter
whether it has been attached or not and once set it is an immutable
flag. If no device attached the operation always succeeds then this
empty domain can be only attached to a device of which the IOMMU
supports snoop control.


Exactly. Will update this description.




  static bool intel_iommu_enforce_cache_coherency(struct iommu_domain
*domain)
  {
struct dmar_domain *dmar_domain = to_dmar_domain(domain);

-   if (!domain_update_iommu_snooping(NULL))
+   if (dmar_domain->force_snooping)
+   return true;
+
+   if (!domain_support_force_snooping(dmar_domain))
return false;
+


Who guarantees that domain->devices won't change between
above condition check and following set operation?


Good catch. Should lift the lock up here.




+   domain_set_force_snooping(dmar_domain);
dmar_domain->force_snooping = true;
+
return true;
  }



Thanks
Kevin


Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Joao Martins
On 5/5/22 12:03, Tian, Kevin wrote:
>> From: Joao Martins 
>> Sent: Thursday, May 5, 2022 6:07 PM
>>
>> On 5/5/22 08:42, Tian, Kevin wrote:
 From: Jason Gunthorpe 
 Sent: Tuesday, May 3, 2022 2:53 AM

 On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> On Fri, 29 Apr 2022 05:45:20 +
> "Tian, Kevin"  wrote:
>>> From: Joao Martins 
>>>  3) Unmapping an IOVA range while returning its dirty bit prior to
>>> unmap. This case is specific for non-nested vIOMMU case where an
>>> erronous guest (or device) DMAing to an address being unmapped at
 the
>>> same time.
>>
>> an erroneous attempt like above cannot anticipate which DMAs can
>> succeed in that window thus the end behavior is undefined. For an
>> undefined behavior nothing will be broken by losing some bits dirtied
>> in the window between reading back dirty bits of the range and
>> actually calling unmap. From guest p.o.v. all those are black-box
>> hardware logic to serve a virtual iotlb invalidation request which just
>> cannot be completed in one cycle.
>>
>> Hence in reality probably this is not required except to meet vfio
>> compat requirement. Just in concept returning dirty bits at unmap
>> is more accurate.
>>
>> I'm slightly inclined to abandon it in iommufd uAPI.
>
> Sorry, I'm not following why an unmap with returned dirty bitmap
> operation is specific to a vIOMMU case, or in fact indicative of some
> sort of erroneous, racy behavior of guest or device.

 It is being compared against the alternative which is to explicitly
 query dirty then do a normal unmap as two system calls and permit a
 race.

 The only case with any difference is if the guest is racing DMA with
 the unmap - in which case it is already indeterminate for the guest if
 the DMA will be completed or not.

 eg on the vIOMMU case if the guest races DMA with unmap then we are
 already fine with throwing away that DMA because that is how the race
 resolves during non-migration situations, so resovling it as throwing
 away the DMA during migration is OK too.

> We need the flexibility to support memory hot-unplug operations
> during migration,

 I would have thought that hotplug during migration would simply
 discard all the data - how does it use the dirty bitmap?

> This was implemented as a single operation specifically to avoid
> races where ongoing access may be available after retrieving a
> snapshot of the bitmap.  Thanks,

 The issue is the cost.

 On a real iommu elminating the race is expensive as we have to write
 protect the pages before query dirty, which seems to be an extra IOTLB
 flush.

 It is not clear if paying this cost to become atomic is actually
 something any use case needs.

 So, I suggest we think about a 3rd op 'write protect and clear
 dirties' that will be followed by a normal unmap - the extra op will
 have the extra oveheard and userspace can decide if it wants to pay or
 not vs the non-atomic read dirties operation. And lets have a use case
 where this must be atomic before we implement it..
>>>
>>> and write-protection also relies on the support of I/O page fault...
>>>
>> /I think/ all IOMMUs in this series already support permission/unrecoverable
>> I/O page faults for a long time IIUC.
>>
>> The earlier suggestion was just to discard the I/O page fault after
>> write-protection happens. fwiw, some IOMMUs also support suppressing
>> the event notification (like AMD).
> 
> iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
> in the said race window until unmap and iotlb is invalidated is completed.
> 
But then we depend on PRS being there on the device, because without it, DMA is
aborted on the target on a read-only IOVA prior to the page fault, thus the page
is not going to be dirty anyways.

> *unrecoverable* faults are not expected to be used in a feature path
> as occurrence of such faults may lead to severe reaction in iommu
> drivers e.g. completely block DMA from the device causing such faults.

Unless I totally misunderstood ... the later is actually what we were suggesting
here /in the context of unmaping an GIOVA/(*).

The wrprotect() was there to ensure we get an atomic dirty state of the IOVA 
range
afterwards, by blocking DMA (as opposed to sort of mediating DMA). The I/O page 
fault is
not supposed to happen unless there's rogue DMA AIUI.

TBH, the same could be said for normal DMA unmap as that does not make any sort 
of
guarantees of stopping DMA until the IOTLB flush happens.

(*) Although I am not saying the use-case of wrprotect() and mediating dirty 
pages you say
isn't useful. I guess it is in a world where we want support post-copy 
migration with VFs,
which would require some form of PRI (via the PF?) of the 

RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Tian, Kevin
> From: Joao Martins 
> Sent: Thursday, May 5, 2022 6:07 PM
> 
> On 5/5/22 08:42, Tian, Kevin wrote:
> >> From: Jason Gunthorpe 
> >> Sent: Tuesday, May 3, 2022 2:53 AM
> >>
> >> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> >>> On Fri, 29 Apr 2022 05:45:20 +
> >>> "Tian, Kevin"  wrote:
> > From: Joao Martins 
> >  3) Unmapping an IOVA range while returning its dirty bit prior to
> > unmap. This case is specific for non-nested vIOMMU case where an
> > erronous guest (or device) DMAing to an address being unmapped at
> >> the
> > same time.
> 
>  an erroneous attempt like above cannot anticipate which DMAs can
>  succeed in that window thus the end behavior is undefined. For an
>  undefined behavior nothing will be broken by losing some bits dirtied
>  in the window between reading back dirty bits of the range and
>  actually calling unmap. From guest p.o.v. all those are black-box
>  hardware logic to serve a virtual iotlb invalidation request which just
>  cannot be completed in one cycle.
> 
>  Hence in reality probably this is not required except to meet vfio
>  compat requirement. Just in concept returning dirty bits at unmap
>  is more accurate.
> 
>  I'm slightly inclined to abandon it in iommufd uAPI.
> >>>
> >>> Sorry, I'm not following why an unmap with returned dirty bitmap
> >>> operation is specific to a vIOMMU case, or in fact indicative of some
> >>> sort of erroneous, racy behavior of guest or device.
> >>
> >> It is being compared against the alternative which is to explicitly
> >> query dirty then do a normal unmap as two system calls and permit a
> >> race.
> >>
> >> The only case with any difference is if the guest is racing DMA with
> >> the unmap - in which case it is already indeterminate for the guest if
> >> the DMA will be completed or not.
> >>
> >> eg on the vIOMMU case if the guest races DMA with unmap then we are
> >> already fine with throwing away that DMA because that is how the race
> >> resolves during non-migration situations, so resovling it as throwing
> >> away the DMA during migration is OK too.
> >>
> >>> We need the flexibility to support memory hot-unplug operations
> >>> during migration,
> >>
> >> I would have thought that hotplug during migration would simply
> >> discard all the data - how does it use the dirty bitmap?
> >>
> >>> This was implemented as a single operation specifically to avoid
> >>> races where ongoing access may be available after retrieving a
> >>> snapshot of the bitmap.  Thanks,
> >>
> >> The issue is the cost.
> >>
> >> On a real iommu elminating the race is expensive as we have to write
> >> protect the pages before query dirty, which seems to be an extra IOTLB
> >> flush.
> >>
> >> It is not clear if paying this cost to become atomic is actually
> >> something any use case needs.
> >>
> >> So, I suggest we think about a 3rd op 'write protect and clear
> >> dirties' that will be followed by a normal unmap - the extra op will
> >> have the extra oveheard and userspace can decide if it wants to pay or
> >> not vs the non-atomic read dirties operation. And lets have a use case
> >> where this must be atomic before we implement it..
> >
> > and write-protection also relies on the support of I/O page fault...
> >
> /I think/ all IOMMUs in this series already support permission/unrecoverable
> I/O page faults for a long time IIUC.
> 
> The earlier suggestion was just to discard the I/O page fault after
> write-protection happens. fwiw, some IOMMUs also support suppressing
> the event notification (like AMD).

iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages
in the said race window until unmap and iotlb is invalidated is completed.

*unrecoverable* faults are not expected to be used in a feature path
as occurrence of such faults may lead to severe reaction in iommu
drivers e.g. completely block DMA from the device causing such faults.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain

2022-05-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Thursday, May 5, 2022 3:09 AM
> 
> Once the group enters 'owned' mode it can never be assigned back to the
> default_domain or to a NULL domain. It must always be actively assigned to

worth pointing out that a NULL domain is not always translated to DMA
blocked on all platforms. That was a wrong assumption before this patch.

> a current domain. If the caller hasn't provided a domain then the core
> must provide an explicit DMA blocking domain that has no DMA map.
> 
> Lazily create a group-global blocking DMA domain when
> iommu_group_claim_dma_owner is first called and immediately assign the
> group to it. This ensures that DMA is immediately fully isolated on all
> IOMMU drivers.
> 
> If the user attaches/detaches while owned then detach will set the group
> back to the blocking domain.
> 
> Slightly reorganize the call chains so that
> __iommu_group_attach_core_domain() is the function that removes any
> caller
> configured domain and sets the domains back a core owned domain with an
> appropriate lifetime.
> 
> __iommu_group_attach_domain() is the worker function that can change the
> domain assigned to a group to any target domain, including NULL.
> 
> Add comments clarifying how the NULL vs detach_dev vs default_domain
> works
> based on Robin's remarks.
> 
> This fixes an oops with VFIO and SMMUv3 because VFIO will call
> iommu_detach_group() and then immediately iommu_domain_free(), but
> SMMUv3 has no way to know that the domain it is holding a pointer to
> has been freed. Now the iommu_detach_group() will assign the blocking
> domain and SMMUv3 will no longer hold a stale domain reference.

Overall I like what this patch does. Just some nits below.

> 
> Fixes: 1ea2a07a532b ("iommu: Add DMA ownership management
> interfaces")
> Reported-by: Qian Cai 
> Tested-by: Baolu Lu 
> Tested-by: Nicolin Chen 
> Co-developed-by: Robin Murphy 
> Signed-off-by: Robin Murphy 
> Signed-off-by: Jason Gunthorpe 
> ---
>  drivers/iommu/iommu.c | 122 ++
>  1 file changed, 87 insertions(+), 35 deletions(-)
> 
> Joerg, this should address the issue, Nicolin reproduced the original issue
> and verified this fix on ARM SMMUv3.
> 
> v2:
>  - Remove redundant detach_dev ops check in __iommu_detach_device and
>make the added WARN_ON fail instead
>  - Check for blocking_domain in __iommu_attach_group() so VFIO can
>actually attach a new group
>  - Update comments and spelling
>  - Fix missed change to new_domain in iommu_group_do_detach_device()
> v1: https://lore.kernel.org/r/0-v1-6e9d2d0a759d+11b-
> iommu_dma_block_...@nvidia.com
> 
> Thanks,
> Jason
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 0c42ece2585406..c1bdec807ea381 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -44,6 +44,7 @@ struct iommu_group {
>   char *name;
>   int id;
>   struct iommu_domain *default_domain;
> + struct iommu_domain *blocking_domain;
>   struct iommu_domain *domain;
>   struct list_head entry;
>   unsigned int owner_cnt;
> @@ -82,8 +83,7 @@ static int __iommu_attach_device(struct
> iommu_domain *domain,
>struct device *dev);
>  static int __iommu_attach_group(struct iommu_domain *domain,
>   struct iommu_group *group);
> -static void __iommu_detach_group(struct iommu_domain *domain,
> -  struct iommu_group *group);
> +static void __iommu_group_attach_core_domain(struct iommu_group
> *group);
>  static int iommu_create_device_direct_mappings(struct iommu_group
> *group,
>  struct device *dev);
>  static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
> @@ -596,6 +596,8 @@ static void iommu_group_release(struct kobject
> *kobj)
> 
>   if (group->default_domain)
>   iommu_domain_free(group->default_domain);
> + if (group->blocking_domain)
> + iommu_domain_free(group->blocking_domain);
> 
>   kfree(group->name);
>   kfree(group);
> @@ -1963,9 +1965,6 @@ static void __iommu_detach_device(struct
> iommu_domain *domain,
>   if (iommu_is_attach_deferred(dev))
>   return;
> 
> - if (unlikely(domain->ops->detach_dev == NULL))
> - return;
> -
>   domain->ops->detach_dev(domain, dev);
>   trace_detach_device_from_domain(dev);
>  }
> @@ -1979,12 +1978,10 @@ void iommu_detach_device(struct
> iommu_domain *domain, struct device *dev)
>   return;
> 
>   mutex_lock(>mutex);
> - if (iommu_group_device_count(group) != 1) {
> - WARN_ON(1);
> + if (WARN_ON(domain != group->domain) ||
> + WARN_ON(iommu_group_device_count(group) != 1))
>   goto out_unlock;
> - }
> -
> - __iommu_detach_group(domain, group);
> + __iommu_group_attach_core_domain(group);
> 
>  out_unlock:
>   mutex_unlock(>mutex);
> @@ -2040,7 

Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Joao Martins
On 5/5/22 08:42, Tian, Kevin wrote:
>> From: Jason Gunthorpe 
>> Sent: Tuesday, May 3, 2022 2:53 AM
>>
>> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
>>> On Fri, 29 Apr 2022 05:45:20 +
>>> "Tian, Kevin"  wrote:
> From: Joao Martins 
>  3) Unmapping an IOVA range while returning its dirty bit prior to
> unmap. This case is specific for non-nested vIOMMU case where an
> erronous guest (or device) DMAing to an address being unmapped at
>> the
> same time.

 an erroneous attempt like above cannot anticipate which DMAs can
 succeed in that window thus the end behavior is undefined. For an
 undefined behavior nothing will be broken by losing some bits dirtied
 in the window between reading back dirty bits of the range and
 actually calling unmap. From guest p.o.v. all those are black-box
 hardware logic to serve a virtual iotlb invalidation request which just
 cannot be completed in one cycle.

 Hence in reality probably this is not required except to meet vfio
 compat requirement. Just in concept returning dirty bits at unmap
 is more accurate.

 I'm slightly inclined to abandon it in iommufd uAPI.
>>>
>>> Sorry, I'm not following why an unmap with returned dirty bitmap
>>> operation is specific to a vIOMMU case, or in fact indicative of some
>>> sort of erroneous, racy behavior of guest or device.
>>
>> It is being compared against the alternative which is to explicitly
>> query dirty then do a normal unmap as two system calls and permit a
>> race.
>>
>> The only case with any difference is if the guest is racing DMA with
>> the unmap - in which case it is already indeterminate for the guest if
>> the DMA will be completed or not.
>>
>> eg on the vIOMMU case if the guest races DMA with unmap then we are
>> already fine with throwing away that DMA because that is how the race
>> resolves during non-migration situations, so resovling it as throwing
>> away the DMA during migration is OK too.
>>
>>> We need the flexibility to support memory hot-unplug operations
>>> during migration,
>>
>> I would have thought that hotplug during migration would simply
>> discard all the data - how does it use the dirty bitmap?
>>
>>> This was implemented as a single operation specifically to avoid
>>> races where ongoing access may be available after retrieving a
>>> snapshot of the bitmap.  Thanks,
>>
>> The issue is the cost.
>>
>> On a real iommu elminating the race is expensive as we have to write
>> protect the pages before query dirty, which seems to be an extra IOTLB
>> flush.
>>
>> It is not clear if paying this cost to become atomic is actually
>> something any use case needs.
>>
>> So, I suggest we think about a 3rd op 'write protect and clear
>> dirties' that will be followed by a normal unmap - the extra op will
>> have the extra oveheard and userspace can decide if it wants to pay or
>> not vs the non-atomic read dirties operation. And lets have a use case
>> where this must be atomic before we implement it..
> 
> and write-protection also relies on the support of I/O page fault...
> 
/I think/ all IOMMUs in this series already support permission/unrecoverable
I/O page faults for a long time IIUC.

The earlier suggestion was just to discard the I/O page fault after
write-protection happens. fwiw, some IOMMUs also support suppressing
the event notification (like AMD).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-05-05 Thread Joao Martins
On 5/5/22 08:25, Shameerali Kolothum Thodi wrote:
>> -Original Message-
>> From: Joao Martins [mailto:joao.m.mart...@oracle.com]
>> Sent: 29 April 2022 12:05
>> To: Tian, Kevin 
>> Cc: Joerg Roedel ; Suravee Suthikulpanit
>> ; Will Deacon ; Robin
>> Murphy ; Jean-Philippe Brucker
>> ; zhukeqian ;
>> Shameerali Kolothum Thodi ;
>> David Woodhouse ; Lu Baolu
>> ; Jason Gunthorpe ; Nicolin Chen
>> ; Yishai Hadas ; Eric Auger
>> ; Liu, Yi L ; Alex Williamson
>> ; Cornelia Huck ;
>> k...@vger.kernel.org; iommu@lists.linux-foundation.org
>> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
>> set_dirty_tracking_range() support
>>
>> On 4/29/22 09:28, Tian, Kevin wrote:
 From: Joao Martins 
 Sent: Friday, April 29, 2022 5:09 AM

 Similar to .read_and_clear_dirty() use the page table
 walker helper functions and set DBM|RDONLY bit, thus
 switching the IOPTE to writeable-clean.
>>>
>>> this should not be one-off if the operation needs to be
>>> applied to IOPTE. Say a map request comes right after
>>> set_dirty_tracking() is called. If it's agreed to remove
>>> the range op then smmu driver should record the tracking
>>> status internally and then apply the modifier to all the new
>>> mappings automatically before dirty tracking is disabled.
>>> Otherwise the same logic needs to be kept in iommufd to
>>> call set_dirty_tracking_range() explicitly for every new
>>> iopt_area created within the tracking window.
>>
>> Gah, I totally missed that by mistake. New mappings aren't
>> carrying over the "DBM is set". This needs a new io-pgtable
>> quirk added post dirty-tracking toggling.
>>
>> I can adjust, but I am at odds on including this in a future
>> iteration given that I can't really test any of this stuff.
>> Might drop the driver until I have hardware/emulation I can
>> use (or maybe others can take over this). It was included
>> for revising the iommu core ops and whether iommufd was
>> affected by it.
> 
> [+Kunkun Jiang]. I think he is now looking into this and might have
> a test setup to verify this.

I'll keep him CC'ed next iterations. Thanks!

FWIW, the should change a bit on next iteration (simpler)
by always enabling DBM from the start. SMMUv3 ::set_dirty_tracking() becomes
a simpler function that tests quirks (i.e. DBM set) and what not, and calls
read_and_clear_dirty() without a bitmap argument to clear dirties.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 10/37] iommu/amd: Introduce per PCI segment last_bdf

2022-05-05 Thread Vasant Hegde via iommu
Hi Joerg,


On 5/2/2022 4:24 PM, Joerg Roedel wrote:
> Hi Vasant,
> 
> On Fri, Apr 29, 2022 at 08:15:49PM +0530, Vasant Hegde wrote:
>> We still need to parse IVHD to find max devices supported by each PCI segment
>> (same as the way its doing it today). Hence we need all these variables.
> 
> From what I have seen since a few years the IVRS tables enumerate the
> whole PCI segment, up to device ff:1f.7. This results in the maximum
> being allocated for all data structures anyway. Therefore we can
> probably think about skipping the scan to find the largest bdf and just
> assume it is ff:1f.7, saving us all the size-tracking variables?

With PCI segment, I think we will have segments with less than ff:1f:7.
Hence we need these variables.

-Vasant

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 4/4] iommu/vt-d: Remove hard coding PGSNP bit in PASID entries

2022-05-05 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, May 5, 2022 9:07 AM
> 
> As enforce_cache_coherency has been introduced into the
> iommu_domain_ops,
> the kernel component which owns the iommu domain is able to opt-in its
> requirement for force snooping support. The iommu driver has no need to
> hard code the page snoop control bit in the PASID table entries anymore.
> 
> Signed-off-by: Lu Baolu 

Reviewed-by: Kevin Tian , with one nit:

> ---
>  drivers/iommu/intel/pasid.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 41a0e3b02c79..0abfa7fc7fb0 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -710,9 +710,6 @@ int intel_pasid_setup_second_level(struct
> intel_iommu *iommu,
>   pasid_set_fault_enable(pte);
>   pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap));

Probably in a separate patch but above should really be renamed
to pasid_set_page_walk_snoop().

> 
> - if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED)
> - pasid_set_pgsnp(pte);
> -
>   /*
>* Since it is a second level only translation setup, we should
>* set SRE bit as well (addresses are expected to be GPAs).
> --
> 2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 3/4] iommu/vt-d: Remove domain_update_iommu_snooping()

2022-05-05 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, May 5, 2022 9:07 AM
> 
> The IOMMU force snooping capability is not required to be consistent
> among all the IOMMUs anymore. Remove force snooping capability check
> in the IOMMU hot-add path and domain_update_iommu_snooping()
> becomes
> a dead code now.
> 
> Signed-off-by: Lu Baolu 
> Reviewed-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 

> ---
>  drivers/iommu/intel/iommu.c | 34 +-
>  1 file changed, 1 insertion(+), 33 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 98112228ae93..da4bfb34ae4b 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -533,33 +533,6 @@ static void
> domain_update_iommu_coherency(struct dmar_domain *domain)
>   rcu_read_unlock();
>  }
> 
> -static bool domain_update_iommu_snooping(struct intel_iommu *skip)
> -{
> - struct dmar_drhd_unit *drhd;
> - struct intel_iommu *iommu;
> - bool ret = true;
> -
> - rcu_read_lock();
> - for_each_active_iommu(iommu, drhd) {
> - if (iommu != skip) {
> - /*
> -  * If the hardware is operating in the scalable mode,
> -  * the snooping control is always supported since we
> -  * always set PASID-table-entry.PGSNP bit if the
> domain
> -  * is managed outside (UNMANAGED).
> -  */
> - if (!sm_supported(iommu) &&
> - !ecap_sc_support(iommu->ecap)) {
> - ret = false;
> - break;
> - }
> - }
> - }
> - rcu_read_unlock();
> -
> - return ret;
> -}
> -
>  static int domain_update_iommu_superpage(struct dmar_domain *domain,
>struct intel_iommu *skip)
>  {
> @@ -3593,12 +3566,7 @@ static int intel_iommu_add(struct
> dmar_drhd_unit *dmaru)
>   iommu->name);
>   return -ENXIO;
>   }
> - if (!ecap_sc_support(iommu->ecap) &&
> - domain_update_iommu_snooping(iommu)) {
> - pr_warn("%s: Doesn't support snooping.\n",
> - iommu->name);
> - return -ENXIO;
> - }
> +
>   sp = domain_update_iommu_superpage(NULL, iommu) - 1;
>   if (sp >= 0 && !(cap_super_page_val(iommu->cap) & (1 << sp))) {
>   pr_warn("%s: Doesn't support large page.\n",
> --
> 2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 2/4] iommu/vt-d: Check domain force_snooping against attached devices

2022-05-05 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, May 5, 2022 9:07 AM
> 
> As domain->force_snooping only impacts the devices attached with the
> domain, there's no need to check against all IOMMU units. At the same
> time, for a brand new domain (hasn't been attached to any device), the
> force_snooping field could be set, but the attach_dev callback will
> return failure if it wants to attach to a device which IOMMU has no
> snoop control capability.

The description about brand new domain is not very clear. I think the
point here is that force_snooping could be set on a domain no matter
whether it has been attached or not and once set it is an immutable
flag. If no device attached the operation always succeeds then this
empty domain can be only attached to a device of which the IOMMU
supports snoop control.

>  static bool intel_iommu_enforce_cache_coherency(struct iommu_domain
> *domain)
>  {
>   struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> 
> - if (!domain_update_iommu_snooping(NULL))
> + if (dmar_domain->force_snooping)
> + return true;
> +
> + if (!domain_support_force_snooping(dmar_domain))
>   return false;
> +

Who guarantees that domain->devices won't change between
above condition check and following set operation?

> + domain_set_force_snooping(dmar_domain);
>   dmar_domain->force_snooping = true;
> +
>   return true;
>  }
> 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v2 1/4] iommu/vt-d: Block force-snoop domain attaching if no SC support

2022-05-05 Thread Tian, Kevin
> From: Lu Baolu 
> Sent: Thursday, May 5, 2022 9:07 AM
> 
> In the attach_dev callback of the default domain ops, if the domain has
> been set force_snooping, but the iommu hardware of the device does not
> support SC(Snoop Control) capability, the callback should block it and
> return a corresponding error code.
> 
> Signed-off-by: Lu Baolu 
> Reviewed-by: Jason Gunthorpe 

Reviewed-by: Kevin Tian 

> ---
>  drivers/iommu/intel/iommu.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index cf43e8f9091b..d68f5bbf3e93 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4354,6 +4354,9 @@ static int prepare_domain_attach_device(struct
> iommu_domain *domain,
>   if (!iommu)
>   return -ENODEV;
> 
> + if (dmar_domain->force_snooping && !ecap_sc_support(iommu-
> >ecap))
> + return -EOPNOTSUPP;
> +
>   /* check if this iommu agaw is sufficient for max mapped address */
>   addr_width = agaw_to_width(iommu->agaw);
>   if (addr_width > cap_mgaw(iommu->cap))
> --
> 2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 10/12] iommu: Prepare IOMMU domain for IOPF

2022-05-05 Thread Baolu Lu

On 2022/5/4 02:20, Jean-Philippe Brucker wrote:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7cae631c1baa..33449523afbe 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3174,3 +3174,24 @@ void iommu_detach_device_pasid(struct iommu_domain 
*domain,
  
  	iommu_group_put(group);

  }
+
+struct iommu_domain *iommu_get_domain_for_dev_pasid(struct device *dev,
+   ioasid_t pasid)
+{
+   struct iommu_domain *domain;
+   struct iommu_group *group;
+
+   if (!pasid_valid(pasid))
+   return NULL;
+
+   group = iommu_group_get(dev);
+   if (!group)
+   return NULL;
+
+   mutex_lock(>mutex);

Unfortunately this still causes the deadlock when unbind() flushes the
IOPF queue while holding the group mutex.


Sorry, I didn't get your point here.

Do you mean unbind() could hold group mutex before calling this helper?
The group mutex is only available in iommu.c. The unbind() has no means
to hold this lock. Or, I missed anything?

Best regards,
baolu



If we make this function private to IOPF, then we can get rid of this
mutex_lock(). It's OK because:

* xarray protects its internal state with RCU, so we can call
   xa_load() outside the lock.

* The domain obtained from xa_load is finalized. Its content is valid
   because xarray stores the domain using rcu_assign_pointer(), which has a
   release memory barrier, which pairs with data dependencies in IOPF
   (domain->sva_ioas etc).

   We'll need to be careful about this when allowing other users to install
   a fault handler. Should be fine as long as the handler and data are
   installed before the domain is added to pasid_array.

* We know the domain is valid the whole time IOPF is using it, because
   unbind() waits for pending faults.

We just need a comment explaining the last point, something like:

/*
* Safe to fetch outside the group mutex because:
 * - xarray protects its internal state with RCU
 * - the domain obtained is either NULL or fully formed
* - the IOPF work is the only caller and is flushed before the
*   domain is freed.
 */

Thanks,
Jean


+   domain = xa_load(>pasid_array, pasid);
+   mutex_unlock(>mutex);
+   iommu_group_put(group);
+
+   return domain;
+}


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, May 3, 2022 2:53 AM
> 
> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote:
> > On Fri, 29 Apr 2022 05:45:20 +
> > "Tian, Kevin"  wrote:
> > > > From: Joao Martins 
> > > >  3) Unmapping an IOVA range while returning its dirty bit prior to
> > > > unmap. This case is specific for non-nested vIOMMU case where an
> > > > erronous guest (or device) DMAing to an address being unmapped at
> the
> > > > same time.
> > >
> > > an erroneous attempt like above cannot anticipate which DMAs can
> > > succeed in that window thus the end behavior is undefined. For an
> > > undefined behavior nothing will be broken by losing some bits dirtied
> > > in the window between reading back dirty bits of the range and
> > > actually calling unmap. From guest p.o.v. all those are black-box
> > > hardware logic to serve a virtual iotlb invalidation request which just
> > > cannot be completed in one cycle.
> > >
> > > Hence in reality probably this is not required except to meet vfio
> > > compat requirement. Just in concept returning dirty bits at unmap
> > > is more accurate.
> > >
> > > I'm slightly inclined to abandon it in iommufd uAPI.
> >
> > Sorry, I'm not following why an unmap with returned dirty bitmap
> > operation is specific to a vIOMMU case, or in fact indicative of some
> > sort of erroneous, racy behavior of guest or device.
> 
> It is being compared against the alternative which is to explicitly
> query dirty then do a normal unmap as two system calls and permit a
> race.
> 
> The only case with any difference is if the guest is racing DMA with
> the unmap - in which case it is already indeterminate for the guest if
> the DMA will be completed or not.
> 
> eg on the vIOMMU case if the guest races DMA with unmap then we are
> already fine with throwing away that DMA because that is how the race
> resolves during non-migration situations, so resovling it as throwing
> away the DMA during migration is OK too.
> 
> > We need the flexibility to support memory hot-unplug operations
> > during migration,
> 
> I would have thought that hotplug during migration would simply
> discard all the data - how does it use the dirty bitmap?
> 
> > This was implemented as a single operation specifically to avoid
> > races where ongoing access may be available after retrieving a
> > snapshot of the bitmap.  Thanks,
> 
> The issue is the cost.
> 
> On a real iommu elminating the race is expensive as we have to write
> protect the pages before query dirty, which seems to be an extra IOTLB
> flush.
> 
> It is not clear if paying this cost to become atomic is actually
> something any use case needs.
> 
> So, I suggest we think about a 3rd op 'write protect and clear
> dirties' that will be followed by a normal unmap - the extra op will
> have the extra oveheard and userspace can decide if it wants to pay or
> not vs the non-atomic read dirties operation. And lets have a use case
> where this must be atomic before we implement it..

and write-protection also relies on the support of I/O page fault...

> 
> The downside is we loose a little bit of efficiency by unbundling
> these steps, the upside is that it doesn't require quite as many
> special iommu_domain/etc paths.
> 
> (Also Joao, you should probably have a read and do not clear dirty
> operation with the idea that the next operation will be unmap - then
> maybe we can avoid IOTLB flushing..)
> 
> Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking

2022-05-05 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Friday, April 29, 2022 8:39 PM
> 
> > >> * There's no capabilities API in IOMMUFD, and in this RFC each vendor
> tracks
> > >
> > > there was discussion adding device capability uAPI somewhere.
> > >
> > ack let me know if there was snippets to the conversation as I seem to have
> missed that.
> 
> It was just discssion pending something we actually needed to report.
> 
> Would be a very simple ioctl taking in the device ID and fulling a
> struct of stuff.
> 
> > > probably this can be reported as a device cap as supporting of dirty bit 
> > > is
> > > an immutable property of the iommu serving that device.
> 
> It is an easier fit to read it out of the iommu_domain after device
> attach though - since we don't need to build new kernel infrastructure
> to query it from a device.
> 
> > > Userspace can
> > > enable dirty tracking on a hwpt if all attached devices claim the support
> > > and kernel will does the same verification.
> >
> > Sorry to be dense but this is not up to 'devices' given they take no
> > part in the tracking?  I guess by 'devices' you mean the software
> > idea of it i.e. the iommu context created for attaching a said
> > physical device, not the physical device itself.
> 
> Indeed, an hwpt represents an iommu_domain and if the iommu_domain
> has
> dirty tracking ops set then that is an inherent propery of the domain
> and does not suddenly go away when a new device is attached.
> 

In concept this is an iommu property instead of a domain property.
The two can draw an equation only if the iommu driver registers
dirty tracking ops only when all IOMMUs in the platform support
the capability, i.e. sort of managing this iommu property in a global
way.

But the global way sort of conflicts with the on-going direction making
iommu capability truly per-iommu (though I'm not sure whether
heterogeneity would exist for dirty tracking). Following that trend
a domain property is not inherent as it is meaningless if no device is
attached at all.

>From this angle IMHO it's more reasonable to report this IOMMU
property to userspace via a device capability. If all devices attached
to a hwpt claim IOMMU dirty tracking capability, the user can call
set_dirty_tracking() on the hwpt object. Once dirty tracking is
enabled on a hwpt, further attaching a device which doesn't claim
this capability is simply rejected.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support

2022-05-05 Thread Shameerali Kolothum Thodi via iommu



> -Original Message-
> From: Joao Martins [mailto:joao.m.mart...@oracle.com]
> Sent: 29 April 2022 12:05
> To: Tian, Kevin 
> Cc: Joerg Roedel ; Suravee Suthikulpanit
> ; Will Deacon ; Robin
> Murphy ; Jean-Philippe Brucker
> ; zhukeqian ;
> Shameerali Kolothum Thodi ;
> David Woodhouse ; Lu Baolu
> ; Jason Gunthorpe ; Nicolin Chen
> ; Yishai Hadas ; Eric Auger
> ; Liu, Yi L ; Alex Williamson
> ; Cornelia Huck ;
> k...@vger.kernel.org; iommu@lists.linux-foundation.org
> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add
> set_dirty_tracking_range() support
> 
> On 4/29/22 09:28, Tian, Kevin wrote:
> >> From: Joao Martins 
> >> Sent: Friday, April 29, 2022 5:09 AM
> >>
> >> Similar to .read_and_clear_dirty() use the page table
> >> walker helper functions and set DBM|RDONLY bit, thus
> >> switching the IOPTE to writeable-clean.
> >
> > this should not be one-off if the operation needs to be
> > applied to IOPTE. Say a map request comes right after
> > set_dirty_tracking() is called. If it's agreed to remove
> > the range op then smmu driver should record the tracking
> > status internally and then apply the modifier to all the new
> > mappings automatically before dirty tracking is disabled.
> > Otherwise the same logic needs to be kept in iommufd to
> > call set_dirty_tracking_range() explicitly for every new
> > iopt_area created within the tracking window.
> 
> Gah, I totally missed that by mistake. New mappings aren't
> carrying over the "DBM is set". This needs a new io-pgtable
> quirk added post dirty-tracking toggling.
> 
> I can adjust, but I am at odds on including this in a future
> iteration given that I can't really test any of this stuff.
> Might drop the driver until I have hardware/emulation I can
> use (or maybe others can take over this). It was included
> for revising the iommu core ops and whether iommufd was
> affected by it.

[+Kunkun Jiang]. I think he is now looking into this and might have
a test setup to verify this.

Thanks,
Shameer


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 07/12] arm-smmu-v3/sva: Add SVA domain support

2022-05-05 Thread Baolu Lu

On 2022/5/4 02:12, Jean-Philippe Brucker wrote:

On Mon, May 02, 2022 at 09:48:37AM +0800, Lu Baolu wrote:

Add support for SVA domain allocation and provide an SVA-specific
iommu_domain_ops.

Signed-off-by: Lu Baolu 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   | 14 +++
  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   | 42 +++
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 21 ++
  3 files changed, 77 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index cd48590ada30..7631c00fdcbd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -759,6 +759,10 @@ struct iommu_sva *arm_smmu_sva_bind(struct device *dev, 
struct mm_struct *mm,
  void arm_smmu_sva_unbind(struct iommu_sva *handle);
  u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle);
  void arm_smmu_sva_notifier_synchronize(void);
+int arm_smmu_sva_attach_dev_pasid(struct iommu_domain *domain,
+ struct device *dev, ioasid_t id);
+void arm_smmu_sva_detach_dev_pasid(struct iommu_domain *domain,
+  struct device *dev, ioasid_t id);
  #else /* CONFIG_ARM_SMMU_V3_SVA */
  static inline bool arm_smmu_sva_supported(struct arm_smmu_device *smmu)
  {
@@ -804,5 +808,15 @@ static inline u32 arm_smmu_sva_get_pasid(struct iommu_sva 
*handle)
  }
  
  static inline void arm_smmu_sva_notifier_synchronize(void) {}

+
+static inline int arm_smmu_sva_attach_dev_pasid(struct iommu_domain *domain,
+   struct device *dev, ioasid_t id)
+{
+   return -ENODEV;
+}
+
+static inline void arm_smmu_sva_detach_dev_pasid(struct iommu_domain *domain,
+struct device *dev,
+ioasid_t id) {}
  #endif /* CONFIG_ARM_SMMU_V3_SVA */
  #endif /* _ARM_SMMU_V3_H */
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index c623dae1e115..3b843cd3ed67 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -541,3 +541,45 @@ void arm_smmu_sva_notifier_synchronize(void)
 */
mmu_notifier_synchronize();
  }
+
+int arm_smmu_sva_attach_dev_pasid(struct iommu_domain *domain,
+ struct device *dev, ioasid_t id)
+{
+   int ret = 0;
+   struct iommu_sva *handle;
+   struct mm_struct *mm = iommu_sva_domain_mm(domain);
+
+   if (domain->type != IOMMU_DOMAIN_SVA || !mm)


We wouldn't get that far with a non-SVA domain since iommu_sva_domain_mm()
would dereference a NULL pointer. Could you move it after the domain->type
check, and maybe add a WARN_ON()?  It could help catch issues in future
API changes.


Sure. I will make it like this,

int arm_smmu_sva_attach_dev_pasid(struct iommu_domain *domain,
  struct device *dev, ioasid_t id)
{
int ret = 0;
struct mm_struct *mm;
struct iommu_sva *handle;

if (domain->type != IOMMU_DOMAIN_SVA)
return -EINVAL;

mm = iommu_sva_domain_mm(domain);
if (WARN_ON(!mm))
return -ENODEV;
... ...




+   return -EINVAL;
+
+   mutex_lock(_lock);
+   handle = __arm_smmu_sva_bind(dev, mm);
+   if (IS_ERR(handle))
+   ret = PTR_ERR(handle);
+   mutex_unlock(_lock);
+
+   return ret;
+}
+
+void arm_smmu_sva_detach_dev_pasid(struct iommu_domain *domain,
+  struct device *dev, ioasid_t id)
+{
+   struct arm_smmu_bond *bond = NULL, *t;
+   struct mm_struct *mm = iommu_sva_domain_mm(domain);
+   struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+   mutex_lock(_lock);
+   list_for_each_entry(t, >bonds, list) {
+   if (t->mm == mm) {
+   bond = t;
+   break;
+   }
+   }
+
+   if (!WARN_ON(!bond) && refcount_dec_and_test(>refs)) {
+   list_del(>list);
+   arm_smmu_mmu_notifier_put(bond->smmu_mn);
+   kfree(bond);
+   }
+   mutex_unlock(_lock);
+}
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index afc63fce6107..bd80de0bad98 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1995,10 +1995,31 @@ static bool arm_smmu_capable(enum iommu_cap cap)
}
  }
  
+static void arm_smmu_sva_domain_free(struct iommu_domain *domain)

+{
+   kfree(domain);
+}
+
+static const struct iommu_domain_ops arm_smmu_sva_domain_ops = {
+   .attach_dev_pasid   = arm_smmu_sva_attach_dev_pasid,
+   .detach_dev_pasid   = arm_smmu_sva_detach_dev_pasid,
+   .free   = 

Re: [PATCH v5 04/12] iommu/sva: Basic data structures for SVA

2022-05-05 Thread Baolu Lu

On 2022/5/4 02:09, Jean-Philippe Brucker wrote:

On Mon, May 02, 2022 at 09:48:34AM +0800, Lu Baolu wrote:

Use below data structures for SVA implementation in the IOMMU core:

- struct iommu_sva_ioas
   Represent the I/O address space shared with an application CPU address
   space. This structure has a 1:1 relationship with an mm_struct. It
   grabs a "mm->mm_count" refcount during creation and drop it on release.


Do we actually need this structure?  At the moment it only keeps track of
bonds, which we can move to struct dev_iommu. Replacing it by a mm pointer
in struct iommu_domain simplifies the driver and seems to work


Fair enough.

+struct iommu_sva_ioas {
+   struct mm_struct *mm;
+   ioasid_t pasid;
+
+   /* Counter of domains attached to this ioas. */
+   refcount_t users;
+
+   /* All bindings are linked here. */
+   struct list_head bonds;
+};

By moving @mm to struct iommu_domain and @bonds to struct dev_iommu, the
code looks simpler. The mm, sva domain and per-device dev_iommu are
guaranteed to be valid during bind() and unbind().

Will head this direction in the next version.



Thanks,
Jean


Best regards,
baolu





- struct iommu_domain (IOMMU_DOMAIN_SVA type)
   Represent a hardware pagetable that the IOMMU hardware could use for
   SVA translation. Multiple iommu domains could be bound with an SVA ioas
   and each grabs a refcount from ioas in order to make sure ioas could
   only be freed after all domains have been unbound.

- struct iommu_sva
   Represent a bond relationship between an SVA ioas and an iommu domain.
   If a bond already exists, it's reused and a reference is taken.

Signed-off-by: Lu Baolu 
---
  include/linux/iommu.h | 14 +-
  drivers/iommu/iommu-sva-lib.h |  1 +
  drivers/iommu/iommu-sva-lib.c | 18 ++
  3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ab36244d4e94..f582f434c513 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -42,6 +42,7 @@ struct notifier_block;
  struct iommu_sva;
  struct iommu_fault_event;
  struct iommu_dma_cookie;
+struct iommu_sva_ioas;
  
  /* iommu fault flags */

  #define IOMMU_FAULT_READ  0x0
@@ -64,6 +65,9 @@ struct iommu_domain_geometry {
  #define __IOMMU_DOMAIN_PT (1U << 2)  /* Domain is identity mapped   */
  #define __IOMMU_DOMAIN_DMA_FQ (1U << 3)  /* DMA-API uses flush queue*/
  
+#define __IOMMU_DOMAIN_SHARED	(1U << 4)  /* Page table shared from CPU  */

+#define __IOMMU_DOMAIN_HOST_VA (1U << 5)  /* Host CPU virtual address */
+
  /*
   * This are the possible domain-types
   *
@@ -86,6 +90,8 @@ struct iommu_domain_geometry {
  #define IOMMU_DOMAIN_DMA_FQ   (__IOMMU_DOMAIN_PAGING |\
 __IOMMU_DOMAIN_DMA_API |   \
 __IOMMU_DOMAIN_DMA_FQ)
+#define IOMMU_DOMAIN_SVA   (__IOMMU_DOMAIN_SHARED |\
+__IOMMU_DOMAIN_HOST_VA)
  
  struct iommu_domain {

unsigned type;
@@ -95,6 +101,7 @@ struct iommu_domain {
void *handler_token;
struct iommu_domain_geometry geometry;
struct iommu_dma_cookie *iova_cookie;
+   struct iommu_sva_ioas *sva_ioas;
  };
  
  static inline bool iommu_is_dma_domain(struct iommu_domain *domain)

@@ -628,7 +635,12 @@ struct iommu_fwspec {
   * struct iommu_sva - handle to a device-mm bond
   */
  struct iommu_sva {
-   struct device   *dev;
+   struct device   *dev;
+   struct iommu_sva_ioas   *sva_ioas;
+   struct iommu_domain *domain;
+   /* Link to sva ioas's bonds list */
+   struct list_headnode;
+   refcount_t  users;
  };
  
  int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode,

diff --git a/drivers/iommu/iommu-sva-lib.h b/drivers/iommu/iommu-sva-lib.h
index 8909ea1094e3..9c5e108e2c8a 100644
--- a/drivers/iommu/iommu-sva-lib.h
+++ b/drivers/iommu/iommu-sva-lib.h
@@ -10,6 +10,7 @@
  
  int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max);

  struct mm_struct *iommu_sva_find(ioasid_t pasid);
+struct mm_struct *iommu_sva_domain_mm(struct iommu_domain *domain);
  
  /* I/O Page fault */

  struct device;
diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c
index 106506143896..d524a402be3b 100644
--- a/drivers/iommu/iommu-sva-lib.c
+++ b/drivers/iommu/iommu-sva-lib.c
@@ -3,6 +3,8 @@
   * Helpers for IOMMU drivers implementing SVA
   */
  #include 
+#include 
+#include 
  #include 
  
  #include "iommu-sva-lib.h"

@@ -10,6 +12,22 @@
  static DEFINE_MUTEX(iommu_sva_lock);
  static DECLARE_IOASID_SET(iommu_sva_pasid);
  
+struct iommu_sva_ioas {

+   struct mm_struct *mm;
+   ioasid_t pasid;
+
+   /* Counter of domains attached to this ioas. */
+   refcount_t users;
+
+   /* All bindings are linked here. */
+   struct list_head bonds;
+};
+

Re: [PATCH v5 03/12] iommu: Add attach/detach_dev_pasid domain ops

2022-05-05 Thread Baolu Lu

On 2022/5/4 02:07, Jean-Philippe Brucker wrote:

On Mon, May 02, 2022 at 09:48:33AM +0800, Lu Baolu wrote:

Attaching an IOMMU domain to a PASID of a device is a generic operation
for modern IOMMU drivers which support PASID-granular DMA address
translation. Currently visible usage scenarios include (but not limited):

  - SVA (Shared Virtual Address)
  - kernel DMA with PASID
  - hardware-assist mediated device

This adds a pair of common domain ops for this purpose and adds helpers
to attach/detach a domain to/from a {device, PASID}. Some buses, like
PCI, route packets without considering the PASID value. Thus a DMA target
address with PASID might be treated as P2P if the address falls into the
MMIO BAR of other devices in the group. To make things simple, these
interfaces only apply to devices belonging to the singleton groups, and
the singleton is immutable in fabric i.e. not affected by hotplug.

Signed-off-by: Lu Baolu 


Reviewed-by: Jean-Philippe Brucker 

just a nit below


---
  include/linux/iommu.h | 21 
  drivers/iommu/iommu.c | 76 +++
  2 files changed, 97 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b8ffaf2cb1d0..ab36244d4e94 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -263,6 +263,8 @@ struct iommu_ops {
   * struct iommu_domain_ops - domain specific operations
   * @attach_dev: attach an iommu domain to a device
   * @detach_dev: detach an iommu domain from a device
+ * @attach_dev_pasid: attach an iommu domain to a pasid of device
+ * @detach_dev_pasid: detach an iommu domain from a pasid of device
   * @map: map a physically contiguous memory region to an iommu domain
   * @map_pages: map a physically contiguous set of pages of the same size to
   * an iommu domain.
@@ -283,6 +285,10 @@ struct iommu_ops {
  struct iommu_domain_ops {
int (*attach_dev)(struct iommu_domain *domain, struct device *dev);
void (*detach_dev)(struct iommu_domain *domain, struct device *dev);
+   int (*attach_dev_pasid)(struct iommu_domain *domain,
+   struct device *dev, ioasid_t pasid);
+   void (*detach_dev_pasid)(struct iommu_domain *domain,
+struct device *dev, ioasid_t pasid);
  
  	int (*map)(struct iommu_domain *domain, unsigned long iova,

   phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
@@ -678,6 +684,10 @@ int iommu_group_claim_dma_owner(struct iommu_group *group, 
void *owner);
  void iommu_group_release_dma_owner(struct iommu_group *group);
  bool iommu_group_dma_owner_claimed(struct iommu_group *group);
  
+int iommu_attach_device_pasid(struct iommu_domain *domain,

+ struct device *dev, ioasid_t pasid);
+void iommu_detach_device_pasid(struct iommu_domain *domain,
+  struct device *dev, ioasid_t pasid);
  #else /* CONFIG_IOMMU_API */
  
  struct iommu_ops {};

@@ -1051,6 +1061,17 @@ static inline bool iommu_group_dma_owner_claimed(struct 
iommu_group *group)
  {
return false;
  }
+
+static inline int iommu_attach_device_pasid(struct iommu_domain *domain,
+   struct device *dev, ioasid_t pasid)
+{
+   return -ENODEV;
+}
+
+static inline void iommu_detach_device_pasid(struct iommu_domain *domain,
+struct device *dev, ioasid_t pasid)
+{
+}
  #endif /* CONFIG_IOMMU_API */
  
  /**

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 29906bc16371..89c9d19ddb28 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -38,6 +38,7 @@ struct iommu_group {
struct kobject kobj;
struct kobject *devices_kobj;
struct list_head devices;
+   struct xarray pasid_array;
struct mutex mutex;
void *iommu_data;
void (*iommu_data_release)(void *iommu_data);
@@ -630,6 +631,7 @@ struct iommu_group *iommu_group_alloc(void)
mutex_init(>mutex);
INIT_LIST_HEAD(>devices);
INIT_LIST_HEAD(>entry);
+   xa_init(>pasid_array);
  
  	ret = ida_simple_get(_group_ida, 0, 0, GFP_KERNEL);

if (ret < 0) {
@@ -3190,3 +3192,77 @@ bool iommu_group_dma_owner_claimed(struct iommu_group 
*group)
return user;
  }
  EXPORT_SYMBOL_GPL(iommu_group_dma_owner_claimed);
+
+/*
+ * Use standard PCI bus topology and isolation features to check immutable
+ * singleton. Otherwise, assume the bus is static and then singleton can
+ * know from the device count in the group.
+ */


The comment doesn't really add anything that can't be directly understood
from the code.


Yes. It's fine to remove it.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 02/12] iommu: Add pasid_bits field in struct dev_iommu

2022-05-05 Thread Baolu Lu

On 2022/5/4 02:02, Jean-Philippe Brucker wrote:

On Mon, May 02, 2022 at 09:48:32AM +0800, Lu Baolu wrote:

Use this field to save the pasid/ssid bits that a device is able to
support with its IOMMU hardware. It is a generic attribute of a device
and lifting it into the per-device dev_iommu struct makes it possible
to allocate a PASID for device without calls into the IOMMU drivers.
Any iommu driver which suports PASID related features should set this
field before features are enabled on the devices.

For initialization of this field in the VT-d driver, the
info->pasid_supported is only set for PCI devices. So the status is
that non-PCI SVA hasn't been supported yet. Setting this field only for
PCI devices has no functional change.

Signed-off-by: Lu Baolu

Reviewed-by: Jean-Philippe Brucker


Thank you! Very appreciated for reviewing my patches.

Best regards,
baolu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu