from:"Jason Gunthorpe"

Re: [PATCH v2 1/3] drm/nouveau/tegra: Use iommu_paging_domain_alloc()

2024-09-15 Thread Jason Gunthorpe

On Thu, Sep 05, 2024 at 12:26:31PM -0400, Lyude Paul wrote:
> I did take the one patch - but I'm happy to push the others to drm-misc
> (provided they all get reviewed. 2/3 seems to be reviewed already but not 3/3)

Did it get lost?

$ git reset --hard next-20240913
$ git grep 'iommu_domain_alloc(' drivers/gpu/
drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c: 
tdev->iommu.domain = iommu_domain_alloc(&platform_bus_type);

??

 drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c |  4 ++--
 drivers/gpu/drm/rockchip/rockchip_drm_drv.c| 10 +++---
 drivers/gpu/drm/tegra/drm.c|  5 +++--

Now we have the other two but not nouveau?

Jason

Re: [PATCH v2 1/3] drm/nouveau/tegra: Use iommu_paging_domain_alloc()

2024-09-09 Thread Jason Gunthorpe

On Thu, Sep 05, 2024 at 12:26:31PM -0400, Lyude Paul wrote:
> I did take the one patch - but I'm happy to push the others to drm-misc
> (provided they all get reviewed. 2/3 seems to be reviewed already but not 3/3)

The whole series is acked now, can you pick it up please?

Thanks,
Jason

Re: [PATCH v2 1/3] drm/nouveau/tegra: Use iommu_paging_domain_alloc()

2024-09-05 Thread Jason Gunthorpe

On Wed, Sep 04, 2024 at 03:06:07PM -0400, Lyude Paul wrote:
> Reviewed-by: Lyude Paul 
> 
> Will handle pushing it to drm-misc in just a moment

Did you just take this one patch? 

Who will take the rest of the series for DRM?

Jason

Re: [PATCH v2 02/37] drm/nouveau: handle pci/tegra drm_dev_{alloc, register} from common code

2024-07-28 Thread Jason Gunthorpe

On Sun, Jul 28, 2024 at 11:34:14PM +0200, Danilo Krummrich wrote:
> On Sun, Jul 28, 2024 at 03:13:08PM -0300, Jason Gunthorpe wrote:

> I think we're on the same page with all that. As clarified in [1], that's not 
> a
> big concern, I was referring to the changes required to integrate the auxbus
> stuff.

Well, I see this thread having the realization that things are not
setup proeprly to use devres. To be fair devres creates almost as many
bugs as it solves :\ cleanup.h is possibly a better option for most
simple things and harder to misuse...

> > normal (though most subsystems would call that unregister, not put)
> 
> A DRM device is reference counted and can out-live the driver, hence the
> drm_dev_put() call in .remove(). There is also a special drm_dev_unplug()
> function, which does not only unregister the DRM device, but also sets a guard
> to be able prevent HW accesses after the HW is accessible anymore.

Every subsystem has a refcounted object, struct device is inherently
refcounted. You call the thing driver calls during .remove()
'unregister' because it is special. Once it returns the subsystem has
to promise no more code is running in driver callbacks and the driver
is permitted to start destroying anything it might need to use when
processing any callbacks.

This is really tricky and people routinely misunderstand the
requirements and get this wrong. The consequence is UAF problems in
obscure cases with unbind races (that few actually care about), but
getting it right starts with labeling things properly :)

We went through this long ago in RDMA because someone actually had a
usecase of live driver unbind, making that work reliably under a full
active work load took some thoughtfulness.

Jason

Re: [PATCH v2 02/37] drm/nouveau: handle pci/tegra drm_dev_{alloc, register} from common code

2024-07-28 Thread Jason Gunthorpe

On Fri, Jul 26, 2024 at 11:07:19PM +1000, Ben Skeggs wrote:

> > Right, I think I took that too literally.
> > 
> > The lifetime of the DRM device (or more precisely one of its references) is
> > bound to the binding between the parent device and its corresponding driver.
> > 
> > But the lifetime of the parent device itself is bound to the DRM device.
> > 
> > So, yes this doesn't work, and proves the point that initializing the DRM 
> > device
> > with the parent's parent is just a workaround.
> 
> You're greatly overstating the "complexity" that's added here. It's a minor
> inconvenience that doesn't require much code at all to implement, and is
> essentially irrelevant outside of module load/unload.
> 
> I agree it's not ideal, and userspace should gain auxiliary bus support
> before a new driver implements a similar architecture, but it's really not
> that big a deal.

Ben asked me to share what other places are doing this stuff.

To recap, when converting a legacy driver into an aux split we've
found in several places that there is existing userspace that has
hardwired certain sysfs paths. ie an assumption that an infiniband
device appears under the sys/../pci/ directory.

Argubaly this userspace is not in good shape, but we have to preserve
it.

So the approach is to make the sysfs visible elements tied to the
original sysfs location (ie the pci device) and continue to use aux
otherwise for discovery, probing and tying subsystems together.

Obviously you have to be careful about the difference between the
sysfs parent (for owning a subordinate struct device, sysfs files,
etc) and the probe time parent (for owning devres, and other tasks)

We've been fortunate enough that subsystems so far have had a clean
enough setup that this is easy enough to do. It sounds like DRM is the
same if it just requires calling  a put in .remove() - that is pretty
normal (though most subsystems would call that unregister, not put)

Jason

[PATCH v2 0/7] IOMMU related FW parsing cleanup

2023-12-07 Thread Jason Gunthorpe

These are the patches from the from the prior series without the "fwspec
polishing":
 https://lore.kernel.org/r/0-v2-36a0088ecaa7+22c6e-iommu_fwspec_...@nvidia.com

Does a few things to prepare for the next:

- Clean up the call chains around dma_configure so the iommu_ops isn't being
  exposed.

- Add additional lockdep annotations now that we can.

- Fix some missed places that need to call tegra_dev_iommu_get_stream_id()

Based on Joerg's for-next with Robin's bus changes.

Robin's dma_base/size cleanup squashes the first patch, but we can't do
the ops removal in the other parts without it, so let's keep it
unsquashed.

v2:
 - Remove comments and bracket around tegra_dev_iommu_get_stream_id()
   in gp10b.c
 - Remove WARN_ON() in tegra186_mc_client_sid_override(), just return 0
 - Push the locking change to a later series
 - Drop the COMPILE_TEST improvement, not important enough to argue.
v1: 
https://lore.kernel.org/r/0-v1-720585788a7d+811b-iommu_fwspec_p1_...@nvidia.com

Jason Gunthorpe (7):
  iommu: Remove struct iommu_ops *iommu from arch_setup_dma_ops()
  iommmu/of: Do not return struct iommu_ops from of_iommu_configure()
  iommu/of: Use -ENODEV consistently in of_iommu_configure()
  iommu: Mark dev_iommu_get() with lockdep
  iommu: Mark dev_iommu_priv_set() with a lockdep
  acpi: Do not return struct iommu_ops from acpi_iommu_configure_id()
  iommu/tegra: Use tegra_dev_iommu_get_stream_id() in the remaining
places

 arch/arc/mm/dma.c |  2 +-
 arch/arm/mm/dma-mapping-nommu.c   |  2 +-
 arch/arm/mm/dma-mapping.c | 10 +--
 arch/arm64/mm/dma-mapping.c   |  4 +-
 arch/mips/mm/dma-noncoherent.c|  2 +-
 arch/riscv/mm/dma-noncoherent.c   |  2 +-
 drivers/acpi/scan.c   | 32 ++
 drivers/dma/tegra186-gpc-dma.c|  8 +--
 .../gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c   |  9 +--
 drivers/hv/hv_common.c|  2 +-
 drivers/iommu/amd/iommu.c |  2 -
 drivers/iommu/apple-dart.c|  1 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  1 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c |  1 -
 drivers/iommu/intel/iommu.c   |  2 -
 drivers/iommu/iommu.c | 11 
 drivers/iommu/of_iommu.c  | 64 ---
 drivers/iommu/omap-iommu.c|  1 -
 drivers/memory/tegra/tegra186.c   | 14 ++--
 drivers/of/device.c   | 24 ---
 include/linux/dma-map-ops.h   |  4 +-
 include/linux/iommu.h |  5 +-
 include/linux/of_iommu.h  | 13 ++--
 23 files changed, 105 insertions(+), 111 deletions(-)


base-commit: 173ff345925a394284250bfa6e47d231e62031c7
-- 
2.43.0

[PATCH v2 5/7] iommu: Mark dev_iommu_priv_set() with a lockdep

2023-12-07 Thread Jason Gunthorpe

A perfect driver would only call dev_iommu_priv_set() from its probe
callback. We've made it functionally correct to call it from the of_xlate
by adding a lock around that call.

lockdep assert that iommu_probe_device_lock is held to discourage misuse.

Exclude PPC kernels with CONFIG_FSL_PAMU turned on because FSL_PAMU uses a
global static for its priv and abuses priv for its domain.

Remove the pointless stores of NULL, all these are on paths where the core
code will free dev->iommu after the op returns.

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/amd/iommu.c   | 2 --
 drivers/iommu/apple-dart.c  | 1 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c   | 1 -
 drivers/iommu/intel/iommu.c | 2 --
 drivers/iommu/iommu.c   | 9 +
 drivers/iommu/omap-iommu.c  | 1 -
 include/linux/iommu.h   | 5 +
 8 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 9f706436082833..be58644a6fa518 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -551,8 +551,6 @@ static void amd_iommu_uninit_device(struct device *dev)
if (dev_data->domain)
detach_device(dev);
 
-   dev_iommu_priv_set(dev, NULL);
-
/*
 * We keep dev_data around for unplugged devices and reuse it when the
 * device is re-plugged - not doing so would introduce a ton of races.
diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 7438e9c82ba982..25135440b5dd54 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -743,7 +743,6 @@ static void apple_dart_release_device(struct device *dev)
 {
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
 
-   dev_iommu_priv_set(dev, NULL);
kfree(cfg);
 }
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fc4317c25b6d53..1855d3892b15f8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2695,7 +2695,6 @@ static struct iommu_device *arm_smmu_probe_device(struct 
device *dev)
 
 err_free_master:
kfree(master);
-   dev_iommu_priv_set(dev, NULL);
return ERR_PTR(ret);
 }
 
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 4d09c004789274..adc7937fd8a3a3 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1420,7 +1420,6 @@ static void arm_smmu_release_device(struct device *dev)
 
arm_smmu_rpm_put(cfg->smmu);
 
-   dev_iommu_priv_set(dev, NULL);
kfree(cfg);
 }
 
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 897159dba47de4..511589341074f0 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4461,7 +4461,6 @@ static struct iommu_device 
*intel_iommu_probe_device(struct device *dev)
ret = intel_pasid_alloc_table(dev);
if (ret) {
dev_err(dev, "PASID table allocation failed\n");
-   dev_iommu_priv_set(dev, NULL);
kfree(info);
return ERR_PTR(ret);
}
@@ -4479,7 +4478,6 @@ static void intel_iommu_release_device(struct device *dev)
dmar_remove_one_dev_info(dev);
intel_pasid_free_table(dev);
intel_iommu_debugfs_remove_dev(info);
-   dev_iommu_priv_set(dev, NULL);
kfree(info);
set_dma_ops(dev, NULL);
 }
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 4323b6276e977f..08f29a1dfcd5f8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -387,6 +387,15 @@ static u32 dev_iommu_get_max_pasids(struct device *dev)
return min_t(u32, max_pasids, dev->iommu->iommu_dev->max_pasids);
 }
 
+void dev_iommu_priv_set(struct device *dev, void *priv)
+{
+   /* FSL_PAMU does something weird */
+   if (!IS_ENABLED(CONFIG_FSL_PAMU))
+   lockdep_assert_held(&iommu_probe_device_lock);
+   dev->iommu->priv = priv;
+}
+EXPORT_SYMBOL_GPL(dev_iommu_priv_set);
+
 /*
  * Init the dev->iommu and dev->iommu_group in the struct device and get the
  * driver probed
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index c66b070841dd41..c9528065a59afa 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1719,7 +1719,6 @@ static void omap_iommu_release_device(struct device *dev)
if (!dev->of_node || !arch_data)
return;
 
-   dev_iommu_priv_set(dev, NULL);
kfree(arch_data);
 
 }
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c7394

[PATCH v2 1/7] iommu: Remove struct iommu_ops *iommu from arch_setup_dma_ops()

2023-12-07 Thread Jason Gunthorpe

This is not being used to pass ops, it is just a way to tell if an
iommu driver was probed. These days this can be detected directly via
device_iommu_mapped(). Call device_iommu_mapped() in the two places that
need to check it and remove the iommu parameter everywhere.

Reviewed-by: Jerry Snitselaar 
Reviewed-by: Lu Baolu 
Reviewed-by: Moritz Fischer 
Acked-by: Christoph Hellwig 
Acked-by: Rob Herring 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 arch/arc/mm/dma.c   |  2 +-
 arch/arm/mm/dma-mapping-nommu.c |  2 +-
 arch/arm/mm/dma-mapping.c   | 10 +-
 arch/arm64/mm/dma-mapping.c |  4 ++--
 arch/mips/mm/dma-noncoherent.c  |  2 +-
 arch/riscv/mm/dma-noncoherent.c |  2 +-
 drivers/acpi/scan.c |  3 +--
 drivers/hv/hv_common.c  |  2 +-
 drivers/of/device.c |  2 +-
 include/linux/dma-map-ops.h |  4 ++--
 10 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index 2a7fbbb83b7056..197707bc765889 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -91,7 +91,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
  * Plug in direct dma map ops.
  */
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {
/*
 * IOC hardware snoops all DMA traffic keeping the caches consistent
diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index cfd9c933d2f09c..b94850b579952a 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -34,7 +34,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 }
 
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {
if (IS_ENABLED(CONFIG_CPU_V7M)) {
/*
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 5409225b4abc06..6c359a3af8d9c7 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1713,7 +1713,7 @@ void arm_iommu_detach_device(struct device *dev)
 EXPORT_SYMBOL_GPL(arm_iommu_detach_device);
 
 static void arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool 
coherent)
+   bool coherent)
 {
struct dma_iommu_mapping *mapping;
 
@@ -1748,7 +1748,7 @@ static void arm_teardown_iommu_dma_ops(struct device *dev)
 #else
 
 static void arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool 
coherent)
+   bool coherent)
 {
 }
 
@@ -1757,7 +1757,7 @@ static void arm_teardown_iommu_dma_ops(struct device 
*dev) { }
 #endif /* CONFIG_ARM_DMA_USE_IOMMU */
 
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {
/*
 * Due to legacy code that sets the ->dma_coherent flag from a bus
@@ -1776,8 +1776,8 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, 
u64 size,
if (dev->dma_ops)
return;
 
-   if (iommu)
-   arm_setup_iommu_dma_ops(dev, dma_base, size, iommu, coherent);
+   if (device_iommu_mapped(dev))
+   arm_setup_iommu_dma_ops(dev, dma_base, size, coherent);
 
xen_setup_dma_ops(dev);
dev->archdata.dma_ops_setup = true;
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 3cb101e8cb29ba..61886e43e3a10f 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -47,7 +47,7 @@ void arch_teardown_dma_ops(struct device *dev)
 #endif
 
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {
int cls = cache_line_size_of_cpu();
 
@@ -58,7 +58,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 
size,
   ARCH_DMA_MINALIGN, cls);
 
dev->dma_coherent = coherent;
-   if (iommu)
+   if (device_iommu_mapped(dev))
iommu_setup_dma_ops(dev, dma_base, dma_base + size - 1);
 
xen_setup_dma_ops(dev);
diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c
index 3c4fc97b9f394b..0f3cec663a12cd 100644
--- a/arch/mips/mm/dma-noncoherent.c
+++ b/arch/mips/mm/dma-noncoherent.c
@@ -138,7 +138,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 
 #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {

[PATCH v2 6/7] acpi: Do not return struct iommu_ops from acpi_iommu_configure_id()

2023-12-07 Thread Jason Gunthorpe

Nothing needs this pointer. Return a normal error code with the usual
IOMMU semantic that ENODEV means 'there is no IOMMU driver'.

Acked-by: Rafael J. Wysocki 
Reviewed-by: Jerry Snitselaar 
Reviewed-by: Lu Baolu 
Reviewed-by: Moritz Fischer 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 drivers/acpi/scan.c | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 444a0b3c72f2d8..340ba720c72129 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1562,8 +1562,7 @@ static inline const struct iommu_ops 
*acpi_iommu_fwspec_ops(struct device *dev)
return fwspec ? fwspec->ops : NULL;
 }
 
-static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev,
-  const u32 *id_in)
+static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in)
 {
int err;
const struct iommu_ops *ops;
@@ -1577,7 +1576,7 @@ static const struct iommu_ops 
*acpi_iommu_configure_id(struct device *dev,
ops = acpi_iommu_fwspec_ops(dev);
if (ops) {
mutex_unlock(&iommu_probe_device_lock);
-   return ops;
+   return 0;
}
 
err = iort_iommu_configure_id(dev, id_in);
@@ -1594,12 +1593,14 @@ static const struct iommu_ops 
*acpi_iommu_configure_id(struct device *dev,
 
/* Ignore all other errors apart from EPROBE_DEFER */
if (err == -EPROBE_DEFER) {
-   return ERR_PTR(err);
+   return err;
} else if (err) {
dev_dbg(dev, "Adding to IOMMU failed: %d\n", err);
-   return NULL;
+   return -ENODEV;
}
-   return acpi_iommu_fwspec_ops(dev);
+   if (!acpi_iommu_fwspec_ops(dev))
+   return -ENODEV;
+   return 0;
 }
 
 #else /* !CONFIG_IOMMU_API */
@@ -1611,10 +1612,9 @@ int acpi_iommu_fwspec_init(struct device *dev, u32 id,
return -ENODEV;
 }
 
-static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev,
-  const u32 *id_in)
+static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in)
 {
-   return NULL;
+   return -ENODEV;
 }
 
 #endif /* !CONFIG_IOMMU_API */
@@ -1628,7 +1628,7 @@ static const struct iommu_ops 
*acpi_iommu_configure_id(struct device *dev,
 int acpi_dma_configure_id(struct device *dev, enum dev_dma_attr attr,
  const u32 *input_id)
 {
-   const struct iommu_ops *iommu;
+   int ret;
 
if (attr == DEV_DMA_NOT_SUPPORTED) {
set_dma_ops(dev, &dma_dummy_ops);
@@ -1637,10 +1637,15 @@ int acpi_dma_configure_id(struct device *dev, enum 
dev_dma_attr attr,
 
acpi_arch_dma_setup(dev);
 
-   iommu = acpi_iommu_configure_id(dev, input_id);
-   if (PTR_ERR(iommu) == -EPROBE_DEFER)
+   ret = acpi_iommu_configure_id(dev, input_id);
+   if (ret == -EPROBE_DEFER)
return -EPROBE_DEFER;
 
+   /*
+* Historically this routine doesn't fail driver probing due to errors
+* in acpi_iommu_configure_id()
+*/
+
arch_setup_dma_ops(dev, 0, U64_MAX, attr == DEV_DMA_COHERENT);
 
return 0;
-- 
2.43.0

[PATCH v2 4/7] iommu: Mark dev_iommu_get() with lockdep

2023-12-07 Thread Jason Gunthorpe

Allocation of dev->iommu must be done under the
iommu_probe_device_lock. Mark this with lockdep to discourage future
mistakes.

Reviewed-by: Jerry Snitselaar 
Tested-by: Hector Martin 
Reviewed-by: Lu Baolu 
Reviewed-by: Moritz Fischer 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0d25468d53a68a..4323b6276e977f 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -334,6 +334,8 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
 {
struct dev_iommu *param = dev->iommu;
 
+   lockdep_assert_held(&iommu_probe_device_lock);
+
if (param)
return param;
 
-- 
2.43.0

[PATCH v2 2/7] iommmu/of: Do not return struct iommu_ops from of_iommu_configure()

2023-12-07 Thread Jason Gunthorpe

Nothing needs this pointer. Return a normal error code with the usual
IOMMU semantic that ENODEV means 'there is no IOMMU driver'.

Reviewed-by: Jerry Snitselaar 
Reviewed-by: Lu Baolu 
Acked-by: Rob Herring 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/of_iommu.c | 31 +++
 drivers/of/device.c  | 22 +++---
 include/linux/of_iommu.h | 13 ++---
 3 files changed, 40 insertions(+), 26 deletions(-)

diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index 5ecca53847d325..c6510d7e7b241b 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -107,16 +107,22 @@ static int of_iommu_configure_device(struct device_node 
*master_np,
  of_iommu_configure_dev(master_np, dev);
 }
 
-const struct iommu_ops *of_iommu_configure(struct device *dev,
-  struct device_node *master_np,
-  const u32 *id)
+/*
+ * Returns:
+ *  0 on success, an iommu was configured
+ *  -ENODEV if the device does not have any IOMMU
+ *  -EPROBEDEFER if probing should be tried again
+ *  -errno fatal errors
+ */
+int of_iommu_configure(struct device *dev, struct device_node *master_np,
+  const u32 *id)
 {
const struct iommu_ops *ops = NULL;
struct iommu_fwspec *fwspec;
int err = NO_IOMMU;
 
if (!master_np)
-   return NULL;
+   return -ENODEV;
 
/* Serialise to make dev->iommu stable under our potential fwspec */
mutex_lock(&iommu_probe_device_lock);
@@ -124,7 +130,7 @@ const struct iommu_ops *of_iommu_configure(struct device 
*dev,
if (fwspec) {
if (fwspec->ops) {
mutex_unlock(&iommu_probe_device_lock);
-   return fwspec->ops;
+   return 0;
}
/* In the deferred case, start again from scratch */
iommu_fwspec_free(dev);
@@ -169,14 +175,15 @@ const struct iommu_ops *of_iommu_configure(struct device 
*dev,
err = iommu_probe_device(dev);
 
/* Ignore all other errors apart from EPROBE_DEFER */
-   if (err == -EPROBE_DEFER) {
-   ops = ERR_PTR(err);
-   } else if (err < 0) {
-   dev_dbg(dev, "Adding to IOMMU failed: %d\n", err);
-   ops = NULL;
+   if (err < 0) {
+   if (err == -EPROBE_DEFER)
+   return err;
+   dev_dbg(dev, "Adding to IOMMU failed: %pe\n", ERR_PTR(err));
+   return err;
}
-
-   return ops;
+   if (!ops)
+   return -ENODEV;
+   return 0;
 }
 
 static enum iommu_resv_type __maybe_unused
diff --git a/drivers/of/device.c b/drivers/of/device.c
index 65c71be71a8d45..873d933e8e6d1d 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -93,12 +93,12 @@ of_dma_set_restricted_buffer(struct device *dev, struct 
device_node *np)
 int of_dma_configure_id(struct device *dev, struct device_node *np,
bool force_dma, const u32 *id)
 {
-   const struct iommu_ops *iommu;
const struct bus_dma_region *map = NULL;
struct device_node *bus_np;
u64 dma_start = 0;
u64 mask, end, size = 0;
bool coherent;
+   int iommu_ret;
int ret;
 
if (np == dev->of_node)
@@ -181,21 +181,29 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
dev_dbg(dev, "device is%sdma coherent\n",
coherent ? " " : " not ");
 
-   iommu = of_iommu_configure(dev, np, id);
-   if (PTR_ERR(iommu) == -EPROBE_DEFER) {
+   iommu_ret = of_iommu_configure(dev, np, id);
+   if (iommu_ret == -EPROBE_DEFER) {
/* Don't touch range map if it wasn't set from a valid 
dma-ranges */
if (!ret)
dev->dma_range_map = NULL;
kfree(map);
return -EPROBE_DEFER;
-   }
+   } else if (iommu_ret == -ENODEV) {
+   dev_dbg(dev, "device is not behind an iommu\n");
+   } else if (iommu_ret) {
+   dev_err(dev, "iommu configuration for device failed with %pe\n",
+   ERR_PTR(iommu_ret));
 
-   dev_dbg(dev, "device is%sbehind an iommu\n",
-   iommu ? " " : " not ");
+   /*
+* Historically this routine doesn't fail driver probing
+* due to errors in of_iommu_configure()
+*/
+   } else
+   dev_dbg(dev, "device is behind an iommu\n");
 
arch_setup_dma_ops(dev, dma_start, size, coherent);
 
-   if (!iommu)
+   if (iommu_ret)
of_dma_set_restricted_buffer(dev,

[PATCH v2 3/7] iommu/of: Use -ENODEV consistently in of_iommu_configure()

2023-12-07 Thread Jason Gunthorpe

Instead of returning 1 and trying to handle positive error codes just
stick to the convention of returning -ENODEV. Remove references to ops
from of_iommu_configure(), a NULL ops will already generate an error code.

There is no reason to check dev->bus, if err=0 at this point then the
called configure functions thought there was an iommu and we should try to
probe it. Remove it.

Reviewed-by: Jerry Snitselaar 
Reviewed-by: Moritz Fischer 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/of_iommu.c | 49 
 1 file changed, 15 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index c6510d7e7b241b..164317bfb8a81f 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -17,8 +17,6 @@
 #include 
 #include 
 
-#define NO_IOMMU   1
-
 static int of_iommu_xlate(struct device *dev,
  struct of_phandle_args *iommu_spec)
 {
@@ -29,7 +27,7 @@ static int of_iommu_xlate(struct device *dev,
ops = iommu_ops_from_fwnode(fwnode);
if ((ops && !ops->of_xlate) ||
!of_device_is_available(iommu_spec->np))
-   return NO_IOMMU;
+   return -ENODEV;
 
ret = iommu_fwspec_init(dev, &iommu_spec->np->fwnode, ops);
if (ret)
@@ -61,7 +59,7 @@ static int of_iommu_configure_dev_id(struct device_node 
*master_np,
 "iommu-map-mask", &iommu_spec.np,
 iommu_spec.args);
if (err)
-   return err == -ENODEV ? NO_IOMMU : err;
+   return err;
 
err = of_iommu_xlate(dev, &iommu_spec);
of_node_put(iommu_spec.np);
@@ -72,7 +70,7 @@ static int of_iommu_configure_dev(struct device_node 
*master_np,
  struct device *dev)
 {
struct of_phandle_args iommu_spec;
-   int err = NO_IOMMU, idx = 0;
+   int err = -ENODEV, idx = 0;
 
while (!of_parse_phandle_with_args(master_np, "iommus",
   "#iommu-cells",
@@ -117,9 +115,8 @@ static int of_iommu_configure_device(struct device_node 
*master_np,
 int of_iommu_configure(struct device *dev, struct device_node *master_np,
   const u32 *id)
 {
-   const struct iommu_ops *ops = NULL;
struct iommu_fwspec *fwspec;
-   int err = NO_IOMMU;
+   int err;
 
if (!master_np)
return -ENODEV;
@@ -153,37 +150,21 @@ int of_iommu_configure(struct device *dev, struct 
device_node *master_np,
} else {
err = of_iommu_configure_device(master_np, dev, id);
}
-
-   /*
-* Two success conditions can be represented by non-negative err here:
-* >0 : there is no IOMMU, or one was unavailable for non-fatal reasons
-*  0 : we found an IOMMU, and dev->fwspec is initialised appropriately
-* <0 : any actual error
-*/
-   if (!err) {
-   /* The fwspec pointer changed, read it again */
-   fwspec = dev_iommu_fwspec_get(dev);
-   ops= fwspec->ops;
-   }
mutex_unlock(&iommu_probe_device_lock);
 
-   /*
-* If we have reason to believe the IOMMU driver missed the initial
-* probe for dev, replay it to get things in order.
-*/
-   if (!err && dev->bus)
-   err = iommu_probe_device(dev);
-
-   /* Ignore all other errors apart from EPROBE_DEFER */
-   if (err < 0) {
-   if (err == -EPROBE_DEFER)
-   return err;
-   dev_dbg(dev, "Adding to IOMMU failed: %pe\n", ERR_PTR(err));
+   if (err == -ENODEV || err == -EPROBE_DEFER)
return err;
-   }
-   if (!ops)
-   return -ENODEV;
+   if (err)
+   goto err_log;
+
+   err = iommu_probe_device(dev);
+   if (err)
+   goto err_log;
return 0;
+
+err_log:
+   dev_dbg(dev, "Adding to IOMMU failed: %pe\n", ERR_PTR(err));
+   return err;
 }
 
 static enum iommu_resv_type __maybe_unused
-- 
2.43.0

[PATCH v2 7/7] iommu/tegra: Use tegra_dev_iommu_get_stream_id() in the remaining places

2023-12-07 Thread Jason Gunthorpe

This API was defined to formalize the access to internal iommu details on
some Tegra SOCs, but a few callers got missed. Add them.

The helper already masks by 0x so remove this code from the callers.

Suggested-by: Thierry Reding 
Reviewed-by: Thierry Reding 
Signed-off-by: Jason Gunthorpe 
---
 drivers/dma/tegra186-gpc-dma.c  |  8 +++-
 drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c |  9 ++---
 drivers/memory/tegra/tegra186.c | 14 --
 3 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/drivers/dma/tegra186-gpc-dma.c b/drivers/dma/tegra186-gpc-dma.c
index fa4d4142a68a21..88547a23825b18 100644
--- a/drivers/dma/tegra186-gpc-dma.c
+++ b/drivers/dma/tegra186-gpc-dma.c
@@ -1348,8 +1348,8 @@ static int tegra_dma_program_sid(struct tegra_dma_channel 
*tdc, int stream_id)
 static int tegra_dma_probe(struct platform_device *pdev)
 {
const struct tegra_dma_chip_data *cdata = NULL;
-   struct iommu_fwspec *iommu_spec;
-   unsigned int stream_id, i;
+   unsigned int i;
+   u32 stream_id;
struct tegra_dma *tdma;
int ret;
 
@@ -1378,12 +1378,10 @@ static int tegra_dma_probe(struct platform_device *pdev)
 
tdma->dma_dev.dev = &pdev->dev;
 
-   iommu_spec = dev_iommu_fwspec_get(&pdev->dev);
-   if (!iommu_spec) {
+   if (!tegra_dev_iommu_get_stream_id(&pdev->dev, &stream_id)) {
dev_err(&pdev->dev, "Missing iommu stream-id\n");
return -EINVAL;
}
-   stream_id = iommu_spec->ids[0] & 0x;
 
ret = device_property_read_u32(&pdev->dev, "dma-channel-mask",
   &tdma->chan_mask);
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c 
b/drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c
index e7e8fdf3adab7a..29682722b0b36b 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c
@@ -28,19 +28,14 @@ static void
 gp10b_ltc_init(struct nvkm_ltc *ltc)
 {
struct nvkm_device *device = ltc->subdev.device;
-   struct iommu_fwspec *spec;
+   u32 sid;
 
nvkm_wr32(device, 0x17e27c, ltc->ltc_nr);
nvkm_wr32(device, 0x17e000, ltc->ltc_nr);
nvkm_wr32(device, 0x100800, ltc->ltc_nr);
 
-   spec = dev_iommu_fwspec_get(device->dev);
-   if (spec) {
-   u32 sid = spec->ids[0] & 0x;
-
-   /* stream ID */
+   if (tegra_dev_iommu_get_stream_id(device->dev, &sid))
nvkm_wr32(device, 0x16, sid << 2);
-   }
 }
 
 static const struct nvkm_ltc_func
diff --git a/drivers/memory/tegra/tegra186.c b/drivers/memory/tegra/tegra186.c
index 533f85a4b2bdb7..9cbf22a10a8270 100644
--- a/drivers/memory/tegra/tegra186.c
+++ b/drivers/memory/tegra/tegra186.c
@@ -111,9 +111,12 @@ static void tegra186_mc_client_sid_override(struct 
tegra_mc *mc,
 static int tegra186_mc_probe_device(struct tegra_mc *mc, struct device *dev)
 {
 #if IS_ENABLED(CONFIG_IOMMU_API)
-   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct of_phandle_args args;
unsigned int i, index = 0;
+   u32 sid;
+
+   if (!tegra_dev_iommu_get_stream_id(dev, &sid))
+   return 0;
 
while (!of_parse_phandle_with_args(dev->of_node, "interconnects", 
"#interconnect-cells",
   index, &args)) {
@@ -121,11 +124,10 @@ static int tegra186_mc_probe_device(struct tegra_mc *mc, 
struct device *dev)
for (i = 0; i < mc->soc->num_clients; i++) {
const struct tegra_mc_client *client = 
&mc->soc->clients[i];
 
-   if (client->id == args.args[0]) {
-   u32 sid = fwspec->ids[0] & 
MC_SID_STREAMID_OVERRIDE_MASK;
-
-   tegra186_mc_client_sid_override(mc, 
client, sid);
-   }
+   if (client->id == args.args[0])
+   tegra186_mc_client_sid_override(
+   mc, client,
+   sid & 
MC_SID_STREAMID_OVERRIDE_MASK);
}
}
 
-- 
2.43.0

Re: [Nouveau] [PATCH 10/10] ACPI: IORT: Allow COMPILE_TEST of IORT

2023-11-30 Thread Jason Gunthorpe

On Thu, Nov 30, 2023 at 02:10:48PM +, Robin Murphy wrote:
> > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > index 7673bb82945b6c..309378e76a9bc9 100644
> > --- a/drivers/iommu/Kconfig
> > +++ b/drivers/iommu/Kconfig
> > @@ -318,6 +318,7 @@ config ARM_SMMU
> > select IOMMU_API
> > select IOMMU_IO_PGTABLE_LPAE
> > select ARM_DMA_USE_IOMMU if ARM
> > +   select ACPI_IORT if ACPI
> 
> This is incomplete. If you want the driver to be responsible for enabling
> its own probing mechanisms then you need to select OF and ACPI too. 

Well, yes, we do have that minor issue today that drivers can be
compiled without any way to parse any FW and are thus completely
useless.

Certainly one could make the case this should be
   depends on OF || ACPI
   select ACPI_IORT if ACPI

And similar in other drivers so they have the minimum dependencies to
actually be able to work. This would be the correct way to use
kconfig.

But who cares? I'm not trying to fix everything here, I'm trying to
allow COMPILE_TEST for more sub components of this one driver.

> And all the other drivers which probe from IORT should surely also
> select ACPI_IORT, and thus ACPI as well. And maybe the PCI core
> should as well because there are general properties of PCI host
> bridges and devices described in there?

Now you are just arguring to an absurdity.

> But of course that's clearly backwards nonsense, because drivers do not and
> should not do that, so this change is not appropriate either.

This patch is about COMPILE_TEST.

> theoretical bug becomes real. There's really no practical value to be had
> from compile-testing IORT.

COMPILE_TEST is to make it easier to maintain the kernel code by
reducing the neccessary combinations required to get complete compile
coverage. 100% compile test is a laudible goal on its own.

I have no idea what you are talking about with "no practical value"
just because you don't use COMPILE_TEST doesn't mean it has "no
practical value". It exists, people like me use, we can make it
better. Why is this even a point of debate? :(

Jason

Re: [Nouveau] [PATCH 10/10] ACPI: IORT: Allow COMPILE_TEST of IORT

2023-11-30 Thread Jason Gunthorpe

On Thu, Nov 30, 2023 at 12:12:26PM +0100, Lorenzo Pieralisi wrote:
> On Wed, Nov 29, 2023 at 03:12:40PM -0400, Jason Gunthorpe wrote:
> > On Wed, Nov 29, 2023 at 01:55:04PM +0100, Lorenzo Pieralisi wrote:
> > 
> > > I don't think it should be done this way. Is the goal compile testing
> > > IORT code ? 
> > 
> > Yes
> > 
> > > If so, why are we forcing it through the SMMU (only because
> > > it can be compile tested while eg SMMUv3 driver can't ?) menu entry ?
> > 
> > Because something needs to select it, and SMMU is one of the places
> > that are implicitly using it.
> > 
> > It isn't (and shouldn't be) a user selectable kconfig. Currently the
> > only thing that selects it is the ARM64 master kconfig.
> 
> I never said it should be a user selectable kconfig. I said that
> I don't like using the SMMU entry (only) to select it just because
> that entry allows COMPILE_TEST.

So you would like each of the drivers that use it to select it?

> > SMMUv3 doesn't COMPILE_TEST so it picks up the dependency transitivity
> > through ARM64. I'm not sure why IORT was put as a global ARM64 kconfig
> > dependency and not put in the places that directly need it.
> 
> Because IORT is used by few ARM64 system IPs (that are enabled by
> default, eg GIC), it makes sense to have a generic ARM64 select (if ACPI).

IMHO that is not a good way to use kconfig, it is obfuscating and
doesn't support things like COMPILE_TEST.

> > > Maybe we can move IORT code into drivers/acpi and add a silent config
> > > option there with a dependency on ARM64 || COMPILE_TEST.
> > 
> > That seems pretty weird to me, this is the right way to approach it,
> > IMHO. Making an entire directory condition is pretty incompatible with
> > COMPILE_TEST as a philosophy.
> 
> That's not what I was suggesting. I was suggesting to move iort.c (or
> some portions of it) into drivers/acpi if we care enough to compile test
> it on arches !ARM64.
> 
> It is also weird to have a file in drivers/acpi/arm64 that you want
> to compile test on other arches IMO (and I don't think it is very useful
> to compile test it either).

Why? Just because the directory is named "arm64" doesn't mean it
should be excluded from COMPILE_TEST. arch/arm64 yes, but not random
directories in the driver tree.

Stuff under drivers/ should strive to get 100% COMPILE_TEST coverage
as much as practical. This makes everyone else's life easier.

Jason

Re: [Nouveau] [PATCH 08/10] iommu/tegra: Use tegra_dev_iommu_get_stream_id() in the remaining places

2023-11-29 Thread Jason Gunthorpe

On Wed, Nov 29, 2023 at 05:23:13PM +0100, Thierry Reding wrote:
> > diff --git a/drivers/memory/tegra/tegra186.c 
> > b/drivers/memory/tegra/tegra186.c
> > index 533f85a4b2bdb7..3e4fbe94dd666e 100644
> > --- a/drivers/memory/tegra/tegra186.c
> > +++ b/drivers/memory/tegra/tegra186.c
> > @@ -111,21 +111,21 @@ static void tegra186_mc_client_sid_override(struct 
> > tegra_mc *mc,
> >  static int tegra186_mc_probe_device(struct tegra_mc *mc, struct device 
> > *dev)
> >  {
> >  #if IS_ENABLED(CONFIG_IOMMU_API)
> > -   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > struct of_phandle_args args;
> > unsigned int i, index = 0;
> > +   u32 sid;
> >  
> > +   WARN_ON(!tegra_dev_iommu_get_stream_id(dev, &sid));
> 
> I know the code previously didn't check for any errors, but we may want
> to do so now. If tegra_dev_iommu_get_stream_id() ever fails we may end
> up writing some undefined value into the override register.

My assumption was it never fails otherwise this probably already
doesn't work?

> I'm also unsure if WARN_ON() is appropriate here. I vaguely recall that
> ->probe_device() was called for all devices on the bus and not all of
> them may have been associated with the IOMMU. Not all of them may in
> fact access memory in the first place.

So you are thinkin that of_parse_phandle_with_args() is a NOP
sometimes so it will tolerate the failure?

Seems like the best thing to do is just continue to ignore it then?

> Perhaps I'm misremembering and the IOMMU core now takes care of only
> calling this when fwspec is indeed valid?

Can't advise, I have no idea what tegra_mc_ops is for :)

Jason

Re: [Nouveau] [PATCH 10/10] ACPI: IORT: Allow COMPILE_TEST of IORT

2023-11-29 Thread Jason Gunthorpe

On Wed, Nov 29, 2023 at 01:55:04PM +0100, Lorenzo Pieralisi wrote:

> I don't think it should be done this way. Is the goal compile testing
> IORT code ? 

Yes

> If so, why are we forcing it through the SMMU (only because
> it can be compile tested while eg SMMUv3 driver can't ?) menu entry ?

Because something needs to select it, and SMMU is one of the places
that are implicitly using it.

It isn't (and shouldn't be) a user selectable kconfig. Currently the
only thing that selects it is the ARM64 master kconfig.

> This looks a bit artificial (and it is unclear from the Kconfig
> file why only that driver selects IORT, it looks like eg the SMMUv3
> does not have the same dependency - there is also the SMMUv3 perf
> driver to consider).

SMMUv3 doesn't COMPILE_TEST so it picks up the dependency transitivity
through ARM64. I'm not sure why IORT was put as a global ARM64 kconfig
dependency and not put in the places that directly need it.

"perf driver" ? There is a bunch of GIC stuff that uses this too but I
don't know if it compile tests.

> Maybe we can move IORT code into drivers/acpi and add a silent config
> option there with a dependency on ARM64 || COMPILE_TEST.

That seems pretty weird to me, this is the right way to approach it,
IMHO. Making an entire directory condition is pretty incompatible with
COMPILE_TEST as a philosophy.

> Don't know but at least it is clearer. As for the benefits of compile
> testing IORT code - yes the previous patch is a warning to fix but
> I am not so sure about the actual benefits.

IMHO COMPILE_TEST is an inherently good thing. It makes development
easier for everyone because you have a less fractured code base to
work with.

Jason

Re: [Nouveau] [PATCH 06/10] iommu: Replace iommu_device_lock with iommu_probe_device_lock

2023-11-29 Thread Jason Gunthorpe

On Wed, Nov 29, 2023 at 05:58:08PM +, Robin Murphy wrote:
> On 29/11/2023 12:48 am, Jason Gunthorpe wrote:
> > The iommu_device_lock protects the iommu_device_list which is only read by
> > iommu_ops_from_fwnode().
> > 
> > This is now always called under the iommu_probe_device_lock, so we don't
> > need to double lock the linked list. Use the iommu_probe_device_lock on
> > the write side too.
> 
> Please no, iommu_probe_device_lock() is a hack and we need to remove the
> *reason* it exists at all.

Yes, I agree that goal is good

However, it is doing a lot of things, removing it is not so easy.

One thing it is quietly doing is keeping the ops and iommu_device
pointers alive during the entire probe process against(deeply broken,
but whatever) concurrent iommu driver removal.

It is also protecting access to dev->iommu_group during the group
formation process.

So, it is a little more complex. My specific interest was to make it
not a spinlock.

> And IMO just because iommu_present() is
> deprecated doesn't justify making it look utterly nonsensical - in no way
> does that have any relationship with probe_device, much less need to
> serialise against it!

The naming is poor now, I agree, but it is not nonsensical since it
still holds the correct lock for the data it is accessing.

Thanks,
Jason

[Nouveau] [PATCH 03/10] iommu/of: Use -ENODEV consistently in of_iommu_configure()

2023-11-28 Thread Jason Gunthorpe

Instead of returning 1 and trying to handle positive error codes just
stick to the convention of returning -ENODEV. Remove references to ops
from of_iommu_configure(), a NULL ops will already generate an error code.

There is no reason to check dev->bus, if err=0 at this point then the
called configure functions thought there was an iommu and we should try to
probe it. Remove it.

Reviewed-by: Jerry Snitselaar 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/of_iommu.c | 49 
 1 file changed, 15 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index c6510d7e7b241b..164317bfb8a81f 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -17,8 +17,6 @@
 #include 
 #include 
 
-#define NO_IOMMU   1
-
 static int of_iommu_xlate(struct device *dev,
  struct of_phandle_args *iommu_spec)
 {
@@ -29,7 +27,7 @@ static int of_iommu_xlate(struct device *dev,
ops = iommu_ops_from_fwnode(fwnode);
if ((ops && !ops->of_xlate) ||
!of_device_is_available(iommu_spec->np))
-   return NO_IOMMU;
+   return -ENODEV;
 
ret = iommu_fwspec_init(dev, &iommu_spec->np->fwnode, ops);
if (ret)
@@ -61,7 +59,7 @@ static int of_iommu_configure_dev_id(struct device_node 
*master_np,
 "iommu-map-mask", &iommu_spec.np,
 iommu_spec.args);
if (err)
-   return err == -ENODEV ? NO_IOMMU : err;
+   return err;
 
err = of_iommu_xlate(dev, &iommu_spec);
of_node_put(iommu_spec.np);
@@ -72,7 +70,7 @@ static int of_iommu_configure_dev(struct device_node 
*master_np,
  struct device *dev)
 {
struct of_phandle_args iommu_spec;
-   int err = NO_IOMMU, idx = 0;
+   int err = -ENODEV, idx = 0;
 
while (!of_parse_phandle_with_args(master_np, "iommus",
   "#iommu-cells",
@@ -117,9 +115,8 @@ static int of_iommu_configure_device(struct device_node 
*master_np,
 int of_iommu_configure(struct device *dev, struct device_node *master_np,
   const u32 *id)
 {
-   const struct iommu_ops *ops = NULL;
struct iommu_fwspec *fwspec;
-   int err = NO_IOMMU;
+   int err;
 
if (!master_np)
return -ENODEV;
@@ -153,37 +150,21 @@ int of_iommu_configure(struct device *dev, struct 
device_node *master_np,
} else {
err = of_iommu_configure_device(master_np, dev, id);
}
-
-   /*
-* Two success conditions can be represented by non-negative err here:
-* >0 : there is no IOMMU, or one was unavailable for non-fatal reasons
-*  0 : we found an IOMMU, and dev->fwspec is initialised appropriately
-* <0 : any actual error
-*/
-   if (!err) {
-   /* The fwspec pointer changed, read it again */
-   fwspec = dev_iommu_fwspec_get(dev);
-   ops= fwspec->ops;
-   }
mutex_unlock(&iommu_probe_device_lock);
 
-   /*
-* If we have reason to believe the IOMMU driver missed the initial
-* probe for dev, replay it to get things in order.
-*/
-   if (!err && dev->bus)
-   err = iommu_probe_device(dev);
-
-   /* Ignore all other errors apart from EPROBE_DEFER */
-   if (err < 0) {
-   if (err == -EPROBE_DEFER)
-   return err;
-   dev_dbg(dev, "Adding to IOMMU failed: %pe\n", ERR_PTR(err));
+   if (err == -ENODEV || err == -EPROBE_DEFER)
return err;
-   }
-   if (!ops)
-   return -ENODEV;
+   if (err)
+   goto err_log;
+
+   err = iommu_probe_device(dev);
+   if (err)
+   goto err_log;
return 0;
+
+err_log:
+   dev_dbg(dev, "Adding to IOMMU failed: %pe\n", ERR_PTR(err));
+   return err;
 }
 
 static enum iommu_resv_type __maybe_unused
-- 
2.42.0

[Nouveau] [PATCH 02/10] iommmu/of: Do not return struct iommu_ops from of_iommu_configure()

2023-11-28 Thread Jason Gunthorpe

Nothing needs this pointer. Return a normal error code with the usual
IOMMU semantic that ENODEV means 'there is no IOMMU driver'.

Reviewed-by: Jerry Snitselaar 
Acked-by: Rob Herring 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/of_iommu.c | 31 +++
 drivers/of/device.c  | 22 +++---
 include/linux/of_iommu.h | 13 ++---
 3 files changed, 40 insertions(+), 26 deletions(-)

diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index 5ecca53847d325..c6510d7e7b241b 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -107,16 +107,22 @@ static int of_iommu_configure_device(struct device_node 
*master_np,
  of_iommu_configure_dev(master_np, dev);
 }
 
-const struct iommu_ops *of_iommu_configure(struct device *dev,
-  struct device_node *master_np,
-  const u32 *id)
+/*
+ * Returns:
+ *  0 on success, an iommu was configured
+ *  -ENODEV if the device does not have any IOMMU
+ *  -EPROBEDEFER if probing should be tried again
+ *  -errno fatal errors
+ */
+int of_iommu_configure(struct device *dev, struct device_node *master_np,
+  const u32 *id)
 {
const struct iommu_ops *ops = NULL;
struct iommu_fwspec *fwspec;
int err = NO_IOMMU;
 
if (!master_np)
-   return NULL;
+   return -ENODEV;
 
/* Serialise to make dev->iommu stable under our potential fwspec */
mutex_lock(&iommu_probe_device_lock);
@@ -124,7 +130,7 @@ const struct iommu_ops *of_iommu_configure(struct device 
*dev,
if (fwspec) {
if (fwspec->ops) {
mutex_unlock(&iommu_probe_device_lock);
-   return fwspec->ops;
+   return 0;
}
/* In the deferred case, start again from scratch */
iommu_fwspec_free(dev);
@@ -169,14 +175,15 @@ const struct iommu_ops *of_iommu_configure(struct device 
*dev,
err = iommu_probe_device(dev);
 
/* Ignore all other errors apart from EPROBE_DEFER */
-   if (err == -EPROBE_DEFER) {
-   ops = ERR_PTR(err);
-   } else if (err < 0) {
-   dev_dbg(dev, "Adding to IOMMU failed: %d\n", err);
-   ops = NULL;
+   if (err < 0) {
+   if (err == -EPROBE_DEFER)
+   return err;
+   dev_dbg(dev, "Adding to IOMMU failed: %pe\n", ERR_PTR(err));
+   return err;
}
-
-   return ops;
+   if (!ops)
+   return -ENODEV;
+   return 0;
 }
 
 static enum iommu_resv_type __maybe_unused
diff --git a/drivers/of/device.c b/drivers/of/device.c
index 65c71be71a8d45..873d933e8e6d1d 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -93,12 +93,12 @@ of_dma_set_restricted_buffer(struct device *dev, struct 
device_node *np)
 int of_dma_configure_id(struct device *dev, struct device_node *np,
bool force_dma, const u32 *id)
 {
-   const struct iommu_ops *iommu;
const struct bus_dma_region *map = NULL;
struct device_node *bus_np;
u64 dma_start = 0;
u64 mask, end, size = 0;
bool coherent;
+   int iommu_ret;
int ret;
 
if (np == dev->of_node)
@@ -181,21 +181,29 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
dev_dbg(dev, "device is%sdma coherent\n",
coherent ? " " : " not ");
 
-   iommu = of_iommu_configure(dev, np, id);
-   if (PTR_ERR(iommu) == -EPROBE_DEFER) {
+   iommu_ret = of_iommu_configure(dev, np, id);
+   if (iommu_ret == -EPROBE_DEFER) {
/* Don't touch range map if it wasn't set from a valid 
dma-ranges */
if (!ret)
dev->dma_range_map = NULL;
kfree(map);
return -EPROBE_DEFER;
-   }
+   } else if (iommu_ret == -ENODEV) {
+   dev_dbg(dev, "device is not behind an iommu\n");
+   } else if (iommu_ret) {
+   dev_err(dev, "iommu configuration for device failed with %pe\n",
+   ERR_PTR(iommu_ret));
 
-   dev_dbg(dev, "device is%sbehind an iommu\n",
-   iommu ? " " : " not ");
+   /*
+* Historically this routine doesn't fail driver probing
+* due to errors in of_iommu_configure()
+*/
+   } else
+   dev_dbg(dev, "device is behind an iommu\n");
 
arch_setup_dma_ops(dev, dma_start, size, coherent);
 
-   if (!iommu)
+   if (iommu_ret)
of_dma_set_restricted_buffer(dev, np);
 
retur

[Nouveau] [PATCH 08/10] iommu/tegra: Use tegra_dev_iommu_get_stream_id() in the remaining places

2023-11-28 Thread Jason Gunthorpe

This API was defined to formalize the access to internal iommu details on
some Tegra SOCs, but a few callers got missed. Add them.

The helper already masks by 0x so remove this code from the callers.

Suggested-by: Thierry Reding 
Signed-off-by: Jason Gunthorpe 
---
 drivers/dma/tegra186-gpc-dma.c  |  8 +++-
 drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c |  7 ++-
 drivers/memory/tegra/tegra186.c | 12 ++--
 3 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/drivers/dma/tegra186-gpc-dma.c b/drivers/dma/tegra186-gpc-dma.c
index fa4d4142a68a21..88547a23825b18 100644
--- a/drivers/dma/tegra186-gpc-dma.c
+++ b/drivers/dma/tegra186-gpc-dma.c
@@ -1348,8 +1348,8 @@ static int tegra_dma_program_sid(struct tegra_dma_channel 
*tdc, int stream_id)
 static int tegra_dma_probe(struct platform_device *pdev)
 {
const struct tegra_dma_chip_data *cdata = NULL;
-   struct iommu_fwspec *iommu_spec;
-   unsigned int stream_id, i;
+   unsigned int i;
+   u32 stream_id;
struct tegra_dma *tdma;
int ret;
 
@@ -1378,12 +1378,10 @@ static int tegra_dma_probe(struct platform_device *pdev)
 
tdma->dma_dev.dev = &pdev->dev;
 
-   iommu_spec = dev_iommu_fwspec_get(&pdev->dev);
-   if (!iommu_spec) {
+   if (!tegra_dev_iommu_get_stream_id(&pdev->dev, &stream_id)) {
dev_err(&pdev->dev, "Missing iommu stream-id\n");
return -EINVAL;
}
-   stream_id = iommu_spec->ids[0] & 0x;
 
ret = device_property_read_u32(&pdev->dev, "dma-channel-mask",
   &tdma->chan_mask);
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c 
b/drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c
index e7e8fdf3adab7a..b40fd1dbb21617 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c
@@ -28,16 +28,13 @@ static void
 gp10b_ltc_init(struct nvkm_ltc *ltc)
 {
struct nvkm_device *device = ltc->subdev.device;
-   struct iommu_fwspec *spec;
+   u32 sid;
 
nvkm_wr32(device, 0x17e27c, ltc->ltc_nr);
nvkm_wr32(device, 0x17e000, ltc->ltc_nr);
nvkm_wr32(device, 0x100800, ltc->ltc_nr);
 
-   spec = dev_iommu_fwspec_get(device->dev);
-   if (spec) {
-   u32 sid = spec->ids[0] & 0x;
-
+   if (tegra_dev_iommu_get_stream_id(device->dev, &sid)) {
/* stream ID */
nvkm_wr32(device, 0x16, sid << 2);
}
diff --git a/drivers/memory/tegra/tegra186.c b/drivers/memory/tegra/tegra186.c
index 533f85a4b2bdb7..3e4fbe94dd666e 100644
--- a/drivers/memory/tegra/tegra186.c
+++ b/drivers/memory/tegra/tegra186.c
@@ -111,21 +111,21 @@ static void tegra186_mc_client_sid_override(struct 
tegra_mc *mc,
 static int tegra186_mc_probe_device(struct tegra_mc *mc, struct device *dev)
 {
 #if IS_ENABLED(CONFIG_IOMMU_API)
-   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
struct of_phandle_args args;
unsigned int i, index = 0;
+   u32 sid;
 
+   WARN_ON(!tegra_dev_iommu_get_stream_id(dev, &sid));
while (!of_parse_phandle_with_args(dev->of_node, "interconnects", 
"#interconnect-cells",
   index, &args)) {
if (args.np == mc->dev->of_node && args.args_count != 0) {
for (i = 0; i < mc->soc->num_clients; i++) {
const struct tegra_mc_client *client = 
&mc->soc->clients[i];
 
-   if (client->id == args.args[0]) {
-   u32 sid = fwspec->ids[0] & 
MC_SID_STREAMID_OVERRIDE_MASK;
-
-   tegra186_mc_client_sid_override(mc, 
client, sid);
-   }
+   if (client->id == args.args[0])
+   tegra186_mc_client_sid_override(
+   mc, client,
+   sid & 
MC_SID_STREAMID_OVERRIDE_MASK);
}
}
 
-- 
2.42.0

[Nouveau] [PATCH 10/10] ACPI: IORT: Allow COMPILE_TEST of IORT

2023-11-28 Thread Jason Gunthorpe

The arm-smmu driver can COMPILE_TEST on x86, so expand this to also
enable the IORT code so it can be COMPILE_TEST'd too.

Signed-off-by: Jason Gunthorpe 
---
 drivers/acpi/Kconfig| 2 --
 drivers/acpi/Makefile   | 2 +-
 drivers/acpi/arm64/Kconfig  | 1 +
 drivers/acpi/arm64/Makefile | 2 +-
 drivers/iommu/Kconfig   | 1 +
 5 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index f819e760ff195a..3b7f77b227d13a 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -541,9 +541,7 @@ config ACPI_PFRUT
  To compile the drivers as modules, choose M here:
  the modules will be called pfr_update and pfr_telemetry.
 
-if ARM64
 source "drivers/acpi/arm64/Kconfig"
-endif
 
 config ACPI_PPTT
bool
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index eaa09bf52f1760..4e77ae37b80726 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -127,7 +127,7 @@ obj-y   += pmic/
 video-objs += acpi_video.o video_detect.o
 obj-y  += dptf/
 
-obj-$(CONFIG_ARM64)+= arm64/
+obj-y  += arm64/
 
 obj-$(CONFIG_ACPI_VIOT)+= viot.o
 
diff --git a/drivers/acpi/arm64/Kconfig b/drivers/acpi/arm64/Kconfig
index b3ed6212244c1e..537d49d8ace69e 100644
--- a/drivers/acpi/arm64/Kconfig
+++ b/drivers/acpi/arm64/Kconfig
@@ -11,6 +11,7 @@ config ACPI_GTDT
 
 config ACPI_AGDI
bool "Arm Generic Diagnostic Dump and Reset Device Interface"
+   depends on ARM64
depends on ARM_SDE_INTERFACE
help
  Arm Generic Diagnostic Dump and Reset Device Interface (AGDI) is
diff --git a/drivers/acpi/arm64/Makefile b/drivers/acpi/arm64/Makefile
index 143debc1ba4a9d..71d0e635599390 100644
--- a/drivers/acpi/arm64/Makefile
+++ b/drivers/acpi/arm64/Makefile
@@ -4,4 +4,4 @@ obj-$(CONFIG_ACPI_IORT) += iort.o
 obj-$(CONFIG_ACPI_GTDT)+= gtdt.o
 obj-$(CONFIG_ACPI_APMT)+= apmt.o
 obj-$(CONFIG_ARM_AMBA) += amba.o
-obj-y  += dma.o init.o
+obj-$(CONFIG_ARM64)+= dma.o init.o
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 7673bb82945b6c..309378e76a9bc9 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -318,6 +318,7 @@ config ARM_SMMU
select IOMMU_API
select IOMMU_IO_PGTABLE_LPAE
select ARM_DMA_USE_IOMMU if ARM
+   select ACPI_IORT if ACPI
help
  Support for implementations of the ARM System MMU architecture
  versions 1 and 2.
-- 
2.42.0

[Nouveau] [PATCH 04/10] iommu: Mark dev_iommu_get() with lockdep

2023-11-28 Thread Jason Gunthorpe

Allocation of dev->iommu must be done under the
iommu_probe_device_lock. Mark this with lockdep to discourage future
mistakes.

Reviewed-by: Jerry Snitselaar 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0d25468d53a68a..4323b6276e977f 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -334,6 +334,8 @@ static struct dev_iommu *dev_iommu_get(struct device *dev)
 {
struct dev_iommu *param = dev->iommu;
 
+   lockdep_assert_held(&iommu_probe_device_lock);
+
if (param)
return param;
 
-- 
2.42.0

[Nouveau] [PATCH 01/10] iommu: Remove struct iommu_ops *iommu from arch_setup_dma_ops()

2023-11-28 Thread Jason Gunthorpe

This is not being used to pass ops, it is just a way to tell if an
iommu driver was probed. These days this can be detected directly via
device_iommu_mapped(). Call device_iommu_mapped() in the two places that
need to check it and remove the iommu parameter everywhere.

Reviewed-by: Jerry Snitselaar 
Reviewed-by: Lu Baolu 
Reviewed-by: Moritz Fischer 
Acked-by: Christoph Hellwig 
Acked-by: Rob Herring 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 arch/arc/mm/dma.c   |  2 +-
 arch/arm/mm/dma-mapping-nommu.c |  2 +-
 arch/arm/mm/dma-mapping.c   | 10 +-
 arch/arm64/mm/dma-mapping.c |  4 ++--
 arch/mips/mm/dma-noncoherent.c  |  2 +-
 arch/riscv/mm/dma-noncoherent.c |  2 +-
 drivers/acpi/scan.c |  3 +--
 drivers/hv/hv_common.c  |  2 +-
 drivers/of/device.c |  2 +-
 include/linux/dma-map-ops.h |  4 ++--
 10 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index 2a7fbbb83b7056..197707bc765889 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -91,7 +91,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
  * Plug in direct dma map ops.
  */
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {
/*
 * IOC hardware snoops all DMA traffic keeping the caches consistent
diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index cfd9c933d2f09c..b94850b579952a 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -34,7 +34,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 }
 
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {
if (IS_ENABLED(CONFIG_CPU_V7M)) {
/*
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 5409225b4abc06..6c359a3af8d9c7 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1713,7 +1713,7 @@ void arm_iommu_detach_device(struct device *dev)
 EXPORT_SYMBOL_GPL(arm_iommu_detach_device);
 
 static void arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool 
coherent)
+   bool coherent)
 {
struct dma_iommu_mapping *mapping;
 
@@ -1748,7 +1748,7 @@ static void arm_teardown_iommu_dma_ops(struct device *dev)
 #else
 
 static void arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool 
coherent)
+   bool coherent)
 {
 }
 
@@ -1757,7 +1757,7 @@ static void arm_teardown_iommu_dma_ops(struct device 
*dev) { }
 #endif /* CONFIG_ARM_DMA_USE_IOMMU */
 
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {
/*
 * Due to legacy code that sets the ->dma_coherent flag from a bus
@@ -1776,8 +1776,8 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, 
u64 size,
if (dev->dma_ops)
return;
 
-   if (iommu)
-   arm_setup_iommu_dma_ops(dev, dma_base, size, iommu, coherent);
+   if (device_iommu_mapped(dev))
+   arm_setup_iommu_dma_ops(dev, dma_base, size, coherent);
 
xen_setup_dma_ops(dev);
dev->archdata.dma_ops_setup = true;
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 3cb101e8cb29ba..61886e43e3a10f 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -47,7 +47,7 @@ void arch_teardown_dma_ops(struct device *dev)
 #endif
 
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {
int cls = cache_line_size_of_cpu();
 
@@ -58,7 +58,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 
size,
   ARCH_DMA_MINALIGN, cls);
 
dev->dma_coherent = coherent;
-   if (iommu)
+   if (device_iommu_mapped(dev))
iommu_setup_dma_ops(dev, dma_base, dma_base + size - 1);
 
xen_setup_dma_ops(dev);
diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c
index 3c4fc97b9f394b..0f3cec663a12cd 100644
--- a/arch/mips/mm/dma-noncoherent.c
+++ b/arch/mips/mm/dma-noncoherent.c
@@ -138,7 +138,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 
 #ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
 void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
-   const struct iommu_ops *iommu, bool coherent)
+   bool coherent)
 {

[PATCH 06/10] iommu: Replace iommu_device_lock with iommu_probe_device_lock

2023-11-28 Thread Jason Gunthorpe

The iommu_device_lock protects the iommu_device_list which is only read by
iommu_ops_from_fwnode().

This is now always called under the iommu_probe_device_lock, so we don't
need to double lock the linked list. Use the iommu_probe_device_lock on
the write side too.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommu.c | 30 +-
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 08f29a1dfcd5f8..9557c2ec08d915 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -146,7 +146,6 @@ struct iommu_group_attribute iommu_group_attr_##_name = 
\
container_of(_kobj, struct iommu_group, kobj)
 
 static LIST_HEAD(iommu_device_list);
-static DEFINE_SPINLOCK(iommu_device_lock);
 
 static const struct bus_type * const iommu_buses[] = {
&platform_bus_type,
@@ -262,9 +261,9 @@ int iommu_device_register(struct iommu_device *iommu,
if (hwdev)
iommu->fwnode = dev_fwnode(hwdev);
 
-   spin_lock(&iommu_device_lock);
+   mutex_lock(&iommu_probe_device_lock);
list_add_tail(&iommu->list, &iommu_device_list);
-   spin_unlock(&iommu_device_lock);
+   mutex_unlock(&iommu_probe_device_lock);
 
for (int i = 0; i < ARRAY_SIZE(iommu_buses) && !err; i++)
err = bus_iommu_probe(iommu_buses[i]);
@@ -279,9 +278,9 @@ void iommu_device_unregister(struct iommu_device *iommu)
for (int i = 0; i < ARRAY_SIZE(iommu_buses); i++)
bus_for_each_dev(iommu_buses[i], NULL, iommu, 
remove_iommu_group);
 
-   spin_lock(&iommu_device_lock);
+   mutex_lock(&iommu_probe_device_lock);
list_del(&iommu->list);
-   spin_unlock(&iommu_device_lock);
+   mutex_unlock(&iommu_probe_device_lock);
 
/* Pairs with the alloc in generic_single_device_group() */
iommu_group_put(iommu->singleton_group);
@@ -316,9 +315,9 @@ int iommu_device_register_bus(struct iommu_device *iommu,
if (err)
return err;
 
-   spin_lock(&iommu_device_lock);
+   mutex_lock(&iommu_probe_device_lock);
list_add_tail(&iommu->list, &iommu_device_list);
-   spin_unlock(&iommu_device_lock);
+   mutex_unlock(&iommu_probe_device_lock);
 
err = bus_iommu_probe(bus);
if (err) {
@@ -2033,9 +2032,9 @@ bool iommu_present(const struct bus_type *bus)
 
for (int i = 0; i < ARRAY_SIZE(iommu_buses); i++) {
if (iommu_buses[i] == bus) {
-   spin_lock(&iommu_device_lock);
+   mutex_lock(&iommu_probe_device_lock);
ret = !list_empty(&iommu_device_list);
-   spin_unlock(&iommu_device_lock);
+   mutex_unlock(&iommu_probe_device_lock);
}
}
return ret;
@@ -2980,17 +2979,14 @@ EXPORT_SYMBOL_GPL(iommu_default_passthrough);
 
 const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode)
 {
-   const struct iommu_ops *ops = NULL;
struct iommu_device *iommu;
 
-   spin_lock(&iommu_device_lock);
+   lockdep_assert_held(&iommu_probe_device_lock);
+
list_for_each_entry(iommu, &iommu_device_list, list)
-   if (iommu->fwnode == fwnode) {
-   ops = iommu->ops;
-   break;
-   }
-   spin_unlock(&iommu_device_lock);
-   return ops;
+   if (iommu->fwnode == fwnode)
+   return iommu->ops;
+   return NULL;
 }
 
 int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode,
-- 
2.42.0

[Nouveau] [PATCH 09/10] ACPI: IORT: Cast from ULL to phys_addr_t

2023-11-28 Thread Jason Gunthorpe

gcc on i386 (when compile testing) warns:

 drivers/acpi/arm64/iort.c:2014:18: warning: implicit conversion from 'unsigned 
long long' to 'phys_addr_t' (aka 'unsigned int') changes value from 
18446744073709551615 to 4294967295 [-Wconstant-conversion]
   local_limit = 
DMA_BIT_MASK(ncomp->memory_address_limit);

Because DMA_BIT_MASK returns a large ULL constant. Explicitly truncate it
to phys_addr_t.

Signed-off-by: Jason Gunthorpe 
---
 drivers/acpi/arm64/iort.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 6496ff5a6ba20d..bdaf9256870d92 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -2011,7 +2011,8 @@ phys_addr_t __init acpi_iort_dma_get_max_cpu_address(void)
 
case ACPI_IORT_NODE_NAMED_COMPONENT:
ncomp = (struct acpi_iort_named_component 
*)node->node_data;
-   local_limit = DMA_BIT_MASK(ncomp->memory_address_limit);
+   local_limit = (phys_addr_t)DMA_BIT_MASK(
+   ncomp->memory_address_limit);
limit = min_not_zero(limit, local_limit);
break;
 
@@ -2020,7 +2021,8 @@ phys_addr_t __init acpi_iort_dma_get_max_cpu_address(void)
break;
 
rc = (struct acpi_iort_root_complex *)node->node_data;
-   local_limit = DMA_BIT_MASK(rc->memory_address_limit);
+   local_limit = (phys_addr_t)DMA_BIT_MASK(
+   rc->memory_address_limit);
limit = min_not_zero(limit, local_limit);
break;
}
-- 
2.42.0

[Nouveau] [PATCH 05/10] iommu: Mark dev_iommu_priv_set() with a lockdep

2023-11-28 Thread Jason Gunthorpe

A perfect driver would only call dev_iommu_priv_set() from its probe
callback. We've made it functionally correct to call it from the of_xlate
by adding a lock around that call.

lockdep assert that iommu_probe_device_lock is held to discourage misuse.

Exclude PPC kernels with CONFIG_FSL_PAMU turned on because FSL_PAMU uses a
global static for its priv and abuses priv for its domain.

Remove the pointless stores of NULL, all these are on paths where the core
code will free dev->iommu after the op returns.

Reviewed-by: Lu Baolu 
Reviewed-by: Jerry Snitselaar 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/amd/iommu.c   | 2 --
 drivers/iommu/apple-dart.c  | 1 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c   | 1 -
 drivers/iommu/intel/iommu.c | 2 --
 drivers/iommu/iommu.c   | 9 +
 drivers/iommu/omap-iommu.c  | 1 -
 include/linux/iommu.h   | 5 +
 8 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 9f706436082833..be58644a6fa518 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -551,8 +551,6 @@ static void amd_iommu_uninit_device(struct device *dev)
if (dev_data->domain)
detach_device(dev);
 
-   dev_iommu_priv_set(dev, NULL);
-
/*
 * We keep dev_data around for unplugged devices and reuse it when the
 * device is re-plugged - not doing so would introduce a ton of races.
diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index 7438e9c82ba982..25135440b5dd54 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -743,7 +743,6 @@ static void apple_dart_release_device(struct device *dev)
 {
struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
 
-   dev_iommu_priv_set(dev, NULL);
kfree(cfg);
 }
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fc4317c25b6d53..1855d3892b15f8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2695,7 +2695,6 @@ static struct iommu_device *arm_smmu_probe_device(struct 
device *dev)
 
 err_free_master:
kfree(master);
-   dev_iommu_priv_set(dev, NULL);
return ERR_PTR(ret);
 }
 
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 4d09c004789274..adc7937fd8a3a3 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1420,7 +1420,6 @@ static void arm_smmu_release_device(struct device *dev)
 
arm_smmu_rpm_put(cfg->smmu);
 
-   dev_iommu_priv_set(dev, NULL);
kfree(cfg);
 }
 
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 897159dba47de4..511589341074f0 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4461,7 +4461,6 @@ static struct iommu_device 
*intel_iommu_probe_device(struct device *dev)
ret = intel_pasid_alloc_table(dev);
if (ret) {
dev_err(dev, "PASID table allocation failed\n");
-   dev_iommu_priv_set(dev, NULL);
kfree(info);
return ERR_PTR(ret);
}
@@ -4479,7 +4478,6 @@ static void intel_iommu_release_device(struct device *dev)
dmar_remove_one_dev_info(dev);
intel_pasid_free_table(dev);
intel_iommu_debugfs_remove_dev(info);
-   dev_iommu_priv_set(dev, NULL);
kfree(info);
set_dma_ops(dev, NULL);
 }
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 4323b6276e977f..08f29a1dfcd5f8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -387,6 +387,15 @@ static u32 dev_iommu_get_max_pasids(struct device *dev)
return min_t(u32, max_pasids, dev->iommu->iommu_dev->max_pasids);
 }
 
+void dev_iommu_priv_set(struct device *dev, void *priv)
+{
+   /* FSL_PAMU does something weird */
+   if (!IS_ENABLED(CONFIG_FSL_PAMU))
+   lockdep_assert_held(&iommu_probe_device_lock);
+   dev->iommu->priv = priv;
+}
+EXPORT_SYMBOL_GPL(dev_iommu_priv_set);
+
 /*
  * Init the dev->iommu and dev->iommu_group in the struct device and get the
  * driver probed
diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
index c66b070841dd41..c9528065a59afa 100644
--- a/drivers/iommu/omap-iommu.c
+++ b/drivers/iommu/omap-iommu.c
@@ -1719,7 +1719,6 @@ static void omap_iommu_release_device(struct device *dev)
if (!dev->of_node || !arch_data)
return;
 
-   dev_iommu_priv_set(dev, NULL);
kfree(arch_data);
 
 }
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c7394

[Nouveau] [PATCH 00/10] IOMMU related FW parsing cleanup

2023-11-28 Thread Jason Gunthorpe

These are the patches from the from the prior series without the "fwspec
polishing":
 https://lore.kernel.org/r/0-v2-36a0088ecaa7+22c6e-iommu_fwspec_...@nvidia.com

Rebased onto Robin's patch:
 
https://lore.kernel.org/all/16f433658661d7cadfea51e7c65da95826112a2b.1700071477.git.robin.mur...@arm.com/

Does a few things to prepare for the next:

- Clean up the call chains around dma_configure so the iommu_ops isn't being
  exposed.

- Add additional lockdep annotations now that we can.

- Replace the iommu_device_lock with iommu_probe_device_lock.

- Fix some missed places that need to call tegra_dev_iommu_get_stream_id()

Jason Gunthorpe (10):
  iommu: Remove struct iommu_ops *iommu from arch_setup_dma_ops()
  iommmu/of: Do not return struct iommu_ops from of_iommu_configure()
  iommu/of: Use -ENODEV consistently in of_iommu_configure()
  iommu: Mark dev_iommu_get() with lockdep
  iommu: Mark dev_iommu_priv_set() with a lockdep
  iommu: Replace iommu_device_lock with iommu_probe_device_lock
  acpi: Do not return struct iommu_ops from acpi_iommu_configure_id()
  iommu/tegra: Use tegra_dev_iommu_get_stream_id() in the remaining
places
  ACPI: IORT: Cast from ULL to phys_addr_t
  ACPI: IORT: Allow COMPILE_TEST of IORT

 arch/arc/mm/dma.c |  2 +-
 arch/arm/mm/dma-mapping-nommu.c   |  2 +-
 arch/arm/mm/dma-mapping.c | 10 +--
 arch/arm64/mm/dma-mapping.c   |  4 +-
 arch/mips/mm/dma-noncoherent.c|  2 +-
 arch/riscv/mm/dma-noncoherent.c   |  2 +-
 drivers/acpi/Kconfig  |  2 -
 drivers/acpi/Makefile |  2 +-
 drivers/acpi/arm64/Kconfig|  1 +
 drivers/acpi/arm64/Makefile   |  2 +-
 drivers/acpi/arm64/iort.c |  6 +-
 drivers/acpi/scan.c   | 32 ++
 drivers/dma/tegra186-gpc-dma.c|  8 +--
 .../gpu/drm/nouveau/nvkm/subdev/ltc/gp10b.c   |  7 +-
 drivers/hv/hv_common.c|  2 +-
 drivers/iommu/Kconfig |  1 +
 drivers/iommu/amd/iommu.c |  2 -
 drivers/iommu/apple-dart.c|  1 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  1 -
 drivers/iommu/arm/arm-smmu/arm-smmu.c |  1 -
 drivers/iommu/intel/iommu.c   |  2 -
 drivers/iommu/iommu.c | 41 +++-
 drivers/iommu/of_iommu.c  | 64 ---
 drivers/iommu/omap-iommu.c|  1 -
 drivers/memory/tegra/tegra186.c   | 12 ++--
 drivers/of/device.c   | 24 ---
 include/linux/dma-map-ops.h   |  4 +-
 include/linux/iommu.h |  5 +-
 include/linux/of_iommu.h  | 13 ++--
 29 files changed, 124 insertions(+), 132 deletions(-)


base-commit: 173ff345925a394284250bfa6e47d231e62031c7
-- 
2.42.0

[PATCH 07/10] acpi: Do not return struct iommu_ops from acpi_iommu_configure_id()

2023-11-28 Thread Jason Gunthorpe

Nothing needs this pointer. Return a normal error code with the usual
IOMMU semantic that ENODEV means 'there is no IOMMU driver'.

Acked-by: Rafael J. Wysocki 
Reviewed-by: Jerry Snitselaar 
Tested-by: Hector Martin 
Signed-off-by: Jason Gunthorpe 
---
 drivers/acpi/scan.c | 29 +
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index 444a0b3c72f2d8..340ba720c72129 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1562,8 +1562,7 @@ static inline const struct iommu_ops 
*acpi_iommu_fwspec_ops(struct device *dev)
return fwspec ? fwspec->ops : NULL;
 }
 
-static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev,
-  const u32 *id_in)
+static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in)
 {
int err;
const struct iommu_ops *ops;
@@ -1577,7 +1576,7 @@ static const struct iommu_ops 
*acpi_iommu_configure_id(struct device *dev,
ops = acpi_iommu_fwspec_ops(dev);
if (ops) {
mutex_unlock(&iommu_probe_device_lock);
-   return ops;
+   return 0;
}
 
err = iort_iommu_configure_id(dev, id_in);
@@ -1594,12 +1593,14 @@ static const struct iommu_ops 
*acpi_iommu_configure_id(struct device *dev,
 
/* Ignore all other errors apart from EPROBE_DEFER */
if (err == -EPROBE_DEFER) {
-   return ERR_PTR(err);
+   return err;
} else if (err) {
dev_dbg(dev, "Adding to IOMMU failed: %d\n", err);
-   return NULL;
+   return -ENODEV;
}
-   return acpi_iommu_fwspec_ops(dev);
+   if (!acpi_iommu_fwspec_ops(dev))
+   return -ENODEV;
+   return 0;
 }
 
 #else /* !CONFIG_IOMMU_API */
@@ -1611,10 +1612,9 @@ int acpi_iommu_fwspec_init(struct device *dev, u32 id,
return -ENODEV;
 }
 
-static const struct iommu_ops *acpi_iommu_configure_id(struct device *dev,
-  const u32 *id_in)
+static int acpi_iommu_configure_id(struct device *dev, const u32 *id_in)
 {
-   return NULL;
+   return -ENODEV;
 }
 
 #endif /* !CONFIG_IOMMU_API */
@@ -1628,7 +1628,7 @@ static const struct iommu_ops 
*acpi_iommu_configure_id(struct device *dev,
 int acpi_dma_configure_id(struct device *dev, enum dev_dma_attr attr,
  const u32 *input_id)
 {
-   const struct iommu_ops *iommu;
+   int ret;
 
if (attr == DEV_DMA_NOT_SUPPORTED) {
set_dma_ops(dev, &dma_dummy_ops);
@@ -1637,10 +1637,15 @@ int acpi_dma_configure_id(struct device *dev, enum 
dev_dma_attr attr,
 
acpi_arch_dma_setup(dev);
 
-   iommu = acpi_iommu_configure_id(dev, input_id);
-   if (PTR_ERR(iommu) == -EPROBE_DEFER)
+   ret = acpi_iommu_configure_id(dev, input_id);
+   if (ret == -EPROBE_DEFER)
return -EPROBE_DEFER;
 
+   /*
+* Historically this routine doesn't fail driver probing due to errors
+* in acpi_iommu_configure_id()
+*/
+
arch_setup_dma_ops(dev, 0, U64_MAX, attr == DEV_DMA_COHERENT);
 
return 0;
-- 
2.42.0

[Nouveau] [PATCH v3 09/10] iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s

2023-01-23 Thread Jason Gunthorpe

dma_alloc_cpu_table() and dma_alloc_page_table() are eventually called by
iommufd through s390_iommu_map_pages() and it should not be forced to
atomic. Thread the gfp parameter through the call chain starting from
s390_iommu_map_pages().

Reviewed-by: Niklas Schnelle 
Reviewed-by: Matthew Rosato 
Signed-off-by: Jason Gunthorpe 
---
 arch/s390/include/asm/pci_dma.h |  5 +++--
 arch/s390/pci/pci_dma.c | 31 +--
 drivers/iommu/s390-iommu.c  | 15 +--
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/arch/s390/include/asm/pci_dma.h b/arch/s390/include/asm/pci_dma.h
index 91e63426bdc53f..7119c04c51c5c8 100644
--- a/arch/s390/include/asm/pci_dma.h
+++ b/arch/s390/include/asm/pci_dma.h
@@ -186,9 +186,10 @@ static inline unsigned long *get_st_pto(unsigned long 
entry)
 
 /* Prototypes */
 void dma_free_seg_table(unsigned long);
-unsigned long *dma_alloc_cpu_table(void);
+unsigned long *dma_alloc_cpu_table(gfp_t gfp);
 void dma_cleanup_tables(unsigned long *);
-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr);
+unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr,
+ gfp_t gfp);
 void dma_update_cpu_trans(unsigned long *entry, phys_addr_t page_addr, int 
flags);
 
 extern const struct dma_map_ops s390_pci_dma_ops;
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index ea478d11fbd132..2f6d05d6da4f76 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -27,11 +27,11 @@ static int zpci_refresh_global(struct zpci_dev *zdev)
  zdev->iommu_pages * PAGE_SIZE);
 }
 
-unsigned long *dma_alloc_cpu_table(void)
+unsigned long *dma_alloc_cpu_table(gfp_t gfp)
 {
unsigned long *table, *entry;
 
-   table = kmem_cache_alloc(dma_region_table_cache, GFP_ATOMIC);
+   table = kmem_cache_alloc(dma_region_table_cache, gfp);
if (!table)
return NULL;
 
@@ -45,11 +45,11 @@ static void dma_free_cpu_table(void *table)
kmem_cache_free(dma_region_table_cache, table);
 }
 
-static unsigned long *dma_alloc_page_table(void)
+static unsigned long *dma_alloc_page_table(gfp_t gfp)
 {
unsigned long *table, *entry;
 
-   table = kmem_cache_alloc(dma_page_table_cache, GFP_ATOMIC);
+   table = kmem_cache_alloc(dma_page_table_cache, gfp);
if (!table)
return NULL;
 
@@ -63,7 +63,7 @@ static void dma_free_page_table(void *table)
kmem_cache_free(dma_page_table_cache, table);
 }
 
-static unsigned long *dma_get_seg_table_origin(unsigned long *rtep)
+static unsigned long *dma_get_seg_table_origin(unsigned long *rtep, gfp_t gfp)
 {
unsigned long old_rte, rte;
unsigned long *sto;
@@ -72,7 +72,7 @@ static unsigned long *dma_get_seg_table_origin(unsigned long 
*rtep)
if (reg_entry_isvalid(rte)) {
sto = get_rt_sto(rte);
} else {
-   sto = dma_alloc_cpu_table();
+   sto = dma_alloc_cpu_table(gfp);
if (!sto)
return NULL;
 
@@ -90,7 +90,7 @@ static unsigned long *dma_get_seg_table_origin(unsigned long 
*rtep)
return sto;
 }
 
-static unsigned long *dma_get_page_table_origin(unsigned long *step)
+static unsigned long *dma_get_page_table_origin(unsigned long *step, gfp_t gfp)
 {
unsigned long old_ste, ste;
unsigned long *pto;
@@ -99,7 +99,7 @@ static unsigned long *dma_get_page_table_origin(unsigned long 
*step)
if (reg_entry_isvalid(ste)) {
pto = get_st_pto(ste);
} else {
-   pto = dma_alloc_page_table();
+   pto = dma_alloc_page_table(gfp);
if (!pto)
return NULL;
set_st_pto(&ste, virt_to_phys(pto));
@@ -116,18 +116,19 @@ static unsigned long *dma_get_page_table_origin(unsigned 
long *step)
return pto;
 }
 
-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr)
+unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr,
+ gfp_t gfp)
 {
unsigned long *sto, *pto;
unsigned int rtx, sx, px;
 
rtx = calc_rtx(dma_addr);
-   sto = dma_get_seg_table_origin(&rto[rtx]);
+   sto = dma_get_seg_table_origin(&rto[rtx], gfp);
if (!sto)
return NULL;
 
sx = calc_sx(dma_addr);
-   pto = dma_get_page_table_origin(&sto[sx]);
+   pto = dma_get_page_table_origin(&sto[sx], gfp);
if (!pto)
return NULL;
 
@@ -170,7 +171,8 @@ static int __dma_update_trans(struct zpci_dev *zdev, 
phys_addr_t pa,
return -EINVAL;
 
for (i = 0; i < nr_pages; i++) {
-   entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr);
+   entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr,
+

[Nouveau] [PATCH v3 03/10] iommu: Add a gfp parameter to iommu_map_sg()

2023-01-23 Thread Jason Gunthorpe

Follow the pattern for iommu_map() and remove iommu_map_sg_atomic().

This allows __iommu_dma_alloc_noncontiguous() to use a GFP_KERNEL
allocation here, based on the provided gfp flags.

Reviewed-by: Kevin Tian 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c |  5 +++--
 drivers/iommu/iommu.c | 26 ++
 include/linux/iommu.h | 18 +-
 3 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7016db569f81fc..72cfa24503b8bc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -833,7 +833,8 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
arch_dma_prep_coherent(sg_page(sg), sg->length);
}
 
-   ret = iommu_map_sg_atomic(domain, iova, sgt->sgl, sgt->orig_nents, 
ioprot);
+   ret = iommu_map_sg(domain, iova, sgt->sgl, sgt->orig_nents, ioprot,
+  GFP_ATOMIC);
if (ret < 0 || ret < size)
goto out_free_sg;
 
@@ -1281,7 +1282,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 * We'll leave any physical concatenation to the IOMMU driver's
 * implementation - it knows better than we do.
 */
-   ret = iommu_map_sg_atomic(domain, iova, sg, nents, prot);
+   ret = iommu_map_sg(domain, iova, sg, nents, prot, GFP_ATOMIC);
if (ret < 0 || ret < iova_len)
goto out_free_iova;
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9412b420d07257..cc6e7c6bf72758 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2470,9 +2470,9 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
-static ssize_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot,
-   gfp_t gfp)
+ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+struct scatterlist *sg, unsigned int nents, int prot,
+gfp_t gfp)
 {
const struct iommu_domain_ops *ops = domain->ops;
size_t len = 0, mapped = 0;
@@ -2480,6 +2480,13 @@ static ssize_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
unsigned int i = 0;
int ret;
 
+   might_sleep_if(gfpflags_allow_blocking(gfp));
+
+   /* Discourage passing strange GFP flags */
+   if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
+   __GFP_HIGHMEM)))
+   return -EINVAL;
+
while (i <= nents) {
phys_addr_t s_phys = sg_phys(sg);
 
@@ -2519,21 +2526,8 @@ static ssize_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
 
return ret;
 }
-
-ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-struct scatterlist *sg, unsigned int nents, int prot)
-{
-   might_sleep();
-   return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_KERNEL);
-}
 EXPORT_SYMBOL_GPL(iommu_map_sg);
 
-ssize_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot)
-{
-   return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_ATOMIC);
-}
-
 /**
  * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
  * @domain: the iommu domain where the fault has happened
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 521cd79700f4d8..d5c16dc33c87de 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -474,10 +474,8 @@ extern size_t iommu_unmap_fast(struct iommu_domain *domain,
   unsigned long iova, size_t size,
   struct iommu_iotlb_gather *iotlb_gather);
 extern ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot);
-extern ssize_t iommu_map_sg_atomic(struct iommu_domain *domain,
-  unsigned long iova, struct scatterlist *sg,
-  unsigned int nents, int prot);
+   struct scatterlist *sg, unsigned int nents,
+   int prot, gfp_t gfp);
 extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t 
iova);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
iommu_fault_handler_t handler, void *token);
@@ -791,14 +789,7 @@ static inline size_t iommu_unmap_fast(struct iommu_domain 
*domain,
 
 static inline ssize_t iommu_map_sg(struct iommu_domain *domain,
   unsigned long iova, struct scatterlist *sg,
-  unsigned int nents, int prot)
-{
-   return -ENODEV;
-}
-
-s

[Nouveau] [PATCH v3 08/10] iommu/intel: Use GFP_KERNEL in sleepable contexts

2023-01-23 Thread Jason Gunthorpe

These contexts are sleepable, so use the proper annotation. The GFP_ATOMIC
was added mechanically in the prior patches.

Reviewed-by: Lu Baolu 
Reviewed-by: Kevin Tian 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index e95f7703ce7b83..a1a66798e1f06c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2372,7 +2372,7 @@ static int iommu_domain_identity_map(struct dmar_domain 
*domain,
 
return __domain_mapping(domain, first_vpfn,
first_vpfn, last_vpfn - first_vpfn + 1,
-   DMA_PTE_READ|DMA_PTE_WRITE, GFP_ATOMIC);
+   DMA_PTE_READ|DMA_PTE_WRITE, GFP_KERNEL);
 }
 
 static int md_domain_init(struct dmar_domain *domain, int guest_width);
@@ -2680,7 +2680,7 @@ static int copy_context_table(struct intel_iommu *iommu,
if (!old_ce)
goto out;
 
-   new_ce = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
+   new_ce = alloc_pgtable_page(iommu->node, GFP_KERNEL);
if (!new_ce)
goto out_unmap;
 
-- 
2.39.0

[Nouveau] [PATCH v3 01/10] iommu: Add a gfp parameter to iommu_map()

2023-01-23 Thread Jason Gunthorpe

The internal mechanisms support this, but instead of exposting the gfp to
the caller it wrappers it into iommu_map() and iommu_map_atomic()

Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT.

Reviewed-by: Kevin Tian 
Signed-off-by: Jason Gunthorpe 
---
 arch/arm/mm/dma-mapping.c | 11 ++
 .../drm/nouveau/nvkm/subdev/instmem/gk20a.c   |  3 ++-
 drivers/gpu/drm/tegra/drm.c   |  2 +-
 drivers/gpu/host1x/cdma.c |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c  |  4 ++--
 drivers/iommu/dma-iommu.c |  2 +-
 drivers/iommu/iommu.c | 22 +--
 drivers/iommu/iommufd/pages.c |  6 +++--
 drivers/media/platform/qcom/venus/firmware.c  |  2 +-
 drivers/net/ipa/ipa_mem.c |  6 +++--
 drivers/net/wireless/ath/ath10k/snoc.c|  2 +-
 drivers/net/wireless/ath/ath11k/ahb.c |  4 ++--
 drivers/remoteproc/remoteproc_core.c  |  5 +++--
 drivers/vfio/vfio_iommu_type1.c   |  9 
 drivers/vhost/vdpa.c  |  2 +-
 include/linux/iommu.h |  4 ++--
 16 files changed, 48 insertions(+), 38 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index c135f6e37a00ca..8bc01071474ab7 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -984,7 +984,8 @@ __iommu_create_mapping(struct device *dev, struct page 
**pages, size_t size,
 
len = (j - i) << PAGE_SHIFT;
ret = iommu_map(mapping->domain, iova, phys, len,
-   __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs));
+   __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs),
+   GFP_KERNEL);
if (ret < 0)
goto fail;
iova += len;
@@ -1207,7 +1208,8 @@ static int __map_sg_chunk(struct device *dev, struct 
scatterlist *sg,
 
prot = __dma_info_to_prot(dir, attrs);
 
-   ret = iommu_map(mapping->domain, iova, phys, len, prot);
+   ret = iommu_map(mapping->domain, iova, phys, len, prot,
+   GFP_KERNEL);
if (ret < 0)
goto fail;
count += len >> PAGE_SHIFT;
@@ -1379,7 +1381,8 @@ static dma_addr_t arm_iommu_map_page(struct device *dev, 
struct page *page,
 
prot = __dma_info_to_prot(dir, attrs);
 
-   ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len, 
prot);
+   ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len,
+   prot, GFP_KERNEL);
if (ret < 0)
goto fail;
 
@@ -1443,7 +1446,7 @@ static dma_addr_t arm_iommu_map_resource(struct device 
*dev,
 
prot = __dma_info_to_prot(dir, attrs) | IOMMU_MMIO;
 
-   ret = iommu_map(mapping->domain, dma_addr, addr, len, prot);
+   ret = iommu_map(mapping->domain, dma_addr, addr, len, prot, GFP_KERNEL);
if (ret < 0)
goto fail;
 
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c 
b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
index 648ecf5a8fbc2a..a4ac94a2ab57fc 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
@@ -475,7 +475,8 @@ gk20a_instobj_ctor_iommu(struct gk20a_instmem *imem, u32 
npages, u32 align,
u32 offset = (r->offset + i) << imem->iommu_pgshift;
 
ret = iommu_map(imem->domain, offset, node->dma_addrs[i],
-   PAGE_SIZE, IOMMU_READ | IOMMU_WRITE);
+   PAGE_SIZE, IOMMU_READ | IOMMU_WRITE,
+   GFP_KERNEL);
if (ret < 0) {
nvkm_error(subdev, "IOMMU mapping failure: %d\n", ret);
 
diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 7bd2e65c2a16c5..6ca9f396e55be4 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -1057,7 +1057,7 @@ void *tegra_drm_alloc(struct tegra_drm *tegra, size_t 
size, dma_addr_t *dma)
 
*dma = iova_dma_addr(&tegra->carveout.domain, alloc);
err = iommu_map(tegra->domain, *dma, virt_to_phys(virt),
-   size, IOMMU_READ | IOMMU_WRITE);
+   size, IOMMU_READ | IOMMU_WRITE, GFP_KERNEL);
if (err < 0)
goto free_iova;
 
diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c
index 103fda055394ab..4ddfcd2138c95b 100644
--- a/drivers/gpu/host1x/cdma.c
+++ b/drivers/gpu/host1x/cdma.c
@@ -105,7 +105,7 @@ static int host1x_pushbuffer_init(struct push_buffer *pb)
 
pb->dma = iova_dma_addr(&host1x->iova, alloc);

[Nouveau] [PATCH v3 05/10] iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map()

2023-01-23 Thread Jason Gunthorpe

iommufd follows the same design as KVM and uses memory cgroups to limit
the amount of kernel memory a iommufd file descriptor can pin down. The
various internal data structures already use GFP_KERNEL_ACCOUNT.

However, one of the biggest consumers of kernel memory is the IOPTEs
stored under the iommu_domain. Many drivers will allocate these at
iommu_map() time and will trivially do the right thing if we pass in
GFP_KERNEL_ACCOUNT.

Reviewed-by: Kevin Tian 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/pages.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 22cc3bb0c6c55a..f8d92c9bb65b60 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -457,7 +457,7 @@ static int batch_iommu_map_small(struct iommu_domain 
*domain,
 
while (size) {
rc = iommu_map(domain, iova, paddr, PAGE_SIZE, prot,
-  GFP_KERNEL);
+  GFP_KERNEL_ACCOUNT);
if (rc)
goto err_unmap;
iova += PAGE_SIZE;
@@ -502,7 +502,7 @@ static int batch_to_domain(struct pfn_batch *batch, struct 
iommu_domain *domain,
rc = iommu_map(domain, iova,
   PFN_PHYS(batch->pfns[cur]) + page_offset,
   next_iova - iova, area->iommu_prot,
-  GFP_KERNEL);
+  GFP_KERNEL_ACCOUNT);
if (rc)
goto err_unmap;
iova = next_iova;
-- 
2.39.0

[Nouveau] [PATCH v3 06/10] iommu/intel: Add a gfp parameter to alloc_pgtable_page()

2023-01-23 Thread Jason Gunthorpe

This is eventually called by iommufd through intel_iommu_map_pages() and
it should not be forced to atomic. Push the GFP_ATOMIC to all callers.

Reviewed-by: Kevin Tian 
Reviewed-by: Lu Baolu 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 14 +++---
 drivers/iommu/intel/iommu.h |  2 +-
 drivers/iommu/intel/pasid.c |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 59df7e42fd533c..aa29561d3549b3 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -362,12 +362,12 @@ static int __init intel_iommu_setup(char *str)
 }
 __setup("intel_iommu=", intel_iommu_setup);
 
-void *alloc_pgtable_page(int node)
+void *alloc_pgtable_page(int node, gfp_t gfp)
 {
struct page *page;
void *vaddr = NULL;
 
-   page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
+   page = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
if (page)
vaddr = page_address(page);
return vaddr;
@@ -612,7 +612,7 @@ struct context_entry *iommu_context_addr(struct intel_iommu 
*iommu, u8 bus,
if (!alloc)
return NULL;
 
-   context = alloc_pgtable_page(iommu->node);
+   context = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
if (!context)
return NULL;
 
@@ -935,7 +935,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain 
*domain,
if (!dma_pte_present(pte)) {
uint64_t pteval;
 
-   tmp_page = alloc_pgtable_page(domain->nid);
+   tmp_page = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
 
if (!tmp_page)
return NULL;
@@ -1186,7 +1186,7 @@ static int iommu_alloc_root_entry(struct intel_iommu 
*iommu)
 {
struct root_entry *root;
 
-   root = (struct root_entry *)alloc_pgtable_page(iommu->node);
+   root = (struct root_entry *)alloc_pgtable_page(iommu->node, GFP_ATOMIC);
if (!root) {
pr_err("Allocating root entry for %s failed\n",
iommu->name);
@@ -2676,7 +2676,7 @@ static int copy_context_table(struct intel_iommu *iommu,
if (!old_ce)
goto out;
 
-   new_ce = alloc_pgtable_page(iommu->node);
+   new_ce = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
if (!new_ce)
goto out_unmap;
 
@@ -4136,7 +4136,7 @@ static int md_domain_init(struct dmar_domain *domain, int 
guest_width)
domain->max_addr = 0;
 
/* always allocate the top pgd */
-   domain->pgd = alloc_pgtable_page(domain->nid);
+   domain->pgd = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
if (!domain->pgd)
return -ENOMEM;
domain_flush_cache(domain, domain->pgd, PAGE_SIZE);
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 06e61e4748567a..ca9a035e0110af 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -737,7 +737,7 @@ int qi_submit_sync(struct intel_iommu *iommu, struct 
qi_desc *desc,
 
 extern int dmar_ir_support(void);
 
-void *alloc_pgtable_page(int node);
+void *alloc_pgtable_page(int node, gfp_t gfp);
 void free_pgtable_page(void *vaddr);
 void iommu_flush_write_buffer(struct intel_iommu *iommu);
 struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn);
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index fb3c7020028d07..c5bf74e9372d62 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -200,7 +200,7 @@ static struct pasid_entry *intel_pasid_get_entry(struct 
device *dev, u32 pasid)
 retry:
entries = get_pasid_table_from_pde(&dir[dir_index]);
if (!entries) {
-   entries = alloc_pgtable_page(info->iommu->node);
+   entries = alloc_pgtable_page(info->iommu->node, GFP_ATOMIC);
if (!entries)
return NULL;
 
-- 
2.39.0

[Nouveau] [PATCH v3 10/10] iommu/s390: Use GFP_KERNEL in sleepable contexts

2023-01-23 Thread Jason Gunthorpe

These contexts are sleepable, so use the proper annotation. The GFP_ATOMIC
was added mechanically in the prior patches.

Reviewed-by: Niklas Schnelle 
Reviewed-by: Matthew Rosato 
Signed-off-by: Jason Gunthorpe 
---
 arch/s390/pci/pci_dma.c| 2 +-
 drivers/iommu/s390-iommu.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index 2f6d05d6da4f76..2d9b01d7ca4c5c 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -579,7 +579,7 @@ int zpci_dma_init_device(struct zpci_dev *zdev)
 
spin_lock_init(&zdev->iommu_bitmap_lock);
 
-   zdev->dma_table = dma_alloc_cpu_table(GFP_ATOMIC);
+   zdev->dma_table = dma_alloc_cpu_table(GFP_KERNEL);
if (!zdev->dma_table) {
rc = -ENOMEM;
goto out;
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index 654ec4411fe36c..7dcfffed260e6b 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -52,7 +52,7 @@ static struct iommu_domain *s390_domain_alloc(unsigned 
domain_type)
if (!s390_domain)
return NULL;
 
-   s390_domain->dma_table = dma_alloc_cpu_table(GFP_ATOMIC);
+   s390_domain->dma_table = dma_alloc_cpu_table(GFP_KERNEL);
if (!s390_domain->dma_table) {
kfree(s390_domain);
return NULL;
-- 
2.39.0

[Nouveau] [PATCH v3 07/10] iommu/intel: Support the gfp argument to the map_pages op

2023-01-23 Thread Jason Gunthorpe

Flow it down to alloc_pgtable_page() via pfn_to_dma_pte() and
__domain_mapping().

Reviewed-by: Kevin Tian 
Reviewed-by: Lu Baolu 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index aa29561d3549b3..e95f7703ce7b83 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -908,7 +908,8 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 
source_id,
 #endif
 
 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
- unsigned long pfn, int *target_level)
+ unsigned long pfn, int *target_level,
+ gfp_t gfp)
 {
struct dma_pte *parent, *pte;
int level = agaw_to_level(domain->agaw);
@@ -935,7 +936,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain 
*domain,
if (!dma_pte_present(pte)) {
uint64_t pteval;
 
-   tmp_page = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
+   tmp_page = alloc_pgtable_page(domain->nid, gfp);
 
if (!tmp_page)
return NULL;
@@ -2150,7 +2151,8 @@ static void switch_to_super_page(struct dmar_domain 
*domain,
 
while (start_pfn <= end_pfn) {
if (!pte)
-   pte = pfn_to_dma_pte(domain, start_pfn, &level);
+   pte = pfn_to_dma_pte(domain, start_pfn, &level,
+GFP_ATOMIC);
 
if (dma_pte_present(pte)) {
dma_pte_free_pagetable(domain, start_pfn,
@@ -2172,7 +2174,8 @@ static void switch_to_super_page(struct dmar_domain 
*domain,
 
 static int
 __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
-unsigned long phys_pfn, unsigned long nr_pages, int prot)
+unsigned long phys_pfn, unsigned long nr_pages, int prot,
+gfp_t gfp)
 {
struct dma_pte *first_pte = NULL, *pte = NULL;
unsigned int largepage_lvl = 0;
@@ -2202,7 +2205,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned 
long iov_pfn,
largepage_lvl = hardware_largepage_caps(domain, iov_pfn,
phys_pfn, nr_pages);
 
-   pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl);
+   pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl,
+gfp);
if (!pte)
return -ENOMEM;
first_pte = pte;
@@ -2368,7 +2372,7 @@ static int iommu_domain_identity_map(struct dmar_domain 
*domain,
 
return __domain_mapping(domain, first_vpfn,
first_vpfn, last_vpfn - first_vpfn + 1,
-   DMA_PTE_READ|DMA_PTE_WRITE);
+   DMA_PTE_READ|DMA_PTE_WRITE, GFP_ATOMIC);
 }
 
 static int md_domain_init(struct dmar_domain *domain, int guest_width);
@@ -4298,7 +4302,7 @@ static int intel_iommu_map(struct iommu_domain *domain,
   the low bits of hpa would take us onto the next page */
size = aligned_nrpages(hpa, size);
return __domain_mapping(dmar_domain, iova >> VTD_PAGE_SHIFT,
-   hpa >> VTD_PAGE_SHIFT, size, prot);
+   hpa >> VTD_PAGE_SHIFT, size, prot, gfp);
 }
 
 static int intel_iommu_map_pages(struct iommu_domain *domain,
@@ -4333,7 +4337,8 @@ static size_t intel_iommu_unmap(struct iommu_domain 
*domain,
 
/* Cope with horrid API which requires us to unmap more than the
   size argument if it happens to be a large-page mapping. */
-   BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level));
+   BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
+  GFP_ATOMIC));
 
if (size < VTD_PAGE_SIZE << level_to_offset_bits(level))
size = VTD_PAGE_SIZE << level_to_offset_bits(level);
@@ -4392,7 +4397,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct 
iommu_domain *domain,
int level = 0;
u64 phys = 0;
 
-   pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level);
+   pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
+GFP_ATOMIC);
if (pte && dma_pte_present(pte))
phys = dma_pte_addr(pte) +
(iova & (BIT_MASK(level_to_offset_bits(level) +
-- 
2.39.0

[Nouveau] [PATCH v3 02/10] iommu: Remove iommu_map_atomic()

2023-01-23 Thread Jason Gunthorpe

There is only one call site and it can now just pass the GFP_ATOMIC to the
normal iommu_map().

Reviewed-by: Kevin Tian 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 2 +-
 drivers/iommu/iommu.c | 7 ---
 include/linux/iommu.h | 9 -
 3 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8bdb65e7686ff9..7016db569f81fc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -713,7 +713,7 @@ static dma_addr_t __iommu_dma_map(struct device *dev, 
phys_addr_t phys,
if (!iova)
return DMA_MAPPING_ERROR;
 
-   if (iommu_map_atomic(domain, iova, phys - iova_off, size, prot)) {
+   if (iommu_map(domain, iova, phys - iova_off, size, prot, GFP_ATOMIC)) {
iommu_dma_free_iova(cookie, iova, size, NULL);
return DMA_MAPPING_ERROR;
}
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7dac062b58f039..9412b420d07257 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2381,13 +2381,6 @@ int iommu_map(struct iommu_domain *domain, unsigned long 
iova,
 }
 EXPORT_SYMBOL_GPL(iommu_map);
 
-int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot)
-{
-   return iommu_map(domain, iova, paddr, size, prot, GFP_ATOMIC);
-}
-EXPORT_SYMBOL_GPL(iommu_map_atomic);
-
 static size_t __iommu_unmap_pages(struct iommu_domain *domain,
  unsigned long iova, size_t size,
  struct iommu_iotlb_gather *iotlb_gather)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d2020994f292db..521cd79700f4d8 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -468,8 +468,6 @@ extern struct iommu_domain *iommu_get_domain_for_dev(struct 
device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
-extern int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
-   phys_addr_t paddr, size_t size, int prot);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
@@ -778,13 +776,6 @@ static inline int iommu_map(struct iommu_domain *domain, 
unsigned long iova,
return -ENODEV;
 }
 
-static inline int iommu_map_atomic(struct iommu_domain *domain,
-  unsigned long iova, phys_addr_t paddr,
-  size_t size, int prot)
-{
-   return -ENODEV;
-}
-
 static inline size_t iommu_unmap(struct iommu_domain *domain,
 unsigned long iova, size_t size)
 {
-- 
2.39.0

[Nouveau] [PATCH v3 00/10] Let iommufd charge IOPTE allocations to the memory cgroup

2023-01-23 Thread Jason Gunthorpe

iommufd follows the same design as KVM and uses memory cgroups to limit
the amount of kernel memory a iommufd file descriptor can pin down. The
various internal data structures already use GFP_KERNEL_ACCOUNT to charge
its own memory.

However, one of the biggest consumers of kernel memory is the IOPTEs
stored under the iommu_domain and these allocations are not tracked.

This series is the first step in fixing it.

The iommu driver contract already includes a 'gfp' argument to the
map_pages op, allowing iommufd to specify GFP_KERNEL_ACCOUNT and then
having the driver allocate the IOPTE tables with that flag will capture a
significant amount of the allocations.

Update the iommu_map() API to pass in the GFP argument, and fix all call
sites. Replace iommu_map_atomic().

Audit the "enterprise" iommu drivers to make sure they do the right thing.
Intel and S390 ignore the GFP argument and always use GFP_ATOMIC. This is
problematic for iommufd anyhow, so fix it. AMD and ARM SMMUv2/3 are
already correct.

A follow up series will be needed to capture the allocations made when the
iommu_domain itself is allocated, which will complete the job.

v3:
 - Leave a GFP_ATOMIC in "Add a gfp parameter to iommu_map_sg()"
   and move the conversion to gfp argument to "Use the gfp parameter in
   __iommu_dma_alloc_noncontiguous()"
 - Mask off the zone/policy flags from gfp before doing internal
   allocations and add a comment about Robin's note that this is to keep
   the buffer and internal seperate.
v2: https://lore.kernel.org/r/0-v2-ce66f632bd0d+484-iommu_map_gfp_...@nvidia.com
 - Prohibit bad GFP flags in the iommu wrappers
 - Split out the new GFP_KERNEL usages into dedicated patches so it is
   easier to check. No code change after the full series
v1: https://lore.kernel.org/r/0-v1-6e8b3997c46d+89e-iommu_map_gfp_...@nvidia.com

Jason Gunthorpe (10):
  iommu: Add a gfp parameter to iommu_map()
  iommu: Remove iommu_map_atomic()
  iommu: Add a gfp parameter to iommu_map_sg()
  iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()
  iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map()
  iommu/intel: Add a gfp parameter to alloc_pgtable_page()
  iommu/intel: Support the gfp argument to the map_pages op
  iommu/intel: Use GFP_KERNEL in sleepable contexts
  iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s
  iommu/s390: Use GFP_KERNEL in sleepable contexts

 arch/arm/mm/dma-mapping.c | 11 ++--
 arch/s390/include/asm/pci_dma.h   |  5 +-
 arch/s390/pci/pci_dma.c   | 31 ++-
 .../drm/nouveau/nvkm/subdev/instmem/gk20a.c   |  3 +-
 drivers/gpu/drm/tegra/drm.c   |  2 +-
 drivers/gpu/host1x/cdma.c |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c  |  4 +-
 drivers/iommu/dma-iommu.c | 18 +--
 drivers/iommu/intel/iommu.c   | 36 +++--
 drivers/iommu/intel/iommu.h   |  2 +-
 drivers/iommu/intel/pasid.c   |  2 +-
 drivers/iommu/iommu.c | 53 +++
 drivers/iommu/iommufd/pages.c |  6 ++-
 drivers/iommu/s390-iommu.c| 15 +++---
 drivers/media/platform/qcom/venus/firmware.c  |  2 +-
 drivers/net/ipa/ipa_mem.c |  6 ++-
 drivers/net/wireless/ath/ath10k/snoc.c|  2 +-
 drivers/net/wireless/ath/ath11k/ahb.c |  4 +-
 drivers/remoteproc/remoteproc_core.c  |  5 +-
 drivers/vfio/vfio_iommu_type1.c   |  9 ++--
 drivers/vhost/vdpa.c  |  2 +-
 include/linux/iommu.h | 31 +++
 22 files changed, 126 insertions(+), 125 deletions(-)


base-commit: 5dc4c995db9eb45f6373a956eb1f69460e69e6d4
-- 
2.39.0

[Nouveau] [PATCH v3 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

2023-01-23 Thread Jason Gunthorpe

This function does an allocation of a buffer to return to the caller and
then goes on to allocate some internal memory, eg the scatterlist and
IOPTEs.

Instead of hard wiring GFP_KERNEL and a wrong GFP_ATOMIC, continue to use
the passed in gfp flags for all of the allocations. Clear the zone and
policy bits that are only relevant for the buffer allocation before
re-using them for internal allocations.

Auditing says this is never called from an atomic context, so the
GFP_ATOMIC is the incorrect flag.

Reviewed-by: Kevin Tian 
Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 72cfa24503b8bc..c99e4bc55d8cb0 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -822,7 +822,14 @@ static struct page 
**__iommu_dma_alloc_noncontiguous(struct device *dev,
if (!iova)
goto out_free_pages;
 
-   if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, GFP_KERNEL))
+   /*
+* Remove the zone/policy flags from the GFP - these are applied to the
+* __iommu_dma_alloc_pages() but are not used for the supporting
+* internal allocations that follow.
+*/
+   gfp &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM | __GFP_COMP);
+
+   if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, gfp))
goto out_free_iova;
 
if (!(ioprot & IOMMU_CACHE)) {
@@ -834,7 +841,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
}
 
ret = iommu_map_sg(domain, iova, sgt->sgl, sgt->orig_nents, ioprot,
-  GFP_ATOMIC);
+  gfp);
if (ret < 0 || ret < size)
goto out_free_sg;
 
-- 
2.39.0

Re: [Nouveau] [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

2023-01-23 Thread Jason Gunthorpe

On Fri, Jan 20, 2023 at 07:28:19PM +, Robin Murphy wrote:

> Overall I'm starting to wonder if it might not be better to stick a "use
> GFP_KERNEL_ACCOUNT if you allocate" flag in the domain for any level of the
> API internals to pick up as appropriate, rather than propagate per-call gfp
> flags everywhere. 

I was thinking about this some more, and I don't thinking hiding the
GFP_KERNEL_ACCOUNT in the iommu driver will be very maintainable.

The GFP_KERNEL_ACCOUNT is sensitive to current since that is where it
gets the cgroup from, if we start putting it in driver code directly
it becomes very hard to understand if the call chains are actually
originating from a syscall or not. I'd prefer we try to keep thing so
that iommufd provides the GFP_KERNEL_ACCOUNT on a call-by-call basis
where it is clearer what call chains originate from a system call vs
not.

So, I think we will strive for adding a gfp flag to the future 'alloc
domain iommufd' and pass GFP_KERNEL_ACCOUNT there. Then we can see
what is left.

Jason

Re: [Nouveau] [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

2023-01-20 Thread Jason Gunthorpe

On Fri, Jan 20, 2023 at 07:28:19PM +, Robin Murphy wrote:
> On 2023-01-18 18:00, Jason Gunthorpe wrote:
> > Change the sg_alloc_table_from_pages() allocation that was hardwired to
> > GFP_KERNEL to use the gfp parameter like the other allocations in this
> > function.
> > 
> > Auditing says this is never called from an atomic context, so it is safe
> > as is, but reads wrong.
> 
> I think the point may have been that the sgtable metadata is a
> logically-distinct allocation from the buffer pages themselves. Much like
> the allocation of the pages array itself further down in
> __iommu_dma_alloc_pages().

That makes sense, and it is a good reason to mask off the allocation
policy flags from the gfp.

On the other hand it also makes sense to continue to pass in things
like NOWAIT|NOWARN to all the allocations. Even to the iommu driver.

So I'd prefer to change this to mask and make all the following calls
consistently use the input gfp

> I'd say the more confusing thing about this particular context is why we're
> using iommu_map_sg_atomic() further down - that seems to have been an
> oversight in 781ca2de89ba, since this particular path has never supported
> being called in atomic context.

Huh. I had fixed that in v1, this patch was supposed to have that
hunk, that was the main point of making this patch actually..

> Overall I'm starting to wonder if it might not be better to stick a "use
> GFP_KERNEL_ACCOUNT if you allocate" flag in the domain for any level of the
> API internals to pick up as appropriate, rather than propagate per-call gfp
> flags everywhere. 

We might get to something like that, but it requires more parts that
are not ready yet. Most likely this would take the form of some kind
of 'this is an iommufd created domain' indication. This happens
naturally as part of the nesting patches.

Right now I want to get people to start testing with this because the
charge from the IOPTEs is far and away the largest memory draw.  Parts
like fixing the iommu drivers to actually use gfp are necessary to
make it work.

If we flip the two places using KERNEL_ACCOUNT to something else later
it doesn't really matter. I think the removal of the two _atomic
wrappers is still appropriate stand-alone.

> As it stands we're still missing potential pagetable and other
> domain-related allocations by drivers in .attach_dev and even (in

Yes, I plan to get to those when we add an alloc_domain_iommufd() or
whatever op. The driver will know the calling context and can set the
gfp flags for any allocations under alloc_domain under that time.

Then we can go and figure out if there are other allocations and if
all or only some drivers need a flag - eg at attach time. Though this
is less worrying because you can only scale attach up to num_pasids *
num open vfios.

iommufd will let userspace create and populate an unlimited number of
iommu_domains, so everything linked to an unattached iommu_domain
should be charged.

> probably-shouldn't-really-happen cases) .unmap_pages...

Gah, unmap_pages isn't allow to fail. There is no way to recover from
this. iommufd will spew a warn and then have a small race where
userspace can UAF kernel memory.

I'd call such a driver implementation broken. Why would you need to do
this?? :(

Thanks,
Jason

Re: [Nouveau] [PATCH 1/8] iommu: Add a gfp parameter to iommu_map()

2023-01-20 Thread Jason Gunthorpe

On Fri, Jan 20, 2023 at 10:24:55AM +0100, Joerg Roedel wrote:
> On Fri, Jan 06, 2023 at 01:24:11PM -0400, Jason Gunthorpe wrote:
> > I think it is just better to follow kernel convention and have
> > allocation functions include the GFP because it is a clear signal to
> > the user that there is an allocation hidden inside the API. The whole
> > point of gfp is not to have multitudes of every function for every
> > allocation mode.
> 
> Well, having GFP parameters is not a strict kernel convention. There are
> places doing it differently and have sleeping and atomic variants of
> APIs. I have to say I like the latter more. But given that this leads to
> an invasion of API functions here which all do the same under the hood, I
> agree it is better to go with a GFP parameter here.

Ok, I think we are done with this series, I'll stick it in linux-next
for a bit and send you a PR so the trees stay in sync

Thanks,
Jason

[Nouveau] [PATCH v2 00/10] Let iommufd charge IOPTE allocations to the memory cgroup

2023-01-18 Thread Jason Gunthorpe

iommufd follows the same design as KVM and uses memory cgroups to limit
the amount of kernel memory a iommufd file descriptor can pin down. The
various internal data structures already use GFP_KERNEL_ACCOUNT to charge
its own memory.

However, one of the biggest consumers of kernel memory is the IOPTEs
stored under the iommu_domain and these allocations are not tracked.

This series is the first step in fixing it.

The iommu driver contract already includes a 'gfp' argument to the
map_pages op, allowing iommufd to specify GFP_KERNEL_ACCOUNT and then
having the driver allocate the IOPTE tables with that flag will capture a
significant amount of the allocations.

Update the iommu_map() API to pass in the GFP argument, and fix all call
sites. Replace iommu_map_atomic().

Audit the "enterprise" iommu drivers to make sure they do the right thing.
Intel and S390 ignore the GFP argument and always use GFP_ATOMIC. This is
problematic for iommufd anyhow, so fix it. AMD and ARM SMMUv2/3 are
already correct.

A follow up series will be needed to capture the allocations made when the
iommu_domain itself is allocated, which will complete the job.

v2:
 - Prohibit bad GFP flags in the iommu wrappers
 - Split out the new GFP_KERNEL usages into dedicated patches so it is
   easier to check. No code change after the full series
v1: https://lore.kernel.org/r/0-v1-6e8b3997c46d+89e-iommu_map_gfp_...@nvidia.com

Jason Gunthorpe (10):
  iommu: Add a gfp parameter to iommu_map()
  iommu: Remove iommu_map_atomic()
  iommu: Add a gfp parameter to iommu_map_sg()
  iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()
  iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map()
  iommu/intel: Add a gfp parameter to alloc_pgtable_page()
  iommu/intel: Support the gfp argument to the map_pages op
  iommu/intel: Use GFP_KERNEL in sleepable contexts
  iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s
  iommu/s390: Use GFP_KERNEL in sleepable contexts

 arch/arm/mm/dma-mapping.c | 11 ++--
 arch/s390/include/asm/pci_dma.h   |  5 +-
 arch/s390/pci/pci_dma.c   | 31 ++-
 .../drm/nouveau/nvkm/subdev/instmem/gk20a.c   |  3 +-
 drivers/gpu/drm/tegra/drm.c   |  2 +-
 drivers/gpu/host1x/cdma.c |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c  |  4 +-
 drivers/iommu/dma-iommu.c | 11 ++--
 drivers/iommu/intel/iommu.c   | 36 +++--
 drivers/iommu/intel/iommu.h   |  2 +-
 drivers/iommu/intel/pasid.c   |  2 +-
 drivers/iommu/iommu.c | 53 +++
 drivers/iommu/iommufd/pages.c |  6 ++-
 drivers/iommu/s390-iommu.c| 15 +++---
 drivers/media/platform/qcom/venus/firmware.c  |  2 +-
 drivers/net/ipa/ipa_mem.c |  6 ++-
 drivers/net/wireless/ath/ath10k/snoc.c|  2 +-
 drivers/net/wireless/ath/ath11k/ahb.c |  4 +-
 drivers/remoteproc/remoteproc_core.c  |  5 +-
 drivers/vfio/vfio_iommu_type1.c   |  9 ++--
 drivers/vhost/vdpa.c  |  2 +-
 include/linux/iommu.h | 31 +++
 22 files changed, 119 insertions(+), 125 deletions(-)


base-commit: 5dc4c995db9eb45f6373a956eb1f69460e69e6d4
-- 
2.39.0

[Nouveau] [PATCH v2 07/10] iommu/intel: Support the gfp argument to the map_pages op

2023-01-18 Thread Jason Gunthorpe

Flow it down to alloc_pgtable_page() via pfn_to_dma_pte() and
__domain_mapping().

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index aa29561d3549b3..e95f7703ce7b83 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -908,7 +908,8 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 
source_id,
 #endif
 
 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
- unsigned long pfn, int *target_level)
+ unsigned long pfn, int *target_level,
+ gfp_t gfp)
 {
struct dma_pte *parent, *pte;
int level = agaw_to_level(domain->agaw);
@@ -935,7 +936,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain 
*domain,
if (!dma_pte_present(pte)) {
uint64_t pteval;
 
-   tmp_page = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
+   tmp_page = alloc_pgtable_page(domain->nid, gfp);
 
if (!tmp_page)
return NULL;
@@ -2150,7 +2151,8 @@ static void switch_to_super_page(struct dmar_domain 
*domain,
 
while (start_pfn <= end_pfn) {
if (!pte)
-   pte = pfn_to_dma_pte(domain, start_pfn, &level);
+   pte = pfn_to_dma_pte(domain, start_pfn, &level,
+GFP_ATOMIC);
 
if (dma_pte_present(pte)) {
dma_pte_free_pagetable(domain, start_pfn,
@@ -2172,7 +2174,8 @@ static void switch_to_super_page(struct dmar_domain 
*domain,
 
 static int
 __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
-unsigned long phys_pfn, unsigned long nr_pages, int prot)
+unsigned long phys_pfn, unsigned long nr_pages, int prot,
+gfp_t gfp)
 {
struct dma_pte *first_pte = NULL, *pte = NULL;
unsigned int largepage_lvl = 0;
@@ -2202,7 +2205,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned 
long iov_pfn,
largepage_lvl = hardware_largepage_caps(domain, iov_pfn,
phys_pfn, nr_pages);
 
-   pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl);
+   pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl,
+gfp);
if (!pte)
return -ENOMEM;
first_pte = pte;
@@ -2368,7 +2372,7 @@ static int iommu_domain_identity_map(struct dmar_domain 
*domain,
 
return __domain_mapping(domain, first_vpfn,
first_vpfn, last_vpfn - first_vpfn + 1,
-   DMA_PTE_READ|DMA_PTE_WRITE);
+   DMA_PTE_READ|DMA_PTE_WRITE, GFP_ATOMIC);
 }
 
 static int md_domain_init(struct dmar_domain *domain, int guest_width);
@@ -4298,7 +4302,7 @@ static int intel_iommu_map(struct iommu_domain *domain,
   the low bits of hpa would take us onto the next page */
size = aligned_nrpages(hpa, size);
return __domain_mapping(dmar_domain, iova >> VTD_PAGE_SHIFT,
-   hpa >> VTD_PAGE_SHIFT, size, prot);
+   hpa >> VTD_PAGE_SHIFT, size, prot, gfp);
 }
 
 static int intel_iommu_map_pages(struct iommu_domain *domain,
@@ -4333,7 +4337,8 @@ static size_t intel_iommu_unmap(struct iommu_domain 
*domain,
 
/* Cope with horrid API which requires us to unmap more than the
   size argument if it happens to be a large-page mapping. */
-   BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level));
+   BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
+  GFP_ATOMIC));
 
if (size < VTD_PAGE_SIZE << level_to_offset_bits(level))
size = VTD_PAGE_SIZE << level_to_offset_bits(level);
@@ -4392,7 +4397,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct 
iommu_domain *domain,
int level = 0;
u64 phys = 0;
 
-   pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level);
+   pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
+GFP_ATOMIC);
if (pte && dma_pte_present(pte))
phys = dma_pte_addr(pte) +
(iova & (BIT_MASK(level_to_offset_bits(level) +
-- 
2.39.0

[Nouveau] [PATCH v2 01/10] iommu: Add a gfp parameter to iommu_map()

2023-01-18 Thread Jason Gunthorpe

The internal mechanisms support this, but instead of exposting the gfp to
the caller it wrappers it into iommu_map() and iommu_map_atomic()

Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT.

Signed-off-by: Jason Gunthorpe 
---
 arch/arm/mm/dma-mapping.c | 11 ++
 .../drm/nouveau/nvkm/subdev/instmem/gk20a.c   |  3 ++-
 drivers/gpu/drm/tegra/drm.c   |  2 +-
 drivers/gpu/host1x/cdma.c |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c  |  4 ++--
 drivers/iommu/dma-iommu.c |  2 +-
 drivers/iommu/iommu.c | 22 +--
 drivers/iommu/iommufd/pages.c |  6 +++--
 drivers/media/platform/qcom/venus/firmware.c  |  2 +-
 drivers/net/ipa/ipa_mem.c |  6 +++--
 drivers/net/wireless/ath/ath10k/snoc.c|  2 +-
 drivers/net/wireless/ath/ath11k/ahb.c |  4 ++--
 drivers/remoteproc/remoteproc_core.c  |  5 +++--
 drivers/vfio/vfio_iommu_type1.c   |  9 
 drivers/vhost/vdpa.c  |  2 +-
 include/linux/iommu.h |  4 ++--
 16 files changed, 48 insertions(+), 38 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index c135f6e37a00ca..8bc01071474ab7 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -984,7 +984,8 @@ __iommu_create_mapping(struct device *dev, struct page 
**pages, size_t size,
 
len = (j - i) << PAGE_SHIFT;
ret = iommu_map(mapping->domain, iova, phys, len,
-   __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs));
+   __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs),
+   GFP_KERNEL);
if (ret < 0)
goto fail;
iova += len;
@@ -1207,7 +1208,8 @@ static int __map_sg_chunk(struct device *dev, struct 
scatterlist *sg,
 
prot = __dma_info_to_prot(dir, attrs);
 
-   ret = iommu_map(mapping->domain, iova, phys, len, prot);
+   ret = iommu_map(mapping->domain, iova, phys, len, prot,
+   GFP_KERNEL);
if (ret < 0)
goto fail;
count += len >> PAGE_SHIFT;
@@ -1379,7 +1381,8 @@ static dma_addr_t arm_iommu_map_page(struct device *dev, 
struct page *page,
 
prot = __dma_info_to_prot(dir, attrs);
 
-   ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len, 
prot);
+   ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len,
+   prot, GFP_KERNEL);
if (ret < 0)
goto fail;
 
@@ -1443,7 +1446,7 @@ static dma_addr_t arm_iommu_map_resource(struct device 
*dev,
 
prot = __dma_info_to_prot(dir, attrs) | IOMMU_MMIO;
 
-   ret = iommu_map(mapping->domain, dma_addr, addr, len, prot);
+   ret = iommu_map(mapping->domain, dma_addr, addr, len, prot, GFP_KERNEL);
if (ret < 0)
goto fail;
 
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c 
b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
index 648ecf5a8fbc2a..a4ac94a2ab57fc 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
@@ -475,7 +475,8 @@ gk20a_instobj_ctor_iommu(struct gk20a_instmem *imem, u32 
npages, u32 align,
u32 offset = (r->offset + i) << imem->iommu_pgshift;
 
ret = iommu_map(imem->domain, offset, node->dma_addrs[i],
-   PAGE_SIZE, IOMMU_READ | IOMMU_WRITE);
+   PAGE_SIZE, IOMMU_READ | IOMMU_WRITE,
+   GFP_KERNEL);
if (ret < 0) {
nvkm_error(subdev, "IOMMU mapping failure: %d\n", ret);
 
diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 7bd2e65c2a16c5..6ca9f396e55be4 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -1057,7 +1057,7 @@ void *tegra_drm_alloc(struct tegra_drm *tegra, size_t 
size, dma_addr_t *dma)
 
*dma = iova_dma_addr(&tegra->carveout.domain, alloc);
err = iommu_map(tegra->domain, *dma, virt_to_phys(virt),
-   size, IOMMU_READ | IOMMU_WRITE);
+   size, IOMMU_READ | IOMMU_WRITE, GFP_KERNEL);
if (err < 0)
goto free_iova;
 
diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c
index 103fda055394ab..4ddfcd2138c95b 100644
--- a/drivers/gpu/host1x/cdma.c
+++ b/drivers/gpu/host1x/cdma.c
@@ -105,7 +105,7 @@ static int host1x_pushbuffer_init(struct push_buffer *pb)
 
pb->dma = iova_dma_addr(&host1x->iova, alloc);
err = iommu_map(host1x->domain, pb->dma

[Nouveau] [PATCH v2 04/10] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

2023-01-18 Thread Jason Gunthorpe

Change the sg_alloc_table_from_pages() allocation that was hardwired to
GFP_KERNEL to use the gfp parameter like the other allocations in this
function.

Auditing says this is never called from an atomic context, so it is safe
as is, but reads wrong.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8c2788633c1766..e4bf1bb159f7c7 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -822,7 +822,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
if (!iova)
goto out_free_pages;
 
-   if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, GFP_KERNEL))
+   if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, gfp))
goto out_free_iova;
 
if (!(ioprot & IOMMU_CACHE)) {
-- 
2.39.0

[Nouveau] [PATCH v2 06/10] iommu/intel: Add a gfp parameter to alloc_pgtable_page()

2023-01-18 Thread Jason Gunthorpe

This is eventually called by iommufd through intel_iommu_map_pages() and
it should not be forced to atomic. Push the GFP_ATOMIC to all callers.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 14 +++---
 drivers/iommu/intel/iommu.h |  2 +-
 drivers/iommu/intel/pasid.c |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 59df7e42fd533c..aa29561d3549b3 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -362,12 +362,12 @@ static int __init intel_iommu_setup(char *str)
 }
 __setup("intel_iommu=", intel_iommu_setup);
 
-void *alloc_pgtable_page(int node)
+void *alloc_pgtable_page(int node, gfp_t gfp)
 {
struct page *page;
void *vaddr = NULL;
 
-   page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
+   page = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
if (page)
vaddr = page_address(page);
return vaddr;
@@ -612,7 +612,7 @@ struct context_entry *iommu_context_addr(struct intel_iommu 
*iommu, u8 bus,
if (!alloc)
return NULL;
 
-   context = alloc_pgtable_page(iommu->node);
+   context = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
if (!context)
return NULL;
 
@@ -935,7 +935,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain 
*domain,
if (!dma_pte_present(pte)) {
uint64_t pteval;
 
-   tmp_page = alloc_pgtable_page(domain->nid);
+   tmp_page = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
 
if (!tmp_page)
return NULL;
@@ -1186,7 +1186,7 @@ static int iommu_alloc_root_entry(struct intel_iommu 
*iommu)
 {
struct root_entry *root;
 
-   root = (struct root_entry *)alloc_pgtable_page(iommu->node);
+   root = (struct root_entry *)alloc_pgtable_page(iommu->node, GFP_ATOMIC);
if (!root) {
pr_err("Allocating root entry for %s failed\n",
iommu->name);
@@ -2676,7 +2676,7 @@ static int copy_context_table(struct intel_iommu *iommu,
if (!old_ce)
goto out;
 
-   new_ce = alloc_pgtable_page(iommu->node);
+   new_ce = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
if (!new_ce)
goto out_unmap;
 
@@ -4136,7 +4136,7 @@ static int md_domain_init(struct dmar_domain *domain, int 
guest_width)
domain->max_addr = 0;
 
/* always allocate the top pgd */
-   domain->pgd = alloc_pgtable_page(domain->nid);
+   domain->pgd = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
if (!domain->pgd)
return -ENOMEM;
domain_flush_cache(domain, domain->pgd, PAGE_SIZE);
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 06e61e4748567a..ca9a035e0110af 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -737,7 +737,7 @@ int qi_submit_sync(struct intel_iommu *iommu, struct 
qi_desc *desc,
 
 extern int dmar_ir_support(void);
 
-void *alloc_pgtable_page(int node);
+void *alloc_pgtable_page(int node, gfp_t gfp);
 void free_pgtable_page(void *vaddr);
 void iommu_flush_write_buffer(struct intel_iommu *iommu);
 struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn);
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index fb3c7020028d07..c5bf74e9372d62 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -200,7 +200,7 @@ static struct pasid_entry *intel_pasid_get_entry(struct 
device *dev, u32 pasid)
 retry:
entries = get_pasid_table_from_pde(&dir[dir_index]);
if (!entries) {
-   entries = alloc_pgtable_page(info->iommu->node);
+   entries = alloc_pgtable_page(info->iommu->node, GFP_ATOMIC);
if (!entries)
return NULL;
 
-- 
2.39.0

[Nouveau] [PATCH v2 10/10] iommu/s390: Use GFP_KERNEL in sleepable contexts

2023-01-18 Thread Jason Gunthorpe

These contexts are sleepable, so use the proper annotation. The GFP_ATOMIC
was added mechanically in the prior patches.

Reviewed-by: Niklas Schnelle 
Signed-off-by: Jason Gunthorpe 
---
 arch/s390/pci/pci_dma.c| 2 +-
 drivers/iommu/s390-iommu.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index 2f6d05d6da4f76..2d9b01d7ca4c5c 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -579,7 +579,7 @@ int zpci_dma_init_device(struct zpci_dev *zdev)
 
spin_lock_init(&zdev->iommu_bitmap_lock);
 
-   zdev->dma_table = dma_alloc_cpu_table(GFP_ATOMIC);
+   zdev->dma_table = dma_alloc_cpu_table(GFP_KERNEL);
if (!zdev->dma_table) {
rc = -ENOMEM;
goto out;
diff --git a/drivers/iommu/s390-iommu.c b/drivers/iommu/s390-iommu.c
index 654ec4411fe36c..7dcfffed260e6b 100644
--- a/drivers/iommu/s390-iommu.c
+++ b/drivers/iommu/s390-iommu.c
@@ -52,7 +52,7 @@ static struct iommu_domain *s390_domain_alloc(unsigned 
domain_type)
if (!s390_domain)
return NULL;
 
-   s390_domain->dma_table = dma_alloc_cpu_table(GFP_ATOMIC);
+   s390_domain->dma_table = dma_alloc_cpu_table(GFP_KERNEL);
if (!s390_domain->dma_table) {
kfree(s390_domain);
return NULL;
-- 
2.39.0

[Nouveau] [PATCH v2 05/10] iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map()

2023-01-18 Thread Jason Gunthorpe

iommufd follows the same design as KVM and uses memory cgroups to limit
the amount of kernel memory a iommufd file descriptor can pin down. The
various internal data structures already use GFP_KERNEL_ACCOUNT.

However, one of the biggest consumers of kernel memory is the IOPTEs
stored under the iommu_domain. Many drivers will allocate these at
iommu_map() time and will trivially do the right thing if we pass in
GFP_KERNEL_ACCOUNT.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/pages.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 22cc3bb0c6c55a..f8d92c9bb65b60 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -457,7 +457,7 @@ static int batch_iommu_map_small(struct iommu_domain 
*domain,
 
while (size) {
rc = iommu_map(domain, iova, paddr, PAGE_SIZE, prot,
-  GFP_KERNEL);
+  GFP_KERNEL_ACCOUNT);
if (rc)
goto err_unmap;
iova += PAGE_SIZE;
@@ -502,7 +502,7 @@ static int batch_to_domain(struct pfn_batch *batch, struct 
iommu_domain *domain,
rc = iommu_map(domain, iova,
   PFN_PHYS(batch->pfns[cur]) + page_offset,
   next_iova - iova, area->iommu_prot,
-  GFP_KERNEL);
+  GFP_KERNEL_ACCOUNT);
if (rc)
goto err_unmap;
iova = next_iova;
-- 
2.39.0

[Nouveau] [PATCH v2 09/10] iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s

2023-01-18 Thread Jason Gunthorpe

dma_alloc_cpu_table() and dma_alloc_page_table() are eventually called by
iommufd through s390_iommu_map_pages() and it should not be forced to
atomic. Thread the gfp parameter through the call chain starting from
s390_iommu_map_pages().

Reviewed-by: Niklas Schnelle 
Signed-off-by: Jason Gunthorpe 
---
 arch/s390/include/asm/pci_dma.h |  5 +++--
 arch/s390/pci/pci_dma.c | 31 +--
 drivers/iommu/s390-iommu.c  | 15 +--
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/arch/s390/include/asm/pci_dma.h b/arch/s390/include/asm/pci_dma.h
index 91e63426bdc53f..7119c04c51c5c8 100644
--- a/arch/s390/include/asm/pci_dma.h
+++ b/arch/s390/include/asm/pci_dma.h
@@ -186,9 +186,10 @@ static inline unsigned long *get_st_pto(unsigned long 
entry)
 
 /* Prototypes */
 void dma_free_seg_table(unsigned long);
-unsigned long *dma_alloc_cpu_table(void);
+unsigned long *dma_alloc_cpu_table(gfp_t gfp);
 void dma_cleanup_tables(unsigned long *);
-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr);
+unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr,
+ gfp_t gfp);
 void dma_update_cpu_trans(unsigned long *entry, phys_addr_t page_addr, int 
flags);
 
 extern const struct dma_map_ops s390_pci_dma_ops;
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index ea478d11fbd132..2f6d05d6da4f76 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -27,11 +27,11 @@ static int zpci_refresh_global(struct zpci_dev *zdev)
  zdev->iommu_pages * PAGE_SIZE);
 }
 
-unsigned long *dma_alloc_cpu_table(void)
+unsigned long *dma_alloc_cpu_table(gfp_t gfp)
 {
unsigned long *table, *entry;
 
-   table = kmem_cache_alloc(dma_region_table_cache, GFP_ATOMIC);
+   table = kmem_cache_alloc(dma_region_table_cache, gfp);
if (!table)
return NULL;
 
@@ -45,11 +45,11 @@ static void dma_free_cpu_table(void *table)
kmem_cache_free(dma_region_table_cache, table);
 }
 
-static unsigned long *dma_alloc_page_table(void)
+static unsigned long *dma_alloc_page_table(gfp_t gfp)
 {
unsigned long *table, *entry;
 
-   table = kmem_cache_alloc(dma_page_table_cache, GFP_ATOMIC);
+   table = kmem_cache_alloc(dma_page_table_cache, gfp);
if (!table)
return NULL;
 
@@ -63,7 +63,7 @@ static void dma_free_page_table(void *table)
kmem_cache_free(dma_page_table_cache, table);
 }
 
-static unsigned long *dma_get_seg_table_origin(unsigned long *rtep)
+static unsigned long *dma_get_seg_table_origin(unsigned long *rtep, gfp_t gfp)
 {
unsigned long old_rte, rte;
unsigned long *sto;
@@ -72,7 +72,7 @@ static unsigned long *dma_get_seg_table_origin(unsigned long 
*rtep)
if (reg_entry_isvalid(rte)) {
sto = get_rt_sto(rte);
} else {
-   sto = dma_alloc_cpu_table();
+   sto = dma_alloc_cpu_table(gfp);
if (!sto)
return NULL;
 
@@ -90,7 +90,7 @@ static unsigned long *dma_get_seg_table_origin(unsigned long 
*rtep)
return sto;
 }
 
-static unsigned long *dma_get_page_table_origin(unsigned long *step)
+static unsigned long *dma_get_page_table_origin(unsigned long *step, gfp_t gfp)
 {
unsigned long old_ste, ste;
unsigned long *pto;
@@ -99,7 +99,7 @@ static unsigned long *dma_get_page_table_origin(unsigned long 
*step)
if (reg_entry_isvalid(ste)) {
pto = get_st_pto(ste);
} else {
-   pto = dma_alloc_page_table();
+   pto = dma_alloc_page_table(gfp);
if (!pto)
return NULL;
set_st_pto(&ste, virt_to_phys(pto));
@@ -116,18 +116,19 @@ static unsigned long *dma_get_page_table_origin(unsigned 
long *step)
return pto;
 }
 
-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr)
+unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr,
+ gfp_t gfp)
 {
unsigned long *sto, *pto;
unsigned int rtx, sx, px;
 
rtx = calc_rtx(dma_addr);
-   sto = dma_get_seg_table_origin(&rto[rtx]);
+   sto = dma_get_seg_table_origin(&rto[rtx], gfp);
if (!sto)
return NULL;
 
sx = calc_sx(dma_addr);
-   pto = dma_get_page_table_origin(&sto[sx]);
+   pto = dma_get_page_table_origin(&sto[sx], gfp);
if (!pto)
return NULL;
 
@@ -170,7 +171,8 @@ static int __dma_update_trans(struct zpci_dev *zdev, 
phys_addr_t pa,
return -EINVAL;
 
for (i = 0; i < nr_pages; i++) {
-   entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr);
+   entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr,
+  GFP_ATO

[Nouveau] [PATCH v2 02/10] iommu: Remove iommu_map_atomic()

2023-01-18 Thread Jason Gunthorpe

There is only one call site and it can now just pass the GFP_ATOMIC to the
normal iommu_map().

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 2 +-
 drivers/iommu/iommu.c | 7 ---
 include/linux/iommu.h | 9 -
 3 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8bdb65e7686ff9..7016db569f81fc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -713,7 +713,7 @@ static dma_addr_t __iommu_dma_map(struct device *dev, 
phys_addr_t phys,
if (!iova)
return DMA_MAPPING_ERROR;
 
-   if (iommu_map_atomic(domain, iova, phys - iova_off, size, prot)) {
+   if (iommu_map(domain, iova, phys - iova_off, size, prot, GFP_ATOMIC)) {
iommu_dma_free_iova(cookie, iova, size, NULL);
return DMA_MAPPING_ERROR;
}
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7dac062b58f039..9412b420d07257 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2381,13 +2381,6 @@ int iommu_map(struct iommu_domain *domain, unsigned long 
iova,
 }
 EXPORT_SYMBOL_GPL(iommu_map);
 
-int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot)
-{
-   return iommu_map(domain, iova, paddr, size, prot, GFP_ATOMIC);
-}
-EXPORT_SYMBOL_GPL(iommu_map_atomic);
-
 static size_t __iommu_unmap_pages(struct iommu_domain *domain,
  unsigned long iova, size_t size,
  struct iommu_iotlb_gather *iotlb_gather)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d2020994f292db..521cd79700f4d8 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -468,8 +468,6 @@ extern struct iommu_domain *iommu_get_domain_for_dev(struct 
device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
-extern int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
-   phys_addr_t paddr, size_t size, int prot);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
@@ -778,13 +776,6 @@ static inline int iommu_map(struct iommu_domain *domain, 
unsigned long iova,
return -ENODEV;
 }
 
-static inline int iommu_map_atomic(struct iommu_domain *domain,
-  unsigned long iova, phys_addr_t paddr,
-  size_t size, int prot)
-{
-   return -ENODEV;
-}
-
 static inline size_t iommu_unmap(struct iommu_domain *domain,
 unsigned long iova, size_t size)
 {
-- 
2.39.0

[Nouveau] [PATCH v2 08/10] iommu/intel: Use GFP_KERNEL in sleepable contexts

2023-01-18 Thread Jason Gunthorpe

These contexts are sleepable, so use the proper annotation. The GFP_ATOMIC
was added mechanically in the prior patches.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index e95f7703ce7b83..a1a66798e1f06c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2372,7 +2372,7 @@ static int iommu_domain_identity_map(struct dmar_domain 
*domain,
 
return __domain_mapping(domain, first_vpfn,
first_vpfn, last_vpfn - first_vpfn + 1,
-   DMA_PTE_READ|DMA_PTE_WRITE, GFP_ATOMIC);
+   DMA_PTE_READ|DMA_PTE_WRITE, GFP_KERNEL);
 }
 
 static int md_domain_init(struct dmar_domain *domain, int guest_width);
@@ -2680,7 +2680,7 @@ static int copy_context_table(struct intel_iommu *iommu,
if (!old_ce)
goto out;
 
-   new_ce = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
+   new_ce = alloc_pgtable_page(iommu->node, GFP_KERNEL);
if (!new_ce)
goto out_unmap;
 
-- 
2.39.0

[Nouveau] [PATCH v2 03/10] iommu: Add a gfp parameter to iommu_map_sg()

2023-01-18 Thread Jason Gunthorpe

Follow the pattern for iommu_map() and remove iommu_map_sg_atomic().

This allows __iommu_dma_alloc_noncontiguous() to use a GFP_KERNEL
allocation here, based on the provided gfp flags.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c |  5 +++--
 drivers/iommu/iommu.c | 26 ++
 include/linux/iommu.h | 18 +-
 3 files changed, 18 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7016db569f81fc..8c2788633c1766 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -833,7 +833,8 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
arch_dma_prep_coherent(sg_page(sg), sg->length);
}
 
-   ret = iommu_map_sg_atomic(domain, iova, sgt->sgl, sgt->orig_nents, 
ioprot);
+   ret = iommu_map_sg(domain, iova, sgt->sgl, sgt->orig_nents, ioprot,
+  gfp);
if (ret < 0 || ret < size)
goto out_free_sg;
 
@@ -1281,7 +1282,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 * We'll leave any physical concatenation to the IOMMU driver's
 * implementation - it knows better than we do.
 */
-   ret = iommu_map_sg_atomic(domain, iova, sg, nents, prot);
+   ret = iommu_map_sg(domain, iova, sg, nents, prot, GFP_ATOMIC);
if (ret < 0 || ret < iova_len)
goto out_free_iova;
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9412b420d07257..cc6e7c6bf72758 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2470,9 +2470,9 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
-static ssize_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot,
-   gfp_t gfp)
+ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+struct scatterlist *sg, unsigned int nents, int prot,
+gfp_t gfp)
 {
const struct iommu_domain_ops *ops = domain->ops;
size_t len = 0, mapped = 0;
@@ -2480,6 +2480,13 @@ static ssize_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
unsigned int i = 0;
int ret;
 
+   might_sleep_if(gfpflags_allow_blocking(gfp));
+
+   /* Discourage passing strange GFP flags */
+   if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
+   __GFP_HIGHMEM)))
+   return -EINVAL;
+
while (i <= nents) {
phys_addr_t s_phys = sg_phys(sg);
 
@@ -2519,21 +2526,8 @@ static ssize_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
 
return ret;
 }
-
-ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-struct scatterlist *sg, unsigned int nents, int prot)
-{
-   might_sleep();
-   return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_KERNEL);
-}
 EXPORT_SYMBOL_GPL(iommu_map_sg);
 
-ssize_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot)
-{
-   return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_ATOMIC);
-}
-
 /**
  * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
  * @domain: the iommu domain where the fault has happened
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 521cd79700f4d8..d5c16dc33c87de 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -474,10 +474,8 @@ extern size_t iommu_unmap_fast(struct iommu_domain *domain,
   unsigned long iova, size_t size,
   struct iommu_iotlb_gather *iotlb_gather);
 extern ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot);
-extern ssize_t iommu_map_sg_atomic(struct iommu_domain *domain,
-  unsigned long iova, struct scatterlist *sg,
-  unsigned int nents, int prot);
+   struct scatterlist *sg, unsigned int nents,
+   int prot, gfp_t gfp);
 extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t 
iova);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
iommu_fault_handler_t handler, void *token);
@@ -791,14 +789,7 @@ static inline size_t iommu_unmap_fast(struct iommu_domain 
*domain,
 
 static inline ssize_t iommu_map_sg(struct iommu_domain *domain,
   unsigned long iova, struct scatterlist *sg,
-  unsigned int nents, int prot)
-{
-   return -ENODEV;
-}
-
-static inline ssize_t

Re: [Nouveau] [PATCH 6/8] iommu/intel: Add a gfp parameter to alloc_pgtable_page()

2023-01-18 Thread Jason Gunthorpe

On Wed, Jan 18, 2023 at 01:18:18AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, January 17, 2023 9:30 PM
> > 
> > On Tue, Jan 17, 2023 at 03:35:08AM +, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe 
> > > > Sent: Saturday, January 7, 2023 12:43 AM
> > > >
> > > > @@ -2676,7 +2676,7 @@ static int copy_context_table(struct
> > intel_iommu
> > > > *iommu,
> > > > if (!old_ce)
> > > > goto out;
> > > >
> > > > -   new_ce = alloc_pgtable_page(iommu->node);
> > > > +   new_ce = alloc_pgtable_page(iommu->node,
> > > > GFP_KERNEL);
> > >
> > > GFP_ATOMIC
> > 
> > Can't be:
> > 
> > old_ce = memremap(old_ce_phys, PAGE_SIZE,
> > MEMREMAP_WB);
> > if (!old_ce)
> > goto out;
> > 
> > new_ce = alloc_pgtable_page(iommu->node,
> > GFP_KERNEL);
> > if (!new_ce)
> > 
> > memremap() is sleeping.
> > 
> > And the only caller is:
> > 
> > ctxt_tbls = kcalloc(ctxt_table_entries, sizeof(void *), GFP_KERNEL);
> > if (!ctxt_tbls)
> > goto out_unmap;
> > 
> > for (bus = 0; bus < 256; bus++) {
> > ret = copy_context_table(iommu, &old_rt[bus],
> >  ctxt_tbls, bus, ext);
> > 
> 
> Yes, but the patch description says "Push the GFP_ATOMIC to all
> callers." implying it's purely a refactoring w/o changing those
> semantics.

Sure, lets split the patch, it is a good idea

Jason

Re: [Nouveau] [PATCH 6/8] iommu/intel: Add a gfp parameter to alloc_pgtable_page()

2023-01-17 Thread Jason Gunthorpe

On Tue, Jan 17, 2023 at 03:35:08AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Saturday, January 7, 2023 12:43 AM
> > 
> > @@ -2676,7 +2676,7 @@ static int copy_context_table(struct intel_iommu
> > *iommu,
> > if (!old_ce)
> > goto out;
> > 
> > -   new_ce = alloc_pgtable_page(iommu->node);
> > +   new_ce = alloc_pgtable_page(iommu->node,
> > GFP_KERNEL);
> 
> GFP_ATOMIC

Can't be:

old_ce = memremap(old_ce_phys, PAGE_SIZE,
MEMREMAP_WB);
if (!old_ce)
goto out;

new_ce = alloc_pgtable_page(iommu->node, GFP_KERNEL);
if (!new_ce)

memremap() is sleeping.

And the only caller is:

ctxt_tbls = kcalloc(ctxt_table_entries, sizeof(void *), GFP_KERNEL);
if (!ctxt_tbls)
goto out_unmap;

for (bus = 0; bus < 256; bus++) {
ret = copy_context_table(iommu, &old_rt[bus],
 ctxt_tbls, bus, ext);

Jason

Re: [Nouveau] [PATCH 7/8] iommu/intel: Support the gfp argument to the map_pages op

2023-01-17 Thread Jason Gunthorpe

On Tue, Jan 17, 2023 at 03:38:51AM +, Tian, Kevin wrote:
> > From: Jason Gunthorpe 
> > Sent: Saturday, January 7, 2023 12:43 AM
> > 
> > @@ -2368,7 +2372,7 @@ static int iommu_domain_identity_map(struct
> > dmar_domain *domain,
> > 
> > return __domain_mapping(domain, first_vpfn,
> > first_vpfn, last_vpfn - first_vpfn + 1,
> > -   DMA_PTE_READ|DMA_PTE_WRITE);
> > +   DMA_PTE_READ|DMA_PTE_WRITE,
> > GFP_KERNEL);
> >  }
> 
> Baolu, can you help confirm whether switching from GFP_ATOMIC to
> GFP_KERNEL is OK in this path? it looks fine to me in a quick glance
> but want to be conservative here.

I checked it carefully myself as well, good to check again.

> > @@ -4333,7 +4337,8 @@ static size_t intel_iommu_unmap(struct
> > iommu_domain *domain,
> > 
> > /* Cope with horrid API which requires us to unmap more than the
> >size argument if it happens to be a large-page mapping. */
> > -   BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
> > &level));
> > +   BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
> > &level,
> > +  GFP_ATOMIC));
> 
> with level==0 it implies it's only lookup w/o pgtable allocation. From this
> angle it reads better to use a more relaxed gfp e.g. GFP_KERNEL here.

We should only write GFP_KERNEL if it is actually a sleepable context
because it will be mighty confusing if it isn't. I couldn't tell what
the context is so I left it as ATOMIC.

You are correct this is only just a lookup and so the value is never
used / doesn't matter.

Jason

Re: [Nouveau] [PATCH 1/8] iommu: Add a gfp parameter to iommu_map()

2023-01-16 Thread Jason Gunthorpe

On Fri, Jan 06, 2023 at 05:15:28PM +, Robin Murphy wrote:

> However, echoing the recent activity over on the DMA API side of things, I
> think it's still worth proactively constraining the set of permissible
> flags, lest we end up with more weird problems if stuff that doesn't really
> make sense, like GFP_COMP or zone flags, manages to leak through (that may
> have been part of the reason for having the current wrappers rather than a
> bare gfp argument in the first place, I forget now).

I did it like this:

--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2368,6 +2368,11 @@ int iommu_map(struct iommu_domain *domain, unsigned long 
iova,
 
might_sleep_if(gfpflags_allow_blocking(gfp));
 
+   /* Discourage passing strange GFP flags */
+   if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
+   __GFP_HIGHMEM)))
+   return -EINVAL;
+
ret = __iommu_map(domain, iova, paddr, size, prot, gfp);
if (ret == 0 && ops->iotlb_sync_map)
ops->iotlb_sync_map(domain, iova, size);
@@ -2477,6 +2482,11 @@ ssize_t iommu_map_sg(struct iommu_domain *domain, 
unsigned long iova,
 
might_sleep_if(gfpflags_allow_blocking(gfp));
 
+   /* Discourage passing strange GFP flags */
+   if (WARN_ON_ONCE(gfp & (__GFP_COMP | __GFP_DMA | __GFP_DMA32 |
+   __GFP_HIGHMEM)))
+   return -EINVAL;
+
while (i <= nents) {
phys_addr_t s_phys = sg_phys(sg);
 
Will post a v2 when the driver people take a look

Thanks,
Jason

Re: [Nouveau] [PATCH 1/8] iommu: Add a gfp parameter to iommu_map()

2023-01-06 Thread Jason Gunthorpe

On Fri, Jan 06, 2023 at 05:15:28PM +, Robin Murphy wrote:
> On 2023-01-06 16:42, Jason Gunthorpe wrote:
> > The internal mechanisms support this, but instead of exposting the gfp to
> > the caller it wrappers it into iommu_map() and iommu_map_atomic()
> > 
> > Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT.
> 
> FWIW, since we *do* have two variants already, I think I'd have a mild
> preference for leaving the regular map calls as-is (i.e. implicit
> GFP_KERNEL), and just generalising the _atomic versions for the special
> cases.

I think it is just better to follow kernel convention and have
allocation functions include the GFP because it is a clear signal to
the user that there is an allocation hidden inside the API. The whole
point of gfp is not to have multitudes of every function for every
allocation mode.

There are not so many callers that it seems worth worrying about
removing the extra GFP_KERNEL argument.

> However, echoing the recent activity over on the DMA API side of things, I
> think it's still worth proactively constraining the set of permissible
> flags, lest we end up with more weird problems if stuff that doesn't really
> make sense, like GFP_COMP or zone flags, manages to leak through (that may
> have been part of the reason for having the current wrappers rather than a
> bare gfp argument in the first place, I forget now).

Yeah, that can be done

Thanks,
Jason

[Nouveau] [PATCH 2/8] iommu: Remove iommu_map_atomic()

2023-01-06 Thread Jason Gunthorpe

There is only one call site and it can now just pass the GFP_ATOMIC to the
normal iommu_map().

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 2 +-
 drivers/iommu/iommu.c | 7 ---
 include/linux/iommu.h | 9 -
 3 files changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8bdb65e7686ff9..7016db569f81fc 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -713,7 +713,7 @@ static dma_addr_t __iommu_dma_map(struct device *dev, 
phys_addr_t phys,
if (!iova)
return DMA_MAPPING_ERROR;
 
-   if (iommu_map_atomic(domain, iova, phys - iova_off, size, prot)) {
+   if (iommu_map(domain, iova, phys - iova_off, size, prot, GFP_ATOMIC)) {
iommu_dma_free_iova(cookie, iova, size, NULL);
return DMA_MAPPING_ERROR;
}
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fe29fc2140b132..fee37bb246f3ea 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2376,13 +2376,6 @@ int iommu_map(struct iommu_domain *domain, unsigned long 
iova,
 }
 EXPORT_SYMBOL_GPL(iommu_map);
 
-int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot)
-{
-   return iommu_map(domain, iova, paddr, size, prot, GFP_ATOMIC);
-}
-EXPORT_SYMBOL_GPL(iommu_map_atomic);
-
 static size_t __iommu_unmap_pages(struct iommu_domain *domain,
  unsigned long iova, size_t size,
  struct iommu_iotlb_gather *iotlb_gather)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d2020994f292db..521cd79700f4d8 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -468,8 +468,6 @@ extern struct iommu_domain *iommu_get_domain_for_dev(struct 
device *dev);
 extern struct iommu_domain *iommu_get_dma_domain(struct device *dev);
 extern int iommu_map(struct iommu_domain *domain, unsigned long iova,
 phys_addr_t paddr, size_t size, int prot, gfp_t gfp);
-extern int iommu_map_atomic(struct iommu_domain *domain, unsigned long iova,
-   phys_addr_t paddr, size_t size, int prot);
 extern size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova,
  size_t size);
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
@@ -778,13 +776,6 @@ static inline int iommu_map(struct iommu_domain *domain, 
unsigned long iova,
return -ENODEV;
 }
 
-static inline int iommu_map_atomic(struct iommu_domain *domain,
-  unsigned long iova, phys_addr_t paddr,
-  size_t size, int prot)
-{
-   return -ENODEV;
-}
-
 static inline size_t iommu_unmap(struct iommu_domain *domain,
 unsigned long iova, size_t size)
 {
-- 
2.39.0

[Nouveau] [PATCH 8/8] iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s

2023-01-06 Thread Jason Gunthorpe

dma_alloc_cpu_table() and dma_alloc_page_table() are eventually called by
iommufd through s390_iommu_map_pages() and it should not be forced to
atomic. Thread the gfp parameter through the call chain starting from
s390_iommu_map_pages().

Signed-off-by: Jason Gunthorpe 
---
 arch/s390/include/asm/pci_dma.h |  5 +++--
 arch/s390/pci/pci_dma.c | 31 +--
 drivers/iommu/s390-iommu.c  | 15 +--
 3 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/arch/s390/include/asm/pci_dma.h b/arch/s390/include/asm/pci_dma.h
index 91e63426bdc53f..7119c04c51c5c8 100644
--- a/arch/s390/include/asm/pci_dma.h
+++ b/arch/s390/include/asm/pci_dma.h
@@ -186,9 +186,10 @@ static inline unsigned long *get_st_pto(unsigned long 
entry)
 
 /* Prototypes */
 void dma_free_seg_table(unsigned long);
-unsigned long *dma_alloc_cpu_table(void);
+unsigned long *dma_alloc_cpu_table(gfp_t gfp);
 void dma_cleanup_tables(unsigned long *);
-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr);
+unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr,
+ gfp_t gfp);
 void dma_update_cpu_trans(unsigned long *entry, phys_addr_t page_addr, int 
flags);
 
 extern const struct dma_map_ops s390_pci_dma_ops;
diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index ea478d11fbd132..2d9b01d7ca4c5c 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -27,11 +27,11 @@ static int zpci_refresh_global(struct zpci_dev *zdev)
  zdev->iommu_pages * PAGE_SIZE);
 }
 
-unsigned long *dma_alloc_cpu_table(void)
+unsigned long *dma_alloc_cpu_table(gfp_t gfp)
 {
unsigned long *table, *entry;
 
-   table = kmem_cache_alloc(dma_region_table_cache, GFP_ATOMIC);
+   table = kmem_cache_alloc(dma_region_table_cache, gfp);
if (!table)
return NULL;
 
@@ -45,11 +45,11 @@ static void dma_free_cpu_table(void *table)
kmem_cache_free(dma_region_table_cache, table);
 }
 
-static unsigned long *dma_alloc_page_table(void)
+static unsigned long *dma_alloc_page_table(gfp_t gfp)
 {
unsigned long *table, *entry;
 
-   table = kmem_cache_alloc(dma_page_table_cache, GFP_ATOMIC);
+   table = kmem_cache_alloc(dma_page_table_cache, gfp);
if (!table)
return NULL;
 
@@ -63,7 +63,7 @@ static void dma_free_page_table(void *table)
kmem_cache_free(dma_page_table_cache, table);
 }
 
-static unsigned long *dma_get_seg_table_origin(unsigned long *rtep)
+static unsigned long *dma_get_seg_table_origin(unsigned long *rtep, gfp_t gfp)
 {
unsigned long old_rte, rte;
unsigned long *sto;
@@ -72,7 +72,7 @@ static unsigned long *dma_get_seg_table_origin(unsigned long 
*rtep)
if (reg_entry_isvalid(rte)) {
sto = get_rt_sto(rte);
} else {
-   sto = dma_alloc_cpu_table();
+   sto = dma_alloc_cpu_table(gfp);
if (!sto)
return NULL;
 
@@ -90,7 +90,7 @@ static unsigned long *dma_get_seg_table_origin(unsigned long 
*rtep)
return sto;
 }
 
-static unsigned long *dma_get_page_table_origin(unsigned long *step)
+static unsigned long *dma_get_page_table_origin(unsigned long *step, gfp_t gfp)
 {
unsigned long old_ste, ste;
unsigned long *pto;
@@ -99,7 +99,7 @@ static unsigned long *dma_get_page_table_origin(unsigned long 
*step)
if (reg_entry_isvalid(ste)) {
pto = get_st_pto(ste);
} else {
-   pto = dma_alloc_page_table();
+   pto = dma_alloc_page_table(gfp);
if (!pto)
return NULL;
set_st_pto(&ste, virt_to_phys(pto));
@@ -116,18 +116,19 @@ static unsigned long *dma_get_page_table_origin(unsigned 
long *step)
return pto;
 }
 
-unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr)
+unsigned long *dma_walk_cpu_trans(unsigned long *rto, dma_addr_t dma_addr,
+ gfp_t gfp)
 {
unsigned long *sto, *pto;
unsigned int rtx, sx, px;
 
rtx = calc_rtx(dma_addr);
-   sto = dma_get_seg_table_origin(&rto[rtx]);
+   sto = dma_get_seg_table_origin(&rto[rtx], gfp);
if (!sto)
return NULL;
 
sx = calc_sx(dma_addr);
-   pto = dma_get_page_table_origin(&sto[sx]);
+   pto = dma_get_page_table_origin(&sto[sx], gfp);
if (!pto)
return NULL;
 
@@ -170,7 +171,8 @@ static int __dma_update_trans(struct zpci_dev *zdev, 
phys_addr_t pa,
return -EINVAL;
 
for (i = 0; i < nr_pages; i++) {
-   entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr);
+   entry = dma_walk_cpu_trans(zdev->dma_table, dma_addr,
+  GFP_ATOMIC);
if (!entry) {

[Nouveau] [PATCH 7/8] iommu/intel: Support the gfp argument to the map_pages op

2023-01-06 Thread Jason Gunthorpe

Flow it down to alloc_pgtable_page() via pfn_to_dma_pte() and
__domain_mapping().

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index e3807776971563..a1a66798e1f06c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -908,7 +908,8 @@ void dmar_fault_dump_ptes(struct intel_iommu *iommu, u16 
source_id,
 #endif
 
 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
- unsigned long pfn, int *target_level)
+ unsigned long pfn, int *target_level,
+ gfp_t gfp)
 {
struct dma_pte *parent, *pte;
int level = agaw_to_level(domain->agaw);
@@ -935,7 +936,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain 
*domain,
if (!dma_pte_present(pte)) {
uint64_t pteval;
 
-   tmp_page = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
+   tmp_page = alloc_pgtable_page(domain->nid, gfp);
 
if (!tmp_page)
return NULL;
@@ -2150,7 +2151,8 @@ static void switch_to_super_page(struct dmar_domain 
*domain,
 
while (start_pfn <= end_pfn) {
if (!pte)
-   pte = pfn_to_dma_pte(domain, start_pfn, &level);
+   pte = pfn_to_dma_pte(domain, start_pfn, &level,
+GFP_ATOMIC);
 
if (dma_pte_present(pte)) {
dma_pte_free_pagetable(domain, start_pfn,
@@ -2172,7 +2174,8 @@ static void switch_to_super_page(struct dmar_domain 
*domain,
 
 static int
 __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
-unsigned long phys_pfn, unsigned long nr_pages, int prot)
+unsigned long phys_pfn, unsigned long nr_pages, int prot,
+gfp_t gfp)
 {
struct dma_pte *first_pte = NULL, *pte = NULL;
unsigned int largepage_lvl = 0;
@@ -2202,7 +2205,8 @@ __domain_mapping(struct dmar_domain *domain, unsigned 
long iov_pfn,
largepage_lvl = hardware_largepage_caps(domain, iov_pfn,
phys_pfn, nr_pages);
 
-   pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl);
+   pte = pfn_to_dma_pte(domain, iov_pfn, &largepage_lvl,
+gfp);
if (!pte)
return -ENOMEM;
first_pte = pte;
@@ -2368,7 +2372,7 @@ static int iommu_domain_identity_map(struct dmar_domain 
*domain,
 
return __domain_mapping(domain, first_vpfn,
first_vpfn, last_vpfn - first_vpfn + 1,
-   DMA_PTE_READ|DMA_PTE_WRITE);
+   DMA_PTE_READ|DMA_PTE_WRITE, GFP_KERNEL);
 }
 
 static int md_domain_init(struct dmar_domain *domain, int guest_width);
@@ -4298,7 +4302,7 @@ static int intel_iommu_map(struct iommu_domain *domain,
   the low bits of hpa would take us onto the next page */
size = aligned_nrpages(hpa, size);
return __domain_mapping(dmar_domain, iova >> VTD_PAGE_SHIFT,
-   hpa >> VTD_PAGE_SHIFT, size, prot);
+   hpa >> VTD_PAGE_SHIFT, size, prot, gfp);
 }
 
 static int intel_iommu_map_pages(struct iommu_domain *domain,
@@ -4333,7 +4337,8 @@ static size_t intel_iommu_unmap(struct iommu_domain 
*domain,
 
/* Cope with horrid API which requires us to unmap more than the
   size argument if it happens to be a large-page mapping. */
-   BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level));
+   BUG_ON(!pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
+  GFP_ATOMIC));
 
if (size < VTD_PAGE_SIZE << level_to_offset_bits(level))
size = VTD_PAGE_SIZE << level_to_offset_bits(level);
@@ -4392,7 +4397,8 @@ static phys_addr_t intel_iommu_iova_to_phys(struct 
iommu_domain *domain,
int level = 0;
u64 phys = 0;
 
-   pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level);
+   pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &level,
+GFP_ATOMIC);
if (pte && dma_pte_present(pte))
phys = dma_pte_addr(pte) +
(iova & (BIT_MASK(level_to_offset_bits(level) +
-- 
2.39.0

[Nouveau] [PATCH 4/8] iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()

2023-01-06 Thread Jason Gunthorpe

Change the sg_alloc_table_from_pages() allocation that was hardwired to
GFP_KERNEL to use the gfp parameter like the other allocations in this
function.

Auditing says this is never called from an atomic context, so it is safe
as is, but reads wrong.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 8c2788633c1766..e4bf1bb159f7c7 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -822,7 +822,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
if (!iova)
goto out_free_pages;
 
-   if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, GFP_KERNEL))
+   if (sg_alloc_table_from_pages(sgt, pages, count, 0, size, gfp))
goto out_free_iova;
 
if (!(ioprot & IOMMU_CACHE)) {
-- 
2.39.0

[Nouveau] [PATCH 3/8] iommu: Add a gfp parameter to iommu_map_sg()

2023-01-06 Thread Jason Gunthorpe

Follow the pattern for iommu_map() and remove iommu_map_sg_atomic().

This allows __iommu_dma_alloc_noncontiguous() to use a GFP_KERNEL
allocation here, based on the provided gfp flags.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/dma-iommu.c |  5 +++--
 drivers/iommu/iommu.c | 21 +
 include/linux/iommu.h | 18 +-
 3 files changed, 13 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7016db569f81fc..8c2788633c1766 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -833,7 +833,8 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct 
device *dev,
arch_dma_prep_coherent(sg_page(sg), sg->length);
}
 
-   ret = iommu_map_sg_atomic(domain, iova, sgt->sgl, sgt->orig_nents, 
ioprot);
+   ret = iommu_map_sg(domain, iova, sgt->sgl, sgt->orig_nents, ioprot,
+  gfp);
if (ret < 0 || ret < size)
goto out_free_sg;
 
@@ -1281,7 +1282,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 * We'll leave any physical concatenation to the IOMMU driver's
 * implementation - it knows better than we do.
 */
-   ret = iommu_map_sg_atomic(domain, iova, sg, nents, prot);
+   ret = iommu_map_sg(domain, iova, sg, nents, prot, GFP_ATOMIC);
if (ret < 0 || ret < iova_len)
goto out_free_iova;
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fee37bb246f3ea..11fb3981e25642 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2465,9 +2465,9 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
-static ssize_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot,
-   gfp_t gfp)
+ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+struct scatterlist *sg, unsigned int nents, int prot,
+gfp_t gfp)
 {
const struct iommu_domain_ops *ops = domain->ops;
size_t len = 0, mapped = 0;
@@ -2475,6 +2475,8 @@ static ssize_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
unsigned int i = 0;
int ret;
 
+   might_sleep_if(gfpflags_allow_blocking(gfp));
+
while (i <= nents) {
phys_addr_t s_phys = sg_phys(sg);
 
@@ -2514,21 +2516,8 @@ static ssize_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
 
return ret;
 }
-
-ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-struct scatterlist *sg, unsigned int nents, int prot)
-{
-   might_sleep();
-   return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_KERNEL);
-}
 EXPORT_SYMBOL_GPL(iommu_map_sg);
 
-ssize_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot)
-{
-   return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_ATOMIC);
-}
-
 /**
  * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
  * @domain: the iommu domain where the fault has happened
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 521cd79700f4d8..d5c16dc33c87de 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -474,10 +474,8 @@ extern size_t iommu_unmap_fast(struct iommu_domain *domain,
   unsigned long iova, size_t size,
   struct iommu_iotlb_gather *iotlb_gather);
 extern ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot);
-extern ssize_t iommu_map_sg_atomic(struct iommu_domain *domain,
-  unsigned long iova, struct scatterlist *sg,
-  unsigned int nents, int prot);
+   struct scatterlist *sg, unsigned int nents,
+   int prot, gfp_t gfp);
 extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t 
iova);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
iommu_fault_handler_t handler, void *token);
@@ -791,14 +789,7 @@ static inline size_t iommu_unmap_fast(struct iommu_domain 
*domain,
 
 static inline ssize_t iommu_map_sg(struct iommu_domain *domain,
   unsigned long iova, struct scatterlist *sg,
-  unsigned int nents, int prot)
-{
-   return -ENODEV;
-}
-
-static inline ssize_t iommu_map_sg_atomic(struct iommu_domain *domain,
- unsigned long iova, struct scatterlist *sg,
- unsigned int nents, int prot)
+

[Nouveau] [PATCH 1/8] iommu: Add a gfp parameter to iommu_map()

2023-01-06 Thread Jason Gunthorpe

The internal mechanisms support this, but instead of exposting the gfp to
the caller it wrappers it into iommu_map() and iommu_map_atomic()

Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT.

Signed-off-by: Jason Gunthorpe 
---
 arch/arm/mm/dma-mapping.c   | 11 +++
 .../gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c |  3 ++-
 drivers/gpu/drm/tegra/drm.c |  2 +-
 drivers/gpu/host1x/cdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c|  4 ++--
 drivers/iommu/dma-iommu.c   |  2 +-
 drivers/iommu/iommu.c   | 17 ++---
 drivers/iommu/iommufd/pages.c   |  6 --
 drivers/media/platform/qcom/venus/firmware.c|  2 +-
 drivers/net/ipa/ipa_mem.c   |  6 --
 drivers/net/wireless/ath/ath10k/snoc.c  |  2 +-
 drivers/net/wireless/ath/ath11k/ahb.c   |  4 ++--
 drivers/remoteproc/remoteproc_core.c|  5 +++--
 drivers/vfio/vfio_iommu_type1.c |  9 +
 drivers/vhost/vdpa.c|  2 +-
 include/linux/iommu.h   |  4 ++--
 16 files changed, 43 insertions(+), 38 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index c135f6e37a00ca..8bc01071474ab7 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -984,7 +984,8 @@ __iommu_create_mapping(struct device *dev, struct page 
**pages, size_t size,
 
len = (j - i) << PAGE_SHIFT;
ret = iommu_map(mapping->domain, iova, phys, len,
-   __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs));
+   __dma_info_to_prot(DMA_BIDIRECTIONAL, attrs),
+   GFP_KERNEL);
if (ret < 0)
goto fail;
iova += len;
@@ -1207,7 +1208,8 @@ static int __map_sg_chunk(struct device *dev, struct 
scatterlist *sg,
 
prot = __dma_info_to_prot(dir, attrs);
 
-   ret = iommu_map(mapping->domain, iova, phys, len, prot);
+   ret = iommu_map(mapping->domain, iova, phys, len, prot,
+   GFP_KERNEL);
if (ret < 0)
goto fail;
count += len >> PAGE_SHIFT;
@@ -1379,7 +1381,8 @@ static dma_addr_t arm_iommu_map_page(struct device *dev, 
struct page *page,
 
prot = __dma_info_to_prot(dir, attrs);
 
-   ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len, 
prot);
+   ret = iommu_map(mapping->domain, dma_addr, page_to_phys(page), len,
+   prot, GFP_KERNEL);
if (ret < 0)
goto fail;
 
@@ -1443,7 +1446,7 @@ static dma_addr_t arm_iommu_map_resource(struct device 
*dev,
 
prot = __dma_info_to_prot(dir, attrs) | IOMMU_MMIO;
 
-   ret = iommu_map(mapping->domain, dma_addr, addr, len, prot);
+   ret = iommu_map(mapping->domain, dma_addr, addr, len, prot, GFP_KERNEL);
if (ret < 0)
goto fail;
 
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c 
b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
index 648ecf5a8fbc2a..a4ac94a2ab57fc 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/instmem/gk20a.c
@@ -475,7 +475,8 @@ gk20a_instobj_ctor_iommu(struct gk20a_instmem *imem, u32 
npages, u32 align,
u32 offset = (r->offset + i) << imem->iommu_pgshift;
 
ret = iommu_map(imem->domain, offset, node->dma_addrs[i],
-   PAGE_SIZE, IOMMU_READ | IOMMU_WRITE);
+   PAGE_SIZE, IOMMU_READ | IOMMU_WRITE,
+   GFP_KERNEL);
if (ret < 0) {
nvkm_error(subdev, "IOMMU mapping failure: %d\n", ret);
 
diff --git a/drivers/gpu/drm/tegra/drm.c b/drivers/gpu/drm/tegra/drm.c
index 7bd2e65c2a16c5..6ca9f396e55be4 100644
--- a/drivers/gpu/drm/tegra/drm.c
+++ b/drivers/gpu/drm/tegra/drm.c
@@ -1057,7 +1057,7 @@ void *tegra_drm_alloc(struct tegra_drm *tegra, size_t 
size, dma_addr_t *dma)
 
*dma = iova_dma_addr(&tegra->carveout.domain, alloc);
err = iommu_map(tegra->domain, *dma, virt_to_phys(virt),
-   size, IOMMU_READ | IOMMU_WRITE);
+   size, IOMMU_READ | IOMMU_WRITE, GFP_KERNEL);
if (err < 0)
goto free_iova;
 
diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c
index 103fda055394ab..4ddfcd2138c95b 100644
--- a/drivers/gpu/host1x/cdma.c
+++ b/drivers/gpu/host1x/cdma.c
@@ -105,7 +105,7 @@ static int host1x_pushbuffer_init(struct push_buffer *pb)
 
pb->dma = iova_dma_addr(&host1x->iova, alloc);

[Nouveau] [PATCH 5/8] iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map()

2023-01-06 Thread Jason Gunthorpe

iommufd follows the same design as KVM and uses memory cgroups to limit
the amount of kernel memory a iommufd file descriptor can pin down. The
various internal data structures already use GFP_KERNEL_ACCOUNT.

However, one of the biggest consumers of kernel memory is the IOPTEs
stored under the iommu_domain. Many drivers will allocate these at
iommu_map() time and will trivially do the right thing if we pass in
GFP_KERNEL_ACCOUNT.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/iommufd/pages.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 22cc3bb0c6c55a..f8d92c9bb65b60 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -457,7 +457,7 @@ static int batch_iommu_map_small(struct iommu_domain 
*domain,
 
while (size) {
rc = iommu_map(domain, iova, paddr, PAGE_SIZE, prot,
-  GFP_KERNEL);
+  GFP_KERNEL_ACCOUNT);
if (rc)
goto err_unmap;
iova += PAGE_SIZE;
@@ -502,7 +502,7 @@ static int batch_to_domain(struct pfn_batch *batch, struct 
iommu_domain *domain,
rc = iommu_map(domain, iova,
   PFN_PHYS(batch->pfns[cur]) + page_offset,
   next_iova - iova, area->iommu_prot,
-  GFP_KERNEL);
+  GFP_KERNEL_ACCOUNT);
if (rc)
goto err_unmap;
iova = next_iova;
-- 
2.39.0

[Nouveau] [PATCH 0/8] Let iommufd charge IOPTE allocations to the memory cgroup

2023-01-06 Thread Jason Gunthorpe

iommufd follows the same design as KVM and uses memory cgroups to limit
the amount of kernel memory a iommufd file descriptor can pin down. The
various internal data structures already use GFP_KERNEL_ACCOUNT to charge
its own memory.

However, one of the biggest consumers of kernel memory is the IOPTEs
stored under the iommu_domain and these allocations are not tracked.

This series is the first step in fixing it.

The iommu driver contract already includes a 'gfp' argument to the
map_pages op, allowing iommufd to specify GFP_KERNEL_ACCOUNT and then
having the driver allocate the IOPTE tables with that flag will capture a
significant amount of the allocations.

Update the iommu_map() API to pass in the GFP argument, and fix all call
sites. Replace iommu_map_atomic().

Audit the "enterprise" iommu drivers to make sure they do the right thing.
Intel and S390 ignore the GFP argument and always use GFP_ATOMIC. This is
problematic for iommufd anyhow, so fix it. AMD and ARM SMMUv2/3 are
already correct.

A follow up series will be needed to capture the allocations made when the
iommu_domain itself is allocated, which will complete the job.

Jason Gunthorpe (8):
  iommu: Add a gfp parameter to iommu_map()
  iommu: Remove iommu_map_atomic()
  iommu: Add a gfp parameter to iommu_map_sg()
  iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous()
  iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map()
  iommu/intel: Add a gfp parameter to alloc_pgtable_page()
  iommu/intel: Support the gfp argument to the map_pages op
  iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s

 arch/arm/mm/dma-mapping.c | 11 +++--
 arch/s390/include/asm/pci_dma.h   |  5 ++-
 arch/s390/pci/pci_dma.c   | 31 +++--
 .../drm/nouveau/nvkm/subdev/instmem/gk20a.c   |  3 +-
 drivers/gpu/drm/tegra/drm.c   |  2 +-
 drivers/gpu/host1x/cdma.c |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c  |  4 +-
 drivers/iommu/dma-iommu.c | 11 ++---
 drivers/iommu/intel/iommu.c   | 36 +---
 drivers/iommu/intel/iommu.h   |  2 +-
 drivers/iommu/intel/pasid.c   |  2 +-
 drivers/iommu/iommu.c | 43 +--
 drivers/iommu/iommufd/pages.c |  6 ++-
 drivers/iommu/s390-iommu.c| 15 ---
 drivers/media/platform/qcom/venus/firmware.c  |  2 +-
 drivers/net/ipa/ipa_mem.c |  6 ++-
 drivers/net/wireless/ath/ath10k/snoc.c|  2 +-
 drivers/net/wireless/ath/ath11k/ahb.c |  4 +-
 drivers/remoteproc/remoteproc_core.c  |  5 ++-
 drivers/vfio/vfio_iommu_type1.c   |  9 ++--
 drivers/vhost/vdpa.c  |  2 +-
 include/linux/iommu.h | 31 +++--
 22 files changed, 109 insertions(+), 125 deletions(-)


base-commit: 88603b6dc419445847923fcb7fe5080067a30f98
-- 
2.39.0

[Nouveau] [PATCH 6/8] iommu/intel: Add a gfp parameter to alloc_pgtable_page()

2023-01-06 Thread Jason Gunthorpe

This is eventually called by iommufd through intel_iommu_map_pages() and
it should not be forced to atomic. Push the GFP_ATOMIC to all callers.

Signed-off-by: Jason Gunthorpe 
---
 drivers/iommu/intel/iommu.c | 14 +++---
 drivers/iommu/intel/iommu.h |  2 +-
 drivers/iommu/intel/pasid.c |  2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 59df7e42fd533c..e3807776971563 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -362,12 +362,12 @@ static int __init intel_iommu_setup(char *str)
 }
 __setup("intel_iommu=", intel_iommu_setup);
 
-void *alloc_pgtable_page(int node)
+void *alloc_pgtable_page(int node, gfp_t gfp)
 {
struct page *page;
void *vaddr = NULL;
 
-   page = alloc_pages_node(node, GFP_ATOMIC | __GFP_ZERO, 0);
+   page = alloc_pages_node(node, gfp | __GFP_ZERO, 0);
if (page)
vaddr = page_address(page);
return vaddr;
@@ -612,7 +612,7 @@ struct context_entry *iommu_context_addr(struct intel_iommu 
*iommu, u8 bus,
if (!alloc)
return NULL;
 
-   context = alloc_pgtable_page(iommu->node);
+   context = alloc_pgtable_page(iommu->node, GFP_ATOMIC);
if (!context)
return NULL;
 
@@ -935,7 +935,7 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain 
*domain,
if (!dma_pte_present(pte)) {
uint64_t pteval;
 
-   tmp_page = alloc_pgtable_page(domain->nid);
+   tmp_page = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
 
if (!tmp_page)
return NULL;
@@ -1186,7 +1186,7 @@ static int iommu_alloc_root_entry(struct intel_iommu 
*iommu)
 {
struct root_entry *root;
 
-   root = (struct root_entry *)alloc_pgtable_page(iommu->node);
+   root = (struct root_entry *)alloc_pgtable_page(iommu->node, GFP_ATOMIC);
if (!root) {
pr_err("Allocating root entry for %s failed\n",
iommu->name);
@@ -2676,7 +2676,7 @@ static int copy_context_table(struct intel_iommu *iommu,
if (!old_ce)
goto out;
 
-   new_ce = alloc_pgtable_page(iommu->node);
+   new_ce = alloc_pgtable_page(iommu->node, GFP_KERNEL);
if (!new_ce)
goto out_unmap;
 
@@ -4136,7 +4136,7 @@ static int md_domain_init(struct dmar_domain *domain, int 
guest_width)
domain->max_addr = 0;
 
/* always allocate the top pgd */
-   domain->pgd = alloc_pgtable_page(domain->nid);
+   domain->pgd = alloc_pgtable_page(domain->nid, GFP_ATOMIC);
if (!domain->pgd)
return -ENOMEM;
domain_flush_cache(domain, domain->pgd, PAGE_SIZE);
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 06e61e4748567a..ca9a035e0110af 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -737,7 +737,7 @@ int qi_submit_sync(struct intel_iommu *iommu, struct 
qi_desc *desc,
 
 extern int dmar_ir_support(void);
 
-void *alloc_pgtable_page(int node);
+void *alloc_pgtable_page(int node, gfp_t gfp);
 void free_pgtable_page(void *vaddr);
 void iommu_flush_write_buffer(struct intel_iommu *iommu);
 struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn);
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index fb3c7020028d07..c5bf74e9372d62 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -200,7 +200,7 @@ static struct pasid_entry *intel_pasid_get_entry(struct 
device *dev, u32 pasid)
 retry:
entries = get_pasid_table_from_pde(&dir[dir_index]);
if (!entries) {
-   entries = alloc_pgtable_page(info->iommu->node);
+   entries = alloc_pgtable_page(info->iommu->node, GFP_ATOMIC);
if (!entries)
return NULL;
 
-- 
2.39.0

Re: [Nouveau] [PATCH 2/7] mm: Free device private pages have zero refcount

2022-09-26 Thread Jason Gunthorpe

On Mon, Sep 26, 2022 at 04:03:06PM +1000, Alistair Popple wrote:
> Since 27674ef6c73f ("mm: remove the extra ZONE_DEVICE struct page
> refcount") device private pages have no longer had an extra reference
> count when the page is in use. However before handing them back to the
> owning device driver we add an extra reference count such that free
> pages have a reference count of one.
> 
> This makes it difficult to tell if a page is free or not because both
> free and in use pages will have a non-zero refcount. Instead we should
> return pages to the drivers page allocator with a zero reference count.
> Kernel code can then safely use kernel functions such as
> get_page_unless_zero().
> 
> Signed-off-by: Alistair Popple 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 1 +
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
>  lib/test_hmm.c   | 1 +
>  mm/memremap.c| 5 -
>  mm/page_alloc.c  | 6 ++
>  6 files changed, 10 insertions(+), 5 deletions(-)

I think this is a great idea, but I'm surprised no dax stuff is
touched here?

Jason

Re: [Nouveau] [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-09 Thread Jason Gunthorpe

On Wed, Feb 09, 2022 at 02:53:51PM +0100, Christoph Hellwig wrote:
> On Wed, Feb 09, 2022 at 08:29:56AM -0400, Jason Gunthorpe wrote:
> > It is nice, but the other series are still impacted by the fsdax mess
> > - they still stuff pages into ptes without proper refcounts and have
> > to carry nonsense to dance around this problem.
> > 
> > I certainly would be unhappy if the amd driver, for instance, gained
> > the fsdax problem as well and started pushing 4k pages into PMDs.
> 
> As said before: I think this all needs to be fixed.  But I'd rather
> fix it gradually and I think this series is a nice step forward.
> After that we can look at the pte mappings.

Right, I agree with this

Jason

Re: [Nouveau] [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-09 Thread Jason Gunthorpe

On Wed, Feb 09, 2022 at 07:23:45AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 08, 2022 at 07:30:11PM -0800, Dan Williams wrote:
> > Interesting. I had expected that to really fix the refcount problem
> > that fs/dax.c would need to start taking real page references as pages
> > were added to a mapping, just like page cache.
> 
> I think we should do that eventually.  But I think this series that
> just attacks the device private type and extends to the device coherent
> and p2p enhacements is a good first step to stop the proliferation of
> the one off refcount and to allow to deal with the fsdax pages in another
> more focuessed series.

It is nice, but the other series are still impacted by the fsdax mess
- they still stuff pages into ptes without proper refcounts and have
to carry nonsense to dance around this problem.

I certainly would be unhappy if the amd driver, for instance, gained
the fsdax problem as well and started pushing 4k pages into PMDs.

Jason

Re: [Nouveau] [PATCH 8/8] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED

2022-02-07 Thread Jason Gunthorpe

On Mon, Feb 07, 2022 at 07:32:49AM +0100, Christoph Hellwig wrote:
> Add a depends on ZONE_DEVICE support or the s390-specific limited DAX
> support, as one of the two is required at runtime for fsdax code to
> actually work.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/Kconfig | 1 +
>  1 file changed, 1 insertion(+)

Makes sense, but leaves me wonder why a kconfig randomizer didn't hit
this.. Or maybe it means some of the function stubs on !ZONE_DEVICE
are unnecessary now..

Reviewed-by: Jason Gunthorpe 

Jason

Re: [Nouveau] [PATCH 6/8] mm: don't include in

2022-02-07 Thread Jason Gunthorpe

On Mon, Feb 07, 2022 at 07:32:47AM +0100, Christoph Hellwig wrote:
> Move the check for the actual pgmap types that need the free at refcount
> one behavior into the out of line helper, and thus avoid the need to
> pull memremap.h into mm.h.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/arm64/mm/mmu.c|  1 +
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |  1 +
>  drivers/gpu/drm/drm_cache.c|  2 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c |  1 +
>  drivers/gpu/drm/nouveau/nouveau_svm.c  |  1 +
>  drivers/infiniband/core/rw.c   |  1 +
>  drivers/nvdimm/pmem.h  |  1 +
>  drivers/nvme/host/pci.c|  1 +
>  drivers/nvme/target/io-cmd-bdev.c  |  1 +
>  fs/fuse/virtio_fs.c|  1 +
>  include/linux/memremap.h   | 18 ++
>  include/linux/mm.h | 20 
>  lib/test_hmm.c |  1 +
>  mm/memremap.c  |  6 +-
>  14 files changed, 34 insertions(+), 22 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason

Re: [Nouveau] [PATCH 5/8] mm: simplify freeing of devmap managed pages

2022-02-07 Thread Jason Gunthorpe

On Mon, Feb 07, 2022 at 07:32:46AM +0100, Christoph Hellwig wrote:
> Make put_devmap_managed_page return if it took charge of the page
> or not and remove the separate page_is_devmap_managed helper.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/mm.h | 34 ++
>  mm/memremap.c  | 20 +---
>  mm/swap.c  | 10 +-
>  3 files changed, 20 insertions(+), 44 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason

Re: [Nouveau] [PATCH 4/8] mm: move free_devmap_managed_page to memremap.c

2022-02-07 Thread Jason Gunthorpe

On Mon, Feb 07, 2022 at 07:32:45AM +0100, Christoph Hellwig wrote:
> free_devmap_managed_page has nothing to do with the code in swap.c,
> move it to live with the rest of the code for devmap handling.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/mm.h |  1 -
>  mm/memremap.c  | 21 +
>  mm/swap.c  | 23 ---
>  3 files changed, 21 insertions(+), 24 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason

Re: [Nouveau] [PATCH 2/8] mm: remove the KERNEL guard from

2022-02-07 Thread Jason Gunthorpe

On Mon, Feb 07, 2022 at 07:32:43AM +0100, Christoph Hellwig wrote:
> __KERNEL__ ifdefs don't make sense outside of include/uapi/.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/mm.h | 4 
>  1 file changed, 4 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason

Re: [Nouveau] [PATCH 1/8] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages

2022-02-07 Thread Jason Gunthorpe

On Mon, Feb 07, 2022 at 07:32:42AM +0100, Christoph Hellwig wrote:
> memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
> the superflous extra check.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/memremap.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason

Re: [Nouveau] [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-07 Thread Jason Gunthorpe

On Mon, Feb 07, 2022 at 07:32:48AM +0100, Christoph Hellwig wrote:
> ZONE_DEVICE struct pages have an extra reference count that complicates
> the code for put_page() and several places in the kernel that need to
> check the reference count to see that a page is not being used (gup,
> compaction, migration, etc.). Clean up the code so the reference count
> doesn't need to be treated specially for ZONE_DEVICE pages.
> 
> Note that this excludes the special idle page wakeup for fsdax pages,
> which still happens at refcount 1.  This is a separate issue and will
> be sorted out later.  Given that only fsdax pages require the
> notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
> symbol can go away and be replaced with a FS_DAX check for this hook
> in the put_page fastpath.
> 
> Based on an earlier patch from Ralph Campbell .
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c   |  1 -
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  1 -
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   |  1 -
>  fs/Kconfig   |  1 -
>  include/linux/memremap.h | 12 +++--
>  include/linux/mm.h   |  6 +--
>  lib/test_hmm.c   |  1 -
>  mm/Kconfig   |  4 --
>  mm/internal.h|  2 +
>  mm/memcontrol.c  | 11 ++---
>  mm/memremap.c| 57 
>  mm/migrate.c |  6 ---
>  mm/swap.c| 16 ++-
>  13 files changed, 36 insertions(+), 83 deletions(-)

It looks like a good next step to me

Reviewed-by: Jason Gunthorpe 

>  struct dev_pagemap_ops {
>   /*
> -  * Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
> -  * reach 0 refcount unless there is a refcount bug. This allows the
> -  * device driver to implement its own memory management.)
> +  * Called once the page refcount reaches 0.  The reference count will be
> +  * reset to one by the core code after the method is called to prepare
> +  * for handing out the page again.

I did prefer Ralph's version of this that kept the refcount at 0 while
the page was on the free-list. I hope we can get there again after
later series :)

Jason

Re: [Nouveau] [PATCH 3/8] mm: remove pointless includes from

2022-02-07 Thread Jason Gunthorpe

On Mon, Feb 07, 2022 at 07:32:44AM +0100, Christoph Hellwig wrote:
> hmm.h pulls in the world for no good reason at all.  Remove the
> includes and push a few ones into the users instead.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
>  drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
>  include/linux/hmm.h  | 9 ++---
>  lib/test_hmm.c   | 2 ++
>  4 files changed, 6 insertions(+), 7 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Jason Gunthorpe

> Sorry for the noise.

Not at all, it is good that more people understand things!

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-19 Thread Jason Gunthorpe

On Tue, May 18, 2021 at 07:45:05PM -0400, Peter Xu wrote:
> On Tue, May 18, 2021 at 08:03:27PM -0300, Jason Gunthorpe wrote:
> > Logically during fork all these device exclusive pages should be
> > reverted back to their CPU pages, write protected and the CPU page PTE
> > copied to the fork.
> > 
> > We should not copy the device exclusive page PTE to the fork. I think
> > I pointed to this on an earlier rev..
> 
> Agreed.  Though please see the question I posted in the other thread: now I am
> not very sure whether we'll be able to mark a page as device exclusive if that
> page has mapcount>1.

IMHO it is similar to write protect done by filesystems on shared
mappings - all VMAs with a copy of the CPU page have to get switched
to the device exclusive PTE. This is why the rmap stuff is involved in
the migration helpers

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Jason Gunthorpe

On Tue, May 18, 2021 at 04:29:14PM -0400, Peter Xu wrote:
> On Tue, May 18, 2021 at 04:45:09PM -0300, Jason Gunthorpe wrote:
> > On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > > > Indeed it'll be odd for a COW page since for COW page then it means 
> > > > > after
> > > > > parent/child writting to the page it'll clone into two, then it's a 
> > > > > mistery on
> > > > > which one will be the one that "exclusived owned" by the device..
> > > > 
> > > > For COW pages it is like every other fork case.. We can't reliably
> > > > write-protect the device_exclusive page during fork so we must copy it
> > > > at fork time.
> > > > 
> > > > Thus three reasonable choices:
> > > >  - Copy to a new CPU page
> > > >  - Migrate back to a CPU page and write protect it
> > > >  - Copy to a new device exclusive page
> > > 
> > > IMHO the ownership question would really help us to answer this one..
> > 
> > I'm confused about what device ownership you are talking about
> 
> My question was more about the user scenario rather than anything related to
> the kernel code, nor does it related to page struct at all.
> 
> Let me try to be a little bit more verbose...
> 
> Firstly, I think one simple solution to handle fork() of device exclusive ptes
> is to do just like device private ptes: if COW we convert writable ptes into
> readable ptes.  Then when CPU access happens (in either parent/child) page
> restore triggers which will convert those readable ptes into read-only present
> ptes (with the original page backing it).  Then do_wp_page() will take care of
> page copy.

I suspect it doesn't work. This is much more like pinning than
anything, the data in the page is still under active use by a device
and if we cannot globally write write protect it, both from CPU and
device access, then we cannot do COW. IIRC the mm can't trigger a full
global write protect through the pgmap?
 
> Then here comes the ownership question: If we still want to have the parent
> process behave like before it fork()ed, IMHO we must make sure that original
> page (that exclusively owned by the device once) still belongs to the parent
> process not the child.  That's why I think if that's the case we'd do early 
> cow
> in fork(), because it guarantees that.

Logically during fork all these device exclusive pages should be
reverted back to their CPU pages, write protected and the CPU page PTE
copied to the fork.

We should not copy the device exclusive page PTE to the fork. I think
I pointed to this on an earlier rev..

We can optimize this into the various variants above, but logically
device exclusive stop existing during fork.

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Jason Gunthorpe

On Tue, May 18, 2021 at 02:01:36PM -0400, Peter Xu wrote:
> > > Indeed it'll be odd for a COW page since for COW page then it means after
> > > parent/child writting to the page it'll clone into two, then it's a 
> > > mistery on
> > > which one will be the one that "exclusived owned" by the device..
> > 
> > For COW pages it is like every other fork case.. We can't reliably
> > write-protect the device_exclusive page during fork so we must copy it
> > at fork time.
> > 
> > Thus three reasonable choices:
> >  - Copy to a new CPU page
> >  - Migrate back to a CPU page and write protect it
> >  - Copy to a new device exclusive page
> 
> IMHO the ownership question would really help us to answer this one..

I'm confused about what device ownership you are talking about

It is just a page and it is tied to some specific pgmap?

If the thing providing the migration stuff goes away then all
device_exclusive pages should revert back to CPU pages and destroy the
pgmap?

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access

2021-05-18 Thread Jason Gunthorpe

On Tue, May 18, 2021 at 01:27:42PM -0400, Peter Xu wrote:

> I also have a pure and high level question regarding a process fork() when
> there're device exclusive ptes: would the two processes then own the device
> together?  Is this a real usecase?

If the pages are MAP_SHARED then yes. All VMAs should point at the
same device_exclusive page and all VMA should migrate back to CPU
pages together.

> Indeed it'll be odd for a COW page since for COW page then it means after
> parent/child writting to the page it'll clone into two, then it's a mistery on
> which one will be the one that "exclusived owned" by the device..

For COW pages it is like every other fork case.. We can't reliably
write-protect the device_exclusive page during fork so we must copy it
at fork time.

Thus three reasonable choices:
 - Copy to a new CPU page
 - Migrate back to a CPU page and write protect it
 - Copy to a new device exclusive page

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v7 5/8] mm: Device exclusive memory access

2021-04-01 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 01:20:05PM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > > > 
> > > > > > > I guess that makes sense as the split could go either way at the
> > > > > > > moment but I should add a check to make sure this isn't used with
> > > > > > > pinned pages anyway.
> > > > > > 
> > > > > > Is it possible to have a pinned page under one of these things? If I
> > > > > > pin it before you migrate it then it remains pinned but hidden under
> > > > > > the swap entry?
> > > > > 
> > > > > At the moment yes. But I had planned (and this reminded me) to add a 
> check 
> > > to 
> > > > > prevent marking pinned pages for exclusive access. 
> > > > 
> > > > How do you even do that without races with GUP fast?
> > > 
> > > Unless I've missed something I think I've convinced myself it should be 
> safe 
> > > to do the pin check after make_device_exclusive() has replaced all the 
> PTEs 
> > > with exclusive entries.
> > > 
> > > GUP fast sequence:
> > > 1. Read PTE
> > > 2. Pin page
> > > 3. Check PTE
> > > 4. if PTE changed -> unpin and fallback
> > > 
> > > If make_device_exclusive() runs after (1) it will either succeed or see 
> the 
> > > pin from (2) and fail (as desired). GUP should always see the PTE change 
> and 
> > > fallback which will revoke the exclusive access.
> > 
> > AFAICT the user can trigger fork at that instant and fork will try to
> > copy the desposited migration entry before it has been checked
> 
> In that case the child will get a read-only exclusive entry and eventually a 
> page copy via do_wp_page() 

Having do_wp_page() do a copy is a security bug. We closed it with the
at-fork checks.

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > 
> > > > > I guess that makes sense as the split could go either way at the
> > > > > moment but I should add a check to make sure this isn't used with
> > > > > pinned pages anyway.
> > > > 
> > > > Is it possible to have a pinned page under one of these things? If I
> > > > pin it before you migrate it then it remains pinned but hidden under
> > > > the swap entry?
> > > 
> > > At the moment yes. But I had planned (and this reminded me) to add a 
> > > check 
> to 
> > > prevent marking pinned pages for exclusive access. 
> > 
> > How do you even do that without races with GUP fast?
> 
> Unless I've missed something I think I've convinced myself it should be safe 
> to do the pin check after make_device_exclusive() has replaced all the PTEs 
> with exclusive entries.
> 
> GUP fast sequence:
> 1. Read PTE
> 2. Pin page
> 3. Check PTE
> 4. if PTE changed -> unpin and fallback
> 
> If make_device_exclusive() runs after (1) it will either succeed or see the 
> pin from (2) and fail (as desired). GUP should always see the PTE change and 
> fallback which will revoke the exclusive access.

AFAICT the user can trigger fork at that instant and fork will try to
copy the desposited migration entry before it has been checked

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Jason Gunthorpe

On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > 
> > > I guess that makes sense as the split could go either way at the
> > > moment but I should add a check to make sure this isn't used with
> > > pinned pages anyway.
> > 
> > Is it possible to have a pinned page under one of these things? If I
> > pin it before you migrate it then it remains pinned but hidden under
> > the swap entry?
> 
> At the moment yes. But I had planned (and this reminded me) to add a check to 
> prevent marking pinned pages for exclusive access. 

How do you even do that without races with GUP fast?

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-31 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:

> I guess that makes sense as the split could go either way at the
> moment but I should add a check to make sure this isn't used with
> pinned pages anyway.

Is it possible to have a pinned page under one of these things? If I
pin it before you migrate it then it remains pinned but hidden under
the swap entry?

So the special logic is needed and the pinned page has to be copied
and written as a normal pte, not dropped as a migration entry

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-31 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 03:15:47PM +1100, Alistair Popple wrote:
> On Wednesday, 31 March 2021 2:56:38 PM AEDT John Hubbard wrote:
> > On 3/30/21 3:56 PM, Alistair Popple wrote:
> > ...
> > >> +1 for renaming "munlock*" items to "mlock*", where applicable. good 
> grief.
> > > 
> > > At least the situation was weird enough to prompt further investigation :)
> > > 
> > > Renaming to mlock* doesn't feel like the right solution to me either 
> though. I
> > > am not sure if you saw me responding to myself earlier but I am thinking
> > > renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() ->
> > > page_mlock_one() might be better. Thoughts?
> > > 
> > 
> > Quite confused by this naming idea. Because: try_to_munlock() returns
> > void, so a boolean-style name such as "page_mlocked()" is already not a
> > good fit.
> > 
> > Even more important, though, is that try_to_munlock() is mlock-ing the
> > page, right? Is there some subtle point I'm missing? It really is doing
> > an mlock to the best of my knowledge here. Although the kerneldoc
> > comment for try_to_munlock() seems questionable too:
> 
> It's mlocking the page if it turns out it still needs to be locked after 
> unlocking it. But I don't think you're missing anything.

It is really searching all VMA's to see if the VMA flag is set and if
any are found then it mlocks the page.

But presenting this rountine in its simplified form raises lots of
questions:

 - What locking is being used to read the VMA flag?
 - Why do we need to manipulate global struct page flags under the
   page table locks of a single VMA?
 - Why do we need to check for huge pages inside the VMA loop, not
   before going to the rmap? PageTransCompoundHead() is not sensitive to
   the PTEs. (and what happens if the huge page breaks up concurrently?)
 - Why do we clear the mlock bit then run around to try and set it?
   Feels racey.

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread Jason Gunthorpe

On Wed, Mar 31, 2021 at 09:09:30AM +1100, Alistair Popple wrote:

> > > @@ -1796,8 +1821,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags 
> flags)
> > >  void try_to_munlock(struct page *page)
> > >  {
> > 
> > But this is also called try_to_munlock ??
> 
> As far as I can tell this has always been called try_to_munlock() even though 
> it appears to do the opposite.

Maybe we should change it then?

> > /**
> >  * try_to_munlock - try to munlock a page
> >  * @page: the page to be munlocked
> >  *
> >  * Called from munlock code.  Checks all of the VMAs mapping the page
> >  * to make sure nobody else has this page mlocked. The page will be
> >  * returned with PG_mlocked cleared if no other vmas have it mlocked.
> >  */
> 
> In other words it sets PG_mlocked if one or more vmas has it mlocked. So
> try_to_mlock() might be a better name, except that seems to have the 
> potential 
> for confusion as well because it's only called from the munlock code path and 
> never for mlock.

That explanation makes more sense.. This function looks like it is
'set PG_mlocked of the page if any vm->flags has VM_LOCKED'

Maybe call it check_vm_locked or something then and reword the above
comment?

(and why is it OK to read vm->flags for this without any locking?)

> > Something needs attention here..
> 
> I think the code is correct, but perhaps the naming could be better. Would be 
> interested hearing any thoughts on renaming try_to_munlock() to 
> try_to_mlock() 
> as the current name appears based on the context it is called from (munlock) 
> rather than what it does (mlock).

The point of this patch is to make it clearer, after all, so I'd
change something and maybe slightly clarify the comment.

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v7 5/8] mm: Device exclusive memory access

2021-03-30 Thread Jason Gunthorpe

On Fri, Mar 26, 2021 at 11:08:02AM +1100, Alistair Popple wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 3a5705cfc891..33d11527ef77 100644
> +++ b/mm/memory.c
> @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
> mm_struct *src_mm,
>   pte = pte_swp_mkuffd_wp(pte);
>   set_pte_at(src_mm, addr, src_pte, pte);
>   }
> + } else if (is_device_exclusive_entry(entry)) {
> + page = pfn_swap_entry_to_page(entry);
> +
> + get_page(page);
> + rss[mm_counter(page)]++;
> +
> + if (is_writable_device_exclusive_entry(entry) &&
> + is_cow_mapping(vm_flags)) {
> + /*
> +  * COW mappings require pages in both
> +  * parent and child to be set to read.
> +  */
> + entry = make_readable_device_exclusive_entry(
> + swp_offset(entry));
> + pte = swp_entry_to_pte(entry);
> + if (pte_swp_soft_dirty(*src_pte))
> + pte = pte_swp_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(*src_pte))
> + pte = pte_swp_mkuffd_wp(pte);
> + set_pte_at(src_mm, addr, src_pte, pte);
> + }

This needs to have the same logic as we now have in
copy_present_page(). The page *is* present and we can't copy the PTE
value hidden in a swap entry if we can't copy the PTE normally.

The code should be shared because nobody is going to remember about
this corner case.

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-30 Thread Jason Gunthorpe

On Fri, Mar 26, 2021 at 11:08:00AM +1100, Alistair Popple wrote:

> +static bool try_to_munlock_one(struct page *page, struct vm_area_struct *vma,
> +  unsigned long address, void *arg)
> +{

Is this function name right?

> + struct page_vma_mapped_walk pvmw = {
> + .page = page,
> + .vma = vma,
> + .address = address,
> + };
> +
> + /* munlock has nothing to gain from examining un-locked vmas */
> + if (!(vma->vm_flags & VM_LOCKED))
> + return true;
> +
> + while (page_vma_mapped_walk(&pvmw)) {
> + /* PTE-mapped THP are never mlocked */
> + if (!PageTransCompound(page)) {
> + /*
> +  * Holding pte lock, we do *not* need
> +  * mmap_lock here
> +  */
> + mlock_vma_page(page);

Because the only action this function seems to take is to call
*mlock*_vma_page()

> + }
> + page_vma_mapped_walk_done(&pvmw);
> +
> + /* found a mlocked page, no point continuing munlock check */
> + return false;
> + }
> +
> + return true;
> +}
> +
>  /**
>   * try_to_munlock - try to munlock a page
>   * @page: the page to be munlocked
> @@ -1796,8 +1821,7 @@ bool try_to_unmap(struct page *page, enum ttu_flags 
> flags)
>  void try_to_munlock(struct page *page)
>  {

But this is also called try_to_munlock ??

/**
 * try_to_munlock - try to munlock a page
 * @page: the page to be munlocked
 *
 * Called from munlock code.  Checks all of the VMAs mapping the page
 * to make sure nobody else has this page mlocked. The page will be
 * returned with PG_mlocked cleared if no other vmas have it mlocked.
 */

So what clears PG_mlocked on this call path?

Something needs attention here..

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v7 1/8] mm: Remove special swap entry functions

2021-03-30 Thread Jason Gunthorpe

On Fri, Mar 26, 2021 at 11:07:58AM +1100, Alistair Popple wrote:
> Remove multiple similar inline functions for dealing with different
> types of special swap entries.
> 
> Both migration and device private swap entries use the swap offset to
> store a pfn. Instead of multiple inline functions to obtain a struct
> page for each swap entry type use a common function
> pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
> functions as this results is shorter code that is easier to understand.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Ralph Campbell 
> Reviewed-by: Christoph Hellwig 
> 
> ---
> 
> v7:
> * Reworded commit message to include pfn_swap_entry_to_page()
> * Added Christoph's Reviewed-by
> 
> v6:
> * Removed redundant compound_page() call from inside PageLocked()
> * Fixed a minor build issue for s390 reported by kernel test bot
> 
> v4:
> * Added pfn_swap_entry_to_page()
> * Reinstated check that migration entries point to locked pages
> * Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
>   builds
> ---
>  arch/s390/mm/pgtable.c  |  2 +-
>  fs/proc/task_mmu.c  | 23 +-
>  include/linux/swap.h|  4 +--
>  include/linux/swapops.h | 69 ++---
>  mm/hmm.c|  5 ++-
>  mm/huge_memory.c|  4 +--
>  mm/memcontrol.c |  2 +-
>  mm/memory.c | 10 +++---
>  mm/migrate.c|  6 ++--
>  mm/page_vma_mapped.c|  6 ++--
>  10 files changed, 50 insertions(+), 81 deletions(-)

Looks good

Reviewed-by: Jason Gunthorpe 

> diff --git a/mm/hmm.c b/mm/hmm.c
> index 943cb2ba4442..3b2dda71d0ed 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -214,7 +214,7 @@ static inline bool hmm_is_device_private_entry(struct 
> hmm_range *range,
>   swp_entry_t entry)
>  {
>   return is_device_private_entry(entry) &&
> - device_private_entry_to_page(entry)->pgmap->owner ==
> + pfn_swap_entry_to_page(entry)->pgmap->owner ==
>   range->dev_private_owner;
>  }
>  
> @@ -257,8 +257,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
> unsigned long addr,
>   cpu_flags = HMM_PFN_VALID;
>   if (is_write_device_private_entry(entry))
>   cpu_flags |= HMM_PFN_WRITE;
> - *hmm_pfn = device_private_entry_to_pfn(entry) |
> - cpu_flags;
> + *hmm_pfn = swp_offset(entry) | cpu_flags;

Though swp_offset() seems poor here

Something like this seems nicer, maybe as an additional patch in this
series?

diff --git a/mm/hmm.c b/mm/hmm.c
index 943cb2ba444232..c06cbc4e3981b7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -210,14 +210,6 @@ int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long 
addr,
unsigned long end, unsigned long hmm_pfns[], pmd_t pmd);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool hmm_is_device_private_entry(struct hmm_range *range,
-   swp_entry_t entry)
-{
-   return is_device_private_entry(entry) &&
-   device_private_entry_to_page(entry)->pgmap->owner ==
-   range->dev_private_owner;
-}
-
 static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
 pte_t pte)
 {
@@ -226,6 +218,32 @@ static inline unsigned long pte_to_hmm_pfn_flags(struct 
hmm_range *range,
return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
 }
 
+static bool hmm_pte_handle_device_private(struct hmm_range *range, pte_t pte,
+ unsigned long *hmm_pfn)
+{
+   swp_entry_t entry = pte_to_swp_entry(pte);
+   struct page *device_page;
+   unsigned long cpu_flags;
+
+   if (is_device_private_entry(entry))
+   return false;
+
+   /*
+* If the device private page matches the device the caller understands
+* then return the private pfn directly. The caller must know what to do
+* with it.
+*/
+   device_page = pfn_swap_entry_to_page(entry);
+   if (device_page->pgmap->owner != range->dev_private_owner)
+   return false;
+
+   cpu_flags = HMM_PFN_VALID;
+   if (is_write_device_private_entry(entry))
+   cpu_flags |= HMM_PFN_WRITE;
+   *hmm_pfn = page_to_pfn(device_page) | cpu_flags;
+   return true;
+}
+
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
  unsigned long end, pmd_t *pmdp, pte_t *ptep,
  unsigned long *hmm_pfn)
@@ -247,20 +265,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned l

Re: [Nouveau] [PATCH v3 5/8] mm: Device exclusive memory access

2021-03-02 Thread Jason Gunthorpe

On Tue, Mar 02, 2021 at 07:57:58PM +1100, Alistair Popple wrote:

> The intent was a driver could use HMM or some other mechanism to keep PTEs 
> synchronised if required. However I just looked at patch 8 in the series 
> again 
> and it appears I got this wrong when converting from the old migration 
> approach:
> 
> +   mutex_unlock(&svmm->mutex);
> +   ret = nouveau_atomic_range_fault(svmm, drm, args,
> +   size, hmm_flags, mm);
> 
> The mutex needs to be unlocked after the range fault to ensure the PTE hasn't 
> changed. But this ends up being a problem because try_to_protect() calls 
> notifiers which need to take that mutex and hence deadlocks.

you have to check the notifier sequence under the mutex and loop
again. The mutex should only cover programming the HW to use the
pages, nothing else.

> However try_to_protect() scans the PTEs again under the PTL so checking the 
> mapping of interest actually gets replaced during the rmap walk seems like a 
> reasonable solution. Thanks for the comments.

It does seem cleaner if you can manage it, the notifier will still be
needd to program the HW though

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 5/8] mm: Device exclusive memory access

2021-03-01 Thread Jason Gunthorpe

On Fri, Feb 26, 2021 at 06:18:29PM +1100, Alistair Popple wrote:

> +/**
> + * make_device_exclusive_range() - Mark a range for exclusive use by a device
> + * @mm: mm_struct of assoicated target process
> + * @start: start of the region to mark for exclusive device access
> + * @end: end address of region
> + * @pages: returns the pages which were successfully mark for exclusive acces
> + *
> + * Returns: number of pages successfully marked for exclusive access
> + *
> + * This function finds the ptes mapping page(s) to the given address range 
> and
> + * replaces them with special swap entries preventing userspace CPU access. 
> On
> + * fault these entries are replaced with the original mapping after calling 
> MMU
> + * notifiers.
> + */
> +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct page **pages)
> +{
> + long npages = (end - start) >> PAGE_SHIFT;
> + long i;
> +
> + npages = get_user_pages_remote(mm, start, npages,
> +FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +pages, NULL, NULL);
> + for (i = 0; i < npages; i++) {
> + if (!trylock_page(pages[i])) {
> + put_page(pages[i]);
> + pages[i] = NULL;
> + continue;
> + }
> +
> + if (!try_to_protect(pages[i])) {

Isn't this racy? get_user_pages returns the ptes at an instant in
time, they could have already been changed to something else?

I would think you'd want to switch to the swap entry atomically under
th PTLs?

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 6/8] mm: Selftests for exclusive device memory

2021-03-01 Thread Jason Gunthorpe

On Fri, Feb 26, 2021 at 06:18:30PM +1100, Alistair Popple wrote:
> Adds some selftests for exclusive device memory.
> 
> Signed-off-by: Alistair Popple 
> ---
>  lib/test_hmm.c | 124 ++
>  lib/test_hmm_uapi.h|   2 +
>  tools/testing/selftests/vm/hmm-tests.c | 219 +
>  3 files changed, 345 insertions(+)

Please get Ralph to review this, otherwise:

Acked-by: Jason Gunthorpe 

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 5/8] mm: Device exclusive memory access

2021-03-01 Thread Jason Gunthorpe

On Fri, Feb 26, 2021 at 06:18:29PM +1100, Alistair Popple wrote:
> Some devices require exclusive write access to shared virtual
> memory (SVM) ranges to perform atomic operations on that memory. This
> requires CPU page tables to be updated to deny access whilst atomic
> operations are occurring.
> 
> In order to do this introduce a new swap entry
> type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
> exclusive access by a device all page table mappings for the particular
> range are replaced with device exclusive swap entries. This causes any
> CPU access to the page to result in a fault.
> 
> Faults are resovled by replacing the faulting entry with the original
> mapping. This results in MMU notifiers being called which a driver uses
> to update access permissions such as revoking atomic access. After
> notifiers have been called the device will no longer have exclusive
> access to the region.

This makes alot more sense than the prior versions!

I don't know the migration area especially well, but nothing caught my
eye in here

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 2/8] mm/swapops: Rework swap entry manipulation code

2021-03-01 Thread Jason Gunthorpe

On Fri, Feb 26, 2021 at 06:18:26PM +1100, Alistair Popple wrote:
> Both migration and device private pages use special swap entries which
> are manipluated by a range of inline functions. The arguments to these
> are somewhat inconsitent so rework them to remove flag type arguments
> and to make the arguments similar for both a read and write entry
> creation.
> 
> Signed-off-by: Alistair Popple 
> ---
>  include/linux/swapops.h | 56 ++---
>  mm/debug_vm_pgtable.c   | 12 -
>  mm/hmm.c|  2 +-
>  mm/huge_memory.c| 26 +--
>  mm/hugetlb.c| 10 +---
>  mm/memory.c | 10 +---
>  mm/migrate.c| 26 ++-
>  mm/mprotect.c   | 10 +---
>  mm/rmap.c   | 10 +---
>  9 files changed, 100 insertions(+), 62 deletions(-)


Reviewed-by: Jason Gunthorpe 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 1/8] mm: Remove special swap entry functions

2021-03-01 Thread Jason Gunthorpe

On Fri, Feb 26, 2021 at 06:18:25PM +1100, Alistair Popple wrote:
> Remove the migration and device private entry_to_page() and
> entry_to_pfn() inline functions and instead open code them directly.
> This results in shorter code which is easier to understand.
> 
> Signed-off-by: Alistair Popple 
> ---
>  arch/s390/mm/pgtable.c  |  2 +-
>  fs/proc/task_mmu.c  | 23 +++
>  include/linux/swap.h|  4 ++--
>  include/linux/swapops.h | 51 -
>  mm/hmm.c|  5 ++--
>  mm/memcontrol.c |  2 +-
>  mm/memory.c | 10 
>  mm/migrate.c|  6 ++---
>  mm/page_vma_mapped.c|  6 ++---
>  9 files changed, 30 insertions(+), 79 deletions(-)

I wish you could come up with a more descriptive word that special
here

What I understand is this is true when the swap_offset is a pfn?

> -static inline struct page *migration_entry_to_page(swp_entry_t entry)
> -{
> - struct page *p = pfn_to_page(swp_offset(entry));
> - /*
> -  * Any use of migration entries may only occur while the
> -  * corresponding page is locked
> -  */
> - BUG_ON(!PageLocked(compound_head(p)));
> - return p;

And this constraint has been completely lost?

A comment in front of the is_special_entry explaining all the rule
would help alot

Transformation looks fine otherwise

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-01 Thread Jason Gunthorpe

On Fri, Feb 26, 2021 at 06:18:27PM +1100, Alistair Popple wrote:
> The behaviour of try_to_unmap_one() is difficult to follow because it
> performs different operations based on a fairly large set of flags used
> in different combinations.
> 
> TTU_MUNLOCK is one such flag. However it is exclusively used by
> try_to_munlock() which specifies no other flags. Therefore rather than
> overload try_to_unmap_one() with unrelated behaviour split this out into
> it's own function and remove the flag.
> 
> Signed-off-by: Alistair Popple 
> 
> 
> Given the comments on not needing to hold mmap_lock it was not 100% clear
> to me if it is safe to check vma->vma_flags & VM_LOCKED and if re-checking
> under the ptl was significant. I left the extra check in case it was, but
> it seems one of the checks is redunant as either the first check is racey
> or the second check is unneccsary.

The rmap doesn't hold the mmap_lock so I think both of these cases are
racey.

eg 

apply_vma_lock_flags()

vma = find_vma(current->mm, start);
if (!vma || vma->vm_start > start)
return -ENOMEM;

prev = vma->vm_prev;
if (start > vma->vm_start)
prev = vma;

for (nstart = start ; ; ) {
vm_flags_t newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;

newflags |= flags;
 [...]
mlock_fixup()
/*
 * vm_flags is protected by the mmap_lock held in write mode.
 * It's okay if try_to_unmap_one unmaps a page just after we
 * set VM_LOCKED, populate_vma_page_range will bring it back.
 */

if (lock)
vma->vm_flags = newflags;
else
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;

Which is only done under the mmap_sem

> +static bool try_to_munlock_one(struct page *page, struct vm_area_struct *vma,
> +  unsigned long address, void *arg)
> +{
> + struct page_vma_mapped_walk pvmw = {
> + .page = page,
> + .vma = vma,
> + .address = address,
> + };
> + bool ret = true;
> +
> + /* munlock has nothing to gain from examining un-locked vmas */
> + if (!(vma->vm_flags & VM_LOCKED))
> + return true;

The mmap_sem can't be obtained in the rmap walkers due to lock
ordering, the various rmap locks are nested under the mmap_sem

So, when reading data that is not locked it should be written as:

   READ_ONCE(vma->vm_flags) & VM_LOCKED

> + while (page_vma_mapped_walk(&pvmw)) {
> + /*
> +  * If the page is mlock()d, we cannot swap it out.
> +  * If it's recently referenced (perhaps page_referenced
> +  * skipped over this mm) then we should reactivate it.
> +  */
> + if (vma->vm_flags & VM_LOCKED) {

And since we write the data without holding the PTLs this looks
pointless, unless there is some other VM_LOCKED manipulation

Jason
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

1 2 3 4 >

1 - 100 of 351 matches

Mail list logo