[PATCH] iommu/vt-d: Fix PCI bus rescan device hot add

2022-05-20 Thread Yian Chen
Notifier calling chain uses priority to determine the execution
order of the notifiers or listeners registered to the chain.
PCI bus device hot add utilizes the notification mechanism.

The current code sets low priority (INT_MIN) to Intel
dmar_pci_bus_notifier and postpones DMAR decoding after adding
new device into IOMMU. The result is that struct device pointer
cannot be found in DRHD search for the new device's DMAR/IOMMU.
Subsequently, the device is put under the "catch-all" IOMMU
instead of the correct one. This could cause system hang when
device TLB invalidation is sent to the wrong IOMMU. Invalidation
timeout error and hard lockup have been observed and data
inconsistency/crush may occur as well.

This patch fixes the issue by setting a positive priority(1) for
dmar_pci_bus_notifier while the priority of IOMMU bus notifier
uses the default value(0), therefore DMAR decoding will be in
advance of DRHD search for a new device to find the correct IOMMU.

Following is a 2-step example that triggers the bug by simulating
PCI device hot add behavior in Intel Sapphire Rapids server.

echo 1 > /sys/bus/pci/devices/:6a:01.0/remove
echo 1 > /sys/bus/pci/rescan

Fixes: 59ce0515cdaf ("iommu/vt-d: Update DRHD/RMRR/ATSR device scope")
Cc: sta...@vger.kernel.org # v3.15+
Reported-by: Zhang, Bernice 
Signed-off-by: Jacob Pan 
Signed-off-by: Yian Chen 
---
This is a quick fix for the bug reported. Intel internally evaluated
another redesigned solution that eliminates dmar pci bus notifier to
simplify the workflow of pci hotplug and improve its runtime efficiency.

While considering the fix could apply to downstream and the complexity
of pci hotplug workflow change may significantly increase the
engineering effort to downstream the patch, the choice is to submit this
simple patch to help the deployment of this bug fix.
---

 drivers/iommu/intel/dmar.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index 4de960834a1b..497c5bd95caf 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -383,7 +383,7 @@ static int dmar_pci_bus_notifier(struct notifier_block *nb,
 
 static struct notifier_block dmar_pci_bus_nb = {
.notifier_call = dmar_pci_bus_notifier,
-   .priority = INT_MIN,
+   .priority = 1,
 };
 
 static struct dmar_drhd_unit *
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1] driver core: Extend deferred probe timeout on driver registration

2022-05-20 Thread Saravana Kannan via iommu
On Fri, May 20, 2022 at 5:04 PM Nathan Chancellor  wrote:
>
> On Fri, May 20, 2022 at 04:49:48PM -0700, Saravana Kannan wrote:
> > On Fri, May 20, 2022 at 4:30 PM Nathan Chancellor  wrote:
> > >
> > > Hi Saravana,
> > >
> > > On Fri, Apr 29, 2022 at 03:09:32PM -0700, Saravana Kannan wrote:
> > > > The deferred probe timer that's used for this currently starts at
> > > > late_initcall and runs for driver_deferred_probe_timeout seconds. The
> > > > assumption being that all available drivers would be loaded and
> > > > registered before the timer expires. This means, the
> > > > driver_deferred_probe_timeout has to be pretty large for it to cover the
> > > > worst case. But if we set the default value for it to cover the worst
> > > > case, it would significantly slow down the average case. For this
> > > > reason, the default value is set to 0.
> > > >
> > > > Also, with CONFIG_MODULES=y and the current default values of
> > > > driver_deferred_probe_timeout=0 and fw_devlink=on, devices with missing
> > > > drivers will cause their consumer devices to always defer their probes.
> > > > This is because device links created by fw_devlink defer the probe even
> > > > before the consumer driver's probe() is called.
> > > >
> > > > Instead of a fixed timeout, if we extend an unexpired deferred probe
> > > > timer on every successful driver registration, with the expectation more
> > > > modules would be loaded in the near future, then the default value of
> > > > driver_deferred_probe_timeout only needs to be as long as the worst case
> > > > time difference between two consecutive module loads.
> > > >
> > > > So let's implement that and set the default value to 10 seconds when
> > > > CONFIG_MODULES=y.
> > > >
> > > > Cc: Greg Kroah-Hartman 
> > > > Cc: "Rafael J. Wysocki" 
> > > > Cc: Rob Herring 
> > > > Cc: Linus Walleij 
> > > > Cc: Will Deacon 
> > > > Cc: Ulf Hansson 
> > > > Cc: Kevin Hilman 
> > > > Cc: Thierry Reding 
> > > > Cc: Mark Brown 
> > > > Cc: Pavel Machek 
> > > > Cc: Geert Uytterhoeven 
> > > > Cc: Yoshihiro Shimoda 
> > > > Cc: Paul Kocialkowski 
> > > > Cc: linux-g...@vger.kernel.org
> > > > Cc: linux...@vger.kernel.org
> > > > Cc: iommu@lists.linux-foundation.org
> > > > Signed-off-by: Saravana Kannan 
> > >
> > > I bisected a boot hang with ARCH=s390 defconfig in QEMU down to this
> > > change as commit 2b28a1a84a0e ("driver core: Extend deferred probe
> > > timeout on driver registration") in next-20220520 (bisect log below).
> > >
> > > $ make -skj"$(nproc)" ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- defconfig 
> > > bzImage
> > >
> > > $ timeout --foreground 15m stdbuf -oL -eL \
> > > qemu-system-s390x \
> > > -initrd ... \
> > > -M s390-ccw-virtio \
> > > -display none \
> > > -kernel arch/s390/boot/bzImage \
> > > -m 512m \
> > > -nodefaults \
> > > -serial mon:stdio
> > > ...
> > > [2.077303] In-situ OAM (IOAM) with IPv6
> > > [2.077639] NET: Registered PF_PACKET protocol family
> > > [2.078063] bridge: filtering via arp/ip/ip6tables is no longer 
> > > available by default. Update your scripts to load br_netfilter if you 
> > > need this.
> > > [2.078795] Key type dns_resolver registered
> > > [2.079317] cio: Channel measurement facility initialized using format 
> > > extended (mode autodetected)
> > > [2.081494] Discipline DIAG cannot be used without z/VM
> > > [  260.626363] random: crng init done
> > > qemu-system-s390x: terminating on signal 15 from pid 3815762 (timeout)
> > >
> > > We have a simple rootfs available if necessary:
> > >
> > > https://github.com/ClangBuiltLinux/boot-utils/raw/bc0d17785eb67f1edd0ee0a134970a807895f741/images/s390/rootfs.cpio.zst
> > >
> > > If there is any other information I can provide, please let me know!
> >
> > Hmm... strange. Can you please try the following command line options
> > and tell me which of these has the issue and which don't?
>
> Sure thing!
>
> > 1) deferred_probe_timeout=0
>
> No issue.
>
> > 2) deferred_probe_timeout=1
> > 3) deferred_probe_timeout=300
>
> Both of these appear to hang in the same way, I let each sit for five
> minutes.

Strange that a sufficiently large timeout isn't helping. Is it trying
to boot off a network mount? I'll continue looking into this next
week.

-Saravana

>
> Cheers,
> Nathan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 4/4] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-05-20 Thread Damien Le Moal via iommu
On 5/20/22 17:23, John Garry wrote:
> ATA devices (struct ata_device) have a max_sectors field which is
> configured internally in libata. This is then used to (re)configure the
> associated sdev request queue max_sectors value from how it is earlier set
> in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set
> according to shost limits, which includes host DMA mapping limits.
> 
> Cap the ata_device max_sectors according to shost->max_sectors to respect
> this shost limit.
> 
> Signed-off-by: John Garry 
> ---
>  drivers/ata/libata-scsi.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 06c9d90238d9..25fe89791641 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -1036,6 +1036,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, 
> struct ata_device *dev)
>   dev->flags |= ATA_DFLAG_NO_UNLOAD;
>  
>   /* configure max sectors */
> + dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors);
>   blk_queue_max_hw_sectors(q, dev->max_sectors);
>  
>   if (dev->class == ATA_DEV_ATAPI) {

Acked-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/4] dma-mapping: Add dma_opt_mapping_size()

2022-05-20 Thread Damien Le Moal via iommu
On 5/20/22 17:23, John Garry wrote:
> Streaming DMA mapping involving an IOMMU may be much slower for larger
> total mapping size. This is because every IOMMU DMA mapping requires an
> IOVA to be allocated and freed. IOVA sizes above a certain limit are not
> cached, which can have a big impact on DMA mapping performance.
> 
> Provide an API for device drivers to know this "optimal" limit, such that
> they may try to produce mapping which don't exceed it.
> 
> Signed-off-by: John Garry 
> ---
>  Documentation/core-api/dma-api.rst |  9 +
>  include/linux/dma-map-ops.h|  1 +
>  include/linux/dma-mapping.h|  5 +
>  kernel/dma/mapping.c   | 12 
>  4 files changed, 27 insertions(+)
> 
> diff --git a/Documentation/core-api/dma-api.rst 
> b/Documentation/core-api/dma-api.rst
> index 6d6d0edd2d27..b3cd9763d28b 100644
> --- a/Documentation/core-api/dma-api.rst
> +++ b/Documentation/core-api/dma-api.rst
> @@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. 
> The size parameter
>  of the mapping functions like dma_map_single(), dma_map_page() and
>  others should not be larger than the returned value.
>  
> +::
> +
> + size_t
> + dma_opt_mapping_size(struct device *dev);
> +
> +Returns the maximum optimal size of a mapping for the device. Mapping large
> +buffers may take longer so device drivers are advised to limit total DMA
> +streaming mappings length to the returned value.
> +
>  ::
>  
>   bool
> diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
> index 0d5b06b3a4a6..98ceba6fa848 100644
> --- a/include/linux/dma-map-ops.h
> +++ b/include/linux/dma-map-ops.h
> @@ -69,6 +69,7 @@ struct dma_map_ops {
>   int (*dma_supported)(struct device *dev, u64 mask);
>   u64 (*get_required_mask)(struct device *dev);
>   size_t (*max_mapping_size)(struct device *dev);
> + size_t (*opt_mapping_size)(void);
>   unsigned long (*get_merge_boundary)(struct device *dev);
>  };
>  
> diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
> index dca2b1355bb1..fe3849434b2a 100644
> --- a/include/linux/dma-mapping.h
> +++ b/include/linux/dma-mapping.h
> @@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask);
>  int dma_set_coherent_mask(struct device *dev, u64 mask);
>  u64 dma_get_required_mask(struct device *dev);
>  size_t dma_max_mapping_size(struct device *dev);
> +size_t dma_opt_mapping_size(struct device *dev);
>  bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
>  unsigned long dma_get_merge_boundary(struct device *dev);
>  struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
> @@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device 
> *dev)
>  {
>   return 0;
>  }
> +static inline size_t dma_opt_mapping_size(struct device *dev)
> +{
> + return 0;
> +}
>  static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
>  {
>   return false;
> diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
> index db7244291b74..1bfe11b1edb6 100644
> --- a/kernel/dma/mapping.c
> +++ b/kernel/dma/mapping.c
> @@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(dma_max_mapping_size);
>  
> +size_t dma_opt_mapping_size(struct device *dev)
> +{
> + const struct dma_map_ops *ops = get_dma_ops(dev);
> + size_t size = SIZE_MAX;
> +
> + if (ops && ops->opt_mapping_size)
> + size = ops->opt_mapping_size();
> +
> + return min(dma_max_mapping_size(dev), size);
> +}
> +EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
> +
>  bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
>  {
>   const struct dma_map_ops *ops = get_dma_ops(dev);

Looks OK to me.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-05-20 Thread Damien Le Moal via iommu
On 5/20/22 17:23, John Garry wrote:
> Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which
> allows the drivers to know the optimal mapping limit and thus limit the
> requested IOVA lengths.
> 
> This value is based on the IOVA rcache range limit, as IOVAs allocated
> above this limit must always be newly allocated, which may be quite slow.
> 
> Signed-off-by: John Garry 
> ---
>  drivers/iommu/dma-iommu.c | 6 ++
>  drivers/iommu/iova.c  | 5 +
>  include/linux/iova.h  | 2 ++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 09f6e1c0f9c0..f619e41b9172 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1442,6 +1442,11 @@ static unsigned long 
> iommu_dma_get_merge_boundary(struct device *dev)
>   return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
>  }
>  
> +static size_t iommu_dma_opt_mapping_size(void)
> +{
> + return iova_rcache_range();
> +}
> +
>  static const struct dma_map_ops iommu_dma_ops = {
>   .alloc  = iommu_dma_alloc,
>   .free   = iommu_dma_free,
> @@ -1462,6 +1467,7 @@ static const struct dma_map_ops iommu_dma_ops = {
>   .map_resource   = iommu_dma_map_resource,
>   .unmap_resource = iommu_dma_unmap_resource,
>   .get_merge_boundary = iommu_dma_get_merge_boundary,
> + .opt_mapping_size   = iommu_dma_opt_mapping_size,
>  };
>  
>  /*
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index db77aa675145..9f00b58d546e 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain 
> *iovad,
>  static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain 
> *iovad);
>  static void free_iova_rcaches(struct iova_domain *iovad);
>  
> +unsigned long iova_rcache_range(void)

Why not a size_t return type ?

> +{
> + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
> +}
> +
>  static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
>  {
>   struct iova_domain *iovad;
> diff --git a/include/linux/iova.h b/include/linux/iova.h
> index 320a70e40233..c6ba6d95d79c 100644
> --- a/include/linux/iova.h
> +++ b/include/linux/iova.h
> @@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
> *iovad, dma_addr_t iova)
>  int iova_cache_get(void);
>  void iova_cache_put(void);
>  
> +unsigned long iova_rcache_range(void);
> +
>  void free_iova(struct iova_domain *iovad, unsigned long pfn);
>  void __free_iova(struct iova_domain *iovad, struct iova *iova);
>  struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,


-- 
Damien Le Moal
Western Digital Research
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-05-20 Thread Damien Le Moal via iommu
On 5/20/22 17:23, John Garry wrote:
> Streaming DMA mappings may be considerably slower when mappings go through
> an IOMMU and the total mapping length is somewhat long. This is because the
> IOMMU IOVA code allocates and free an IOVA for each mapping, which may
> affect performance.
> 
> For performance reasons set the request_queue max_sectors from
> dma_opt_mapping_size(), which knows this mapping limit.
> 
> In addition, the shost->max_sectors is repeatedly set for each sdev in
> __scsi_init_queue(). This is unnecessary, so set once when adding the
> host.
> 
> Signed-off-by: John Garry 
> ---
>  drivers/scsi/hosts.c| 5 +
>  drivers/scsi/scsi_lib.c | 4 
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index f69b77cbf538..a3ae6345473b 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -225,6 +225,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, 
> struct device *dev,
>   shost->cmd_per_lun = min_t(int, shost->cmd_per_lun,
>  shost->can_queue);
>  
> + if (dma_dev->dma_mask) {
> + shost->max_sectors = min_t(unsigned int, shost->max_sectors,
> + dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
> + }

Nit: you could drop the curly brackets here.

> +
>   error = scsi_init_sense_cache(shost);
>   if (error)
>   goto fail;
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 8d18cc7e510e..2d43bb8799bd 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
> request_queue *q)
>   blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
>   }
>  
> - if (dev->dma_mask) {
> - shost->max_sectors = min_t(unsigned int, shost->max_sectors,
> - dma_max_mapping_size(dev) >> SECTOR_SHIFT);
> - }
>   blk_queue_max_hw_sectors(q, shost->max_sectors);
>   blk_queue_segment_boundary(q, shost->dma_boundary);
>   dma_set_seg_boundary(dev, shost->dma_boundary);

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1] driver core: Extend deferred probe timeout on driver registration

2022-05-20 Thread Nathan Chancellor
On Fri, May 20, 2022 at 04:49:48PM -0700, Saravana Kannan wrote:
> On Fri, May 20, 2022 at 4:30 PM Nathan Chancellor  wrote:
> >
> > Hi Saravana,
> >
> > On Fri, Apr 29, 2022 at 03:09:32PM -0700, Saravana Kannan wrote:
> > > The deferred probe timer that's used for this currently starts at
> > > late_initcall and runs for driver_deferred_probe_timeout seconds. The
> > > assumption being that all available drivers would be loaded and
> > > registered before the timer expires. This means, the
> > > driver_deferred_probe_timeout has to be pretty large for it to cover the
> > > worst case. But if we set the default value for it to cover the worst
> > > case, it would significantly slow down the average case. For this
> > > reason, the default value is set to 0.
> > >
> > > Also, with CONFIG_MODULES=y and the current default values of
> > > driver_deferred_probe_timeout=0 and fw_devlink=on, devices with missing
> > > drivers will cause their consumer devices to always defer their probes.
> > > This is because device links created by fw_devlink defer the probe even
> > > before the consumer driver's probe() is called.
> > >
> > > Instead of a fixed timeout, if we extend an unexpired deferred probe
> > > timer on every successful driver registration, with the expectation more
> > > modules would be loaded in the near future, then the default value of
> > > driver_deferred_probe_timeout only needs to be as long as the worst case
> > > time difference between two consecutive module loads.
> > >
> > > So let's implement that and set the default value to 10 seconds when
> > > CONFIG_MODULES=y.
> > >
> > > Cc: Greg Kroah-Hartman 
> > > Cc: "Rafael J. Wysocki" 
> > > Cc: Rob Herring 
> > > Cc: Linus Walleij 
> > > Cc: Will Deacon 
> > > Cc: Ulf Hansson 
> > > Cc: Kevin Hilman 
> > > Cc: Thierry Reding 
> > > Cc: Mark Brown 
> > > Cc: Pavel Machek 
> > > Cc: Geert Uytterhoeven 
> > > Cc: Yoshihiro Shimoda 
> > > Cc: Paul Kocialkowski 
> > > Cc: linux-g...@vger.kernel.org
> > > Cc: linux...@vger.kernel.org
> > > Cc: iommu@lists.linux-foundation.org
> > > Signed-off-by: Saravana Kannan 
> >
> > I bisected a boot hang with ARCH=s390 defconfig in QEMU down to this
> > change as commit 2b28a1a84a0e ("driver core: Extend deferred probe
> > timeout on driver registration") in next-20220520 (bisect log below).
> >
> > $ make -skj"$(nproc)" ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- defconfig 
> > bzImage
> >
> > $ timeout --foreground 15m stdbuf -oL -eL \
> > qemu-system-s390x \
> > -initrd ... \
> > -M s390-ccw-virtio \
> > -display none \
> > -kernel arch/s390/boot/bzImage \
> > -m 512m \
> > -nodefaults \
> > -serial mon:stdio
> > ...
> > [2.077303] In-situ OAM (IOAM) with IPv6
> > [2.077639] NET: Registered PF_PACKET protocol family
> > [2.078063] bridge: filtering via arp/ip/ip6tables is no longer 
> > available by default. Update your scripts to load br_netfilter if you need 
> > this.
> > [2.078795] Key type dns_resolver registered
> > [2.079317] cio: Channel measurement facility initialized using format 
> > extended (mode autodetected)
> > [2.081494] Discipline DIAG cannot be used without z/VM
> > [  260.626363] random: crng init done
> > qemu-system-s390x: terminating on signal 15 from pid 3815762 (timeout)
> >
> > We have a simple rootfs available if necessary:
> >
> > https://github.com/ClangBuiltLinux/boot-utils/raw/bc0d17785eb67f1edd0ee0a134970a807895f741/images/s390/rootfs.cpio.zst
> >
> > If there is any other information I can provide, please let me know!
> 
> Hmm... strange. Can you please try the following command line options
> and tell me which of these has the issue and which don't?

Sure thing!

> 1) deferred_probe_timeout=0

No issue.

> 2) deferred_probe_timeout=1
> 3) deferred_probe_timeout=300

Both of these appear to hang in the same way, I let each sit for five
minutes.

Cheers,
Nathan
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1] driver core: Extend deferred probe timeout on driver registration

2022-05-20 Thread Saravana Kannan via iommu
On Fri, May 20, 2022 at 4:30 PM Nathan Chancellor  wrote:
>
> Hi Saravana,
>
> On Fri, Apr 29, 2022 at 03:09:32PM -0700, Saravana Kannan wrote:
> > The deferred probe timer that's used for this currently starts at
> > late_initcall and runs for driver_deferred_probe_timeout seconds. The
> > assumption being that all available drivers would be loaded and
> > registered before the timer expires. This means, the
> > driver_deferred_probe_timeout has to be pretty large for it to cover the
> > worst case. But if we set the default value for it to cover the worst
> > case, it would significantly slow down the average case. For this
> > reason, the default value is set to 0.
> >
> > Also, with CONFIG_MODULES=y and the current default values of
> > driver_deferred_probe_timeout=0 and fw_devlink=on, devices with missing
> > drivers will cause their consumer devices to always defer their probes.
> > This is because device links created by fw_devlink defer the probe even
> > before the consumer driver's probe() is called.
> >
> > Instead of a fixed timeout, if we extend an unexpired deferred probe
> > timer on every successful driver registration, with the expectation more
> > modules would be loaded in the near future, then the default value of
> > driver_deferred_probe_timeout only needs to be as long as the worst case
> > time difference between two consecutive module loads.
> >
> > So let's implement that and set the default value to 10 seconds when
> > CONFIG_MODULES=y.
> >
> > Cc: Greg Kroah-Hartman 
> > Cc: "Rafael J. Wysocki" 
> > Cc: Rob Herring 
> > Cc: Linus Walleij 
> > Cc: Will Deacon 
> > Cc: Ulf Hansson 
> > Cc: Kevin Hilman 
> > Cc: Thierry Reding 
> > Cc: Mark Brown 
> > Cc: Pavel Machek 
> > Cc: Geert Uytterhoeven 
> > Cc: Yoshihiro Shimoda 
> > Cc: Paul Kocialkowski 
> > Cc: linux-g...@vger.kernel.org
> > Cc: linux...@vger.kernel.org
> > Cc: iommu@lists.linux-foundation.org
> > Signed-off-by: Saravana Kannan 
>
> I bisected a boot hang with ARCH=s390 defconfig in QEMU down to this
> change as commit 2b28a1a84a0e ("driver core: Extend deferred probe
> timeout on driver registration") in next-20220520 (bisect log below).
>
> $ make -skj"$(nproc)" ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- defconfig 
> bzImage
>
> $ timeout --foreground 15m stdbuf -oL -eL \
> qemu-system-s390x \
> -initrd ... \
> -M s390-ccw-virtio \
> -display none \
> -kernel arch/s390/boot/bzImage \
> -m 512m \
> -nodefaults \
> -serial mon:stdio
> ...
> [2.077303] In-situ OAM (IOAM) with IPv6
> [2.077639] NET: Registered PF_PACKET protocol family
> [2.078063] bridge: filtering via arp/ip/ip6tables is no longer available 
> by default. Update your scripts to load br_netfilter if you need this.
> [2.078795] Key type dns_resolver registered
> [2.079317] cio: Channel measurement facility initialized using format 
> extended (mode autodetected)
> [2.081494] Discipline DIAG cannot be used without z/VM
> [  260.626363] random: crng init done
> qemu-system-s390x: terminating on signal 15 from pid 3815762 (timeout)
>
> We have a simple rootfs available if necessary:
>
> https://github.com/ClangBuiltLinux/boot-utils/raw/bc0d17785eb67f1edd0ee0a134970a807895f741/images/s390/rootfs.cpio.zst
>
> If there is any other information I can provide, please let me know!

Hmm... strange. Can you please try the following command line options
and tell me which of these has the issue and which don't?
1) deferred_probe_timeout=0
2) deferred_probe_timeout=1
3) deferred_probe_timeout=300

That should help me narrow down where the error might be.

Thanks,
Saravana

>
> Cheers,
> Nathan
>
> # bad: [18ecd30af1a8402c162cca1bd58771c0e5be7815] Add linux-next specific 
> files for 20220520
> # good: [b015dcd62b86d298829990f8261d5d154b8d7af5] Merge tag 
> 'for-5.18/parisc-4' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
> git bisect start '18ecd30af1a8402c162cca1bd58771c0e5be7815' 
> 'b015dcd62b86d298829990f8261d5d154b8d7af5'
> # good: [f9b63740b666dd9887eb0282d21b5f65bb0cadd0] Merge branch 'master' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
> git bisect good f9b63740b666dd9887eb0282d21b5f65bb0cadd0
> # good: [1f5eb3e76303572f0318e8c50da51c516580aa03] Merge branch 'master' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
> git bisect good 1f5eb3e76303572f0318e8c50da51c516580aa03
> # bad: [4c1d9cc0363691893ef94fa0d798faca013e27d3] Merge branch 'staging-next' 
> of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
&

Re: [PATCH v1] driver core: Extend deferred probe timeout on driver registration

2022-05-20 Thread Nathan Chancellor
Hi Saravana,

On Fri, Apr 29, 2022 at 03:09:32PM -0700, Saravana Kannan wrote:
> The deferred probe timer that's used for this currently starts at
> late_initcall and runs for driver_deferred_probe_timeout seconds. The
> assumption being that all available drivers would be loaded and
> registered before the timer expires. This means, the
> driver_deferred_probe_timeout has to be pretty large for it to cover the
> worst case. But if we set the default value for it to cover the worst
> case, it would significantly slow down the average case. For this
> reason, the default value is set to 0.
> 
> Also, with CONFIG_MODULES=y and the current default values of
> driver_deferred_probe_timeout=0 and fw_devlink=on, devices with missing
> drivers will cause their consumer devices to always defer their probes.
> This is because device links created by fw_devlink defer the probe even
> before the consumer driver's probe() is called.
> 
> Instead of a fixed timeout, if we extend an unexpired deferred probe
> timer on every successful driver registration, with the expectation more
> modules would be loaded in the near future, then the default value of
> driver_deferred_probe_timeout only needs to be as long as the worst case
> time difference between two consecutive module loads.
> 
> So let's implement that and set the default value to 10 seconds when
> CONFIG_MODULES=y.
> 
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Cc: Rob Herring 
> Cc: Linus Walleij 
> Cc: Will Deacon 
> Cc: Ulf Hansson 
> Cc: Kevin Hilman 
> Cc: Thierry Reding 
> Cc: Mark Brown 
> Cc: Pavel Machek 
> Cc: Geert Uytterhoeven 
> Cc: Yoshihiro Shimoda 
> Cc: Paul Kocialkowski 
> Cc: linux-g...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> Cc: iommu@lists.linux-foundation.org
> Signed-off-by: Saravana Kannan 

I bisected a boot hang with ARCH=s390 defconfig in QEMU down to this
change as commit 2b28a1a84a0e ("driver core: Extend deferred probe
timeout on driver registration") in next-20220520 (bisect log below).

$ make -skj"$(nproc)" ARCH=s390 CROSS_COMPILE=s390x-linux-gnu- defconfig bzImage

$ timeout --foreground 15m stdbuf -oL -eL \
qemu-system-s390x \
-initrd ... \
-M s390-ccw-virtio \
-display none \
-kernel arch/s390/boot/bzImage \
-m 512m \
-nodefaults \
-serial mon:stdio
...
[2.077303] In-situ OAM (IOAM) with IPv6
[2.077639] NET: Registered PF_PACKET protocol family
[2.078063] bridge: filtering via arp/ip/ip6tables is no longer available by 
default. Update your scripts to load br_netfilter if you need this.
[2.078795] Key type dns_resolver registered
[2.079317] cio: Channel measurement facility initialized using format 
extended (mode autodetected)
[2.081494] Discipline DIAG cannot be used without z/VM
[  260.626363] random: crng init done
qemu-system-s390x: terminating on signal 15 from pid 3815762 (timeout)

We have a simple rootfs available if necessary:

https://github.com/ClangBuiltLinux/boot-utils/raw/bc0d17785eb67f1edd0ee0a134970a807895f741/images/s390/rootfs.cpio.zst

If there is any other information I can provide, please let me know!

Cheers,
Nathan

# bad: [18ecd30af1a8402c162cca1bd58771c0e5be7815] Add linux-next specific files 
for 20220520
# good: [b015dcd62b86d298829990f8261d5d154b8d7af5] Merge tag 
'for-5.18/parisc-4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
git bisect start '18ecd30af1a8402c162cca1bd58771c0e5be7815' 
'b015dcd62b86d298829990f8261d5d154b8d7af5'
# good: [f9b63740b666dd9887eb0282d21b5f65bb0cadd0] Merge branch 'master' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
git bisect good f9b63740b666dd9887eb0282d21b5f65bb0cadd0
# good: [1f5eb3e76303572f0318e8c50da51c516580aa03] Merge branch 'master' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
git bisect good 1f5eb3e76303572f0318e8c50da51c516580aa03
# bad: [4c1d9cc0363691893ef94fa0d798faca013e27d3] Merge branch 'staging-next' 
of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
git bisect bad 4c1d9cc0363691893ef94fa0d798faca013e27d3
# bad: [dcb68304485c0ba5f84f1a54687c751b68263d93] Merge branch 'usb-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial.git
git bisect bad dcb68304485c0ba5f84f1a54687c751b68263d93
# good: [61271996dc46aecb40fd26f89a4ec0a6bd8f3a8f] Merge branch 'next' of 
git://git.kernel.org/pub/scm/virt/kvm/kvm.git
git bisect good 61271996dc46aecb40fd26f89a4ec0a6bd8f3a8f
# good: [d4db45a71f56032b552e161968bb0e5fdd2767f8] Merge branch 'for-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds.git
git bisect good d4db45a71f56032b552e161968bb0e5fdd2767f8
# good: [d090c7a2ab84663185e4abda21d7d83880937c8a] USB / dwc3: Fix a checkpatch 
warning in core.c
git bisect good d090c7a2ab84663185e4abda21d7d83880937c8a
# bad

Re: [PATCH v2 1/2] dt-bindings: mediatek: Add bindings for MT6795 M4U

2022-05-20 Thread Rob Herring
On Wed, 18 May 2022 12:18:48 +0200, AngeloGioacchino Del Regno wrote:
> Add bindings for the MediaTek Helio X10 (MT6795) IOMMU/M4U.
> 
> Signed-off-by: AngeloGioacchino Del Regno 
> 
> ---
>  .../bindings/iommu/mediatek,iommu.yaml|  4 +
>  include/dt-bindings/memory/mt6795-larb-port.h | 96 +++
>  2 files changed, 100 insertions(+)
>  create mode 100644 include/dt-bindings/memory/mt6795-larb-port.h
> 

Acked-by: Rob Herring 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma-direct: Don't over-decrypt memory

2022-05-20 Thread Robin Murphy
The original x86 sev_alloc() only called set_memory_decrypted() on
memory returned by alloc_pages_node(), so the page order calculation
fell out of that logic. However, the common dma-direct code has several
potential allocators, not all of which are guaranteed to round up the
underlying allocation to a power-of-two size, so carrying over that
calculation for the encryption/decryption size was a mistake. Fix it by
rounding to a *number* of pages, rather than an order.

Until recently there was an even worse interaction with DMA_DIRECT_REMAP
where we could have ended up decrypting part of the next adjacent
vmalloc area, only averted by no architecture actually supporting both
configs at once. Don't ask how I found that one out...

CC: sta...@vger.kernel.org
Fixes: c10f07aa27da ("dma/direct: Handle force decryption for DMA coherent 
buffers in common code")
Signed-off-by: Robin Murphy 
---
 kernel/dma/direct.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 9743c6ccce1a..09d78aa40466 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -79,7 +79,7 @@ static int dma_set_decrypted(struct device *dev, void *vaddr, 
size_t size)
 {
if (!force_dma_unencrypted(dev))
return 0;
-   return set_memory_decrypted((unsigned long)vaddr, 1 << get_order(size));
+   return set_memory_decrypted((unsigned long)vaddr, PFN_UP(size));
 }
 
 static int dma_set_encrypted(struct device *dev, void *vaddr, size_t size)
@@ -88,7 +88,7 @@ static int dma_set_encrypted(struct device *dev, void *vaddr, 
size_t size)
 
if (!force_dma_unencrypted(dev))
return 0;
-   ret = set_memory_encrypted((unsigned long)vaddr, 1 << get_order(size));
+   ret = set_memory_encrypted((unsigned long)vaddr, PFN_UP(size));
if (ret)
pr_warn_ratelimited("leaking DMA memory that can't be 
re-encrypted\n");
return ret;
-- 
2.35.3.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/6] iommu: Add a helper to do PASID lookup from domain

2022-05-20 Thread Jacob Pan
Hi Christoph,

On Wed, 18 May 2022 23:48:44 -0700, Christoph Hellwig 
wrote:

> On Wed, May 18, 2022 at 11:21:16AM -0700, Jacob Pan wrote:
> > +ioasid_t iommu_get_pasid_from_domain(struct device *dev, struct
> > iommu_domain *domain)  
> 
> Overly long line here.
will fix,

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] vfio: Remove VFIO_TYPE1_NESTING_IOMMU

2022-05-20 Thread Jason Gunthorpe via iommu
On Fri, May 20, 2022 at 01:00:32PM +0100, Will Deacon wrote:
> On Fri, May 20, 2022 at 12:02:23PM +0100, Robin Murphy wrote:
> > On 2022-05-10 17:55, Jason Gunthorpe via iommu wrote:
> > > This control causes the ARM SMMU drivers to choose a stage 2
> > > implementation for the IO pagetable (vs the stage 1 usual default),
> > > however this choice has no visible impact to the VFIO user.
> > 
> > Oh, I should have read more carefully... this isn't entirely true. Stage 2
> > has a different permission model from stage 1, so although it's arguably
> > undocumented behaviour, VFIO users that know enough about the underlying
> > system could use this to get write-only mappings if they so wish.
> 
> There's also an impact on combining memory attributes, but it's not hugely
> clear how that impacts userspace via things like VFIO_DMA_CC_IOMMU.

VFIO_DMA_CC_IOMMU has been clarified now to only be about the ability
to reject device requested no-snoop transactions. ie ignore the
NoSnoop bit in PCIe TLPs.

AFAICT SMMU can't implement this currently, and maybe doesn't need to.

VFIO requires the IO page tables be setup for cache coherent operation
for ordinary device DMA otherwises users like DPDK will not work.

It is surprising to hear you say that VFIO_TYPE1_NESTING_IOMMU might
also cause the IO to become non-coherent..

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] vfio: Remove VFIO_TYPE1_NESTING_IOMMU

2022-05-20 Thread Jason Gunthorpe via iommu
On Fri, May 20, 2022 at 12:02:23PM +0100, Robin Murphy wrote:
> On 2022-05-10 17:55, Jason Gunthorpe via iommu wrote:
> > This control causes the ARM SMMU drivers to choose a stage 2
> > implementation for the IO pagetable (vs the stage 1 usual default),
> > however this choice has no visible impact to the VFIO user.
> 
> Oh, I should have read more carefully... this isn't entirely true. Stage 2
> has a different permission model from stage 1, so although it's arguably
> undocumented behaviour, VFIO users that know enough about the underlying
> system could use this to get write-only mappings if they so wish.

I think this is getting very pedantic, the intention of this was to
enable nesting, not to expose subtle differences in ARM
architecture. I'm very doubtful something exists here that uses this
without also using the other parts.

> > Just in-case there is some userspace using this continue to treat
> > requesting it as a NOP, but do not advertise support any more.
> 
> This also seems a bit odd, especially given that it's not actually a no-op;
> surely either it's supported and functional or it isn't?

Indeed, I was loath to completely break possibly existing broken
userspace, but I agree we could just fail here as well. Normal
userspace should see the missing capability and not call the API at
all. But maybe somehow hardwired their VMM or something IDK.

> In all honesty, I'm not personally attached to this code either way. If this
> patch had come 5 years ago, when the interface already looked like a bit of
> a dead end, I'd probably have agreed more readily. But now, when we're
> possibly mere months away from implementing the functional equivalent for
> IOMMUFD, which if done right might be able to support a trivial compat layer
> for this anyway,

There won't be compat for this - the generic code is not going to know
about the driver specific details of how nesting works.  The plan is
to have some new op like:

   alloc_user_domain(struct device *dev, void *params, size_t params_len, ..)

And all the special driver-specific iommu domains will be allocated
opaquely this way. The stage 1 or stage 2 choice will be specified in
the params along with any other HW specific parameters required to
hook it up properly.

So, the core code can't just do what VFIO_TYPE1_NESTING_IOMMU is doing
now without also being hardwired to make it work with ARM.

This is why I want to remove it now to be clear we are not going to be
accommodating this API.

> I just don't see what we gain from not at least waiting to see where
> that ends up. The given justification reads as "get rid of this code
> that we already know we'll need to bring back in some form, and
> half-break an unpopular VFIO ABI because it doesn't do *everything*
> that its name might imply", which just isn't convincing me.

No it wont come back. The code that we will need is the driver code in
SMMU, and that wasn't touched here.

We can defer this, as long as we all agree the uAPI is going away.

But I don't see a strong argument why we should wait either.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 00/35] iommu/amd: Add multiple PCI segments support

2022-05-20 Thread Vasant Hegde via iommu
Joerg,


On 5/20/2022 3:33 PM, Joerg Roedel wrote:
> Hi Vasant,
> 
> On Fri, May 20, 2022 at 03:25:38PM +0530, Vasant Hegde wrote:
>> Ping. Did you get a chance to look into this series?
> 
> Sorry, too late for this round. The changes are pretty invasive and
> merging them at -rc7 stage would not give them enough testing before
> being merged. Please send me a reminder after the next merge window.

Sure. I will remind you after v5.19 merge window closes.

-Vasant

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] vfio: Remove VFIO_TYPE1_NESTING_IOMMU

2022-05-20 Thread Will Deacon
On Fri, May 20, 2022 at 12:02:23PM +0100, Robin Murphy wrote:
> On 2022-05-10 17:55, Jason Gunthorpe via iommu wrote:
> > This control causes the ARM SMMU drivers to choose a stage 2
> > implementation for the IO pagetable (vs the stage 1 usual default),
> > however this choice has no visible impact to the VFIO user.
> 
> Oh, I should have read more carefully... this isn't entirely true. Stage 2
> has a different permission model from stage 1, so although it's arguably
> undocumented behaviour, VFIO users that know enough about the underlying
> system could use this to get write-only mappings if they so wish.

There's also an impact on combining memory attributes, but it's not hugely
clear how that impacts userspace via things like VFIO_DMA_CC_IOMMU.

> There may potentially also be visible differences in translation performance
> between stages, although I imagine that's firmly over in the niche of things
> that users might look at for system validation purposes, rather than for
> practical usefulness.
> 
> > Further qemu
> > never implemented this and no other userspace user is known.
> > 
> > The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add
> > new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide
> > SMMU translation services to the guest operating system" however the rest
> > of the API to set the guest table pointer for the stage 1 was never
> > completed, or at least never upstreamed, rendering this part useless dead
> > code.
> > 
> > Since the current patches to enable nested translation, aka userspace page
> > tables, rely on iommufd and will not use the enable_nesting()
> > iommu_domain_op, remove this infrastructure. However, don't cut too deep
> > into the SMMU drivers for now expecting the iommufd work to pick it up -
> > we still need to create S2 IO page tables.
> > 
> > Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
> > enable_nesting iommu_domain_op.
> > 
> > Just in-case there is some userspace using this continue to treat
> > requesting it as a NOP, but do not advertise support any more.
> 
> This also seems a bit odd, especially given that it's not actually a no-op;
> surely either it's supported and functional or it isn't?
> 
> In all honesty, I'm not personally attached to this code either way. If this
> patch had come 5 years ago, when the interface already looked like a bit of
> a dead end, I'd probably have agreed more readily. But now, when we're
> possibly mere months away from implementing the functional equivalent for
> IOMMUFD, which if done right might be able to support a trivial compat layer
> for this anyway, I just don't see what we gain from not at least waiting to
> see where that ends up. The given justification reads as "get rid of this
> code that we already know we'll need to bring back in some form, and
> half-break an unpopular VFIO ABI because it doesn't do *everything* that its
> name might imply", which just isn't convincing me.

I'm inclined to agree although we can very easily revert this patch when we
need to bring the stuff back, it would just be a bit churny.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 06/10] iommu/sva: Refactoring iommu_sva_bind/unbind_device()

2022-05-20 Thread Jean-Philippe Brucker
On Fri, May 20, 2022 at 02:38:12PM +0800, Baolu Lu wrote:
> On 2022/5/20 00:39, Jean-Philippe Brucker wrote:
> > > +struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct 
> > > mm_struct *mm)
> > > +{
> > > + struct iommu_sva_domain *sva_domain;
> > > + struct iommu_domain *domain;
> > > + ioasid_t max_pasid = 0;
> > > + int ret = -EINVAL;
> > > +
> > > + /* Allocate mm->pasid if necessary. */
> > > + if (!dev->iommu->iommu_dev->pasids)
> > > + return ERR_PTR(-EOPNOTSUPP);
> > > +
> > > + if (dev_is_pci(dev)) {
> > > + max_pasid = pci_max_pasids(to_pci_dev(dev));
> > > + if (max_pasid < 0)
> > > + return ERR_PTR(max_pasid);
> > > + } else {
> > > + ret = device_property_read_u32(dev, "pasid-num-bits",
> > > +_pasid);
> > > + if (ret)
> > > + return ERR_PTR(ret);
> > > + max_pasid = (1UL << max_pasid);
> > > + }
> > The IOMMU driver needs this PASID width information earlier, when creating
> > the PASID table (in .probe_device(), .attach_dev()). Since we're moving it
> > to the IOMMU core to avoid code duplication, it should be done earlier and
> > stored in dev->iommu
> 
> Yes, really. How about below changes?
> 
> From f1382579e8a15ca49acdf758d38fd36451ea174d Mon Sep 17 00:00:00 2001
> From: Lu Baolu 
> Date: Mon, 28 Feb 2022 15:01:35 +0800
> Subject: [PATCH 1/1] iommu: Add pasids field in struct dev_iommu
> 
> Use this field to save the number of PASIDs that a device is able to
> consume. It is a generic attribute of a device and lifting it into the
> per-device dev_iommu struct could help to avoid the boilerplate code
> in various IOMMU drivers.
> 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/iommu.c | 15 +++
>  include/linux/iommu.h |  2 ++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index e49c5a5b8cc1..6b731171d42f 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -194,6 +195,8 @@ EXPORT_SYMBOL_GPL(iommu_device_unregister);
>  static struct dev_iommu *dev_iommu_get(struct device *dev)
>  {
>   struct dev_iommu *param = dev->iommu;
> + u32 max_pasids = 0;
> + int ret;
> 
>   if (param)
>   return param;
> @@ -202,6 +205,18 @@ static struct dev_iommu *dev_iommu_get(struct device
> *dev)
>   if (!param)
>   return NULL;
> 
> + if (dev_is_pci(dev)) {
> + ret = pci_max_pasids(to_pci_dev(dev));
> + if (ret > 0)
> + max_pasids = ret;
> + } else {
> + ret = device_property_read_u32(dev, "pasid-num-bits",
> +_pasids);
> + if (!ret)
> + max_pasids = (1UL << max_pasids);
> + }
> + param->pasids = max_pasids;
> +

we could also do a min() with the IOMMU PASID size here

>   mutex_init(>lock);
>   dev->iommu = param;
>   return param;
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 45f274b2640d..d4296136ba75 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -371,6 +371,7 @@ struct iommu_fault_param {
>   * @fwspec:   IOMMU fwspec data
>   * @iommu_dev:IOMMU device this device is linked to
>   * @priv: IOMMU Driver private data
> + * @pasids:   number of supported PASIDs

'max_pasids' to stay consistent?

Thanks,
Jean

>   *
>   * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
>   *   struct iommu_group  *iommu_group;
> @@ -382,6 +383,7 @@ struct dev_iommu {
>   struct iommu_fwspec *fwspec;
>   struct iommu_device *iommu_dev;
>   void*priv;
> + u32 pasids;
>  };
> 
>  int iommu_device_register(struct iommu_device *iommu,
> -- 
> 2.25.1
> 
> Best regards,
> baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] iommu: Add blocking_domain_ops field in iommu_ops

2022-05-20 Thread Baolu Lu

On 2022/5/20 16:45, Joerg Roedel wrote:

On Mon, May 16, 2022 at 09:57:56AM +0800, Lu Baolu wrote:

const struct iommu_domain_ops *default_domain_ops;
+   const struct iommu_domain_ops *blocking_domain_ops;


I don't understand why extra domain-ops are needed for this. I think it
would be more straight-forward to implement allocation of
IOMMU_DOMAIN_BLOCKED domains in each driver and let the details be
handled in the set_dev() call-back. The IOMMU driver can make sure DMA
is blocked for a device when it encounters a IOMMU_DOMAIN_BLOCKED
domain.

For IOMMUs that have no explicit way to block DMA could just use an
unmanaged domain with an empty page-table.


Yes, this is what will go.

Best regards,
baolu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] vfio: Remove VFIO_TYPE1_NESTING_IOMMU

2022-05-20 Thread Robin Murphy

On 2022-05-10 17:55, Jason Gunthorpe via iommu wrote:

This control causes the ARM SMMU drivers to choose a stage 2
implementation for the IO pagetable (vs the stage 1 usual default),
however this choice has no visible impact to the VFIO user.


Oh, I should have read more carefully... this isn't entirely true. Stage 
2 has a different permission model from stage 1, so although it's 
arguably undocumented behaviour, VFIO users that know enough about the 
underlying system could use this to get write-only mappings if they so wish.


There may potentially also be visible differences in translation 
performance between stages, although I imagine that's firmly over in the 
niche of things that users might look at for system validation purposes, 
rather than for practical usefulness.



Further qemu
never implemented this and no other userspace user is known.

The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add
new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide
SMMU translation services to the guest operating system" however the rest
of the API to set the guest table pointer for the stage 1 was never
completed, or at least never upstreamed, rendering this part useless dead
code.

Since the current patches to enable nested translation, aka userspace page
tables, rely on iommufd and will not use the enable_nesting()
iommu_domain_op, remove this infrastructure. However, don't cut too deep
into the SMMU drivers for now expecting the iommufd work to pick it up -
we still need to create S2 IO page tables.

Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
enable_nesting iommu_domain_op.

Just in-case there is some userspace using this continue to treat
requesting it as a NOP, but do not advertise support any more.


This also seems a bit odd, especially given that it's not actually a 
no-op; surely either it's supported and functional or it isn't?


In all honesty, I'm not personally attached to this code either way. If 
this patch had come 5 years ago, when the interface already looked like 
a bit of a dead end, I'd probably have agreed more readily. But now, 
when we're possibly mere months away from implementing the functional 
equivalent for IOMMUFD, which if done right might be able to support a 
trivial compat layer for this anyway, I just don't see what we gain from 
not at least waiting to see where that ends up. The given justification 
reads as "get rid of this code that we already know we'll need to bring 
back in some form, and half-break an unpopular VFIO ABI because it 
doesn't do *everything* that its name might imply", which just isn't 
convincing me.


Thanks,
Robin.


Signed-off-by: Jason Gunthorpe 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 16 
  drivers/iommu/iommu.c   | 10 --
  drivers/vfio/vfio_iommu_type1.c | 12 +---
  include/linux/iommu.h   |  3 ---
  include/uapi/linux/vfio.h   |  2 +-
  6 files changed, 2 insertions(+), 57 deletions(-)

It would probably make sense for this to go through the VFIO tree with Robin's
ack for the SMMU changes.

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 627a3ed5ee8fd1..b901e8973bb4ea 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2724,21 +2724,6 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
  }
  
-static int arm_smmu_enable_nesting(struct iommu_domain *domain)

-{
-   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-   int ret = 0;
-
-   mutex_lock(_domain->init_mutex);
-   if (smmu_domain->smmu)
-   ret = -EPERM;
-   else
-   smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
-   mutex_unlock(_domain->init_mutex);
-
-   return ret;
-}
-
  static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
  {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2865,7 +2850,6 @@ static struct iommu_ops arm_smmu_ops = {
.flush_iotlb_all= arm_smmu_flush_iotlb_all,
.iotlb_sync = arm_smmu_iotlb_sync,
.iova_to_phys   = arm_smmu_iova_to_phys,
-   .enable_nesting = arm_smmu_enable_nesting,
.free   = arm_smmu_domain_free,
}
  };
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 568cce590ccc13..239e6f6585b48d 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1507,21 +1507,6 @@ static struct iommu_group *arm_smmu_device_group(struct 
device *dev)
return group;
  }
  
-static int arm_smmu_enable_nesting(struct iommu_domain *domain)

-{
-   

Re: [PATCH] iommu/s390: tolerate repeat attach_dev calls

2022-05-20 Thread Niklas Schnelle
On Thu, 2022-05-19 at 14:29 -0400, Matthew Rosato wrote:
> Since commit 0286300e6045 ("iommu: iommu_group_claim_dma_owner() must
> always assign a domain") s390-iommu will get called to allocate multiple
> unmanaged iommu domains for a vfio-pci device -- however the current
> s390-iommu logic tolerates only one.  Recognize that multiple domains can
> be allocated and handle switching between DMA or different iommu domain
> tables during attach_dev.
> 
> Signed-off-by: Matthew Rosato 
> ---

I know it's applied already and no need to add my R-b but:

Reviewed-by: Niklas Schnelle 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] iommu/amd: Increase timeout waiting for GA log enablement

2022-05-20 Thread Joerg Roedel
From: Joerg Roedel 

On some systems it can take a long time for the hardware to enable the
GA log of the AMD IOMMU. The current wait time is only 0.1ms, but
testing showed that it can take up to 14ms for the GA log to enter
running state after it has been enabled.

Sometimes the long delay happens when booting the system, sometimes
only on resume. Adjust the timeout accordingly to not print a warning
when hardware takes a longer than usual.

There has already been an attempt to fix this with commit

9b45a7738eec ("iommu/amd: Fix loop timeout issue in 
iommu_ga_log_enable()")

But that commit was based on some wrong math and did not fix the issue
in all cases.

Cc: "D. Ziegfeld" 
Cc: Jörg-Volker Peetz 
Fixes: 8bda0cfbdc1a ("iommu/amd: Detect and initialize guest vAPIC log")
Signed-off-by: Joerg Roedel 
---
 drivers/iommu/amd/init.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index b4a798c7b347..d8060503ba51 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -84,7 +84,7 @@
 #define ACPI_DEVFLAG_LINT1  0x80
 #define ACPI_DEVFLAG_ATSDIS 0x1000
 
-#define LOOP_TIMEOUT   10
+#define LOOP_TIMEOUT   200
 /*
  * ACPI table definitions
  *
-- 
2.36.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 00/35] iommu/amd: Add multiple PCI segments support

2022-05-20 Thread Joerg Roedel
Hi Vasant,

On Fri, May 20, 2022 at 03:25:38PM +0530, Vasant Hegde wrote:
> Ping. Did you get a chance to look into this series?

Sorry, too late for this round. The changes are pretty invasive and
merging them at -rc7 stage would not give them enough testing before
being merged. Please send me a reminder after the next merge window.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 00/35] iommu/amd: Add multiple PCI segments support

2022-05-20 Thread Vasant Hegde via iommu
Hello Joerg,



On 5/11/2022 12:51 PM, Vasant Hegde wrote:
> Newer AMD systems can support multiple PCI segments, where each segment
> contains one or more IOMMU instances. However, an IOMMU instance can only
> support a single PCI segment.

Ping. Did you get a chance to look into this series? 

-Vasant


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/amd: Set translation valid bit only when IO page tables are in used

2022-05-20 Thread Robin Murphy

On 2022-05-20 09:58, Joerg Roedel wrote:

On Fri, May 20, 2022 at 09:54:51AM +0100, Robin Murphy wrote:

The .def_domain type op already allows drivers to do exactly this sort of
override. You could also conditionally reject IOMMU_DOMAIN_PASSTHROUGH in
.domain_alloc for good measure, provided that (for now at least*) SNP is a
global thing rather than per-instance.


Yeah, that could work. I am just not sure the IOMMU core behaves well in
all situations when allocation IOMMU_DOMAIN_PASSTHROUGH suddenly starts
to fail. I would feel better if this is checked and tested :)


Well, iommu_group_alloc_default_domain() has the fallback and is 
currently the only place that __iommu_domain_alloc() can be called with 
a type other than IOMMU_DOMAIN_UNMANAGED, so by inspection it should be 
fine. However if iommu_get_def_domain_type() says the right thing then 
neither sysfs nor automatic default domains should get as far as even 
trying to allocate an identity domain anyway - note that that's already 
what happens for untrusted external devices. But either way should be 
easy enough to verify with a quick hack, too.


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/amd: Set translation valid bit only when IO page tables are in used

2022-05-20 Thread Joerg Roedel
On Fri, May 20, 2022 at 09:54:51AM +0100, Robin Murphy wrote:
> The .def_domain type op already allows drivers to do exactly this sort of
> override. You could also conditionally reject IOMMU_DOMAIN_PASSTHROUGH in
> .domain_alloc for good measure, provided that (for now at least*) SNP is a
> global thing rather than per-instance.

Yeah, that could work. I am just not sure the IOMMU core behaves well in
all situations when allocation IOMMU_DOMAIN_PASSTHROUGH suddenly starts
to fail. I would feel better if this is checked and tested :)

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/amd: Set translation valid bit only when IO page tables are in used

2022-05-20 Thread Robin Murphy

On 2022-05-20 09:09, Joerg Roedel wrote:

Hi Suravee,

On Mon, May 16, 2022 at 07:27:51PM +0700, Suravee Suthikulpanit wrote:

Due to the new restriction (please see the IOMMU spec Rev 3.06-PUB - Apr 2021
https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf) where the use of
DTE[Mode]=0 is not supported on systems that are SNP-enabled (i.e. 
EFR[SNPSup]=1),
the IOMMU HW looks at the DTE[TV] bit to determine if it needs to handle the v1 
page table.
When the HW encounters DTE entry with TV=1, V=1, Mode=0, it would generate
ILLEGAL_DEV_TABLE_ENTRY event.


Ah, that is the part I was missing, thanks.


- I am still trying to see what is the best way to force Linux to not allow
Mode=0 (i.e. iommu=pt mode). Any thoughts?


I think this needs a general approach. First start in the AMD IOMMU
driver:

1) Do not set DTE.TV=1 bit iff SNP-Support is enabled
2) Fail to allocate passthrough domains when SNP support is
   enabled.

Then test how the IOMMU core layer handles that. In fact the IOMMU layer
needs to adjust its decisions so that:

1) It uses translated-mode by default
2) passthrough domains are disallowed and can not be chosen, not
   on the kernel command line and not at runtime.

Ideally this needs some kind of arch-callback to set the appropriate
defaults.


The .def_domain type op already allows drivers to do exactly this sort 
of override. You could also conditionally reject 
IOMMU_DOMAIN_PASSTHROUGH in .domain_alloc for good measure, provided 
that (for now at least*) SNP is a global thing rather than per-instance.


Cheers,
Robin.

*Instance-aware .domain_alloc probably about 2 releases away at the 
current pace of landing the tree-wide prep ;)



- Also, it seems that the current iommu v2 page table use case, where 
GVA->GPA=SPA
will no longer be supported on system w/ SNPSup=1. Any thoughts?


Support for that is not upstream yet, it should be easy to disallow this
configuration and just use the v1 page-tables when SNP is active. This
can be handled entirely inside the AMD IOMMU driver.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH] dma-iommu: Add iommu_dma_max_mapping_size()

2022-05-20 Thread Joerg Roedel
On Wed, May 18, 2022 at 03:13:53PM +0200, Christoph Hellwig wrote:
> On Tue, May 17, 2022 at 01:02:00PM +0100, Robin Murphy wrote:
> >> So how to inform the SCSI driver of this caching limit then so that it may 
> >> limit the SGL length?
> >
> > Driver-specific mechanism; block-layer-specific mechanism; redefine this 
> > whole API to something like dma_opt_mapping_size(), as a limit above which 
> > mappings might become less efficient or start to fail (callback to my 
> > thoughts on [1] as well, I suppose); many options. Just not imposing a 
> > ridiculously low *maximum* on everyone wherein mapping calls "should not be 
> > larger than the returned value" when that's clearly bollocks.
> 
> Well, for swiotlb it is a hard limit.  So if we want to go down that
> route we need two APIs, one for the optimal size and one for the
> hard limit.

I agree with Robin, and if it really helps some drivers I am all for
doing a dma_opt_mapping_size() instead. Limiting DMA mapping sizes to
make drivers perform better gets a clear NAK from my side.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/5] iommu: Add blocking_domain_ops field in iommu_ops

2022-05-20 Thread Joerg Roedel
On Mon, May 16, 2022 at 09:57:56AM +0800, Lu Baolu wrote:
>   const struct iommu_domain_ops *default_domain_ops;
> + const struct iommu_domain_ops *blocking_domain_ops;

I don't understand why extra domain-ops are needed for this. I think it
would be more straight-forward to implement allocation of
IOMMU_DOMAIN_BLOCKED domains in each driver and let the details be
handled in the set_dev() call-back. The IOMMU driver can make sure DMA
is blocked for a device when it encounters a IOMMU_DOMAIN_BLOCKED
domain.

For IOMMUs that have no explicit way to block DMA could just use an
unmanaged domain with an empty page-table.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 4/4] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-05-20 Thread John Garry via iommu
ATA devices (struct ata_device) have a max_sectors field which is
configured internally in libata. This is then used to (re)configure the
associated sdev request queue max_sectors value from how it is earlier set
in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set
according to shost limits, which includes host DMA mapping limits.

Cap the ata_device max_sectors according to shost->max_sectors to respect
this shost limit.

Signed-off-by: John Garry 
---
 drivers/ata/libata-scsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 06c9d90238d9..25fe89791641 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1036,6 +1036,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct 
ata_device *dev)
dev->flags |= ATA_DFLAG_NO_UNLOAD;
 
/* configure max sectors */
+   dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors);
blk_queue_max_hw_sectors(q, dev->max_sectors);
 
if (dev->class == ATA_DEV_ATAPI) {
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-05-20 Thread John Garry via iommu
Streaming DMA mappings may be considerably slower when mappings go through
an IOMMU and the total mapping length is somewhat long. This is because the
IOMMU IOVA code allocates and free an IOVA for each mapping, which may
affect performance.

For performance reasons set the request_queue max_sectors from
dma_opt_mapping_size(), which knows this mapping limit.

In addition, the shost->max_sectors is repeatedly set for each sdev in
__scsi_init_queue(). This is unnecessary, so set once when adding the
host.

Signed-off-by: John Garry 
---
 drivers/scsi/hosts.c| 5 +
 drivers/scsi/scsi_lib.c | 4 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index f69b77cbf538..a3ae6345473b 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -225,6 +225,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
shost->cmd_per_lun = min_t(int, shost->cmd_per_lun,
   shost->can_queue);
 
+   if (dma_dev->dma_mask) {
+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }
+
error = scsi_init_sense_cache(shost);
if (error)
goto fail;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8d18cc7e510e..2d43bb8799bd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
request_queue *q)
blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
}
 
-   if (dev->dma_mask) {
-   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
-   dma_max_mapping_size(dev) >> SECTOR_SHIFT);
-   }
blk_queue_max_hw_sectors(q, shost->max_sectors);
blk_queue_segment_boundary(q, shost->dma_boundary);
dma_set_seg_boundary(dev, shost->dma_boundary);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-05-20 Thread John Garry via iommu
Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which
allows the drivers to know the optimal mapping limit and thus limit the
requested IOVA lengths.

This value is based on the IOVA rcache range limit, as IOVAs allocated
above this limit must always be newly allocated, which may be quite slow.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c | 6 ++
 drivers/iommu/iova.c  | 5 +
 include/linux/iova.h  | 2 ++
 3 files changed, 13 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 09f6e1c0f9c0..f619e41b9172 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1442,6 +1442,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct 
device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
 }
 
+static size_t iommu_dma_opt_mapping_size(void)
+{
+   return iova_rcache_range();
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
.alloc  = iommu_dma_alloc,
.free   = iommu_dma_free,
@@ -1462,6 +1467,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .opt_mapping_size   = iommu_dma_opt_mapping_size,
 };
 
 /*
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..9f00b58d546e 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 
+unsigned long iova_rcache_range(void)
+{
+   return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+
 static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
 {
struct iova_domain *iovad;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..c6ba6d95d79c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
*iovad, dma_addr_t iova)
 int iova_cache_get(void);
 void iova_cache_put(void);
 
+unsigned long iova_rcache_range(void);
+
 void free_iova(struct iova_domain *iovad, unsigned long pfn);
 void __free_iova(struct iova_domain *iovad, struct iova *iova);
 struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 0/4] DMA mapping changes for SCSI core

2022-05-20 Thread John Garry via iommu
As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching
limit may see a big performance hit.

This series introduces a new DMA mapping API, dma_opt_mapping_size(), so
that drivers may know this limit when performance is a factor in the
mapping.

Robin didn't like using dma_max_mapping_size() for this [1]. An
alternative to adding the new API would be to add a "hard_limit" arg
to dma_max_mapping_size(). This would mean fixing up all current users,
but it would be good to do that anyway as not all users require a hard
limit.

The SCSI core coded is modified to use this limit.

I also added a patch for libata-scsi as it does not currently honour the
shost max_sectors limit.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/

John Garry (4):
  dma-mapping: Add dma_opt_mapping_size()
  dma-iommu: Add iommu_dma_opt_mapping_size()
  scsi: core: Cap shost max_sectors according to DMA optimum mapping
limits
  libata-scsi: Cap ata_device->max_sectors according to
shost->max_sectors

 Documentation/core-api/dma-api.rst |  9 +
 drivers/ata/libata-scsi.c  |  1 +
 drivers/iommu/dma-iommu.c  |  6 ++
 drivers/iommu/iova.c   |  5 +
 drivers/scsi/hosts.c   |  5 +
 drivers/scsi/scsi_lib.c|  4 
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 include/linux/iova.h   |  2 ++
 kernel/dma/mapping.c   | 12 
 10 files changed, 46 insertions(+), 4 deletions(-)

-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 1/4] dma-mapping: Add dma_opt_mapping_size()

2022-05-20 Thread John Garry via iommu
Streaming DMA mapping involving an IOMMU may be much slower for larger
total mapping size. This is because every IOMMU DMA mapping requires an
IOVA to be allocated and freed. IOVA sizes above a certain limit are not
cached, which can have a big impact on DMA mapping performance.

Provide an API for device drivers to know this "optimal" limit, such that
they may try to produce mapping which don't exceed it.

Signed-off-by: John Garry 
---
 Documentation/core-api/dma-api.rst |  9 +
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 kernel/dma/mapping.c   | 12 
 4 files changed, 27 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst 
b/Documentation/core-api/dma-api.rst
index 6d6d0edd2d27..b3cd9763d28b 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. The 
size parameter
 of the mapping functions like dma_map_single(), dma_map_page() and
 others should not be larger than the returned value.
 
+::
+
+   size_t
+   dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device. Mapping large
+buffers may take longer so device drivers are advised to limit total DMA
+streaming mappings length to the returned value.
+
 ::
 
bool
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..98ceba6fa848 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -69,6 +69,7 @@ struct dma_map_ops {
int (*dma_supported)(struct device *dev, u64 mask);
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
+   size_t (*opt_mapping_size)(void);
unsigned long (*get_merge_boundary)(struct device *dev);
 };
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..fe3849434b2a 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
 size_t dma_max_mapping_size(struct device *dev);
+size_t dma_opt_mapping_size(struct device *dev);
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 unsigned long dma_get_merge_boundary(struct device *dev);
 struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
@@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device 
*dev)
 {
return 0;
 }
+static inline size_t dma_opt_mapping_size(struct device *dev)
+{
+   return 0;
+}
 static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
return false;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..1bfe11b1edb6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(dma_max_mapping_size);
 
+size_t dma_opt_mapping_size(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+   size_t size = SIZE_MAX;
+
+   if (ops && ops->opt_mapping_size)
+   size = ops->opt_mapping_size();
+
+   return min(dma_max_mapping_size(dev), size);
+}
+EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
+
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
const struct dma_map_ops *ops = get_dma_ops(dev);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/amd: Set translation valid bit only when IO page tables are in used

2022-05-20 Thread Joerg Roedel
Hi Suravee,

On Mon, May 16, 2022 at 07:27:51PM +0700, Suravee Suthikulpanit wrote:
> Due to the new restriction (please see the IOMMU spec Rev 3.06-PUB - Apr 2021
> https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf) where the use of
> DTE[Mode]=0 is not supported on systems that are SNP-enabled (i.e. 
> EFR[SNPSup]=1),
> the IOMMU HW looks at the DTE[TV] bit to determine if it needs to handle the 
> v1 page table.
> When the HW encounters DTE entry with TV=1, V=1, Mode=0, it would generate
> ILLEGAL_DEV_TABLE_ENTRY event.

Ah, that is the part I was missing, thanks.

> - I am still trying to see what is the best way to force Linux to not allow
> Mode=0 (i.e. iommu=pt mode). Any thoughts?

I think this needs a general approach. First start in the AMD IOMMU
driver:

1) Do not set DTE.TV=1 bit iff SNP-Support is enabled
2) Fail to allocate passthrough domains when SNP support is
   enabled.

Then test how the IOMMU core layer handles that. In fact the IOMMU layer
needs to adjust its decisions so that:

1) It uses translated-mode by default
2) passthrough domains are disallowed and can not be chosen, not
   on the kernel command line and not at runtime.

Ideally this needs some kind of arch-callback to set the appropriate
defaults.

> - Also, it seems that the current iommu v2 page table use case, where 
> GVA->GPA=SPA
> will no longer be supported on system w/ SNPSup=1. Any thoughts?

Support for that is not upstream yet, it should be easy to disallow this
configuration and just use the v1 page-tables when SNP is active. This
can be handled entirely inside the AMD IOMMU driver.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/s390: tolerate repeat attach_dev calls

2022-05-20 Thread Joerg Roedel
On Thu, May 19, 2022 at 02:29:29PM -0400, Matthew Rosato wrote:
> Since commit 0286300e6045 ("iommu: iommu_group_claim_dma_owner() must
> always assign a domain") s390-iommu will get called to allocate multiple
> unmanaged iommu domains for a vfio-pci device -- however the current
> s390-iommu logic tolerates only one.  Recognize that multiple domains can
> be allocated and handle switching between DMA or different iommu domain
> tables during attach_dev.
> 
> Signed-off-by: Matthew Rosato 
> ---
>  drivers/iommu/s390-iommu.c | 15 ---
>  1 file changed, 12 insertions(+), 3 deletions(-)

Applied to the vfio-notifier-fix topic branch, thanks.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] vfio: Remove VFIO_TYPE1_NESTING_IOMMU

2022-05-20 Thread Joerg Roedel
On Tue, May 17, 2022 at 02:26:56PM -0600, Alex Williamson wrote:
> I'd be in favor of applying this, but it seems Robin and Eric are
> looking for a stay of execution and I'd also be looking for an ack from
> Joerg.  Thanks,

This is mainly an ARM-SMMU thing, so I defer my ack to Will and Robin.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v7 06/10] iommu/sva: Refactoring iommu_sva_bind/unbind_device()

2022-05-20 Thread Baolu Lu

On 2022/5/20 00:39, Jean-Philippe Brucker wrote:

+struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct 
*mm)
+{
+   struct iommu_sva_domain *sva_domain;
+   struct iommu_domain *domain;
+   ioasid_t max_pasid = 0;
+   int ret = -EINVAL;
+
+   /* Allocate mm->pasid if necessary. */
+   if (!dev->iommu->iommu_dev->pasids)
+   return ERR_PTR(-EOPNOTSUPP);
+
+   if (dev_is_pci(dev)) {
+   max_pasid = pci_max_pasids(to_pci_dev(dev));
+   if (max_pasid < 0)
+   return ERR_PTR(max_pasid);
+   } else {
+   ret = device_property_read_u32(dev, "pasid-num-bits",
+  _pasid);
+   if (ret)
+   return ERR_PTR(ret);
+   max_pasid = (1UL << max_pasid);
+   }

The IOMMU driver needs this PASID width information earlier, when creating
the PASID table (in .probe_device(), .attach_dev()). Since we're moving it
to the IOMMU core to avoid code duplication, it should be done earlier and
stored in dev->iommu


Yes, really. How about below changes?

From f1382579e8a15ca49acdf758d38fd36451ea174d Mon Sep 17 00:00:00 2001
From: Lu Baolu 
Date: Mon, 28 Feb 2022 15:01:35 +0800
Subject: [PATCH 1/1] iommu: Add pasids field in struct dev_iommu

Use this field to save the number of PASIDs that a device is able to
consume. It is a generic attribute of a device and lifting it into the
per-device dev_iommu struct could help to avoid the boilerplate code
in various IOMMU drivers.

Signed-off-by: Lu Baolu 
---
 drivers/iommu/iommu.c | 15 +++
 include/linux/iommu.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index e49c5a5b8cc1..6b731171d42f 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -194,6 +195,8 @@ EXPORT_SYMBOL_GPL(iommu_device_unregister);
 static struct dev_iommu *dev_iommu_get(struct device *dev)
 {
struct dev_iommu *param = dev->iommu;
+   u32 max_pasids = 0;
+   int ret;

if (param)
return param;
@@ -202,6 +205,18 @@ static struct dev_iommu *dev_iommu_get(struct 
device *dev)

if (!param)
return NULL;

+   if (dev_is_pci(dev)) {
+   ret = pci_max_pasids(to_pci_dev(dev));
+   if (ret > 0)
+   max_pasids = ret;
+   } else {
+   ret = device_property_read_u32(dev, "pasid-num-bits",
+  _pasids);
+   if (!ret)
+   max_pasids = (1UL << max_pasids);
+   }
+   param->pasids = max_pasids;
+
mutex_init(>lock);
dev->iommu = param;
return param;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 45f274b2640d..d4296136ba75 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -371,6 +371,7 @@ struct iommu_fault_param {
  * @fwspec: IOMMU fwspec data
  * @iommu_dev:  IOMMU device this device is linked to
  * @priv:   IOMMU Driver private data
+ * @pasids: number of supported PASIDs
  *
  * TODO: migrate other per device data pointers under iommu_dev_data, e.g.
  * struct iommu_group  *iommu_group;
@@ -382,6 +383,7 @@ struct dev_iommu {
struct iommu_fwspec *fwspec;
struct iommu_device *iommu_dev;
void*priv;
+   u32 pasids;
 };

 int iommu_device_register(struct iommu_device *iommu,
--
2.25.1

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu