from:"John Garry"

Re: [PATCH v5 0/5] DMA mapping changes for SCSI core

2022-07-09 Thread John Garry via iommu


On 07/07/2022 21:35, Martin K. Petersen wrote:

Christoph,


Yes, I've mostly been waiting for an ACK from Martin.

Sorry, I'm on vacation this week. The series looks OK to me although I
do agree that it would be great if the max was reflected in the queue's
hard limit and opt in the soft limit.


Ah, I think that I misunderstood Damien's question. I thought he was 
asking why not keep shost max_sectors at dma_max_mapping_size() and then 
init each sdev request queue max hw sectors at dma_opt_mapping_size().


But he seems that you want to know why not have the request queue max 
sectors at dma_opt_mapping_size(). The answer is related to meaning of 
dma_opt_mapping_size(). If we get any mappings which exceed this size 
then it can have a big dma mapping performance hit. So I set max hw 
sectors at this ‘opt’ mapping size to ensure that we get no mappings 
which exceed this size. Indeed, I think max sectors is 128Kb today for 
my host, which would be same as dma_opt_mapping_size() value with an 
IOMMU enabled. And I find that only a small % of request size may exceed 
this 128kb size, but it still has a big performance impact.




Acked-by: Martin K. Petersen


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v5 0/5] DMA mapping changes for SCSI core

2022-07-06 Thread John Garry via iommu


On 30/06/2022 13:08, John Garry wrote:

Hi Christoph,

Can you please consider picking up this series? A few things to note 
beforehand:


- I changed to only apply the mapping limit to SAS hosts in this 
version. I would need a fresh ack from Martin for those SCSI parts, but 
wanted to make sure you were ok with it.
- Damien had some doubt on updating the shost max_sectors as opposed to 
the per-request queue default, but I think he's ok with it - see patch 4/5


Thanks,
John



As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching
limit may see a big performance hit.

This series introduces a new DMA mapping API, dma_opt_mapping_size(), so
that drivers may know this limit when performance is a factor in the
mapping.

The SCSI SAS transport code is modified only to use this limit. For now I
did not want to touch other hosts as I have a concern that this change
could cause a performance regression.

I also added a patch for libata-scsi as it does not currently honour the
shost max_sectors limit.

[0]https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/

Changes since v4:
- tweak libata and other patch titles
- Add Robin's tag (thanks!)
- Clarify description of new DMA mapping API


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH RESEND v5 1/5] iommu: Refactor iommu_group_store_type()

2022-07-06 Thread John Garry via iommu


On 06/07/2022 13:00, Will Deacon wrote:

On Mon, Apr 04, 2022 at 07:27:10PM +0800, John Garry wrote:

Function iommu_group_store_type() supports changing the default domain
of an IOMMU group.

Many conditions need to be satisfied and steps taken for this action to be
successful.

Satisfying these conditions and steps will be required for setting other
IOMMU group attributes, so factor into a common part and a part specific
to update the IOMMU group attribute.

No functional change intended.

Some code comments are tidied up also.

Signed-off-by: John Garry
---
  drivers/iommu/iommu.c | 96 ---
  1 file changed, 62 insertions(+), 34 deletions(-)

Acked-by: Will Deacon



Thanks, but currently I have no plans to progress this series, in favour 
of this 
https://lore.kernel.org/linux-iommu/1656590892-42307-1-git-send-email-john.ga...@huawei.com/T/#me0e806913050c95f6e6ba2c7f7d96d51ce191204


cheers

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/iova: change IOVA_MAG_SIZE to 127 to save memory

2022-07-01 Thread John Garry via iommu


On 01/07/2022 04:56, Feng Tang wrote:

inclination.


ok, what you are saying sounds reasonable. I just remember that when we
analyzed the longterm aging issue that we concluded that the FQ size and its
relation to the magazine size was a factor and this change makes me a little
worried about new issues. Better the devil you know and all that...

Anyway, if I get some time I might do some testing to see if this change has
any influence.

Another thought is if we need even store the size in the iova_magazine? mags
in the depot are always full. As such, we only need worry about mags loaded
in the cpu rcache and their sizes, so maybe we could have something like
this:

struct iova_magazine {
-   unsigned long size;
unsigned long pfns[IOVA_MAG_SIZE];
};

@@ -631,6 +630,8 @@ struct iova_cpu_rcache {
spinlock_t lock;
struct iova_magazine *loaded;
struct iova_magazine *prev;
+   int loaded_size;
+   int prev_size;
};

I haven't tried to implement it though..

I have very few knowledge of iova, so you can chose what's the better
solution. I just wanted to raise the problem and will be happy to see
it solved:)


I quickly tested your patch for performance and saw no noticeable 
difference, which is no surprise.


But I'll defer to Robin if he thinks that your patch is a better 
solution - I would guess that he does. For me personally I would prefer 
that this value was not changed, as I mentioned before.


thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v5 4/5] scsi: scsi_transport_sas: Cap shost max_sectors according to DMA optimal limit

2022-07-01 Thread John Garry via iommu

On 01/07/2022 00:49, Damien Le Moal wrote:

+	if (dma_dev) {

+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }

Hi Damien,

> Hmm... shost->max_sectors becomes the max_hw_sectors limit for the block
> dev. So using dma_max_mapping_size(dma_dev) for that limit makes sense.
> Shouldn't dma_opt_mapping_size(dma_dev) be used to limit only the default
> "soft" limit (queue max_sectors limit) instead of the hard limit ?
>

Sure, it would sensible to use dma_opt_mapping_size() to limit the 
default queue max sectors limit, while dma_max_mapping_size() limits the 
host max sectors. But I didn't see in practice how limiting the shost 
max sectors to dma_opt_mapping_size() makes a difference:

- block queue max_hw_sectors_kb file is read-only, so we cannot change 
the queue max sectors from there

- And no SAS driver actually tries to modify upwards from the default.
I do note that USB storage driver as an example of a scsi driver which 
does (modify from shost max sectors): see scsiglue.c::slave_configure()

Finally there is no common method to limit the default request queue max 
sectors for those SAS drivers - I would need to add this limit in each 
of their slave_configure callbacks, and I didn't think that its worth it.

Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v5 3/5] scsi: core: Cap shost max_sectors according to DMA limits only once

2022-07-01 Thread John Garry via iommu


On 01/07/2022 00:41, Damien Le Moal wrote:
  
  	shost->dma_dev = dma_dev;
  
+	if (dma_dev->dma_mask) {

+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_max_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }

Nit: you could remove the curly brackets... But it being a multi-line
statement, having them is OK too I think.



tglx seems to think that they are ok, and I generally agree (now):

https://lore.kernel.org/linux-arm-kernel/877djwdorz@nanos.tec.linutronix.de/

AFAICT coding-style.rst is ok with them in this scenario too

Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v5 5/5] ata: libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-06-30 Thread John Garry via iommu

ATA devices (struct ata_device) have a max_sectors field which is
configured internally in libata. This is then used to (re)configure the
associated sdev request queue max_sectors value from how it is earlier set
in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set
according to shost limits, which includes host DMA mapping limits.

Cap the ata_device max_sectors according to shost->max_sectors to respect
this shost limit.

Signed-off-by: John Garry 
Acked-by: Damien Le Moal 
---
 drivers/ata/libata-scsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 86dbb1cdfabd..24a43d540d9f 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1060,6 +1060,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct 
ata_device *dev)
dev->flags |= ATA_DFLAG_NO_UNLOAD;
 
/* configure max sectors */
+   dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors);
blk_queue_max_hw_sectors(q, dev->max_sectors);
 
if (dev->class == ATA_DEV_ATAPI) {
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v5 4/5] scsi: scsi_transport_sas: Cap shost max_sectors according to DMA optimal limit

2022-06-30 Thread John Garry via iommu

Streaming DMA mappings may be considerably slower when mappings go through
an IOMMU and the total mapping length is somewhat long. This is because the
IOMMU IOVA code allocates and free an IOVA for each mapping, which may
affect performance.

For performance reasons set the request queue max_sectors from
dma_opt_mapping_size(), which knows this mapping limit.

Signed-off-by: John Garry 
---
 drivers/scsi/scsi_transport_sas.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/scsi_transport_sas.c 
b/drivers/scsi/scsi_transport_sas.c
index 12bff64dade6..1b45248748e0 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -225,6 +225,7 @@ static int sas_host_setup(struct transport_container *tc, 
struct device *dev,
 {
struct Scsi_Host *shost = dev_to_shost(dev);
struct sas_host_attrs *sas_host = to_sas_host_attrs(shost);
+   struct device *dma_dev = shost->dma_dev;
 
INIT_LIST_HEAD(&sas_host->rphy_list);
mutex_init(&sas_host->lock);
@@ -236,6 +237,11 @@ static int sas_host_setup(struct transport_container *tc, 
struct device *dev,
dev_printk(KERN_ERR, dev, "fail to a bsg device %d\n",
   shost->host_no);
 
+   if (dma_dev) {
+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }
+
return 0;
 }
 
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v5 3/5] scsi: core: Cap shost max_sectors according to DMA limits only once

2022-06-30 Thread John Garry via iommu

The shost->max_sectors is repeatedly capped according to the host DMA
mapping limit for each sdev in __scsi_init_queue(). This is unnecessary, so
set only once when adding the host.

Signed-off-by: John Garry 
---
 drivers/scsi/hosts.c| 5 +
 drivers/scsi/scsi_lib.c | 4 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 8352f90d997d..d04bd2c7c9f1 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -236,6 +236,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
 
shost->dma_dev = dma_dev;
 
+   if (dma_dev->dma_mask) {
+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_max_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }
+
error = scsi_mq_setup_tags(shost);
if (error)
goto fail;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6ffc9e4258a8..6ce8acea322a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
request_queue *q)
blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
}
 
-   if (dev->dma_mask) {
-   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
-   dma_max_mapping_size(dev) >> SECTOR_SHIFT);
-   }
blk_queue_max_hw_sectors(q, shost->max_sectors);
blk_queue_segment_boundary(q, shost->dma_boundary);
dma_set_seg_boundary(dev, shost->dma_boundary);
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v5 2/5] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-06-30 Thread John Garry via iommu

Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which
allows the drivers to know the optimal mapping limit and thus limit the
requested IOVA lengths.

This value is based on the IOVA rcache range limit, as IOVAs allocated
above this limit must always be newly allocated, which may be quite slow.

Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
Acked-by: Robin Murphy 
---
 drivers/iommu/dma-iommu.c | 6 ++
 drivers/iommu/iova.c  | 5 +
 include/linux/iova.h  | 2 ++
 3 files changed, 13 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..9e1586447ee8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1459,6 +1459,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct 
device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
 }
 
+static size_t iommu_dma_opt_mapping_size(void)
+{
+   return iova_rcache_range();
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
.alloc  = iommu_dma_alloc,
.free   = iommu_dma_free,
@@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .opt_mapping_size   = iommu_dma_opt_mapping_size,
 };
 
 /*
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..9f00b58d546e 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 
+unsigned long iova_rcache_range(void)
+{
+   return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+
 static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
 {
struct iova_domain *iovad;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..c6ba6d95d79c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
*iovad, dma_addr_t iova)
 int iova_cache_get(void);
 void iova_cache_put(void);
 
+unsigned long iova_rcache_range(void);
+
 void free_iova(struct iova_domain *iovad, unsigned long pfn);
 void __free_iova(struct iova_domain *iovad, struct iova *iova);
 struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v5 1/5] dma-mapping: Add dma_opt_mapping_size()

2022-06-30 Thread John Garry via iommu

Streaming DMA mapping involving an IOMMU may be much slower for larger
total mapping size. This is because every IOMMU DMA mapping requires an
IOVA to be allocated and freed. IOVA sizes above a certain limit are not
cached, which can have a big impact on DMA mapping performance.

Provide an API for device drivers to know this "optimal" limit, such that
they may try to produce mapping which don't exceed it.

Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
---
 Documentation/core-api/dma-api.rst | 14 ++
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 kernel/dma/mapping.c   | 12 
 4 files changed, 32 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst 
b/Documentation/core-api/dma-api.rst
index 6d6d0edd2d27..829f20a193ca 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -204,6 +204,20 @@ Returns the maximum size of a mapping for the device. The 
size parameter
 of the mapping functions like dma_map_single(), dma_map_page() and
 others should not be larger than the returned value.
 
+::
+
+   size_t
+   dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device.
+
+Mapping larger buffers may take much longer in certain scenarios. In
+addition, for high-rate short-lived streaming mappings, the upfront time
+spent on the mapping may account for an appreciable part of the total
+request lifetime. As such, if splitting larger requests incurs no
+significant performance penalty, then device drivers are advised to
+limit total DMA streaming mappings length to the returned value.
+
 ::
 
bool
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..98ceba6fa848 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -69,6 +69,7 @@ struct dma_map_ops {
int (*dma_supported)(struct device *dev, u64 mask);
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
+   size_t (*opt_mapping_size)(void);
unsigned long (*get_merge_boundary)(struct device *dev);
 };
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..fe3849434b2a 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
 size_t dma_max_mapping_size(struct device *dev);
+size_t dma_opt_mapping_size(struct device *dev);
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 unsigned long dma_get_merge_boundary(struct device *dev);
 struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
@@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device 
*dev)
 {
return 0;
 }
+static inline size_t dma_opt_mapping_size(struct device *dev)
+{
+   return 0;
+}
 static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
return false;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..1bfe11b1edb6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(dma_max_mapping_size);
 
+size_t dma_opt_mapping_size(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+   size_t size = SIZE_MAX;
+
+   if (ops && ops->opt_mapping_size)
+   size = ops->opt_mapping_size();
+
+   return min(dma_max_mapping_size(dev), size);
+}
+EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
+
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
const struct dma_map_ops *ops = get_dma_ops(dev);
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v5 0/5] DMA mapping changes for SCSI core

2022-06-30 Thread John Garry via iommu

As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching
limit may see a big performance hit.

This series introduces a new DMA mapping API, dma_opt_mapping_size(), so
that drivers may know this limit when performance is a factor in the
mapping.

The SCSI SAS transport code is modified only to use this limit. For now I
did not want to touch other hosts as I have a concern that this change
could cause a performance regression.

I also added a patch for libata-scsi as it does not currently honour the
shost max_sectors limit.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/

Changes since v4:
- tweak libata and other patch titles
- Add Robin's tag (thanks!)
- Clarify description of new DMA mapping API

Changes since v3:
- Apply max DMA optimial limit to SAS hosts only
  Note: Even though "scsi: core: Cap shost max_sectors only once when
  adding" is a subset of a previous patch I did not transfer the RB tags
- Rebase on v5.19-rc4

John Garry (5):
  dma-mapping: Add dma_opt_mapping_size()
  dma-iommu: Add iommu_dma_opt_mapping_size()
  scsi: core: Cap shost max_sectors according to DMA limits only once
  scsi: scsi_transport_sas: Cap shost max_sectors according to DMA
optimal limit
  ata: libata-scsi: Cap ata_device->max_sectors according to
shost->max_sectors

 Documentation/core-api/dma-api.rst | 14 ++
 drivers/ata/libata-scsi.c  |  1 +
 drivers/iommu/dma-iommu.c  |  6 ++
 drivers/iommu/iova.c   |  5 +
 drivers/scsi/hosts.c   |  5 +
 drivers/scsi/scsi_lib.c|  4 
 drivers/scsi/scsi_transport_sas.c  |  6 ++
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 include/linux/iova.h   |  2 ++
 kernel/dma/mapping.c   | 12 
 11 files changed, 57 insertions(+), 4 deletions(-)

-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/iova: change IOVA_MAG_SIZE to 127 to save memory

2022-06-30 Thread John Garry via iommu




   [    4.319253] iommu: Adding device :06:00.2 to group 5
   [    4.325869] iommu: Adding device :20:01.0 to group 15
   [    4.332648] iommu: Adding device :20:02.0 to group 16
   [    4.338946] swapper/0 invoked oom-killer: 
gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, 
oom_score_adj=0

   [    4.350251] swapper/0 cpuset=/ mems_allowed=0
   [    4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.19.57.mx64.282 #1
   [    4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, 
BIOS 1.9.3 06/25/2019

   [    4.355612] Call Trace:
   [    4.355612]  dump_stack+0x46/0x5b
   [    4.355612]  dump_header+0x6b/0x289
   [    4.355612]  out_of_memory+0x470/0x4c0
   [    4.355612]  __alloc_pages_nodemask+0x970/0x1030
   [    4.355612]  cache_grow_begin+0x7d/0x520
   [    4.355612]  fallback_alloc+0x148/0x200
   [    4.355612]  kmem_cache_alloc_trace+0xac/0x1f0
   [    4.355612]  init_iova_domain+0x112/0x170


Note for Feng Tang: This callchain does not exist anymore since we 
separated out the rcache init from the IOVA domain init. Indeed, not so 
much memory is wasted for unused rcaches now.


My point really is that it would be nicer to see a modern callchain - 
but don't read that as me saying that the change is this patch is bad.



   [    4.355612]  amd_iommu_domain_alloc+0x138/0x1a0
   [    4.355612]  iommu_group_get_for_dev+0xc4/0x1a0
   [    4.355612]  amd_iommu_add_device+0x13a/0x610
   [    4.355612]  add_iommu_group+0x20/0x30
   [    4.355612]  bus_for_each_dev+0x76/0xc0
   [    4.355612]  bus_set_iommu+0xb6/0xf0
   [    4.355612]  amd_iommu_init_api+0x112/0x132
   [    4.355612]  state_next+0xfb1/0x1165
   [    4.355612]  amd_iommu_init+0x1f/0x67
   [    4.355612]  pci_iommu_init+0x16/0x3f
   ...
   [    4.670295] Unreclaimable slab info:
   ...
   [    4.857565] kmalloc-2048   59164KB  59164KB

Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine'
1024 bytes so that no memory will be wasted.

[1]. https://lkml.org/lkml/2019/8/12/266

Signed-off-by: Feng Tang 
---
  drivers/iommu/iova.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145b..27634ddd9b904 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova);
   * dynamic size tuning described in the paper.
   */
-#define IOVA_MAG_SIZE 128
+/*
+ * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to
+ * assure size of 'iova_magzine' to be 1024 bytes, so that no memory


Typo: iova_magazine


+ * will be wasted.
+ */
+#define IOVA_MAG_SIZE 127


I do wonder if we will see some strange new behaviour since 
IOVA_FQ_SIZE % IOVA_MAG_SIZE != 0 now...


I doubt it - even if a flush queue does happen to be entirely full of 
equal-sized IOVAs, a CPU's loaded magazines also both being perfectly 
empty when it comes to dump a full fq seem further unlikely, so in 
practice I don't see this making any appreciable change to the 
likelihood of spilling back to the depot or not. In fact the smaller the 
magazines get, the less time would be spent flushing the depot back to 
the rbtree, where your interesting workload falls off the cliff and 
never catches back up with the fq timer, so at some point it might even 
improve (unless it's also already close to the point where smaller 
caches would bottleneck allocation)... might be interesting to 
experiment with a wider range of magazine sizes if you had the time and 
inclination.




ok, what you are saying sounds reasonable. I just remember that when we 
analyzed the longterm aging issue that we concluded that the FQ size and 
its relation to the magazine size was a factor and this change makes me 
a little worried about new issues. Better the devil you know and all that...


Anyway, if I get some time I might do some testing to see if this change 
has any influence.


Another thought is if we need even store the size in the iova_magazine? 
mags in the depot are always full. As such, we only need worry about 
mags loaded in the cpu rcache and their sizes, so maybe we could have 
something like this:


struct iova_magazine {
-   unsigned long size;
   unsigned long pfns[IOVA_MAG_SIZE];
};

@@ -631,6 +630,8 @@ struct iova_cpu_rcache {
   spinlock_t lock;
   struct iova_magazine *loaded;
   struct iova_magazine *prev;
+   int loaded_size;
+   int prev_size;
};

I haven't tried to implement it though..

Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/iova: change IOVA_MAG_SIZE to 127 to save memory

2022-06-30 Thread John Garry via iommu


On 30/06/2022 10:02, Robin Murphy wrote:

On 2022-06-30 08:33, Feng Tang wrote:

kmalloc will round up the request size to power of 2, and current
iova_magazine's size is 1032 (1024+8) bytes, so each instance
allocated will get 2048 bytes from kmalloc, causing around 1KB
waste.

And in some exstreme case, the memory wasted can trigger OOM as
reported in 2019 on a crash kernel with 256 MB memory [1].


I don't think it really needs pointing out that excessive memory 
consumption can cause OOM. Especially not in the particularly silly 
context of a system with only 2MB of RAM per CPU - that's pretty much 
guaranteed to be doomed one way or another.



   [    4.319253] iommu: Adding device :06:00.2 to group 5
   [    4.325869] iommu: Adding device :20:01.0 to group 15
   [    4.332648] iommu: Adding device :20:02.0 to group 16
   [    4.338946] swapper/0 invoked oom-killer: 
gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, 
oom_score_adj=0

   [    4.350251] swapper/0 cpuset=/ mems_allowed=0
   [    4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.19.57.mx64.282 #1
   [    4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, 
BIOS 1.9.3 06/25/2019

   [    4.355612] Call Trace:
   [    4.355612]  dump_stack+0x46/0x5b
   [    4.355612]  dump_header+0x6b/0x289
   [    4.355612]  out_of_memory+0x470/0x4c0
   [    4.355612]  __alloc_pages_nodemask+0x970/0x1030
   [    4.355612]  cache_grow_begin+0x7d/0x520
   [    4.355612]  fallback_alloc+0x148/0x200
   [    4.355612]  kmem_cache_alloc_trace+0xac/0x1f0
   [    4.355612]  init_iova_domain+0x112/0x170
   [    4.355612]  amd_iommu_domain_alloc+0x138/0x1a0
   [    4.355612]  iommu_group_get_for_dev+0xc4/0x1a0
   [    4.355612]  amd_iommu_add_device+0x13a/0x610
   [    4.355612]  add_iommu_group+0x20/0x30
   [    4.355612]  bus_for_each_dev+0x76/0xc0
   [    4.355612]  bus_set_iommu+0xb6/0xf0
   [    4.355612]  amd_iommu_init_api+0x112/0x132
   [    4.355612]  state_next+0xfb1/0x1165
   [    4.355612]  amd_iommu_init+0x1f/0x67
   [    4.355612]  pci_iommu_init+0x16/0x3f
   ...
   [    4.670295] Unreclaimable slab info:
   ...
   [    4.857565] kmalloc-2048   59164KB  59164KB

Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine'
1024 bytes so that no memory will be wasted.

[1]. https://lkml.org/lkml/2019/8/12/266

Signed-off-by: Feng Tang 
---
  drivers/iommu/iova.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145b..27634ddd9b904 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova);
   * dynamic size tuning described in the paper.
   */
-#define IOVA_MAG_SIZE 128
+/*
+ * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to
+ * assure size of 'iova_magzine' to be 1024 bytes, so that no memory


Typo: iova_magazine


+ * will be wasted.
+ */
+#define IOVA_MAG_SIZE 127


I do wonder if we will see some strange new behaviour since IOVA_FQ_SIZE 
% IOVA_MAG_SIZE != 0 now...




The change itself seems perfectly reasonable, though.

Acked-by: Robin Murphy 


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/5] dma-mapping: Add dma_opt_mapping_size()

2022-06-29 Thread John Garry via iommu


On 28/06/2022 12:27, John Garry via iommu wrote:

On 28/06/2022 12:23, Robin Murphy wrote:

+
+    size_t
+    dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device. 
Mapping large
+buffers may take longer so device drivers are advised to limit total 
DMA

+streaming mappings length to the returned value.


Nit: I'm not sure "advised" is necessarily the right thing to say in 
general - that's only really true for a caller who cares about 
throughput of churning through short-lived mappings more than anything 
else, and doesn't take a significant hit overall from splitting up 
larger requests. I do think it's good to clarify the exact context of 
"optimal" here, but I'd prefer to be objectively clear that it's for 
workloads where the up-front mapping overhead dominates.



I'm going to go with something like this:

size_t
dma_opt_mapping_size(struct device *dev);

Returns the maximum optimal size of a mapping for the device.

Mapping larger buffers may take much longer in certain scenarios. In 
addition, for high-rate short-lived streaming mappings the upfront time 
spent on the mapping may account for an appreciable part of the total 
request lifetime. As such, if splitting larger requests incurs no 
significant performance penalty, then device drivers are advised to 
limit total DMA streaming mappings length to the returned value.


Let me know if you would like it further amended.

Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 5/5] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-06-29 Thread John Garry via iommu


On 29/06/2022 06:58, Damien Le Moal wrote:

On 6/29/22 14:40, Christoph Hellwig wrote:

On Tue, Jun 28, 2022 at 12:33:58PM +0100, John Garry wrote:

Well Christoph originally offered to take this series via the dma-mapping
tree.

@Christoph, is that still ok with you? If so, would you rather I send this
libata patch separately?


The offer still stands, and I don't really care where the libata
patch is routed.  Just tell me what you prefer.


Cheers.



If it is 100% independent from the other patches, I can take it.
Otherwise, feel free to take it !



I'll just keep the all together - it's easier in case I need to change 
anything.


Thanks!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 5/5] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-06-28 Thread John Garry via iommu


On 28/06/2022 10:14, Damien Le Moal wrote:

BTW, this patch has no real dependency on the rest of the series, so
could be taken separately if you prefer.

Sure, you can send it separately. Adding it through the scsi tree is fine too.



Well Christoph originally offered to take this series via the 
dma-mapping tree.


@Christoph, is that still ok with you? If so, would you rather I send 
this libata patch separately?


Thanks,
john
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 1/5] dma-mapping: Add dma_opt_mapping_size()

2022-06-28 Thread John Garry via iommu


On 28/06/2022 12:23, Robin Murphy wrote:

+
+    size_t
+    dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device. Mapping 
large

+buffers may take longer so device drivers are advised to limit total DMA
+streaming mappings length to the returned value.


Nit: I'm not sure "advised" is necessarily the right thing to say in 
general - that's only really true for a caller who cares about 
throughput of churning through short-lived mappings more than anything 
else, and doesn't take a significant hit overall from splitting up 
larger requests. I do think it's good to clarify the exact context of 
"optimal" here, but I'd prefer to be objectively clear that it's for 
workloads where the up-front mapping overhead dominates.


Ok, sure, I can make that clear.

Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 5/5] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-06-28 Thread John Garry via iommu


On 28/06/2022 00:24, Damien Le Moal wrote:

On 6/28/22 00:25, John Garry wrote:

ATA devices (struct ata_device) have a max_sectors field which is
configured internally in libata. This is then used to (re)configure the
associated sdev request queue max_sectors value from how it is earlier set
in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set
according to shost limits, which includes host DMA mapping limits.

Cap the ata_device max_sectors according to shost->max_sectors to respect
this shost limit.

Signed-off-by: John Garry
Acked-by: Damien Le Moal

Nit: please change the patch title to "ata: libata-scsi: Cap ..."



ok, but it's going to be an even longer title :)

BTW, this patch has no real dependency on the rest of the series, so 
could be taken separately if you prefer.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 5/5] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-06-27 Thread John Garry via iommu

ATA devices (struct ata_device) have a max_sectors field which is
configured internally in libata. This is then used to (re)configure the
associated sdev request queue max_sectors value from how it is earlier set
in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set
according to shost limits, which includes host DMA mapping limits.

Cap the ata_device max_sectors according to shost->max_sectors to respect
this shost limit.

Signed-off-by: John Garry 
Acked-by: Damien Le Moal 
---
 drivers/ata/libata-scsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 86dbb1cdfabd..24a43d540d9f 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1060,6 +1060,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct 
ata_device *dev)
dev->flags |= ATA_DFLAG_NO_UNLOAD;
 
/* configure max sectors */
+   dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors);
blk_queue_max_hw_sectors(q, dev->max_sectors);
 
if (dev->class == ATA_DEV_ATAPI) {
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 4/5] scsi: scsi_transport_sas: Cap shost max_sectors according to DMA optimal mapping limit

2022-06-27 Thread John Garry via iommu

Streaming DMA mappings may be considerably slower when mappings go through
an IOMMU and the total mapping length is somewhat long. This is because the
IOMMU IOVA code allocates and free an IOVA for each mapping, which may
affect performance.

For performance reasons set the request queue max_sectors from
dma_opt_mapping_size(), which knows this mapping limit.

Signed-off-by: John Garry 
---
 drivers/scsi/scsi_transport_sas.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/scsi/scsi_transport_sas.c 
b/drivers/scsi/scsi_transport_sas.c
index 12bff64dade6..1b45248748e0 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -225,6 +225,7 @@ static int sas_host_setup(struct transport_container *tc, 
struct device *dev,
 {
struct Scsi_Host *shost = dev_to_shost(dev);
struct sas_host_attrs *sas_host = to_sas_host_attrs(shost);
+   struct device *dma_dev = shost->dma_dev;
 
INIT_LIST_HEAD(&sas_host->rphy_list);
mutex_init(&sas_host->lock);
@@ -236,6 +237,11 @@ static int sas_host_setup(struct transport_container *tc, 
struct device *dev,
dev_printk(KERN_ERR, dev, "fail to a bsg device %d\n",
   shost->host_no);
 
+   if (dma_dev) {
+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }
+
return 0;
 }
 
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 2/5] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-06-27 Thread John Garry via iommu

Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which
allows the drivers to know the optimal mapping limit and thus limit the
requested IOVA lengths.

This value is based on the IOVA rcache range limit, as IOVAs allocated
above this limit must always be newly allocated, which may be quite slow.

Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
---
 drivers/iommu/dma-iommu.c | 6 ++
 drivers/iommu/iova.c  | 5 +
 include/linux/iova.h  | 2 ++
 3 files changed, 13 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..9e1586447ee8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1459,6 +1459,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct 
device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
 }
 
+static size_t iommu_dma_opt_mapping_size(void)
+{
+   return iova_rcache_range();
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
.alloc  = iommu_dma_alloc,
.free   = iommu_dma_free,
@@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .opt_mapping_size   = iommu_dma_opt_mapping_size,
 };
 
 /*
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..9f00b58d546e 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 
+unsigned long iova_rcache_range(void)
+{
+   return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+
 static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
 {
struct iova_domain *iovad;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..c6ba6d95d79c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
*iovad, dma_addr_t iova)
 int iova_cache_get(void);
 void iova_cache_put(void);
 
+unsigned long iova_rcache_range(void);
+
 void free_iova(struct iova_domain *iovad, unsigned long pfn);
 void __free_iova(struct iova_domain *iovad, struct iova *iova);
 struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 3/5] scsi: core: Cap shost max_sectors according to DMA mapping limits only once

2022-06-27 Thread John Garry via iommu

The shost->max_sectors is repeatedly capped according to the host DMA
mapping limit for each sdev in __scsi_init_queue(). This is unnecessary, so
set only once when adding the host.

Signed-off-by: John Garry 
---
 drivers/scsi/hosts.c| 5 +
 drivers/scsi/scsi_lib.c | 4 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 8352f90d997d..d04bd2c7c9f1 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -236,6 +236,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
 
shost->dma_dev = dma_dev;
 
+   if (dma_dev->dma_mask) {
+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_max_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }
+
error = scsi_mq_setup_tags(shost);
if (error)
goto fail;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6ffc9e4258a8..6ce8acea322a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
request_queue *q)
blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
}
 
-   if (dev->dma_mask) {
-   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
-   dma_max_mapping_size(dev) >> SECTOR_SHIFT);
-   }
blk_queue_max_hw_sectors(q, shost->max_sectors);
blk_queue_segment_boundary(q, shost->dma_boundary);
dma_set_seg_boundary(dev, shost->dma_boundary);
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 1/5] dma-mapping: Add dma_opt_mapping_size()

2022-06-27 Thread John Garry via iommu

Streaming DMA mapping involving an IOMMU may be much slower for larger
total mapping size. This is because every IOMMU DMA mapping requires an
IOVA to be allocated and freed. IOVA sizes above a certain limit are not
cached, which can have a big impact on DMA mapping performance.

Provide an API for device drivers to know this "optimal" limit, such that
they may try to produce mapping which don't exceed it.

Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
---
 Documentation/core-api/dma-api.rst |  9 +
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 kernel/dma/mapping.c   | 12 
 4 files changed, 27 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst 
b/Documentation/core-api/dma-api.rst
index 6d6d0edd2d27..b3cd9763d28b 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. The 
size parameter
 of the mapping functions like dma_map_single(), dma_map_page() and
 others should not be larger than the returned value.
 
+::
+
+   size_t
+   dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device. Mapping large
+buffers may take longer so device drivers are advised to limit total DMA
+streaming mappings length to the returned value.
+
 ::
 
bool
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..98ceba6fa848 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -69,6 +69,7 @@ struct dma_map_ops {
int (*dma_supported)(struct device *dev, u64 mask);
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
+   size_t (*opt_mapping_size)(void);
unsigned long (*get_merge_boundary)(struct device *dev);
 };
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..fe3849434b2a 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
 size_t dma_max_mapping_size(struct device *dev);
+size_t dma_opt_mapping_size(struct device *dev);
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 unsigned long dma_get_merge_boundary(struct device *dev);
 struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
@@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device 
*dev)
 {
return 0;
 }
+static inline size_t dma_opt_mapping_size(struct device *dev)
+{
+   return 0;
+}
 static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
return false;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..1bfe11b1edb6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(dma_max_mapping_size);
 
+size_t dma_opt_mapping_size(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+   size_t size = SIZE_MAX;
+
+   if (ops && ops->opt_mapping_size)
+   size = ops->opt_mapping_size();
+
+   return min(dma_max_mapping_size(dev), size);
+}
+EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
+
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
const struct dma_map_ops *ops = get_dma_ops(dev);
-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 0/5] DMA mapping changes for SCSI core

2022-06-27 Thread John Garry via iommu

As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching
limit may see a big performance hit.

This series introduces a new DMA mapping API, dma_opt_mapping_size(), so
that drivers may know this limit when performance is a factor in the
mapping.

The SCSI SAS transport code is modified only to use this limit. For now I
did not want to touch other hosts as I have a concern that this change
could cause a performance regression.

I also added a patch for libata-scsi as it does not currently honour the
shost max_sectors limit.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/

Changes since v3:
- Apply max DMA optimial limit to SAS hosts only
  Note: Even though "scsi: core: Cap shost max_sectors only once when
  adding" is a subset of a previous patch I did not transfer the RB tags
- Rebase on v5.19-rc4

Changes since v2:
- Rebase on v5.19-rc1
- Add Damien's tag to 2/4 (thanks)

Changes since v1:
- Relocate scsi_add_host_with_dma() dma_dev check (Reported by Dan)
- Add tags from Damien and Martin (thanks)
  - note: I only added Martin's tag to the SCSI patch

John Garry (5):
  dma-mapping: Add dma_opt_mapping_size()
  dma-iommu: Add iommu_dma_opt_mapping_size()
  scsi: core: Cap shost max_sectors according to DMA mapping limits only
once
  scsi: scsi_transport_sas: Cap shost max_sectors according to DMA
optimal mapping limit
  libata-scsi: Cap ata_device->max_sectors according to
shost->max_sectors

 Documentation/core-api/dma-api.rst |  9 +
 drivers/ata/libata-scsi.c  |  1 +
 drivers/iommu/dma-iommu.c  |  6 ++
 drivers/iommu/iova.c   |  5 +
 drivers/scsi/hosts.c   |  5 +
 drivers/scsi/scsi_lib.c|  4 
 drivers/scsi/scsi_transport_sas.c  |  6 ++
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 include/linux/iova.h   |  2 ++
 kernel/dma/mapping.c   | 12 
 11 files changed, 52 insertions(+), 4 deletions(-)

-- 
2.35.3

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-06-23 Thread John Garry via iommu


On 14/06/2022 14:12, John Garry wrote:

On 06/06/2022 10:30, John Garry wrote:

Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which
allows the drivers to know the optimal mapping limit and thus limit the
requested IOVA lengths.

This value is based on the IOVA rcache range limit, as IOVAs allocated
above this limit must always be newly allocated, which may be quite slow.



Can I please get some sort of ack from the IOMMU people on this one?



Another request for an ack please.

Thanks,
john




Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
---
  drivers/iommu/dma-iommu.c | 6 ++
  drivers/iommu/iova.c  | 5 +
  include/linux/iova.h  | 2 ++
  3 files changed, 13 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..9e1586447ee8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1459,6 +1459,11 @@ static unsigned long 
iommu_dma_get_merge_boundary(struct device *dev)

  return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
  }
+static size_t iommu_dma_opt_mapping_size(void)
+{
+    return iova_rcache_range();
+}
+
  static const struct dma_map_ops iommu_dma_ops = {
  .alloc    = iommu_dma_alloc,
  .free    = iommu_dma_free,
@@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = {
  .map_resource    = iommu_dma_map_resource,
  .unmap_resource    = iommu_dma_unmap_resource,
  .get_merge_boundary    = iommu_dma_get_merge_boundary,
+    .opt_mapping_size    = iommu_dma_opt_mapping_size,
  };
  /*
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..9f00b58d546e 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct 
iova_domain *iovad,
  static void free_cpu_cached_iovas(unsigned int cpu, struct 
iova_domain *iovad);

  static void free_iova_rcaches(struct iova_domain *iovad);
+unsigned long iova_rcache_range(void)
+{
+    return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+
  static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
  {
  struct iova_domain *iovad;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..c6ba6d95d79c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct 
iova_domain *iovad, dma_addr_t iova)

  int iova_cache_get(void);
  void iova_cache_put(void);
+unsigned long iova_rcache_range(void);
+
  void free_iova(struct iova_domain *iovad, unsigned long pfn);
  void __free_iova(struct iova_domain *iovad, struct iova *iova);
  struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-06-23 Thread John Garry via iommu


On 10/06/2022 16:37, John Garry via iommu wrote:



On 6/9/22 10:54, John Garry wrote:
ok, but do you have a system where the UFS host controller is behind 
an IOMMU? I had the impression that UFS controllers would be mostly 
found in embedded systems and IOMMUs are not as common on there.


Modern phones have an IOMMU. Below one can find an example from a 
Pixel 6 phone. The UFS storage controller is not controller by the 
IOMMU as far as I can see but I wouldn't be surprised if the security 
team would ask us one day to enable the IOMMU for the UFS controller.


OK, then unfortunately it seems that you have no method to test. I might 
be able to test USB MSC but I am not even sure if I can even get DMA 
mappings who length exceeds the IOVA rcache limit there.


I was able to do some testing on USB MSC for an XHCI controller. The 
result is that limiting the max HW sectors there does not affect 
performance in normal conditions.


However if I hack the USB driver and fiddle with request queue settings 
then it can:

- lift max_sectors limit in usb_stor_host_template 120KB -> 256KB
- lift request queue read_ahead_kb 128KB -> 256KB

In this scenario I can get 42.5MB/s read throughput, as opposed to 
39.5MB/s in normal conditions. Since .can_queue=1 for that host it would 
not fall foul of some issues I experience in IOVA allocator performance, 
so limiting max_sectors would not be required for that reason.


So this is an artificial test, but it may be worth considering only 
applying this DMA mapping optimal max_sectors limit to SAS controllers 
which I know can benefit.


Christoph, any opinion?

thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-06-14 Thread John Garry via iommu


On 06/06/2022 10:30, John Garry wrote:

Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which
allows the drivers to know the optimal mapping limit and thus limit the
requested IOVA lengths.

This value is based on the IOVA rcache range limit, as IOVAs allocated
above this limit must always be newly allocated, which may be quite slow.



Can I please get some sort of ack from the IOMMU people on this one?

Thanks,
John

EOM


Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
---
  drivers/iommu/dma-iommu.c | 6 ++
  drivers/iommu/iova.c  | 5 +
  include/linux/iova.h  | 2 ++
  3 files changed, 13 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..9e1586447ee8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1459,6 +1459,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct 
device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
  }
  
+static size_t iommu_dma_opt_mapping_size(void)

+{
+   return iova_rcache_range();
+}
+
  static const struct dma_map_ops iommu_dma_ops = {
.alloc  = iommu_dma_alloc,
.free   = iommu_dma_free,
@@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .opt_mapping_size   = iommu_dma_opt_mapping_size,
  };
  
  /*

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..9f00b58d546e 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
  static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain 
*iovad);
  static void free_iova_rcaches(struct iova_domain *iovad);
  
+unsigned long iova_rcache_range(void)

+{
+   return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+
  static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
  {
struct iova_domain *iovad;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..c6ba6d95d79c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
*iovad, dma_addr_t iova)
  int iova_cache_get(void);
  void iova_cache_put(void);
  
+unsigned long iova_rcache_range(void);

+
  void free_iova(struct iova_domain *iovad, unsigned long pfn);
  void __free_iova(struct iova_domain *iovad, struct iova *iova);
  struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-06-10 Thread John Garry via iommu


On 09/06/2022 21:34, Bart Van Assche wrote:

On 6/9/22 10:54, John Garry wrote:
ok, but do you have a system where the UFS host controller is behind 
an IOMMU? I had the impression that UFS controllers would be mostly 
found in embedded systems and IOMMUs are not as common on there.


Modern phones have an IOMMU. Below one can find an example from a Pixel 
6 phone. The UFS storage controller is not controller by the IOMMU as 
far as I can see but I wouldn't be surprised if the security team would 
ask us one day to enable the IOMMU for the UFS controller.


OK, then unfortunately it seems that you have no method to test. I might 
be able to test USB MSC but I am not even sure if I can even get DMA 
mappings who length exceeds the IOVA rcache limit there.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2] iommu/dma: Add config for PCI SAC address trick

2022-06-09 Thread John Garry via iommu


On 09/06/2022 16:12, Robin Murphy wrote:

For devices stuck behind a conventional PCI bus, saving extra cycles at
33MHz is probably fairly significant. However since native PCI Express
is now the norm for high-performance devices, the optimisation to always
prefer 32-bit addresses for the sake of avoiding DAC is starting to look
rather anachronistic. Technically 32-bit addresses do have shorter TLPs
on PCIe, but unless the device is saturating its link bandwidth with
small transfers it seems unlikely that the difference is appreciable.

What definitely is appreciable, however, is that the IOVA allocator
doesn't behave all that well once the 32-bit space starts getting full.
As DMA working sets get bigger, this optimisation increasingly backfires
and adds considerable overhead to the dma_map path for use-cases like
high-bandwidth networking. We've increasingly bandaged the allocator
in attempts to mitigate this, but it remains fundamentally at odds with
other valid requirements to try as hard as possible to satisfy a request
within the given limit; what we really need is to just avoid this odd
notion of a speculative allocation when it isn't beneficial anyway.

Unfortunately that's where things get awkward... Having been present on
x86 for 15 years or so now, it turns out there are systems which fail to
properly define the upper limit of usable IOVA space for certain devices
and this trick was the only thing letting them work OK. I had a similar
ulterior motive for a couple of early arm64 systems when originally
adding it to iommu-dma, but those really should be fixed with proper
firmware bindings by now. Let's be brave and default it to off in the
hope that CI systems and developers will find and fix those bugs, > but
expect that desktop-focused distro configs are likely to want to turn
it back on for maximum compatibility.

Signed-off-by: Robin Murphy 


FWIW,
Reviewed-by: John Garry 

If we're not enabling by default for x86 then doesn't Jeorg have some 
XHCI issue which we would now need to quirk? I don't remember which 
device exactly. Or, alternatively, simply ask him to enable this new config.




---

v2: Tweak wording to clarify that it's not really an optimisation in
 general, remove "default X86".

  drivers/iommu/Kconfig | 26 ++
  drivers/iommu/dma-iommu.c |  2 +-
  2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index c79a0df090c0..5a225b48dd00 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -144,6 +144,32 @@ config IOMMU_DMA
select IRQ_MSI_IOMMU
select NEED_SG_DMA_LENGTH
  
+config IOMMU_DMA_PCI_SAC

+   bool "Enable 64-bit legacy PCI optimisation by default"
+   depends on IOMMU_DMA
+   help
+ Enable by default an IOMMU optimisation for 64-bit legacy PCI devices,
+ wherein the DMA API layer will always first try to allocate a 32-bit
+ DMA address suitable for a single address cycle, before falling back
+ to allocating from the device's full usable address range. If your
+ system has 64-bit legacy PCI devices in 32-bit slots where using dual
+ address cycles reduces DMA throughput significantly, this may be
+ beneficial to overall performance.
+
+ If you have a modern PCI Express based system, this feature mostly 
just
+ represents extra overhead in the allocation path for no practical
+ benefit, and it should usually be preferable to say "n" here.
+
+ However, beware that this feature has also historically papered over
+ bugs where the IOMMU address width and/or device DMA mask is not set
+ correctly. If device DMA problems and IOMMU faults start occurring
+ after disabling this option, it is almost certainly indicative of a
+ latent driver or firmware/BIOS bug, which would previously have only
+ manifested with several gigabytes worth of concurrent DMA mappings.
+
+ If this option is not set, the feature can still be re-enabled at
+ boot time with the "iommu.forcedac=0" command-line argument.
+
  # Shared Virtual Addressing
  config IOMMU_SVA
bool
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..9f9d9ba7f376 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -67,7 +67,7 @@ struct iommu_dma_cookie {
  };
  
  static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);

-bool iommu_dma_forcedac __read_mostly;
+bool iommu_dma_forcedac __read_mostly = !IS_ENABLED(CONFIG_IOMMU_DMA_PCI_SAC);
  
  static int __init iommu_dma_forcedac_setup(char *str)

  {


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-06-09 Thread John Garry via iommu


On 09/06/2022 18:18, Bart Van Assche wrote:


SCSI host bus adapters that support 64-bit DMA may support much 
larger transfer sizes than 128 KiB.


Indeed, and that is my problem today, as my storage controller is 
generating DMA mapping lengths which exceeds 128K and they slow 
everything down.


If you say that SRP enjoys best peformance with larger transfers then 
can you please test this with an IOMMU enabled (iommu group type DMA 
or DMA-FQ)?


Hmm ... what exactly do you want me to test? Do you perhaps want me to 
measure how much performance drops with an IOMMU enabled? 


Yes, I would like to know of any performance change with an IOMMU 
enabled and then with an IOMMU enabled and including my series.


I don't have 
access anymore to the SRP setup I referred to in my previous email. But 
I do have access to devices that boot from UFS storage. For these 
devices we need to transfer 2 MiB per request to achieve full bandwidth.


ok, but do you have a system where the UFS host controller is behind an 
IOMMU? I had the impression that UFS controllers would be mostly found 
in embedded systems and IOMMUs are not as common on there.


Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-06-09 Thread John Garry via iommu


On 08/06/2022 22:07, Bart Van Assche wrote:

On 6/8/22 10:50, John Garry wrote:
Please note that this limit only applies if we have an IOMMU enabled 
for the scsi host dma device. Otherwise we are limited by dma direct 
or swiotlb max mapping size, as before.


SCSI host bus adapters that support 64-bit DMA may support much larger 
transfer sizes than 128 KiB.


Indeed, and that is my problem today, as my storage controller is 
generating DMA mapping lengths which exceeds 128K and they slow 
everything down.


If you say that SRP enjoys best peformance with larger transfers then 
can you please test this with an IOMMU enabled (iommu group type DMA or 
DMA-FQ)?


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-06-08 Thread John Garry via iommu


On 08/06/2022 18:33, Bart Van Assche wrote:

On 6/6/22 02:30, John Garry wrote:

+    if (dma_dev->dma_mask) {
+    shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+    dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+    }


Since IOVA_RANGE_CACHE_MAX_SIZE = 6 this limits max_sectors to 2**6 * 
PAGE_SIZE or 256 KiB if the page size is 4 KiB.


It's actually 128K for 4K page size, as any IOVA size is roundup to 
power-of-2 when testing if we may cache it, which means anything >128K 
would roundup to 256K and cannot be cached.


I think that's too 
small. Some (SRP) storage arrays require much larger transfers to 
achieve optimal performance.


Have you tried this achieve this optimal performance with an IOMMU enabled?

Please note that this limit only applies if we have an IOMMU enabled for 
the scsi host dma device. Otherwise we are limited by dma direct or 
swiotlb max mapping size, as before.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-06-08 Thread John Garry via iommu


On 08/06/2022 18:26, Bart Van Assche wrote:

On 6/6/22 02:30, John Garry via iommu wrote:

+unsigned long iova_rcache_range(void)
+{
+    return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}


My understanding is that iova cache entries may be smaller than 
IOVA_RANGE_CACHE_MAX_SIZE and hence that even if code that uses the DMA 
mapping API respects this limit that a cache miss can still happen.


Sure, a cache miss may still happen - however once we have stressed the 
system for a while then the rcaches fill up and don't fail often, or 
often enough to be noticeable compared to not having a cached IOVAs at all.


Thanks,
john
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 0/4] DMA mapping changes for SCSI core

2022-06-08 Thread John Garry via iommu


On 07/06/2022 23:43, Bart Van Assche wrote:

On 6/6/22 02:30, John Garry wrote:
As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA 
caching

limit may see a big performance hit.

This series introduces a new DMA mapping API, dma_opt_mapping_size(), so
that drivers may know this limit when performance is a factor in the
mapping.

Robin didn't like using dma_max_mapping_size() for this [1].

The SCSI core code is modified to use this limit.

I also added a patch for libata-scsi as it does not currently honour the
shost max_sectors limit.

Note: Christoph has previously kindly offered to take this series via the
   dma-mapping tree, so I think that we just need an ack from the
   IOMMU guys now.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ 

[1] 
https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/ 



Regarding [0], that patch reverts commit 4e89dce72521 ("iommu/iova: 
Retry from last rb tree node if iova search fails"). Reading the 
description of that patch, it seems to me that the iova allocator can be 
improved. Shouldn't the iova allocator be improved such that we don't 
need this patch series? There are algorithms that handle fragmentation 
much better than the current iova allocator algorithm, e.g. the 
https://en.wikipedia.org/wiki/Buddy_memory_allocation algorithm.


Regardless of whether the IOVA allocator can be improved - which it 
probably can be - this series is still useful. That is due to the IOVA 
rcache - that is a cache of pre-allocated IOVAs which can be quickly 
used in the DMA mapping. The rache contains IOVAs up to certain fixed 
size. In this series we limit the DMA mapping length to the rcache size 
upper limit to always bypass the allocator (when we have a cached IOVA 
available) - see alloc_iova_fast().


Even if the IOVA allocator were greatly optimised for speed, there would 
still be an overhead in the alloc and free for those larger IOVAs which 
would outweigh the advantage of having larger DMA mappings. But is there 
even an advantage in very large streaming DMA mappings? Maybe for iotlb 
efficiency. But some say it's better to have the DMA engine start 
processing the data ASAP and not wait for larger lists to be built.


Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 4/4] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-06-06 Thread John Garry via iommu

ATA devices (struct ata_device) have a max_sectors field which is
configured internally in libata. This is then used to (re)configure the
associated sdev request queue max_sectors value from how it is earlier set
in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set
according to shost limits, which includes host DMA mapping limits.

Cap the ata_device max_sectors according to shost->max_sectors to respect
this shost limit.

Signed-off-by: John Garry 
Acked-by: Damien Le Moal 
---
 drivers/ata/libata-scsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 42cecf95a4e5..8b4b318f378d 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1060,6 +1060,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct 
ata_device *dev)
dev->flags |= ATA_DFLAG_NO_UNLOAD;
 
/* configure max sectors */
+   dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors);
blk_queue_max_hw_sectors(q, dev->max_sectors);
 
if (dev->class == ATA_DEV_ATAPI) {
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-06-06 Thread John Garry via iommu

Streaming DMA mappings may be considerably slower when mappings go through
an IOMMU and the total mapping length is somewhat long. This is because the
IOMMU IOVA code allocates and free an IOVA for each mapping, which may
affect performance.

For performance reasons set the request_queue max_sectors from
dma_opt_mapping_size(), which knows this mapping limit.

In addition, the shost->max_sectors is repeatedly set for each sdev in
__scsi_init_queue(). This is unnecessary, so set once when adding the
host.

Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
Reviewed-by: Martin K. Petersen 
---
 drivers/scsi/hosts.c| 5 +
 drivers/scsi/scsi_lib.c | 4 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 8352f90d997d..ea1a207634d1 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -236,6 +236,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
 
shost->dma_dev = dma_dev;
 
+   if (dma_dev->dma_mask) {
+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }
+
error = scsi_mq_setup_tags(shost);
if (error)
goto fail;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6ffc9e4258a8..6ce8acea322a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
request_queue *q)
blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
}
 
-   if (dev->dma_mask) {
-   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
-   dma_max_mapping_size(dev) >> SECTOR_SHIFT);
-   }
blk_queue_max_hw_sectors(q, shost->max_sectors);
blk_queue_segment_boundary(q, shost->dma_boundary);
dma_set_seg_boundary(dev, shost->dma_boundary);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-06-06 Thread John Garry via iommu

Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which
allows the drivers to know the optimal mapping limit and thus limit the
requested IOVA lengths.

This value is based on the IOVA rcache range limit, as IOVAs allocated
above this limit must always be newly allocated, which may be quite slow.

Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
---
 drivers/iommu/dma-iommu.c | 6 ++
 drivers/iommu/iova.c  | 5 +
 include/linux/iova.h  | 2 ++
 3 files changed, 13 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f90251572a5d..9e1586447ee8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1459,6 +1459,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct 
device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
 }
 
+static size_t iommu_dma_opt_mapping_size(void)
+{
+   return iova_rcache_range();
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
.alloc  = iommu_dma_alloc,
.free   = iommu_dma_free,
@@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .opt_mapping_size   = iommu_dma_opt_mapping_size,
 };
 
 /*
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..9f00b58d546e 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 
+unsigned long iova_rcache_range(void)
+{
+   return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+
 static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
 {
struct iova_domain *iovad;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..c6ba6d95d79c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
*iovad, dma_addr_t iova)
 int iova_cache_get(void);
 void iova_cache_put(void);
 
+unsigned long iova_rcache_range(void);
+
 void free_iova(struct iova_domain *iovad, unsigned long pfn);
 void __free_iova(struct iova_domain *iovad, struct iova *iova);
 struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 1/4] dma-mapping: Add dma_opt_mapping_size()

2022-06-06 Thread John Garry via iommu

Streaming DMA mapping involving an IOMMU may be much slower for larger
total mapping size. This is because every IOMMU DMA mapping requires an
IOVA to be allocated and freed. IOVA sizes above a certain limit are not
cached, which can have a big impact on DMA mapping performance.

Provide an API for device drivers to know this "optimal" limit, such that
they may try to produce mapping which don't exceed it.

Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
---
 Documentation/core-api/dma-api.rst |  9 +
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 kernel/dma/mapping.c   | 12 
 4 files changed, 27 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst 
b/Documentation/core-api/dma-api.rst
index 6d6d0edd2d27..b3cd9763d28b 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. The 
size parameter
 of the mapping functions like dma_map_single(), dma_map_page() and
 others should not be larger than the returned value.
 
+::
+
+   size_t
+   dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device. Mapping large
+buffers may take longer so device drivers are advised to limit total DMA
+streaming mappings length to the returned value.
+
 ::
 
bool
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..98ceba6fa848 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -69,6 +69,7 @@ struct dma_map_ops {
int (*dma_supported)(struct device *dev, u64 mask);
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
+   size_t (*opt_mapping_size)(void);
unsigned long (*get_merge_boundary)(struct device *dev);
 };
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..fe3849434b2a 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
 size_t dma_max_mapping_size(struct device *dev);
+size_t dma_opt_mapping_size(struct device *dev);
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 unsigned long dma_get_merge_boundary(struct device *dev);
 struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
@@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device 
*dev)
 {
return 0;
 }
+static inline size_t dma_opt_mapping_size(struct device *dev)
+{
+   return 0;
+}
 static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
return false;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..1bfe11b1edb6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(dma_max_mapping_size);
 
+size_t dma_opt_mapping_size(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+   size_t size = SIZE_MAX;
+
+   if (ops && ops->opt_mapping_size)
+   size = ops->opt_mapping_size();
+
+   return min(dma_max_mapping_size(dev), size);
+}
+EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
+
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
const struct dma_map_ops *ops = get_dma_ops(dev);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v3 0/4] DMA mapping changes for SCSI core

2022-06-06 Thread John Garry via iommu

As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching
limit may see a big performance hit.

This series introduces a new DMA mapping API, dma_opt_mapping_size(), so
that drivers may know this limit when performance is a factor in the
mapping.

Robin didn't like using dma_max_mapping_size() for this [1].

The SCSI core code is modified to use this limit.

I also added a patch for libata-scsi as it does not currently honour the
shost max_sectors limit.

Note: Christoph has previously kindly offered to take this series via the
  dma-mapping tree, so I think that we just need an ack from the
  IOMMU guys now. 

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/

Changes since v2:
- Rebase on v5.19-rc1
- Add Damien's tag to 2/4 (thanks)

Changes since v1:
- Relocate scsi_add_host_with_dma() dma_dev check (Reported by Dan)
- Add tags from Damien and Martin (thanks)
  - note: I only added Martin's tag to the SCSI patch

John Garry (4):
  dma-mapping: Add dma_opt_mapping_size()
  dma-iommu: Add iommu_dma_opt_mapping_size()
  scsi: core: Cap shost max_sectors according to DMA optimum mapping
limits
  libata-scsi: Cap ata_device->max_sectors according to
shost->max_sectors

 Documentation/core-api/dma-api.rst |  9 +
 drivers/ata/libata-scsi.c  |  1 +
 drivers/iommu/dma-iommu.c  |  6 ++
 drivers/iommu/iova.c   |  5 +
 drivers/scsi/hosts.c   |  5 +
 drivers/scsi/scsi_lib.c|  4 
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 include/linux/iova.h   |  2 ++
 kernel/dma/mapping.c   | 12 
 10 files changed, 46 insertions(+), 4 deletions(-)

-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 4/4] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-05-26 Thread John Garry via iommu

ATA devices (struct ata_device) have a max_sectors field which is
configured internally in libata. This is then used to (re)configure the
associated sdev request queue max_sectors value from how it is earlier set
in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set
according to shost limits, which includes host DMA mapping limits.

Cap the ata_device max_sectors according to shost->max_sectors to respect
this shost limit.

Signed-off-by: John Garry 
Acked-by: Damien Le Moal 
---
 drivers/ata/libata-scsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 06c9d90238d9..25fe89791641 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1036,6 +1036,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct 
ata_device *dev)
dev->flags |= ATA_DFLAG_NO_UNLOAD;
 
/* configure max sectors */
+   dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors);
blk_queue_max_hw_sectors(q, dev->max_sectors);
 
if (dev->class == ATA_DEV_ATAPI) {
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-05-26 Thread John Garry via iommu

Streaming DMA mappings may be considerably slower when mappings go through
an IOMMU and the total mapping length is somewhat long. This is because the
IOMMU IOVA code allocates and free an IOVA for each mapping, which may
affect performance.

For performance reasons set the request_queue max_sectors from
dma_opt_mapping_size(), which knows this mapping limit.

In addition, the shost->max_sectors is repeatedly set for each sdev in
__scsi_init_queue(). This is unnecessary, so set once when adding the
host.

Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
Reviewed-by: Martin K. Petersen 
---
 drivers/scsi/hosts.c| 5 +
 drivers/scsi/scsi_lib.c | 4 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index f69b77cbf538..9563c0ac567a 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -240,6 +240,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
 
shost->dma_dev = dma_dev;
 
+   if (dma_dev->dma_mask) {
+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }
+
/*
 * Increase usage count temporarily here so that calling
 * scsi_autopm_put_host() will trigger runtime idle if there is
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8d18cc7e510e..2d43bb8799bd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
request_queue *q)
blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
}
 
-   if (dev->dma_mask) {
-   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
-   dma_max_mapping_size(dev) >> SECTOR_SHIFT);
-   }
blk_queue_max_hw_sectors(q, shost->max_sectors);
blk_queue_segment_boundary(q, shost->dma_boundary);
dma_set_seg_boundary(dev, shost->dma_boundary);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-05-26 Thread John Garry via iommu

Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which
allows the drivers to know the optimal mapping limit and thus limit the
requested IOVA lengths.

This value is based on the IOVA rcache range limit, as IOVAs allocated
above this limit must always be newly allocated, which may be quite slow.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c | 6 ++
 drivers/iommu/iova.c  | 5 +
 include/linux/iova.h  | 2 ++
 3 files changed, 13 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 09f6e1c0f9c0..f619e41b9172 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1442,6 +1442,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct 
device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
 }
 
+static size_t iommu_dma_opt_mapping_size(void)
+{
+   return iova_rcache_range();
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
.alloc  = iommu_dma_alloc,
.free   = iommu_dma_free,
@@ -1462,6 +1467,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .opt_mapping_size   = iommu_dma_opt_mapping_size,
 };
 
 /*
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..9f00b58d546e 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 
+unsigned long iova_rcache_range(void)
+{
+   return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+
 static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
 {
struct iova_domain *iovad;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..c6ba6d95d79c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
*iovad, dma_addr_t iova)
 int iova_cache_get(void);
 void iova_cache_put(void);
 
+unsigned long iova_rcache_range(void);
+
 void free_iova(struct iova_domain *iovad, unsigned long pfn);
 void __free_iova(struct iova_domain *iovad, struct iova *iova);
 struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 0/4] DMA mapping changes for SCSI core

2022-05-26 Thread John Garry via iommu

As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching
limit may see a big performance hit.

This series introduces a new DMA mapping API, dma_opt_mapping_size(), so
that drivers may know this limit when performance is a factor in the
mapping.

Robin didn't like using dma_max_mapping_size() for this [1].

The SCSI core code is modified to use this limit.

I also added a patch for libata-scsi as it does not currently honour the
shost max_sectors limit.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/

Changes since v1:
- Relocate scsi_add_host_with_dma() dma_dev check (Reported by Dan)
- Add tags from Damien and Martin (thanks)
  - note: I only added Martin's tag to the SCSI patch

John Garry (4):
  dma-mapping: Add dma_opt_mapping_size()
  dma-iommu: Add iommu_dma_opt_mapping_size()
  scsi: core: Cap shost max_sectors according to DMA optimum mapping
limits
  libata-scsi: Cap ata_device->max_sectors according to
shost->max_sectors

 Documentation/core-api/dma-api.rst |  9 +
 drivers/ata/libata-scsi.c  |  1 +
 drivers/iommu/dma-iommu.c  |  6 ++
 drivers/iommu/iova.c   |  5 +
 drivers/scsi/hosts.c   |  5 +
 drivers/scsi/scsi_lib.c|  4 
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 include/linux/iova.h   |  2 ++
 kernel/dma/mapping.c   | 12 
 10 files changed, 46 insertions(+), 4 deletions(-)

-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2 1/4] dma-mapping: Add dma_opt_mapping_size()

2022-05-26 Thread John Garry via iommu

Streaming DMA mapping involving an IOMMU may be much slower for larger
total mapping size. This is because every IOMMU DMA mapping requires an
IOVA to be allocated and freed. IOVA sizes above a certain limit are not
cached, which can have a big impact on DMA mapping performance.

Provide an API for device drivers to know this "optimal" limit, such that
they may try to produce mapping which don't exceed it.

Signed-off-by: John Garry 
Reviewed-by: Damien Le Moal 
---
 Documentation/core-api/dma-api.rst |  9 +
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 kernel/dma/mapping.c   | 12 
 4 files changed, 27 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst 
b/Documentation/core-api/dma-api.rst
index 6d6d0edd2d27..b3cd9763d28b 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. The 
size parameter
 of the mapping functions like dma_map_single(), dma_map_page() and
 others should not be larger than the returned value.
 
+::
+
+   size_t
+   dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device. Mapping large
+buffers may take longer so device drivers are advised to limit total DMA
+streaming mappings length to the returned value.
+
 ::
 
bool
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..98ceba6fa848 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -69,6 +69,7 @@ struct dma_map_ops {
int (*dma_supported)(struct device *dev, u64 mask);
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
+   size_t (*opt_mapping_size)(void);
unsigned long (*get_merge_boundary)(struct device *dev);
 };
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..fe3849434b2a 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
 size_t dma_max_mapping_size(struct device *dev);
+size_t dma_opt_mapping_size(struct device *dev);
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 unsigned long dma_get_merge_boundary(struct device *dev);
 struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
@@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device 
*dev)
 {
return 0;
 }
+static inline size_t dma_opt_mapping_size(struct device *dev)
+{
+   return 0;
+}
 static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
return false;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..1bfe11b1edb6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(dma_max_mapping_size);
 
+size_t dma_opt_mapping_size(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+   size_t size = SIZE_MAX;
+
+   if (ops && ops->opt_mapping_size)
+   size = ops->opt_mapping_size();
+
+   return min(dma_max_mapping_size(dev), size);
+}
+EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
+
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
const struct dma_map_ops *ops = get_dma_ops(dev);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 0/4] DMA mapping changes for SCSI core

2022-05-23 Thread John Garry via iommu


On 22/05/2022 23:22, Damien Le Moal wrote:

On 2022/05/22 22:13, Christoph Hellwig wrote:

The whole series looks fine to me.  I'll happily queue it up in the
dma-mapping tree if the SCSI and ATA maintainers are ok with that.



Fine with me. I sent an acked-by for the libata bit.



Thanks, I'm going to have to post a v2 and I figure that with the timing 
that I'll have to wait for v5.20 now.


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-05-23 Thread John Garry via iommu


On 23/05/2022 12:08, Dan Carpenter wrote:

Thanks for the report


50b6cb3516365c Dexuan Cui2021-10-07  224/* Use min_t(int, ...) in 
case shost->can_queue exceeds SHRT_MAX */
50b6cb3516365c Dexuan Cui2021-10-07  225shost->cmd_per_lun = 
min_t(int, shost->cmd_per_lun,
ea2f0f77538c50 John Garry2021-05-19  226   
shost->can_queue);
ea2f0f77538c50 John Garry2021-05-19  227
2ad7ba6ca08593 John Garry2022-05-20 @228if (dma_dev->dma_mask) {
 ^


I knew that we fixed up dma_dev to be non-NULL, but I thought it was 
earlier in this function...



The patch adds a new unchecked dereference

2ad7ba6ca08593 John Garry2022-05-20  229shost->max_sectors 
= min_t(unsigned int, shost->max_sectors,
2ad7ba6ca08593 John Garry2022-05-20  230
dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
2ad7ba6ca08593 John Garry2022-05-20  231    }
2ad7ba6ca08593 John Garry2022-05-20  232
0a6ac4ee7c2109 Christoph Hellwig 2017-01-03  233error = 
scsi_init_sense_cache(shost);
0a6ac4ee7c2109 Christoph Hellwig 2017-01-03  234if (error)
0a6ac4ee7c2109 Christoph Hellwig 2017-01-03  235goto fail;
0a6ac4ee7c2109 Christoph Hellwig 2017-01-03  236
d285203cf647d7 Christoph Hellwig 2014-01-17  237error = 
scsi_mq_setup_tags(shost);
542bd1377a9630 James Bottomley   2008-04-21  238if (error)
542bd1377a9630 James Bottomley   2008-04-21  239goto fail;
d285203cf647d7 Christoph Hellwig 2014-01-17  240
^1da177e4c3f41 Linus Torvalds2005-04-16  241if 
(!shost->shost_gendev.parent)
^1da177e4c3f41 Linus Torvalds2005-04-16  242
shost->shost_gendev.parent = dev ? dev : &platform_bus;
3c8d9a957d0ae6 James Bottomley   2012-05-04 @243if (!dma_dev)
 


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-05-23 Thread John Garry via iommu


On 21/05/2022 00:33, Damien Le Moal wrote:

Hi Damien,


+unsigned long iova_rcache_range(void)

Why not a size_t return type ?


The IOVA code generally uses unsigned long for size/range while 
dam-iommu uses size_t as appropiate, so I'm just sticking to that.





+{
+   return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-05-22 Thread John Garry via iommu


On 21/05/2022 00:30, Damien Le Moal wrote:

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index f69b77cbf538..a3ae6345473b 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -225,6 +225,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
shost->cmd_per_lun = min_t(int, shost->cmd_per_lun,
   shost->can_queue);
  


Hi Damien,


+   if (dma_dev->dma_mask) {
+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }

Nit: you could drop the curly brackets here.


Some people prefer this style - multi-line statements have curly 
brackets, while single-line statements conform to the official coding 
style (and don't use brackets).


I'll just stick with what we have unless there is a consensus to change.

Thanks,
John




+
error = scsi_init_sense_cache(shost);
if (error)
goto fail;


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 4/4] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors

2022-05-20 Thread John Garry via iommu

ATA devices (struct ata_device) have a max_sectors field which is
configured internally in libata. This is then used to (re)configure the
associated sdev request queue max_sectors value from how it is earlier set
in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set
according to shost limits, which includes host DMA mapping limits.

Cap the ata_device max_sectors according to shost->max_sectors to respect
this shost limit.

Signed-off-by: John Garry 
---
 drivers/ata/libata-scsi.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 06c9d90238d9..25fe89791641 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1036,6 +1036,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct 
ata_device *dev)
dev->flags |= ATA_DFLAG_NO_UNLOAD;
 
/* configure max sectors */
+   dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors);
blk_queue_max_hw_sectors(q, dev->max_sectors);
 
if (dev->class == ATA_DEV_ATAPI) {
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

2022-05-20 Thread John Garry via iommu

Streaming DMA mappings may be considerably slower when mappings go through
an IOMMU and the total mapping length is somewhat long. This is because the
IOMMU IOVA code allocates and free an IOVA for each mapping, which may
affect performance.

For performance reasons set the request_queue max_sectors from
dma_opt_mapping_size(), which knows this mapping limit.

In addition, the shost->max_sectors is repeatedly set for each sdev in
__scsi_init_queue(). This is unnecessary, so set once when adding the
host.

Signed-off-by: John Garry 
---
 drivers/scsi/hosts.c| 5 +
 drivers/scsi/scsi_lib.c | 4 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index f69b77cbf538..a3ae6345473b 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -225,6 +225,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct 
device *dev,
shost->cmd_per_lun = min_t(int, shost->cmd_per_lun,
   shost->can_queue);
 
+   if (dma_dev->dma_mask) {
+   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+   dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+   }
+
error = scsi_init_sense_cache(shost);
if (error)
goto fail;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8d18cc7e510e..2d43bb8799bd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
request_queue *q)
blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
}
 
-   if (dev->dma_mask) {
-   shost->max_sectors = min_t(unsigned int, shost->max_sectors,
-   dma_max_mapping_size(dev) >> SECTOR_SHIFT);
-   }
blk_queue_max_hw_sectors(q, shost->max_sectors);
blk_queue_segment_boundary(q, shost->dma_boundary);
dma_set_seg_boundary(dev, shost->dma_boundary);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()

2022-05-20 Thread John Garry via iommu

Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which
allows the drivers to know the optimal mapping limit and thus limit the
requested IOVA lengths.

This value is based on the IOVA rcache range limit, as IOVAs allocated
above this limit must always be newly allocated, which may be quite slow.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c | 6 ++
 drivers/iommu/iova.c  | 5 +
 include/linux/iova.h  | 2 ++
 3 files changed, 13 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 09f6e1c0f9c0..f619e41b9172 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1442,6 +1442,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct 
device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
 }
 
+static size_t iommu_dma_opt_mapping_size(void)
+{
+   return iova_rcache_range();
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
.alloc  = iommu_dma_alloc,
.free   = iommu_dma_free,
@@ -1462,6 +1467,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .opt_mapping_size   = iommu_dma_opt_mapping_size,
 };
 
 /*
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..9f00b58d546e 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 
+unsigned long iova_rcache_range(void)
+{
+   return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+
 static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
 {
struct iova_domain *iovad;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..c6ba6d95d79c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
*iovad, dma_addr_t iova)
 int iova_cache_get(void);
 void iova_cache_put(void);
 
+unsigned long iova_rcache_range(void);
+
 void free_iova(struct iova_domain *iovad, unsigned long pfn);
 void __free_iova(struct iova_domain *iovad, struct iova *iova);
 struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 0/4] DMA mapping changes for SCSI core

2022-05-20 Thread John Garry via iommu

As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching
limit may see a big performance hit.

This series introduces a new DMA mapping API, dma_opt_mapping_size(), so
that drivers may know this limit when performance is a factor in the
mapping.

Robin didn't like using dma_max_mapping_size() for this [1]. An
alternative to adding the new API would be to add a "hard_limit" arg
to dma_max_mapping_size(). This would mean fixing up all current users,
but it would be good to do that anyway as not all users require a hard
limit.

The SCSI core coded is modified to use this limit.

I also added a patch for libata-scsi as it does not currently honour the
shost max_sectors limit.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/

John Garry (4):
  dma-mapping: Add dma_opt_mapping_size()
  dma-iommu: Add iommu_dma_opt_mapping_size()
  scsi: core: Cap shost max_sectors according to DMA optimum mapping
limits
  libata-scsi: Cap ata_device->max_sectors according to
shost->max_sectors

 Documentation/core-api/dma-api.rst |  9 +
 drivers/ata/libata-scsi.c  |  1 +
 drivers/iommu/dma-iommu.c  |  6 ++
 drivers/iommu/iova.c   |  5 +
 drivers/scsi/hosts.c   |  5 +
 drivers/scsi/scsi_lib.c|  4 
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 include/linux/iova.h   |  2 ++
 kernel/dma/mapping.c   | 12 
 10 files changed, 46 insertions(+), 4 deletions(-)

-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 1/4] dma-mapping: Add dma_opt_mapping_size()

2022-05-20 Thread John Garry via iommu

Streaming DMA mapping involving an IOMMU may be much slower for larger
total mapping size. This is because every IOMMU DMA mapping requires an
IOVA to be allocated and freed. IOVA sizes above a certain limit are not
cached, which can have a big impact on DMA mapping performance.

Provide an API for device drivers to know this "optimal" limit, such that
they may try to produce mapping which don't exceed it.

Signed-off-by: John Garry 
---
 Documentation/core-api/dma-api.rst |  9 +
 include/linux/dma-map-ops.h|  1 +
 include/linux/dma-mapping.h|  5 +
 kernel/dma/mapping.c   | 12 
 4 files changed, 27 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst 
b/Documentation/core-api/dma-api.rst
index 6d6d0edd2d27..b3cd9763d28b 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. The 
size parameter
 of the mapping functions like dma_map_single(), dma_map_page() and
 others should not be larger than the returned value.
 
+::
+
+   size_t
+   dma_opt_mapping_size(struct device *dev);
+
+Returns the maximum optimal size of a mapping for the device. Mapping large
+buffers may take longer so device drivers are advised to limit total DMA
+streaming mappings length to the returned value.
+
 ::
 
bool
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d5b06b3a4a6..98ceba6fa848 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -69,6 +69,7 @@ struct dma_map_ops {
int (*dma_supported)(struct device *dev, u64 mask);
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
+   size_t (*opt_mapping_size)(void);
unsigned long (*get_merge_boundary)(struct device *dev);
 };
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index dca2b1355bb1..fe3849434b2a 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask);
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 u64 dma_get_required_mask(struct device *dev);
 size_t dma_max_mapping_size(struct device *dev);
+size_t dma_opt_mapping_size(struct device *dev);
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
 unsigned long dma_get_merge_boundary(struct device *dev);
 struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size,
@@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device 
*dev)
 {
return 0;
 }
+static inline size_t dma_opt_mapping_size(struct device *dev)
+{
+   return 0;
+}
 static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
return false;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index db7244291b74..1bfe11b1edb6 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(dma_max_mapping_size);
 
+size_t dma_opt_mapping_size(struct device *dev)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+   size_t size = SIZE_MAX;
+
+   if (ops && ops->opt_mapping_size)
+   size = ops->opt_mapping_size();
+
+   return min(dma_max_mapping_size(dev), size);
+}
+EXPORT_SYMBOL_GPL(dma_opt_mapping_size);
+
 bool dma_need_sync(struct device *dev, dma_addr_t dma_addr)
 {
const struct dma_map_ops *ops = get_dma_ops(dev);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/dma: Add config for PCI SAC address trick

2022-05-19 Thread John Garry via iommu


On 18/05/2022 18:36, Robin Murphy wrote:

For devices stuck behind a conventional PCI bus, saving extra cycles at
33MHz is probably fairly significant. However since native PCI Express
is now the norm for high-performance devices, the optimisation to always
prefer 32-bit addresses for the sake of avoiding DAC is starting to look
rather anachronistic. Technically 32-bit addresses do have shorter TLPs
on PCIe, but unless the device is saturating its link bandwidth with
small transfers it seems unlikely that the difference is appreciable.

What definitely is appreciable, however, is that the IOVA allocator
doesn't behave all that well once the 32-bit space starts getting full.
As DMA working sets get bigger, this optimisation increasingly backfires
and adds considerable overhead to the dma_map path for use-cases like
high-bandwidth networking. We've increasingly bandaged the allocator
in attempts to mitigate this, but it remains fundamentally at odds with
other valid requirements to try as hard as possible to satisfy a request
within the given limit; what we really need is to just avoid this odd
notion of a speculative allocation when it isn't beneficial anyway.

Unfortunately that's where things get awkward... Having been present on
x86 for 15 years or so now, it turns out there are systems which fail to
properly define the upper limit of usable IOVA space for certain devices
and this trick was the only thing letting them work OK. I had a similar
ulterior motive for a couple of early arm64 systems when originally
adding it to iommu-dma, but those really should now be fixed with proper
firmware bindings, and other arm64 users really need it out of the way,
so let's just leave it default-on for x86.

Signed-off-by: Robin Murphy 
---
  drivers/iommu/Kconfig | 24 
  drivers/iommu/dma-iommu.c |  2 +-


It might be worth printing this default value always and not just for 
when it is set from commandline, like what we do for default domain type 
and IOTLB invalidation policy



  2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index c79a0df090c0..bf9b295f1c89 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -144,6 +144,30 @@ config IOMMU_DMA
select IRQ_MSI_IOMMU
select NEED_SG_DMA_LENGTH
  
+config IOMMU_DMA_PCI_SAC_OPT

+   bool "Enable 64-bit legacy PCI optimisation by default"
+   depends on IOMMU_DMA
+   default X86


Do we have a strategy for if and when issues start popping up on other 
architectures? Is it to simply tell them to just turn this flag on (and 
also fix your platform)?



+   help
+ Enable by default an IOMMU optimisation for 64-bit legacy PCI devices,
+ wherein the DMA API layer will always first try to allocate a 32-bit
+ DMA address suitable for a single address cycle, before falling back
+ to allocating from the full usable address range. If your system has
+ 64-bit legacy PCI devices in 32-bit slots where using dual address
+ cycles reduces DMA throughput significantly, this optimisation may be
+ beneficial to overall performance.
+
+ If you have a modern PCI Express based system, it should usually be
+ safe to say "n" here and avoid the potential extra allocation 
overhead.
+ However, beware that this optimisation has also historically papered
+ over bugs where the IOMMU address range above 32 bits is not fully
+ usable. If device DMA problems and/or IOMMU faults start occurring
+ with IOMMU translation enabled after disabling this option, it is
+ likely a sign of a latent driver or firmware/BIOS bug.
+
+ If this option is not set, the optimisation can be enabled at
+ boot time with the "iommu.forcedac=0" command-line argument.
+


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] dma-iommu: Add iommu_dma_max_mapping_size()

2022-05-17 Thread John Garry via iommu


On 17/05/2022 13:02, Robin Murphy wrote:


Indeed, sorry but NAK for this being nonsense. As I've said at least 
once before, if the unnecessary SAC address allocation attempt slows 
down your workload, make it not do that in the first place. If you 
don't like the existing command-line parameter then fine, > there are 
plenty of
other options, it just needs to be done in a way that doesn't break 
x86 systems with dodgy firmware, as my first attempt turned out to.


Sorry, but I am not interested in this. It was discussed in Jan last 
year without any viable solution.


Er, OK, if you're not interested in solving that problem I don't see why 
you'd bring it up, but hey ho. *I* still think it's important, so I 
guess I'll revive my old patch with a CONFIG_X86 bodge and have another 
go at some point.


Let me rephrase, I would be happy to help fix that problem if we really 
can get it fixed, however for my problem it's better to try to get the 
SCSI driver to stop requesting uncached IOVAs foremost.




Anyway we still have the long-term IOVA aging issue, and requesting 
non-cached IOVAs is involved in that. So I would rather keep the SCSI 
driver to requesting cached IOVAs all the time.


I did try to do it the other way around - configuring the IOVA caching 
range according to the drivers requirement but that got nowhere.


Note that this is still not a final solution as it's not always viable 
to ask a user to unbind + bind the driver.




FWIW I thought that all looked OK, it just kept getting drowned out by 
more critical things in my inbox so I hoped someone else might comment. 
If it turns out that I've become the de-facto IOVA maintainer in 
everyone else's minds now and they're all waiting for my word then fair 
enough, I just need to know and reset my expectations accordingly. Joerg?


It would be great to see an improvement here...



Furthermore, if a particular SCSI driver doesn't benefit from 
mappings larger than 256KB, then that driver is also free to limit 
its own mapping size. There are other folks out there with use-cases 
for mapping *gigabytes* at once; you don't get to cripple the API and 
say that that's suddenly not allowed just because it happens to make 
your thing go faster, that's absurd.


I'd say less catastrophically slow, not faster.

So how to inform the SCSI driver of this caching limit then so that it 
may limit the SGL length?


Driver-specific mechanism; block-layer-specific mechanism; redefine this 
whole API to something like dma_opt_mapping_size(), as a limit above 
which mappings might become less efficient or start to fail (callback to 
my thoughts on [1] as well, I suppose); many options.


ok, fine.

Just not imposing 
a ridiculously low *maximum* on everyone wherein mapping calls "should 
not be larger than the returned value" when that's clearly bollocks.


I agree that this change is in violation as the documentation clearly 
implies a hard limit.


However, FWIW, from looking at users of dma_max_mapping_size(), they 
seem to use in a similar way to SCSI/block layer core, i.e. use this 
value to limit the max SGL total len per IO command. And not as a method 
to learn max DMA consistent mappings size for ring buffers, etc.


Anyway I'll look at dma_opt_mapping_size() but I am not sure how keen 
Christoph will be on this... it is strange to introduce that API due to 
peculiarity of the IOVA allocator.





[1] 
https://lore.kernel.org/linux-iommu/20220510142109.38-1-ltyker...@gmail.com/


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] dma-iommu: Add iommu_dma_max_mapping_size()

2022-05-17 Thread John Garry via iommu


On 17/05/2022 11:40, Robin Murphy wrote:

On 2022-05-16 14:06, John Garry wrote:
For streaming DMA mappings involving an IOMMU and whose IOVA len 
regularly

exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced.

Add the IOMMU callback for DMA mapping API dma_max_mapping_size(), which
allows the drivers to know the mapping limit and thus limit the requested
IOVA lengths.

This resolves the performance issue originally reported in [0] for a SCSI
HBA driver which was regularly mapping SGLs which required IOVAs in
excess of the IOVA caching limit. In this case the block layer limits the
max sectors per request - as configured in __scsi_init_queue() - which
will limit the total SGL length the driver tries to map and in turn 
limits

IOVA lengths requested.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ 



Signed-off-by: John Garry 
---
Sending as an RFC as iommu_dma_max_mapping_size() is a soft limit, and 
not
a hard limit which I expect is the semantics of 
dma_map_ops.max_mapping_size


Indeed, sorry but NAK for this being nonsense. As I've said at least 
once before, if the unnecessary SAC address allocation attempt slows 
down your workload, make it not do that in the first place. If you don't 
like the existing command-line parameter then fine, > there are plenty of
other options, it just needs to be done in a way that doesn't break x86 
systems with dodgy firmware, as my first attempt turned out to.


Sorry, but I am not interested in this. It was discussed in Jan last 
year without any viable solution.


Anyway we still have the long-term IOVA aging issue, and requesting 
non-cached IOVAs is involved in that. So I would rather keep the SCSI 
driver to requesting cached IOVAs all the time.


I did try to do it the other way around - configuring the IOVA caching 
range according to the drivers requirement but that got nowhere.




Furthermore, if a particular SCSI driver doesn't benefit from mappings 
larger than 256KB, then that driver is also free to limit its own 
mapping size. There are other folks out there with use-cases for mapping 
*gigabytes* at once; you don't get to cripple the API and say that 
that's suddenly not allowed just because it happens to make your thing 
go faster, that's absurd.


I'd say less catastrophically slow, not faster.

So how to inform the SCSI driver of this caching limit then so that it 
may limit the SGL length?


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH] dma-iommu: Add iommu_dma_max_mapping_size()

2022-05-17 Thread John Garry via iommu


On 17/05/2022 09:38, Christoph Hellwig wrote:

On Mon, May 16, 2022 at 09:06:01PM +0800, John Garry wrote:

For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced.

Add the IOMMU callback for DMA mapping API dma_max_mapping_size(), which
allows the drivers to know the mapping limit and thus limit the requested
IOVA lengths.

This resolves the performance issue originally reported in [0] for a SCSI
HBA driver which was regularly mapping SGLs which required IOVAs in
excess of the IOVA caching limit. In this case the block layer limits the
max sectors per request - as configured in __scsi_init_queue() - which
will limit the total SGL length the driver tries to map and in turn limits
IOVA lengths requested.


BTW, on a separate topic, I noticed that even with this change my ATA 
devices have max_hw_sectors_kb of 32767, as opposed to 128 for SAS 
devices. It seems that libata-scsi - specifically ata_scsi_dev_config() 
- doesn't honour the shost max_sectors limit. I guess that is not 
intentional.




[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/

Signed-off-by: John Garry 
---
Sending as an RFC as iommu_dma_max_mapping_size() is a soft limit, and not
a hard limit which I expect is the semantics of dma_map_ops.max_mapping_size

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 09f6e1c0f9c0..e2d5205cde37 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1442,6 +1442,21 @@ static unsigned long iommu_dma_get_merge_boundary(struct 
device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
  }
  



+   if (!domain)
+   return 0;
+
+   cookie = domain->iova_cookie;
+   if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
+   return 0;


Can these conditions even be true here?


I don't think so. Paranoia on my part.




+static inline unsigned long iova_rcache_range(void)
+{
+   return 0;
+}


Given that IOMMU_DMA select IOMMU_IOVA there is no need for this stub.


hmmm.. ok. Policy was to be stub everything but I think that it has changed.



Otherwise this looks sensible to me.

.


Great, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 2/8] hwtracing: hisi_ptt: Add trace function support for HiSilicon PCIe Tune and Trace device

2022-05-17 Thread John Garry via iommu


On 17/05/2022 09:09, Yicong Yang wrote:

+    target = cpumask_any(cpumask_of_node(dev_to_node(&hisi_ptt->pdev->dev)));
+    if (target < nr_cpumask_bits) {

the comment for cpumask_any() hints to check against nr_cpu_ids - any specific 
reason to check against nr_cpumask_bits?


here should be:
if (target >= nr_cpumask_bits) {

will fix this up.



I am still not sure that using nr_cpumask_bits is correct.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 4/8] perf arm: Refactor event list iteration in auxtrace_record__init()

2022-05-16 Thread John Garry via iommu


On 16/05/2022 13:52, Yicong Yang wrote:

As requested before, please mention "perf tool" in the commit subject


From: Qi Liu 

Use find_pmu_for_event() to simplify logic in auxtrace_record__init().

Signed-off-by: Qi Liu 
Signed-off-by: Yicong Yang 
---
  tools/perf/arch/arm/util/auxtrace.c | 53 ++---
  1 file changed, 34 insertions(+), 19 deletions(-)

diff --git a/tools/perf/arch/arm/util/auxtrace.c 
b/tools/perf/arch/arm/util/auxtrace.c
index 5fc6a2a3dbc5..384c7cfda0fd 100644
--- a/tools/perf/arch/arm/util/auxtrace.c
+++ b/tools/perf/arch/arm/util/auxtrace.c
@@ -50,16 +50,32 @@ static struct perf_pmu **find_all_arm_spe_pmus(int 
*nr_spes, int *err)
return arm_spe_pmus;
  }
  
+static struct perf_pmu *find_pmu_for_event(struct perf_pmu **pmus,

+  int pmu_nr, struct evsel *evsel)
+{
+   int i;
+
+   if (!pmus)
+   return NULL;
+
+   for (i = 0; i < pmu_nr; i++) {
+   if (evsel->core.attr.type == pmus[i]->type)
+   return pmus[i];
+   }
+
+   return NULL;
+}
+
  struct auxtrace_record
  *auxtrace_record__init(struct evlist *evlist, int *err)
  {
-   struct perf_pmu *cs_etm_pmu;
+   struct perf_pmu *cs_etm_pmu = NULL;
+   struct perf_pmu **arm_spe_pmus = NULL;
struct evsel *evsel;
-   bool found_etm = false;
+   struct perf_pmu *found_etm = NULL;
struct perf_pmu *found_spe = NULL;
-   struct perf_pmu **arm_spe_pmus = NULL;
+   int auxtrace_event_cnt = 0;
int nr_spes = 0;
-   int i = 0;
  
  	if (!evlist)

return NULL;
@@ -68,24 +84,23 @@ struct auxtrace_record
arm_spe_pmus = find_all_arm_spe_pmus(&nr_spes, err);
  
  	evlist__for_each_entry(evlist, evsel) {

-   if (cs_etm_pmu &&
-   evsel->core.attr.type == cs_etm_pmu->type)
-   found_etm = true;
-
-   if (!nr_spes || found_spe)
-   continue;
-
-   for (i = 0; i < nr_spes; i++) {
-   if (evsel->core.attr.type == arm_spe_pmus[i]->type) {
-   found_spe = arm_spe_pmus[i];
-   break;
-   }
-   }
+   if (cs_etm_pmu && !found_etm)
+   found_etm = find_pmu_for_event(&cs_etm_pmu, 1, evsel);
+
+   if (arm_spe_pmus && !found_spe)
+   found_spe = find_pmu_for_event(arm_spe_pmus, nr_spes, 
evsel);
}
+
free(arm_spe_pmus);
  
-	if (found_etm && found_spe) {

-   pr_err("Concurrent ARM Coresight ETM and SPE operation not currently 
supported\n");
+   if (found_etm)
+   auxtrace_event_cnt++;
+
+   if (found_spe)
+   auxtrace_event_cnt++;
+
+   if (auxtrace_event_cnt > 1) {
+   pr_err("Concurrent AUX trace operation not currently 
supported\n");
*err = -EOPNOTSUPP;
return NULL;
}


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v8 3/8] hwtracing: hisi_ptt: Add tune function support for HiSilicon PCIe Tune and Trace device

2022-05-16 Thread John Garry via iommu


On 16/05/2022 13:52, Yicong Yang wrote:

Add tune function for the HiSilicon Tune and Trace device. The interface
of tune is exposed through sysfs attributes of PTT PMU device.

Signed-off-by: Yicong Yang 
Reviewed-by: Jonathan Cameron 


Apart from a comment on preferential style:

Reviewed-by: John Garry 


---
  drivers/hwtracing/ptt/hisi_ptt.c | 157 +++
  drivers/hwtracing/ptt/hisi_ptt.h |  23 +
  2 files changed, 180 insertions(+)

diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c
index ef25ce98f664..c3fdb9bfb1b4 100644
--- a/drivers/hwtracing/ptt/hisi_ptt.c
+++ b/drivers/hwtracing/ptt/hisi_ptt.c
@@ -25,6 +25,161 @@
  /* Dynamic CPU hotplug state used by PTT */
  static enum cpuhp_state hisi_ptt_pmu_online;
  
+static bool hisi_ptt_wait_tuning_finish(struct hisi_ptt *hisi_ptt)

+{
+   u32 val;
+
+   return !readl_poll_timeout(hisi_ptt->iobase + HISI_PTT_TUNING_INT_STAT,
+ val, !(val & HISI_PTT_TUNING_INT_STAT_MASK),
+ HISI_PTT_WAIT_POLL_INTERVAL_US,
+ HISI_PTT_WAIT_TUNE_TIMEOUT_US);
+}
+
+static int hisi_ptt_tune_data_get(struct hisi_ptt *hisi_ptt,
+ u32 event, u16 *data)


this only has 1x caller so may inline it


+{
+   u32 reg;
+
+   reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_CTRL);
+   reg &= ~(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB);
+   reg |= FIELD_PREP(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB,
+ event);
+   writel(reg, hisi_ptt->iobase + HISI_PTT_TUNING_CTRL);
+
+   /* Write all 1 to indicates it's the read process */
+   writel(~0U, hisi_ptt->iobase + HISI_PTT_TUNING_DATA);
+
+   if (!hisi_ptt_wait_tuning_finish(hisi_ptt))
+   return -ETIMEDOUT;
+
+   reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_DATA);
+   reg &= HISI_PTT_TUNING_DATA_VAL_MASK;
+   *data = FIELD_GET(HISI_PTT_TUNING_DATA_VAL_MASK, reg);
+
+   return 0;
+}
+
+static int hisi_ptt_tune_data_set(struct hisi_ptt *hisi_ptt,
+ u32 event, u16 data)


again only 1x caller


+{
+   u32 reg;
+
+   reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_CTRL);
+   reg &= ~(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB);
+   reg |= FIELD_PREP(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB,
+ event);
+   writel(reg, hisi_ptt->iobase + HISI_PTT_TUNING_CTRL);
+
+   writel(FIELD_PREP(HISI_PTT_TUNING_DATA_VAL_MASK, data),
+  hisi_ptt->iobase + HISI_PTT_TUNING_DATA);
+
+   if (!hisi_ptt_wait_tuning_finish(hisi_ptt))
+   return -ETIMEDOUT;
+
+   return 0;
+}
+
+static ssize_t hisi_ptt_tune_attr_show(struct device *dev,
+  struct device_attribute *attr,
+  char *buf)
+{
+   struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev));
+   struct dev_ext_attribute *ext_attr;
+   struct hisi_ptt_tune_desc *desc;
+   int ret;
+   u16 val;
+
+   ext_attr = container_of(attr, struct dev_ext_attribute, attr);
+   desc = ext_attr->var;
+
+   mutex_lock(&hisi_ptt->tune_lock);
+   ret = hisi_ptt_tune_data_get(hisi_ptt, desc->event_code, &val);
+   mutex_unlock(&hisi_ptt->tune_lock);
+
+   if (ret)
+   return ret;
+
+   return sysfs_emit(buf, "%u\n", val);
+}
+
+static ssize_t hisi_ptt_tune_attr_store(struct device *dev,
+   struct device_attribute *attr,
+   const char *buf, size_t count)
+{
+   struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev));
+   struct dev_ext_attribute *ext_attr;
+   struct hisi_ptt_tune_desc *desc;
+   int ret;
+   u16 val;
+
+   ext_attr = container_of(attr, struct dev_ext_attribute, attr);
+   desc = ext_attr->var;
+
+   if (kstrtou16(buf, 10, &val))
+   return -EINVAL;
+
+   mutex_lock(&hisi_ptt->tune_lock);
+   ret = hisi_ptt_tune_data_set(hisi_ptt, desc->event_code, val);
+   mutex_unlock(&hisi_ptt->tune_lock);
+
+   if (ret)
+   return ret;
+
+   return count;
+}
+
+#define HISI_PTT_TUNE_ATTR(_name, _val, _show, _store) \
+   static struct hisi_ptt_tune_desc _name##_desc = {   \
+   .name = #_name, \
+   .event_code = _val, \
+   };  \
+   static struct dev_ext_attribute hisi_ptt_##_name##_attr = { \
+   .attr   = __ATTR(_name, 0600, _show, _store),   \
+   .var=

Re: [PATCH v8 2/8] hwtracing: hisi_ptt: Add trace function support for HiSilicon PCIe Tune and Trace device

2022-05-16 Thread John Garry via iommu


On 16/05/2022 13:52, Yicong Yang wrote:

HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated
Endpoint(RCiEP) device, providing the capability to dynamically monitor and
tune the PCIe traffic and trace the TLP headers.

Add the driver for the device to enable the trace function. Register PMU
device of PTT trace, then users can use trace through perf command. The
driver makes use of perf AUX trace function and support the following
events to configure the trace:

- filter: select Root port or Endpoint to trace
- type: select the type of traced TLP headers
- direction: select the direction of traced TLP headers
- format: select the data format of the traced TLP headers

This patch initially add a basic driver of PTT trace.


Initially add basic trace support.



Signed-off-by: Yicong Yang 


Generally this looks ok, apart from nitpicking below, so, FWIW:
Reviewed-by: John Garry 


---
  drivers/Makefile |   1 +
  drivers/hwtracing/Kconfig|   2 +
  drivers/hwtracing/ptt/Kconfig|  12 +
  drivers/hwtracing/ptt/Makefile   |   2 +
  drivers/hwtracing/ptt/hisi_ptt.c | 964 +++
  drivers/hwtracing/ptt/hisi_ptt.h | 178 ++
  6 files changed, 1159 insertions(+)
  create mode 100644 drivers/hwtracing/ptt/Kconfig
  create mode 100644 drivers/hwtracing/ptt/Makefile
  create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c
  create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h

diff --git a/drivers/Makefile b/drivers/Makefile
index 020780b6b4d2..662d50599467 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -175,6 +175,7 @@ obj-$(CONFIG_USB4)  += thunderbolt/
  obj-$(CONFIG_CORESIGHT)   += hwtracing/coresight/
  obj-y += hwtracing/intel_th/
  obj-$(CONFIG_STM) += hwtracing/stm/
+obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/
  obj-$(CONFIG_ANDROID) += android/
  obj-$(CONFIG_NVMEM)   += nvmem/
  obj-$(CONFIG_FPGA)+= fpga/
diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig
index 13085835a636..911ee977103c 100644
--- a/drivers/hwtracing/Kconfig
+++ b/drivers/hwtracing/Kconfig
@@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig"
  
  source "drivers/hwtracing/intel_th/Kconfig"
  
+source "drivers/hwtracing/ptt/Kconfig"

+
  endmenu
diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig
new file mode 100644
index ..6d46a09ffeb9
--- /dev/null
+++ b/drivers/hwtracing/ptt/Kconfig
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config HISI_PTT
+   tristate "HiSilicon PCIe Tune and Trace Device"
+   depends on ARM64 || (COMPILE_TEST && 64BIT)
+   depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS
+   help
+ HiSilicon PCIe Tune and Trace device exists as a PCIe RCiEP
+ device, and it provides support for PCIe traffic tuning and
+ tracing TLP headers to the memory.
+
+ This driver can also be built as a module. If so, the module
+ will be called hisi_ptt.
diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile
new file mode 100644
index ..908c09a98161
--- /dev/null
+++ b/drivers/hwtracing/ptt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_HISI_PTT) += hisi_ptt.o
diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c
new file mode 100644
index ..ef25ce98f664
--- /dev/null
+++ b/drivers/hwtracing/ptt/hisi_ptt.c
@@ -0,0 +1,964 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for HiSilicon PCIe tune and trace device
+ *
+ * Copyright (c) 2022 HiSilicon Technologies Co., Ltd.
+ * Author: Yicong Yang 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_ptt.h"
+
+/* Dynamic CPU hotplug state used by PTT */
+static enum cpuhp_state hisi_ptt_pmu_online;
+
+static u16 hisi_ptt_get_filter_val(u16 devid, bool is_port)
+{
+   if (is_port)
+   return BIT(HISI_PCIE_CORE_PORT_ID(devid & 0xff));
+
+   return devid;
+}
+
+static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt)
+{
+   u32 val;
+
+   return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS,
+ val, val & HISI_PTT_TRACE_IDLE,
+ HISI_PTT_WAIT_POLL_INTERVAL_US,
+ HISI_PTT_WAIT_TRACE_TIMEOUT_US);
+}
+
+static void hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt)
+{
+   u32 val;
+
+   readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_WR_STS,
+ val, !val, HISI_PTT_RESET_POLL_INTERVAL_US,
+ HISI_PTT_RESET_TIMEOUT_US);
+}
+

[RFC PATCH] dma-iommu: Add iommu_dma_max_mapping_size()

2022-05-16 Thread John Garry via iommu

For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced.

Add the IOMMU callback for DMA mapping API dma_max_mapping_size(), which
allows the drivers to know the mapping limit and thus limit the requested 
IOVA lengths.

This resolves the performance issue originally reported in [0] for a SCSI
HBA driver which was regularly mapping SGLs which required IOVAs in
excess of the IOVA caching limit. In this case the block layer limits the
max sectors per request - as configured in __scsi_init_queue() - which
will limit the total SGL length the driver tries to map and in turn limits
IOVA lengths requested.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/

Signed-off-by: John Garry 
---
Sending as an RFC as iommu_dma_max_mapping_size() is a soft limit, and not
a hard limit which I expect is the semantics of dma_map_ops.max_mapping_size

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 09f6e1c0f9c0..e2d5205cde37 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1442,6 +1442,21 @@ static unsigned long iommu_dma_get_merge_boundary(struct 
device *dev)
return (1UL << __ffs(domain->pgsize_bitmap)) - 1;
 }
 
+static size_t iommu_dma_max_mapping_size(struct device *dev)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct iommu_dma_cookie *cookie;
+
+   if (!domain)
+   return 0;
+
+   cookie = domain->iova_cookie;
+   if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
+   return 0;
+
+   return iova_rcache_range();
+}
+
 static const struct dma_map_ops iommu_dma_ops = {
.alloc  = iommu_dma_alloc,
.free   = iommu_dma_free,
@@ -1462,6 +1477,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .max_mapping_size   = iommu_dma_max_mapping_size,
 };
 
 /*
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..9f00b58d546e 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 
+unsigned long iova_rcache_range(void)
+{
+   return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1);
+}
+
 static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
 {
struct iova_domain *iovad;
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..ae3e18d77e6c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain 
*iovad, dma_addr_t iova)
 int iova_cache_get(void);
 void iova_cache_put(void);
 
+unsigned long iova_rcache_range(void);
+
 void free_iova(struct iova_domain *iovad, unsigned long pfn);
 void __free_iova(struct iova_domain *iovad, struct iova *iova);
 struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size,
@@ -105,6 +107,11 @@ static inline void iova_cache_put(void)
 {
 }
 
+static inline unsigned long iova_rcache_range(void)
+{
+   return 0;
+}
+
 static inline void free_iova(struct iova_domain *iovad, unsigned long pfn)
 {
 }
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 1/7] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity

2022-05-11 Thread John Garry via iommu


On 07/04/2022 13:58, Yicong Yang wrote:

The DMA operations of HiSilicon PTT device can only work properly with
identical mappings. So add a quirk for the device to force the domain


I'm not sure if you meant to write "identity mappings".


as passthrough.

Signed-off-by: Yicong Yang 


FWIW,

Reviewed-by: John Garry 


---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
  1 file changed, 16 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 627a3ed5ee8f..5ec15ae2a9b1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2839,6 +2839,21 @@ static int arm_smmu_dev_disable_feature(struct device 
*dev,
}
  }
  
+#define IS_HISI_PTT_DEVICE(pdev)	((pdev)->vendor == PCI_VENDOR_ID_HUAWEI && \

+(pdev)->device == 0xa12e)
+
+static int arm_smmu_def_domain_type(struct device *dev)
+{
+   if (dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   if (IS_HISI_PTT_DEVICE(pdev))
+   return IOMMU_DOMAIN_IDENTITY;
+   }
+
+   return 0;
+}
+
  static struct iommu_ops arm_smmu_ops = {
.capable= arm_smmu_capable,
.domain_alloc   = arm_smmu_domain_alloc,
@@ -2856,6 +2871,7 @@ static struct iommu_ops arm_smmu_ops = {
.sva_unbind = arm_smmu_sva_unbind,
.sva_get_pasid  = arm_smmu_sva_get_pasid,
.page_response  = arm_smmu_page_response,
+   .def_domain_type= arm_smmu_def_domain_type,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
.owner  = THIS_MODULE,
.default_domain_ops = &(const struct iommu_domain_ops) {


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 5/7] perf tool: Add support for HiSilicon PCIe Tune and Trace device driver

2022-04-14 Thread John Garry via iommu


On 12/04/2022 08:41, Yicong Yang wrote:

+    hisi_ptt_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_ptts));
+    if (!hisi_ptt_pmus) {
+    pr_err("hisi_ptt alloc failed\n");
+    *err = -ENOMEM;

using PTR_ERR seems better, if possible


ok will change to that. *err = -ENOMEM is used here to keep consistence with
what spe does.



Ah, I see that we are contrained by the interface of 
auxtrace_record_init() to pass err as a pointer, so I suppose the code 
in this patch is ok to fit into that.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device

2022-04-12 Thread John Garry via iommu


+static int hisi_ptt_alloc_trace_buf(struct hisi_ptt *hisi_ptt)
+{
+    struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl;
+    struct device *dev = &hisi_ptt->pdev->dev;
+    int i;
+
+    hisi_ptt->trace_ctrl.buf_index = 0;
+
+    /* If the trace buffer has already been allocated, zero it. */


I am not sure why this is not called from the probe



The buffer allocation is done when necessary as driver will probe the device on 
booting but
the user may never use it. In this condition it's a waste of memory if we 
allocate the buffers
in probe. Currently we'll allocate 16M memory for 4 buffers.



But that's just not how we do things. We setup the driver fully to be 
used in the probe. If the user cannot really afford the memory then 
he/she should not load the driver.


In addition, this driver would be used in a machine which will have 
gigbytes of memory, so I think that the memory mentioned here is 
relatively insignificant.



So this function is called every time before we start trace. In the first time 
it will allocate
the DMA buffers and it the other times it just zero the buffers to clear the 
data of last time.


+    if (ctrl->trace_buf) {
+    for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++)
+    memset(ctrl->trace_buf[i].addr, 0, HISI_PTT_TRACE_BUF_SIZE);
+    return 0;
+    }
+
+    ctrl->trace_buf = devm_kcalloc(dev, HISI_PTT_TRACE_BUF_CNT,
+   sizeof(struct hisi_ptt_dma_buffer), GFP_KERNEL);


sizeof(*ctrl->trace_buf) may be better



ok.


+    if (!ctrl->trace_buf)
+    return -ENOMEM;
+
+    for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; ++i) {
+    ctrl->trace_buf[i].addr = dmam_alloc_coherent(dev, 
HISI_PTT_TRACE_BUF_SIZE,
+ &ctrl->trace_buf[i].dma,
+ GFP_KERNEL);
+    if (!ctrl->trace_buf[i].addr) {
+    hisi_ptt_free_trace_buf(hisi_ptt);
+    return -ENOMEM;
+    }
+    }
+
+    return 0;
+}
+
+static void hisi_ptt_trace_end(struct hisi_ptt *hisi_ptt)
+{
+    writel(0, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL);
+    hisi_ptt->trace_ctrl.started = false;
+}
+
+static int hisi_ptt_trace_start(struct hisi_ptt *hisi_ptt)
+{
+    struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl;
+    u32 val;
+    int i;
+
+    /* Check device idle before start trace */
+    if (!hisi_ptt_wait_trace_hw_idle(hisi_ptt)) {
+    pci_err(hisi_ptt->pdev, "Failed to start trace, the device is still 
busy\n");
+    return -EBUSY;
+    }
+
+    ctrl->started = true;
+
+    /* Reset the DMA before start tracing */
+    val = readl(hisi_ptt->iobase + HISI_PTT_TRACE_CTRL);
+    val |= HISI_PTT_TRACE_CTRL_RST;
+    writel(val, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL);
+
+    hisi_ptt_wait_dma_reset_done(hisi_ptt);
+
+    val = readl(hisi_ptt->iobase + HISI_PTT_TRACE_CTRL);
+    val &= ~HISI_PTT_TRACE_CTRL_RST;
+    writel(val, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL);
+
+    /* Clear the interrupt status */
+    writel(HISI_PTT_TRACE_INT_STAT_MASK, hisi_ptt->iobase + 
HISI_PTT_TRACE_INT_STAT);
+    writel(0, hisi_ptt->iobase + HISI_PTT_TRACE_INT_MASK);
+
+    /* Configure the trace DMA buffer */


I am not sure why this sort of thing is done outside probing



...


+
+    val = FIELD_GET(HISI_PTT_PMU_DIRECTION_MASK, event->attr.config);
+    ret = hisi_ptt_trace_valid_config_onehot(val, 
hisi_ptt_trace_available_direction,


how about put all those arrays in hisi_ptt_trace_valid_config_onehot() and pass 
some flag to say which array you want to use? Or something like that. Passing 
the arrays in this fashion is messy



Since there are 3 configs (type, direction, format) with different available 
range and setting method (onehot, non-onehot),
moving the arrays into the valid checking function means we need to recognize 
the config types (passed by the caller but need
to know the available value array) and the checking method together. That may 
make the code more complex than now: 1st picking
up the right array and judge wich checking method this array applied and 2nd do 
the checking.

Currently it's designed to decouple the checking method and the available value 
array. The hisi_ptt_trace_valid_config{_onehot}()
won't care about which array to use since caller take responsibilty for this. 
So perhaps current approach is simple and clear
enough.



A couple of points:
- hisi_ptt_trace_valid_config_type() only has 1x caller so can make it 
dedicated for that caller
- there is not much code in hisi_ptt_trace_valid_config_onshot(), so ok 
to duplicate if makes overall code look better


So I think dedicated functions make the code simpler, easier to follow, 
and maintain:


static int hisi_ptt_trace_valid_config_dir(u32 val)
{
int i;
/*
 * The supported value of the direction parameter. See hisi_ptt.rst
 * documentation for more details.
 */
static const u32 hisi_ptt_trace_available_direction[] = {
0,

Re: [PATCH v7 5/7] perf tool: Add support for HiSilicon PCIe Tune and Trace device driver

2022-04-11 Thread John Garry via iommu


On 07/04/2022 13:58, Yicong Yang wrote:

From: Qi Liu 

'perf record' and 'perf report --dump-raw-trace' supported in this
patch.

Example usage:

Output will contain raw PTT data and its textual representation, such
as:

0 0 0x5810 [0x30]: PERF_RECORD_AUXTRACE size: 0x40  offset: 0
ref: 0xa5d50c725  idx: 0  tid: -1  cpu: 0
.
. ... HISI PTT data: size 4194304 bytes
.  : 00 00 00 00 Prefix
.  0004: 08 20 00 60 Header DW0
.  0008: ff 02 00 01 Header DW1
.  000c: 20 08 00 00 Header DW2
.  0010: 10 e7 44 ab Header DW3
.  0014: 2a a8 1e 01 Time
.  0020: 00 00 00 00 Prefix
.  0024: 01 00 00 60 Header DW0
.  0028: 0f 1e 00 01 Header DW1
.  002c: 04 00 00 00 Header DW2
.  0030: 40 00 81 02 Header DW3
.  0034: ee 02 00 00 Time


Signed-off-by: Qi Liu 
Signed-off-by: Yicong Yang 
---
  tools/perf/arch/arm/util/auxtrace.c   |  76 +-
  tools/perf/arch/arm/util/pmu.c|   3 +
  tools/perf/arch/arm64/util/Build  |   2 +-
  tools/perf/arch/arm64/util/hisi_ptt.c | 195 
  tools/perf/util/Build |   2 +
  tools/perf/util/auxtrace.c|   4 +
  tools/perf/util/auxtrace.h|   1 +
  tools/perf/util/hisi-ptt-decoder/Build|   1 +
  .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.c   | 170 ++
  .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.h   |  28 +++
  tools/perf/util/hisi_ptt.c| 218 ++
  tools/perf/util/hisi_ptt.h|  28 +++
  12 files changed, 724 insertions(+), 4 deletions(-)
  create mode 100644 tools/perf/arch/arm64/util/hisi_ptt.c
  create mode 100644 tools/perf/util/hisi-ptt-decoder/Build
  create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.c
  create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.h
  create mode 100644 tools/perf/util/hisi_ptt.c
  create mode 100644 tools/perf/util/hisi_ptt.h

diff --git a/tools/perf/arch/arm/util/auxtrace.c 
b/tools/perf/arch/arm/util/auxtrace.c
index 5fc6a2a3dbc5..393f5757c039 100644
--- a/tools/perf/arch/arm/util/auxtrace.c
+++ b/tools/perf/arch/arm/util/auxtrace.c
@@ -4,9 +4,11 @@
   * Author: Mathieu Poirier 
   */
  
+#include 

  #include 
  #include 
  #include 
+#include 
  
  #include "../../../util/auxtrace.h"

  #include "../../../util/debug.h"
@@ -14,6 +16,7 @@
  #include "../../../util/pmu.h"
  #include "cs-etm.h"
  #include "arm-spe.h"
+#include "hisi_ptt.h"
  
  static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err)

  {
@@ -50,6 +53,58 @@ static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, 
int *err)
return arm_spe_pmus;
  }
  
+static struct perf_pmu **find_all_hisi_ptt_pmus(int *nr_ptts, int *err)

+{
+   const char *sysfs = sysfs__mountpoint();
+   struct perf_pmu **hisi_ptt_pmus = NULL;
+   struct dirent *dent;
+   char path[PATH_MAX];
+   DIR *dir = NULL;
+   int idx = 0;
+
+   snprintf(path, PATH_MAX, "%s" EVENT_SOURCE_DEVICE_PATH, sysfs);
+   dir = opendir(path);
+   if (!dir) {
+   pr_err("can't read directory '%s'\n", EVENT_SOURCE_DEVICE_PATH);
+   *err = -EINVAL;
+   goto out;
+   }
+
+   while ((dent = readdir(dir))) {
+   if (strstr(dent->d_name, HISI_PTT_PMU_NAME))
+   (*nr_ptts)++;
+   }
+
+   if (!(*nr_ptts))
+   goto out;
+
+   hisi_ptt_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_ptts));
+   if (!hisi_ptt_pmus) {
+   pr_err("hisi_ptt alloc failed\n");
+   *err = -ENOMEM;


using PTR_ERR seems better, if possible


+   goto out;
+   }
+
+   rewinddir(dir);
+   while ((dent = readdir(dir))) {
+   if (strstr(dent->d_name, HISI_PTT_PMU_NAME) && idx < 
(*nr_ptts)) {
+   hisi_ptt_pmus[idx] = perf_pmu__find(dent->d_name);
+   if (hisi_ptt_pmus[idx]) {
+   pr_debug2("%s %d: hisi_ptt_pmu %d type %d name 
%s\n",


do you really need this?


+   __func__, __LINE__, idx,
+   hisi_ptt_pmus[idx]->type,
+   hisi_ptt_pmus[idx]->name);
+   idx++;
+   }
+
+   }
+   }
+
+out:
+   closedir(dir);
+   return hisi_ptt_pmus;
+}
+
  struct auxtrace_record
  *auxtrace_record__init(struct evlist *evlist, int *err)
  {
@@ -57,8 +112,12 @@ s

Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device

2022-04-11 Thread John Garry via iommu


On 07/04/2022 13:58, Yicong Yang wrote:

HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated
Endpoint(RCiEP) device, providing the capability to dynamically monitor and
tune the PCIe traffic, and trace the TLP headers.

Add the driver for the device to enable the trace function. Register PMU
device of PTT trace, then users can use trace through perf command. The
driver makes use of perf AUX trace and support following events to


"The driver makes use of perf AUX trace function and supports the 
following events to .."



configure the trace:

- filter: select Root port or Endpoint to trace
- type: select the type of traced TLP headers
- direction: select the direction of traced TLP headers
- format: select the data format of the traced TLP headers

This patch adds the driver part of PTT trace. The perf command support of
PTT trace is added in the following patch.


I would not mention "following patch" and the like. Just have "initially 
add a basic driver"




Signed-off-by: Yicong Yang 
Reviewed-by: Jonathan Cameron 
---
  drivers/Makefile |   1 +
  drivers/hwtracing/Kconfig|   2 +
  drivers/hwtracing/ptt/Kconfig|  12 +
  drivers/hwtracing/ptt/Makefile   |   2 +
  drivers/hwtracing/ptt/hisi_ptt.c | 874 +++
  drivers/hwtracing/ptt/hisi_ptt.h | 166 ++
  6 files changed, 1057 insertions(+)
  create mode 100644 drivers/hwtracing/ptt/Kconfig
  create mode 100644 drivers/hwtracing/ptt/Makefile
  create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c
  create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h

diff --git a/drivers/Makefile b/drivers/Makefile
index 020780b6b4d2..662d50599467 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -175,6 +175,7 @@ obj-$(CONFIG_USB4)  += thunderbolt/
  obj-$(CONFIG_CORESIGHT)   += hwtracing/coresight/
  obj-y += hwtracing/intel_th/
  obj-$(CONFIG_STM) += hwtracing/stm/
+obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/
  obj-$(CONFIG_ANDROID) += android/
  obj-$(CONFIG_NVMEM)   += nvmem/
  obj-$(CONFIG_FPGA)+= fpga/
diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig
index 13085835a636..911ee977103c 100644
--- a/drivers/hwtracing/Kconfig
+++ b/drivers/hwtracing/Kconfig
@@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig"
  
  source "drivers/hwtracing/intel_th/Kconfig"
  
+source "drivers/hwtracing/ptt/Kconfig"

+
  endmenu
diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig
new file mode 100644
index ..8902a6f27563
--- /dev/null
+++ b/drivers/hwtracing/ptt/Kconfig
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config HISI_PTT
+   tristate "HiSilicon PCIe Tune and Trace Device"
+   depends on ARM64 || (COMPILE_TEST && 64BIT)
+   depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS
+   help
+ HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP
+ device, and it provides support for PCIe traffic tuning and
+ tracing TLP headers to the memory.
+
+ This driver can also be built as a module. If so, the module
+ will be called hisi_ptt.
diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile
new file mode 100644
index ..908c09a98161
--- /dev/null
+++ b/drivers/hwtracing/ptt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_HISI_PTT) += hisi_ptt.o
diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c
new file mode 100644
index ..242b41870380
--- /dev/null
+++ b/drivers/hwtracing/ptt/hisi_ptt.c
@@ -0,0 +1,874 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for HiSilicon PCIe tune and trace device
+ *
+ * Copyright (c) 2022 HiSilicon Technologies Co., Ltd.
+ * Author: Yicong Yang 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_ptt.h"
+
+static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev)
+{
+   if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT)
+   return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn)));
+
+   return PCI_DEVID(pdev->bus->number, pdev->devfn);
+}
+
+static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt)
+{
+   u32 val;
+
+   return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS,
+ val, val & HISI_PTT_TRACE_IDLE,
+ HISI_PTT_WAIT_POLL_INTERVAL_US,
+ HISI_PTT_WAIT_TRACE_TIMEOUT_US);
+}
+
+static bool hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt)
+{
+   u32 val;
+
+   return !readl_poll_timeout_atomic(hisi_ptt->iobase + 
HISI_PTT_TRACE_WR_STS,
+ val, !val, 
HISI_PTT_RESET_POLL_INTERVAL_US,
+

Re: [PATCH RESEND v5 4/5] iommu: Allow max opt DMA len be set for a group via sysfs

2022-04-07 Thread John Garry via iommu


On 07/04/2022 09:21, Leizhen (ThunderTown) wrote:



On 2022/4/4 19:27, John Garry wrote:

Add support to allow the maximum optimised DMA len be set for an IOMMU
group via sysfs.

This is much the same with the method to change the default domain type
for a group.

Signed-off-by: John Garry 
---
  .../ABI/testing/sysfs-kernel-iommu_groups | 16 +
  drivers/iommu/iommu.c | 59 ++-
  include/linux/iommu.h |  6 ++
  3 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups 
b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
index b15af6a5bc08..ed6f72794f6c 100644
--- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups
+++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
@@ -63,3 +63,19 @@ Description: /sys/kernel/iommu_groups//type shows 
the type of default
system could lead to catastrophic effects (the users might
need to reboot the machine to get it to normal state). So, it's
expected that the users understand what they're doing.
+
+What:  /sys/kernel/iommu_groups//max_opt_dma_size
+Date:  Feb 2022
+KernelVersion: v5.18
+Contact:   iommu@lists.linux-foundation.org
+Description:   /sys/kernel/iommu_groups//max_opt_dma_size shows the
+   max optimised DMA size for the default IOMMU domain associated
+   with the group.
+   Each IOMMU domain has an IOVA domain. The IOVA domain caches
+   IOVAs upto a certain size as a performance optimisation.
+   This sysfs file allows the range of the IOVA domain caching be
+   set, such that larger than default IOVAs may be cached.
+   A value of 0 means that the default caching range is chosen.
+   A privileged user could request the kernel the change the range
+   by writing to this file. For this to happen, the same rules
+   and procedure applies as in changing the default domain type.
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 10bb10c2a210..7c7258f19bed 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -48,6 +48,7 @@ struct iommu_group {
struct iommu_domain *default_domain;
struct iommu_domain *domain;
struct list_head entry;
+   size_t max_opt_dma_size;
  };
  
  struct group_device {

@@ -89,6 +90,9 @@ static int iommu_create_device_direct_mappings(struct 
iommu_group *group,
  static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
  static ssize_t iommu_group_store_type(struct iommu_group *group,
  const char *buf, size_t count);
+static ssize_t iommu_group_store_max_opt_dma_size(struct iommu_group *group,
+ const char *buf,
+ size_t count);
  
  #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\

  struct iommu_group_attribute iommu_group_attr_##_name =   \
@@ -571,6 +575,12 @@ static ssize_t iommu_group_show_type(struct iommu_group 
*group,
return strlen(type);
  }
  
+static ssize_t iommu_group_show_max_opt_dma_size(struct iommu_group *group,

+char *buf)
+{
+   return sprintf(buf, "%zu\n", group->max_opt_dma_size);
+}
+
  static IOMMU_GROUP_ATTR(name, S_IRUGO, iommu_group_show_name, NULL);
  
  static IOMMU_GROUP_ATTR(reserved_regions, 0444,

@@ -579,6 +589,9 @@ static IOMMU_GROUP_ATTR(reserved_regions, 0444,
  static IOMMU_GROUP_ATTR(type, 0644, iommu_group_show_type,
iommu_group_store_type);
  
+static IOMMU_GROUP_ATTR(max_opt_dma_size, 0644, iommu_group_show_max_opt_dma_size,

+   iommu_group_store_max_opt_dma_size);
+
  static void iommu_group_release(struct kobject *kobj)
  {
struct iommu_group *group = to_iommu_group(kobj);
@@ -665,6 +678,10 @@ struct iommu_group *iommu_group_alloc(void)
if (ret)
return ERR_PTR(ret);
  
+	ret = iommu_group_create_file(group, &iommu_group_attr_max_opt_dma_size);

+   if (ret)
+   return ERR_PTR(ret);
+
pr_debug("Allocated group %d\n", group->id);
  
  	return group;

@@ -2087,6 +2104,11 @@ struct iommu_domain *iommu_get_dma_domain(struct device 
*dev)
return dev->iommu_group->default_domain;
  }
  
+size_t iommu_group_get_max_opt_dma_size(struct iommu_group *group)

+{
+   return group->max_opt_dma_size;
+}
+
  /*
   * IOMMU groups are really the natural working unit of the IOMMU, but
   * the IOMMU API works on domains and devices.  Bridge that gap by
@@ -2871,12 +2893,14 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
   * @prev_dev: The device in the group (this is used to make sure that the 
device
   * hasn't changed after the caller has called this functio

Re: [PATCH RESEND v5 5/5] iova: Add iova_len argument to iova_domain_init_rcaches()

2022-04-07 Thread John Garry via iommu


On 07/04/2022 09:27, Leizhen (ThunderTown) wrote:




Thanks for having a look



On 2022/4/4 19:27, John Garry wrote:

Add max opt argument to iova_domain_init_rcaches(), and use it to set the
rcaches range.

Also fix up all users to set this value (at 0, meaning use default),
including a wrapper for that, iova_domain_init_rcaches_default().

For dma-iommu.c we derive the iova_len argument from the IOMMU group
max opt DMA size.

Signed-off-by: John Garry 
---
  drivers/iommu/dma-iommu.c| 15 ++-
  drivers/iommu/iova.c | 19 ---
  drivers/vdpa/vdpa_user/iova_domain.c |  4 ++--
  include/linux/iova.h |  3 ++-
  4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 42ca42ff1b5d..19f35624611c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -525,6 +525,8 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
struct iova_domain *iovad;
+   size_t max_opt_dma_size;
+   unsigned long iova_len = 0;
int ret;
  
  	if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)

@@ -560,7 +562,18 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
}
  
  	init_iova_domain(iovad, 1UL << order, base_pfn);

-   ret = iova_domain_init_rcaches(iovad);
+
+   max_opt_dma_size = iommu_group_get_max_opt_dma_size(dev->iommu_group);
+   if (max_opt_dma_size) {
+   unsigned long shift = __ffs(1UL << order);
+
+   iova_len = roundup_pow_of_two(max_opt_dma_size);
+   iova_len >>= shift;
+   if (!iova_len)
+   iova_len = 1;


How about move "iovad->rcache_max_size = iova_len_to_rcache_max(iova_len);" 
here? So that,
iova_domain_init_rcaches() can remain the same. And 
iova_domain_init_rcaches_default() does
not need to be added.



I see your idea. I will say that I would rather not add 
iova_domain_init_rcaches_default(). But personally I think it's better 
to setup all rcache stuff only once and inside 
iova_domain_init_rcaches(), as it is today.


In addition, it doesn't look reasonable to expose iova_len_to_rcache_max().

But maybe it's ok. Other opinion would be welcome...

Thanks,
John


+   }
+
+   ret = iova_domain_init_rcaches(iovad, iova_len);
if (ret)
return ret;
  
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c

index 5c22b9187b79..d65e79e132ee 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -706,12 +706,20 @@ static void iova_magazine_push(struct iova_magazine *mag, 
unsigned long pfn)
mag->pfns[mag->size++] = pfn;
  }
  
-int iova_domain_init_rcaches(struct iova_domain *iovad)

+static unsigned long iova_len_to_rcache_max(unsigned long iova_len)
+{
+   return order_base_2(iova_len) + 1;
+}
+
+int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long iova_len)
  {
unsigned int cpu;
int i, ret;
  
-	iovad->rcache_max_size = 6; /* Arbitrarily high default */

+   if (iova_len)
+   iovad->rcache_max_size = iova_len_to_rcache_max(iova_len);
+   else
+   iovad->rcache_max_size = 6; /* Arbitrarily high default */
  
  	iovad->rcaches = kcalloc(iovad->rcache_max_size,

 sizeof(struct iova_rcache),
@@ -755,7 +763,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
free_iova_rcaches(iovad);
return ret;
  }
-EXPORT_SYMBOL_GPL(iova_domain_init_rcaches);
+
+int iova_domain_init_rcaches_default(struct iova_domain *iovad)
+{
+   return iova_domain_init_rcaches(iovad, 0);
+}
+EXPORT_SYMBOL_GPL(iova_domain_init_rcaches_default);
  
  /*

   * Try inserting IOVA range starting with 'iova_pfn' into 'rcache', and
diff --git a/drivers/vdpa/vdpa_user/iova_domain.c 
b/drivers/vdpa/vdpa_user/iova_domain.c
index 6daa3978d290..3a2acef98a4a 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -514,12 +514,12 @@ vduse_domain_create(unsigned long iova_limit, size_t 
bounce_size)
spin_lock_init(&domain->iotlb_lock);
init_iova_domain(&domain->stream_iovad,
PAGE_SIZE, IOVA_START_PFN);
-   ret = iova_domain_init_rcaches(&domain->stream_iovad);
+   ret = iova_domain_init_rcaches_default(&domain->stream_iovad);
if (ret)
goto err_iovad_stream;
init_iova_domain(&domain->consistent_iovad,
PAGE_SIZE, bounce_pfns);
-   ret = iova_domain_init_rcaches(&domain->consistent_iovad);
+   ret = iova_domain_init_rcaches_default(&domain->consistent_io

[PATCH RESEND v5 5/5] iova: Add iova_len argument to iova_domain_init_rcaches()

2022-04-04 Thread John Garry via iommu

Add max opt argument to iova_domain_init_rcaches(), and use it to set the
rcaches range.

Also fix up all users to set this value (at 0, meaning use default),
including a wrapper for that, iova_domain_init_rcaches_default().

For dma-iommu.c we derive the iova_len argument from the IOMMU group
max opt DMA size.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c| 15 ++-
 drivers/iommu/iova.c | 19 ---
 drivers/vdpa/vdpa_user/iova_domain.c |  4 ++--
 include/linux/iova.h |  3 ++-
 4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 42ca42ff1b5d..19f35624611c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -525,6 +525,8 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
struct iova_domain *iovad;
+   size_t max_opt_dma_size;
+   unsigned long iova_len = 0;
int ret;
 
if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
@@ -560,7 +562,18 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
}
 
init_iova_domain(iovad, 1UL << order, base_pfn);
-   ret = iova_domain_init_rcaches(iovad);
+
+   max_opt_dma_size = iommu_group_get_max_opt_dma_size(dev->iommu_group);
+   if (max_opt_dma_size) {
+   unsigned long shift = __ffs(1UL << order);
+
+   iova_len = roundup_pow_of_two(max_opt_dma_size);
+   iova_len >>= shift;
+   if (!iova_len)
+   iova_len = 1;
+   }
+
+   ret = iova_domain_init_rcaches(iovad, iova_len);
if (ret)
return ret;
 
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 5c22b9187b79..d65e79e132ee 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -706,12 +706,20 @@ static void iova_magazine_push(struct iova_magazine *mag, 
unsigned long pfn)
mag->pfns[mag->size++] = pfn;
 }
 
-int iova_domain_init_rcaches(struct iova_domain *iovad)
+static unsigned long iova_len_to_rcache_max(unsigned long iova_len)
+{
+   return order_base_2(iova_len) + 1;
+}
+
+int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long iova_len)
 {
unsigned int cpu;
int i, ret;
 
-   iovad->rcache_max_size = 6; /* Arbitrarily high default */
+   if (iova_len)
+   iovad->rcache_max_size = iova_len_to_rcache_max(iova_len);
+   else
+   iovad->rcache_max_size = 6; /* Arbitrarily high default */
 
iovad->rcaches = kcalloc(iovad->rcache_max_size,
 sizeof(struct iova_rcache),
@@ -755,7 +763,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
free_iova_rcaches(iovad);
return ret;
 }
-EXPORT_SYMBOL_GPL(iova_domain_init_rcaches);
+
+int iova_domain_init_rcaches_default(struct iova_domain *iovad)
+{
+   return iova_domain_init_rcaches(iovad, 0);
+}
+EXPORT_SYMBOL_GPL(iova_domain_init_rcaches_default);
 
 /*
  * Try inserting IOVA range starting with 'iova_pfn' into 'rcache', and
diff --git a/drivers/vdpa/vdpa_user/iova_domain.c 
b/drivers/vdpa/vdpa_user/iova_domain.c
index 6daa3978d290..3a2acef98a4a 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -514,12 +514,12 @@ vduse_domain_create(unsigned long iova_limit, size_t 
bounce_size)
spin_lock_init(&domain->iotlb_lock);
init_iova_domain(&domain->stream_iovad,
PAGE_SIZE, IOVA_START_PFN);
-   ret = iova_domain_init_rcaches(&domain->stream_iovad);
+   ret = iova_domain_init_rcaches_default(&domain->stream_iovad);
if (ret)
goto err_iovad_stream;
init_iova_domain(&domain->consistent_iovad,
PAGE_SIZE, bounce_pfns);
-   ret = iova_domain_init_rcaches(&domain->consistent_iovad);
+   ret = iova_domain_init_rcaches_default(&domain->consistent_iovad);
if (ret)
goto err_iovad_consistent;
 
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 02f7222fa85a..56281434ce0c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -95,7 +95,8 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned 
long pfn_lo,
unsigned long pfn_hi);
 void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
unsigned long start_pfn);
-int iova_domain_init_rcaches(struct iova_domain *iovad);
+int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long 
iova_len);
+int iova_domain_init_rcaches_default(struct iova_domain *iovad);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iov

[PATCH RESEND v5 4/5] iommu: Allow max opt DMA len be set for a group via sysfs

2022-04-04 Thread John Garry via iommu

Add support to allow the maximum optimised DMA len be set for an IOMMU
group via sysfs.

This is much the same with the method to change the default domain type
for a group.

Signed-off-by: John Garry 
---
 .../ABI/testing/sysfs-kernel-iommu_groups | 16 +
 drivers/iommu/iommu.c | 59 ++-
 include/linux/iommu.h |  6 ++
 3 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups 
b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
index b15af6a5bc08..ed6f72794f6c 100644
--- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups
+++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
@@ -63,3 +63,19 @@ Description: /sys/kernel/iommu_groups//type shows 
the type of default
system could lead to catastrophic effects (the users might
need to reboot the machine to get it to normal state). So, it's
expected that the users understand what they're doing.
+
+What:  /sys/kernel/iommu_groups//max_opt_dma_size
+Date:  Feb 2022
+KernelVersion: v5.18
+Contact:   iommu@lists.linux-foundation.org
+Description:   /sys/kernel/iommu_groups//max_opt_dma_size shows the
+   max optimised DMA size for the default IOMMU domain associated
+   with the group.
+   Each IOMMU domain has an IOVA domain. The IOVA domain caches
+   IOVAs upto a certain size as a performance optimisation.
+   This sysfs file allows the range of the IOVA domain caching be
+   set, such that larger than default IOVAs may be cached.
+   A value of 0 means that the default caching range is chosen.
+   A privileged user could request the kernel the change the range
+   by writing to this file. For this to happen, the same rules
+   and procedure applies as in changing the default domain type.
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 10bb10c2a210..7c7258f19bed 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -48,6 +48,7 @@ struct iommu_group {
struct iommu_domain *default_domain;
struct iommu_domain *domain;
struct list_head entry;
+   size_t max_opt_dma_size;
 };
 
 struct group_device {
@@ -89,6 +90,9 @@ static int iommu_create_device_direct_mappings(struct 
iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
  const char *buf, size_t count);
+static ssize_t iommu_group_store_max_opt_dma_size(struct iommu_group *group,
+ const char *buf,
+ size_t count);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)  \
 struct iommu_group_attribute iommu_group_attr_##_name =\
@@ -571,6 +575,12 @@ static ssize_t iommu_group_show_type(struct iommu_group 
*group,
return strlen(type);
 }
 
+static ssize_t iommu_group_show_max_opt_dma_size(struct iommu_group *group,
+char *buf)
+{
+   return sprintf(buf, "%zu\n", group->max_opt_dma_size);
+}
+
 static IOMMU_GROUP_ATTR(name, S_IRUGO, iommu_group_show_name, NULL);
 
 static IOMMU_GROUP_ATTR(reserved_regions, 0444,
@@ -579,6 +589,9 @@ static IOMMU_GROUP_ATTR(reserved_regions, 0444,
 static IOMMU_GROUP_ATTR(type, 0644, iommu_group_show_type,
iommu_group_store_type);
 
+static IOMMU_GROUP_ATTR(max_opt_dma_size, 0644, 
iommu_group_show_max_opt_dma_size,
+   iommu_group_store_max_opt_dma_size);
+
 static void iommu_group_release(struct kobject *kobj)
 {
struct iommu_group *group = to_iommu_group(kobj);
@@ -665,6 +678,10 @@ struct iommu_group *iommu_group_alloc(void)
if (ret)
return ERR_PTR(ret);
 
+   ret = iommu_group_create_file(group, 
&iommu_group_attr_max_opt_dma_size);
+   if (ret)
+   return ERR_PTR(ret);
+
pr_debug("Allocated group %d\n", group->id);
 
return group;
@@ -2087,6 +2104,11 @@ struct iommu_domain *iommu_get_dma_domain(struct device 
*dev)
return dev->iommu_group->default_domain;
 }
 
+size_t iommu_group_get_max_opt_dma_size(struct iommu_group *group)
+{
+   return group->max_opt_dma_size;
+}
+
 /*
  * IOMMU groups are really the natural working unit of the IOMMU, but
  * the IOMMU API works on domains and devices.  Bridge that gap by
@@ -2871,12 +2893,14 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
  * @prev_dev: The device in the group (this is used to make sure that the 
device
  *  hasn't changed after the caller has called this function)
  * @type: The type of the new default domain that gets associated with the 
group
+ * @max_opt_dma_

[PATCH RESEND v5 3/5] iommu: Allow iommu_change_dev_def_domain() realloc same default domain type

2022-04-04 Thread John Garry via iommu

Allow iommu_change_dev_def_domain() to create a new default domain, keeping
the same as current.

Also remove comment about the function purpose, which will become stale.

Signed-off-by: John Garry 
---
 drivers/iommu/iommu.c | 49 ++-
 include/linux/iommu.h |  1 +
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0dd766030baf..10bb10c2a210 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2863,6 +2863,7 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle)
 }
 EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
 
+
 /*
  * Changes the default domain of an iommu group that has *only* one device
  *
@@ -2873,10 +2874,6 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
  *
  * Returns 0 on success and error code on failure
  *
- * Note:
- * 1. Presently, this function is called only when user requests to change the
- *group's default domain type through 
/sys/kernel/iommu_groups//type
- *Please take a closer look if intended to use for other purposes.
  */
 static int iommu_change_dev_def_domain(struct iommu_group *group,
   struct device *prev_dev, int type)
@@ -2929,28 +2926,32 @@ static int iommu_change_dev_def_domain(struct 
iommu_group *group,
goto out;
}
 
-   dev_def_dom = iommu_get_def_domain_type(dev);
-   if (!type) {
+   if (type == __IOMMU_DOMAIN_SAME) {
+   type = prev_dom->type;
+   } else {
+   dev_def_dom = iommu_get_def_domain_type(dev);
+   if (!type) {
+   /*
+* If the user hasn't requested any specific type of 
domain and
+* if the device supports both the domains, then 
default to the
+* domain the device was booted with
+*/
+   type = dev_def_dom ? : iommu_def_domain_type;
+   } else if (dev_def_dom && type != dev_def_dom) {
+   dev_err_ratelimited(prev_dev, "Device cannot be in %s 
domain\n",
+   iommu_domain_type_str(type));
+   ret = -EINVAL;
+   goto out;
+   }
+
/*
-* If the user hasn't requested any specific type of domain and
-* if the device supports both the domains, then default to the
-* domain the device was booted with
+* Switch to a new domain only if the requested domain type is 
different
+* from the existing default domain type
 */
-   type = dev_def_dom ? : iommu_def_domain_type;
-   } else if (dev_def_dom && type != dev_def_dom) {
-   dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n",
-   iommu_domain_type_str(type));
-   ret = -EINVAL;
-   goto out;
-   }
-
-   /*
-* Switch to a new domain only if the requested domain type is different
-* from the existing default domain type
-*/
-   if (prev_dom->type == type) {
-   ret = 0;
-   goto out;
+   if (prev_dom->type == type) {
+   ret = 0;
+   goto out;
+   }
}
 
/* We can bring up a flush queue without tearing down the domain */
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 9208eca4b0d1..b141cf71c7af 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -63,6 +63,7 @@ struct iommu_domain_geometry {
  implementation  */
 #define __IOMMU_DOMAIN_PT  (1U << 2)  /* Domain is identity mapped   */
 #define __IOMMU_DOMAIN_DMA_FQ  (1U << 3)  /* DMA-API uses flush queue*/
+#define __IOMMU_DOMAIN_SAME(1U << 4)  /* Keep same type (internal)   */
 
 /*
  * This are the possible domain-types
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH RESEND v5 2/5] iova: Allow rcache range upper limit to be flexible

2022-04-04 Thread John Garry via iommu

Some low-level drivers may request DMA mappings whose IOVA length exceeds
that of the current rcache upper limit.

This means that allocations for those IOVAs will never be cached, and
always must be allocated and freed from the RB tree per DMA mapping cycle.
This has a significant effect on performance, more so since commit
4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search
fails"), as discussed at [0].

As a first step towards allowing the rcache range upper limit be
configured, hold this value in the IOVA rcache structure, and allocate
the rcaches separately.

Delete macro IOVA_RANGE_CACHE_MAX_SIZE in case it's reused by mistake.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/

Signed-off-by: John Garry 
---
 drivers/iommu/iova.c | 20 ++--
 include/linux/iova.h |  3 +++
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index db77aa675145..5c22b9187b79 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -15,8 +15,6 @@
 /* The anchor node sits above the top of the usable address space */
 #define IOVA_ANCHOR~0UL
 
-#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size 
(in pages) */
-
 static bool iova_rcache_insert(struct iova_domain *iovad,
   unsigned long pfn,
   unsigned long size);
@@ -443,7 +441,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
size,
 * rounding up anything cacheable to make sure that can't happen. The
 * order of the unadjusted size will still match upon freeing.
 */
-   if (size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+   if (size < (1 << (iovad->rcache_max_size - 1)))
size = roundup_pow_of_two(size);
 
iova_pfn = iova_rcache_get(iovad, size, limit_pfn + 1);
@@ -713,13 +711,15 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
unsigned int cpu;
int i, ret;
 
-   iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE,
+   iovad->rcache_max_size = 6; /* Arbitrarily high default */
+
+   iovad->rcaches = kcalloc(iovad->rcache_max_size,
 sizeof(struct iova_rcache),
 GFP_KERNEL);
if (!iovad->rcaches)
return -ENOMEM;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
struct iova_cpu_rcache *cpu_rcache;
struct iova_rcache *rcache;
 
@@ -816,7 +816,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, 
unsigned long pfn,
 {
unsigned int log_size = order_base_2(size);
 
-   if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
+   if (log_size >= iovad->rcache_max_size)
return false;
 
return __iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn);
@@ -872,7 +872,7 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 {
unsigned int log_size = order_base_2(size);
 
-   if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE || !iovad->rcaches)
+   if (log_size >= iovad->rcache_max_size || !iovad->rcaches)
return 0;
 
return __iova_rcache_get(&iovad->rcaches[log_size], limit_pfn - size);
@@ -888,7 +888,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
unsigned int cpu;
int i, j;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = &iovad->rcaches[i];
if (!rcache->cpu_rcaches)
break;
@@ -916,7 +916,7 @@ static void free_cpu_cached_iovas(unsigned int cpu, struct 
iova_domain *iovad)
unsigned long flags;
int i;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = &iovad->rcaches[i];
cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
spin_lock_irqsave(&cpu_rcache->lock, flags);
@@ -935,7 +935,7 @@ static void free_global_cached_iovas(struct iova_domain 
*iovad)
unsigned long flags;
int i, j;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = &iovad->rcaches[i];
spin_lock_irqsave(&rcache->lock, flags);
for (j = 0; j < rcache->depot_size; ++j) {
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..02f7222fa85a 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -38,6 +38,9 @@ struct iova_domain {
 
struct iova_rcache  *rcaches;
struct hlist_node

[PATCH RESEND v5 0/5] iommu: Allow IOVA rcache range be configured

2022-04-04 Thread John Garry via iommu

For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced. 

This may be much more pronounced from commit 4e89dce72521 ("iommu/iova:
Retry from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA ageing issue,
as discussed at [1].

This series allows the IOVA rcache range be configured, so that we may
cache all IOVAs per domain, thus improving performance.

A new IOMMU group sysfs file is added - max_opt_dma_size - which is used
indirectly to configure the IOVA rcache range:
/sys/kernel/iommu_groups/X/max_opt_dma_size

This file is updated same as how the IOMMU group default domain type is
updated, i.e. must unbind the only device in the group first.

The inspiration here comes from block layer request queue sysfs
"optimal_io_size" file, in /sys/block/sdX/queue/optimal_io_size

Some old figures* for storage scenario (when increasing IOVA rcache range
to cover all DMA mapping sizes from the LLD):
v5.13-rc1 baseline: 1200K IOPS
With series:1800K IOPS

All above are for IOMMU strict mode. Non-strict mode gives ~1800K IOPS in
all scenarios.

Based on v5.18-rc1
* I lost my high data throughout test setup

Differences to v4:
https://lore.kernel.org/linux-iommu/1626259003-201303-1-git-send-email-john.ga...@huawei.com/
- Major rebase
- Change the "Refactor iommu_group_store_type()" to not use a callback
  and an op type enum instead
  - I didn't pick up Will's Ack as it has changed so much
- Use a domain feature flag to keep same default group type
- Add wrapper for default IOVA rcache range
- Combine last 2x patches

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/

John Garry (5):
  iommu: Refactor iommu_group_store_type()
  iova: Allow rcache range upper limit to be flexible
  iommu: Allow iommu_change_dev_def_domain() realloc same default domain
type
  iommu: Allow max opt DMA len be set for a group via sysfs
  iova: Add iova_len argument to iova_domain_init_rcaches()

 .../ABI/testing/sysfs-kernel-iommu_groups |  16 ++
 drivers/iommu/dma-iommu.c |  15 +-
 drivers/iommu/iommu.c | 202 +-
 drivers/iommu/iova.c  |  37 ++--
 drivers/vdpa/vdpa_user/iova_domain.c  |   4 +-
 include/linux/iommu.h |   7 +
 include/linux/iova.h  |   6 +-
 7 files changed, 212 insertions(+), 75 deletions(-)

-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH RESEND v5 1/5] iommu: Refactor iommu_group_store_type()

2022-04-04 Thread John Garry via iommu

Function iommu_group_store_type() supports changing the default domain
of an IOMMU group.

Many conditions need to be satisfied and steps taken for this action to be
successful.

Satisfying these conditions and steps will be required for setting other
IOMMU group attributes, so factor into a common part and a part specific
to update the IOMMU group attribute.

No functional change intended.

Some code comments are tidied up also.

Signed-off-by: John Garry 
---
 drivers/iommu/iommu.c | 96 ---
 1 file changed, 62 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index f2c45b85b9fc..0dd766030baf 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3000,21 +3000,57 @@ static int iommu_change_dev_def_domain(struct 
iommu_group *group,
return ret;
 }
 
+enum iommu_group_op {
+   CHANGE_GROUP_TYPE,
+};
+
+static int __iommu_group_store_type(const char *buf, struct iommu_group *group,
+   struct device *dev)
+{
+   int type;
+
+   if (sysfs_streq(buf, "identity"))
+   type = IOMMU_DOMAIN_IDENTITY;
+   else if (sysfs_streq(buf, "DMA"))
+   type = IOMMU_DOMAIN_DMA;
+   else if (sysfs_streq(buf, "DMA-FQ"))
+   type = IOMMU_DOMAIN_DMA_FQ;
+   else if (sysfs_streq(buf, "auto"))
+   type = 0;
+   else
+   return -EINVAL;
+
+   /*
+* Check if the only device in the group still has a driver bound or
+* we're transistioning from DMA -> DMA-FQ
+*/
+   if (device_is_bound(dev) && !(type == IOMMU_DOMAIN_DMA_FQ &&
+   group->default_domain->type == IOMMU_DOMAIN_DMA)) {
+   pr_err_ratelimited("Device is still bound to driver\n");
+   return -EINVAL;
+   }
+
+   return iommu_change_dev_def_domain(group, dev, type);
+}
+
 /*
  * Changing the default domain through sysfs requires the users to unbind the
  * drivers from the devices in the iommu group, except for a DMA -> DMA-FQ
- * transition. Return failure if this isn't met.
+ * transition. Changing or any other IOMMU group attribute still requires the
+ * user to unbind the drivers from the devices in the iommu group. Return
+ * failure if these conditions are not met.
  *
  * We need to consider the race between this and the device release path.
  * device_lock(dev) is used here to guarantee that the device release path
  * will not be entered at the same time.
  */
-static ssize_t iommu_group_store_type(struct iommu_group *group,
- const char *buf, size_t count)
+static ssize_t iommu_group_store_common(struct iommu_group *group,
+   enum iommu_group_op op,
+   const char *buf, size_t count)
 {
struct group_device *grp_dev;
struct device *dev;
-   int ret, req_type;
+   int ret;
 
if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
return -EACCES;
@@ -3022,27 +3058,16 @@ static ssize_t iommu_group_store_type(struct 
iommu_group *group,
if (WARN_ON(!group))
return -EINVAL;
 
-   if (sysfs_streq(buf, "identity"))
-   req_type = IOMMU_DOMAIN_IDENTITY;
-   else if (sysfs_streq(buf, "DMA"))
-   req_type = IOMMU_DOMAIN_DMA;
-   else if (sysfs_streq(buf, "DMA-FQ"))
-   req_type = IOMMU_DOMAIN_DMA_FQ;
-   else if (sysfs_streq(buf, "auto"))
-   req_type = 0;
-   else
-   return -EINVAL;
-
/*
 * Lock/Unlock the group mutex here before device lock to
-* 1. Make sure that the iommu group has only one device (this is a
+* 1. Make sure that the IOMMU group has only one device (this is a
 *prerequisite for step 2)
 * 2. Get struct *dev which is needed to lock device
 */
mutex_lock(&group->mutex);
if (iommu_group_device_count(group) != 1) {
mutex_unlock(&group->mutex);
-   pr_err_ratelimited("Cannot change default domain: Group has 
more than one device\n");
+   pr_err_ratelimited("Cannot change IOMMU group default domain 
attribute: Group has more than one device\n");
return -EINVAL;
}
 
@@ -3054,16 +3079,16 @@ static ssize_t iommu_group_store_type(struct 
iommu_group *group,
/*
 * Don't hold the group mutex because taking group mutex first and then
 * the device lock could potentially cause a deadlock as below. Assume
-* two threads T1 and T2. T1 is trying to change default domain of an
-* iommu group and T2 is trying to hot unplug a device or release [1] VF
-* of a PCIe device which is in th

Re: [PATCH v5 2/8] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device

2022-03-11 Thread John Garry via iommu





+
  endmenu
diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig
new file mode 100644
index ..8902a6f27563
--- /dev/null
+++ b/drivers/hwtracing/ptt/Kconfig
@@ -0,0 +1,12 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config HISI_PTT
+   tristate "HiSilicon PCIe Tune and Trace Device"
+   depends on ARM64 || (COMPILE_TEST && 64BIT)
+   depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS
+   help
+ HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP
+ device, and it provides support for PCIe traffic tuning and
+ tracing TLP headers to the memory.
+
+ This driver can also be built as a module. If so, the module
+ will be called hisi_ptt.
diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile
new file mode 100644
index ..908c09a98161
--- /dev/null
+++ b/drivers/hwtracing/ptt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_HISI_PTT) += hisi_ptt.o
diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c
new file mode 100644
index ..935dc9b44a54
--- /dev/null
+++ b/drivers/hwtracing/ptt/hisi_ptt.c
@@ -0,0 +1,330 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for HiSilicon PCIe tune and trace device
+ *
+ * Copyright (c) 2022 HiSilicon Technologies Co., Ltd.
+ * Author: Yicong Yang 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_ptt.h"
+
+static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev)
+{
+   if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT)
+   return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn)));
+
+   return PCI_DEVID(pdev->bus->number, pdev->devfn);
+}
+
+static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt)
+{
+   u32 val;
+
+   return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS,
+ val, val & HISI_PTT_TRACE_IDLE,
+ HISI_PTT_WAIT_POLL_INTERVAL_US,
+ HISI_PTT_WAIT_TRACE_TIMEOUT_US);
+}
+
+static bool hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt)
+{
+   u32 val;
+
+   return !readl_poll_timeout_atomic(hisi_ptt->iobase + 
HISI_PTT_TRACE_WR_STS,
+ val, !val, 
HISI_PTT_RESET_POLL_INTERVAL_US,
+ HISI_PTT_RESET_TIMEOUT_US);
+}
+
+static void hisi_ptt_free_trace_buf(struct hisi_ptt *hisi_ptt)
+{
+   struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl;
+   struct device *dev = &hisi_ptt->pdev->dev;
+   int i;
+
+   if (!ctrl->trace_buf)
+   return;
+
+   for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++) {
+   if (ctrl->trace_buf[i].addr)
+   dmam_free_coherent(dev, HISI_PTT_TRACE_BUF_SIZE,
+  ctrl->trace_buf[i].addr,
+  ctrl->trace_buf[i].dma);
+   }
+
+   devm_kfree(dev, ctrl->trace_buf);
+   ctrl->trace_buf = NULL;
+}
+
+static int hisi_ptt_alloc_trace_buf(struct hisi_ptt *hisi_ptt)


no caller


+{
+   struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl;
+   struct device *dev = &hisi_ptt->pdev->dev;
+   int i;
+
+   hisi_ptt->trace_ctrl.buf_index = 0;
+
+   /* If the trace buffer has already been allocated, zero it. */
+   if (ctrl->trace_buf) {
+   for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++)
+   memset(ctrl->trace_buf[i].addr, 0, 
HISI_PTT_TRACE_BUF_SIZE);
+   return 0;
+   }
+
+   ctrl->trace_buf = devm_kcalloc(dev, HISI_PTT_TRACE_BUF_CNT,
+  sizeof(struct hisi_ptt_dma_buffer), 
GFP_KERNEL);
+   if (!ctrl->trace_buf)
+   return -ENOMEM;
+
+   for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; ++i) {
+   ctrl->trace_buf[i].addr = dmam_alloc_coherent(dev, 
HISI_PTT_TRACE_BUF_SIZE,
+
&ctrl->trace_buf[i].dma,
+GFP_KERNEL);
+   if (!ctrl->trace_buf[i].addr) {
+   hisi_ptt_free_trace_buf(hisi_ptt);
+   return -ENOMEM;
+   }
+   }
+
+   return 0;
+}
+
+static void hisi_ptt_trace_end(struct hisi_ptt *hisi_ptt)
+{
+   writel(0, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL);
+   hisi_ptt->trace_ctrl.started = false;
+}
+
+static int hisi_ptt_trace_start(struct hisi_ptt *hisi_ptt)


again this function has no caller, so I assume a warn is generated if we 
only apply up to this patch (when compiling)


please only add code per-patch which is actually referenced


+{
+   struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl;
+   u32 val;
+

Re: [PATCH v5 1/8] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity

2022-03-11 Thread John Garry via iommu


On 08/03/2022 08:49, Yicong Yang wrote:

The DMA of HiSilicon PTT device can only work with identical mapping.


nit: I'd have "DMA operations of the HiSilicon PTT device can only work 
properly with identity mappings".



So add a quirk for the device to force the domain passthrough.


".. domain as passthrough."



Signed-off-by: Yicong Yang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
  1 file changed, 16 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6dc6d8b6b368..6f67a2b1dd27 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2838,6 +2838,21 @@ static int arm_smmu_dev_disable_feature(struct device 
*dev,
}
  }
  
+#define IS_HISI_PTT_DEVICE(pdev)	((pdev)->vendor == PCI_VENDOR_ID_HUAWEI && \

+(pdev)->device == 0xa12e)
+
+static int arm_smmu_def_domain_type(struct device *dev)
+{
+   if (dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   if (IS_HISI_PTT_DEVICE(pdev))
+   return IOMMU_DOMAIN_IDENTITY;
+   }
+
+   return 0;
+}
+
  static struct iommu_ops arm_smmu_ops = {
.capable= arm_smmu_capable,
.domain_alloc   = arm_smmu_domain_alloc,
@@ -2863,6 +2878,7 @@ static struct iommu_ops arm_smmu_ops = {
.sva_unbind = arm_smmu_sva_unbind,
.sva_get_pasid  = arm_smmu_sva_get_pasid,
.page_response  = arm_smmu_page_response,
+   .def_domain_type= arm_smmu_def_domain_type,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
.owner  = THIS_MODULE,
  };


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/iova: Free all CPU rcache for retry when iova alloc failure

2022-03-04 Thread John Garry via iommu


On 04/03/2022 04:46, yf.wang--- via iommu wrote:

* MEDIATEK Confidentiality Notice 
The information contained in this e-mail message (including any
attachments) may be confidential, proprietary, privileged, or otherwise
exempt from disclosure under applicable laws. It is intended to be


Can you please stop sending patches with this?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v5 0/5] iommu: Allow IOVA rcache range be configured

2022-02-28 Thread John Garry via iommu


On 14/02/2022 17:29, John Garry wrote:

Hi guys,

And a friendly reminder on this series also.

Cheers,
john



For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced.

This may be much more pronounced from commit 4e89dce72521 ("iommu/iova:
Retry from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA ageing issue,
as discussed at [1].

This series allows the IOVA rcache range be configured, so that we may
cache all IOVAs per domain, thus improving performance.

A new IOMMU group sysfs file is added - max_opt_dma_size - which is used
indirectly to configure the IOVA rcache range:
/sys/kernel/iommu_groups/X/max_opt_dma_size

This file is updated same as how the IOMMU group default domain type is
updated, i.e. must unbind the only device in the group first.

The inspiration here comes from block layer request queue sysfs
"optimal_io_size" file, in /sys/block/sdX/queue/optimal_io_size

Some old figures* for storage scenario (when increasing IOVA rcache range
to cover all DMA mapping sizes from the LLD):
v5.13-rc1 baseline: 1200K IOPS
With series:1800K IOPS

All above are for IOMMU strict mode. Non-strict mode gives ~1800K IOPS in
all scenarios.

Based on v5.17-rc4 + [2]
* I lost my high data throughout test setup

Differences to v4:
https://lore.kernel.org/linux-iommu/1626259003-201303-1-git-send-email-john.ga...@huawei.com/
- Major rebase
- Change the "Refactor iommu_group_store_type()" to not use a callback
   and an op type enum instead
   - I didn't pick up Will's Ack as it has changed so much
- Use a domain feature flag to keep same default group type
- Add wrapper for default IOVA rcache range
- Combine last 2x patches

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/
[2] 
https://lore.kernel.org/linux-iommu/20220203063345-mutt-send-email-...@kernel.org/T/#m5b2b59576d35cad544314470f32e5f40ac5d1fe9

John Garry (5):
   iommu: Refactor iommu_group_store_type()
   iova: Allow rcache range upper limit to be flexible
   iommu: Allow iommu_change_dev_def_domain() realloc same default domain
 type
   iommu: Allow max opt DMA len be set for a group via sysfs
   iova: Add iova_len argument to iova_domain_init_rcaches()

  .../ABI/testing/sysfs-kernel-iommu_groups |  16 ++
  drivers/iommu/dma-iommu.c |  15 +-
  drivers/iommu/iommu.c | 202 +-
  drivers/iommu/iova.c  |  37 ++--
  drivers/vdpa/vdpa_user/iova_domain.c  |   4 +-
  include/linux/iommu.h |   7 +
  include/linux/iova.h  |   6 +-
  7 files changed, 212 insertions(+), 75 deletions(-)



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iova: Remove forward declarations

2022-02-28 Thread John Garry via iommu


On 18/02/2022 16:28, John Garry wrote:

Hi guys,

A friendly reminder on this one.

Cheers,
john


Now that the FQ code has been moved to dma-iommu.c and also the rcache-
related structures have been brought into iova.c, let's rearrange the code
to remove all the forward declarations.

The general order is as follows:
- RB tree code
- iova management
- magazine helpers
- rcache code and "fast" APIs
- iova domain public APIs

Rearrange prototypes in iova.h to follow the same general group ordering.

A couple of pre-existing checkpatch warnings are also remedied:

A suspect indentation is also corrected:
WARNING: suspect code indent for conditional statements (16, 32)
  #374: FILE: drivers/iommu/iova.c:194:
+   } else if (overlap)
+   break;

WARNING: Block comments should align the * on each line
  #1038: FILE: drivers/iommu/iova.c:787:
+ * fails too and the flush_rcache flag is set then the rcache will be flushed.
+*/

No functional change intended.

Signed-off-by: John Garry 

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 7e9c3a97c040..d543131025b3 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -17,75 +17,40 @@
  
  #define IOVA_RANGE_CACHE_MAX_SIZE 6	/* log of max cached IOVA range size (in pages) */
  
-static bool iova_rcache_insert(struct iova_domain *iovad,

-  unsigned long pfn,
-  unsigned long size);
-static unsigned long iova_rcache_get(struct iova_domain *iovad,
-unsigned long size,
-unsigned long limit_pfn);
-static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
-static void free_iova_rcaches(struct iova_domain *iovad);
+/*
+ * Magazine caches for IOVA ranges.  For an introduction to magazines,
+ * see the USENIX 2001 paper "Magazines and Vmem: Extending the Slab
+ * Allocator to Many CPUs and Arbitrary Resources" by Bonwick and Adams.
+ * For simplicity, we use a static magazine size and don't implement the
+ * dynamic size tuning described in the paper.
+ */
  
-static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)

-{
-   struct iova_domain *iovad;
+#define IOVA_MAG_SIZE 128
+#define MAX_GLOBAL_MAGS 32 /* magazines per bin */
  
-	iovad = hlist_entry_safe(node, struct iova_domain, cpuhp_dead);

+struct iova_magazine {
+   unsigned long size;
+   unsigned long pfns[IOVA_MAG_SIZE];
+};
  
-	free_cpu_cached_iovas(cpu, iovad);

-   return 0;
-}
+struct iova_cpu_rcache {
+   spinlock_t lock;
+   struct iova_magazine *loaded;
+   struct iova_magazine *prev;
+};
  
-static void free_global_cached_iovas(struct iova_domain *iovad);

+struct iova_rcache {
+   spinlock_t lock;
+   unsigned long depot_size;
+   struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+   struct iova_cpu_rcache __percpu *cpu_rcaches;
+};
  
  static struct iova *to_iova(struct rb_node *node)

  {
return rb_entry(node, struct iova, node);
  }
  
-void

-init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-   unsigned long start_pfn)
-{
-   /*
-* IOVA granularity will normally be equal to the smallest
-* supported IOMMU page size; both *must* be capable of
-* representing individual CPU pages exactly.
-*/
-   BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
-
-   spin_lock_init(&iovad->iova_rbtree_lock);
-   iovad->rbroot = RB_ROOT;
-   iovad->cached_node = &iovad->anchor.node;
-   iovad->cached32_node = &iovad->anchor.node;
-   iovad->granule = granule;
-   iovad->start_pfn = start_pfn;
-   iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
-   iovad->max32_alloc_size = iovad->dma_32bit_pfn;
-   iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
-   rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
-   rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
-}
-EXPORT_SYMBOL_GPL(init_iova_domain);
-
-static struct rb_node *
-__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn)
-{
-   if (limit_pfn <= iovad->dma_32bit_pfn)
-   return iovad->cached32_node;
-
-   return iovad->cached_node;
-}
-
-static void
-__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
-{
-   if (new->pfn_hi < iovad->dma_32bit_pfn)
-   iovad->cached32_node = &new->node;
-   else
-   iovad->cached_node = &new->node;
-}
-
  static void
  __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
  {
@@ -104,43 +69,6 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
struct iova *free)
iovad->cached_node = rb_next(&free->node);
  }

Re: [PATCH v4 2/8] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device

2022-02-24 Thread John Garry via iommu


On 24/02/2022 03:53, Yicong Yang wrote:

On 2022/2/22 19:06, John Garry wrote:

On 21/02/2022 08:43, Yicong Yang wrote:

HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex
integrated Endpoint(RCiEP) device, providing the capability
to dynamically monitor and tune the PCIe traffic, and trace
the TLP headers.

Add the driver for the device to enable the trace function.
This patch adds basic function of trace, including the device's
probe and initialization, functions for trace buffer allocation
and trace enable/disable, register an interrupt handler to
simply response to the DMA events. The user interface of trace
will be added in the following patch.



Fill commit message lines upto 75 characters


Hi John,

Thanks for the comments.

The commit message is within 75 characters. I checked again and checkpatch
didn't warning for this.


I mean to fill the lines up as much as possible, upto 75 char max, if 
not already done so. I am not sure if you are doing this already, but it 
looks like you were not.


Checkpatch
will
no
warn
about
a
commit
message
like
this
:)


Thanks,
john
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 3/8] hisi_ptt: Register PMU device for PTT trace

2022-02-22 Thread John Garry via iommu





+
  static irqreturn_t hisi_ptt_irq(int irq, void *context)
  {
struct hisi_ptt *hisi_ptt = context;
@@ -169,7 +233,7 @@ static irqreturn_t hisi_ptt_irq(int irq, void *context)
if (!(status & HISI_PTT_TRACE_INT_STAT_MASK))
return IRQ_NONE;
  
-	return IRQ_HANDLED;

+   return IRQ_WAKE_THREAD;
  }
  
  static void hisi_ptt_irq_free_vectors(void *pdev)

@@ -192,8 +256,10 @@ static int hisi_ptt_register_irq(struct hisi_ptt *hisi_ptt)
if (ret < 0)
return ret;
  
-	ret = devm_request_irq(&pdev->dev, pci_irq_vector(pdev, HISI_PTT_TRACE_DMA_IRQ),

-  hisi_ptt_irq, 0, DRV_NAME, hisi_ptt);
+   ret = devm_request_threaded_irq(&pdev->dev,


why add code in patch 2/8 and then immediately change 3/8?


+   pci_irq_vector(pdev, 
HISI_PTT_TRACE_DMA_IRQ),
+   hisi_ptt_irq, hisi_ptt_isr, 0,
+   DRV_NAME, hisi_ptt);
if (ret) {
pci_err(pdev, "failed to request irq %d, ret = %d.\n",
pci_irq_vector(pdev, HISI_PTT_TRACE_DMA_IRQ), ret);
@@ -270,6 +336,429 @@ static void hisi_ptt_init_ctrls(struct hisi_ptt *hisi_ptt)
hisi_ptt->trace_ctrl.default_cpu = 
cpumask_first(cpumask_of_node(dev_to_node(&pdev->dev)));
  }
  
+#define HISI_PTT_PMU_FILTER_IS_PORT	BIT(19)

+#define HISI_PTT_PMU_FILTER_VAL_MASK   GENMASK(15, 0)
+#define HISI_PTT_PMU_DIRECTION_MASKGENMASK(23, 20)
+#define HISI_PTT_PMU_TYPE_MASK GENMASK(31, 24)
+#define HISI_PTT_PMU_FORMAT_MASK   GENMASK(35, 32)
+
+static ssize_t available_root_port_filters_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev));
+   struct hisi_ptt_filter_desc *filter;
+   int pos = 0;
+
+   if (list_empty(&hisi_ptt->port_filters))
+   return sysfs_emit(buf, "\n");
+
+   mutex_lock(&hisi_ptt->mutex);
+   list_for_each_entry(filter, &hisi_ptt->port_filters, list)
+   pos += sysfs_emit_at(buf, pos, "%s 0x%05lx\n",
+pci_name(filter->pdev),
+hisi_ptt_get_filter_val(filter->pdev) |
+HISI_PTT_PMU_FILTER_IS_PORT);
+
+   mutex_unlock(&hisi_ptt->mutex);
+   return pos;
+}
+static DEVICE_ATTR_ADMIN_RO(available_root_port_filters);
+
+static ssize_t available_requester_filters_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev));
+   struct hisi_ptt_filter_desc *filter;
+   int pos = 0;
+
+   if (list_empty(&hisi_ptt->port_filters))


is this supposed to be req_filters? And is it safe to access without 
locking?



+   return sysfs_emit(buf, "\n");
+
+   mutex_lock(&hisi_ptt->mutex);
+   list_for_each_entry(filter, &hisi_ptt->req_filters, list)
+   pos += sysfs_emit_at(buf, pos, "%s 0x%05x\n",
+pci_name(filter->pdev),
+hisi_ptt_get_filter_val(filter->pdev));
+
+   mutex_unlock(&hisi_ptt->mutex);
+   return pos;
+}
+static DEVICE_ATTR_ADMIN_RO(available_requester_filters);
+
+PMU_FORMAT_ATTR(filter,"config:0-19");
+PMU_FORMAT_ATTR(direction, "config:20-23");
+PMU_FORMAT_ATTR(type,  "config:24-31");
+PMU_FORMAT_ATTR(format,"config:32-35");
+
+static struct attribute *hisi_ptt_pmu_format_attrs[] = {
+   &format_attr_filter.attr,
+   &format_attr_direction.attr,
+   &format_attr_type.attr,
+   &format_attr_format.attr,
+   NULL
+};
+
+static struct attribute_group hisi_ptt_pmu_format_group = {
+   .name = "format",
+   .attrs = hisi_ptt_pmu_format_attrs,
+};
+
+static struct attribute *hisi_ptt_pmu_filter_attrs[] = {
+   &dev_attr_available_root_port_filters.attr,
+   &dev_attr_available_requester_filters.attr,
+   NULL
+};
+
+static struct attribute_group hisi_ptt_pmu_filter_group = {
+   .attrs = hisi_ptt_pmu_filter_attrs,
+};
+
+static const struct attribute_group *hisi_ptt_pmu_groups[] = {
+   &hisi_ptt_pmu_format_group,
+   &hisi_ptt_pmu_filter_group,
+   NULL
+};
+
+/*
+ * The supported value of the direction parameter. See hisi_ptt.rst
+ * documentation for more details.
+ */
+static u32 hisi_ptt_trace_available_direction[] = {
+   0,
+   1,
+   2,
+   3,


this seems a very odd array.

And I assume it is const as it is modified - can this be non-global and 
tied to the device context?



+};
+
+/* Different types can be set simultaneously */
+static u32 hisi_ptt_trace_available_type

Re: [PATCH v4 2/8] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device

2022-02-22 Thread John Garry via iommu


On 21/02/2022 08:43, Yicong Yang wrote:

HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex
integrated Endpoint(RCiEP) device, providing the capability
to dynamically monitor and tune the PCIe traffic, and trace
the TLP headers.

Add the driver for the device to enable the trace function.
This patch adds basic function of trace, including the device's
probe and initialization, functions for trace buffer allocation
and trace enable/disable, register an interrupt handler to
simply response to the DMA events. The user interface of trace
will be added in the following patch.



Fill commit message lines upto 75 characters


Signed-off-by: Yicong Yang 
---
  drivers/Makefile |   1 +
  drivers/hwtracing/Kconfig|   2 +
  drivers/hwtracing/ptt/Kconfig|  11 +
  drivers/hwtracing/ptt/Makefile   |   2 +
  drivers/hwtracing/ptt/hisi_ptt.c | 370 +++
  drivers/hwtracing/ptt/hisi_ptt.h | 149 +
  6 files changed, 535 insertions(+)
  create mode 100644 drivers/hwtracing/ptt/Kconfig
  create mode 100644 drivers/hwtracing/ptt/Makefile
  create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c
  create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h

diff --git a/drivers/Makefile b/drivers/Makefile
index a110338c860c..ab3411e4eba5 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -175,6 +175,7 @@ obj-$(CONFIG_USB4)  += thunderbolt/
  obj-$(CONFIG_CORESIGHT)   += hwtracing/coresight/
  obj-y += hwtracing/intel_th/
  obj-$(CONFIG_STM) += hwtracing/stm/
+obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/
  obj-$(CONFIG_ANDROID) += android/
  obj-$(CONFIG_NVMEM)   += nvmem/
  obj-$(CONFIG_FPGA)+= fpga/
diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig
index 13085835a636..911ee977103c 100644
--- a/drivers/hwtracing/Kconfig
+++ b/drivers/hwtracing/Kconfig
@@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig"
  
  source "drivers/hwtracing/intel_th/Kconfig"
  
+source "drivers/hwtracing/ptt/Kconfig"

+
  endmenu
diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig
new file mode 100644
index ..41fa83921a07
--- /dev/null
+++ b/drivers/hwtracing/ptt/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config HISI_PTT
+   tristate "HiSilicon PCIe Tune and Trace Device"
+   depends on ARM64 && PCI && HAS_DMA && HAS_IOMEM


why no compile test support?


+   help
+ HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP
+ device, and it provides support for PCIe traffic tuning and
+ tracing TLP headers to the memory.
+
+ This driver can also be built as a module. If so, the module
+ will be called hisi_ptt.
diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile
new file mode 100644
index ..908c09a98161
--- /dev/null
+++ b/drivers/hwtracing/ptt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_HISI_PTT) += hisi_ptt.o
diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c
new file mode 100644
index ..a5b4f09ccd1e
--- /dev/null
+++ b/drivers/hwtracing/ptt/hisi_ptt.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for HiSilicon PCIe tune and trace device
+ *
+ * Copyright (c) 2022 HiSilicon Technologies Co., Ltd.
+ * Author: Yicong Yang 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_ptt.h"
+
+static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev)
+{
+   if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT)
+   return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn)));
+
+   return PCI_DEVID(pdev->bus->number, pdev->devfn);
+}
+
+static int hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt)
+{
+   u32 val;
+
+   return readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS,
+val, val & HISI_PTT_TRACE_IDLE,
+HISI_PTT_WAIT_POLL_INTERVAL_US,
+HISI_PTT_WAIT_TIMEOUT_US);
+}
+
+static int hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt)
+{
+   u32 val;
+
+   return readl_poll_timeout_atomic(hisi_ptt->iobase + 
HISI_PTT_TRACE_WR_STS,
+val, !val, 
HISI_PTT_RESET_POLL_INTERVAL_US,
+HISI_PTT_RESET_TIMEOUT_US);
+}
+
+static void hisi_ptt_free_trace_buf(struct hisi_ptt *hisi_ptt)
+{
+   struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl;
+   struct device *dev = &hisi_ptt->pdev->dev;
+   int i;
+
+   if (!ctrl->trace_buf)
+   return;
+
+   for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++)


it's good practice to use {} for if-else or similar in the loop


+   if (ctr

[PATCH] iova: Remove forward declarations

2022-02-18 Thread John Garry via iommu

Now that the FQ code has been moved to dma-iommu.c and also the rcache-
related structures have been brought into iova.c, let's rearrange the code
to remove all the forward declarations.

The general order is as follows:
- RB tree code
- iova management
- magazine helpers
- rcache code and "fast" APIs
- iova domain public APIs

Rearrange prototypes in iova.h to follow the same general group ordering.

A couple of pre-existing checkpatch warnings are also remedied:

A suspect indentation is also corrected:
WARNING: suspect code indent for conditional statements (16, 32)
 #374: FILE: drivers/iommu/iova.c:194:
+   } else if (overlap)
+   break;

WARNING: Block comments should align the * on each line
 #1038: FILE: drivers/iommu/iova.c:787:
+ * fails too and the flush_rcache flag is set then the rcache will be flushed.
+*/

No functional change intended.

Signed-off-by: John Garry 

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 7e9c3a97c040..d543131025b3 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -17,75 +17,40 @@
 
 #define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size 
(in pages) */
 
-static bool iova_rcache_insert(struct iova_domain *iovad,
-  unsigned long pfn,
-  unsigned long size);
-static unsigned long iova_rcache_get(struct iova_domain *iovad,
-unsigned long size,
-unsigned long limit_pfn);
-static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
-static void free_iova_rcaches(struct iova_domain *iovad);
+/*
+ * Magazine caches for IOVA ranges.  For an introduction to magazines,
+ * see the USENIX 2001 paper "Magazines and Vmem: Extending the Slab
+ * Allocator to Many CPUs and Arbitrary Resources" by Bonwick and Adams.
+ * For simplicity, we use a static magazine size and don't implement the
+ * dynamic size tuning described in the paper.
+ */
 
-static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
-{
-   struct iova_domain *iovad;
+#define IOVA_MAG_SIZE 128
+#define MAX_GLOBAL_MAGS 32 /* magazines per bin */
 
-   iovad = hlist_entry_safe(node, struct iova_domain, cpuhp_dead);
+struct iova_magazine {
+   unsigned long size;
+   unsigned long pfns[IOVA_MAG_SIZE];
+};
 
-   free_cpu_cached_iovas(cpu, iovad);
-   return 0;
-}
+struct iova_cpu_rcache {
+   spinlock_t lock;
+   struct iova_magazine *loaded;
+   struct iova_magazine *prev;
+};
 
-static void free_global_cached_iovas(struct iova_domain *iovad);
+struct iova_rcache {
+   spinlock_t lock;
+   unsigned long depot_size;
+   struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+   struct iova_cpu_rcache __percpu *cpu_rcaches;
+};
 
 static struct iova *to_iova(struct rb_node *node)
 {
return rb_entry(node, struct iova, node);
 }
 
-void
-init_iova_domain(struct iova_domain *iovad, unsigned long granule,
-   unsigned long start_pfn)
-{
-   /*
-* IOVA granularity will normally be equal to the smallest
-* supported IOMMU page size; both *must* be capable of
-* representing individual CPU pages exactly.
-*/
-   BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule));
-
-   spin_lock_init(&iovad->iova_rbtree_lock);
-   iovad->rbroot = RB_ROOT;
-   iovad->cached_node = &iovad->anchor.node;
-   iovad->cached32_node = &iovad->anchor.node;
-   iovad->granule = granule;
-   iovad->start_pfn = start_pfn;
-   iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad));
-   iovad->max32_alloc_size = iovad->dma_32bit_pfn;
-   iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
-   rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
-   rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
-}
-EXPORT_SYMBOL_GPL(init_iova_domain);
-
-static struct rb_node *
-__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn)
-{
-   if (limit_pfn <= iovad->dma_32bit_pfn)
-   return iovad->cached32_node;
-
-   return iovad->cached_node;
-}
-
-static void
-__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
-{
-   if (new->pfn_hi < iovad->dma_32bit_pfn)
-   iovad->cached32_node = &new->node;
-   else
-   iovad->cached_node = &new->node;
-}
-
 static void
 __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 {
@@ -104,43 +69,6 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
struct iova *free)
iovad->cached_node = rb_next(&free->node);
 }
 
-static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned 
long limit_pfn)
-{
-   s

[PATCH v5 5/5] iova: Add iova_len argument to iova_domain_init_rcaches()

2022-02-14 Thread John Garry via iommu

Add max opt argument to iova_domain_init_rcaches(), and use it to set the
rcaches range.

Also fix up all users to set this value (at 0, meaning use default),
including a wrapper for that, iova_domain_init_rcaches_default().

For dma-iommu.c we derive the iova_len argument from the IOMMU group
max opt DMA size.

Signed-off-by: John Garry 
---
 drivers/iommu/dma-iommu.c| 15 ++-
 drivers/iommu/iova.c | 19 ---
 drivers/vdpa/vdpa_user/iova_domain.c |  4 ++--
 include/linux/iova.h |  3 ++-
 4 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 908272f0153e..7afa2226c6bd 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -525,6 +525,8 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
struct iova_domain *iovad;
+   size_t max_opt_dma_size;
+   unsigned long iova_len = 0;
int ret;
 
if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
@@ -560,7 +562,18 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
}
 
init_iova_domain(iovad, 1UL << order, base_pfn);
-   ret = iova_domain_init_rcaches(iovad);
+
+   max_opt_dma_size = iommu_group_get_max_opt_dma_size(dev->iommu_group);
+   if (max_opt_dma_size) {
+   unsigned long shift = __ffs(1UL << order);
+
+   iova_len = roundup_pow_of_two(max_opt_dma_size);
+   iova_len >>= shift;
+   if (!iova_len)
+   iova_len = 1;
+   }
+
+   ret = iova_domain_init_rcaches(iovad, iova_len);
if (ret)
return ret;
 
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index f6eb93c737cb..f05aae044e25 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -705,12 +705,20 @@ static void iova_magazine_push(struct iova_magazine *mag, 
unsigned long pfn)
mag->pfns[mag->size++] = pfn;
 }
 
-int iova_domain_init_rcaches(struct iova_domain *iovad)
+static unsigned long iova_len_to_rcache_max(unsigned long iova_len)
+{
+   return order_base_2(iova_len) + 1;
+}
+
+int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long iova_len)
 {
unsigned int cpu;
int i, ret;
 
-   iovad->rcache_max_size = 6; /* Arbitrarily high default */
+   if (iova_len)
+   iovad->rcache_max_size = iova_len_to_rcache_max(iova_len);
+   else
+   iovad->rcache_max_size = 6; /* Arbitrarily high default */
 
iovad->rcaches = kcalloc(iovad->rcache_max_size,
 sizeof(struct iova_rcache),
@@ -754,7 +762,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
free_iova_rcaches(iovad);
return ret;
 }
-EXPORT_SYMBOL_GPL(iova_domain_init_rcaches);
+
+int iova_domain_init_rcaches_default(struct iova_domain *iovad)
+{
+   return iova_domain_init_rcaches(iovad, 0);
+}
+EXPORT_SYMBOL_GPL(iova_domain_init_rcaches_default);
 
 /*
  * Try inserting IOVA range starting with 'iova_pfn' into 'rcache', and
diff --git a/drivers/vdpa/vdpa_user/iova_domain.c 
b/drivers/vdpa/vdpa_user/iova_domain.c
index 22f7d43f8a68..74471bc463ec 100644
--- a/drivers/vdpa/vdpa_user/iova_domain.c
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -514,12 +514,12 @@ vduse_domain_create(unsigned long iova_limit, size_t 
bounce_size)
spin_lock_init(&domain->iotlb_lock);
init_iova_domain(&domain->stream_iovad,
PAGE_SIZE, IOVA_START_PFN);
-   ret = iova_domain_init_rcaches(&domain->stream_iovad);
+   ret = iova_domain_init_rcaches_default(&domain->stream_iovad);
if (ret)
goto err_iovad_stream;
init_iova_domain(&domain->consistent_iovad,
PAGE_SIZE, bounce_pfns);
-   ret = iova_domain_init_rcaches(&domain->consistent_iovad);
+   ret = iova_domain_init_rcaches_default(&domain->consistent_iovad);
if (ret)
goto err_iovad_consistent;
 
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 02f7222fa85a..56281434ce0c 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -95,7 +95,8 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned 
long pfn_lo,
unsigned long pfn_hi);
 void init_iova_domain(struct iova_domain *iovad, unsigned long granule,
unsigned long start_pfn);
-int iova_domain_init_rcaches(struct iova_domain *iovad);
+int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long 
iova_len);
+int iova_domain_init_rcaches_default(struct iova_domain *iovad);
 struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
 void put_iov

[PATCH v5 4/5] iommu: Allow max opt DMA len be set for a group via sysfs

2022-02-14 Thread John Garry via iommu

Add support to allow the maximum optimised DMA len be set for an IOMMU
group via sysfs.

This is much the same with the method to change the default domain type
for a group.

Signed-off-by: John Garry 
---
 .../ABI/testing/sysfs-kernel-iommu_groups | 16 +
 drivers/iommu/iommu.c | 59 ++-
 include/linux/iommu.h |  6 ++
 3 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups 
b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
index b15af6a5bc08..ed6f72794f6c 100644
--- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups
+++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups
@@ -63,3 +63,19 @@ Description: /sys/kernel/iommu_groups//type shows 
the type of default
system could lead to catastrophic effects (the users might
need to reboot the machine to get it to normal state). So, it's
expected that the users understand what they're doing.
+
+What:  /sys/kernel/iommu_groups//max_opt_dma_size
+Date:  Feb 2022
+KernelVersion: v5.18
+Contact:   iommu@lists.linux-foundation.org
+Description:   /sys/kernel/iommu_groups//max_opt_dma_size shows the
+   max optimised DMA size for the default IOMMU domain associated
+   with the group.
+   Each IOMMU domain has an IOVA domain. The IOVA domain caches
+   IOVAs upto a certain size as a performance optimisation.
+   This sysfs file allows the range of the IOVA domain caching be
+   set, such that larger than default IOVAs may be cached.
+   A value of 0 means that the default caching range is chosen.
+   A privileged user could request the kernel the change the range
+   by writing to this file. For this to happen, the same rules
+   and procedure applies as in changing the default domain type.
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index df9ffd76c184..79f5cbea5c95 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -48,6 +48,7 @@ struct iommu_group {
struct iommu_domain *default_domain;
struct iommu_domain *domain;
struct list_head entry;
+   size_t max_opt_dma_size;
 };
 
 struct group_device {
@@ -89,6 +90,9 @@ static int iommu_create_device_direct_mappings(struct 
iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
  const char *buf, size_t count);
+static ssize_t iommu_group_store_max_opt_dma_size(struct iommu_group *group,
+ const char *buf,
+ size_t count);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)  \
 struct iommu_group_attribute iommu_group_attr_##_name =\
@@ -570,6 +574,12 @@ static ssize_t iommu_group_show_type(struct iommu_group 
*group,
return strlen(type);
 }
 
+static ssize_t iommu_group_show_max_opt_dma_size(struct iommu_group *group,
+char *buf)
+{
+   return sprintf(buf, "%zu\n", group->max_opt_dma_size);
+}
+
 static IOMMU_GROUP_ATTR(name, S_IRUGO, iommu_group_show_name, NULL);
 
 static IOMMU_GROUP_ATTR(reserved_regions, 0444,
@@ -578,6 +588,9 @@ static IOMMU_GROUP_ATTR(reserved_regions, 0444,
 static IOMMU_GROUP_ATTR(type, 0644, iommu_group_show_type,
iommu_group_store_type);
 
+static IOMMU_GROUP_ATTR(max_opt_dma_size, 0644, 
iommu_group_show_max_opt_dma_size,
+   iommu_group_store_max_opt_dma_size);
+
 static void iommu_group_release(struct kobject *kobj)
 {
struct iommu_group *group = to_iommu_group(kobj);
@@ -664,6 +677,10 @@ struct iommu_group *iommu_group_alloc(void)
if (ret)
return ERR_PTR(ret);
 
+   ret = iommu_group_create_file(group, 
&iommu_group_attr_max_opt_dma_size);
+   if (ret)
+   return ERR_PTR(ret);
+
pr_debug("Allocated group %d\n", group->id);
 
return group;
@@ -2302,6 +2319,11 @@ struct iommu_domain *iommu_get_dma_domain(struct device 
*dev)
return dev->iommu_group->default_domain;
 }
 
+size_t iommu_group_get_max_opt_dma_size(struct iommu_group *group)
+{
+   return group->max_opt_dma_size;
+}
+
 /*
  * IOMMU groups are really the natural working unit of the IOMMU, but
  * the IOMMU API works on domains and devices.  Bridge that gap by
@@ -3132,12 +3154,14 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
  * @prev_dev: The device in the group (this is used to make sure that the 
device
  *  hasn't changed after the caller has called this function)
  * @type: The type of the new default domain that gets associated with the 
group
+ * @max_opt_dma_

[PATCH v5 3/5] iommu: Allow iommu_change_dev_def_domain() realloc same default domain type

2022-02-14 Thread John Garry via iommu

Allow iommu_change_dev_def_domain() to create a new default domain, keeping
the same as current.

Also remove comment about the function purpose, which will become stale.

Signed-off-by: John Garry 
---
 drivers/iommu/iommu.c | 49 ++-
 include/linux/iommu.h |  1 +
 2 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5e7ed969b870..df9ffd76c184 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3124,6 +3124,7 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle)
 }
 EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
 
+
 /*
  * Changes the default domain of an iommu group that has *only* one device
  *
@@ -3134,10 +3135,6 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid);
  *
  * Returns 0 on success and error code on failure
  *
- * Note:
- * 1. Presently, this function is called only when user requests to change the
- *group's default domain type through 
/sys/kernel/iommu_groups//type
- *Please take a closer look if intended to use for other purposes.
  */
 static int iommu_change_dev_def_domain(struct iommu_group *group,
   struct device *prev_dev, int type)
@@ -3190,28 +3187,32 @@ static int iommu_change_dev_def_domain(struct 
iommu_group *group,
goto out;
}
 
-   dev_def_dom = iommu_get_def_domain_type(dev);
-   if (!type) {
+   if (type == __IOMMU_DOMAIN_SAME) {
+   type = prev_dom->type;
+   } else {
+   dev_def_dom = iommu_get_def_domain_type(dev);
+   if (!type) {
+   /*
+* If the user hasn't requested any specific type of 
domain and
+* if the device supports both the domains, then 
default to the
+* domain the device was booted with
+*/
+   type = dev_def_dom ? : iommu_def_domain_type;
+   } else if (dev_def_dom && type != dev_def_dom) {
+   dev_err_ratelimited(prev_dev, "Device cannot be in %s 
domain\n",
+   iommu_domain_type_str(type));
+   ret = -EINVAL;
+   goto out;
+   }
+
/*
-* If the user hasn't requested any specific type of domain and
-* if the device supports both the domains, then default to the
-* domain the device was booted with
+* Switch to a new domain only if the requested domain type is 
different
+* from the existing default domain type
 */
-   type = dev_def_dom ? : iommu_def_domain_type;
-   } else if (dev_def_dom && type != dev_def_dom) {
-   dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n",
-   iommu_domain_type_str(type));
-   ret = -EINVAL;
-   goto out;
-   }
-
-   /*
-* Switch to a new domain only if the requested domain type is different
-* from the existing default domain type
-*/
-   if (prev_dom->type == type) {
-   ret = 0;
-   goto out;
+   if (prev_dom->type == type) {
+   ret = 0;
+   goto out;
+   }
}
 
/* We can bring up a flush queue without tearing down the domain */
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index de0c57a567c8..d242fccc7c2d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -62,6 +62,7 @@ struct iommu_domain_geometry {
  implementation  */
 #define __IOMMU_DOMAIN_PT  (1U << 2)  /* Domain is identity mapped   */
 #define __IOMMU_DOMAIN_DMA_FQ  (1U << 3)  /* DMA-API uses flush queue*/
+#define __IOMMU_DOMAIN_SAME(1U << 4)  /* Keep same type (internal)   */
 
 /*
  * This are the possible domain-types
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v5 2/5] iova: Allow rcache range upper limit to be flexible

2022-02-14 Thread John Garry via iommu

Some low-level drivers may request DMA mappings whose IOVA length exceeds
that of the current rcache upper limit.

This means that allocations for those IOVAs will never be cached, and
always must be allocated and freed from the RB tree per DMA mapping cycle.
This has a significant effect on performance, more so since commit
4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search
fails"), as discussed at [0].

As a first step towards allowing the rcache range upper limit be
configured, hold this value in the IOVA rcache structure, and allocate
the rcaches separately.

Delete macro IOVA_RANGE_CACHE_MAX_SIZE in case it's reused by mistake.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/

Signed-off-by: John Garry 
---
 drivers/iommu/iova.c | 20 ++--
 include/linux/iova.h |  3 +++
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 7e9c3a97c040..f6eb93c737cb 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -15,8 +15,6 @@
 /* The anchor node sits above the top of the usable address space */
 #define IOVA_ANCHOR~0UL
 
-#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size 
(in pages) */
-
 static bool iova_rcache_insert(struct iova_domain *iovad,
   unsigned long pfn,
   unsigned long size);
@@ -442,7 +440,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long 
size,
 * rounding up anything cacheable to make sure that can't happen. The
 * order of the unadjusted size will still match upon freeing.
 */
-   if (size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+   if (size < (1 << (iovad->rcache_max_size - 1)))
size = roundup_pow_of_two(size);
 
iova_pfn = iova_rcache_get(iovad, size, limit_pfn + 1);
@@ -712,13 +710,15 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
unsigned int cpu;
int i, ret;
 
-   iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE,
+   iovad->rcache_max_size = 6; /* Arbitrarily high default */
+
+   iovad->rcaches = kcalloc(iovad->rcache_max_size,
 sizeof(struct iova_rcache),
 GFP_KERNEL);
if (!iovad->rcaches)
return -ENOMEM;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
struct iova_cpu_rcache *cpu_rcache;
struct iova_rcache *rcache;
 
@@ -815,7 +815,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, 
unsigned long pfn,
 {
unsigned int log_size = order_base_2(size);
 
-   if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)
+   if (log_size >= iovad->rcache_max_size)
return false;
 
return __iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn);
@@ -871,7 +871,7 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 {
unsigned int log_size = order_base_2(size);
 
-   if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE || !iovad->rcaches)
+   if (log_size >= iovad->rcache_max_size || !iovad->rcaches)
return 0;
 
return __iova_rcache_get(&iovad->rcaches[log_size], limit_pfn - size);
@@ -887,7 +887,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
unsigned int cpu;
int i, j;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = &iovad->rcaches[i];
if (!rcache->cpu_rcaches)
break;
@@ -915,7 +915,7 @@ static void free_cpu_cached_iovas(unsigned int cpu, struct 
iova_domain *iovad)
unsigned long flags;
int i;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = &iovad->rcaches[i];
cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
spin_lock_irqsave(&cpu_rcache->lock, flags);
@@ -934,7 +934,7 @@ static void free_global_cached_iovas(struct iova_domain 
*iovad)
unsigned long flags;
int i, j;
 
-   for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
+   for (i = 0; i < iovad->rcache_max_size; ++i) {
rcache = &iovad->rcaches[i];
spin_lock_irqsave(&rcache->lock, flags);
for (j = 0; j < rcache->depot_size; ++j) {
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 320a70e40233..02f7222fa85a 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -38,6 +38,9 @@ struct iova_domain {
 
struct iova_rcache  *rcaches;
struct hlist_node

[PATCH v5 1/5] iommu: Refactor iommu_group_store_type()

2022-02-14 Thread John Garry via iommu

Function iommu_group_store_type() supports changing the default domain
of an IOMMU group.

Many conditions need to be satisfied and steps taken for this action to be
successful.

Satisfying these conditions and steps will be required for setting other
IOMMU group attributes, so factor into a common part and a part specific
to update the IOMMU group attribute.

No functional change intended.

Some code comments are tidied up also.

Signed-off-by: John Garry 
---
 drivers/iommu/iommu.c | 96 ---
 1 file changed, 62 insertions(+), 34 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 107dcf5938d6..5e7ed969b870 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -3261,21 +3261,57 @@ static int iommu_change_dev_def_domain(struct 
iommu_group *group,
return ret;
 }
 
+enum iommu_group_op {
+   CHANGE_GROUP_TYPE,
+};
+
+static int __iommu_group_store_type(const char *buf, struct iommu_group *group,
+   struct device *dev)
+{
+   int type;
+
+   if (sysfs_streq(buf, "identity"))
+   type = IOMMU_DOMAIN_IDENTITY;
+   else if (sysfs_streq(buf, "DMA"))
+   type = IOMMU_DOMAIN_DMA;
+   else if (sysfs_streq(buf, "DMA-FQ"))
+   type = IOMMU_DOMAIN_DMA_FQ;
+   else if (sysfs_streq(buf, "auto"))
+   type = 0;
+   else
+   return -EINVAL;
+
+   /*
+* Check if the only device in the group still has a driver bound or
+* we're transistioning from DMA -> DMA-FQ
+*/
+   if (device_is_bound(dev) && !(type == IOMMU_DOMAIN_DMA_FQ &&
+   group->default_domain->type == IOMMU_DOMAIN_DMA)) {
+   pr_err_ratelimited("Device is still bound to driver\n");
+   return -EINVAL;
+   }
+
+   return iommu_change_dev_def_domain(group, dev, type);
+}
+
 /*
  * Changing the default domain through sysfs requires the users to unbind the
  * drivers from the devices in the iommu group, except for a DMA -> DMA-FQ
- * transition. Return failure if this isn't met.
+ * transition. Changing or any other IOMMU group attribute still requires the
+ * user to unbind the drivers from the devices in the iommu group. Return
+ * failure if these conditions are not met.
  *
  * We need to consider the race between this and the device release path.
  * device_lock(dev) is used here to guarantee that the device release path
  * will not be entered at the same time.
  */
-static ssize_t iommu_group_store_type(struct iommu_group *group,
- const char *buf, size_t count)
+static ssize_t iommu_group_store_common(struct iommu_group *group,
+   enum iommu_group_op op,
+   const char *buf, size_t count)
 {
struct group_device *grp_dev;
struct device *dev;
-   int ret, req_type;
+   int ret;
 
if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
return -EACCES;
@@ -3283,27 +3319,16 @@ static ssize_t iommu_group_store_type(struct 
iommu_group *group,
if (WARN_ON(!group))
return -EINVAL;
 
-   if (sysfs_streq(buf, "identity"))
-   req_type = IOMMU_DOMAIN_IDENTITY;
-   else if (sysfs_streq(buf, "DMA"))
-   req_type = IOMMU_DOMAIN_DMA;
-   else if (sysfs_streq(buf, "DMA-FQ"))
-   req_type = IOMMU_DOMAIN_DMA_FQ;
-   else if (sysfs_streq(buf, "auto"))
-   req_type = 0;
-   else
-   return -EINVAL;
-
/*
 * Lock/Unlock the group mutex here before device lock to
-* 1. Make sure that the iommu group has only one device (this is a
+* 1. Make sure that the IOMMU group has only one device (this is a
 *prerequisite for step 2)
 * 2. Get struct *dev which is needed to lock device
 */
mutex_lock(&group->mutex);
if (iommu_group_device_count(group) != 1) {
mutex_unlock(&group->mutex);
-   pr_err_ratelimited("Cannot change default domain: Group has 
more than one device\n");
+   pr_err_ratelimited("Cannot change IOMMU group default domain 
attribute: Group has more than one device\n");
return -EINVAL;
}
 
@@ -3315,16 +3340,16 @@ static ssize_t iommu_group_store_type(struct 
iommu_group *group,
/*
 * Don't hold the group mutex because taking group mutex first and then
 * the device lock could potentially cause a deadlock as below. Assume
-* two threads T1 and T2. T1 is trying to change default domain of an
-* iommu group and T2 is trying to hot unplug a device or release [1] VF
-* of a PCIe device which is in th

[PATCH v5 0/5] iommu: Allow IOVA rcache range be configured

2022-02-14 Thread John Garry via iommu

For streaming DMA mappings involving an IOMMU and whose IOVA len regularly
exceeds the IOVA rcache upper limit (meaning that they are not cached),
performance can be reduced. 

This may be much more pronounced from commit 4e89dce72521 ("iommu/iova:
Retry from last rb tree node if iova search fails"), as discussed at [0].

IOVAs which cannot be cached are highly involved in the IOVA ageing issue,
as discussed at [1].

This series allows the IOVA rcache range be configured, so that we may
cache all IOVAs per domain, thus improving performance.

A new IOMMU group sysfs file is added - max_opt_dma_size - which is used
indirectly to configure the IOVA rcache range:
/sys/kernel/iommu_groups/X/max_opt_dma_size

This file is updated same as how the IOMMU group default domain type is
updated, i.e. must unbind the only device in the group first.

The inspiration here comes from block layer request queue sysfs
"optimal_io_size" file, in /sys/block/sdX/queue/optimal_io_size

Some old figures* for storage scenario (when increasing IOVA rcache range
to cover all DMA mapping sizes from the LLD):
v5.13-rc1 baseline: 1200K IOPS
With series:1800K IOPS

All above are for IOMMU strict mode. Non-strict mode gives ~1800K IOPS in
all scenarios.

Based on v5.17-rc4 + [2]
* I lost my high data throughout test setup

Differences to v4:
https://lore.kernel.org/linux-iommu/1626259003-201303-1-git-send-email-john.ga...@huawei.com/
- Major rebase
- Change the "Refactor iommu_group_store_type()" to not use a callback
  and an op type enum instead
  - I didn't pick up Will's Ack as it has changed so much
- Use a domain feature flag to keep same default group type
- Add wrapper for default IOVA rcache range
- Combine last 2x patches

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/
[1] 
https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/
[2] 
https://lore.kernel.org/linux-iommu/20220203063345-mutt-send-email-...@kernel.org/T/#m5b2b59576d35cad544314470f32e5f40ac5d1fe9

John Garry (5):
  iommu: Refactor iommu_group_store_type()
  iova: Allow rcache range upper limit to be flexible
  iommu: Allow iommu_change_dev_def_domain() realloc same default domain
type
  iommu: Allow max opt DMA len be set for a group via sysfs
  iova: Add iova_len argument to iova_domain_init_rcaches()

 .../ABI/testing/sysfs-kernel-iommu_groups |  16 ++
 drivers/iommu/dma-iommu.c |  15 +-
 drivers/iommu/iommu.c | 202 +-
 drivers/iommu/iova.c  |  37 ++--
 drivers/vdpa/vdpa_user/iova_domain.c  |   4 +-
 include/linux/iommu.h |   7 +
 include/linux/iova.h  |   6 +-
 7 files changed, 212 insertions(+), 75 deletions(-)

-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2] iommu/iova: Separate out rcache init

2022-02-14 Thread John Garry via iommu


On 03/02/2022 11:34, Michael S. Tsirkin wrote:

On Thu, Feb 03, 2022 at 05:59:20PM +0800, John Garry wrote:

Currently the rcache structures are allocated for all IOVA domains, even if
they do not use "fast" alloc+free interface. This is wasteful of memory.

In addition, fails in init_iova_rcaches() are not handled safely, which is
less than ideal.

Make "fast" users call a separate rcache init explicitly, which includes
error checking.

Signed-off-by: John Garry 


virtio things:

Acked-by: Michael S. Tsirkin 


Cheers

Hi Robin,

Can you kindly give this your blessing if you are happy with it?

Thanks!




---
Differences to v1:
- Drop stubs for iova_domain_init_rcaches() and iova_domain_free_rcaches()
- Use put_iova_domain() in vdpa code

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d85d54f2b549..b22034975301 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -525,6 +525,7 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
struct iova_domain *iovad;
+   int ret;
  
  	if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)

return -EINVAL;
@@ -559,6 +560,9 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
}
  
  	init_iova_domain(iovad, 1UL << order, base_pfn);

+   ret = iova_domain_init_rcaches(iovad);
+   if (ret)
+   return ret;
  
  	/* If the FQ fails we can simply fall back to strict mode */

if (domain->type == IOMMU_DOMAIN_DMA_FQ && iommu_dma_init_fq(domain))
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b28c9435b898..7e9c3a97c040 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -15,13 +15,14 @@
  /* The anchor node sits above the top of the usable address space */
  #define IOVA_ANCHOR   ~0UL
  
+#define IOVA_RANGE_CACHE_MAX_SIZE 6	/* log of max cached IOVA range size (in pages) */

+
  static bool iova_rcache_insert(struct iova_domain *iovad,
   unsigned long pfn,
   unsigned long size);
  static unsigned long iova_rcache_get(struct iova_domain *iovad,
 unsigned long size,
 unsigned long limit_pfn);
-static void init_iova_rcaches(struct iova_domain *iovad);
  static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain 
*iovad);
  static void free_iova_rcaches(struct iova_domain *iovad);
  
@@ -64,8 +65,6 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule,

iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
-   cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, 
&iovad->cpuhp_dead);
-   init_iova_rcaches(iovad);
  }
  EXPORT_SYMBOL_GPL(init_iova_domain);
  
@@ -488,6 +487,13 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size)

  }
  EXPORT_SYMBOL_GPL(free_iova_fast);
  
+static void iova_domain_free_rcaches(struct iova_domain *iovad)

+{
+   cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD,
+   &iovad->cpuhp_dead);
+   free_iova_rcaches(iovad);
+}
+
  /**
   * put_iova_domain - destroys the iova domain
   * @iovad: - iova domain in question.
@@ -497,9 +503,9 @@ void put_iova_domain(struct iova_domain *iovad)
  {
struct iova *iova, *tmp;
  
-	cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD,

-   &iovad->cpuhp_dead);
-   free_iova_rcaches(iovad);
+   if (iovad->rcaches)
+   iova_domain_free_rcaches(iovad);
+
rbtree_postorder_for_each_entry_safe(iova, tmp, &iovad->rbroot, node)
free_iova_mem(iova);
  }
@@ -608,6 +614,7 @@ EXPORT_SYMBOL_GPL(reserve_iova);
   */
  
  #define IOVA_MAG_SIZE 128

+#define MAX_GLOBAL_MAGS 32 /* magazines per bin */
  
  struct iova_magazine {

unsigned long size;
@@ -620,6 +627,13 @@ struct iova_cpu_rcache {
struct iova_magazine *prev;
  };
  
+struct iova_rcache {

+   spinlock_t lock;
+   unsigned long depot_size;
+   struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+   struct iova_cpu_rcache __percpu *cpu_rcaches;
+};
+
  static struct iova_magazine *iova_magazine_alloc(gfp_t flags)
  {
return kzalloc(sizeof(struct iova_magazine), flags);
@@ -693,28 +707,54 @@ static void iova_magazine_push(struct iova_magazine *mag, 
unsigned long pfn)
mag->pfns[mag->size++] = pfn;
  }
  
-static void init_iova_rcaches(struct iova_domain *iovad)

+int iova_domain_init_rcaches(struct iova_domain *iovad)
  {
-

Re: [PATCH v3 8/8] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity

2022-02-08 Thread John Garry via iommu


On 08/02/2022 11:21, Yicong Yang wrote:

This patch should be earlier in the series, before the PTT driver, and the 
comment on hisi_ptt_check_iommu_mapping() should mention what is going on here.


ok I'll reorder the serives and modify the comments of 
hisi_ptt_check_iommu_mapping() like:

/*
  * The DMA of PTT trace can only use direct mapping, due to some
  * hardware restriction. Check whether there is an iommu or the
  * policy of the iommu domain is passthrough, otherwise the trace
  * cannot work.


IOMMU, capitalize acronyms


  *
  * The PTT device is supposed to behind the arm smmu v3, which
  * should have passthrough the device by a quirk. Otherwise user
  * should manually set the iommu domain type to identity through
  * sysfs.


Sorry, but I don't really understand your meaning here.

I did not think that if we have a default domain then we can change via 
sysfs to anything else.



  */


Signed-off-by: Yicong Yang
---
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
   1 file changed, 16 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6dc6d8b6b368..6f67a2b1dd27 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2838,6 +2838,21 @@ static int arm_smmu_dev_disable_feature(struct device 
*dev,
   }
   }
   +#define IS_HISI_PTT_DEVICE(pdev)    ((pdev)->vendor == PCI_VENDOR_ID_HUAWEI 
&& \
+ (pdev)->device == 0xa12e)

I assume that not all revisions will require this check, right?


So if you are very confident that the next revision will be fixed then I 
would add a check for this current broken revision.





For current revisions it's necessary.



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 8/8] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity

2022-02-08 Thread John Garry via iommu


On 24/01/2022 13:11, Yicong Yang wrote:

The DMA of HiSilicon PTT device can only work with identical
mapping. So add a quirk for the device to force the domain
passthrough.


This patch should be earlier in the series, before the PTT driver, and 
the comment on hisi_ptt_check_iommu_mapping() should mention what is 
going on here.




Signed-off-by: Yicong Yang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
  1 file changed, 16 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6dc6d8b6b368..6f67a2b1dd27 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2838,6 +2838,21 @@ static int arm_smmu_dev_disable_feature(struct device 
*dev,
}
  }
  
+#define IS_HISI_PTT_DEVICE(pdev)	((pdev)->vendor == PCI_VENDOR_ID_HUAWEI && \

+(pdev)->device == 0xa12e)


I assume that not all revisions will require this check, right?


+
+static int arm_smmu_def_domain_type(struct device *dev)
+{
+   if (dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   if (IS_HISI_PTT_DEVICE(pdev))
+   return IOMMU_DOMAIN_IDENTITY;
+   }
+
+   return 0;
+}
+
  static struct iommu_ops arm_smmu_ops = {
.capable= arm_smmu_capable,
.domain_alloc   = arm_smmu_domain_alloc,
@@ -2863,6 +2878,7 @@ static struct iommu_ops arm_smmu_ops = {
.sva_unbind = arm_smmu_sva_unbind,
.sva_get_pasid  = arm_smmu_sva_get_pasid,
.page_response  = arm_smmu_page_response,
+   .def_domain_type= arm_smmu_def_domain_type,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
.owner  = THIS_MODULE,
  };


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 1/8] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device

2022-02-07 Thread John Garry via iommu


On 24/01/2022 13:11, Yicong Yang wrote:

HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex
integrated Endpoint(RCiEP) device, providing the capability
to dynamically monitor and tune the PCIe traffic, and trace
the TLP headers.

Add the driver for the device to enable the trace function.
This patch adds basic function of trace, including the device's
probe and initialization, functions for trace buffer allocation
and trace enable/disable, register an interrupt handler to
simply response to the DMA events. The user interface of trace
will be added in the following patch.

Signed-off-by: Yicong Yang 
---
  drivers/Makefile |   1 +
  drivers/hwtracing/Kconfig|   2 +
  drivers/hwtracing/ptt/Kconfig|  11 +
  drivers/hwtracing/ptt/Makefile   |   2 +
  drivers/hwtracing/ptt/hisi_ptt.c | 398 +++
  drivers/hwtracing/ptt/hisi_ptt.h | 159 
  6 files changed, 573 insertions(+)
  create mode 100644 drivers/hwtracing/ptt/Kconfig
  create mode 100644 drivers/hwtracing/ptt/Makefile
  create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c
  create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h

diff --git a/drivers/Makefile b/drivers/Makefile
index a110338c860c..ab3411e4eba5 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -175,6 +175,7 @@ obj-$(CONFIG_USB4)  += thunderbolt/
  obj-$(CONFIG_CORESIGHT)   += hwtracing/coresight/
  obj-y += hwtracing/intel_th/
  obj-$(CONFIG_STM) += hwtracing/stm/
+obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/
  obj-$(CONFIG_ANDROID) += android/
  obj-$(CONFIG_NVMEM)   += nvmem/
  obj-$(CONFIG_FPGA)+= fpga/
diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig
index 13085835a636..911ee977103c 100644
--- a/drivers/hwtracing/Kconfig
+++ b/drivers/hwtracing/Kconfig
@@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig"
  
  source "drivers/hwtracing/intel_th/Kconfig"
  
+source "drivers/hwtracing/ptt/Kconfig"

+
  endmenu
diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig
new file mode 100644
index ..4f4f2459ac47
--- /dev/null
+++ b/drivers/hwtracing/ptt/Kconfig
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config HISI_PTT
+   tristate "HiSilicon PCIe Tune and Trace Device"
+   depends on ARM64 && PCI && HAS_DMA && HAS_IOMEM
+   help
+ HiSilicon PCIe Tune and Trace Device exist as a PCIe RCiEP


exists


+ device, provides support for PCIe traffic tuning and


and it provides support...


+ tracing TLP headers to the memory.
+
+ This driver can also be built as a module. If so, the module
+ will be called hisi_ptt.
diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile
new file mode 100644
index ..908c09a98161
--- /dev/null
+++ b/drivers/hwtracing/ptt/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_HISI_PTT) += hisi_ptt.o
diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c
new file mode 100644
index ..6d0a0ca5c0a9
--- /dev/null
+++ b/drivers/hwtracing/ptt/hisi_ptt.c
@@ -0,0 +1,398 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for HiSilicon PCIe tune and trace device
+ *
+ * Copyright (c) 2022 HiSilicon Technologies Co., Ltd.
+ * Author: Yicong Yang 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "hisi_ptt.h"
+
+static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev)
+{
+   if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT)
+   return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn)));
+
+   return PCI_DEVID(pdev->bus->number, pdev->devfn);
+}
+
+static int hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt)
+{
+   u32 val;
+
+   return readl_poll_timeout(hisi_ptt->iobase + HISI_PTT_TRACE_STS, val,
+ val & HISI_PTT_TRACE_IDLE,
+ HISI_PTT_WAIT_POLL_INTERVAL_US,
+ HISI_PTT_WAIT_TIMEOUT_US);
+}
+
+static void hisi_ptt_free_trace_buf(struct hisi_ptt *hisi_ptt)
+{
+   struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl;
+   struct device *dev = &hisi_ptt->pdev->dev;
+   struct hisi_ptt_dma_buffer *buffer, *tbuffer;
+
+   list_for_each_entry_safe(buffer, tbuffer, &ctrl->trace_buf, list) {
+   list_del(&buffer->list);
+   dma_free_coherent(dev, buffer->size, buffer->addr,
+ buffer->dma);
+   kfree(buffer);
+   }
+}
+
+static int hisi_ptt_alloc_trace_buf(struct hisi_ptt *hisi_ptt)
+{
+   struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl;
+   struct device *dev = &hisi_ptt->pdev->dev;
+   struct hisi_ptt_dma_buffer *buffer;
+   int i, ret;
+
+   hisi_ptt->trace_ctrl.buf_index = 0;

Re: [PATCH v2] iommu/core: Remove comment reference to iommu_dev_has_feature

2022-02-07 Thread John Garry via iommu


On 07/02/2022 03:23, Akeem G Abodunrin wrote:

iommu_dev_has_feature() api has been removed by the commit 262948f8ba573
("iommu: Delete iommu_dev_has_feature()") - So this patch removes comment
about the api to avoid any confusion.

Signed-off-by: Akeem G Abodunrin 
Cc: Lu Baolu 
Reviewed-by: Christoph Hellwig 


Reviewed-by: John Garry 

BTW, It looks like we can get rid of iommu_ops.dev_has_feat also.

It does not seem to be called, while arm-smmu-v3 driver does provide a 
callback.



---
  include/linux/iommu.h | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index de0c57a567c8..bea054f2bd4d 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -153,8 +153,7 @@ struct iommu_resv_region {
   * supported, this feature must be enabled before and
   * disabled after %IOMMU_DEV_FEAT_SVA.
   *
- * Device drivers query whether a feature is supported using
- * iommu_dev_has_feature(), and enable it using iommu_dev_enable_feature().
+ * Device drivers enable the feature via iommu_dev_enable_feature().
   */
  enum iommu_dev_features {
IOMMU_DEV_FEAT_AUX,


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v1 04/10] iommu/vt-d: Remove iova_cache_get/put()

2022-02-07 Thread John Garry via iommu


On 07/02/2022 06:41, Lu Baolu wrote:

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 583ec0fa4ac1..e8d58654361c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3348,9 +3348,6 @@ static inline int iommu_devinfo_cache_init(void)
  static int __init iommu_init_mempool(void)
  {
int ret;
-   ret = iova_cache_get();
-   if (ret)
-   return ret;
  
  	ret = iommu_domain_cache_init();

if (ret)
@@ -3362,7 +3359,6 @@ static int __init iommu_init_mempool(void)
  
  	kmem_cache_destroy(iommu_domain_cache);

  domain_error:


nit: is this label still really required? only failures in 
iommu_domain_cache_init() jump to it, and that can return directly now.


Thanks,
John


-   iova_cache_put();


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v2] iommu/iova: Separate out rcache init

2022-02-03 Thread John Garry via iommu

Currently the rcache structures are allocated for all IOVA domains, even if
they do not use "fast" alloc+free interface. This is wasteful of memory.

In addition, fails in init_iova_rcaches() are not handled safely, which is
less than ideal.

Make "fast" users call a separate rcache init explicitly, which includes
error checking.

Signed-off-by: John Garry 
---
Differences to v1:
- Drop stubs for iova_domain_init_rcaches() and iova_domain_free_rcaches()
- Use put_iova_domain() in vdpa code

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d85d54f2b549..b22034975301 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -525,6 +525,7 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
unsigned long order, base_pfn;
struct iova_domain *iovad;
+   int ret;
 
if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE)
return -EINVAL;
@@ -559,6 +560,9 @@ static int iommu_dma_init_domain(struct iommu_domain 
*domain, dma_addr_t base,
}
 
init_iova_domain(iovad, 1UL << order, base_pfn);
+   ret = iova_domain_init_rcaches(iovad);
+   if (ret)
+   return ret;
 
/* If the FQ fails we can simply fall back to strict mode */
if (domain->type == IOMMU_DOMAIN_DMA_FQ && iommu_dma_init_fq(domain))
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b28c9435b898..7e9c3a97c040 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -15,13 +15,14 @@
 /* The anchor node sits above the top of the usable address space */
 #define IOVA_ANCHOR~0UL
 
+#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size 
(in pages) */
+
 static bool iova_rcache_insert(struct iova_domain *iovad,
   unsigned long pfn,
   unsigned long size);
 static unsigned long iova_rcache_get(struct iova_domain *iovad,
 unsigned long size,
 unsigned long limit_pfn);
-static void init_iova_rcaches(struct iova_domain *iovad);
 static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
 static void free_iova_rcaches(struct iova_domain *iovad);
 
@@ -64,8 +65,6 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node);
rb_insert_color(&iovad->anchor.node, &iovad->rbroot);
-   cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, 
&iovad->cpuhp_dead);
-   init_iova_rcaches(iovad);
 }
 EXPORT_SYMBOL_GPL(init_iova_domain);
 
@@ -488,6 +487,13 @@ free_iova_fast(struct iova_domain *iovad, unsigned long 
pfn, unsigned long size)
 }
 EXPORT_SYMBOL_GPL(free_iova_fast);
 
+static void iova_domain_free_rcaches(struct iova_domain *iovad)
+{
+   cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD,
+   &iovad->cpuhp_dead);
+   free_iova_rcaches(iovad);
+}
+
 /**
  * put_iova_domain - destroys the iova domain
  * @iovad: - iova domain in question.
@@ -497,9 +503,9 @@ void put_iova_domain(struct iova_domain *iovad)
 {
struct iova *iova, *tmp;
 
-   cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD,
-   &iovad->cpuhp_dead);
-   free_iova_rcaches(iovad);
+   if (iovad->rcaches)
+   iova_domain_free_rcaches(iovad);
+
rbtree_postorder_for_each_entry_safe(iova, tmp, &iovad->rbroot, node)
free_iova_mem(iova);
 }
@@ -608,6 +614,7 @@ EXPORT_SYMBOL_GPL(reserve_iova);
  */
 
 #define IOVA_MAG_SIZE 128
+#define MAX_GLOBAL_MAGS 32 /* magazines per bin */
 
 struct iova_magazine {
unsigned long size;
@@ -620,6 +627,13 @@ struct iova_cpu_rcache {
struct iova_magazine *prev;
 };
 
+struct iova_rcache {
+   spinlock_t lock;
+   unsigned long depot_size;
+   struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+   struct iova_cpu_rcache __percpu *cpu_rcaches;
+};
+
 static struct iova_magazine *iova_magazine_alloc(gfp_t flags)
 {
return kzalloc(sizeof(struct iova_magazine), flags);
@@ -693,28 +707,54 @@ static void iova_magazine_push(struct iova_magazine *mag, 
unsigned long pfn)
mag->pfns[mag->size++] = pfn;
 }
 
-static void init_iova_rcaches(struct iova_domain *iovad)
+int iova_domain_init_rcaches(struct iova_domain *iovad)
 {
-   struct iova_cpu_rcache *cpu_rcache;
-   struct iova_rcache *rcache;
unsigned int cpu;
-   int i;
+   int i, ret;
+
+   iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE,
+sizeof(struct iova_rcache),
+

Re: [PATCH] iommu/amd: Fix loop timeout issue in iommu_ga_log_enable()

2022-01-31 Thread John Garry via iommu


On 31/01/2022 16:17, Joerg Roedel wrote:

From: Joerg Roedel 

The polling loop for the register change in iommu_ga_log_enable() needs
to have a udelay() in it.  Otherwise the CPU might be faster than the
IOMMU hardware and wrongly trigger the WARN_ON() further down the code
stream.

Fixes: 8bda0cfbdc1a ("iommu/amd: Detect and initialize guest vAPIC log")
Signed-off-by: Joerg Roedel 
---
  drivers/iommu/amd/init.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index dc338acf3338..d2e09d53851f 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -834,6 +834,7 @@ static int iommu_ga_log_enable(struct amd_iommu *iommu)
status = readl(iommu->mmio_base + MMIO_STATUS_OFFSET);
if (status & (MMIO_STATUS_GALOG_RUN_MASK))
break;
+   udelay(1);


Maybe readl_relaxed_poll_timeout_atomic() could be used instead

Thanks,
John


}
  
  	if (WARN_ON(i >= LOOP_TIMEOUT))




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/iova: Separate out rcache init

2022-01-28 Thread John Garry via iommu


On 26/01/2022 17:00, Robin Murphy wrote:

As above, I vote for just forward-declaring the free routine in iova.c
and keeping it entirely private.


BTW, speaking of forward declarations, it's possible to remove all the 
forward declarations in iova.c now that the FQ code is gone - but with a 
good bit of rearranging. However I am not sure how much people care 
about that or whether the code layout is sane...

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

1 2 3 4 5 6 >

1 - 100 of 545 matches

Mail list logo