Re: [PATCH v5 0/5] DMA mapping changes for SCSI core
On 07/07/2022 21:35, Martin K. Petersen wrote: Christoph, Yes, I've mostly been waiting for an ACK from Martin. Sorry, I'm on vacation this week. The series looks OK to me although I do agree that it would be great if the max was reflected in the queue's hard limit and opt in the soft limit. Ah, I think that I misunderstood Damien's question. I thought he was asking why not keep shost max_sectors at dma_max_mapping_size() and then init each sdev request queue max hw sectors at dma_opt_mapping_size(). But he seems that you want to know why not have the request queue max sectors at dma_opt_mapping_size(). The answer is related to meaning of dma_opt_mapping_size(). If we get any mappings which exceed this size then it can have a big dma mapping performance hit. So I set max hw sectors at this ‘opt’ mapping size to ensure that we get no mappings which exceed this size. Indeed, I think max sectors is 128Kb today for my host, which would be same as dma_opt_mapping_size() value with an IOMMU enabled. And I find that only a small % of request size may exceed this 128kb size, but it still has a big performance impact. Acked-by: Martin K. Petersen Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 0/5] DMA mapping changes for SCSI core
On 30/06/2022 13:08, John Garry wrote: Hi Christoph, Can you please consider picking up this series? A few things to note beforehand: - I changed to only apply the mapping limit to SAS hosts in this version. I would need a fresh ack from Martin for those SCSI parts, but wanted to make sure you were ok with it. - Damien had some doubt on updating the shost max_sectors as opposed to the per-request queue default, but I think he's ok with it - see patch 4/5 Thanks, John As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching limit may see a big performance hit. This series introduces a new DMA mapping API, dma_opt_mapping_size(), so that drivers may know this limit when performance is a factor in the mapping. The SCSI SAS transport code is modified only to use this limit. For now I did not want to touch other hosts as I have a concern that this change could cause a performance regression. I also added a patch for libata-scsi as it does not currently honour the shost max_sectors limit. [0]https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Changes since v4: - tweak libata and other patch titles - Add Robin's tag (thanks!) - Clarify description of new DMA mapping API ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RESEND v5 1/5] iommu: Refactor iommu_group_store_type()
On 06/07/2022 13:00, Will Deacon wrote: On Mon, Apr 04, 2022 at 07:27:10PM +0800, John Garry wrote: Function iommu_group_store_type() supports changing the default domain of an IOMMU group. Many conditions need to be satisfied and steps taken for this action to be successful. Satisfying these conditions and steps will be required for setting other IOMMU group attributes, so factor into a common part and a part specific to update the IOMMU group attribute. No functional change intended. Some code comments are tidied up also. Signed-off-by: John Garry --- drivers/iommu/iommu.c | 96 --- 1 file changed, 62 insertions(+), 34 deletions(-) Acked-by: Will Deacon Thanks, but currently I have no plans to progress this series, in favour of this https://lore.kernel.org/linux-iommu/1656590892-42307-1-git-send-email-john.ga...@huawei.com/T/#me0e806913050c95f6e6ba2c7f7d96d51ce191204 cheers ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/iova: change IOVA_MAG_SIZE to 127 to save memory
On 01/07/2022 04:56, Feng Tang wrote: inclination. ok, what you are saying sounds reasonable. I just remember that when we analyzed the longterm aging issue that we concluded that the FQ size and its relation to the magazine size was a factor and this change makes me a little worried about new issues. Better the devil you know and all that... Anyway, if I get some time I might do some testing to see if this change has any influence. Another thought is if we need even store the size in the iova_magazine? mags in the depot are always full. As such, we only need worry about mags loaded in the cpu rcache and their sizes, so maybe we could have something like this: struct iova_magazine { - unsigned long size; unsigned long pfns[IOVA_MAG_SIZE]; }; @@ -631,6 +630,8 @@ struct iova_cpu_rcache { spinlock_t lock; struct iova_magazine *loaded; struct iova_magazine *prev; + int loaded_size; + int prev_size; }; I haven't tried to implement it though.. I have very few knowledge of iova, so you can chose what's the better solution. I just wanted to raise the problem and will be happy to see it solved:) I quickly tested your patch for performance and saw no noticeable difference, which is no surprise. But I'll defer to Robin if he thinks that your patch is a better solution - I would guess that he does. For me personally I would prefer that this value was not changed, as I mentioned before. thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 4/5] scsi: scsi_transport_sas: Cap shost max_sectors according to DMA optimal limit
On 01/07/2022 00:49, Damien Le Moal wrote: + if (dma_dev) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); + } Hi Damien, > Hmm... shost->max_sectors becomes the max_hw_sectors limit for the block > dev. So using dma_max_mapping_size(dma_dev) for that limit makes sense. > Shouldn't dma_opt_mapping_size(dma_dev) be used to limit only the default > "soft" limit (queue max_sectors limit) instead of the hard limit ? > Sure, it would sensible to use dma_opt_mapping_size() to limit the default queue max sectors limit, while dma_max_mapping_size() limits the host max sectors. But I didn't see in practice how limiting the shost max sectors to dma_opt_mapping_size() makes a difference: - block queue max_hw_sectors_kb file is read-only, so we cannot change the queue max sectors from there - And no SAS driver actually tries to modify upwards from the default. I do note that USB storage driver as an example of a scsi driver which does (modify from shost max sectors): see scsiglue.c::slave_configure() Finally there is no common method to limit the default request queue max sectors for those SAS drivers - I would need to add this limit in each of their slave_configure callbacks, and I didn't think that its worth it. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 3/5] scsi: core: Cap shost max_sectors according to DMA limits only once
On 01/07/2022 00:41, Damien Le Moal wrote: shost->dma_dev = dma_dev; + if (dma_dev->dma_mask) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_max_mapping_size(dma_dev) >> SECTOR_SHIFT); + } Nit: you could remove the curly brackets... But it being a multi-line statement, having them is OK too I think. tglx seems to think that they are ok, and I generally agree (now): https://lore.kernel.org/linux-arm-kernel/877djwdorz@nanos.tec.linutronix.de/ AFAICT coding-style.rst is ok with them in this scenario too Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 5/5] ata: libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors
ATA devices (struct ata_device) have a max_sectors field which is configured internally in libata. This is then used to (re)configure the associated sdev request queue max_sectors value from how it is earlier set in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set according to shost limits, which includes host DMA mapping limits. Cap the ata_device max_sectors according to shost->max_sectors to respect this shost limit. Signed-off-by: John Garry Acked-by: Damien Le Moal --- drivers/ata/libata-scsi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 86dbb1cdfabd..24a43d540d9f 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1060,6 +1060,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev) dev->flags |= ATA_DFLAG_NO_UNLOAD; /* configure max sectors */ + dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors); blk_queue_max_hw_sectors(q, dev->max_sectors); if (dev->class == ATA_DEV_ATAPI) { -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 4/5] scsi: scsi_transport_sas: Cap shost max_sectors according to DMA optimal limit
Streaming DMA mappings may be considerably slower when mappings go through an IOMMU and the total mapping length is somewhat long. This is because the IOMMU IOVA code allocates and free an IOVA for each mapping, which may affect performance. For performance reasons set the request queue max_sectors from dma_opt_mapping_size(), which knows this mapping limit. Signed-off-by: John Garry --- drivers/scsi/scsi_transport_sas.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c index 12bff64dade6..1b45248748e0 100644 --- a/drivers/scsi/scsi_transport_sas.c +++ b/drivers/scsi/scsi_transport_sas.c @@ -225,6 +225,7 @@ static int sas_host_setup(struct transport_container *tc, struct device *dev, { struct Scsi_Host *shost = dev_to_shost(dev); struct sas_host_attrs *sas_host = to_sas_host_attrs(shost); + struct device *dma_dev = shost->dma_dev; INIT_LIST_HEAD(&sas_host->rphy_list); mutex_init(&sas_host->lock); @@ -236,6 +237,11 @@ static int sas_host_setup(struct transport_container *tc, struct device *dev, dev_printk(KERN_ERR, dev, "fail to a bsg device %d\n", shost->host_no); + if (dma_dev) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); + } + return 0; } -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 3/5] scsi: core: Cap shost max_sectors according to DMA limits only once
The shost->max_sectors is repeatedly capped according to the host DMA mapping limit for each sdev in __scsi_init_queue(). This is unnecessary, so set only once when adding the host. Signed-off-by: John Garry --- drivers/scsi/hosts.c| 5 + drivers/scsi/scsi_lib.c | 4 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 8352f90d997d..d04bd2c7c9f1 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -236,6 +236,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, shost->dma_dev = dma_dev; + if (dma_dev->dma_mask) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_max_mapping_size(dma_dev) >> SECTOR_SHIFT); + } + error = scsi_mq_setup_tags(shost); if (error) goto fail; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 6ffc9e4258a8..6ce8acea322a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q) blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize); } - if (dev->dma_mask) { - shost->max_sectors = min_t(unsigned int, shost->max_sectors, - dma_max_mapping_size(dev) >> SECTOR_SHIFT); - } blk_queue_max_hw_sectors(q, shost->max_sectors); blk_queue_segment_boundary(q, shost->dma_boundary); dma_set_seg_boundary(dev, shost->dma_boundary); -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 2/5] dma-iommu: Add iommu_dma_opt_mapping_size()
Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which allows the drivers to know the optimal mapping limit and thus limit the requested IOVA lengths. This value is based on the IOVA rcache range limit, as IOVAs allocated above this limit must always be newly allocated, which may be quite slow. Signed-off-by: John Garry Reviewed-by: Damien Le Moal Acked-by: Robin Murphy --- drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + include/linux/iova.h | 2 ++ 3 files changed, 13 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index f90251572a5d..9e1586447ee8 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1459,6 +1459,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev) return (1UL << __ffs(domain->pgsize_bitmap)) - 1; } +static size_t iommu_dma_opt_mapping_size(void) +{ + return iova_rcache_range(); +} + static const struct dma_map_ops iommu_dma_ops = { .alloc = iommu_dma_alloc, .free = iommu_dma_free, @@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = { .map_resource = iommu_dma_map_resource, .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, + .opt_mapping_size = iommu_dma_opt_mapping_size, }; /* diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145..9f00b58d546e 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); +unsigned long iova_rcache_range(void) +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} + static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) { struct iova_domain *iovad; diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..c6ba6d95d79c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain *iovad, dma_addr_t iova) int iova_cache_get(void); void iova_cache_put(void); +unsigned long iova_rcache_range(void); + void free_iova(struct iova_domain *iovad, unsigned long pfn); void __free_iova(struct iova_domain *iovad, struct iova *iova); struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size, -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 1/5] dma-mapping: Add dma_opt_mapping_size()
Streaming DMA mapping involving an IOMMU may be much slower for larger total mapping size. This is because every IOMMU DMA mapping requires an IOVA to be allocated and freed. IOVA sizes above a certain limit are not cached, which can have a big impact on DMA mapping performance. Provide an API for device drivers to know this "optimal" limit, such that they may try to produce mapping which don't exceed it. Signed-off-by: John Garry Reviewed-by: Damien Le Moal --- Documentation/core-api/dma-api.rst | 14 ++ include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + kernel/dma/mapping.c | 12 4 files changed, 32 insertions(+) diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst index 6d6d0edd2d27..829f20a193ca 100644 --- a/Documentation/core-api/dma-api.rst +++ b/Documentation/core-api/dma-api.rst @@ -204,6 +204,20 @@ Returns the maximum size of a mapping for the device. The size parameter of the mapping functions like dma_map_single(), dma_map_page() and others should not be larger than the returned value. +:: + + size_t + dma_opt_mapping_size(struct device *dev); + +Returns the maximum optimal size of a mapping for the device. + +Mapping larger buffers may take much longer in certain scenarios. In +addition, for high-rate short-lived streaming mappings, the upfront time +spent on the mapping may account for an appreciable part of the total +request lifetime. As such, if splitting larger requests incurs no +significant performance penalty, then device drivers are advised to +limit total DMA streaming mappings length to the returned value. + :: bool diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d5b06b3a4a6..98ceba6fa848 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -69,6 +69,7 @@ struct dma_map_ops { int (*dma_supported)(struct device *dev, u64 mask); u64 (*get_required_mask)(struct device *dev); size_t (*max_mapping_size)(struct device *dev); + size_t (*opt_mapping_size)(void); unsigned long (*get_merge_boundary)(struct device *dev); }; diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index dca2b1355bb1..fe3849434b2a 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask); int dma_set_coherent_mask(struct device *dev, u64 mask); u64 dma_get_required_mask(struct device *dev); size_t dma_max_mapping_size(struct device *dev); +size_t dma_opt_mapping_size(struct device *dev); bool dma_need_sync(struct device *dev, dma_addr_t dma_addr); unsigned long dma_get_merge_boundary(struct device *dev); struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size, @@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device *dev) { return 0; } +static inline size_t dma_opt_mapping_size(struct device *dev) +{ + return 0; +} static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { return false; diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index db7244291b74..1bfe11b1edb6 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev) } EXPORT_SYMBOL_GPL(dma_max_mapping_size); +size_t dma_opt_mapping_size(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + size_t size = SIZE_MAX; + + if (ops && ops->opt_mapping_size) + size = ops->opt_mapping_size(); + + return min(dma_max_mapping_size(dev), size); +} +EXPORT_SYMBOL_GPL(dma_opt_mapping_size); + bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { const struct dma_map_ops *ops = get_dma_ops(dev); -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 0/5] DMA mapping changes for SCSI core
As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching limit may see a big performance hit. This series introduces a new DMA mapping API, dma_opt_mapping_size(), so that drivers may know this limit when performance is a factor in the mapping. The SCSI SAS transport code is modified only to use this limit. For now I did not want to touch other hosts as I have a concern that this change could cause a performance regression. I also added a patch for libata-scsi as it does not currently honour the shost max_sectors limit. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Changes since v4: - tweak libata and other patch titles - Add Robin's tag (thanks!) - Clarify description of new DMA mapping API Changes since v3: - Apply max DMA optimial limit to SAS hosts only Note: Even though "scsi: core: Cap shost max_sectors only once when adding" is a subset of a previous patch I did not transfer the RB tags - Rebase on v5.19-rc4 John Garry (5): dma-mapping: Add dma_opt_mapping_size() dma-iommu: Add iommu_dma_opt_mapping_size() scsi: core: Cap shost max_sectors according to DMA limits only once scsi: scsi_transport_sas: Cap shost max_sectors according to DMA optimal limit ata: libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors Documentation/core-api/dma-api.rst | 14 ++ drivers/ata/libata-scsi.c | 1 + drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + drivers/scsi/hosts.c | 5 + drivers/scsi/scsi_lib.c| 4 drivers/scsi/scsi_transport_sas.c | 6 ++ include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + include/linux/iova.h | 2 ++ kernel/dma/mapping.c | 12 11 files changed, 57 insertions(+), 4 deletions(-) -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/iova: change IOVA_MAG_SIZE to 127 to save memory
[ 4.319253] iommu: Adding device :06:00.2 to group 5 [ 4.325869] iommu: Adding device :20:01.0 to group 15 [ 4.332648] iommu: Adding device :20:02.0 to group 16 [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 [ 4.350251] swapper/0 cpuset=/ mems_allowed=0 [ 4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.57.mx64.282 #1 [ 4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.9.3 06/25/2019 [ 4.355612] Call Trace: [ 4.355612] dump_stack+0x46/0x5b [ 4.355612] dump_header+0x6b/0x289 [ 4.355612] out_of_memory+0x470/0x4c0 [ 4.355612] __alloc_pages_nodemask+0x970/0x1030 [ 4.355612] cache_grow_begin+0x7d/0x520 [ 4.355612] fallback_alloc+0x148/0x200 [ 4.355612] kmem_cache_alloc_trace+0xac/0x1f0 [ 4.355612] init_iova_domain+0x112/0x170 Note for Feng Tang: This callchain does not exist anymore since we separated out the rcache init from the IOVA domain init. Indeed, not so much memory is wasted for unused rcaches now. My point really is that it would be nicer to see a modern callchain - but don't read that as me saying that the change is this patch is bad. [ 4.355612] amd_iommu_domain_alloc+0x138/0x1a0 [ 4.355612] iommu_group_get_for_dev+0xc4/0x1a0 [ 4.355612] amd_iommu_add_device+0x13a/0x610 [ 4.355612] add_iommu_group+0x20/0x30 [ 4.355612] bus_for_each_dev+0x76/0xc0 [ 4.355612] bus_set_iommu+0xb6/0xf0 [ 4.355612] amd_iommu_init_api+0x112/0x132 [ 4.355612] state_next+0xfb1/0x1165 [ 4.355612] amd_iommu_init+0x1f/0x67 [ 4.355612] pci_iommu_init+0x16/0x3f ... [ 4.670295] Unreclaimable slab info: ... [ 4.857565] kmalloc-2048 59164KB 59164KB Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine' 1024 bytes so that no memory will be wasted. [1]. https://lkml.org/lkml/2019/8/12/266 Signed-off-by: Feng Tang --- drivers/iommu/iova.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145b..27634ddd9b904 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova); * dynamic size tuning described in the paper. */ -#define IOVA_MAG_SIZE 128 +/* + * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to + * assure size of 'iova_magzine' to be 1024 bytes, so that no memory Typo: iova_magazine + * will be wasted. + */ +#define IOVA_MAG_SIZE 127 I do wonder if we will see some strange new behaviour since IOVA_FQ_SIZE % IOVA_MAG_SIZE != 0 now... I doubt it - even if a flush queue does happen to be entirely full of equal-sized IOVAs, a CPU's loaded magazines also both being perfectly empty when it comes to dump a full fq seem further unlikely, so in practice I don't see this making any appreciable change to the likelihood of spilling back to the depot or not. In fact the smaller the magazines get, the less time would be spent flushing the depot back to the rbtree, where your interesting workload falls off the cliff and never catches back up with the fq timer, so at some point it might even improve (unless it's also already close to the point where smaller caches would bottleneck allocation)... might be interesting to experiment with a wider range of magazine sizes if you had the time and inclination. ok, what you are saying sounds reasonable. I just remember that when we analyzed the longterm aging issue that we concluded that the FQ size and its relation to the magazine size was a factor and this change makes me a little worried about new issues. Better the devil you know and all that... Anyway, if I get some time I might do some testing to see if this change has any influence. Another thought is if we need even store the size in the iova_magazine? mags in the depot are always full. As such, we only need worry about mags loaded in the cpu rcache and their sizes, so maybe we could have something like this: struct iova_magazine { - unsigned long size; unsigned long pfns[IOVA_MAG_SIZE]; }; @@ -631,6 +630,8 @@ struct iova_cpu_rcache { spinlock_t lock; struct iova_magazine *loaded; struct iova_magazine *prev; + int loaded_size; + int prev_size; }; I haven't tried to implement it though.. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/iova: change IOVA_MAG_SIZE to 127 to save memory
On 30/06/2022 10:02, Robin Murphy wrote: On 2022-06-30 08:33, Feng Tang wrote: kmalloc will round up the request size to power of 2, and current iova_magazine's size is 1032 (1024+8) bytes, so each instance allocated will get 2048 bytes from kmalloc, causing around 1KB waste. And in some exstreme case, the memory wasted can trigger OOM as reported in 2019 on a crash kernel with 256 MB memory [1]. I don't think it really needs pointing out that excessive memory consumption can cause OOM. Especially not in the particularly silly context of a system with only 2MB of RAM per CPU - that's pretty much guaranteed to be doomed one way or another. [ 4.319253] iommu: Adding device :06:00.2 to group 5 [ 4.325869] iommu: Adding device :20:01.0 to group 15 [ 4.332648] iommu: Adding device :20:02.0 to group 16 [ 4.338946] swapper/0 invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=0 [ 4.350251] swapper/0 cpuset=/ mems_allowed=0 [ 4.354618] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.57.mx64.282 #1 [ 4.355612] Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.9.3 06/25/2019 [ 4.355612] Call Trace: [ 4.355612] dump_stack+0x46/0x5b [ 4.355612] dump_header+0x6b/0x289 [ 4.355612] out_of_memory+0x470/0x4c0 [ 4.355612] __alloc_pages_nodemask+0x970/0x1030 [ 4.355612] cache_grow_begin+0x7d/0x520 [ 4.355612] fallback_alloc+0x148/0x200 [ 4.355612] kmem_cache_alloc_trace+0xac/0x1f0 [ 4.355612] init_iova_domain+0x112/0x170 [ 4.355612] amd_iommu_domain_alloc+0x138/0x1a0 [ 4.355612] iommu_group_get_for_dev+0xc4/0x1a0 [ 4.355612] amd_iommu_add_device+0x13a/0x610 [ 4.355612] add_iommu_group+0x20/0x30 [ 4.355612] bus_for_each_dev+0x76/0xc0 [ 4.355612] bus_set_iommu+0xb6/0xf0 [ 4.355612] amd_iommu_init_api+0x112/0x132 [ 4.355612] state_next+0xfb1/0x1165 [ 4.355612] amd_iommu_init+0x1f/0x67 [ 4.355612] pci_iommu_init+0x16/0x3f ... [ 4.670295] Unreclaimable slab info: ... [ 4.857565] kmalloc-2048 59164KB 59164KB Change IOVA_MAG_SIZE from 128 to 127 to make size of 'iova_magazine' 1024 bytes so that no memory will be wasted. [1]. https://lkml.org/lkml/2019/8/12/266 Signed-off-by: Feng Tang --- drivers/iommu/iova.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145b..27634ddd9b904 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -614,7 +614,12 @@ EXPORT_SYMBOL_GPL(reserve_iova); * dynamic size tuning described in the paper. */ -#define IOVA_MAG_SIZE 128 +/* + * As kmalloc's buffer size is fixed to power of 2, 127 is chosen to + * assure size of 'iova_magzine' to be 1024 bytes, so that no memory Typo: iova_magazine + * will be wasted. + */ +#define IOVA_MAG_SIZE 127 I do wonder if we will see some strange new behaviour since IOVA_FQ_SIZE % IOVA_MAG_SIZE != 0 now... The change itself seems perfectly reasonable, though. Acked-by: Robin Murphy ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/5] dma-mapping: Add dma_opt_mapping_size()
On 28/06/2022 12:27, John Garry via iommu wrote: On 28/06/2022 12:23, Robin Murphy wrote: + + size_t + dma_opt_mapping_size(struct device *dev); + +Returns the maximum optimal size of a mapping for the device. Mapping large +buffers may take longer so device drivers are advised to limit total DMA +streaming mappings length to the returned value. Nit: I'm not sure "advised" is necessarily the right thing to say in general - that's only really true for a caller who cares about throughput of churning through short-lived mappings more than anything else, and doesn't take a significant hit overall from splitting up larger requests. I do think it's good to clarify the exact context of "optimal" here, but I'd prefer to be objectively clear that it's for workloads where the up-front mapping overhead dominates. I'm going to go with something like this: size_t dma_opt_mapping_size(struct device *dev); Returns the maximum optimal size of a mapping for the device. Mapping larger buffers may take much longer in certain scenarios. In addition, for high-rate short-lived streaming mappings the upfront time spent on the mapping may account for an appreciable part of the total request lifetime. As such, if splitting larger requests incurs no significant performance penalty, then device drivers are advised to limit total DMA streaming mappings length to the returned value. Let me know if you would like it further amended. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 5/5] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors
On 29/06/2022 06:58, Damien Le Moal wrote: On 6/29/22 14:40, Christoph Hellwig wrote: On Tue, Jun 28, 2022 at 12:33:58PM +0100, John Garry wrote: Well Christoph originally offered to take this series via the dma-mapping tree. @Christoph, is that still ok with you? If so, would you rather I send this libata patch separately? The offer still stands, and I don't really care where the libata patch is routed. Just tell me what you prefer. Cheers. If it is 100% independent from the other patches, I can take it. Otherwise, feel free to take it ! I'll just keep the all together - it's easier in case I need to change anything. Thanks! ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 5/5] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors
On 28/06/2022 10:14, Damien Le Moal wrote: BTW, this patch has no real dependency on the rest of the series, so could be taken separately if you prefer. Sure, you can send it separately. Adding it through the scsi tree is fine too. Well Christoph originally offered to take this series via the dma-mapping tree. @Christoph, is that still ok with you? If so, would you rather I send this libata patch separately? Thanks, john ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/5] dma-mapping: Add dma_opt_mapping_size()
On 28/06/2022 12:23, Robin Murphy wrote: + + size_t + dma_opt_mapping_size(struct device *dev); + +Returns the maximum optimal size of a mapping for the device. Mapping large +buffers may take longer so device drivers are advised to limit total DMA +streaming mappings length to the returned value. Nit: I'm not sure "advised" is necessarily the right thing to say in general - that's only really true for a caller who cares about throughput of churning through short-lived mappings more than anything else, and doesn't take a significant hit overall from splitting up larger requests. I do think it's good to clarify the exact context of "optimal" here, but I'd prefer to be objectively clear that it's for workloads where the up-front mapping overhead dominates. Ok, sure, I can make that clear. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 5/5] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors
On 28/06/2022 00:24, Damien Le Moal wrote: On 6/28/22 00:25, John Garry wrote: ATA devices (struct ata_device) have a max_sectors field which is configured internally in libata. This is then used to (re)configure the associated sdev request queue max_sectors value from how it is earlier set in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set according to shost limits, which includes host DMA mapping limits. Cap the ata_device max_sectors according to shost->max_sectors to respect this shost limit. Signed-off-by: John Garry Acked-by: Damien Le Moal Nit: please change the patch title to "ata: libata-scsi: Cap ..." ok, but it's going to be an even longer title :) BTW, this patch has no real dependency on the rest of the series, so could be taken separately if you prefer. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 5/5] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors
ATA devices (struct ata_device) have a max_sectors field which is configured internally in libata. This is then used to (re)configure the associated sdev request queue max_sectors value from how it is earlier set in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set according to shost limits, which includes host DMA mapping limits. Cap the ata_device max_sectors according to shost->max_sectors to respect this shost limit. Signed-off-by: John Garry Acked-by: Damien Le Moal --- drivers/ata/libata-scsi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 86dbb1cdfabd..24a43d540d9f 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1060,6 +1060,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev) dev->flags |= ATA_DFLAG_NO_UNLOAD; /* configure max sectors */ + dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors); blk_queue_max_hw_sectors(q, dev->max_sectors); if (dev->class == ATA_DEV_ATAPI) { -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 4/5] scsi: scsi_transport_sas: Cap shost max_sectors according to DMA optimal mapping limit
Streaming DMA mappings may be considerably slower when mappings go through an IOMMU and the total mapping length is somewhat long. This is because the IOMMU IOVA code allocates and free an IOVA for each mapping, which may affect performance. For performance reasons set the request queue max_sectors from dma_opt_mapping_size(), which knows this mapping limit. Signed-off-by: John Garry --- drivers/scsi/scsi_transport_sas.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c index 12bff64dade6..1b45248748e0 100644 --- a/drivers/scsi/scsi_transport_sas.c +++ b/drivers/scsi/scsi_transport_sas.c @@ -225,6 +225,7 @@ static int sas_host_setup(struct transport_container *tc, struct device *dev, { struct Scsi_Host *shost = dev_to_shost(dev); struct sas_host_attrs *sas_host = to_sas_host_attrs(shost); + struct device *dma_dev = shost->dma_dev; INIT_LIST_HEAD(&sas_host->rphy_list); mutex_init(&sas_host->lock); @@ -236,6 +237,11 @@ static int sas_host_setup(struct transport_container *tc, struct device *dev, dev_printk(KERN_ERR, dev, "fail to a bsg device %d\n", shost->host_no); + if (dma_dev) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); + } + return 0; } -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 2/5] dma-iommu: Add iommu_dma_opt_mapping_size()
Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which allows the drivers to know the optimal mapping limit and thus limit the requested IOVA lengths. This value is based on the IOVA rcache range limit, as IOVAs allocated above this limit must always be newly allocated, which may be quite slow. Signed-off-by: John Garry Reviewed-by: Damien Le Moal --- drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + include/linux/iova.h | 2 ++ 3 files changed, 13 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index f90251572a5d..9e1586447ee8 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1459,6 +1459,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev) return (1UL << __ffs(domain->pgsize_bitmap)) - 1; } +static size_t iommu_dma_opt_mapping_size(void) +{ + return iova_rcache_range(); +} + static const struct dma_map_ops iommu_dma_ops = { .alloc = iommu_dma_alloc, .free = iommu_dma_free, @@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = { .map_resource = iommu_dma_map_resource, .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, + .opt_mapping_size = iommu_dma_opt_mapping_size, }; /* diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145..9f00b58d546e 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); +unsigned long iova_rcache_range(void) +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} + static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) { struct iova_domain *iovad; diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..c6ba6d95d79c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain *iovad, dma_addr_t iova) int iova_cache_get(void); void iova_cache_put(void); +unsigned long iova_rcache_range(void); + void free_iova(struct iova_domain *iovad, unsigned long pfn); void __free_iova(struct iova_domain *iovad, struct iova *iova); struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size, -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 3/5] scsi: core: Cap shost max_sectors according to DMA mapping limits only once
The shost->max_sectors is repeatedly capped according to the host DMA mapping limit for each sdev in __scsi_init_queue(). This is unnecessary, so set only once when adding the host. Signed-off-by: John Garry --- drivers/scsi/hosts.c| 5 + drivers/scsi/scsi_lib.c | 4 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 8352f90d997d..d04bd2c7c9f1 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -236,6 +236,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, shost->dma_dev = dma_dev; + if (dma_dev->dma_mask) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_max_mapping_size(dma_dev) >> SECTOR_SHIFT); + } + error = scsi_mq_setup_tags(shost); if (error) goto fail; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 6ffc9e4258a8..6ce8acea322a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q) blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize); } - if (dev->dma_mask) { - shost->max_sectors = min_t(unsigned int, shost->max_sectors, - dma_max_mapping_size(dev) >> SECTOR_SHIFT); - } blk_queue_max_hw_sectors(q, shost->max_sectors); blk_queue_segment_boundary(q, shost->dma_boundary); dma_set_seg_boundary(dev, shost->dma_boundary); -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 1/5] dma-mapping: Add dma_opt_mapping_size()
Streaming DMA mapping involving an IOMMU may be much slower for larger total mapping size. This is because every IOMMU DMA mapping requires an IOVA to be allocated and freed. IOVA sizes above a certain limit are not cached, which can have a big impact on DMA mapping performance. Provide an API for device drivers to know this "optimal" limit, such that they may try to produce mapping which don't exceed it. Signed-off-by: John Garry Reviewed-by: Damien Le Moal --- Documentation/core-api/dma-api.rst | 9 + include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + kernel/dma/mapping.c | 12 4 files changed, 27 insertions(+) diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst index 6d6d0edd2d27..b3cd9763d28b 100644 --- a/Documentation/core-api/dma-api.rst +++ b/Documentation/core-api/dma-api.rst @@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. The size parameter of the mapping functions like dma_map_single(), dma_map_page() and others should not be larger than the returned value. +:: + + size_t + dma_opt_mapping_size(struct device *dev); + +Returns the maximum optimal size of a mapping for the device. Mapping large +buffers may take longer so device drivers are advised to limit total DMA +streaming mappings length to the returned value. + :: bool diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d5b06b3a4a6..98ceba6fa848 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -69,6 +69,7 @@ struct dma_map_ops { int (*dma_supported)(struct device *dev, u64 mask); u64 (*get_required_mask)(struct device *dev); size_t (*max_mapping_size)(struct device *dev); + size_t (*opt_mapping_size)(void); unsigned long (*get_merge_boundary)(struct device *dev); }; diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index dca2b1355bb1..fe3849434b2a 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask); int dma_set_coherent_mask(struct device *dev, u64 mask); u64 dma_get_required_mask(struct device *dev); size_t dma_max_mapping_size(struct device *dev); +size_t dma_opt_mapping_size(struct device *dev); bool dma_need_sync(struct device *dev, dma_addr_t dma_addr); unsigned long dma_get_merge_boundary(struct device *dev); struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size, @@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device *dev) { return 0; } +static inline size_t dma_opt_mapping_size(struct device *dev) +{ + return 0; +} static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { return false; diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index db7244291b74..1bfe11b1edb6 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev) } EXPORT_SYMBOL_GPL(dma_max_mapping_size); +size_t dma_opt_mapping_size(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + size_t size = SIZE_MAX; + + if (ops && ops->opt_mapping_size) + size = ops->opt_mapping_size(); + + return min(dma_max_mapping_size(dev), size); +} +EXPORT_SYMBOL_GPL(dma_opt_mapping_size); + bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { const struct dma_map_ops *ops = get_dma_ops(dev); -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v4 0/5] DMA mapping changes for SCSI core
As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching limit may see a big performance hit. This series introduces a new DMA mapping API, dma_opt_mapping_size(), so that drivers may know this limit when performance is a factor in the mapping. The SCSI SAS transport code is modified only to use this limit. For now I did not want to touch other hosts as I have a concern that this change could cause a performance regression. I also added a patch for libata-scsi as it does not currently honour the shost max_sectors limit. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/ Changes since v3: - Apply max DMA optimial limit to SAS hosts only Note: Even though "scsi: core: Cap shost max_sectors only once when adding" is a subset of a previous patch I did not transfer the RB tags - Rebase on v5.19-rc4 Changes since v2: - Rebase on v5.19-rc1 - Add Damien's tag to 2/4 (thanks) Changes since v1: - Relocate scsi_add_host_with_dma() dma_dev check (Reported by Dan) - Add tags from Damien and Martin (thanks) - note: I only added Martin's tag to the SCSI patch John Garry (5): dma-mapping: Add dma_opt_mapping_size() dma-iommu: Add iommu_dma_opt_mapping_size() scsi: core: Cap shost max_sectors according to DMA mapping limits only once scsi: scsi_transport_sas: Cap shost max_sectors according to DMA optimal mapping limit libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors Documentation/core-api/dma-api.rst | 9 + drivers/ata/libata-scsi.c | 1 + drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + drivers/scsi/hosts.c | 5 + drivers/scsi/scsi_lib.c| 4 drivers/scsi/scsi_transport_sas.c | 6 ++ include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + include/linux/iova.h | 2 ++ kernel/dma/mapping.c | 12 11 files changed, 52 insertions(+), 4 deletions(-) -- 2.35.3 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()
On 14/06/2022 14:12, John Garry wrote: On 06/06/2022 10:30, John Garry wrote: Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which allows the drivers to know the optimal mapping limit and thus limit the requested IOVA lengths. This value is based on the IOVA rcache range limit, as IOVAs allocated above this limit must always be newly allocated, which may be quite slow. Can I please get some sort of ack from the IOMMU people on this one? Another request for an ack please. Thanks, john Signed-off-by: John Garry Reviewed-by: Damien Le Moal --- drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + include/linux/iova.h | 2 ++ 3 files changed, 13 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index f90251572a5d..9e1586447ee8 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1459,6 +1459,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev) return (1UL << __ffs(domain->pgsize_bitmap)) - 1; } +static size_t iommu_dma_opt_mapping_size(void) +{ + return iova_rcache_range(); +} + static const struct dma_map_ops iommu_dma_ops = { .alloc = iommu_dma_alloc, .free = iommu_dma_free, @@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = { .map_resource = iommu_dma_map_resource, .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, + .opt_mapping_size = iommu_dma_opt_mapping_size, }; /* diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145..9f00b58d546e 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); +unsigned long iova_rcache_range(void) +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} + static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) { struct iova_domain *iovad; diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..c6ba6d95d79c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain *iovad, dma_addr_t iova) int iova_cache_get(void); void iova_cache_put(void); +unsigned long iova_rcache_range(void); + void free_iova(struct iova_domain *iovad, unsigned long pfn); void __free_iova(struct iova_domain *iovad, struct iova *iova); struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
On 10/06/2022 16:37, John Garry via iommu wrote: On 6/9/22 10:54, John Garry wrote: ok, but do you have a system where the UFS host controller is behind an IOMMU? I had the impression that UFS controllers would be mostly found in embedded systems and IOMMUs are not as common on there. Modern phones have an IOMMU. Below one can find an example from a Pixel 6 phone. The UFS storage controller is not controller by the IOMMU as far as I can see but I wouldn't be surprised if the security team would ask us one day to enable the IOMMU for the UFS controller. OK, then unfortunately it seems that you have no method to test. I might be able to test USB MSC but I am not even sure if I can even get DMA mappings who length exceeds the IOVA rcache limit there. I was able to do some testing on USB MSC for an XHCI controller. The result is that limiting the max HW sectors there does not affect performance in normal conditions. However if I hack the USB driver and fiddle with request queue settings then it can: - lift max_sectors limit in usb_stor_host_template 120KB -> 256KB - lift request queue read_ahead_kb 128KB -> 256KB In this scenario I can get 42.5MB/s read throughput, as opposed to 39.5MB/s in normal conditions. Since .can_queue=1 for that host it would not fall foul of some issues I experience in IOVA allocator performance, so limiting max_sectors would not be required for that reason. So this is an artificial test, but it may be worth considering only applying this DMA mapping optimal max_sectors limit to SAS controllers which I know can benefit. Christoph, any opinion? thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()
On 06/06/2022 10:30, John Garry wrote: Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which allows the drivers to know the optimal mapping limit and thus limit the requested IOVA lengths. This value is based on the IOVA rcache range limit, as IOVAs allocated above this limit must always be newly allocated, which may be quite slow. Can I please get some sort of ack from the IOMMU people on this one? Thanks, John EOM Signed-off-by: John Garry Reviewed-by: Damien Le Moal --- drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + include/linux/iova.h | 2 ++ 3 files changed, 13 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index f90251572a5d..9e1586447ee8 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1459,6 +1459,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev) return (1UL << __ffs(domain->pgsize_bitmap)) - 1; } +static size_t iommu_dma_opt_mapping_size(void) +{ + return iova_rcache_range(); +} + static const struct dma_map_ops iommu_dma_ops = { .alloc = iommu_dma_alloc, .free = iommu_dma_free, @@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = { .map_resource = iommu_dma_map_resource, .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, + .opt_mapping_size = iommu_dma_opt_mapping_size, }; /* diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145..9f00b58d546e 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); +unsigned long iova_rcache_range(void) +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} + static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) { struct iova_domain *iovad; diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..c6ba6d95d79c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain *iovad, dma_addr_t iova) int iova_cache_get(void); void iova_cache_put(void); +unsigned long iova_rcache_range(void); + void free_iova(struct iova_domain *iovad, unsigned long pfn); void __free_iova(struct iova_domain *iovad, struct iova *iova); struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
On 09/06/2022 21:34, Bart Van Assche wrote: On 6/9/22 10:54, John Garry wrote: ok, but do you have a system where the UFS host controller is behind an IOMMU? I had the impression that UFS controllers would be mostly found in embedded systems and IOMMUs are not as common on there. Modern phones have an IOMMU. Below one can find an example from a Pixel 6 phone. The UFS storage controller is not controller by the IOMMU as far as I can see but I wouldn't be surprised if the security team would ask us one day to enable the IOMMU for the UFS controller. OK, then unfortunately it seems that you have no method to test. I might be able to test USB MSC but I am not even sure if I can even get DMA mappings who length exceeds the IOVA rcache limit there. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/dma: Add config for PCI SAC address trick
On 09/06/2022 16:12, Robin Murphy wrote: For devices stuck behind a conventional PCI bus, saving extra cycles at 33MHz is probably fairly significant. However since native PCI Express is now the norm for high-performance devices, the optimisation to always prefer 32-bit addresses for the sake of avoiding DAC is starting to look rather anachronistic. Technically 32-bit addresses do have shorter TLPs on PCIe, but unless the device is saturating its link bandwidth with small transfers it seems unlikely that the difference is appreciable. What definitely is appreciable, however, is that the IOVA allocator doesn't behave all that well once the 32-bit space starts getting full. As DMA working sets get bigger, this optimisation increasingly backfires and adds considerable overhead to the dma_map path for use-cases like high-bandwidth networking. We've increasingly bandaged the allocator in attempts to mitigate this, but it remains fundamentally at odds with other valid requirements to try as hard as possible to satisfy a request within the given limit; what we really need is to just avoid this odd notion of a speculative allocation when it isn't beneficial anyway. Unfortunately that's where things get awkward... Having been present on x86 for 15 years or so now, it turns out there are systems which fail to properly define the upper limit of usable IOVA space for certain devices and this trick was the only thing letting them work OK. I had a similar ulterior motive for a couple of early arm64 systems when originally adding it to iommu-dma, but those really should be fixed with proper firmware bindings by now. Let's be brave and default it to off in the hope that CI systems and developers will find and fix those bugs, > but expect that desktop-focused distro configs are likely to want to turn it back on for maximum compatibility. Signed-off-by: Robin Murphy FWIW, Reviewed-by: John Garry If we're not enabling by default for x86 then doesn't Jeorg have some XHCI issue which we would now need to quirk? I don't remember which device exactly. Or, alternatively, simply ask him to enable this new config. --- v2: Tweak wording to clarify that it's not really an optimisation in general, remove "default X86". drivers/iommu/Kconfig | 26 ++ drivers/iommu/dma-iommu.c | 2 +- 2 files changed, 27 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index c79a0df090c0..5a225b48dd00 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -144,6 +144,32 @@ config IOMMU_DMA select IRQ_MSI_IOMMU select NEED_SG_DMA_LENGTH +config IOMMU_DMA_PCI_SAC + bool "Enable 64-bit legacy PCI optimisation by default" + depends on IOMMU_DMA + help + Enable by default an IOMMU optimisation for 64-bit legacy PCI devices, + wherein the DMA API layer will always first try to allocate a 32-bit + DMA address suitable for a single address cycle, before falling back + to allocating from the device's full usable address range. If your + system has 64-bit legacy PCI devices in 32-bit slots where using dual + address cycles reduces DMA throughput significantly, this may be + beneficial to overall performance. + + If you have a modern PCI Express based system, this feature mostly just + represents extra overhead in the allocation path for no practical + benefit, and it should usually be preferable to say "n" here. + + However, beware that this feature has also historically papered over + bugs where the IOMMU address width and/or device DMA mask is not set + correctly. If device DMA problems and IOMMU faults start occurring + after disabling this option, it is almost certainly indicative of a + latent driver or firmware/BIOS bug, which would previously have only + manifested with several gigabytes worth of concurrent DMA mappings. + + If this option is not set, the feature can still be re-enabled at + boot time with the "iommu.forcedac=0" command-line argument. + # Shared Virtual Addressing config IOMMU_SVA bool diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index f90251572a5d..9f9d9ba7f376 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -67,7 +67,7 @@ struct iommu_dma_cookie { }; static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled); -bool iommu_dma_forcedac __read_mostly; +bool iommu_dma_forcedac __read_mostly = !IS_ENABLED(CONFIG_IOMMU_DMA_PCI_SAC); static int __init iommu_dma_forcedac_setup(char *str) { ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
On 09/06/2022 18:18, Bart Van Assche wrote: SCSI host bus adapters that support 64-bit DMA may support much larger transfer sizes than 128 KiB. Indeed, and that is my problem today, as my storage controller is generating DMA mapping lengths which exceeds 128K and they slow everything down. If you say that SRP enjoys best peformance with larger transfers then can you please test this with an IOMMU enabled (iommu group type DMA or DMA-FQ)? Hmm ... what exactly do you want me to test? Do you perhaps want me to measure how much performance drops with an IOMMU enabled? Yes, I would like to know of any performance change with an IOMMU enabled and then with an IOMMU enabled and including my series. I don't have access anymore to the SRP setup I referred to in my previous email. But I do have access to devices that boot from UFS storage. For these devices we need to transfer 2 MiB per request to achieve full bandwidth. ok, but do you have a system where the UFS host controller is behind an IOMMU? I had the impression that UFS controllers would be mostly found in embedded systems and IOMMUs are not as common on there. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
On 08/06/2022 22:07, Bart Van Assche wrote: On 6/8/22 10:50, John Garry wrote: Please note that this limit only applies if we have an IOMMU enabled for the scsi host dma device. Otherwise we are limited by dma direct or swiotlb max mapping size, as before. SCSI host bus adapters that support 64-bit DMA may support much larger transfer sizes than 128 KiB. Indeed, and that is my problem today, as my storage controller is generating DMA mapping lengths which exceeds 128K and they slow everything down. If you say that SRP enjoys best peformance with larger transfers then can you please test this with an IOMMU enabled (iommu group type DMA or DMA-FQ)? Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
On 08/06/2022 18:33, Bart Van Assche wrote: On 6/6/22 02:30, John Garry wrote: + if (dma_dev->dma_mask) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); + } Since IOVA_RANGE_CACHE_MAX_SIZE = 6 this limits max_sectors to 2**6 * PAGE_SIZE or 256 KiB if the page size is 4 KiB. It's actually 128K for 4K page size, as any IOVA size is roundup to power-of-2 when testing if we may cache it, which means anything >128K would roundup to 256K and cannot be cached. I think that's too small. Some (SRP) storage arrays require much larger transfers to achieve optimal performance. Have you tried this achieve this optimal performance with an IOMMU enabled? Please note that this limit only applies if we have an IOMMU enabled for the scsi host dma device. Otherwise we are limited by dma direct or swiotlb max mapping size, as before. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()
On 08/06/2022 18:26, Bart Van Assche wrote: On 6/6/22 02:30, John Garry via iommu wrote: +unsigned long iova_rcache_range(void) +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} My understanding is that iova cache entries may be smaller than IOVA_RANGE_CACHE_MAX_SIZE and hence that even if code that uses the DMA mapping API respects this limit that a cache miss can still happen. Sure, a cache miss may still happen - however once we have stressed the system for a while then the rcaches fill up and don't fail often, or often enough to be noticeable compared to not having a cached IOVAs at all. Thanks, john ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 0/4] DMA mapping changes for SCSI core
On 07/06/2022 23:43, Bart Van Assche wrote: On 6/6/22 02:30, John Garry wrote: As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching limit may see a big performance hit. This series introduces a new DMA mapping API, dma_opt_mapping_size(), so that drivers may know this limit when performance is a factor in the mapping. Robin didn't like using dma_max_mapping_size() for this [1]. The SCSI core code is modified to use this limit. I also added a patch for libata-scsi as it does not currently honour the shost max_sectors limit. Note: Christoph has previously kindly offered to take this series via the dma-mapping tree, so I think that we just need an ack from the IOMMU guys now. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/ Regarding [0], that patch reverts commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"). Reading the description of that patch, it seems to me that the iova allocator can be improved. Shouldn't the iova allocator be improved such that we don't need this patch series? There are algorithms that handle fragmentation much better than the current iova allocator algorithm, e.g. the https://en.wikipedia.org/wiki/Buddy_memory_allocation algorithm. Regardless of whether the IOVA allocator can be improved - which it probably can be - this series is still useful. That is due to the IOVA rcache - that is a cache of pre-allocated IOVAs which can be quickly used in the DMA mapping. The rache contains IOVAs up to certain fixed size. In this series we limit the DMA mapping length to the rcache size upper limit to always bypass the allocator (when we have a cached IOVA available) - see alloc_iova_fast(). Even if the IOVA allocator were greatly optimised for speed, there would still be an overhead in the alloc and free for those larger IOVAs which would outweigh the advantage of having larger DMA mappings. But is there even an advantage in very large streaming DMA mappings? Maybe for iotlb efficiency. But some say it's better to have the DMA engine start processing the data ASAP and not wait for larger lists to be built. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 4/4] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors
ATA devices (struct ata_device) have a max_sectors field which is configured internally in libata. This is then used to (re)configure the associated sdev request queue max_sectors value from how it is earlier set in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set according to shost limits, which includes host DMA mapping limits. Cap the ata_device max_sectors according to shost->max_sectors to respect this shost limit. Signed-off-by: John Garry Acked-by: Damien Le Moal --- drivers/ata/libata-scsi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 42cecf95a4e5..8b4b318f378d 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1060,6 +1060,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev) dev->flags |= ATA_DFLAG_NO_UNLOAD; /* configure max sectors */ + dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors); blk_queue_max_hw_sectors(q, dev->max_sectors); if (dev->class == ATA_DEV_ATAPI) { -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
Streaming DMA mappings may be considerably slower when mappings go through an IOMMU and the total mapping length is somewhat long. This is because the IOMMU IOVA code allocates and free an IOVA for each mapping, which may affect performance. For performance reasons set the request_queue max_sectors from dma_opt_mapping_size(), which knows this mapping limit. In addition, the shost->max_sectors is repeatedly set for each sdev in __scsi_init_queue(). This is unnecessary, so set once when adding the host. Signed-off-by: John Garry Reviewed-by: Damien Le Moal Reviewed-by: Martin K. Petersen --- drivers/scsi/hosts.c| 5 + drivers/scsi/scsi_lib.c | 4 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index 8352f90d997d..ea1a207634d1 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -236,6 +236,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, shost->dma_dev = dma_dev; + if (dma_dev->dma_mask) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); + } + error = scsi_mq_setup_tags(shost); if (error) goto fail; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 6ffc9e4258a8..6ce8acea322a 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q) blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize); } - if (dev->dma_mask) { - shost->max_sectors = min_t(unsigned int, shost->max_sectors, - dma_max_mapping_size(dev) >> SECTOR_SHIFT); - } blk_queue_max_hw_sectors(q, shost->max_sectors); blk_queue_segment_boundary(q, shost->dma_boundary); dma_set_seg_boundary(dev, shost->dma_boundary); -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()
Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which allows the drivers to know the optimal mapping limit and thus limit the requested IOVA lengths. This value is based on the IOVA rcache range limit, as IOVAs allocated above this limit must always be newly allocated, which may be quite slow. Signed-off-by: John Garry Reviewed-by: Damien Le Moal --- drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + include/linux/iova.h | 2 ++ 3 files changed, 13 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index f90251572a5d..9e1586447ee8 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1459,6 +1459,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev) return (1UL << __ffs(domain->pgsize_bitmap)) - 1; } +static size_t iommu_dma_opt_mapping_size(void) +{ + return iova_rcache_range(); +} + static const struct dma_map_ops iommu_dma_ops = { .alloc = iommu_dma_alloc, .free = iommu_dma_free, @@ -1479,6 +1484,7 @@ static const struct dma_map_ops iommu_dma_ops = { .map_resource = iommu_dma_map_resource, .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, + .opt_mapping_size = iommu_dma_opt_mapping_size, }; /* diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145..9f00b58d546e 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); +unsigned long iova_rcache_range(void) +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} + static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) { struct iova_domain *iovad; diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..c6ba6d95d79c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain *iovad, dma_addr_t iova) int iova_cache_get(void); void iova_cache_put(void); +unsigned long iova_rcache_range(void); + void free_iova(struct iova_domain *iovad, unsigned long pfn); void __free_iova(struct iova_domain *iovad, struct iova *iova); struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size, -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 1/4] dma-mapping: Add dma_opt_mapping_size()
Streaming DMA mapping involving an IOMMU may be much slower for larger total mapping size. This is because every IOMMU DMA mapping requires an IOVA to be allocated and freed. IOVA sizes above a certain limit are not cached, which can have a big impact on DMA mapping performance. Provide an API for device drivers to know this "optimal" limit, such that they may try to produce mapping which don't exceed it. Signed-off-by: John Garry Reviewed-by: Damien Le Moal --- Documentation/core-api/dma-api.rst | 9 + include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + kernel/dma/mapping.c | 12 4 files changed, 27 insertions(+) diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst index 6d6d0edd2d27..b3cd9763d28b 100644 --- a/Documentation/core-api/dma-api.rst +++ b/Documentation/core-api/dma-api.rst @@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. The size parameter of the mapping functions like dma_map_single(), dma_map_page() and others should not be larger than the returned value. +:: + + size_t + dma_opt_mapping_size(struct device *dev); + +Returns the maximum optimal size of a mapping for the device. Mapping large +buffers may take longer so device drivers are advised to limit total DMA +streaming mappings length to the returned value. + :: bool diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d5b06b3a4a6..98ceba6fa848 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -69,6 +69,7 @@ struct dma_map_ops { int (*dma_supported)(struct device *dev, u64 mask); u64 (*get_required_mask)(struct device *dev); size_t (*max_mapping_size)(struct device *dev); + size_t (*opt_mapping_size)(void); unsigned long (*get_merge_boundary)(struct device *dev); }; diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index dca2b1355bb1..fe3849434b2a 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask); int dma_set_coherent_mask(struct device *dev, u64 mask); u64 dma_get_required_mask(struct device *dev); size_t dma_max_mapping_size(struct device *dev); +size_t dma_opt_mapping_size(struct device *dev); bool dma_need_sync(struct device *dev, dma_addr_t dma_addr); unsigned long dma_get_merge_boundary(struct device *dev); struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size, @@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device *dev) { return 0; } +static inline size_t dma_opt_mapping_size(struct device *dev) +{ + return 0; +} static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { return false; diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index db7244291b74..1bfe11b1edb6 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev) } EXPORT_SYMBOL_GPL(dma_max_mapping_size); +size_t dma_opt_mapping_size(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + size_t size = SIZE_MAX; + + if (ops && ops->opt_mapping_size) + size = ops->opt_mapping_size(); + + return min(dma_max_mapping_size(dev), size); +} +EXPORT_SYMBOL_GPL(dma_opt_mapping_size); + bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { const struct dma_map_ops *ops = get_dma_ops(dev); -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 0/4] DMA mapping changes for SCSI core
As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching limit may see a big performance hit. This series introduces a new DMA mapping API, dma_opt_mapping_size(), so that drivers may know this limit when performance is a factor in the mapping. Robin didn't like using dma_max_mapping_size() for this [1]. The SCSI core code is modified to use this limit. I also added a patch for libata-scsi as it does not currently honour the shost max_sectors limit. Note: Christoph has previously kindly offered to take this series via the dma-mapping tree, so I think that we just need an ack from the IOMMU guys now. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/ Changes since v2: - Rebase on v5.19-rc1 - Add Damien's tag to 2/4 (thanks) Changes since v1: - Relocate scsi_add_host_with_dma() dma_dev check (Reported by Dan) - Add tags from Damien and Martin (thanks) - note: I only added Martin's tag to the SCSI patch John Garry (4): dma-mapping: Add dma_opt_mapping_size() dma-iommu: Add iommu_dma_opt_mapping_size() scsi: core: Cap shost max_sectors according to DMA optimum mapping limits libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors Documentation/core-api/dma-api.rst | 9 + drivers/ata/libata-scsi.c | 1 + drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + drivers/scsi/hosts.c | 5 + drivers/scsi/scsi_lib.c| 4 include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + include/linux/iova.h | 2 ++ kernel/dma/mapping.c | 12 10 files changed, 46 insertions(+), 4 deletions(-) -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 4/4] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors
ATA devices (struct ata_device) have a max_sectors field which is configured internally in libata. This is then used to (re)configure the associated sdev request queue max_sectors value from how it is earlier set in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set according to shost limits, which includes host DMA mapping limits. Cap the ata_device max_sectors according to shost->max_sectors to respect this shost limit. Signed-off-by: John Garry Acked-by: Damien Le Moal --- drivers/ata/libata-scsi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 06c9d90238d9..25fe89791641 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1036,6 +1036,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev) dev->flags |= ATA_DFLAG_NO_UNLOAD; /* configure max sectors */ + dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors); blk_queue_max_hw_sectors(q, dev->max_sectors); if (dev->class == ATA_DEV_ATAPI) { -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
Streaming DMA mappings may be considerably slower when mappings go through an IOMMU and the total mapping length is somewhat long. This is because the IOMMU IOVA code allocates and free an IOVA for each mapping, which may affect performance. For performance reasons set the request_queue max_sectors from dma_opt_mapping_size(), which knows this mapping limit. In addition, the shost->max_sectors is repeatedly set for each sdev in __scsi_init_queue(). This is unnecessary, so set once when adding the host. Signed-off-by: John Garry Reviewed-by: Damien Le Moal Reviewed-by: Martin K. Petersen --- drivers/scsi/hosts.c| 5 + drivers/scsi/scsi_lib.c | 4 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index f69b77cbf538..9563c0ac567a 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -240,6 +240,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, shost->dma_dev = dma_dev; + if (dma_dev->dma_mask) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); + } + /* * Increase usage count temporarily here so that calling * scsi_autopm_put_host() will trigger runtime idle if there is diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 8d18cc7e510e..2d43bb8799bd 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q) blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize); } - if (dev->dma_mask) { - shost->max_sectors = min_t(unsigned int, shost->max_sectors, - dma_max_mapping_size(dev) >> SECTOR_SHIFT); - } blk_queue_max_hw_sectors(q, shost->max_sectors); blk_queue_segment_boundary(q, shost->dma_boundary); dma_set_seg_boundary(dev, shost->dma_boundary); -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()
Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which allows the drivers to know the optimal mapping limit and thus limit the requested IOVA lengths. This value is based on the IOVA rcache range limit, as IOVAs allocated above this limit must always be newly allocated, which may be quite slow. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + include/linux/iova.h | 2 ++ 3 files changed, 13 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 09f6e1c0f9c0..f619e41b9172 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1442,6 +1442,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev) return (1UL << __ffs(domain->pgsize_bitmap)) - 1; } +static size_t iommu_dma_opt_mapping_size(void) +{ + return iova_rcache_range(); +} + static const struct dma_map_ops iommu_dma_ops = { .alloc = iommu_dma_alloc, .free = iommu_dma_free, @@ -1462,6 +1467,7 @@ static const struct dma_map_ops iommu_dma_ops = { .map_resource = iommu_dma_map_resource, .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, + .opt_mapping_size = iommu_dma_opt_mapping_size, }; /* diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145..9f00b58d546e 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); +unsigned long iova_rcache_range(void) +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} + static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) { struct iova_domain *iovad; diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..c6ba6d95d79c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain *iovad, dma_addr_t iova) int iova_cache_get(void); void iova_cache_put(void); +unsigned long iova_rcache_range(void); + void free_iova(struct iova_domain *iovad, unsigned long pfn); void __free_iova(struct iova_domain *iovad, struct iova *iova); struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size, -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 0/4] DMA mapping changes for SCSI core
As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching limit may see a big performance hit. This series introduces a new DMA mapping API, dma_opt_mapping_size(), so that drivers may know this limit when performance is a factor in the mapping. Robin didn't like using dma_max_mapping_size() for this [1]. The SCSI core code is modified to use this limit. I also added a patch for libata-scsi as it does not currently honour the shost max_sectors limit. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/ Changes since v1: - Relocate scsi_add_host_with_dma() dma_dev check (Reported by Dan) - Add tags from Damien and Martin (thanks) - note: I only added Martin's tag to the SCSI patch John Garry (4): dma-mapping: Add dma_opt_mapping_size() dma-iommu: Add iommu_dma_opt_mapping_size() scsi: core: Cap shost max_sectors according to DMA optimum mapping limits libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors Documentation/core-api/dma-api.rst | 9 + drivers/ata/libata-scsi.c | 1 + drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + drivers/scsi/hosts.c | 5 + drivers/scsi/scsi_lib.c| 4 include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + include/linux/iova.h | 2 ++ kernel/dma/mapping.c | 12 10 files changed, 46 insertions(+), 4 deletions(-) -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 1/4] dma-mapping: Add dma_opt_mapping_size()
Streaming DMA mapping involving an IOMMU may be much slower for larger total mapping size. This is because every IOMMU DMA mapping requires an IOVA to be allocated and freed. IOVA sizes above a certain limit are not cached, which can have a big impact on DMA mapping performance. Provide an API for device drivers to know this "optimal" limit, such that they may try to produce mapping which don't exceed it. Signed-off-by: John Garry Reviewed-by: Damien Le Moal --- Documentation/core-api/dma-api.rst | 9 + include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + kernel/dma/mapping.c | 12 4 files changed, 27 insertions(+) diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst index 6d6d0edd2d27..b3cd9763d28b 100644 --- a/Documentation/core-api/dma-api.rst +++ b/Documentation/core-api/dma-api.rst @@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. The size parameter of the mapping functions like dma_map_single(), dma_map_page() and others should not be larger than the returned value. +:: + + size_t + dma_opt_mapping_size(struct device *dev); + +Returns the maximum optimal size of a mapping for the device. Mapping large +buffers may take longer so device drivers are advised to limit total DMA +streaming mappings length to the returned value. + :: bool diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d5b06b3a4a6..98ceba6fa848 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -69,6 +69,7 @@ struct dma_map_ops { int (*dma_supported)(struct device *dev, u64 mask); u64 (*get_required_mask)(struct device *dev); size_t (*max_mapping_size)(struct device *dev); + size_t (*opt_mapping_size)(void); unsigned long (*get_merge_boundary)(struct device *dev); }; diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index dca2b1355bb1..fe3849434b2a 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask); int dma_set_coherent_mask(struct device *dev, u64 mask); u64 dma_get_required_mask(struct device *dev); size_t dma_max_mapping_size(struct device *dev); +size_t dma_opt_mapping_size(struct device *dev); bool dma_need_sync(struct device *dev, dma_addr_t dma_addr); unsigned long dma_get_merge_boundary(struct device *dev); struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size, @@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device *dev) { return 0; } +static inline size_t dma_opt_mapping_size(struct device *dev) +{ + return 0; +} static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { return false; diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index db7244291b74..1bfe11b1edb6 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev) } EXPORT_SYMBOL_GPL(dma_max_mapping_size); +size_t dma_opt_mapping_size(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + size_t size = SIZE_MAX; + + if (ops && ops->opt_mapping_size) + size = ops->opt_mapping_size(); + + return min(dma_max_mapping_size(dev), size); +} +EXPORT_SYMBOL_GPL(dma_opt_mapping_size); + bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { const struct dma_map_ops *ops = get_dma_ops(dev); -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 0/4] DMA mapping changes for SCSI core
On 22/05/2022 23:22, Damien Le Moal wrote: On 2022/05/22 22:13, Christoph Hellwig wrote: The whole series looks fine to me. I'll happily queue it up in the dma-mapping tree if the SCSI and ATA maintainers are ok with that. Fine with me. I sent an acked-by for the libata bit. Thanks, I'm going to have to post a v2 and I figure that with the timing that I'll have to wait for v5.20 now. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
On 23/05/2022 12:08, Dan Carpenter wrote: Thanks for the report 50b6cb3516365c Dexuan Cui2021-10-07 224/* Use min_t(int, ...) in case shost->can_queue exceeds SHRT_MAX */ 50b6cb3516365c Dexuan Cui2021-10-07 225shost->cmd_per_lun = min_t(int, shost->cmd_per_lun, ea2f0f77538c50 John Garry2021-05-19 226 shost->can_queue); ea2f0f77538c50 John Garry2021-05-19 227 2ad7ba6ca08593 John Garry2022-05-20 @228if (dma_dev->dma_mask) { ^ I knew that we fixed up dma_dev to be non-NULL, but I thought it was earlier in this function... The patch adds a new unchecked dereference 2ad7ba6ca08593 John Garry2022-05-20 229shost->max_sectors = min_t(unsigned int, shost->max_sectors, 2ad7ba6ca08593 John Garry2022-05-20 230 dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); 2ad7ba6ca08593 John Garry2022-05-20 231 } 2ad7ba6ca08593 John Garry2022-05-20 232 0a6ac4ee7c2109 Christoph Hellwig 2017-01-03 233error = scsi_init_sense_cache(shost); 0a6ac4ee7c2109 Christoph Hellwig 2017-01-03 234if (error) 0a6ac4ee7c2109 Christoph Hellwig 2017-01-03 235goto fail; 0a6ac4ee7c2109 Christoph Hellwig 2017-01-03 236 d285203cf647d7 Christoph Hellwig 2014-01-17 237error = scsi_mq_setup_tags(shost); 542bd1377a9630 James Bottomley 2008-04-21 238if (error) 542bd1377a9630 James Bottomley 2008-04-21 239goto fail; d285203cf647d7 Christoph Hellwig 2014-01-17 240 ^1da177e4c3f41 Linus Torvalds2005-04-16 241if (!shost->shost_gendev.parent) ^1da177e4c3f41 Linus Torvalds2005-04-16 242 shost->shost_gendev.parent = dev ? dev : &platform_bus; 3c8d9a957d0ae6 James Bottomley 2012-05-04 @243if (!dma_dev) Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()
On 21/05/2022 00:33, Damien Le Moal wrote: Hi Damien, +unsigned long iova_rcache_range(void) Why not a size_t return type ? The IOVA code generally uses unsigned long for size/range while dam-iommu uses size_t as appropiate, so I'm just sticking to that. +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} + Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
On 21/05/2022 00:30, Damien Le Moal wrote: diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index f69b77cbf538..a3ae6345473b 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -225,6 +225,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, shost->cmd_per_lun = min_t(int, shost->cmd_per_lun, shost->can_queue); Hi Damien, + if (dma_dev->dma_mask) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); + } Nit: you could drop the curly brackets here. Some people prefer this style - multi-line statements have curly brackets, while single-line statements conform to the official coding style (and don't use brackets). I'll just stick with what we have unless there is a consensus to change. Thanks, John + error = scsi_init_sense_cache(shost); if (error) goto fail; ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 4/4] libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors
ATA devices (struct ata_device) have a max_sectors field which is configured internally in libata. This is then used to (re)configure the associated sdev request queue max_sectors value from how it is earlier set in __scsi_init_queue(). In __scsi_init_queue() the max_sectors value is set according to shost limits, which includes host DMA mapping limits. Cap the ata_device max_sectors according to shost->max_sectors to respect this shost limit. Signed-off-by: John Garry --- drivers/ata/libata-scsi.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 06c9d90238d9..25fe89791641 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -1036,6 +1036,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct ata_device *dev) dev->flags |= ATA_DFLAG_NO_UNLOAD; /* configure max sectors */ + dev->max_sectors = min(dev->max_sectors, sdev->host->max_sectors); blk_queue_max_hw_sectors(q, dev->max_sectors); if (dev->class == ATA_DEV_ATAPI) { -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits
Streaming DMA mappings may be considerably slower when mappings go through an IOMMU and the total mapping length is somewhat long. This is because the IOMMU IOVA code allocates and free an IOVA for each mapping, which may affect performance. For performance reasons set the request_queue max_sectors from dma_opt_mapping_size(), which knows this mapping limit. In addition, the shost->max_sectors is repeatedly set for each sdev in __scsi_init_queue(). This is unnecessary, so set once when adding the host. Signed-off-by: John Garry --- drivers/scsi/hosts.c| 5 + drivers/scsi/scsi_lib.c | 4 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index f69b77cbf538..a3ae6345473b 100644 --- a/drivers/scsi/hosts.c +++ b/drivers/scsi/hosts.c @@ -225,6 +225,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev, shost->cmd_per_lun = min_t(int, shost->cmd_per_lun, shost->can_queue); + if (dma_dev->dma_mask) { + shost->max_sectors = min_t(unsigned int, shost->max_sectors, + dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT); + } + error = scsi_init_sense_cache(shost); if (error) goto fail; diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 8d18cc7e510e..2d43bb8799bd 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1884,10 +1884,6 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q) blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize); } - if (dev->dma_mask) { - shost->max_sectors = min_t(unsigned int, shost->max_sectors, - dma_max_mapping_size(dev) >> SECTOR_SHIFT); - } blk_queue_max_hw_sectors(q, shost->max_sectors); blk_queue_segment_boundary(q, shost->dma_boundary); dma_set_seg_boundary(dev, shost->dma_boundary); -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 2/4] dma-iommu: Add iommu_dma_opt_mapping_size()
Add the IOMMU callback for DMA mapping API dma_opt_mapping_size(), which allows the drivers to know the optimal mapping limit and thus limit the requested IOVA lengths. This value is based on the IOVA rcache range limit, as IOVAs allocated above this limit must always be newly allocated, which may be quite slow. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + include/linux/iova.h | 2 ++ 3 files changed, 13 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 09f6e1c0f9c0..f619e41b9172 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1442,6 +1442,11 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev) return (1UL << __ffs(domain->pgsize_bitmap)) - 1; } +static size_t iommu_dma_opt_mapping_size(void) +{ + return iova_rcache_range(); +} + static const struct dma_map_ops iommu_dma_ops = { .alloc = iommu_dma_alloc, .free = iommu_dma_free, @@ -1462,6 +1467,7 @@ static const struct dma_map_ops iommu_dma_ops = { .map_resource = iommu_dma_map_resource, .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, + .opt_mapping_size = iommu_dma_opt_mapping_size, }; /* diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145..9f00b58d546e 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); +unsigned long iova_rcache_range(void) +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} + static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) { struct iova_domain *iovad; diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..c6ba6d95d79c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain *iovad, dma_addr_t iova) int iova_cache_get(void); void iova_cache_put(void); +unsigned long iova_rcache_range(void); + void free_iova(struct iova_domain *iovad, unsigned long pfn); void __free_iova(struct iova_domain *iovad, struct iova *iova); struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size, -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 0/4] DMA mapping changes for SCSI core
As reported in [0], DMA mappings whose size exceeds the IOMMU IOVA caching limit may see a big performance hit. This series introduces a new DMA mapping API, dma_opt_mapping_size(), so that drivers may know this limit when performance is a factor in the mapping. Robin didn't like using dma_max_mapping_size() for this [1]. An alternative to adding the new API would be to add a "hard_limit" arg to dma_max_mapping_size(). This would mean fixing up all current users, but it would be good to do that anyway as not all users require a hard limit. The SCSI core coded is modified to use this limit. I also added a patch for libata-scsi as it does not currently honour the shost max_sectors limit. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/f5b78c9c-312e-70ab-ecbb-f14623a4b...@arm.com/ John Garry (4): dma-mapping: Add dma_opt_mapping_size() dma-iommu: Add iommu_dma_opt_mapping_size() scsi: core: Cap shost max_sectors according to DMA optimum mapping limits libata-scsi: Cap ata_device->max_sectors according to shost->max_sectors Documentation/core-api/dma-api.rst | 9 + drivers/ata/libata-scsi.c | 1 + drivers/iommu/dma-iommu.c | 6 ++ drivers/iommu/iova.c | 5 + drivers/scsi/hosts.c | 5 + drivers/scsi/scsi_lib.c| 4 include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + include/linux/iova.h | 2 ++ kernel/dma/mapping.c | 12 10 files changed, 46 insertions(+), 4 deletions(-) -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/4] dma-mapping: Add dma_opt_mapping_size()
Streaming DMA mapping involving an IOMMU may be much slower for larger total mapping size. This is because every IOMMU DMA mapping requires an IOVA to be allocated and freed. IOVA sizes above a certain limit are not cached, which can have a big impact on DMA mapping performance. Provide an API for device drivers to know this "optimal" limit, such that they may try to produce mapping which don't exceed it. Signed-off-by: John Garry --- Documentation/core-api/dma-api.rst | 9 + include/linux/dma-map-ops.h| 1 + include/linux/dma-mapping.h| 5 + kernel/dma/mapping.c | 12 4 files changed, 27 insertions(+) diff --git a/Documentation/core-api/dma-api.rst b/Documentation/core-api/dma-api.rst index 6d6d0edd2d27..b3cd9763d28b 100644 --- a/Documentation/core-api/dma-api.rst +++ b/Documentation/core-api/dma-api.rst @@ -204,6 +204,15 @@ Returns the maximum size of a mapping for the device. The size parameter of the mapping functions like dma_map_single(), dma_map_page() and others should not be larger than the returned value. +:: + + size_t + dma_opt_mapping_size(struct device *dev); + +Returns the maximum optimal size of a mapping for the device. Mapping large +buffers may take longer so device drivers are advised to limit total DMA +streaming mappings length to the returned value. + :: bool diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h index 0d5b06b3a4a6..98ceba6fa848 100644 --- a/include/linux/dma-map-ops.h +++ b/include/linux/dma-map-ops.h @@ -69,6 +69,7 @@ struct dma_map_ops { int (*dma_supported)(struct device *dev, u64 mask); u64 (*get_required_mask)(struct device *dev); size_t (*max_mapping_size)(struct device *dev); + size_t (*opt_mapping_size)(void); unsigned long (*get_merge_boundary)(struct device *dev); }; diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index dca2b1355bb1..fe3849434b2a 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -144,6 +144,7 @@ int dma_set_mask(struct device *dev, u64 mask); int dma_set_coherent_mask(struct device *dev, u64 mask); u64 dma_get_required_mask(struct device *dev); size_t dma_max_mapping_size(struct device *dev); +size_t dma_opt_mapping_size(struct device *dev); bool dma_need_sync(struct device *dev, dma_addr_t dma_addr); unsigned long dma_get_merge_boundary(struct device *dev); struct sg_table *dma_alloc_noncontiguous(struct device *dev, size_t size, @@ -266,6 +267,10 @@ static inline size_t dma_max_mapping_size(struct device *dev) { return 0; } +static inline size_t dma_opt_mapping_size(struct device *dev) +{ + return 0; +} static inline bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { return false; diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c index db7244291b74..1bfe11b1edb6 100644 --- a/kernel/dma/mapping.c +++ b/kernel/dma/mapping.c @@ -773,6 +773,18 @@ size_t dma_max_mapping_size(struct device *dev) } EXPORT_SYMBOL_GPL(dma_max_mapping_size); +size_t dma_opt_mapping_size(struct device *dev) +{ + const struct dma_map_ops *ops = get_dma_ops(dev); + size_t size = SIZE_MAX; + + if (ops && ops->opt_mapping_size) + size = ops->opt_mapping_size(); + + return min(dma_max_mapping_size(dev), size); +} +EXPORT_SYMBOL_GPL(dma_opt_mapping_size); + bool dma_need_sync(struct device *dev, dma_addr_t dma_addr) { const struct dma_map_ops *ops = get_dma_ops(dev); -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Add config for PCI SAC address trick
On 18/05/2022 18:36, Robin Murphy wrote: For devices stuck behind a conventional PCI bus, saving extra cycles at 33MHz is probably fairly significant. However since native PCI Express is now the norm for high-performance devices, the optimisation to always prefer 32-bit addresses for the sake of avoiding DAC is starting to look rather anachronistic. Technically 32-bit addresses do have shorter TLPs on PCIe, but unless the device is saturating its link bandwidth with small transfers it seems unlikely that the difference is appreciable. What definitely is appreciable, however, is that the IOVA allocator doesn't behave all that well once the 32-bit space starts getting full. As DMA working sets get bigger, this optimisation increasingly backfires and adds considerable overhead to the dma_map path for use-cases like high-bandwidth networking. We've increasingly bandaged the allocator in attempts to mitigate this, but it remains fundamentally at odds with other valid requirements to try as hard as possible to satisfy a request within the given limit; what we really need is to just avoid this odd notion of a speculative allocation when it isn't beneficial anyway. Unfortunately that's where things get awkward... Having been present on x86 for 15 years or so now, it turns out there are systems which fail to properly define the upper limit of usable IOVA space for certain devices and this trick was the only thing letting them work OK. I had a similar ulterior motive for a couple of early arm64 systems when originally adding it to iommu-dma, but those really should now be fixed with proper firmware bindings, and other arm64 users really need it out of the way, so let's just leave it default-on for x86. Signed-off-by: Robin Murphy --- drivers/iommu/Kconfig | 24 drivers/iommu/dma-iommu.c | 2 +- It might be worth printing this default value always and not just for when it is set from commandline, like what we do for default domain type and IOTLB invalidation policy 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index c79a0df090c0..bf9b295f1c89 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -144,6 +144,30 @@ config IOMMU_DMA select IRQ_MSI_IOMMU select NEED_SG_DMA_LENGTH +config IOMMU_DMA_PCI_SAC_OPT + bool "Enable 64-bit legacy PCI optimisation by default" + depends on IOMMU_DMA + default X86 Do we have a strategy for if and when issues start popping up on other architectures? Is it to simply tell them to just turn this flag on (and also fix your platform)? + help + Enable by default an IOMMU optimisation for 64-bit legacy PCI devices, + wherein the DMA API layer will always first try to allocate a 32-bit + DMA address suitable for a single address cycle, before falling back + to allocating from the full usable address range. If your system has + 64-bit legacy PCI devices in 32-bit slots where using dual address + cycles reduces DMA throughput significantly, this optimisation may be + beneficial to overall performance. + + If you have a modern PCI Express based system, it should usually be + safe to say "n" here and avoid the potential extra allocation overhead. + However, beware that this optimisation has also historically papered + over bugs where the IOMMU address range above 32 bits is not fully + usable. If device DMA problems and/or IOMMU faults start occurring + with IOMMU translation enabled after disabling this option, it is + likely a sign of a latent driver or firmware/BIOS bug. + + If this option is not set, the optimisation can be enabled at + boot time with the "iommu.forcedac=0" command-line argument. + Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] dma-iommu: Add iommu_dma_max_mapping_size()
On 17/05/2022 13:02, Robin Murphy wrote: Indeed, sorry but NAK for this being nonsense. As I've said at least once before, if the unnecessary SAC address allocation attempt slows down your workload, make it not do that in the first place. If you don't like the existing command-line parameter then fine, > there are plenty of other options, it just needs to be done in a way that doesn't break x86 systems with dodgy firmware, as my first attempt turned out to. Sorry, but I am not interested in this. It was discussed in Jan last year without any viable solution. Er, OK, if you're not interested in solving that problem I don't see why you'd bring it up, but hey ho. *I* still think it's important, so I guess I'll revive my old patch with a CONFIG_X86 bodge and have another go at some point. Let me rephrase, I would be happy to help fix that problem if we really can get it fixed, however for my problem it's better to try to get the SCSI driver to stop requesting uncached IOVAs foremost. Anyway we still have the long-term IOVA aging issue, and requesting non-cached IOVAs is involved in that. So I would rather keep the SCSI driver to requesting cached IOVAs all the time. I did try to do it the other way around - configuring the IOVA caching range according to the drivers requirement but that got nowhere. Note that this is still not a final solution as it's not always viable to ask a user to unbind + bind the driver. FWIW I thought that all looked OK, it just kept getting drowned out by more critical things in my inbox so I hoped someone else might comment. If it turns out that I've become the de-facto IOVA maintainer in everyone else's minds now and they're all waiting for my word then fair enough, I just need to know and reset my expectations accordingly. Joerg? It would be great to see an improvement here... Furthermore, if a particular SCSI driver doesn't benefit from mappings larger than 256KB, then that driver is also free to limit its own mapping size. There are other folks out there with use-cases for mapping *gigabytes* at once; you don't get to cripple the API and say that that's suddenly not allowed just because it happens to make your thing go faster, that's absurd. I'd say less catastrophically slow, not faster. So how to inform the SCSI driver of this caching limit then so that it may limit the SGL length? Driver-specific mechanism; block-layer-specific mechanism; redefine this whole API to something like dma_opt_mapping_size(), as a limit above which mappings might become less efficient or start to fail (callback to my thoughts on [1] as well, I suppose); many options. ok, fine. Just not imposing a ridiculously low *maximum* on everyone wherein mapping calls "should not be larger than the returned value" when that's clearly bollocks. I agree that this change is in violation as the documentation clearly implies a hard limit. However, FWIW, from looking at users of dma_max_mapping_size(), they seem to use in a similar way to SCSI/block layer core, i.e. use this value to limit the max SGL total len per IO command. And not as a method to learn max DMA consistent mappings size for ring buffers, etc. Anyway I'll look at dma_opt_mapping_size() but I am not sure how keen Christoph will be on this... it is strange to introduce that API due to peculiarity of the IOVA allocator. [1] https://lore.kernel.org/linux-iommu/20220510142109.38-1-ltyker...@gmail.com/ Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] dma-iommu: Add iommu_dma_max_mapping_size()
On 17/05/2022 11:40, Robin Murphy wrote: On 2022-05-16 14:06, John Garry wrote: For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. Add the IOMMU callback for DMA mapping API dma_max_mapping_size(), which allows the drivers to know the mapping limit and thus limit the requested IOVA lengths. This resolves the performance issue originally reported in [0] for a SCSI HBA driver which was regularly mapping SGLs which required IOVAs in excess of the IOVA caching limit. In this case the block layer limits the max sectors per request - as configured in __scsi_init_queue() - which will limit the total SGL length the driver tries to map and in turn limits IOVA lengths requested. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Signed-off-by: John Garry --- Sending as an RFC as iommu_dma_max_mapping_size() is a soft limit, and not a hard limit which I expect is the semantics of dma_map_ops.max_mapping_size Indeed, sorry but NAK for this being nonsense. As I've said at least once before, if the unnecessary SAC address allocation attempt slows down your workload, make it not do that in the first place. If you don't like the existing command-line parameter then fine, > there are plenty of other options, it just needs to be done in a way that doesn't break x86 systems with dodgy firmware, as my first attempt turned out to. Sorry, but I am not interested in this. It was discussed in Jan last year without any viable solution. Anyway we still have the long-term IOVA aging issue, and requesting non-cached IOVAs is involved in that. So I would rather keep the SCSI driver to requesting cached IOVAs all the time. I did try to do it the other way around - configuring the IOVA caching range according to the drivers requirement but that got nowhere. Furthermore, if a particular SCSI driver doesn't benefit from mappings larger than 256KB, then that driver is also free to limit its own mapping size. There are other folks out there with use-cases for mapping *gigabytes* at once; you don't get to cripple the API and say that that's suddenly not allowed just because it happens to make your thing go faster, that's absurd. I'd say less catastrophically slow, not faster. So how to inform the SCSI driver of this caching limit then so that it may limit the SGL length? Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH] dma-iommu: Add iommu_dma_max_mapping_size()
On 17/05/2022 09:38, Christoph Hellwig wrote: On Mon, May 16, 2022 at 09:06:01PM +0800, John Garry wrote: For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. Add the IOMMU callback for DMA mapping API dma_max_mapping_size(), which allows the drivers to know the mapping limit and thus limit the requested IOVA lengths. This resolves the performance issue originally reported in [0] for a SCSI HBA driver which was regularly mapping SGLs which required IOVAs in excess of the IOVA caching limit. In this case the block layer limits the max sectors per request - as configured in __scsi_init_queue() - which will limit the total SGL length the driver tries to map and in turn limits IOVA lengths requested. BTW, on a separate topic, I noticed that even with this change my ATA devices have max_hw_sectors_kb of 32767, as opposed to 128 for SAS devices. It seems that libata-scsi - specifically ata_scsi_dev_config() - doesn't honour the shost max_sectors limit. I guess that is not intentional. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Signed-off-by: John Garry --- Sending as an RFC as iommu_dma_max_mapping_size() is a soft limit, and not a hard limit which I expect is the semantics of dma_map_ops.max_mapping_size diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 09f6e1c0f9c0..e2d5205cde37 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1442,6 +1442,21 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev) return (1UL << __ffs(domain->pgsize_bitmap)) - 1; } + if (!domain) + return 0; + + cookie = domain->iova_cookie; + if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) + return 0; Can these conditions even be true here? I don't think so. Paranoia on my part. +static inline unsigned long iova_rcache_range(void) +{ + return 0; +} Given that IOMMU_DMA select IOMMU_IOVA there is no need for this stub. hmmm.. ok. Policy was to be stub everything but I think that it has changed. Otherwise this looks sensible to me. . Great, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 2/8] hwtracing: hisi_ptt: Add trace function support for HiSilicon PCIe Tune and Trace device
On 17/05/2022 09:09, Yicong Yang wrote: + target = cpumask_any(cpumask_of_node(dev_to_node(&hisi_ptt->pdev->dev))); + if (target < nr_cpumask_bits) { the comment for cpumask_any() hints to check against nr_cpu_ids - any specific reason to check against nr_cpumask_bits? here should be: if (target >= nr_cpumask_bits) { will fix this up. I am still not sure that using nr_cpumask_bits is correct. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 4/8] perf arm: Refactor event list iteration in auxtrace_record__init()
On 16/05/2022 13:52, Yicong Yang wrote: As requested before, please mention "perf tool" in the commit subject From: Qi Liu Use find_pmu_for_event() to simplify logic in auxtrace_record__init(). Signed-off-by: Qi Liu Signed-off-by: Yicong Yang --- tools/perf/arch/arm/util/auxtrace.c | 53 ++--- 1 file changed, 34 insertions(+), 19 deletions(-) diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c index 5fc6a2a3dbc5..384c7cfda0fd 100644 --- a/tools/perf/arch/arm/util/auxtrace.c +++ b/tools/perf/arch/arm/util/auxtrace.c @@ -50,16 +50,32 @@ static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err) return arm_spe_pmus; } +static struct perf_pmu *find_pmu_for_event(struct perf_pmu **pmus, + int pmu_nr, struct evsel *evsel) +{ + int i; + + if (!pmus) + return NULL; + + for (i = 0; i < pmu_nr; i++) { + if (evsel->core.attr.type == pmus[i]->type) + return pmus[i]; + } + + return NULL; +} + struct auxtrace_record *auxtrace_record__init(struct evlist *evlist, int *err) { - struct perf_pmu *cs_etm_pmu; + struct perf_pmu *cs_etm_pmu = NULL; + struct perf_pmu **arm_spe_pmus = NULL; struct evsel *evsel; - bool found_etm = false; + struct perf_pmu *found_etm = NULL; struct perf_pmu *found_spe = NULL; - struct perf_pmu **arm_spe_pmus = NULL; + int auxtrace_event_cnt = 0; int nr_spes = 0; - int i = 0; if (!evlist) return NULL; @@ -68,24 +84,23 @@ struct auxtrace_record arm_spe_pmus = find_all_arm_spe_pmus(&nr_spes, err); evlist__for_each_entry(evlist, evsel) { - if (cs_etm_pmu && - evsel->core.attr.type == cs_etm_pmu->type) - found_etm = true; - - if (!nr_spes || found_spe) - continue; - - for (i = 0; i < nr_spes; i++) { - if (evsel->core.attr.type == arm_spe_pmus[i]->type) { - found_spe = arm_spe_pmus[i]; - break; - } - } + if (cs_etm_pmu && !found_etm) + found_etm = find_pmu_for_event(&cs_etm_pmu, 1, evsel); + + if (arm_spe_pmus && !found_spe) + found_spe = find_pmu_for_event(arm_spe_pmus, nr_spes, evsel); } + free(arm_spe_pmus); - if (found_etm && found_spe) { - pr_err("Concurrent ARM Coresight ETM and SPE operation not currently supported\n"); + if (found_etm) + auxtrace_event_cnt++; + + if (found_spe) + auxtrace_event_cnt++; + + if (auxtrace_event_cnt > 1) { + pr_err("Concurrent AUX trace operation not currently supported\n"); *err = -EOPNOTSUPP; return NULL; } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v8 3/8] hwtracing: hisi_ptt: Add tune function support for HiSilicon PCIe Tune and Trace device
On 16/05/2022 13:52, Yicong Yang wrote: Add tune function for the HiSilicon Tune and Trace device. The interface of tune is exposed through sysfs attributes of PTT PMU device. Signed-off-by: Yicong Yang Reviewed-by: Jonathan Cameron Apart from a comment on preferential style: Reviewed-by: John Garry --- drivers/hwtracing/ptt/hisi_ptt.c | 157 +++ drivers/hwtracing/ptt/hisi_ptt.h | 23 + 2 files changed, 180 insertions(+) diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c index ef25ce98f664..c3fdb9bfb1b4 100644 --- a/drivers/hwtracing/ptt/hisi_ptt.c +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -25,6 +25,161 @@ /* Dynamic CPU hotplug state used by PTT */ static enum cpuhp_state hisi_ptt_pmu_online; +static bool hisi_ptt_wait_tuning_finish(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return !readl_poll_timeout(hisi_ptt->iobase + HISI_PTT_TUNING_INT_STAT, + val, !(val & HISI_PTT_TUNING_INT_STAT_MASK), + HISI_PTT_WAIT_POLL_INTERVAL_US, + HISI_PTT_WAIT_TUNE_TIMEOUT_US); +} + +static int hisi_ptt_tune_data_get(struct hisi_ptt *hisi_ptt, + u32 event, u16 *data) this only has 1x caller so may inline it +{ + u32 reg; + + reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + reg &= ~(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB); + reg |= FIELD_PREP(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB, + event); + writel(reg, hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + + /* Write all 1 to indicates it's the read process */ + writel(~0U, hisi_ptt->iobase + HISI_PTT_TUNING_DATA); + + if (!hisi_ptt_wait_tuning_finish(hisi_ptt)) + return -ETIMEDOUT; + + reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_DATA); + reg &= HISI_PTT_TUNING_DATA_VAL_MASK; + *data = FIELD_GET(HISI_PTT_TUNING_DATA_VAL_MASK, reg); + + return 0; +} + +static int hisi_ptt_tune_data_set(struct hisi_ptt *hisi_ptt, + u32 event, u16 data) again only 1x caller +{ + u32 reg; + + reg = readl(hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + reg &= ~(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB); + reg |= FIELD_PREP(HISI_PTT_TUNING_CTRL_CODE | HISI_PTT_TUNING_CTRL_SUB, + event); + writel(reg, hisi_ptt->iobase + HISI_PTT_TUNING_CTRL); + + writel(FIELD_PREP(HISI_PTT_TUNING_DATA_VAL_MASK, data), + hisi_ptt->iobase + HISI_PTT_TUNING_DATA); + + if (!hisi_ptt_wait_tuning_finish(hisi_ptt)) + return -ETIMEDOUT; + + return 0; +} + +static ssize_t hisi_ptt_tune_attr_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev)); + struct dev_ext_attribute *ext_attr; + struct hisi_ptt_tune_desc *desc; + int ret; + u16 val; + + ext_attr = container_of(attr, struct dev_ext_attribute, attr); + desc = ext_attr->var; + + mutex_lock(&hisi_ptt->tune_lock); + ret = hisi_ptt_tune_data_get(hisi_ptt, desc->event_code, &val); + mutex_unlock(&hisi_ptt->tune_lock); + + if (ret) + return ret; + + return sysfs_emit(buf, "%u\n", val); +} + +static ssize_t hisi_ptt_tune_attr_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev)); + struct dev_ext_attribute *ext_attr; + struct hisi_ptt_tune_desc *desc; + int ret; + u16 val; + + ext_attr = container_of(attr, struct dev_ext_attribute, attr); + desc = ext_attr->var; + + if (kstrtou16(buf, 10, &val)) + return -EINVAL; + + mutex_lock(&hisi_ptt->tune_lock); + ret = hisi_ptt_tune_data_set(hisi_ptt, desc->event_code, val); + mutex_unlock(&hisi_ptt->tune_lock); + + if (ret) + return ret; + + return count; +} + +#define HISI_PTT_TUNE_ATTR(_name, _val, _show, _store) \ + static struct hisi_ptt_tune_desc _name##_desc = { \ + .name = #_name, \ + .event_code = _val, \ + }; \ + static struct dev_ext_attribute hisi_ptt_##_name##_attr = { \ + .attr = __ATTR(_name, 0600, _show, _store), \ + .var=
Re: [PATCH v8 2/8] hwtracing: hisi_ptt: Add trace function support for HiSilicon PCIe Tune and Trace device
On 16/05/2022 13:52, Yicong Yang wrote: HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated Endpoint(RCiEP) device, providing the capability to dynamically monitor and tune the PCIe traffic and trace the TLP headers. Add the driver for the device to enable the trace function. Register PMU device of PTT trace, then users can use trace through perf command. The driver makes use of perf AUX trace function and support the following events to configure the trace: - filter: select Root port or Endpoint to trace - type: select the type of traced TLP headers - direction: select the direction of traced TLP headers - format: select the data format of the traced TLP headers This patch initially add a basic driver of PTT trace. Initially add basic trace support. Signed-off-by: Yicong Yang Generally this looks ok, apart from nitpicking below, so, FWIW: Reviewed-by: John Garry --- drivers/Makefile | 1 + drivers/hwtracing/Kconfig| 2 + drivers/hwtracing/ptt/Kconfig| 12 + drivers/hwtracing/ptt/Makefile | 2 + drivers/hwtracing/ptt/hisi_ptt.c | 964 +++ drivers/hwtracing/ptt/hisi_ptt.h | 178 ++ 6 files changed, 1159 insertions(+) create mode 100644 drivers/hwtracing/ptt/Kconfig create mode 100644 drivers/hwtracing/ptt/Makefile create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h diff --git a/drivers/Makefile b/drivers/Makefile index 020780b6b4d2..662d50599467 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -175,6 +175,7 @@ obj-$(CONFIG_USB4) += thunderbolt/ obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ obj-y += hwtracing/intel_th/ obj-$(CONFIG_STM) += hwtracing/stm/ +obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/ obj-$(CONFIG_ANDROID) += android/ obj-$(CONFIG_NVMEM) += nvmem/ obj-$(CONFIG_FPGA)+= fpga/ diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig index 13085835a636..911ee977103c 100644 --- a/drivers/hwtracing/Kconfig +++ b/drivers/hwtracing/Kconfig @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig" source "drivers/hwtracing/intel_th/Kconfig" +source "drivers/hwtracing/ptt/Kconfig" + endmenu diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig new file mode 100644 index ..6d46a09ffeb9 --- /dev/null +++ b/drivers/hwtracing/ptt/Kconfig @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0-only +config HISI_PTT + tristate "HiSilicon PCIe Tune and Trace Device" + depends on ARM64 || (COMPILE_TEST && 64BIT) + depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS + help + HiSilicon PCIe Tune and Trace device exists as a PCIe RCiEP + device, and it provides support for PCIe traffic tuning and + tracing TLP headers to the memory. + + This driver can also be built as a module. If so, the module + will be called hisi_ptt. diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile new file mode 100644 index ..908c09a98161 --- /dev/null +++ b/drivers/hwtracing/ptt/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c new file mode 100644 index ..ef25ce98f664 --- /dev/null +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -0,0 +1,964 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for HiSilicon PCIe tune and trace device + * + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. + * Author: Yicong Yang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_ptt.h" + +/* Dynamic CPU hotplug state used by PTT */ +static enum cpuhp_state hisi_ptt_pmu_online; + +static u16 hisi_ptt_get_filter_val(u16 devid, bool is_port) +{ + if (is_port) + return BIT(HISI_PCIE_CORE_PORT_ID(devid & 0xff)); + + return devid; +} + +static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS, + val, val & HISI_PTT_TRACE_IDLE, + HISI_PTT_WAIT_POLL_INTERVAL_US, + HISI_PTT_WAIT_TRACE_TIMEOUT_US); +} + +static void hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_WR_STS, + val, !val, HISI_PTT_RESET_POLL_INTERVAL_US, + HISI_PTT_RESET_TIMEOUT_US); +} +
[RFC PATCH] dma-iommu: Add iommu_dma_max_mapping_size()
For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. Add the IOMMU callback for DMA mapping API dma_max_mapping_size(), which allows the drivers to know the mapping limit and thus limit the requested IOVA lengths. This resolves the performance issue originally reported in [0] for a SCSI HBA driver which was regularly mapping SGLs which required IOVAs in excess of the IOVA caching limit. In this case the block layer limits the max sectors per request - as configured in __scsi_init_queue() - which will limit the total SGL length the driver tries to map and in turn limits IOVA lengths requested. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Signed-off-by: John Garry --- Sending as an RFC as iommu_dma_max_mapping_size() is a soft limit, and not a hard limit which I expect is the semantics of dma_map_ops.max_mapping_size diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 09f6e1c0f9c0..e2d5205cde37 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1442,6 +1442,21 @@ static unsigned long iommu_dma_get_merge_boundary(struct device *dev) return (1UL << __ffs(domain->pgsize_bitmap)) - 1; } +static size_t iommu_dma_max_mapping_size(struct device *dev) +{ + struct iommu_domain *domain = iommu_get_domain_for_dev(dev); + struct iommu_dma_cookie *cookie; + + if (!domain) + return 0; + + cookie = domain->iova_cookie; + if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) + return 0; + + return iova_rcache_range(); +} + static const struct dma_map_ops iommu_dma_ops = { .alloc = iommu_dma_alloc, .free = iommu_dma_free, @@ -1462,6 +1477,7 @@ static const struct dma_map_ops iommu_dma_ops = { .map_resource = iommu_dma_map_resource, .unmap_resource = iommu_dma_unmap_resource, .get_merge_boundary = iommu_dma_get_merge_boundary, + .max_mapping_size = iommu_dma_max_mapping_size, }; /* diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145..9f00b58d546e 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -26,6 +26,11 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); +unsigned long iova_rcache_range(void) +{ + return PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1); +} + static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) { struct iova_domain *iovad; diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..ae3e18d77e6c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -79,6 +79,8 @@ static inline unsigned long iova_pfn(struct iova_domain *iovad, dma_addr_t iova) int iova_cache_get(void); void iova_cache_put(void); +unsigned long iova_rcache_range(void); + void free_iova(struct iova_domain *iovad, unsigned long pfn); void __free_iova(struct iova_domain *iovad, struct iova *iova); struct iova *alloc_iova(struct iova_domain *iovad, unsigned long size, @@ -105,6 +107,11 @@ static inline void iova_cache_put(void) { } +static inline unsigned long iova_rcache_range(void) +{ + return 0; +} + static inline void free_iova(struct iova_domain *iovad, unsigned long pfn) { } -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 1/7] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity
On 07/04/2022 13:58, Yicong Yang wrote: The DMA operations of HiSilicon PTT device can only work properly with identical mappings. So add a quirk for the device to force the domain I'm not sure if you meant to write "identity mappings". as passthrough. Signed-off-by: Yicong Yang FWIW, Reviewed-by: John Garry --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 627a3ed5ee8f..5ec15ae2a9b1 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2839,6 +2839,21 @@ static int arm_smmu_dev_disable_feature(struct device *dev, } } +#define IS_HISI_PTT_DEVICE(pdev) ((pdev)->vendor == PCI_VENDOR_ID_HUAWEI && \ +(pdev)->device == 0xa12e) + +static int arm_smmu_def_domain_type(struct device *dev) +{ + if (dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + + if (IS_HISI_PTT_DEVICE(pdev)) + return IOMMU_DOMAIN_IDENTITY; + } + + return 0; +} + static struct iommu_ops arm_smmu_ops = { .capable= arm_smmu_capable, .domain_alloc = arm_smmu_domain_alloc, @@ -2856,6 +2871,7 @@ static struct iommu_ops arm_smmu_ops = { .sva_unbind = arm_smmu_sva_unbind, .sva_get_pasid = arm_smmu_sva_get_pasid, .page_response = arm_smmu_page_response, + .def_domain_type= arm_smmu_def_domain_type, .pgsize_bitmap = -1UL, /* Restricted during device attach */ .owner = THIS_MODULE, .default_domain_ops = &(const struct iommu_domain_ops) { ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 5/7] perf tool: Add support for HiSilicon PCIe Tune and Trace device driver
On 12/04/2022 08:41, Yicong Yang wrote: + hisi_ptt_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_ptts)); + if (!hisi_ptt_pmus) { + pr_err("hisi_ptt alloc failed\n"); + *err = -ENOMEM; using PTR_ERR seems better, if possible ok will change to that. *err = -ENOMEM is used here to keep consistence with what spe does. Ah, I see that we are contrained by the interface of auxtrace_record_init() to pass err as a pointer, so I suppose the code in this patch is ok to fit into that. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
+static int hisi_ptt_alloc_trace_buf(struct hisi_ptt *hisi_ptt) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + struct device *dev = &hisi_ptt->pdev->dev; + int i; + + hisi_ptt->trace_ctrl.buf_index = 0; + + /* If the trace buffer has already been allocated, zero it. */ I am not sure why this is not called from the probe The buffer allocation is done when necessary as driver will probe the device on booting but the user may never use it. In this condition it's a waste of memory if we allocate the buffers in probe. Currently we'll allocate 16M memory for 4 buffers. But that's just not how we do things. We setup the driver fully to be used in the probe. If the user cannot really afford the memory then he/she should not load the driver. In addition, this driver would be used in a machine which will have gigbytes of memory, so I think that the memory mentioned here is relatively insignificant. So this function is called every time before we start trace. In the first time it will allocate the DMA buffers and it the other times it just zero the buffers to clear the data of last time. + if (ctrl->trace_buf) { + for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++) + memset(ctrl->trace_buf[i].addr, 0, HISI_PTT_TRACE_BUF_SIZE); + return 0; + } + + ctrl->trace_buf = devm_kcalloc(dev, HISI_PTT_TRACE_BUF_CNT, + sizeof(struct hisi_ptt_dma_buffer), GFP_KERNEL); sizeof(*ctrl->trace_buf) may be better ok. + if (!ctrl->trace_buf) + return -ENOMEM; + + for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; ++i) { + ctrl->trace_buf[i].addr = dmam_alloc_coherent(dev, HISI_PTT_TRACE_BUF_SIZE, + &ctrl->trace_buf[i].dma, + GFP_KERNEL); + if (!ctrl->trace_buf[i].addr) { + hisi_ptt_free_trace_buf(hisi_ptt); + return -ENOMEM; + } + } + + return 0; +} + +static void hisi_ptt_trace_end(struct hisi_ptt *hisi_ptt) +{ + writel(0, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + hisi_ptt->trace_ctrl.started = false; +} + +static int hisi_ptt_trace_start(struct hisi_ptt *hisi_ptt) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + u32 val; + int i; + + /* Check device idle before start trace */ + if (!hisi_ptt_wait_trace_hw_idle(hisi_ptt)) { + pci_err(hisi_ptt->pdev, "Failed to start trace, the device is still busy\n"); + return -EBUSY; + } + + ctrl->started = true; + + /* Reset the DMA before start tracing */ + val = readl(hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + val |= HISI_PTT_TRACE_CTRL_RST; + writel(val, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + + hisi_ptt_wait_dma_reset_done(hisi_ptt); + + val = readl(hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + val &= ~HISI_PTT_TRACE_CTRL_RST; + writel(val, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + + /* Clear the interrupt status */ + writel(HISI_PTT_TRACE_INT_STAT_MASK, hisi_ptt->iobase + HISI_PTT_TRACE_INT_STAT); + writel(0, hisi_ptt->iobase + HISI_PTT_TRACE_INT_MASK); + + /* Configure the trace DMA buffer */ I am not sure why this sort of thing is done outside probing ... + + val = FIELD_GET(HISI_PTT_PMU_DIRECTION_MASK, event->attr.config); + ret = hisi_ptt_trace_valid_config_onehot(val, hisi_ptt_trace_available_direction, how about put all those arrays in hisi_ptt_trace_valid_config_onehot() and pass some flag to say which array you want to use? Or something like that. Passing the arrays in this fashion is messy Since there are 3 configs (type, direction, format) with different available range and setting method (onehot, non-onehot), moving the arrays into the valid checking function means we need to recognize the config types (passed by the caller but need to know the available value array) and the checking method together. That may make the code more complex than now: 1st picking up the right array and judge wich checking method this array applied and 2nd do the checking. Currently it's designed to decouple the checking method and the available value array. The hisi_ptt_trace_valid_config{_onehot}() won't care about which array to use since caller take responsibilty for this. So perhaps current approach is simple and clear enough. A couple of points: - hisi_ptt_trace_valid_config_type() only has 1x caller so can make it dedicated for that caller - there is not much code in hisi_ptt_trace_valid_config_onshot(), so ok to duplicate if makes overall code look better So I think dedicated functions make the code simpler, easier to follow, and maintain: static int hisi_ptt_trace_valid_config_dir(u32 val) { int i; /* * The supported value of the direction parameter. See hisi_ptt.rst * documentation for more details. */ static const u32 hisi_ptt_trace_available_direction[] = { 0,
Re: [PATCH v7 5/7] perf tool: Add support for HiSilicon PCIe Tune and Trace device driver
On 07/04/2022 13:58, Yicong Yang wrote: From: Qi Liu 'perf record' and 'perf report --dump-raw-trace' supported in this patch. Example usage: Output will contain raw PTT data and its textual representation, such as: 0 0 0x5810 [0x30]: PERF_RECORD_AUXTRACE size: 0x40 offset: 0 ref: 0xa5d50c725 idx: 0 tid: -1 cpu: 0 . . ... HISI PTT data: size 4194304 bytes . : 00 00 00 00 Prefix . 0004: 08 20 00 60 Header DW0 . 0008: ff 02 00 01 Header DW1 . 000c: 20 08 00 00 Header DW2 . 0010: 10 e7 44 ab Header DW3 . 0014: 2a a8 1e 01 Time . 0020: 00 00 00 00 Prefix . 0024: 01 00 00 60 Header DW0 . 0028: 0f 1e 00 01 Header DW1 . 002c: 04 00 00 00 Header DW2 . 0030: 40 00 81 02 Header DW3 . 0034: ee 02 00 00 Time Signed-off-by: Qi Liu Signed-off-by: Yicong Yang --- tools/perf/arch/arm/util/auxtrace.c | 76 +- tools/perf/arch/arm/util/pmu.c| 3 + tools/perf/arch/arm64/util/Build | 2 +- tools/perf/arch/arm64/util/hisi_ptt.c | 195 tools/perf/util/Build | 2 + tools/perf/util/auxtrace.c| 4 + tools/perf/util/auxtrace.h| 1 + tools/perf/util/hisi-ptt-decoder/Build| 1 + .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.c | 170 ++ .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.h | 28 +++ tools/perf/util/hisi_ptt.c| 218 ++ tools/perf/util/hisi_ptt.h| 28 +++ 12 files changed, 724 insertions(+), 4 deletions(-) create mode 100644 tools/perf/arch/arm64/util/hisi_ptt.c create mode 100644 tools/perf/util/hisi-ptt-decoder/Build create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.c create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.h create mode 100644 tools/perf/util/hisi_ptt.c create mode 100644 tools/perf/util/hisi_ptt.h diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c index 5fc6a2a3dbc5..393f5757c039 100644 --- a/tools/perf/arch/arm/util/auxtrace.c +++ b/tools/perf/arch/arm/util/auxtrace.c @@ -4,9 +4,11 @@ * Author: Mathieu Poirier */ +#include #include #include #include +#include #include "../../../util/auxtrace.h" #include "../../../util/debug.h" @@ -14,6 +16,7 @@ #include "../../../util/pmu.h" #include "cs-etm.h" #include "arm-spe.h" +#include "hisi_ptt.h" static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err) { @@ -50,6 +53,58 @@ static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err) return arm_spe_pmus; } +static struct perf_pmu **find_all_hisi_ptt_pmus(int *nr_ptts, int *err) +{ + const char *sysfs = sysfs__mountpoint(); + struct perf_pmu **hisi_ptt_pmus = NULL; + struct dirent *dent; + char path[PATH_MAX]; + DIR *dir = NULL; + int idx = 0; + + snprintf(path, PATH_MAX, "%s" EVENT_SOURCE_DEVICE_PATH, sysfs); + dir = opendir(path); + if (!dir) { + pr_err("can't read directory '%s'\n", EVENT_SOURCE_DEVICE_PATH); + *err = -EINVAL; + goto out; + } + + while ((dent = readdir(dir))) { + if (strstr(dent->d_name, HISI_PTT_PMU_NAME)) + (*nr_ptts)++; + } + + if (!(*nr_ptts)) + goto out; + + hisi_ptt_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_ptts)); + if (!hisi_ptt_pmus) { + pr_err("hisi_ptt alloc failed\n"); + *err = -ENOMEM; using PTR_ERR seems better, if possible + goto out; + } + + rewinddir(dir); + while ((dent = readdir(dir))) { + if (strstr(dent->d_name, HISI_PTT_PMU_NAME) && idx < (*nr_ptts)) { + hisi_ptt_pmus[idx] = perf_pmu__find(dent->d_name); + if (hisi_ptt_pmus[idx]) { + pr_debug2("%s %d: hisi_ptt_pmu %d type %d name %s\n", do you really need this? + __func__, __LINE__, idx, + hisi_ptt_pmus[idx]->type, + hisi_ptt_pmus[idx]->name); + idx++; + } + + } + } + +out: + closedir(dir); + return hisi_ptt_pmus; +} + struct auxtrace_record *auxtrace_record__init(struct evlist *evlist, int *err) { @@ -57,8 +112,12 @@ s
Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 07/04/2022 13:58, Yicong Yang wrote: HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated Endpoint(RCiEP) device, providing the capability to dynamically monitor and tune the PCIe traffic, and trace the TLP headers. Add the driver for the device to enable the trace function. Register PMU device of PTT trace, then users can use trace through perf command. The driver makes use of perf AUX trace and support following events to "The driver makes use of perf AUX trace function and supports the following events to .." configure the trace: - filter: select Root port or Endpoint to trace - type: select the type of traced TLP headers - direction: select the direction of traced TLP headers - format: select the data format of the traced TLP headers This patch adds the driver part of PTT trace. The perf command support of PTT trace is added in the following patch. I would not mention "following patch" and the like. Just have "initially add a basic driver" Signed-off-by: Yicong Yang Reviewed-by: Jonathan Cameron --- drivers/Makefile | 1 + drivers/hwtracing/Kconfig| 2 + drivers/hwtracing/ptt/Kconfig| 12 + drivers/hwtracing/ptt/Makefile | 2 + drivers/hwtracing/ptt/hisi_ptt.c | 874 +++ drivers/hwtracing/ptt/hisi_ptt.h | 166 ++ 6 files changed, 1057 insertions(+) create mode 100644 drivers/hwtracing/ptt/Kconfig create mode 100644 drivers/hwtracing/ptt/Makefile create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h diff --git a/drivers/Makefile b/drivers/Makefile index 020780b6b4d2..662d50599467 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -175,6 +175,7 @@ obj-$(CONFIG_USB4) += thunderbolt/ obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ obj-y += hwtracing/intel_th/ obj-$(CONFIG_STM) += hwtracing/stm/ +obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/ obj-$(CONFIG_ANDROID) += android/ obj-$(CONFIG_NVMEM) += nvmem/ obj-$(CONFIG_FPGA)+= fpga/ diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig index 13085835a636..911ee977103c 100644 --- a/drivers/hwtracing/Kconfig +++ b/drivers/hwtracing/Kconfig @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig" source "drivers/hwtracing/intel_th/Kconfig" +source "drivers/hwtracing/ptt/Kconfig" + endmenu diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig new file mode 100644 index ..8902a6f27563 --- /dev/null +++ b/drivers/hwtracing/ptt/Kconfig @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0-only +config HISI_PTT + tristate "HiSilicon PCIe Tune and Trace Device" + depends on ARM64 || (COMPILE_TEST && 64BIT) + depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS + help + HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP + device, and it provides support for PCIe traffic tuning and + tracing TLP headers to the memory. + + This driver can also be built as a module. If so, the module + will be called hisi_ptt. diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile new file mode 100644 index ..908c09a98161 --- /dev/null +++ b/drivers/hwtracing/ptt/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c new file mode 100644 index ..242b41870380 --- /dev/null +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -0,0 +1,874 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for HiSilicon PCIe tune and trace device + * + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. + * Author: Yicong Yang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_ptt.h" + +static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev) +{ + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) + return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn))); + + return PCI_DEVID(pdev->bus->number, pdev->devfn); +} + +static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS, + val, val & HISI_PTT_TRACE_IDLE, + HISI_PTT_WAIT_POLL_INTERVAL_US, + HISI_PTT_WAIT_TRACE_TIMEOUT_US); +} + +static bool hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_WR_STS, + val, !val, HISI_PTT_RESET_POLL_INTERVAL_US, +
Re: [PATCH RESEND v5 4/5] iommu: Allow max opt DMA len be set for a group via sysfs
On 07/04/2022 09:21, Leizhen (ThunderTown) wrote: On 2022/4/4 19:27, John Garry wrote: Add support to allow the maximum optimised DMA len be set for an IOMMU group via sysfs. This is much the same with the method to change the default domain type for a group. Signed-off-by: John Garry --- .../ABI/testing/sysfs-kernel-iommu_groups | 16 + drivers/iommu/iommu.c | 59 ++- include/linux/iommu.h | 6 ++ 3 files changed, 79 insertions(+), 2 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups b/Documentation/ABI/testing/sysfs-kernel-iommu_groups index b15af6a5bc08..ed6f72794f6c 100644 --- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups +++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups @@ -63,3 +63,19 @@ Description: /sys/kernel/iommu_groups//type shows the type of default system could lead to catastrophic effects (the users might need to reboot the machine to get it to normal state). So, it's expected that the users understand what they're doing. + +What: /sys/kernel/iommu_groups//max_opt_dma_size +Date: Feb 2022 +KernelVersion: v5.18 +Contact: iommu@lists.linux-foundation.org +Description: /sys/kernel/iommu_groups//max_opt_dma_size shows the + max optimised DMA size for the default IOMMU domain associated + with the group. + Each IOMMU domain has an IOVA domain. The IOVA domain caches + IOVAs upto a certain size as a performance optimisation. + This sysfs file allows the range of the IOVA domain caching be + set, such that larger than default IOVAs may be cached. + A value of 0 means that the default caching range is chosen. + A privileged user could request the kernel the change the range + by writing to this file. For this to happen, the same rules + and procedure applies as in changing the default domain type. diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 10bb10c2a210..7c7258f19bed 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -48,6 +48,7 @@ struct iommu_group { struct iommu_domain *default_domain; struct iommu_domain *domain; struct list_head entry; + size_t max_opt_dma_size; }; struct group_device { @@ -89,6 +90,9 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group, static struct iommu_group *iommu_group_get_for_dev(struct device *dev); static ssize_t iommu_group_store_type(struct iommu_group *group, const char *buf, size_t count); +static ssize_t iommu_group_store_max_opt_dma_size(struct iommu_group *group, + const char *buf, + size_t count); #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ struct iommu_group_attribute iommu_group_attr_##_name = \ @@ -571,6 +575,12 @@ static ssize_t iommu_group_show_type(struct iommu_group *group, return strlen(type); } +static ssize_t iommu_group_show_max_opt_dma_size(struct iommu_group *group, +char *buf) +{ + return sprintf(buf, "%zu\n", group->max_opt_dma_size); +} + static IOMMU_GROUP_ATTR(name, S_IRUGO, iommu_group_show_name, NULL); static IOMMU_GROUP_ATTR(reserved_regions, 0444, @@ -579,6 +589,9 @@ static IOMMU_GROUP_ATTR(reserved_regions, 0444, static IOMMU_GROUP_ATTR(type, 0644, iommu_group_show_type, iommu_group_store_type); +static IOMMU_GROUP_ATTR(max_opt_dma_size, 0644, iommu_group_show_max_opt_dma_size, + iommu_group_store_max_opt_dma_size); + static void iommu_group_release(struct kobject *kobj) { struct iommu_group *group = to_iommu_group(kobj); @@ -665,6 +678,10 @@ struct iommu_group *iommu_group_alloc(void) if (ret) return ERR_PTR(ret); + ret = iommu_group_create_file(group, &iommu_group_attr_max_opt_dma_size); + if (ret) + return ERR_PTR(ret); + pr_debug("Allocated group %d\n", group->id); return group; @@ -2087,6 +2104,11 @@ struct iommu_domain *iommu_get_dma_domain(struct device *dev) return dev->iommu_group->default_domain; } +size_t iommu_group_get_max_opt_dma_size(struct iommu_group *group) +{ + return group->max_opt_dma_size; +} + /* * IOMMU groups are really the natural working unit of the IOMMU, but * the IOMMU API works on domains and devices. Bridge that gap by @@ -2871,12 +2893,14 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); * @prev_dev: The device in the group (this is used to make sure that the device * hasn't changed after the caller has called this functio
Re: [PATCH RESEND v5 5/5] iova: Add iova_len argument to iova_domain_init_rcaches()
On 07/04/2022 09:27, Leizhen (ThunderTown) wrote: Thanks for having a look On 2022/4/4 19:27, John Garry wrote: Add max opt argument to iova_domain_init_rcaches(), and use it to set the rcaches range. Also fix up all users to set this value (at 0, meaning use default), including a wrapper for that, iova_domain_init_rcaches_default(). For dma-iommu.c we derive the iova_len argument from the IOMMU group max opt DMA size. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c| 15 ++- drivers/iommu/iova.c | 19 --- drivers/vdpa/vdpa_user/iova_domain.c | 4 ++-- include/linux/iova.h | 3 ++- 4 files changed, 34 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 42ca42ff1b5d..19f35624611c 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -525,6 +525,8 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; + size_t max_opt_dma_size; + unsigned long iova_len = 0; int ret; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) @@ -560,7 +562,18 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, } init_iova_domain(iovad, 1UL << order, base_pfn); - ret = iova_domain_init_rcaches(iovad); + + max_opt_dma_size = iommu_group_get_max_opt_dma_size(dev->iommu_group); + if (max_opt_dma_size) { + unsigned long shift = __ffs(1UL << order); + + iova_len = roundup_pow_of_two(max_opt_dma_size); + iova_len >>= shift; + if (!iova_len) + iova_len = 1; How about move "iovad->rcache_max_size = iova_len_to_rcache_max(iova_len);" here? So that, iova_domain_init_rcaches() can remain the same. And iova_domain_init_rcaches_default() does not need to be added. I see your idea. I will say that I would rather not add iova_domain_init_rcaches_default(). But personally I think it's better to setup all rcache stuff only once and inside iova_domain_init_rcaches(), as it is today. In addition, it doesn't look reasonable to expose iova_len_to_rcache_max(). But maybe it's ok. Other opinion would be welcome... Thanks, John + } + + ret = iova_domain_init_rcaches(iovad, iova_len); if (ret) return ret; diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 5c22b9187b79..d65e79e132ee 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -706,12 +706,20 @@ static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn) mag->pfns[mag->size++] = pfn; } -int iova_domain_init_rcaches(struct iova_domain *iovad) +static unsigned long iova_len_to_rcache_max(unsigned long iova_len) +{ + return order_base_2(iova_len) + 1; +} + +int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long iova_len) { unsigned int cpu; int i, ret; - iovad->rcache_max_size = 6; /* Arbitrarily high default */ + if (iova_len) + iovad->rcache_max_size = iova_len_to_rcache_max(iova_len); + else + iovad->rcache_max_size = 6; /* Arbitrarily high default */ iovad->rcaches = kcalloc(iovad->rcache_max_size, sizeof(struct iova_rcache), @@ -755,7 +763,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad) free_iova_rcaches(iovad); return ret; } -EXPORT_SYMBOL_GPL(iova_domain_init_rcaches); + +int iova_domain_init_rcaches_default(struct iova_domain *iovad) +{ + return iova_domain_init_rcaches(iovad, 0); +} +EXPORT_SYMBOL_GPL(iova_domain_init_rcaches_default); /* * Try inserting IOVA range starting with 'iova_pfn' into 'rcache', and diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c index 6daa3978d290..3a2acef98a4a 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.c +++ b/drivers/vdpa/vdpa_user/iova_domain.c @@ -514,12 +514,12 @@ vduse_domain_create(unsigned long iova_limit, size_t bounce_size) spin_lock_init(&domain->iotlb_lock); init_iova_domain(&domain->stream_iovad, PAGE_SIZE, IOVA_START_PFN); - ret = iova_domain_init_rcaches(&domain->stream_iovad); + ret = iova_domain_init_rcaches_default(&domain->stream_iovad); if (ret) goto err_iovad_stream; init_iova_domain(&domain->consistent_iovad, PAGE_SIZE, bounce_pfns); - ret = iova_domain_init_rcaches(&domain->consistent_iovad); + ret = iova_domain_init_rcaches_default(&domain->consistent_io
[PATCH RESEND v5 5/5] iova: Add iova_len argument to iova_domain_init_rcaches()
Add max opt argument to iova_domain_init_rcaches(), and use it to set the rcaches range. Also fix up all users to set this value (at 0, meaning use default), including a wrapper for that, iova_domain_init_rcaches_default(). For dma-iommu.c we derive the iova_len argument from the IOMMU group max opt DMA size. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c| 15 ++- drivers/iommu/iova.c | 19 --- drivers/vdpa/vdpa_user/iova_domain.c | 4 ++-- include/linux/iova.h | 3 ++- 4 files changed, 34 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 42ca42ff1b5d..19f35624611c 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -525,6 +525,8 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; + size_t max_opt_dma_size; + unsigned long iova_len = 0; int ret; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) @@ -560,7 +562,18 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, } init_iova_domain(iovad, 1UL << order, base_pfn); - ret = iova_domain_init_rcaches(iovad); + + max_opt_dma_size = iommu_group_get_max_opt_dma_size(dev->iommu_group); + if (max_opt_dma_size) { + unsigned long shift = __ffs(1UL << order); + + iova_len = roundup_pow_of_two(max_opt_dma_size); + iova_len >>= shift; + if (!iova_len) + iova_len = 1; + } + + ret = iova_domain_init_rcaches(iovad, iova_len); if (ret) return ret; diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 5c22b9187b79..d65e79e132ee 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -706,12 +706,20 @@ static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn) mag->pfns[mag->size++] = pfn; } -int iova_domain_init_rcaches(struct iova_domain *iovad) +static unsigned long iova_len_to_rcache_max(unsigned long iova_len) +{ + return order_base_2(iova_len) + 1; +} + +int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long iova_len) { unsigned int cpu; int i, ret; - iovad->rcache_max_size = 6; /* Arbitrarily high default */ + if (iova_len) + iovad->rcache_max_size = iova_len_to_rcache_max(iova_len); + else + iovad->rcache_max_size = 6; /* Arbitrarily high default */ iovad->rcaches = kcalloc(iovad->rcache_max_size, sizeof(struct iova_rcache), @@ -755,7 +763,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad) free_iova_rcaches(iovad); return ret; } -EXPORT_SYMBOL_GPL(iova_domain_init_rcaches); + +int iova_domain_init_rcaches_default(struct iova_domain *iovad) +{ + return iova_domain_init_rcaches(iovad, 0); +} +EXPORT_SYMBOL_GPL(iova_domain_init_rcaches_default); /* * Try inserting IOVA range starting with 'iova_pfn' into 'rcache', and diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c index 6daa3978d290..3a2acef98a4a 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.c +++ b/drivers/vdpa/vdpa_user/iova_domain.c @@ -514,12 +514,12 @@ vduse_domain_create(unsigned long iova_limit, size_t bounce_size) spin_lock_init(&domain->iotlb_lock); init_iova_domain(&domain->stream_iovad, PAGE_SIZE, IOVA_START_PFN); - ret = iova_domain_init_rcaches(&domain->stream_iovad); + ret = iova_domain_init_rcaches_default(&domain->stream_iovad); if (ret) goto err_iovad_stream; init_iova_domain(&domain->consistent_iovad, PAGE_SIZE, bounce_pfns); - ret = iova_domain_init_rcaches(&domain->consistent_iovad); + ret = iova_domain_init_rcaches_default(&domain->consistent_iovad); if (ret) goto err_iovad_consistent; diff --git a/include/linux/iova.h b/include/linux/iova.h index 02f7222fa85a..56281434ce0c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -95,7 +95,8 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo, unsigned long pfn_hi); void init_iova_domain(struct iova_domain *iovad, unsigned long granule, unsigned long start_pfn); -int iova_domain_init_rcaches(struct iova_domain *iovad); +int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long iova_len); +int iova_domain_init_rcaches_default(struct iova_domain *iovad); struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn); void put_iov
[PATCH RESEND v5 4/5] iommu: Allow max opt DMA len be set for a group via sysfs
Add support to allow the maximum optimised DMA len be set for an IOMMU group via sysfs. This is much the same with the method to change the default domain type for a group. Signed-off-by: John Garry --- .../ABI/testing/sysfs-kernel-iommu_groups | 16 + drivers/iommu/iommu.c | 59 ++- include/linux/iommu.h | 6 ++ 3 files changed, 79 insertions(+), 2 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups b/Documentation/ABI/testing/sysfs-kernel-iommu_groups index b15af6a5bc08..ed6f72794f6c 100644 --- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups +++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups @@ -63,3 +63,19 @@ Description: /sys/kernel/iommu_groups//type shows the type of default system could lead to catastrophic effects (the users might need to reboot the machine to get it to normal state). So, it's expected that the users understand what they're doing. + +What: /sys/kernel/iommu_groups//max_opt_dma_size +Date: Feb 2022 +KernelVersion: v5.18 +Contact: iommu@lists.linux-foundation.org +Description: /sys/kernel/iommu_groups//max_opt_dma_size shows the + max optimised DMA size for the default IOMMU domain associated + with the group. + Each IOMMU domain has an IOVA domain. The IOVA domain caches + IOVAs upto a certain size as a performance optimisation. + This sysfs file allows the range of the IOVA domain caching be + set, such that larger than default IOVAs may be cached. + A value of 0 means that the default caching range is chosen. + A privileged user could request the kernel the change the range + by writing to this file. For this to happen, the same rules + and procedure applies as in changing the default domain type. diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 10bb10c2a210..7c7258f19bed 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -48,6 +48,7 @@ struct iommu_group { struct iommu_domain *default_domain; struct iommu_domain *domain; struct list_head entry; + size_t max_opt_dma_size; }; struct group_device { @@ -89,6 +90,9 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group, static struct iommu_group *iommu_group_get_for_dev(struct device *dev); static ssize_t iommu_group_store_type(struct iommu_group *group, const char *buf, size_t count); +static ssize_t iommu_group_store_max_opt_dma_size(struct iommu_group *group, + const char *buf, + size_t count); #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ struct iommu_group_attribute iommu_group_attr_##_name =\ @@ -571,6 +575,12 @@ static ssize_t iommu_group_show_type(struct iommu_group *group, return strlen(type); } +static ssize_t iommu_group_show_max_opt_dma_size(struct iommu_group *group, +char *buf) +{ + return sprintf(buf, "%zu\n", group->max_opt_dma_size); +} + static IOMMU_GROUP_ATTR(name, S_IRUGO, iommu_group_show_name, NULL); static IOMMU_GROUP_ATTR(reserved_regions, 0444, @@ -579,6 +589,9 @@ static IOMMU_GROUP_ATTR(reserved_regions, 0444, static IOMMU_GROUP_ATTR(type, 0644, iommu_group_show_type, iommu_group_store_type); +static IOMMU_GROUP_ATTR(max_opt_dma_size, 0644, iommu_group_show_max_opt_dma_size, + iommu_group_store_max_opt_dma_size); + static void iommu_group_release(struct kobject *kobj) { struct iommu_group *group = to_iommu_group(kobj); @@ -665,6 +678,10 @@ struct iommu_group *iommu_group_alloc(void) if (ret) return ERR_PTR(ret); + ret = iommu_group_create_file(group, &iommu_group_attr_max_opt_dma_size); + if (ret) + return ERR_PTR(ret); + pr_debug("Allocated group %d\n", group->id); return group; @@ -2087,6 +2104,11 @@ struct iommu_domain *iommu_get_dma_domain(struct device *dev) return dev->iommu_group->default_domain; } +size_t iommu_group_get_max_opt_dma_size(struct iommu_group *group) +{ + return group->max_opt_dma_size; +} + /* * IOMMU groups are really the natural working unit of the IOMMU, but * the IOMMU API works on domains and devices. Bridge that gap by @@ -2871,12 +2893,14 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); * @prev_dev: The device in the group (this is used to make sure that the device * hasn't changed after the caller has called this function) * @type: The type of the new default domain that gets associated with the group + * @max_opt_dma_
[PATCH RESEND v5 3/5] iommu: Allow iommu_change_dev_def_domain() realloc same default domain type
Allow iommu_change_dev_def_domain() to create a new default domain, keeping the same as current. Also remove comment about the function purpose, which will become stale. Signed-off-by: John Garry --- drivers/iommu/iommu.c | 49 ++- include/linux/iommu.h | 1 + 2 files changed, 26 insertions(+), 24 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 0dd766030baf..10bb10c2a210 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2863,6 +2863,7 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle) } EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); + /* * Changes the default domain of an iommu group that has *only* one device * @@ -2873,10 +2874,6 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); * * Returns 0 on success and error code on failure * - * Note: - * 1. Presently, this function is called only when user requests to change the - *group's default domain type through /sys/kernel/iommu_groups//type - *Please take a closer look if intended to use for other purposes. */ static int iommu_change_dev_def_domain(struct iommu_group *group, struct device *prev_dev, int type) @@ -2929,28 +2926,32 @@ static int iommu_change_dev_def_domain(struct iommu_group *group, goto out; } - dev_def_dom = iommu_get_def_domain_type(dev); - if (!type) { + if (type == __IOMMU_DOMAIN_SAME) { + type = prev_dom->type; + } else { + dev_def_dom = iommu_get_def_domain_type(dev); + if (!type) { + /* +* If the user hasn't requested any specific type of domain and +* if the device supports both the domains, then default to the +* domain the device was booted with +*/ + type = dev_def_dom ? : iommu_def_domain_type; + } else if (dev_def_dom && type != dev_def_dom) { + dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n", + iommu_domain_type_str(type)); + ret = -EINVAL; + goto out; + } + /* -* If the user hasn't requested any specific type of domain and -* if the device supports both the domains, then default to the -* domain the device was booted with +* Switch to a new domain only if the requested domain type is different +* from the existing default domain type */ - type = dev_def_dom ? : iommu_def_domain_type; - } else if (dev_def_dom && type != dev_def_dom) { - dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n", - iommu_domain_type_str(type)); - ret = -EINVAL; - goto out; - } - - /* -* Switch to a new domain only if the requested domain type is different -* from the existing default domain type -*/ - if (prev_dom->type == type) { - ret = 0; - goto out; + if (prev_dom->type == type) { + ret = 0; + goto out; + } } /* We can bring up a flush queue without tearing down the domain */ diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 9208eca4b0d1..b141cf71c7af 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -63,6 +63,7 @@ struct iommu_domain_geometry { implementation */ #define __IOMMU_DOMAIN_PT (1U << 2) /* Domain is identity mapped */ #define __IOMMU_DOMAIN_DMA_FQ (1U << 3) /* DMA-API uses flush queue*/ +#define __IOMMU_DOMAIN_SAME(1U << 4) /* Keep same type (internal) */ /* * This are the possible domain-types -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RESEND v5 2/5] iova: Allow rcache range upper limit to be flexible
Some low-level drivers may request DMA mappings whose IOVA length exceeds that of the current rcache upper limit. This means that allocations for those IOVAs will never be cached, and always must be allocated and freed from the RB tree per DMA mapping cycle. This has a significant effect on performance, more so since commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. As a first step towards allowing the rcache range upper limit be configured, hold this value in the IOVA rcache structure, and allocate the rcaches separately. Delete macro IOVA_RANGE_CACHE_MAX_SIZE in case it's reused by mistake. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Signed-off-by: John Garry --- drivers/iommu/iova.c | 20 ++-- include/linux/iova.h | 3 +++ 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index db77aa675145..5c22b9187b79 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -15,8 +15,6 @@ /* The anchor node sits above the top of the usable address space */ #define IOVA_ANCHOR~0UL -#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size (in pages) */ - static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, unsigned long size); @@ -443,7 +441,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, * rounding up anything cacheable to make sure that can't happen. The * order of the unadjusted size will still match upon freeing. */ - if (size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) + if (size < (1 << (iovad->rcache_max_size - 1))) size = roundup_pow_of_two(size); iova_pfn = iova_rcache_get(iovad, size, limit_pfn + 1); @@ -713,13 +711,15 @@ int iova_domain_init_rcaches(struct iova_domain *iovad) unsigned int cpu; int i, ret; - iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE, + iovad->rcache_max_size = 6; /* Arbitrarily high default */ + + iovad->rcaches = kcalloc(iovad->rcache_max_size, sizeof(struct iova_rcache), GFP_KERNEL); if (!iovad->rcaches) return -ENOMEM; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { struct iova_cpu_rcache *cpu_rcache; struct iova_rcache *rcache; @@ -816,7 +816,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, { unsigned int log_size = order_base_2(size); - if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE) + if (log_size >= iovad->rcache_max_size) return false; return __iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn); @@ -872,7 +872,7 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, { unsigned int log_size = order_base_2(size); - if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE || !iovad->rcaches) + if (log_size >= iovad->rcache_max_size || !iovad->rcaches) return 0; return __iova_rcache_get(&iovad->rcaches[log_size], limit_pfn - size); @@ -888,7 +888,7 @@ static void free_iova_rcaches(struct iova_domain *iovad) unsigned int cpu; int i, j; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = &iovad->rcaches[i]; if (!rcache->cpu_rcaches) break; @@ -916,7 +916,7 @@ static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad) unsigned long flags; int i; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = &iovad->rcaches[i]; cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu); spin_lock_irqsave(&cpu_rcache->lock, flags); @@ -935,7 +935,7 @@ static void free_global_cached_iovas(struct iova_domain *iovad) unsigned long flags; int i, j; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = &iovad->rcaches[i]; spin_lock_irqsave(&rcache->lock, flags); for (j = 0; j < rcache->depot_size; ++j) { diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..02f7222fa85a 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -38,6 +38,9 @@ struct iova_domain { struct iova_rcache *rcaches; struct hlist_node
[PATCH RESEND v5 0/5] iommu: Allow IOVA rcache range be configured
For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. This may be much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. IOVAs which cannot be cached are highly involved in the IOVA ageing issue, as discussed at [1]. This series allows the IOVA rcache range be configured, so that we may cache all IOVAs per domain, thus improving performance. A new IOMMU group sysfs file is added - max_opt_dma_size - which is used indirectly to configure the IOVA rcache range: /sys/kernel/iommu_groups/X/max_opt_dma_size This file is updated same as how the IOMMU group default domain type is updated, i.e. must unbind the only device in the group first. The inspiration here comes from block layer request queue sysfs "optimal_io_size" file, in /sys/block/sdX/queue/optimal_io_size Some old figures* for storage scenario (when increasing IOVA rcache range to cover all DMA mapping sizes from the LLD): v5.13-rc1 baseline: 1200K IOPS With series:1800K IOPS All above are for IOMMU strict mode. Non-strict mode gives ~1800K IOPS in all scenarios. Based on v5.18-rc1 * I lost my high data throughout test setup Differences to v4: https://lore.kernel.org/linux-iommu/1626259003-201303-1-git-send-email-john.ga...@huawei.com/ - Major rebase - Change the "Refactor iommu_group_store_type()" to not use a callback and an op type enum instead - I didn't pick up Will's Ack as it has changed so much - Use a domain feature flag to keep same default group type - Add wrapper for default IOVA rcache range - Combine last 2x patches [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/ John Garry (5): iommu: Refactor iommu_group_store_type() iova: Allow rcache range upper limit to be flexible iommu: Allow iommu_change_dev_def_domain() realloc same default domain type iommu: Allow max opt DMA len be set for a group via sysfs iova: Add iova_len argument to iova_domain_init_rcaches() .../ABI/testing/sysfs-kernel-iommu_groups | 16 ++ drivers/iommu/dma-iommu.c | 15 +- drivers/iommu/iommu.c | 202 +- drivers/iommu/iova.c | 37 ++-- drivers/vdpa/vdpa_user/iova_domain.c | 4 +- include/linux/iommu.h | 7 + include/linux/iova.h | 6 +- 7 files changed, 212 insertions(+), 75 deletions(-) -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH RESEND v5 1/5] iommu: Refactor iommu_group_store_type()
Function iommu_group_store_type() supports changing the default domain of an IOMMU group. Many conditions need to be satisfied and steps taken for this action to be successful. Satisfying these conditions and steps will be required for setting other IOMMU group attributes, so factor into a common part and a part specific to update the IOMMU group attribute. No functional change intended. Some code comments are tidied up also. Signed-off-by: John Garry --- drivers/iommu/iommu.c | 96 --- 1 file changed, 62 insertions(+), 34 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index f2c45b85b9fc..0dd766030baf 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3000,21 +3000,57 @@ static int iommu_change_dev_def_domain(struct iommu_group *group, return ret; } +enum iommu_group_op { + CHANGE_GROUP_TYPE, +}; + +static int __iommu_group_store_type(const char *buf, struct iommu_group *group, + struct device *dev) +{ + int type; + + if (sysfs_streq(buf, "identity")) + type = IOMMU_DOMAIN_IDENTITY; + else if (sysfs_streq(buf, "DMA")) + type = IOMMU_DOMAIN_DMA; + else if (sysfs_streq(buf, "DMA-FQ")) + type = IOMMU_DOMAIN_DMA_FQ; + else if (sysfs_streq(buf, "auto")) + type = 0; + else + return -EINVAL; + + /* +* Check if the only device in the group still has a driver bound or +* we're transistioning from DMA -> DMA-FQ +*/ + if (device_is_bound(dev) && !(type == IOMMU_DOMAIN_DMA_FQ && + group->default_domain->type == IOMMU_DOMAIN_DMA)) { + pr_err_ratelimited("Device is still bound to driver\n"); + return -EINVAL; + } + + return iommu_change_dev_def_domain(group, dev, type); +} + /* * Changing the default domain through sysfs requires the users to unbind the * drivers from the devices in the iommu group, except for a DMA -> DMA-FQ - * transition. Return failure if this isn't met. + * transition. Changing or any other IOMMU group attribute still requires the + * user to unbind the drivers from the devices in the iommu group. Return + * failure if these conditions are not met. * * We need to consider the race between this and the device release path. * device_lock(dev) is used here to guarantee that the device release path * will not be entered at the same time. */ -static ssize_t iommu_group_store_type(struct iommu_group *group, - const char *buf, size_t count) +static ssize_t iommu_group_store_common(struct iommu_group *group, + enum iommu_group_op op, + const char *buf, size_t count) { struct group_device *grp_dev; struct device *dev; - int ret, req_type; + int ret; if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; @@ -3022,27 +3058,16 @@ static ssize_t iommu_group_store_type(struct iommu_group *group, if (WARN_ON(!group)) return -EINVAL; - if (sysfs_streq(buf, "identity")) - req_type = IOMMU_DOMAIN_IDENTITY; - else if (sysfs_streq(buf, "DMA")) - req_type = IOMMU_DOMAIN_DMA; - else if (sysfs_streq(buf, "DMA-FQ")) - req_type = IOMMU_DOMAIN_DMA_FQ; - else if (sysfs_streq(buf, "auto")) - req_type = 0; - else - return -EINVAL; - /* * Lock/Unlock the group mutex here before device lock to -* 1. Make sure that the iommu group has only one device (this is a +* 1. Make sure that the IOMMU group has only one device (this is a *prerequisite for step 2) * 2. Get struct *dev which is needed to lock device */ mutex_lock(&group->mutex); if (iommu_group_device_count(group) != 1) { mutex_unlock(&group->mutex); - pr_err_ratelimited("Cannot change default domain: Group has more than one device\n"); + pr_err_ratelimited("Cannot change IOMMU group default domain attribute: Group has more than one device\n"); return -EINVAL; } @@ -3054,16 +3079,16 @@ static ssize_t iommu_group_store_type(struct iommu_group *group, /* * Don't hold the group mutex because taking group mutex first and then * the device lock could potentially cause a deadlock as below. Assume -* two threads T1 and T2. T1 is trying to change default domain of an -* iommu group and T2 is trying to hot unplug a device or release [1] VF -* of a PCIe device which is in th
Re: [PATCH v5 2/8] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
+ endmenu diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig new file mode 100644 index ..8902a6f27563 --- /dev/null +++ b/drivers/hwtracing/ptt/Kconfig @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0-only +config HISI_PTT + tristate "HiSilicon PCIe Tune and Trace Device" + depends on ARM64 || (COMPILE_TEST && 64BIT) + depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS + help + HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP + device, and it provides support for PCIe traffic tuning and + tracing TLP headers to the memory. + + This driver can also be built as a module. If so, the module + will be called hisi_ptt. diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile new file mode 100644 index ..908c09a98161 --- /dev/null +++ b/drivers/hwtracing/ptt/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c new file mode 100644 index ..935dc9b44a54 --- /dev/null +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -0,0 +1,330 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for HiSilicon PCIe tune and trace device + * + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. + * Author: Yicong Yang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_ptt.h" + +static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev) +{ + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) + return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn))); + + return PCI_DEVID(pdev->bus->number, pdev->devfn); +} + +static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS, + val, val & HISI_PTT_TRACE_IDLE, + HISI_PTT_WAIT_POLL_INTERVAL_US, + HISI_PTT_WAIT_TRACE_TIMEOUT_US); +} + +static bool hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_WR_STS, + val, !val, HISI_PTT_RESET_POLL_INTERVAL_US, + HISI_PTT_RESET_TIMEOUT_US); +} + +static void hisi_ptt_free_trace_buf(struct hisi_ptt *hisi_ptt) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + struct device *dev = &hisi_ptt->pdev->dev; + int i; + + if (!ctrl->trace_buf) + return; + + for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++) { + if (ctrl->trace_buf[i].addr) + dmam_free_coherent(dev, HISI_PTT_TRACE_BUF_SIZE, + ctrl->trace_buf[i].addr, + ctrl->trace_buf[i].dma); + } + + devm_kfree(dev, ctrl->trace_buf); + ctrl->trace_buf = NULL; +} + +static int hisi_ptt_alloc_trace_buf(struct hisi_ptt *hisi_ptt) no caller +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + struct device *dev = &hisi_ptt->pdev->dev; + int i; + + hisi_ptt->trace_ctrl.buf_index = 0; + + /* If the trace buffer has already been allocated, zero it. */ + if (ctrl->trace_buf) { + for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++) + memset(ctrl->trace_buf[i].addr, 0, HISI_PTT_TRACE_BUF_SIZE); + return 0; + } + + ctrl->trace_buf = devm_kcalloc(dev, HISI_PTT_TRACE_BUF_CNT, + sizeof(struct hisi_ptt_dma_buffer), GFP_KERNEL); + if (!ctrl->trace_buf) + return -ENOMEM; + + for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; ++i) { + ctrl->trace_buf[i].addr = dmam_alloc_coherent(dev, HISI_PTT_TRACE_BUF_SIZE, + &ctrl->trace_buf[i].dma, +GFP_KERNEL); + if (!ctrl->trace_buf[i].addr) { + hisi_ptt_free_trace_buf(hisi_ptt); + return -ENOMEM; + } + } + + return 0; +} + +static void hisi_ptt_trace_end(struct hisi_ptt *hisi_ptt) +{ + writel(0, hisi_ptt->iobase + HISI_PTT_TRACE_CTRL); + hisi_ptt->trace_ctrl.started = false; +} + +static int hisi_ptt_trace_start(struct hisi_ptt *hisi_ptt) again this function has no caller, so I assume a warn is generated if we only apply up to this patch (when compiling) please only add code per-patch which is actually referenced +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + u32 val; +
Re: [PATCH v5 1/8] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity
On 08/03/2022 08:49, Yicong Yang wrote: The DMA of HiSilicon PTT device can only work with identical mapping. nit: I'd have "DMA operations of the HiSilicon PTT device can only work properly with identity mappings". So add a quirk for the device to force the domain passthrough. ".. domain as passthrough." Signed-off-by: Yicong Yang --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 6dc6d8b6b368..6f67a2b1dd27 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2838,6 +2838,21 @@ static int arm_smmu_dev_disable_feature(struct device *dev, } } +#define IS_HISI_PTT_DEVICE(pdev) ((pdev)->vendor == PCI_VENDOR_ID_HUAWEI && \ +(pdev)->device == 0xa12e) + +static int arm_smmu_def_domain_type(struct device *dev) +{ + if (dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + + if (IS_HISI_PTT_DEVICE(pdev)) + return IOMMU_DOMAIN_IDENTITY; + } + + return 0; +} + static struct iommu_ops arm_smmu_ops = { .capable= arm_smmu_capable, .domain_alloc = arm_smmu_domain_alloc, @@ -2863,6 +2878,7 @@ static struct iommu_ops arm_smmu_ops = { .sva_unbind = arm_smmu_sva_unbind, .sva_get_pasid = arm_smmu_sva_get_pasid, .page_response = arm_smmu_page_response, + .def_domain_type= arm_smmu_def_domain_type, .pgsize_bitmap = -1UL, /* Restricted during device attach */ .owner = THIS_MODULE, }; ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/iova: Free all CPU rcache for retry when iova alloc failure
On 04/03/2022 04:46, yf.wang--- via iommu wrote: * MEDIATEK Confidentiality Notice The information contained in this e-mail message (including any attachments) may be confidential, proprietary, privileged, or otherwise exempt from disclosure under applicable laws. It is intended to be Can you please stop sending patches with this? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 0/5] iommu: Allow IOVA rcache range be configured
On 14/02/2022 17:29, John Garry wrote: Hi guys, And a friendly reminder on this series also. Cheers, john For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. This may be much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. IOVAs which cannot be cached are highly involved in the IOVA ageing issue, as discussed at [1]. This series allows the IOVA rcache range be configured, so that we may cache all IOVAs per domain, thus improving performance. A new IOMMU group sysfs file is added - max_opt_dma_size - which is used indirectly to configure the IOVA rcache range: /sys/kernel/iommu_groups/X/max_opt_dma_size This file is updated same as how the IOMMU group default domain type is updated, i.e. must unbind the only device in the group first. The inspiration here comes from block layer request queue sysfs "optimal_io_size" file, in /sys/block/sdX/queue/optimal_io_size Some old figures* for storage scenario (when increasing IOVA rcache range to cover all DMA mapping sizes from the LLD): v5.13-rc1 baseline: 1200K IOPS With series:1800K IOPS All above are for IOMMU strict mode. Non-strict mode gives ~1800K IOPS in all scenarios. Based on v5.17-rc4 + [2] * I lost my high data throughout test setup Differences to v4: https://lore.kernel.org/linux-iommu/1626259003-201303-1-git-send-email-john.ga...@huawei.com/ - Major rebase - Change the "Refactor iommu_group_store_type()" to not use a callback and an op type enum instead - I didn't pick up Will's Ack as it has changed so much - Use a domain feature flag to keep same default group type - Add wrapper for default IOVA rcache range - Combine last 2x patches [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/ [2] https://lore.kernel.org/linux-iommu/20220203063345-mutt-send-email-...@kernel.org/T/#m5b2b59576d35cad544314470f32e5f40ac5d1fe9 John Garry (5): iommu: Refactor iommu_group_store_type() iova: Allow rcache range upper limit to be flexible iommu: Allow iommu_change_dev_def_domain() realloc same default domain type iommu: Allow max opt DMA len be set for a group via sysfs iova: Add iova_len argument to iova_domain_init_rcaches() .../ABI/testing/sysfs-kernel-iommu_groups | 16 ++ drivers/iommu/dma-iommu.c | 15 +- drivers/iommu/iommu.c | 202 +- drivers/iommu/iova.c | 37 ++-- drivers/vdpa/vdpa_user/iova_domain.c | 4 +- include/linux/iommu.h | 7 + include/linux/iova.h | 6 +- 7 files changed, 212 insertions(+), 75 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iova: Remove forward declarations
On 18/02/2022 16:28, John Garry wrote: Hi guys, A friendly reminder on this one. Cheers, john Now that the FQ code has been moved to dma-iommu.c and also the rcache- related structures have been brought into iova.c, let's rearrange the code to remove all the forward declarations. The general order is as follows: - RB tree code - iova management - magazine helpers - rcache code and "fast" APIs - iova domain public APIs Rearrange prototypes in iova.h to follow the same general group ordering. A couple of pre-existing checkpatch warnings are also remedied: A suspect indentation is also corrected: WARNING: suspect code indent for conditional statements (16, 32) #374: FILE: drivers/iommu/iova.c:194: + } else if (overlap) + break; WARNING: Block comments should align the * on each line #1038: FILE: drivers/iommu/iova.c:787: + * fails too and the flush_rcache flag is set then the rcache will be flushed. +*/ No functional change intended. Signed-off-by: John Garry diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 7e9c3a97c040..d543131025b3 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -17,75 +17,40 @@ #define IOVA_RANGE_CACHE_MAX_SIZE 6 /* log of max cached IOVA range size (in pages) */ -static bool iova_rcache_insert(struct iova_domain *iovad, - unsigned long pfn, - unsigned long size); -static unsigned long iova_rcache_get(struct iova_domain *iovad, -unsigned long size, -unsigned long limit_pfn); -static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); -static void free_iova_rcaches(struct iova_domain *iovad); +/* + * Magazine caches for IOVA ranges. For an introduction to magazines, + * see the USENIX 2001 paper "Magazines and Vmem: Extending the Slab + * Allocator to Many CPUs and Arbitrary Resources" by Bonwick and Adams. + * For simplicity, we use a static magazine size and don't implement the + * dynamic size tuning described in the paper. + */ -static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) -{ - struct iova_domain *iovad; +#define IOVA_MAG_SIZE 128 +#define MAX_GLOBAL_MAGS 32 /* magazines per bin */ - iovad = hlist_entry_safe(node, struct iova_domain, cpuhp_dead); +struct iova_magazine { + unsigned long size; + unsigned long pfns[IOVA_MAG_SIZE]; +}; - free_cpu_cached_iovas(cpu, iovad); - return 0; -} +struct iova_cpu_rcache { + spinlock_t lock; + struct iova_magazine *loaded; + struct iova_magazine *prev; +}; -static void free_global_cached_iovas(struct iova_domain *iovad); +struct iova_rcache { + spinlock_t lock; + unsigned long depot_size; + struct iova_magazine *depot[MAX_GLOBAL_MAGS]; + struct iova_cpu_rcache __percpu *cpu_rcaches; +}; static struct iova *to_iova(struct rb_node *node) { return rb_entry(node, struct iova, node); } -void -init_iova_domain(struct iova_domain *iovad, unsigned long granule, - unsigned long start_pfn) -{ - /* -* IOVA granularity will normally be equal to the smallest -* supported IOMMU page size; both *must* be capable of -* representing individual CPU pages exactly. -*/ - BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule)); - - spin_lock_init(&iovad->iova_rbtree_lock); - iovad->rbroot = RB_ROOT; - iovad->cached_node = &iovad->anchor.node; - iovad->cached32_node = &iovad->anchor.node; - iovad->granule = granule; - iovad->start_pfn = start_pfn; - iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad)); - iovad->max32_alloc_size = iovad->dma_32bit_pfn; - iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; - rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node); - rb_insert_color(&iovad->anchor.node, &iovad->rbroot); -} -EXPORT_SYMBOL_GPL(init_iova_domain); - -static struct rb_node * -__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn) -{ - if (limit_pfn <= iovad->dma_32bit_pfn) - return iovad->cached32_node; - - return iovad->cached_node; -} - -static void -__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new) -{ - if (new->pfn_hi < iovad->dma_32bit_pfn) - iovad->cached32_node = &new->node; - else - iovad->cached_node = &new->node; -} - static void __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) { @@ -104,43 +69,6 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) iovad->cached_node = rb_next(&free->node); }
Re: [PATCH v4 2/8] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 24/02/2022 03:53, Yicong Yang wrote: On 2022/2/22 19:06, John Garry wrote: On 21/02/2022 08:43, Yicong Yang wrote: HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated Endpoint(RCiEP) device, providing the capability to dynamically monitor and tune the PCIe traffic, and trace the TLP headers. Add the driver for the device to enable the trace function. This patch adds basic function of trace, including the device's probe and initialization, functions for trace buffer allocation and trace enable/disable, register an interrupt handler to simply response to the DMA events. The user interface of trace will be added in the following patch. Fill commit message lines upto 75 characters Hi John, Thanks for the comments. The commit message is within 75 characters. I checked again and checkpatch didn't warning for this. I mean to fill the lines up as much as possible, upto 75 char max, if not already done so. I am not sure if you are doing this already, but it looks like you were not. Checkpatch will no warn about a commit message like this :) Thanks, john ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 3/8] hisi_ptt: Register PMU device for PTT trace
+ static irqreturn_t hisi_ptt_irq(int irq, void *context) { struct hisi_ptt *hisi_ptt = context; @@ -169,7 +233,7 @@ static irqreturn_t hisi_ptt_irq(int irq, void *context) if (!(status & HISI_PTT_TRACE_INT_STAT_MASK)) return IRQ_NONE; - return IRQ_HANDLED; + return IRQ_WAKE_THREAD; } static void hisi_ptt_irq_free_vectors(void *pdev) @@ -192,8 +256,10 @@ static int hisi_ptt_register_irq(struct hisi_ptt *hisi_ptt) if (ret < 0) return ret; - ret = devm_request_irq(&pdev->dev, pci_irq_vector(pdev, HISI_PTT_TRACE_DMA_IRQ), - hisi_ptt_irq, 0, DRV_NAME, hisi_ptt); + ret = devm_request_threaded_irq(&pdev->dev, why add code in patch 2/8 and then immediately change 3/8? + pci_irq_vector(pdev, HISI_PTT_TRACE_DMA_IRQ), + hisi_ptt_irq, hisi_ptt_isr, 0, + DRV_NAME, hisi_ptt); if (ret) { pci_err(pdev, "failed to request irq %d, ret = %d.\n", pci_irq_vector(pdev, HISI_PTT_TRACE_DMA_IRQ), ret); @@ -270,6 +336,429 @@ static void hisi_ptt_init_ctrls(struct hisi_ptt *hisi_ptt) hisi_ptt->trace_ctrl.default_cpu = cpumask_first(cpumask_of_node(dev_to_node(&pdev->dev))); } +#define HISI_PTT_PMU_FILTER_IS_PORT BIT(19) +#define HISI_PTT_PMU_FILTER_VAL_MASK GENMASK(15, 0) +#define HISI_PTT_PMU_DIRECTION_MASKGENMASK(23, 20) +#define HISI_PTT_PMU_TYPE_MASK GENMASK(31, 24) +#define HISI_PTT_PMU_FORMAT_MASK GENMASK(35, 32) + +static ssize_t available_root_port_filters_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev)); + struct hisi_ptt_filter_desc *filter; + int pos = 0; + + if (list_empty(&hisi_ptt->port_filters)) + return sysfs_emit(buf, "\n"); + + mutex_lock(&hisi_ptt->mutex); + list_for_each_entry(filter, &hisi_ptt->port_filters, list) + pos += sysfs_emit_at(buf, pos, "%s 0x%05lx\n", +pci_name(filter->pdev), +hisi_ptt_get_filter_val(filter->pdev) | +HISI_PTT_PMU_FILTER_IS_PORT); + + mutex_unlock(&hisi_ptt->mutex); + return pos; +} +static DEVICE_ATTR_ADMIN_RO(available_root_port_filters); + +static ssize_t available_requester_filters_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct hisi_ptt *hisi_ptt = to_hisi_ptt(dev_get_drvdata(dev)); + struct hisi_ptt_filter_desc *filter; + int pos = 0; + + if (list_empty(&hisi_ptt->port_filters)) is this supposed to be req_filters? And is it safe to access without locking? + return sysfs_emit(buf, "\n"); + + mutex_lock(&hisi_ptt->mutex); + list_for_each_entry(filter, &hisi_ptt->req_filters, list) + pos += sysfs_emit_at(buf, pos, "%s 0x%05x\n", +pci_name(filter->pdev), +hisi_ptt_get_filter_val(filter->pdev)); + + mutex_unlock(&hisi_ptt->mutex); + return pos; +} +static DEVICE_ATTR_ADMIN_RO(available_requester_filters); + +PMU_FORMAT_ATTR(filter,"config:0-19"); +PMU_FORMAT_ATTR(direction, "config:20-23"); +PMU_FORMAT_ATTR(type, "config:24-31"); +PMU_FORMAT_ATTR(format,"config:32-35"); + +static struct attribute *hisi_ptt_pmu_format_attrs[] = { + &format_attr_filter.attr, + &format_attr_direction.attr, + &format_attr_type.attr, + &format_attr_format.attr, + NULL +}; + +static struct attribute_group hisi_ptt_pmu_format_group = { + .name = "format", + .attrs = hisi_ptt_pmu_format_attrs, +}; + +static struct attribute *hisi_ptt_pmu_filter_attrs[] = { + &dev_attr_available_root_port_filters.attr, + &dev_attr_available_requester_filters.attr, + NULL +}; + +static struct attribute_group hisi_ptt_pmu_filter_group = { + .attrs = hisi_ptt_pmu_filter_attrs, +}; + +static const struct attribute_group *hisi_ptt_pmu_groups[] = { + &hisi_ptt_pmu_format_group, + &hisi_ptt_pmu_filter_group, + NULL +}; + +/* + * The supported value of the direction parameter. See hisi_ptt.rst + * documentation for more details. + */ +static u32 hisi_ptt_trace_available_direction[] = { + 0, + 1, + 2, + 3, this seems a very odd array. And I assume it is const as it is modified - can this be non-global and tied to the device context? +}; + +/* Different types can be set simultaneously */ +static u32 hisi_ptt_trace_available_type
Re: [PATCH v4 2/8] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 21/02/2022 08:43, Yicong Yang wrote: HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated Endpoint(RCiEP) device, providing the capability to dynamically monitor and tune the PCIe traffic, and trace the TLP headers. Add the driver for the device to enable the trace function. This patch adds basic function of trace, including the device's probe and initialization, functions for trace buffer allocation and trace enable/disable, register an interrupt handler to simply response to the DMA events. The user interface of trace will be added in the following patch. Fill commit message lines upto 75 characters Signed-off-by: Yicong Yang --- drivers/Makefile | 1 + drivers/hwtracing/Kconfig| 2 + drivers/hwtracing/ptt/Kconfig| 11 + drivers/hwtracing/ptt/Makefile | 2 + drivers/hwtracing/ptt/hisi_ptt.c | 370 +++ drivers/hwtracing/ptt/hisi_ptt.h | 149 + 6 files changed, 535 insertions(+) create mode 100644 drivers/hwtracing/ptt/Kconfig create mode 100644 drivers/hwtracing/ptt/Makefile create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h diff --git a/drivers/Makefile b/drivers/Makefile index a110338c860c..ab3411e4eba5 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -175,6 +175,7 @@ obj-$(CONFIG_USB4) += thunderbolt/ obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ obj-y += hwtracing/intel_th/ obj-$(CONFIG_STM) += hwtracing/stm/ +obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/ obj-$(CONFIG_ANDROID) += android/ obj-$(CONFIG_NVMEM) += nvmem/ obj-$(CONFIG_FPGA)+= fpga/ diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig index 13085835a636..911ee977103c 100644 --- a/drivers/hwtracing/Kconfig +++ b/drivers/hwtracing/Kconfig @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig" source "drivers/hwtracing/intel_th/Kconfig" +source "drivers/hwtracing/ptt/Kconfig" + endmenu diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig new file mode 100644 index ..41fa83921a07 --- /dev/null +++ b/drivers/hwtracing/ptt/Kconfig @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: GPL-2.0-only +config HISI_PTT + tristate "HiSilicon PCIe Tune and Trace Device" + depends on ARM64 && PCI && HAS_DMA && HAS_IOMEM why no compile test support? + help + HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP + device, and it provides support for PCIe traffic tuning and + tracing TLP headers to the memory. + + This driver can also be built as a module. If so, the module + will be called hisi_ptt. diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile new file mode 100644 index ..908c09a98161 --- /dev/null +++ b/drivers/hwtracing/ptt/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c new file mode 100644 index ..a5b4f09ccd1e --- /dev/null +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -0,0 +1,370 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for HiSilicon PCIe tune and trace device + * + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. + * Author: Yicong Yang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_ptt.h" + +static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev) +{ + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) + return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn))); + + return PCI_DEVID(pdev->bus->number, pdev->devfn); +} + +static int hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS, +val, val & HISI_PTT_TRACE_IDLE, +HISI_PTT_WAIT_POLL_INTERVAL_US, +HISI_PTT_WAIT_TIMEOUT_US); +} + +static int hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_WR_STS, +val, !val, HISI_PTT_RESET_POLL_INTERVAL_US, +HISI_PTT_RESET_TIMEOUT_US); +} + +static void hisi_ptt_free_trace_buf(struct hisi_ptt *hisi_ptt) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + struct device *dev = &hisi_ptt->pdev->dev; + int i; + + if (!ctrl->trace_buf) + return; + + for (i = 0; i < HISI_PTT_TRACE_BUF_CNT; i++) it's good practice to use {} for if-else or similar in the loop + if (ctr
[PATCH] iova: Remove forward declarations
Now that the FQ code has been moved to dma-iommu.c and also the rcache- related structures have been brought into iova.c, let's rearrange the code to remove all the forward declarations. The general order is as follows: - RB tree code - iova management - magazine helpers - rcache code and "fast" APIs - iova domain public APIs Rearrange prototypes in iova.h to follow the same general group ordering. A couple of pre-existing checkpatch warnings are also remedied: A suspect indentation is also corrected: WARNING: suspect code indent for conditional statements (16, 32) #374: FILE: drivers/iommu/iova.c:194: + } else if (overlap) + break; WARNING: Block comments should align the * on each line #1038: FILE: drivers/iommu/iova.c:787: + * fails too and the flush_rcache flag is set then the rcache will be flushed. +*/ No functional change intended. Signed-off-by: John Garry diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 7e9c3a97c040..d543131025b3 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -17,75 +17,40 @@ #define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size (in pages) */ -static bool iova_rcache_insert(struct iova_domain *iovad, - unsigned long pfn, - unsigned long size); -static unsigned long iova_rcache_get(struct iova_domain *iovad, -unsigned long size, -unsigned long limit_pfn); -static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); -static void free_iova_rcaches(struct iova_domain *iovad); +/* + * Magazine caches for IOVA ranges. For an introduction to magazines, + * see the USENIX 2001 paper "Magazines and Vmem: Extending the Slab + * Allocator to Many CPUs and Arbitrary Resources" by Bonwick and Adams. + * For simplicity, we use a static magazine size and don't implement the + * dynamic size tuning described in the paper. + */ -static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node) -{ - struct iova_domain *iovad; +#define IOVA_MAG_SIZE 128 +#define MAX_GLOBAL_MAGS 32 /* magazines per bin */ - iovad = hlist_entry_safe(node, struct iova_domain, cpuhp_dead); +struct iova_magazine { + unsigned long size; + unsigned long pfns[IOVA_MAG_SIZE]; +}; - free_cpu_cached_iovas(cpu, iovad); - return 0; -} +struct iova_cpu_rcache { + spinlock_t lock; + struct iova_magazine *loaded; + struct iova_magazine *prev; +}; -static void free_global_cached_iovas(struct iova_domain *iovad); +struct iova_rcache { + spinlock_t lock; + unsigned long depot_size; + struct iova_magazine *depot[MAX_GLOBAL_MAGS]; + struct iova_cpu_rcache __percpu *cpu_rcaches; +}; static struct iova *to_iova(struct rb_node *node) { return rb_entry(node, struct iova, node); } -void -init_iova_domain(struct iova_domain *iovad, unsigned long granule, - unsigned long start_pfn) -{ - /* -* IOVA granularity will normally be equal to the smallest -* supported IOMMU page size; both *must* be capable of -* representing individual CPU pages exactly. -*/ - BUG_ON((granule > PAGE_SIZE) || !is_power_of_2(granule)); - - spin_lock_init(&iovad->iova_rbtree_lock); - iovad->rbroot = RB_ROOT; - iovad->cached_node = &iovad->anchor.node; - iovad->cached32_node = &iovad->anchor.node; - iovad->granule = granule; - iovad->start_pfn = start_pfn; - iovad->dma_32bit_pfn = 1UL << (32 - iova_shift(iovad)); - iovad->max32_alloc_size = iovad->dma_32bit_pfn; - iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; - rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node); - rb_insert_color(&iovad->anchor.node, &iovad->rbroot); -} -EXPORT_SYMBOL_GPL(init_iova_domain); - -static struct rb_node * -__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn) -{ - if (limit_pfn <= iovad->dma_32bit_pfn) - return iovad->cached32_node; - - return iovad->cached_node; -} - -static void -__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new) -{ - if (new->pfn_hi < iovad->dma_32bit_pfn) - iovad->cached32_node = &new->node; - else - iovad->cached_node = &new->node; -} - static void __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) { @@ -104,43 +69,6 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) iovad->cached_node = rb_next(&free->node); } -static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned long limit_pfn) -{ - s
[PATCH v5 5/5] iova: Add iova_len argument to iova_domain_init_rcaches()
Add max opt argument to iova_domain_init_rcaches(), and use it to set the rcaches range. Also fix up all users to set this value (at 0, meaning use default), including a wrapper for that, iova_domain_init_rcaches_default(). For dma-iommu.c we derive the iova_len argument from the IOMMU group max opt DMA size. Signed-off-by: John Garry --- drivers/iommu/dma-iommu.c| 15 ++- drivers/iommu/iova.c | 19 --- drivers/vdpa/vdpa_user/iova_domain.c | 4 ++-- include/linux/iova.h | 3 ++- 4 files changed, 34 insertions(+), 7 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 908272f0153e..7afa2226c6bd 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -525,6 +525,8 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; + size_t max_opt_dma_size; + unsigned long iova_len = 0; int ret; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) @@ -560,7 +562,18 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, } init_iova_domain(iovad, 1UL << order, base_pfn); - ret = iova_domain_init_rcaches(iovad); + + max_opt_dma_size = iommu_group_get_max_opt_dma_size(dev->iommu_group); + if (max_opt_dma_size) { + unsigned long shift = __ffs(1UL << order); + + iova_len = roundup_pow_of_two(max_opt_dma_size); + iova_len >>= shift; + if (!iova_len) + iova_len = 1; + } + + ret = iova_domain_init_rcaches(iovad, iova_len); if (ret) return ret; diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index f6eb93c737cb..f05aae044e25 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -705,12 +705,20 @@ static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn) mag->pfns[mag->size++] = pfn; } -int iova_domain_init_rcaches(struct iova_domain *iovad) +static unsigned long iova_len_to_rcache_max(unsigned long iova_len) +{ + return order_base_2(iova_len) + 1; +} + +int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long iova_len) { unsigned int cpu; int i, ret; - iovad->rcache_max_size = 6; /* Arbitrarily high default */ + if (iova_len) + iovad->rcache_max_size = iova_len_to_rcache_max(iova_len); + else + iovad->rcache_max_size = 6; /* Arbitrarily high default */ iovad->rcaches = kcalloc(iovad->rcache_max_size, sizeof(struct iova_rcache), @@ -754,7 +762,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad) free_iova_rcaches(iovad); return ret; } -EXPORT_SYMBOL_GPL(iova_domain_init_rcaches); + +int iova_domain_init_rcaches_default(struct iova_domain *iovad) +{ + return iova_domain_init_rcaches(iovad, 0); +} +EXPORT_SYMBOL_GPL(iova_domain_init_rcaches_default); /* * Try inserting IOVA range starting with 'iova_pfn' into 'rcache', and diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c index 22f7d43f8a68..74471bc463ec 100644 --- a/drivers/vdpa/vdpa_user/iova_domain.c +++ b/drivers/vdpa/vdpa_user/iova_domain.c @@ -514,12 +514,12 @@ vduse_domain_create(unsigned long iova_limit, size_t bounce_size) spin_lock_init(&domain->iotlb_lock); init_iova_domain(&domain->stream_iovad, PAGE_SIZE, IOVA_START_PFN); - ret = iova_domain_init_rcaches(&domain->stream_iovad); + ret = iova_domain_init_rcaches_default(&domain->stream_iovad); if (ret) goto err_iovad_stream; init_iova_domain(&domain->consistent_iovad, PAGE_SIZE, bounce_pfns); - ret = iova_domain_init_rcaches(&domain->consistent_iovad); + ret = iova_domain_init_rcaches_default(&domain->consistent_iovad); if (ret) goto err_iovad_consistent; diff --git a/include/linux/iova.h b/include/linux/iova.h index 02f7222fa85a..56281434ce0c 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -95,7 +95,8 @@ struct iova *reserve_iova(struct iova_domain *iovad, unsigned long pfn_lo, unsigned long pfn_hi); void init_iova_domain(struct iova_domain *iovad, unsigned long granule, unsigned long start_pfn); -int iova_domain_init_rcaches(struct iova_domain *iovad); +int iova_domain_init_rcaches(struct iova_domain *iovad, unsigned long iova_len); +int iova_domain_init_rcaches_default(struct iova_domain *iovad); struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn); void put_iov
[PATCH v5 4/5] iommu: Allow max opt DMA len be set for a group via sysfs
Add support to allow the maximum optimised DMA len be set for an IOMMU group via sysfs. This is much the same with the method to change the default domain type for a group. Signed-off-by: John Garry --- .../ABI/testing/sysfs-kernel-iommu_groups | 16 + drivers/iommu/iommu.c | 59 ++- include/linux/iommu.h | 6 ++ 3 files changed, 79 insertions(+), 2 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-iommu_groups b/Documentation/ABI/testing/sysfs-kernel-iommu_groups index b15af6a5bc08..ed6f72794f6c 100644 --- a/Documentation/ABI/testing/sysfs-kernel-iommu_groups +++ b/Documentation/ABI/testing/sysfs-kernel-iommu_groups @@ -63,3 +63,19 @@ Description: /sys/kernel/iommu_groups//type shows the type of default system could lead to catastrophic effects (the users might need to reboot the machine to get it to normal state). So, it's expected that the users understand what they're doing. + +What: /sys/kernel/iommu_groups//max_opt_dma_size +Date: Feb 2022 +KernelVersion: v5.18 +Contact: iommu@lists.linux-foundation.org +Description: /sys/kernel/iommu_groups//max_opt_dma_size shows the + max optimised DMA size for the default IOMMU domain associated + with the group. + Each IOMMU domain has an IOVA domain. The IOVA domain caches + IOVAs upto a certain size as a performance optimisation. + This sysfs file allows the range of the IOVA domain caching be + set, such that larger than default IOVAs may be cached. + A value of 0 means that the default caching range is chosen. + A privileged user could request the kernel the change the range + by writing to this file. For this to happen, the same rules + and procedure applies as in changing the default domain type. diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index df9ffd76c184..79f5cbea5c95 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -48,6 +48,7 @@ struct iommu_group { struct iommu_domain *default_domain; struct iommu_domain *domain; struct list_head entry; + size_t max_opt_dma_size; }; struct group_device { @@ -89,6 +90,9 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group, static struct iommu_group *iommu_group_get_for_dev(struct device *dev); static ssize_t iommu_group_store_type(struct iommu_group *group, const char *buf, size_t count); +static ssize_t iommu_group_store_max_opt_dma_size(struct iommu_group *group, + const char *buf, + size_t count); #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \ struct iommu_group_attribute iommu_group_attr_##_name =\ @@ -570,6 +574,12 @@ static ssize_t iommu_group_show_type(struct iommu_group *group, return strlen(type); } +static ssize_t iommu_group_show_max_opt_dma_size(struct iommu_group *group, +char *buf) +{ + return sprintf(buf, "%zu\n", group->max_opt_dma_size); +} + static IOMMU_GROUP_ATTR(name, S_IRUGO, iommu_group_show_name, NULL); static IOMMU_GROUP_ATTR(reserved_regions, 0444, @@ -578,6 +588,9 @@ static IOMMU_GROUP_ATTR(reserved_regions, 0444, static IOMMU_GROUP_ATTR(type, 0644, iommu_group_show_type, iommu_group_store_type); +static IOMMU_GROUP_ATTR(max_opt_dma_size, 0644, iommu_group_show_max_opt_dma_size, + iommu_group_store_max_opt_dma_size); + static void iommu_group_release(struct kobject *kobj) { struct iommu_group *group = to_iommu_group(kobj); @@ -664,6 +677,10 @@ struct iommu_group *iommu_group_alloc(void) if (ret) return ERR_PTR(ret); + ret = iommu_group_create_file(group, &iommu_group_attr_max_opt_dma_size); + if (ret) + return ERR_PTR(ret); + pr_debug("Allocated group %d\n", group->id); return group; @@ -2302,6 +2319,11 @@ struct iommu_domain *iommu_get_dma_domain(struct device *dev) return dev->iommu_group->default_domain; } +size_t iommu_group_get_max_opt_dma_size(struct iommu_group *group) +{ + return group->max_opt_dma_size; +} + /* * IOMMU groups are really the natural working unit of the IOMMU, but * the IOMMU API works on domains and devices. Bridge that gap by @@ -3132,12 +3154,14 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); * @prev_dev: The device in the group (this is used to make sure that the device * hasn't changed after the caller has called this function) * @type: The type of the new default domain that gets associated with the group + * @max_opt_dma_
[PATCH v5 3/5] iommu: Allow iommu_change_dev_def_domain() realloc same default domain type
Allow iommu_change_dev_def_domain() to create a new default domain, keeping the same as current. Also remove comment about the function purpose, which will become stale. Signed-off-by: John Garry --- drivers/iommu/iommu.c | 49 ++- include/linux/iommu.h | 1 + 2 files changed, 26 insertions(+), 24 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 5e7ed969b870..df9ffd76c184 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3124,6 +3124,7 @@ u32 iommu_sva_get_pasid(struct iommu_sva *handle) } EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); + /* * Changes the default domain of an iommu group that has *only* one device * @@ -3134,10 +3135,6 @@ EXPORT_SYMBOL_GPL(iommu_sva_get_pasid); * * Returns 0 on success and error code on failure * - * Note: - * 1. Presently, this function is called only when user requests to change the - *group's default domain type through /sys/kernel/iommu_groups//type - *Please take a closer look if intended to use for other purposes. */ static int iommu_change_dev_def_domain(struct iommu_group *group, struct device *prev_dev, int type) @@ -3190,28 +3187,32 @@ static int iommu_change_dev_def_domain(struct iommu_group *group, goto out; } - dev_def_dom = iommu_get_def_domain_type(dev); - if (!type) { + if (type == __IOMMU_DOMAIN_SAME) { + type = prev_dom->type; + } else { + dev_def_dom = iommu_get_def_domain_type(dev); + if (!type) { + /* +* If the user hasn't requested any specific type of domain and +* if the device supports both the domains, then default to the +* domain the device was booted with +*/ + type = dev_def_dom ? : iommu_def_domain_type; + } else if (dev_def_dom && type != dev_def_dom) { + dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n", + iommu_domain_type_str(type)); + ret = -EINVAL; + goto out; + } + /* -* If the user hasn't requested any specific type of domain and -* if the device supports both the domains, then default to the -* domain the device was booted with +* Switch to a new domain only if the requested domain type is different +* from the existing default domain type */ - type = dev_def_dom ? : iommu_def_domain_type; - } else if (dev_def_dom && type != dev_def_dom) { - dev_err_ratelimited(prev_dev, "Device cannot be in %s domain\n", - iommu_domain_type_str(type)); - ret = -EINVAL; - goto out; - } - - /* -* Switch to a new domain only if the requested domain type is different -* from the existing default domain type -*/ - if (prev_dom->type == type) { - ret = 0; - goto out; + if (prev_dom->type == type) { + ret = 0; + goto out; + } } /* We can bring up a flush queue without tearing down the domain */ diff --git a/include/linux/iommu.h b/include/linux/iommu.h index de0c57a567c8..d242fccc7c2d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -62,6 +62,7 @@ struct iommu_domain_geometry { implementation */ #define __IOMMU_DOMAIN_PT (1U << 2) /* Domain is identity mapped */ #define __IOMMU_DOMAIN_DMA_FQ (1U << 3) /* DMA-API uses flush queue*/ +#define __IOMMU_DOMAIN_SAME(1U << 4) /* Keep same type (internal) */ /* * This are the possible domain-types -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v5 2/5] iova: Allow rcache range upper limit to be flexible
Some low-level drivers may request DMA mappings whose IOVA length exceeds that of the current rcache upper limit. This means that allocations for those IOVAs will never be cached, and always must be allocated and freed from the RB tree per DMA mapping cycle. This has a significant effect on performance, more so since commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. As a first step towards allowing the rcache range upper limit be configured, hold this value in the IOVA rcache structure, and allocate the rcaches separately. Delete macro IOVA_RANGE_CACHE_MAX_SIZE in case it's reused by mistake. [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ Signed-off-by: John Garry --- drivers/iommu/iova.c | 20 ++-- include/linux/iova.h | 3 +++ 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index 7e9c3a97c040..f6eb93c737cb 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -15,8 +15,6 @@ /* The anchor node sits above the top of the usable address space */ #define IOVA_ANCHOR~0UL -#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size (in pages) */ - static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, unsigned long size); @@ -442,7 +440,7 @@ alloc_iova_fast(struct iova_domain *iovad, unsigned long size, * rounding up anything cacheable to make sure that can't happen. The * order of the unadjusted size will still match upon freeing. */ - if (size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1))) + if (size < (1 << (iovad->rcache_max_size - 1))) size = roundup_pow_of_two(size); iova_pfn = iova_rcache_get(iovad, size, limit_pfn + 1); @@ -712,13 +710,15 @@ int iova_domain_init_rcaches(struct iova_domain *iovad) unsigned int cpu; int i, ret; - iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE, + iovad->rcache_max_size = 6; /* Arbitrarily high default */ + + iovad->rcaches = kcalloc(iovad->rcache_max_size, sizeof(struct iova_rcache), GFP_KERNEL); if (!iovad->rcaches) return -ENOMEM; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { struct iova_cpu_rcache *cpu_rcache; struct iova_rcache *rcache; @@ -815,7 +815,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, { unsigned int log_size = order_base_2(size); - if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE) + if (log_size >= iovad->rcache_max_size) return false; return __iova_rcache_insert(iovad, &iovad->rcaches[log_size], pfn); @@ -871,7 +871,7 @@ static unsigned long iova_rcache_get(struct iova_domain *iovad, { unsigned int log_size = order_base_2(size); - if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE || !iovad->rcaches) + if (log_size >= iovad->rcache_max_size || !iovad->rcaches) return 0; return __iova_rcache_get(&iovad->rcaches[log_size], limit_pfn - size); @@ -887,7 +887,7 @@ static void free_iova_rcaches(struct iova_domain *iovad) unsigned int cpu; int i, j; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = &iovad->rcaches[i]; if (!rcache->cpu_rcaches) break; @@ -915,7 +915,7 @@ static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad) unsigned long flags; int i; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = &iovad->rcaches[i]; cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu); spin_lock_irqsave(&cpu_rcache->lock, flags); @@ -934,7 +934,7 @@ static void free_global_cached_iovas(struct iova_domain *iovad) unsigned long flags; int i, j; - for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) { + for (i = 0; i < iovad->rcache_max_size; ++i) { rcache = &iovad->rcaches[i]; spin_lock_irqsave(&rcache->lock, flags); for (j = 0; j < rcache->depot_size; ++j) { diff --git a/include/linux/iova.h b/include/linux/iova.h index 320a70e40233..02f7222fa85a 100644 --- a/include/linux/iova.h +++ b/include/linux/iova.h @@ -38,6 +38,9 @@ struct iova_domain { struct iova_rcache *rcaches; struct hlist_node
[PATCH v5 1/5] iommu: Refactor iommu_group_store_type()
Function iommu_group_store_type() supports changing the default domain of an IOMMU group. Many conditions need to be satisfied and steps taken for this action to be successful. Satisfying these conditions and steps will be required for setting other IOMMU group attributes, so factor into a common part and a part specific to update the IOMMU group attribute. No functional change intended. Some code comments are tidied up also. Signed-off-by: John Garry --- drivers/iommu/iommu.c | 96 --- 1 file changed, 62 insertions(+), 34 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 107dcf5938d6..5e7ed969b870 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3261,21 +3261,57 @@ static int iommu_change_dev_def_domain(struct iommu_group *group, return ret; } +enum iommu_group_op { + CHANGE_GROUP_TYPE, +}; + +static int __iommu_group_store_type(const char *buf, struct iommu_group *group, + struct device *dev) +{ + int type; + + if (sysfs_streq(buf, "identity")) + type = IOMMU_DOMAIN_IDENTITY; + else if (sysfs_streq(buf, "DMA")) + type = IOMMU_DOMAIN_DMA; + else if (sysfs_streq(buf, "DMA-FQ")) + type = IOMMU_DOMAIN_DMA_FQ; + else if (sysfs_streq(buf, "auto")) + type = 0; + else + return -EINVAL; + + /* +* Check if the only device in the group still has a driver bound or +* we're transistioning from DMA -> DMA-FQ +*/ + if (device_is_bound(dev) && !(type == IOMMU_DOMAIN_DMA_FQ && + group->default_domain->type == IOMMU_DOMAIN_DMA)) { + pr_err_ratelimited("Device is still bound to driver\n"); + return -EINVAL; + } + + return iommu_change_dev_def_domain(group, dev, type); +} + /* * Changing the default domain through sysfs requires the users to unbind the * drivers from the devices in the iommu group, except for a DMA -> DMA-FQ - * transition. Return failure if this isn't met. + * transition. Changing or any other IOMMU group attribute still requires the + * user to unbind the drivers from the devices in the iommu group. Return + * failure if these conditions are not met. * * We need to consider the race between this and the device release path. * device_lock(dev) is used here to guarantee that the device release path * will not be entered at the same time. */ -static ssize_t iommu_group_store_type(struct iommu_group *group, - const char *buf, size_t count) +static ssize_t iommu_group_store_common(struct iommu_group *group, + enum iommu_group_op op, + const char *buf, size_t count) { struct group_device *grp_dev; struct device *dev; - int ret, req_type; + int ret; if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO)) return -EACCES; @@ -3283,27 +3319,16 @@ static ssize_t iommu_group_store_type(struct iommu_group *group, if (WARN_ON(!group)) return -EINVAL; - if (sysfs_streq(buf, "identity")) - req_type = IOMMU_DOMAIN_IDENTITY; - else if (sysfs_streq(buf, "DMA")) - req_type = IOMMU_DOMAIN_DMA; - else if (sysfs_streq(buf, "DMA-FQ")) - req_type = IOMMU_DOMAIN_DMA_FQ; - else if (sysfs_streq(buf, "auto")) - req_type = 0; - else - return -EINVAL; - /* * Lock/Unlock the group mutex here before device lock to -* 1. Make sure that the iommu group has only one device (this is a +* 1. Make sure that the IOMMU group has only one device (this is a *prerequisite for step 2) * 2. Get struct *dev which is needed to lock device */ mutex_lock(&group->mutex); if (iommu_group_device_count(group) != 1) { mutex_unlock(&group->mutex); - pr_err_ratelimited("Cannot change default domain: Group has more than one device\n"); + pr_err_ratelimited("Cannot change IOMMU group default domain attribute: Group has more than one device\n"); return -EINVAL; } @@ -3315,16 +3340,16 @@ static ssize_t iommu_group_store_type(struct iommu_group *group, /* * Don't hold the group mutex because taking group mutex first and then * the device lock could potentially cause a deadlock as below. Assume -* two threads T1 and T2. T1 is trying to change default domain of an -* iommu group and T2 is trying to hot unplug a device or release [1] VF -* of a PCIe device which is in th
[PATCH v5 0/5] iommu: Allow IOVA rcache range be configured
For streaming DMA mappings involving an IOMMU and whose IOVA len regularly exceeds the IOVA rcache upper limit (meaning that they are not cached), performance can be reduced. This may be much more pronounced from commit 4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search fails"), as discussed at [0]. IOVAs which cannot be cached are highly involved in the IOVA ageing issue, as discussed at [1]. This series allows the IOVA rcache range be configured, so that we may cache all IOVAs per domain, thus improving performance. A new IOMMU group sysfs file is added - max_opt_dma_size - which is used indirectly to configure the IOVA rcache range: /sys/kernel/iommu_groups/X/max_opt_dma_size This file is updated same as how the IOMMU group default domain type is updated, i.e. must unbind the only device in the group first. The inspiration here comes from block layer request queue sysfs "optimal_io_size" file, in /sys/block/sdX/queue/optimal_io_size Some old figures* for storage scenario (when increasing IOVA rcache range to cover all DMA mapping sizes from the LLD): v5.13-rc1 baseline: 1200K IOPS With series:1800K IOPS All above are for IOMMU strict mode. Non-strict mode gives ~1800K IOPS in all scenarios. Based on v5.17-rc4 + [2] * I lost my high data throughout test setup Differences to v4: https://lore.kernel.org/linux-iommu/1626259003-201303-1-git-send-email-john.ga...@huawei.com/ - Major rebase - Change the "Refactor iommu_group_store_type()" to not use a callback and an op type enum instead - I didn't pick up Will's Ack as it has changed so much - Use a domain feature flag to keep same default group type - Add wrapper for default IOVA rcache range - Combine last 2x patches [0] https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/ [1] https://lore.kernel.org/linux-iommu/1607538189-237944-1-git-send-email-john.ga...@huawei.com/ [2] https://lore.kernel.org/linux-iommu/20220203063345-mutt-send-email-...@kernel.org/T/#m5b2b59576d35cad544314470f32e5f40ac5d1fe9 John Garry (5): iommu: Refactor iommu_group_store_type() iova: Allow rcache range upper limit to be flexible iommu: Allow iommu_change_dev_def_domain() realloc same default domain type iommu: Allow max opt DMA len be set for a group via sysfs iova: Add iova_len argument to iova_domain_init_rcaches() .../ABI/testing/sysfs-kernel-iommu_groups | 16 ++ drivers/iommu/dma-iommu.c | 15 +- drivers/iommu/iommu.c | 202 +- drivers/iommu/iova.c | 37 ++-- drivers/vdpa/vdpa_user/iova_domain.c | 4 +- include/linux/iommu.h | 7 + include/linux/iova.h | 6 +- 7 files changed, 212 insertions(+), 75 deletions(-) -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/iova: Separate out rcache init
On 03/02/2022 11:34, Michael S. Tsirkin wrote: On Thu, Feb 03, 2022 at 05:59:20PM +0800, John Garry wrote: Currently the rcache structures are allocated for all IOVA domains, even if they do not use "fast" alloc+free interface. This is wasteful of memory. In addition, fails in init_iova_rcaches() are not handled safely, which is less than ideal. Make "fast" users call a separate rcache init explicitly, which includes error checking. Signed-off-by: John Garry virtio things: Acked-by: Michael S. Tsirkin Cheers Hi Robin, Can you kindly give this your blessing if you are happy with it? Thanks! --- Differences to v1: - Drop stubs for iova_domain_init_rcaches() and iova_domain_free_rcaches() - Use put_iova_domain() in vdpa code diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d85d54f2b549..b22034975301 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -525,6 +525,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; + int ret; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) return -EINVAL; @@ -559,6 +560,9 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, } init_iova_domain(iovad, 1UL << order, base_pfn); + ret = iova_domain_init_rcaches(iovad); + if (ret) + return ret; /* If the FQ fails we can simply fall back to strict mode */ if (domain->type == IOMMU_DOMAIN_DMA_FQ && iommu_dma_init_fq(domain)) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index b28c9435b898..7e9c3a97c040 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -15,13 +15,14 @@ /* The anchor node sits above the top of the usable address space */ #define IOVA_ANCHOR ~0UL +#define IOVA_RANGE_CACHE_MAX_SIZE 6 /* log of max cached IOVA range size (in pages) */ + static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, unsigned long size); static unsigned long iova_rcache_get(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn); -static void init_iova_rcaches(struct iova_domain *iovad); static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); @@ -64,8 +65,6 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node); rb_insert_color(&iovad->anchor.node, &iovad->rbroot); - cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, &iovad->cpuhp_dead); - init_iova_rcaches(iovad); } EXPORT_SYMBOL_GPL(init_iova_domain); @@ -488,6 +487,13 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size) } EXPORT_SYMBOL_GPL(free_iova_fast); +static void iova_domain_free_rcaches(struct iova_domain *iovad) +{ + cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, + &iovad->cpuhp_dead); + free_iova_rcaches(iovad); +} + /** * put_iova_domain - destroys the iova domain * @iovad: - iova domain in question. @@ -497,9 +503,9 @@ void put_iova_domain(struct iova_domain *iovad) { struct iova *iova, *tmp; - cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, - &iovad->cpuhp_dead); - free_iova_rcaches(iovad); + if (iovad->rcaches) + iova_domain_free_rcaches(iovad); + rbtree_postorder_for_each_entry_safe(iova, tmp, &iovad->rbroot, node) free_iova_mem(iova); } @@ -608,6 +614,7 @@ EXPORT_SYMBOL_GPL(reserve_iova); */ #define IOVA_MAG_SIZE 128 +#define MAX_GLOBAL_MAGS 32 /* magazines per bin */ struct iova_magazine { unsigned long size; @@ -620,6 +627,13 @@ struct iova_cpu_rcache { struct iova_magazine *prev; }; +struct iova_rcache { + spinlock_t lock; + unsigned long depot_size; + struct iova_magazine *depot[MAX_GLOBAL_MAGS]; + struct iova_cpu_rcache __percpu *cpu_rcaches; +}; + static struct iova_magazine *iova_magazine_alloc(gfp_t flags) { return kzalloc(sizeof(struct iova_magazine), flags); @@ -693,28 +707,54 @@ static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn) mag->pfns[mag->size++] = pfn; } -static void init_iova_rcaches(struct iova_domain *iovad) +int iova_domain_init_rcaches(struct iova_domain *iovad) { -
Re: [PATCH v3 8/8] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity
On 08/02/2022 11:21, Yicong Yang wrote: This patch should be earlier in the series, before the PTT driver, and the comment on hisi_ptt_check_iommu_mapping() should mention what is going on here. ok I'll reorder the serives and modify the comments of hisi_ptt_check_iommu_mapping() like: /* * The DMA of PTT trace can only use direct mapping, due to some * hardware restriction. Check whether there is an iommu or the * policy of the iommu domain is passthrough, otherwise the trace * cannot work. IOMMU, capitalize acronyms * * The PTT device is supposed to behind the arm smmu v3, which * should have passthrough the device by a quirk. Otherwise user * should manually set the iommu domain type to identity through * sysfs. Sorry, but I don't really understand your meaning here. I did not think that if we have a default domain then we can change via sysfs to anything else. */ Signed-off-by: Yicong Yang --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 6dc6d8b6b368..6f67a2b1dd27 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2838,6 +2838,21 @@ static int arm_smmu_dev_disable_feature(struct device *dev, } } +#define IS_HISI_PTT_DEVICE(pdev) ((pdev)->vendor == PCI_VENDOR_ID_HUAWEI && \ + (pdev)->device == 0xa12e) I assume that not all revisions will require this check, right? So if you are very confident that the next revision will be fixed then I would add a check for this current broken revision. For current revisions it's necessary. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 8/8] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity
On 24/01/2022 13:11, Yicong Yang wrote: The DMA of HiSilicon PTT device can only work with identical mapping. So add a quirk for the device to force the domain passthrough. This patch should be earlier in the series, before the PTT driver, and the comment on hisi_ptt_check_iommu_mapping() should mention what is going on here. Signed-off-by: Yicong Yang --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 6dc6d8b6b368..6f67a2b1dd27 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2838,6 +2838,21 @@ static int arm_smmu_dev_disable_feature(struct device *dev, } } +#define IS_HISI_PTT_DEVICE(pdev) ((pdev)->vendor == PCI_VENDOR_ID_HUAWEI && \ +(pdev)->device == 0xa12e) I assume that not all revisions will require this check, right? + +static int arm_smmu_def_domain_type(struct device *dev) +{ + if (dev_is_pci(dev)) { + struct pci_dev *pdev = to_pci_dev(dev); + + if (IS_HISI_PTT_DEVICE(pdev)) + return IOMMU_DOMAIN_IDENTITY; + } + + return 0; +} + static struct iommu_ops arm_smmu_ops = { .capable= arm_smmu_capable, .domain_alloc = arm_smmu_domain_alloc, @@ -2863,6 +2878,7 @@ static struct iommu_ops arm_smmu_ops = { .sva_unbind = arm_smmu_sva_unbind, .sva_get_pasid = arm_smmu_sva_get_pasid, .page_response = arm_smmu_page_response, + .def_domain_type= arm_smmu_def_domain_type, .pgsize_bitmap = -1UL, /* Restricted during device attach */ .owner = THIS_MODULE, }; ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 1/8] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 24/01/2022 13:11, Yicong Yang wrote: HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated Endpoint(RCiEP) device, providing the capability to dynamically monitor and tune the PCIe traffic, and trace the TLP headers. Add the driver for the device to enable the trace function. This patch adds basic function of trace, including the device's probe and initialization, functions for trace buffer allocation and trace enable/disable, register an interrupt handler to simply response to the DMA events. The user interface of trace will be added in the following patch. Signed-off-by: Yicong Yang --- drivers/Makefile | 1 + drivers/hwtracing/Kconfig| 2 + drivers/hwtracing/ptt/Kconfig| 11 + drivers/hwtracing/ptt/Makefile | 2 + drivers/hwtracing/ptt/hisi_ptt.c | 398 +++ drivers/hwtracing/ptt/hisi_ptt.h | 159 6 files changed, 573 insertions(+) create mode 100644 drivers/hwtracing/ptt/Kconfig create mode 100644 drivers/hwtracing/ptt/Makefile create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h diff --git a/drivers/Makefile b/drivers/Makefile index a110338c860c..ab3411e4eba5 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -175,6 +175,7 @@ obj-$(CONFIG_USB4) += thunderbolt/ obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ obj-y += hwtracing/intel_th/ obj-$(CONFIG_STM) += hwtracing/stm/ +obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/ obj-$(CONFIG_ANDROID) += android/ obj-$(CONFIG_NVMEM) += nvmem/ obj-$(CONFIG_FPGA)+= fpga/ diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig index 13085835a636..911ee977103c 100644 --- a/drivers/hwtracing/Kconfig +++ b/drivers/hwtracing/Kconfig @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig" source "drivers/hwtracing/intel_th/Kconfig" +source "drivers/hwtracing/ptt/Kconfig" + endmenu diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig new file mode 100644 index ..4f4f2459ac47 --- /dev/null +++ b/drivers/hwtracing/ptt/Kconfig @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: GPL-2.0-only +config HISI_PTT + tristate "HiSilicon PCIe Tune and Trace Device" + depends on ARM64 && PCI && HAS_DMA && HAS_IOMEM + help + HiSilicon PCIe Tune and Trace Device exist as a PCIe RCiEP exists + device, provides support for PCIe traffic tuning and and it provides support... + tracing TLP headers to the memory. + + This driver can also be built as a module. If so, the module + will be called hisi_ptt. diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile new file mode 100644 index ..908c09a98161 --- /dev/null +++ b/drivers/hwtracing/ptt/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c new file mode 100644 index ..6d0a0ca5c0a9 --- /dev/null +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -0,0 +1,398 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for HiSilicon PCIe tune and trace device + * + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. + * Author: Yicong Yang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_ptt.h" + +static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev) +{ + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) + return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn))); + + return PCI_DEVID(pdev->bus->number, pdev->devfn); +} + +static int hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return readl_poll_timeout(hisi_ptt->iobase + HISI_PTT_TRACE_STS, val, + val & HISI_PTT_TRACE_IDLE, + HISI_PTT_WAIT_POLL_INTERVAL_US, + HISI_PTT_WAIT_TIMEOUT_US); +} + +static void hisi_ptt_free_trace_buf(struct hisi_ptt *hisi_ptt) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + struct device *dev = &hisi_ptt->pdev->dev; + struct hisi_ptt_dma_buffer *buffer, *tbuffer; + + list_for_each_entry_safe(buffer, tbuffer, &ctrl->trace_buf, list) { + list_del(&buffer->list); + dma_free_coherent(dev, buffer->size, buffer->addr, + buffer->dma); + kfree(buffer); + } +} + +static int hisi_ptt_alloc_trace_buf(struct hisi_ptt *hisi_ptt) +{ + struct hisi_ptt_trace_ctrl *ctrl = &hisi_ptt->trace_ctrl; + struct device *dev = &hisi_ptt->pdev->dev; + struct hisi_ptt_dma_buffer *buffer; + int i, ret; + + hisi_ptt->trace_ctrl.buf_index = 0;
Re: [PATCH v2] iommu/core: Remove comment reference to iommu_dev_has_feature
On 07/02/2022 03:23, Akeem G Abodunrin wrote: iommu_dev_has_feature() api has been removed by the commit 262948f8ba573 ("iommu: Delete iommu_dev_has_feature()") - So this patch removes comment about the api to avoid any confusion. Signed-off-by: Akeem G Abodunrin Cc: Lu Baolu Reviewed-by: Christoph Hellwig Reviewed-by: John Garry BTW, It looks like we can get rid of iommu_ops.dev_has_feat also. It does not seem to be called, while arm-smmu-v3 driver does provide a callback. --- include/linux/iommu.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index de0c57a567c8..bea054f2bd4d 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -153,8 +153,7 @@ struct iommu_resv_region { * supported, this feature must be enabled before and * disabled after %IOMMU_DEV_FEAT_SVA. * - * Device drivers query whether a feature is supported using - * iommu_dev_has_feature(), and enable it using iommu_dev_enable_feature(). + * Device drivers enable the feature via iommu_dev_enable_feature(). */ enum iommu_dev_features { IOMMU_DEV_FEAT_AUX, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 04/10] iommu/vt-d: Remove iova_cache_get/put()
On 07/02/2022 06:41, Lu Baolu wrote: diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 583ec0fa4ac1..e8d58654361c 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -3348,9 +3348,6 @@ static inline int iommu_devinfo_cache_init(void) static int __init iommu_init_mempool(void) { int ret; - ret = iova_cache_get(); - if (ret) - return ret; ret = iommu_domain_cache_init(); if (ret) @@ -3362,7 +3359,6 @@ static int __init iommu_init_mempool(void) kmem_cache_destroy(iommu_domain_cache); domain_error: nit: is this label still really required? only failures in iommu_domain_cache_init() jump to it, and that can return directly now. Thanks, John - iova_cache_put(); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2] iommu/iova: Separate out rcache init
Currently the rcache structures are allocated for all IOVA domains, even if they do not use "fast" alloc+free interface. This is wasteful of memory. In addition, fails in init_iova_rcaches() are not handled safely, which is less than ideal. Make "fast" users call a separate rcache init explicitly, which includes error checking. Signed-off-by: John Garry --- Differences to v1: - Drop stubs for iova_domain_init_rcaches() and iova_domain_free_rcaches() - Use put_iova_domain() in vdpa code diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index d85d54f2b549..b22034975301 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -525,6 +525,7 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, struct iommu_dma_cookie *cookie = domain->iova_cookie; unsigned long order, base_pfn; struct iova_domain *iovad; + int ret; if (!cookie || cookie->type != IOMMU_DMA_IOVA_COOKIE) return -EINVAL; @@ -559,6 +560,9 @@ static int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, } init_iova_domain(iovad, 1UL << order, base_pfn); + ret = iova_domain_init_rcaches(iovad); + if (ret) + return ret; /* If the FQ fails we can simply fall back to strict mode */ if (domain->type == IOMMU_DOMAIN_DMA_FQ && iommu_dma_init_fq(domain)) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index b28c9435b898..7e9c3a97c040 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -15,13 +15,14 @@ /* The anchor node sits above the top of the usable address space */ #define IOVA_ANCHOR~0UL +#define IOVA_RANGE_CACHE_MAX_SIZE 6/* log of max cached IOVA range size (in pages) */ + static bool iova_rcache_insert(struct iova_domain *iovad, unsigned long pfn, unsigned long size); static unsigned long iova_rcache_get(struct iova_domain *iovad, unsigned long size, unsigned long limit_pfn); -static void init_iova_rcaches(struct iova_domain *iovad); static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad); static void free_iova_rcaches(struct iova_domain *iovad); @@ -64,8 +65,6 @@ init_iova_domain(struct iova_domain *iovad, unsigned long granule, iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR; rb_link_node(&iovad->anchor.node, NULL, &iovad->rbroot.rb_node); rb_insert_color(&iovad->anchor.node, &iovad->rbroot); - cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, &iovad->cpuhp_dead); - init_iova_rcaches(iovad); } EXPORT_SYMBOL_GPL(init_iova_domain); @@ -488,6 +487,13 @@ free_iova_fast(struct iova_domain *iovad, unsigned long pfn, unsigned long size) } EXPORT_SYMBOL_GPL(free_iova_fast); +static void iova_domain_free_rcaches(struct iova_domain *iovad) +{ + cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, + &iovad->cpuhp_dead); + free_iova_rcaches(iovad); +} + /** * put_iova_domain - destroys the iova domain * @iovad: - iova domain in question. @@ -497,9 +503,9 @@ void put_iova_domain(struct iova_domain *iovad) { struct iova *iova, *tmp; - cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, - &iovad->cpuhp_dead); - free_iova_rcaches(iovad); + if (iovad->rcaches) + iova_domain_free_rcaches(iovad); + rbtree_postorder_for_each_entry_safe(iova, tmp, &iovad->rbroot, node) free_iova_mem(iova); } @@ -608,6 +614,7 @@ EXPORT_SYMBOL_GPL(reserve_iova); */ #define IOVA_MAG_SIZE 128 +#define MAX_GLOBAL_MAGS 32 /* magazines per bin */ struct iova_magazine { unsigned long size; @@ -620,6 +627,13 @@ struct iova_cpu_rcache { struct iova_magazine *prev; }; +struct iova_rcache { + spinlock_t lock; + unsigned long depot_size; + struct iova_magazine *depot[MAX_GLOBAL_MAGS]; + struct iova_cpu_rcache __percpu *cpu_rcaches; +}; + static struct iova_magazine *iova_magazine_alloc(gfp_t flags) { return kzalloc(sizeof(struct iova_magazine), flags); @@ -693,28 +707,54 @@ static void iova_magazine_push(struct iova_magazine *mag, unsigned long pfn) mag->pfns[mag->size++] = pfn; } -static void init_iova_rcaches(struct iova_domain *iovad) +int iova_domain_init_rcaches(struct iova_domain *iovad) { - struct iova_cpu_rcache *cpu_rcache; - struct iova_rcache *rcache; unsigned int cpu; - int i; + int i, ret; + + iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE, +sizeof(struct iova_rcache), +
Re: [PATCH] iommu/amd: Fix loop timeout issue in iommu_ga_log_enable()
On 31/01/2022 16:17, Joerg Roedel wrote: From: Joerg Roedel The polling loop for the register change in iommu_ga_log_enable() needs to have a udelay() in it. Otherwise the CPU might be faster than the IOMMU hardware and wrongly trigger the WARN_ON() further down the code stream. Fixes: 8bda0cfbdc1a ("iommu/amd: Detect and initialize guest vAPIC log") Signed-off-by: Joerg Roedel --- drivers/iommu/amd/init.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c index dc338acf3338..d2e09d53851f 100644 --- a/drivers/iommu/amd/init.c +++ b/drivers/iommu/amd/init.c @@ -834,6 +834,7 @@ static int iommu_ga_log_enable(struct amd_iommu *iommu) status = readl(iommu->mmio_base + MMIO_STATUS_OFFSET); if (status & (MMIO_STATUS_GALOG_RUN_MASK)) break; + udelay(1); Maybe readl_relaxed_poll_timeout_atomic() could be used instead Thanks, John } if (WARN_ON(i >= LOOP_TIMEOUT)) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/iova: Separate out rcache init
On 26/01/2022 17:00, Robin Murphy wrote: As above, I vote for just forward-declaring the free routine in iova.c and keeping it entirely private. BTW, speaking of forward declarations, it's possible to remove all the forward declarations in iova.c now that the FQ code is gone - but with a good bit of rearranging. However I am not sure how much people care about that or whether the code layout is sane... ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu