Re: [PATCH 00/13] Rework IOMMU API to allow for batching of invalidation
On 15/08/2019 14:55, Will Deacon wrote: On Thu, Aug 15, 2019 at 12:19:58PM +0100, John Garry wrote: On 14/08/2019 18:56, Will Deacon wrote: If you'd like to play with the patches, then I've also pushed them here: https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/unmap but they should behave as a no-op on their own. As anticipated, my storage testing scenarios roughly give parity throughput and CPU loading before and after this series. Patches to convert the Arm SMMUv3 driver to the new API are here: https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq I quickly tested this again and now I see a performance lift: before (5.3-rc1)after D05 8x SAS disks907K IOPS 970K IOPS D05 1x NVMe 450K IOPS 466K IOPS D06 1x NVMe 467K IOPS 466K IOPS The CPU loading seems to track throughput, so nothing much to say there. Note: From 5.2 testing, I was seeing >900K IOPS from that NVMe disk for !IOMMU. Cheers, John. For interest, how do things look if you pass iommu.strict=0? That might give some indication about how much the invalidation is still hurting us. So I tested for iommu/cmdq for NVMe only, and I see: !SMMU 5.3-rc4 strict/!strict cmdq strict/!strict D05 NVMe 750K IOPS 456K/540K IOPS 466K/537K D06 NVMe 750K IOPS 456K/740K IOPS 466K/745K I don't know why the D06 iommu.strict performance is ~ same as D05, while !strict is so much better. D06 SMMU implementation is supposed to be generally much better than that of D05, so I would have thought that the strict performance would be better (than that of D05). BTW, what were your thoughts on changing arm_smmu_atc_inv_domain()->arm_smmu_atc_inv_master() to batching? It seems suitable, but looks untouched. Were you waiting for a resolution to the performance issue which Leizhen reported? In principle, I'm supportive of such a change, but I'm not currently able to test any ATS stuff so somebody else would need to write the patch. Jean-Philippe is on holiday at the moment, but I'd be happy to review something from you if you send it out. Unfortunately I don't have anything ATS-enabled either. Not many do, it seems. Cheers, John Will . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 00/13] Rework IOMMU API to allow for batching of invalidation
On Thu, Aug 15, 2019 at 12:19:58PM +0100, John Garry wrote: > On 14/08/2019 18:56, Will Deacon wrote: > > If you'd like to play with the patches, then I've also pushed them here: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/unmap > > > > but they should behave as a no-op on their own. > > As anticipated, my storage testing scenarios roughly give parity throughput > and CPU loading before and after this series. > > Patches to convert the > > Arm SMMUv3 driver to the new API are here: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq > > I quickly tested this again and now I see a performance lift: > > before (5.3-rc1)after > D05 8x SAS disks 907K IOPS 970K IOPS > D05 1x NVMe 450K IOPS 466K IOPS > D06 1x NVMe 467K IOPS 466K IOPS > > The CPU loading seems to track throughput, so nothing much to say there. > > Note: From 5.2 testing, I was seeing >900K IOPS from that NVMe disk for > !IOMMU. Cheers, John. For interest, how do things look if you pass iommu.strict=0? That might give some indication about how much the invalidation is still hurting us. > BTW, what were your thoughts on changing > arm_smmu_atc_inv_domain()->arm_smmu_atc_inv_master() to batching? It seems > suitable, but looks untouched. Were you waiting for a resolution to the > performance issue which Leizhen reported? In principle, I'm supportive of such a change, but I'm not currently able to test any ATS stuff so somebody else would need to write the patch. Jean-Philippe is on holiday at the moment, but I'd be happy to review something from you if you send it out. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 00/13] Rework IOMMU API to allow for batching of invalidation
On 14/08/2019 18:56, Will Deacon wrote: Hi everybody, These are the core IOMMU changes that I have posted previously as part of my ongoing effort to reduce the lock contention of the SMMUv3 command queue. I thought it would be better to split this out as a separate series, since I think it's ready to go and all the driver conversions mean that it's quite a pain for me to maintain out of tree! The idea of the patch series is to allow TLB invalidation to be batched up into a new 'struct iommu_iotlb_gather' structure, which tracks the properties of the virtual address range being invalidated so that it can be deferred until the driver's ->iotlb_sync() function is called. This allows for more efficient invalidation on hardware that can submit multiple invalidations in one go. The previous series was included in: https://lkml.kernel.org/r/20190711171927.28803-1-w...@kernel.org The only real change since then is incorporating the newly merged virtio-iommu driver. If you'd like to play with the patches, then I've also pushed them here: https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/unmap but they should behave as a no-op on their own. Hi Will, As anticipated, my storage testing scenarios roughly give parity throughput and CPU loading before and after this series. Patches to convert the Arm SMMUv3 driver to the new API are here: https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq I quickly tested this again and now I see a performance lift: before (5.3-rc1)after D05 8x SAS disks907K IOPS 970K IOPS D05 1x NVMe 450K IOPS 466K IOPS D06 1x NVMe 467K IOPS 466K IOPS The CPU loading seems to track throughput, so nothing much to say there. Note: From 5.2 testing, I was seeing >900K IOPS from that NVMe disk for !IOMMU. BTW, what were your thoughts on changing arm_smmu_atc_inv_domain()->arm_smmu_atc_inv_master() to batching? It seems suitable, but looks untouched. Were you waiting for a resolution to the performance issue which Leizhen reported? Thanks, John Cheers, Will --->8 Cc: Jean-Philippe Brucker Cc: Robin Murphy Cc: Jayachandran Chandrasekharan Nair Cc: Jan Glauber Cc: Jon Masters Cc: Eric Auger Cc: Zhen Lei Cc: Jonathan Cameron Cc: Vijay Kilary Cc: Joerg Roedel Cc: John Garry Cc: Alex Williamson Cc: Marek Szyprowski Cc: David Woodhouse Will Deacon (13): iommu: Remove empty iommu_tlb_range_add() callback from iommu_ops iommu/io-pgtable-arm: Remove redundant call to io_pgtable_tlb_sync() iommu/io-pgtable: Rename iommu_gather_ops to iommu_flush_ops iommu: Introduce struct iommu_iotlb_gather for batching TLB flushes iommu: Introduce iommu_iotlb_gather_add_page() iommu: Pass struct iommu_iotlb_gather to ->unmap() and ->iotlb_sync() iommu/io-pgtable: Introduce tlb_flush_walk() and tlb_flush_leaf() iommu/io-pgtable: Hook up ->tlb_flush_walk() and ->tlb_flush_leaf() in drivers iommu/io-pgtable-arm: Call ->tlb_flush_walk() and ->tlb_flush_leaf() iommu/io-pgtable: Replace ->tlb_add_flush() with ->tlb_add_page() iommu/io-pgtable: Remove unused ->tlb_sync() callback iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->unmap() iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->tlb_add_page() drivers/gpu/drm/panfrost/panfrost_mmu.c | 24 +--- drivers/iommu/amd_iommu.c | 11 ++-- drivers/iommu/arm-smmu-v3.c | 52 +++- drivers/iommu/arm-smmu.c| 103 drivers/iommu/dma-iommu.c | 9 ++- drivers/iommu/exynos-iommu.c| 3 +- drivers/iommu/intel-iommu.c | 3 +- drivers/iommu/io-pgtable-arm-v7s.c | 57 +- drivers/iommu/io-pgtable-arm.c | 48 --- drivers/iommu/iommu.c | 24 drivers/iommu/ipmmu-vmsa.c | 28 + drivers/iommu/msm_iommu.c | 42 + drivers/iommu/mtk_iommu.c | 45 +++--- drivers/iommu/mtk_iommu_v1.c| 3 +- drivers/iommu/omap-iommu.c | 2 +- drivers/iommu/qcom_iommu.c | 44 +++--- drivers/iommu/rockchip-iommu.c | 2 +- drivers/iommu/s390-iommu.c | 3 +- drivers/iommu/tegra-gart.c | 12 +++- drivers/iommu/tegra-smmu.c | 2 +- drivers/iommu/virtio-iommu.c| 5 +- drivers/vfio/vfio_iommu_type1.c | 27 + include/linux/io-pgtable.h | 57 -- include/linux/iommu.h | 92 +--- 24 files changed, 483 insertions(+), 215 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/
[PATCH 00/13] Rework IOMMU API to allow for batching of invalidation
Hi everybody, These are the core IOMMU changes that I have posted previously as part of my ongoing effort to reduce the lock contention of the SMMUv3 command queue. I thought it would be better to split this out as a separate series, since I think it's ready to go and all the driver conversions mean that it's quite a pain for me to maintain out of tree! The idea of the patch series is to allow TLB invalidation to be batched up into a new 'struct iommu_iotlb_gather' structure, which tracks the properties of the virtual address range being invalidated so that it can be deferred until the driver's ->iotlb_sync() function is called. This allows for more efficient invalidation on hardware that can submit multiple invalidations in one go. The previous series was included in: https://lkml.kernel.org/r/20190711171927.28803-1-w...@kernel.org The only real change since then is incorporating the newly merged virtio-iommu driver. If you'd like to play with the patches, then I've also pushed them here: https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/unmap but they should behave as a no-op on their own. Patches to convert the Arm SMMUv3 driver to the new API are here: https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq Cheers, Will --->8 Cc: Jean-Philippe Brucker Cc: Robin Murphy Cc: Jayachandran Chandrasekharan Nair Cc: Jan Glauber Cc: Jon Masters Cc: Eric Auger Cc: Zhen Lei Cc: Jonathan Cameron Cc: Vijay Kilary Cc: Joerg Roedel Cc: John Garry Cc: Alex Williamson Cc: Marek Szyprowski Cc: David Woodhouse Will Deacon (13): iommu: Remove empty iommu_tlb_range_add() callback from iommu_ops iommu/io-pgtable-arm: Remove redundant call to io_pgtable_tlb_sync() iommu/io-pgtable: Rename iommu_gather_ops to iommu_flush_ops iommu: Introduce struct iommu_iotlb_gather for batching TLB flushes iommu: Introduce iommu_iotlb_gather_add_page() iommu: Pass struct iommu_iotlb_gather to ->unmap() and ->iotlb_sync() iommu/io-pgtable: Introduce tlb_flush_walk() and tlb_flush_leaf() iommu/io-pgtable: Hook up ->tlb_flush_walk() and ->tlb_flush_leaf() in drivers iommu/io-pgtable-arm: Call ->tlb_flush_walk() and ->tlb_flush_leaf() iommu/io-pgtable: Replace ->tlb_add_flush() with ->tlb_add_page() iommu/io-pgtable: Remove unused ->tlb_sync() callback iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->unmap() iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->tlb_add_page() drivers/gpu/drm/panfrost/panfrost_mmu.c | 24 +--- drivers/iommu/amd_iommu.c | 11 ++-- drivers/iommu/arm-smmu-v3.c | 52 +++- drivers/iommu/arm-smmu.c| 103 drivers/iommu/dma-iommu.c | 9 ++- drivers/iommu/exynos-iommu.c| 3 +- drivers/iommu/intel-iommu.c | 3 +- drivers/iommu/io-pgtable-arm-v7s.c | 57 +- drivers/iommu/io-pgtable-arm.c | 48 --- drivers/iommu/iommu.c | 24 drivers/iommu/ipmmu-vmsa.c | 28 + drivers/iommu/msm_iommu.c | 42 + drivers/iommu/mtk_iommu.c | 45 +++--- drivers/iommu/mtk_iommu_v1.c| 3 +- drivers/iommu/omap-iommu.c | 2 +- drivers/iommu/qcom_iommu.c | 44 +++--- drivers/iommu/rockchip-iommu.c | 2 +- drivers/iommu/s390-iommu.c | 3 +- drivers/iommu/tegra-gart.c | 12 +++- drivers/iommu/tegra-smmu.c | 2 +- drivers/iommu/virtio-iommu.c| 5 +- drivers/vfio/vfio_iommu_type1.c | 27 + include/linux/io-pgtable.h | 57 -- include/linux/iommu.h | 92 +--- 24 files changed, 483 insertions(+), 215 deletions(-) -- 2.11.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu