> -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of John Garry > Sent: Saturday, August 22, 2020 1:54 AM > To: [email protected]; [email protected] > Cc: [email protected]; [email protected]; > [email protected]; [email protected]; Linuxarm > <[email protected]>; [email protected]; John Garry > <[email protected]> > Subject: [PATCH v2 0/2] iommu/arm-smmu-v3: Improve cmdq lock efficiency > > As mentioned in [0], the CPU may consume many cycles processing > arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to > get space on the queue takes a lot of time once we start getting many CPUs > contending - from experiment, for 64 CPUs contending the cmdq, success rate > is ~ 1 in 12, which is poor, but not totally awful. > > This series removes that cmpxchg() and replaces with an atomic_add, same as > how the actual cmdq deals with maintaining the prod pointer. > > For my NVMe test with 3x NVMe SSDs, I'm getting a ~24% throughput > increase: > Before: 1250K IOPs > After: 1550K IOPs > > I also have a test harness to check the rate of DMA map+unmaps we can > achieve: > > CPU count 8 16 32 64 > Before: 282K 115K 36K 11K > After: 302K 193K 80K 30K > > (unit is map+unmaps per CPU per second) I have seen performance improvement on hns3 network by sending UDP with 1-32 threads: Threads number 1 4 8 16 32 Before patch(TX Mbps) 7636.05 16444.36 21694.48 25746.40 25295.93 After patch(TX Mbps) 7711.60 16478.98 26561.06 32628.75 33764.56 As you can see, for 8,16,32 threads, network TX throughput improve much. For 1 and 4 threads, Tx throughput is almost seem before and after patch. This should be sensible as this patch is mainly for decreasing the lock contention. > > [0] > https://lore.kernel.org/linux-iommu/B926444035E5E2439431908E3842AFD2 > [email protected]/T/#ma02e301c38c3e94b7725e > 685757c27e39c7cbde3 > > Differences to v1: > - Simplify by dropping patch to always issue a CMD_SYNC > - Use 64b atomic add, keeping prod in a separate 32b field > > John Garry (2): > iommu/arm-smmu-v3: Calculate max commands per batch > iommu/arm-smmu-v3: Remove cmpxchg() in > arm_smmu_cmdq_issue_cmdlist() > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 166 > ++++++++++++++------ > 1 file changed, 114 insertions(+), 52 deletions(-) > > -- > 2.26.2 Thanks Barry

