On Fri, Mar 20, 2020 at 10:41:44AM +0000, John Garry wrote: > On 19/03/2020 18:43, Jean-Philippe Brucker wrote: > > On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote: > > > Hi Will, > > > > > > > > > > > On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote: > > > > > And for the overall system, we have: > > > > > > > > > > PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: > > > > > 0/34434 drop: > > > > > 0/40116 [4000Hz cycles], (all, 96 CPUs) > > > > > -------------------------------------------------------------------------------------------------------------------------- > > > > > > > > > > 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist > > > > > 11.71% [kernel] [k] _raw_spin_unlock_irqrestore > > > > > 6.35% [kernel] [k] _raw_spin_unlock_irq > > > > > 2.65% [kernel] [k] get_user_pages_fast > > > > > 2.03% [kernel] [k] __slab_free > > > > > 1.55% [kernel] [k] tick_nohz_idle_exit > > > > > 1.47% [kernel] [k] arm_lpae_map > > > > > 1.39% [kernel] [k] __fget > > > > > 1.14% [kernel] [k] __lock_text_start > > > > > 1.09% [kernel] [k] _raw_spin_lock > > > > > 1.08% [kernel] [k] bio_release_pages.part.42 > > > > > 1.03% [kernel] [k] __sbitmap_get_word > > > > > 0.97% [kernel] [k] > > > > > arm_smmu_atc_inv_domain.constprop.42 > > > > > 0.91% [kernel] [k] fput_many > > > > > 0.88% [kernel] [k] __arm_lpae_map > > > > > > > > > > One thing to note is that we still spend an appreciable amount of > > > > > time in > > > > > arm_smmu_atc_inv_domain(), which is disappointing when considering it > > > > > should > > > > > effectively be a noop. > > > > > > > > > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the > > > > > testing our > > > > > batch size is 1, so we're not seeing the real benefit of the > > > > > batching. I > > > > > can't help but think that we could improve this code to try to > > > > > combine CMD > > > > > SYNCs for small batches. > > > > > > > > > > Anyway, let me know your thoughts or any questions. I'll have a look > > > > > if a > > > > > get a chance for other possible bottlenecks. > > > > > > > > Did you ever get any more information on this? I don't have any SMMUv3 > > > > hardware any more, so I can't really dig into this myself. > > > > > > I'm only getting back to look at this now, as SMMU performance is a bit > > > of a > > > hot topic again for us. > > > > > > So one thing we are doing which looks to help performance is this series > > > from Marc: > > > > > > https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d > > > > > > So that is just spreading the per-CPU load for NVMe interrupt handling > > > (where the DMA unmapping is happening), so I'd say just side-stepping any > > > SMMU issue really. > > > > > > Going back to the SMMU, I wanted to run epbf and perf annotate to help > > > profile this, but was having no luck getting them to work properly. I'll > > > look at this again now. > > > > Could you also try with the upcoming ATS change currently in Will's tree? > > They won't improve your numbers but it'd be good to check that they don't > > make things worse. > > I can do when I get a chance. > > > > > I've run a bunch of netperf instances on multiple cores and collecting > > SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty > > consistently. > > > > - 6.07% arm_smmu_iotlb_sync > > - 5.74% arm_smmu_tlb_inv_range > > 5.09% arm_smmu_cmdq_issue_cmdlist > > 0.28% __pi_memset > > 0.08% __pi_memcpy > > 0.08% arm_smmu_atc_inv_domain.constprop.37 > > 0.07% arm_smmu_cmdq_build_cmd > > 0.01% arm_smmu_cmdq_batch_add > > 0.31% __pi_memset > > > > So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(), > > when ATS is not used. According to the annotations, the load from the > > atomic_read(), that checks whether the domain uses ATS, is 77% of the > > samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure > > there is much room for optimization there. > > Well I did originally suggest using RCU protection to scan the list of > devices, instead of reading an atomic and checking for non-zero value. But > that would be an optimsation for ATS also, and there was no ATS devices at > the time (to verify performance).
Heh, I have yet to get my hands on one. Currently I can't evaluate ATS performance, but I agree that using RCU to scan the list should get better results when using ATS. When ATS isn't in use however, I suspect reading nr_ats_masters should be more efficient than taking the RCU lock + reading an "ats_devices" list (since the smmu_domain->devices list also serves context descriptor invalidation, even when ATS isn't in use). I'll run some tests however, to see if I can micro-optimize this case, but I don't expect noticeable improvements. Thanks, Jean _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu