On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote:
> Hi Will,
> 
> > 
> > On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
> > > And for the overall system, we have:
> > > 
> > >    PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 
> > > drop:
> > > 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> > > --------------------------------------------------------------------------------------------------------------------------
> > > 
> > >      27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
> > >      11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
> > >       6.35%  [kernel]          [k] _raw_spin_unlock_irq
> > >       2.65%  [kernel]          [k] get_user_pages_fast
> > >       2.03%  [kernel]          [k] __slab_free
> > >       1.55%  [kernel]          [k] tick_nohz_idle_exit
> > >       1.47%  [kernel]          [k] arm_lpae_map
> > >       1.39%  [kernel]          [k] __fget
> > >       1.14%  [kernel]          [k] __lock_text_start
> > >       1.09%  [kernel]          [k] _raw_spin_lock
> > >       1.08%  [kernel]          [k] bio_release_pages.part.42
> > >       1.03%  [kernel]          [k] __sbitmap_get_word
> > >       0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
> > >       0.91%  [kernel]          [k] fput_many
> > >       0.88%  [kernel]          [k] __arm_lpae_map
> > > 
> > > One thing to note is that we still spend an appreciable amount of time in
> > > arm_smmu_atc_inv_domain(), which is disappointing when considering it 
> > > should
> > > effectively be a noop.
> > > 
> > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing 
> > > our
> > > batch size is 1, so we're not seeing the real benefit of the batching. I
> > > can't help but think that we could improve this code to try to combine CMD
> > > SYNCs for small batches.
> > > 
> > > Anyway, let me know your thoughts or any questions. I'll have a look if a
> > > get a chance for other possible bottlenecks.
> > 
> > Did you ever get any more information on this? I don't have any SMMUv3
> > hardware any more, so I can't really dig into this myself.
> 
> I'm only getting back to look at this now, as SMMU performance is a bit of a
> hot topic again for us.
> 
> So one thing we are doing which looks to help performance is this series
> from Marc:
> 
> https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
> 
> So that is just spreading the per-CPU load for NVMe interrupt handling
> (where the DMA unmapping is happening), so I'd say just side-stepping any
> SMMU issue really.
> 
> Going back to the SMMU, I wanted to run epbf and perf annotate to help
> profile this, but was having no luck getting them to work properly. I'll
> look at this again now.

Could you also try with the upcoming ATS change currently in Will's tree?
They won't improve your numbers but it'd be good to check that they don't
make things worse.

I've run a bunch of netperf instances on multiple cores and collecting
SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.

- 6.07% arm_smmu_iotlb_sync
   - 5.74% arm_smmu_tlb_inv_range
        5.09% arm_smmu_cmdq_issue_cmdlist
        0.28% __pi_memset
        0.08% __pi_memcpy
        0.08% arm_smmu_atc_inv_domain.constprop.37
        0.07% arm_smmu_cmdq_build_cmd
        0.01% arm_smmu_cmdq_batch_add
     0.31% __pi_memset

So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from the
atomic_read(), that checks whether the domain uses ATS, is 77% of the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
there is much room for optimization there.

Thanks,
Jean
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to