Re: arm-smmu-v3 high cpu usage for NVMe

Jean-Philippe Brucker Fri, 20 Mar 2020 04:19:15 -0700

On Fri, Mar 20, 2020 at 10:41:44AM +0000, John Garry wrote:
> On 19/03/2020 18:43, Jean-Philippe Brucker wrote:
> > On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote:
> > > Hi Will,
> > > 
> > > > 
> > > > On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
> > > > > And for the overall system, we have:
> > > > > 
> > > > >     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 
> > > > > 0/34434 drop:
> > > > > 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> > > > > --------------------------------------------------------------------------------------------------------------------------
> > > > > 
> > > > >       27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
> > > > >       11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
> > > > >        6.35%  [kernel]          [k] _raw_spin_unlock_irq
> > > > >        2.65%  [kernel]          [k] get_user_pages_fast
> > > > >        2.03%  [kernel]          [k] __slab_free
> > > > >        1.55%  [kernel]          [k] tick_nohz_idle_exit
> > > > >        1.47%  [kernel]          [k] arm_lpae_map
> > > > >        1.39%  [kernel]          [k] __fget
> > > > >        1.14%  [kernel]          [k] __lock_text_start
> > > > >        1.09%  [kernel]          [k] _raw_spin_lock
> > > > >        1.08%  [kernel]          [k] bio_release_pages.part.42
> > > > >        1.03%  [kernel]          [k] __sbitmap_get_word
> > > > >        0.97%  [kernel]          [k] 
> > > > > arm_smmu_atc_inv_domain.constprop.42
> > > > >        0.91%  [kernel]          [k] fput_many
> > > > >        0.88%  [kernel]          [k] __arm_lpae_map
> > > > > 
> > > > > One thing to note is that we still spend an appreciable amount of 
> > > > > time in
> > > > > arm_smmu_atc_inv_domain(), which is disappointing when considering it 
> > > > > should
> > > > > effectively be a noop.
> > > > > 
> > > > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the 
> > > > > testing our
> > > > > batch size is 1, so we're not seeing the real benefit of the 
> > > > > batching. I
> > > > > can't help but think that we could improve this code to try to 
> > > > > combine CMD
> > > > > SYNCs for small batches.
> > > > > 
> > > > > Anyway, let me know your thoughts or any questions. I'll have a look 
> > > > > if a
> > > > > get a chance for other possible bottlenecks.
> > > > 
> > > > Did you ever get any more information on this? I don't have any SMMUv3
> > > > hardware any more, so I can't really dig into this myself.
> > > 
> > > I'm only getting back to look at this now, as SMMU performance is a bit 
> > > of a
> > > hot topic again for us.
> > > 
> > > So one thing we are doing which looks to help performance is this series
> > > from Marc:
> > > 
> > > https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
> > > 
> > > So that is just spreading the per-CPU load for NVMe interrupt handling
> > > (where the DMA unmapping is happening), so I'd say just side-stepping any
> > > SMMU issue really.
> > > 
> > > Going back to the SMMU, I wanted to run epbf and perf annotate to help
> > > profile this, but was having no luck getting them to work properly. I'll
> > > look at this again now.
> > 
> > Could you also try with the upcoming ATS change currently in Will's tree?
> > They won't improve your numbers but it'd be good to check that they don't
> > make things worse.
> 
> I can do when I get a chance.
> 
> > 
> > I've run a bunch of netperf instances on multiple cores and collecting
> > SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
> > consistently.
> > 
> > - 6.07% arm_smmu_iotlb_sync
> >     - 5.74% arm_smmu_tlb_inv_range
> >          5.09% arm_smmu_cmdq_issue_cmdlist
> >          0.28% __pi_memset
> >          0.08% __pi_memcpy
> >          0.08% arm_smmu_atc_inv_domain.constprop.37
> >          0.07% arm_smmu_cmdq_build_cmd
> >          0.01% arm_smmu_cmdq_batch_add
> >       0.31% __pi_memset
> > 
> > So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
> > when ATS is not used. According to the annotations, the load from the
> > atomic_read(), that checks whether the domain uses ATS, is 77% of the
> > samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
> > there is much room for optimization there.
> 
> Well I did originally suggest using RCU protection to scan the list of
> devices, instead of reading an atomic and checking for non-zero value. But
> that would be an optimsation for ATS also, and there was no ATS devices at
> the time (to verify performance).


Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
performance, but I agree that using RCU to scan the list should get better
results when using ATS.

When ATS isn't in use however, I suspect reading nr_ats_masters should be
more efficient than taking the RCU lock + reading an "ats_devices" list
(since the smmu_domain->devices list also serves context descriptor
invalidation, even when ATS isn't in use). I'll run some tests however, to
see if I can micro-optimize this case, but I don't expect noticeable
improvements.

Thanks,
Jean
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

Reply via email to