Hi Will,


On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
And for the overall system, we have:

   PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
0/40116 [4000Hz cycles],  (all, 96 CPUs)
--------------------------------------------------------------------------------------------------------------------------

     27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
     11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
      6.35%  [kernel]          [k] _raw_spin_unlock_irq
      2.65%  [kernel]          [k] get_user_pages_fast
      2.03%  [kernel]          [k] __slab_free
      1.55%  [kernel]          [k] tick_nohz_idle_exit
      1.47%  [kernel]          [k] arm_lpae_map
      1.39%  [kernel]          [k] __fget
      1.14%  [kernel]          [k] __lock_text_start
      1.09%  [kernel]          [k] _raw_spin_lock
      1.08%  [kernel]          [k] bio_release_pages.part.42
      1.03%  [kernel]          [k] __sbitmap_get_word
      0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
      0.91%  [kernel]          [k] fput_many
      0.88%  [kernel]          [k] __arm_lpae_map

One thing to note is that we still spend an appreciable amount of time in
arm_smmu_atc_inv_domain(), which is disappointing when considering it should
effectively be a noop.

As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
batch size is 1, so we're not seeing the real benefit of the batching. I
can't help but think that we could improve this code to try to combine CMD
SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look if a
get a chance for other possible bottlenecks.

Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.

I'm only getting back to look at this now, as SMMU performance is a bit of a hot topic again for us.

So one thing we are doing which looks to help performance is this series from Marc:

https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d

So that is just spreading the per-CPU load for NVMe interrupt handling (where the DMA unmapping is happening), so I'd say just side-stepping any SMMU issue really.

Going back to the SMMU, I wanted to run epbf and perf annotate to help profile this, but was having no luck getting them to work properly. I'll look at this again now.

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to