On 19/03/2020 18:43, Jean-Philippe Brucker wrote:
On Thu, Mar 19, 2020 at 12:54:59PM +0000, John Garry wrote:
Hi Will,


On Thu, Jan 02, 2020 at 05:44:39PM +0000, John Garry wrote:
And for the overall system, we have:

    PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
0/40116 [4000Hz cycles],  (all, 96 CPUs)
--------------------------------------------------------------------------------------------------------------------------

      27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
      11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
       6.35%  [kernel]          [k] _raw_spin_unlock_irq
       2.65%  [kernel]          [k] get_user_pages_fast
       2.03%  [kernel]          [k] __slab_free
       1.55%  [kernel]          [k] tick_nohz_idle_exit
       1.47%  [kernel]          [k] arm_lpae_map
       1.39%  [kernel]          [k] __fget
       1.14%  [kernel]          [k] __lock_text_start
       1.09%  [kernel]          [k] _raw_spin_lock
       1.08%  [kernel]          [k] bio_release_pages.part.42
       1.03%  [kernel]          [k] __sbitmap_get_word
       0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
       0.91%  [kernel]          [k] fput_many
       0.88%  [kernel]          [k] __arm_lpae_map

One thing to note is that we still spend an appreciable amount of time in
arm_smmu_atc_inv_domain(), which is disappointing when considering it should
effectively be a noop.

As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
batch size is 1, so we're not seeing the real benefit of the batching. I
can't help but think that we could improve this code to try to combine CMD
SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look if a
get a chance for other possible bottlenecks.

Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.

I'm only getting back to look at this now, as SMMU performance is a bit of a
hot topic again for us.

So one thing we are doing which looks to help performance is this series
from Marc:

https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d

So that is just spreading the per-CPU load for NVMe interrupt handling
(where the DMA unmapping is happening), so I'd say just side-stepping any
SMMU issue really.

Going back to the SMMU, I wanted to run epbf and perf annotate to help
profile this, but was having no luck getting them to work properly. I'll
look at this again now.

Could you also try with the upcoming ATS change currently in Will's tree?
They won't improve your numbers but it'd be good to check that they don't
make things worse.

I can do when I get a chance.


I've run a bunch of netperf instances on multiple cores and collecting
SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.

- 6.07% arm_smmu_iotlb_sync
    - 5.74% arm_smmu_tlb_inv_range
         5.09% arm_smmu_cmdq_issue_cmdlist
         0.28% __pi_memset
         0.08% __pi_memcpy
         0.08% arm_smmu_atc_inv_domain.constprop.37
         0.07% arm_smmu_cmdq_build_cmd
         0.01% arm_smmu_cmdq_batch_add
      0.31% __pi_memset

So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from the
atomic_read(), that checks whether the domain uses ATS, is 77% of the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
there is much room for optimization there.

Well I did originally suggest using RCU protection to scan the list of devices, instead of reading an atomic and checking for non-zero value. But that would be an optimsation for ATS also, and there was no ATS devices at the time (to verify performance).

Cheers,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to