Re: arm-smmu-v3 high cpu usage for NVMe

John Garry Fri, 22 May 2020 07:53:48 -0700

On 20/03/2020 10:41, John Garry wrote:

+ Barry, Alexandru

PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost:0/34434 drop:

0/40116 [4000Hz cycles],  (all, 96 CPUs)

--------------------------------------------------------------------------------------------------------------------------


      27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
      11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
       6.35%  [kernel]          [k] _raw_spin_unlock_irq
       2.65%  [kernel]          [k] get_user_pages_fast
       2.03%  [kernel]          [k] __slab_free
       1.55%  [kernel]          [k] tick_nohz_idle_exit
       1.47%  [kernel]          [k] arm_lpae_map
       1.39%  [kernel]          [k] __fget
       1.14%  [kernel]          [k] __lock_text_start
       1.09%  [kernel]          [k] _raw_spin_lock
       1.08%  [kernel]          [k] bio_release_pages.part.42
       1.03%  [kernel]          [k] __sbitmap_get_word

0.97% [kernel] [k]arm_smmu_atc_inv_domain.constprop.42

       0.91%  [kernel]          [k] fput_many
       0.88%  [kernel]          [k] __arm_lpae_map


Hi Will, Robin,

I'm just getting around to look at this topic again. Here's the currentpicture for my NVMe test:


perf top -C 0 *
Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
Overhead Shared Object Symbol
75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
3.28% [kernel] [k] arm_smmu_tlb_inv_range
2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
2.35% [kernel] [k] _raw_spin_unlock_irqrestore
1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
1.20% [kernel] [k] aio_complete_rw
0.96% [kernel] [k] enqueue_task_fair
0.93% [kernel] [k] gic_handle_irq
0.86% [kernel] [k] _raw_spin_lock_irqsave
0.72% [kernel] [k] put_reqs_available
0.72% [kernel] [k] sbitmap_queue_clear

* only certain CPUs run the dma unmap for my scenario, cpu0 being one ofthem.


Colleague Barry has similar findings for some other scenarios.

So we tried the latest perf NMI support wip patches, and noticed a fewhotspots (seehttps://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotateandhttps://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt)when running some NVMe traffic:


- initial cmpxchg to get a place in the queue
        - when more CPUs get involved, we start failing at an exponential rate
0.00 :        ffff8000107a3500:       cas     x4, x2, [x27]

26.52 : ffff8000107a3504: mov x0, x4 :arm_smmu_cmdq_issue_cmdlist():


- the queue locking
- polling cmd_sync

Some ideas to optimise:

a. initial cmpxchg

So this cmpxchg could be considered unfair. In addition, with all thecontention on arm_smmu_cmdq.q, that cacheline would be constantly pingedaround the system.Maybe we can implement something similar to the idea of queued/ticketedspinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q afterinitial cmpxchg fails, released by its leader, and releasing subsequentfollowers


b. Drop the queue_full checking in certain circumstances

If we cannot theoretically fill the queue, then stop the checking forqueue full or similar. This should also help current problem of a., asthe less time between cmpxchg, the less chance of failing (as we checkqueue available space between cmpxchg attempts).

So if cmdq depth > nr_available_cpus * (max batch size + 1) AND wealways issue a cmd_sync for a batch (regardless of whether requested),then we should never fill (I think).


c. Don't do queue locking in certain circumstances

If we implement (and support) b. and support MSI polling, then I don'tthink that this is required.

d. More minor ideas are to move forward when the "owner" stops gatheringto reduce time of advancing the prod, hopefully reducing cmd_syncpolling time; and also use a smaller word size for the valid bitmapoperations, maybe 32b atomic operations are overall more efficient (than64b) - mostly valid range check is < 16 bits from my observation.


Let me know your thoughts or any other ideas.

Thanks,
John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

Reply via email to