On 20/03/2020 10:41, John Garry wrote:
+ Barry, Alexandru
PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost:
0/34434 drop:
0/40116 [4000Hz cycles], (all, 96 CPUs)
--------------------------------------------------------------------------------------------------------------------------
27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
11.71% [kernel] [k] _raw_spin_unlock_irqrestore
6.35% [kernel] [k] _raw_spin_unlock_irq
2.65% [kernel] [k] get_user_pages_fast
2.03% [kernel] [k] __slab_free
1.55% [kernel] [k] tick_nohz_idle_exit
1.47% [kernel] [k] arm_lpae_map
1.39% [kernel] [k] __fget
1.14% [kernel] [k] __lock_text_start
1.09% [kernel] [k] _raw_spin_lock
1.08% [kernel] [k] bio_release_pages.part.42
1.03% [kernel] [k] __sbitmap_get_word
0.97% [kernel] [k]
arm_smmu_atc_inv_domain.constprop.42
0.91% [kernel] [k] fput_many
0.88% [kernel] [k] __arm_lpae_map
Hi Will, Robin,
I'm just getting around to look at this topic again. Here's the current
picture for my NVMe test:
perf top -C 0 *
Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
Overhead Shared Object Symbol
75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
3.28% [kernel] [k] arm_smmu_tlb_inv_range
2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
2.35% [kernel] [k] _raw_spin_unlock_irqrestore
1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
1.20% [kernel] [k] aio_complete_rw
0.96% [kernel] [k] enqueue_task_fair
0.93% [kernel] [k] gic_handle_irq
0.86% [kernel] [k] _raw_spin_lock_irqsave
0.72% [kernel] [k] put_reqs_available
0.72% [kernel] [k] sbitmap_queue_clear
* only certain CPUs run the dma unmap for my scenario, cpu0 being one of
them.
Colleague Barry has similar findings for some other scenarios.
So we tried the latest perf NMI support wip patches, and noticed a few
hotspots (see
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotate
and
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt)
when running some NVMe traffic:
- initial cmpxchg to get a place in the queue
- when more CPUs get involved, we start failing at an exponential rate
0.00 : ffff8000107a3500: cas x4, x2, [x27]
26.52 : ffff8000107a3504: mov x0, x4 :
arm_smmu_cmdq_issue_cmdlist():
- the queue locking
- polling cmd_sync
Some ideas to optimise:
a. initial cmpxchg
So this cmpxchg could be considered unfair. In addition, with all the
contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged
around the system.
Maybe we can implement something similar to the idea of queued/ticketed
spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after
initial cmpxchg fails, released by its leader, and releasing subsequent
followers
b. Drop the queue_full checking in certain circumstances
If we cannot theoretically fill the queue, then stop the checking for
queue full or similar. This should also help current problem of a., as
the less time between cmpxchg, the less chance of failing (as we check
queue available space between cmpxchg attempts).
So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we
always issue a cmd_sync for a batch (regardless of whether requested),
then we should never fill (I think).
c. Don't do queue locking in certain circumstances
If we implement (and support) b. and support MSI polling, then I don't
think that this is required.
d. More minor ideas are to move forward when the "owner" stops gathering
to reduce time of advancing the prod, hopefully reducing cmd_sync
polling time; and also use a smaller word size for the valid bitmap
operations, maybe 32b atomic operations are overall more efficient (than
64b) - mostly valid range check is < 16 bits from my observation.
Let me know your thoughts or any other ideas.
Thanks,
John
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu