> Subject: Re: arm-smmu-v3 high cpu usage for NVMe > > On 20/03/2020 10:41, John Garry wrote: > > + Barry, Alexandru > > >>>>> PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: > >>>>> 0/34434 drop: > >>>>> 0/40116 [4000Hz cycles], (all, 96 CPUs) > >>>>> > ---------------------------------------------------------------------------------------------------------------- > ---------- > >>>>> > >>>>> > >>>>> 27.43% [kernel] [k] > arm_smmu_cmdq_issue_cmdlist > >>>>> 11.71% [kernel] [k] > _raw_spin_unlock_irqrestore > >>>>> 6.35% [kernel] [k] _raw_spin_unlock_irq > >>>>> 2.65% [kernel] [k] get_user_pages_fast > >>>>> 2.03% [kernel] [k] __slab_free > >>>>> 1.55% [kernel] [k] tick_nohz_idle_exit > >>>>> 1.47% [kernel] [k] arm_lpae_map > >>>>> 1.39% [kernel] [k] __fget > >>>>> 1.14% [kernel] [k] __lock_text_start > >>>>> 1.09% [kernel] [k] _raw_spin_lock > >>>>> 1.08% [kernel] [k] bio_release_pages.part.42 > >>>>> 1.03% [kernel] [k] __sbitmap_get_word > >>>>> 0.97% [kernel] [k] > >>>>> arm_smmu_atc_inv_domain.constprop.42 > >>>>> 0.91% [kernel] [k] fput_many > >>>>> 0.88% [kernel] [k] __arm_lpae_map > >>>>> > > Hi Will, Robin, > > I'm just getting around to look at this topic again. Here's the current > picture for my NVMe test: > > perf top -C 0 * > Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024 > Overhead Shared Object Symbol > 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist > 3.28% [kernel] [k] arm_smmu_tlb_inv_range > 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49 > 2.35% [kernel] [k] _raw_spin_unlock_irqrestore > 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41 > 1.20% [kernel] [k] aio_complete_rw > 0.96% [kernel] [k] enqueue_task_fair > 0.93% [kernel] [k] gic_handle_irq > 0.86% [kernel] [k] _raw_spin_lock_irqsave > 0.72% [kernel] [k] put_reqs_available > 0.72% [kernel] [k] sbitmap_queue_clear > > * only certain CPUs run the dma unmap for my scenario, cpu0 being one of > them. > > Colleague Barry has similar findings for some other scenarios.
I wrote a test module and use the parameter "ways" to simulate how busy SMMU is and compare the latency under different degrees of contentions. 1. static int ways=16; 2. module_param(ways, int, S_IRUGO); 3. 4. static int seconds=120; 5. module_param(seconds, int, S_IRUGO); 6. 7. extern struct device *get_zip_dev(void); 8. 9. static noinline void test_mapsingle(struct device *dev, void *buf, int size) 10. { 11. dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE); 12. dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE); 13. } 14. 15. static noinline void test_memcpy(void *out, void *in, int size) 16. { 17. memcpy(out, in, size); 18. } 19. 20. static int testthread(void *data) 21. { 22. unsigned long stop = jiffies +seconds*HZ; 23. struct device *dev = get_zip_dev(); 24. 25. char *input = kzalloc(4096, GFP_KERNEL); 26. if (!input) 27. return -ENOMEM; 28. 29. char *output = kzalloc(4096, GFP_KERNEL); 30. if (!output) 31. return -ENOMEM; 32. 33. while (time_before(jiffies, stop)) { 34. test_mapsingle(dev, input, 4096); 35. test_memcpy(output, input, 4096); 36. } 37. 38. kfree(output); 39. kfree(input); 40. 41. return 0; 42. } 43. 44. static int __init test_init(void) 45. { 46. struct task_struct *tsk; 47. int i; 50. 51. for(i=0;i<ways;i++) { 52. tsk = kthread_run(testthread, &ways, "map_test-%d", i); 53. if (IS_ERR(tsk)) 54. printk(KERN_ERR "create test thread failed\n"); 55. } 56. 57. return 0; 58. } 59. 60. static void __exit test_exit(void) 61. { 62. } 63. 64. module_init(test_init); 65. module_exit(test_exit); 66. MODULE_LICENSE("GPL"); While ways=1, smmu is quite free with only one user, arm_smmu_cmdq_issue_cmdlist() will spend more than 60% time on arm_smmu_cmdq_poll_until_sync(). It seems SMMU reports the completion of CMD_SYNC quite slowly. When I increased "ways", I found the contention would increase rapidly. When ways=16, more than 40% time will be on: cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val) when ways=64, more than 60% time will be on: cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val) here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with different ways: whole unmap(ns) arm_smmu_cmdq_issue_cmdlist()ns wait CMD_SYNC(ns) Ways=1 1956 1328 883 Ways=16 8891 7474 4000 Ways=32 22043 19519 6879 Ways=64 60842 55895 16746 Ways=96 101880 93649 24429 As you can see, while ways=1, we still need 2us to unmap, and arm_smmu_cmdq_issue_cmdlist() takes 60% time of the dma_unmap, CMD_SNC takes more than 60% time of arm_smmu_cmdq_issue_cmdlist(). When SMMU is very busy, dma_unmap latency can be very large due to contention, more than 100us. Thanks Barry > > So we tried the latest perf NMI support wip patches, and noticed a few > hotspots (see > https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3 > 912703cfcd4985a00f6bbb/perf%20annotate > and > https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3 > 912703cfcd4985a00f6bbb/report.txt) > when running some NVMe traffic: > > - initial cmpxchg to get a place in the queue > - when more CPUs get involved, we start failing at an exponential rate > 0.00 : ffff8000107a3500: cas x4, x2, [x27] > 26.52 : ffff8000107a3504: mov x0, x4 : > arm_smmu_cmdq_issue_cmdlist(): > > - the queue locking > - polling cmd_sync > > Some ideas to optimise: > > a. initial cmpxchg > So this cmpxchg could be considered unfair. In addition, with all the > contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged > around the system. > Maybe we can implement something similar to the idea of queued/ticketed > spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after > initial cmpxchg fails, released by its leader, and releasing subsequent > followers > > b. Drop the queue_full checking in certain circumstances > If we cannot theoretically fill the queue, then stop the checking for > queue full or similar. This should also help current problem of a., as > the less time between cmpxchg, the less chance of failing (as we check > queue available space between cmpxchg attempts). > > So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we > always issue a cmd_sync for a batch (regardless of whether requested), > then we should never fill (I think). > > c. Don't do queue locking in certain circumstances > If we implement (and support) b. and support MSI polling, then I don't > think that this is required. > > d. More minor ideas are to move forward when the "owner" stops gathering > to reduce time of advancing the prod, hopefully reducing cmd_sync > polling time; and also use a smaller word size for the valid bitmap > operations, maybe 32b atomic operations are overall more efficient (than > 64b) - mostly valid range check is < 16 bits from my observation. > > Let me know your thoughts or any other ideas. > > Thanks, > John _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu