RE: arm-smmu-v3 high cpu usage for NVMe
> Subject: Re: arm-smmu-v3 high cpu usage for NVMe > > On 20/03/2020 10:41, John Garry wrote: > > + Barry, Alexandru > > >>>>> PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: > >>>>> 0/34434 drop: > >>>>> 0/40116 [4000Hz cycles], (all, 96 CPUs) > >>>>> > > -- > >>>>> > >>>>> > >>>>> 27.43% [kernel] [k] > arm_smmu_cmdq_issue_cmdlist > >>>>> 11.71% [kernel] [k] > _raw_spin_unlock_irqrestore > >>>>> 6.35% [kernel] [k] _raw_spin_unlock_irq > >>>>> 2.65% [kernel] [k] get_user_pages_fast > >>>>> 2.03% [kernel] [k] __slab_free > >>>>> 1.55% [kernel] [k] tick_nohz_idle_exit > >>>>> 1.47% [kernel] [k] arm_lpae_map > >>>>> 1.39% [kernel] [k] __fget > >>>>> 1.14% [kernel] [k] __lock_text_start > >>>>> 1.09% [kernel] [k] _raw_spin_lock > >>>>> 1.08% [kernel] [k] bio_release_pages.part.42 > >>>>> 1.03% [kernel] [k] __sbitmap_get_word > >>>>> 0.97% [kernel] [k] > >>>>> arm_smmu_atc_inv_domain.constprop.42 > >>>>> 0.91% [kernel] [k] fput_many > >>>>> 0.88% [kernel] [k] __arm_lpae_map > >>>>> > > Hi Will, Robin, > > I'm just getting around to look at this topic again. Here's the current > picture for my NVMe test: > > perf top -C 0 * > Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024 > Overhead Shared Object Symbol > 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist > 3.28% [kernel] [k] arm_smmu_tlb_inv_range > 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49 > 2.35% [kernel] [k] _raw_spin_unlock_irqrestore > 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41 > 1.20% [kernel] [k] aio_complete_rw > 0.96% [kernel] [k] enqueue_task_fair > 0.93% [kernel] [k] gic_handle_irq > 0.86% [kernel] [k] _raw_spin_lock_irqsave > 0.72% [kernel] [k] put_reqs_available > 0.72% [kernel] [k] sbitmap_queue_clear > > * only certain CPUs run the dma unmap for my scenario, cpu0 being one of > them. > > Colleague Barry has similar findings for some other scenarios. I wrote a test module and use the parameter "ways" to simulate how busy SMMU is and compare the latency under different degrees of contentions. 1. static int ways=16; 2. module_param(ways, int, S_IRUGO); 3. 4. static int seconds=120; 5. module_param(seconds, int, S_IRUGO); 6. 7. extern struct device *get_zip_dev(void); 8. 9. static noinline void test_mapsingle(struct device *dev, void *buf, int size) 10. { 11. dma_addr_t dma_addr = dma_map_single(dev, buf, size, DMA_TO_DEVICE); 12. dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE); 13. } 14. 15. static noinline void test_memcpy(void *out, void *in, int size) 16. { 17. memcpy(out, in, size); 18. } 19. 20. static int testthread(void *data) 21. { 22. unsigned long stop = jiffies +seconds*HZ; 23. struct device *dev = get_zip_dev(); 24. 25. char *input = kzalloc(4096, GFP_KERNEL); 26. if (!input) 27. return -ENOMEM; 28. 29. char *output = kzalloc(4096, GFP_KERNEL); 30. if (!output) 31. return -ENOMEM; 32. 33. while (time_before(jiffies, stop)) { 34. test_mapsingle(dev, input, 4096); 35. test_memcpy(output, input, 4096); 36. } 37. 38. kfree(output); 39. kfree(input); 40. 41. return 0; 42. } 43. 44. static int __init test_init(void) 45. { 46. struct task_struct *tsk; 47. int i; 50. 51. for(i=0;iq.llq.val, llq.val, head.val) when ways=64, more than 60% time will be on: cmpxchg_relaxed(>q.llq.val, llq.val, head.val) here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with different ways: whole unmap(ns) arm_smmu_cmdq_issue_cmdlist()ns wait CMD_SYNC(ns) Ways=1 19561328 883 Ways=1688917474 4000 Ways=3222043 19519
Re: arm-smmu-v3 high cpu usage for NVMe
On 20/03/2020 10:41, John Garry wrote: + Barry, Alexandru PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop: 0/40116 [4000Hz cycles], (all, 96 CPUs) -- 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 11.71% [kernel] [k] _raw_spin_unlock_irqrestore 6.35% [kernel] [k] _raw_spin_unlock_irq 2.65% [kernel] [k] get_user_pages_fast 2.03% [kernel] [k] __slab_free 1.55% [kernel] [k] tick_nohz_idle_exit 1.47% [kernel] [k] arm_lpae_map 1.39% [kernel] [k] __fget 1.14% [kernel] [k] __lock_text_start 1.09% [kernel] [k] _raw_spin_lock 1.08% [kernel] [k] bio_release_pages.part.42 1.03% [kernel] [k] __sbitmap_get_word 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 0.91% [kernel] [k] fput_many 0.88% [kernel] [k] __arm_lpae_map Hi Will, Robin, I'm just getting around to look at this topic again. Here's the current picture for my NVMe test: perf top -C 0 * Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024 Overhead Shared Object Symbol 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 3.28% [kernel] [k] arm_smmu_tlb_inv_range 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49 2.35% [kernel] [k] _raw_spin_unlock_irqrestore 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41 1.20% [kernel] [k] aio_complete_rw 0.96% [kernel] [k] enqueue_task_fair 0.93% [kernel] [k] gic_handle_irq 0.86% [kernel] [k] _raw_spin_lock_irqsave 0.72% [kernel] [k] put_reqs_available 0.72% [kernel] [k] sbitmap_queue_clear * only certain CPUs run the dma unmap for my scenario, cpu0 being one of them. Colleague Barry has similar findings for some other scenarios. So we tried the latest perf NMI support wip patches, and noticed a few hotspots (see https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotate and https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt) when running some NVMe traffic: - initial cmpxchg to get a place in the queue - when more CPUs get involved, we start failing at an exponential rate 0.00 :8000107a3500: cas x4, x2, [x27] 26.52 :8000107a3504: mov x0, x4 : arm_smmu_cmdq_issue_cmdlist(): - the queue locking - polling cmd_sync Some ideas to optimise: a. initial cmpxchg So this cmpxchg could be considered unfair. In addition, with all the contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged around the system. Maybe we can implement something similar to the idea of queued/ticketed spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after initial cmpxchg fails, released by its leader, and releasing subsequent followers b. Drop the queue_full checking in certain circumstances If we cannot theoretically fill the queue, then stop the checking for queue full or similar. This should also help current problem of a., as the less time between cmpxchg, the less chance of failing (as we check queue available space between cmpxchg attempts). So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we always issue a cmd_sync for a batch (regardless of whether requested), then we should never fill (I think). c. Don't do queue locking in certain circumstances If we implement (and support) b. and support MSI polling, then I don't think that this is required. d. More minor ideas are to move forward when the "owner" stops gathering to reduce time of advancing the prod, hopefully reducing cmd_sync polling time; and also use a smaller word size for the valid bitmap operations, maybe 32b atomic operations are overall more efficient (than 64b) - mostly valid range check is < 16 bits from my observation. Let me know your thoughts or any other ideas. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 02/04/2020 13:10, John Garry wrote: On 18/03/2020 20:53, Will Deacon wrote: As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our batch size is 1, so we're not seeing the real benefit of the batching. I can't help but think that we could improve this code to try to combine CMD SYNCs for small batches. Anyway, let me know your thoughts or any questions. I'll have a look if a get a chance for other possible bottlenecks. Did you ever get any more information on this? I don't have any SMMUv3 hardware any more, so I can't really dig into this myself. Hi Will, JFYI, I added some debug in arm_smmu_cmdq_issue_cmdlist() to get some idea of what is going on. Perf annotate did not tell much. I tested NVMe performance with and without Marc's patchset to spread LPIs for managed interrupts. Average duration of arm_smmu_cmdq_issue_cmdlist() mainline [all results are approximations]: owner: 6ms non-owner: 4ms mainline + LPI spreading patchset: owner: 25ms non-owner: 22ms For this, a list would be a itlb+cmd_sync. Please note that the LPI spreading patchset is still giving circa 25% NVMe throughput increase. What happens there would be that we get many more cpus involved, which creates more inter-cpu contention. But the performance increase comes from just alleviating pressure on those overloaded cpus. I also notice that with the LPI spreading patchset, on average a cpu is an "owner" in arm_smmu_cmdq_issue_cmdlist() 1 in 8, as opposed to 1 in 3 for mainline. This means that we're just creating longer chains of lists to be published. But I found that for a non-owner, average msi cmd_sync polling time is 12ms with the LPI spreading patchset. As such, it seems to be really taking approx (12*2/8-1=) ~3ms to consume a single list. This seems consistent with my finding that an owner polls consumption for 3ms also. Without the LPI speading patchset, polling time is approx 2 and 3ms for both owner and non-owner, respectively. As an experiment, I did try to hack the code to use a spinlock again for protecting the command queue, instead of current solution - and always saw a performance drop there. To be expected. But maybe we can try to not use a spinlock, but still serialise production+consumption to alleviate the long polling periods. Let me know your thoughts. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
FWIW I believe is is still on the plan for someone here to dust off the PMU pNMI patches at some point. Cool. Well I can try to experiment with what Julien had at v4 for now. JFYI, I have done some more perf record capturing, and updated the "annotate" and "report" output here https://raw.githubusercontent.com/hisilicon/kernel-dev/679eca1008b1d11b42e1b5fa8a205266c240d1e1/ann.txt and .../report This capture is just for cpu0, since NVMe irq handling+dma unmapping will occur on specific CPUs, cpu0 being one of them. The reports look somewhat sane. So we no longer have ~99% of time attributed to re-enabling interrupts, now that's like: 3.14 : 80001071eae0: ldr w0, [x29, #108] : int ret = 0; 0.00 : 80001071eae4: mov w24, #0x0 // #0 : if (sync) { 0.00 : 80001071eae8: cbnzw0, 80001071eb44 : arch_local_irq_restore(): : asm volatile(ALTERNATIVE( 0.00 : 80001071eaec: msr daif, x21 : arch_static_branch(): 0.25 : 80001071eaf0: nop : arm_smmu_cmdq_issue_cmdlist(): : } : } : : local_irq_restore(flags); : return ret; : } One observation (if these reports are to be believed) is that we may spend a lot of time in the CAS loop, trying to get a place in the queue initially: : __CMPXCHG_CASE(w, , , 32, ) : __CMPXCHG_CASE(x, , , 64, ) 0.00 : 80001071e828: mov x0, x27 0.00 : 80001071e82c: mov x4, x1 0.00 : 80001071e830: cas x4, x2, [x27] 28.61 : 80001071e834: mov x0, x4 : arm_smmu_cmdq_issue_cmdlist(): : if (old == llq.val) 0.00 : 80001071e838: ldr x1, [x23] John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 24/03/2020 12:07, Robin Murphy wrote: On 2020-03-24 11:55 am, John Garry wrote: On 24/03/2020 10:43, Marc Zyngier wrote: On Tue, 24 Mar 2020 09:18:10 + John Garry wrote: On 23/03/2020 09:16, Marc Zyngier wrote: + Julien, Mark Hi Marc, Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? There was. Julien Thierry has a bunch of patches for that [1], but they > needs reviving. So those patches still apply cleanly (apart from the kvm patch, which I can skip, I suppose) and build, so I can try this I figure. Is there anything else which I should ensure or know about? Apart from enable CONFIG_ARM64_PSUEDO_NMI. You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05 has it set to 0, preventing me from being able to use the feature (hint, nudge...;-). Yeah, apparently it's set on our D06CS board, but I just need to double check the FW version with our FW guy. Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but since it's not been called out: This high priority configuration for interrupts needs to be explicitly enabled by setting the kernel parameter "irqchip.gicv3_pseudo_nmi" to 1. Yeah, I saw that by chance somewhere else previously. FWIW I believe is is still on the plan for someone here to dust off the PMU pNMI patches at some point. Cool. Well I can try to experiment with what Julien had at v4 for now. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 2020-03-24 11:55 am, John Garry wrote: On 24/03/2020 10:43, Marc Zyngier wrote: On Tue, 24 Mar 2020 09:18:10 + John Garry wrote: On 23/03/2020 09:16, Marc Zyngier wrote: + Julien, Mark Hi Marc, Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? There was. Julien Thierry has a bunch of patches for that [1], but they > needs reviving. So those patches still apply cleanly (apart from the kvm patch, which I can skip, I suppose) and build, so I can try this I figure. Is there anything else which I should ensure or know about? Apart from enable CONFIG_ARM64_PSUEDO_NMI. You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05 has it set to 0, preventing me from being able to use the feature (hint, nudge...;-). Yeah, apparently it's set on our D06CS board, but I just need to double check the FW version with our FW guy. Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but since it's not been called out: This high priority configuration for interrupts needs to be explicitly enabled by setting the kernel parameter "irqchip.gicv3_pseudo_nmi" to 1. FWIW I believe is is still on the plan for someone here to dust off the PMU pNMI patches at some point. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 24/03/2020 10:43, Marc Zyngier wrote: On Tue, 24 Mar 2020 09:18:10 + John Garry wrote: On 23/03/2020 09:16, Marc Zyngier wrote: + Julien, Mark Hi Marc, Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? There was. Julien Thierry has a bunch of patches for that [1], but they > needs reviving. So those patches still apply cleanly (apart from the kvm patch, which I can skip, I suppose) and build, so I can try this I figure. Is there anything else which I should ensure or know about? Apart from enable CONFIG_ARM64_PSUEDO_NMI. You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05 has it set to 0, preventing me from being able to use the feature (hint, nudge...;-). Yeah, apparently it's set on our D06CS board, but I just need to double check the FW version with our FW guy. As for D05, there has not been a FW update there in quite a long time and no plans for it. Sorry. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On Tue, 24 Mar 2020 09:18:10 + John Garry wrote: > On 23/03/2020 09:16, Marc Zyngier wrote: > > + Julien, Mark > > Hi Marc, > > >>> Time to enable pseudo-NMIs in the PMUv3 driver... > >>> > >> > >> Do you know if there is any plan for this? > > > > There was. Julien Thierry has a bunch of patches for that [1], but they > > > needs > > reviving. > > > > So those patches still apply cleanly (apart from the kvm patch, which > I can skip, I suppose) and build, so I can try this I figure. Is > there anything else which I should ensure or know about? Apart from > enable CONFIG_ARM64_PSUEDO_NMI. You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05 has it set to 0, preventing me from being able to use the feature (hint, nudge... ;-). M. -- Jazz is not dead. It just smells funny... ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 23/03/2020 09:16, Marc Zyngier wrote: + Julien, Mark Hi Marc, Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? There was. Julien Thierry has a bunch of patches for that [1], but they needs reviving. So those patches still apply cleanly (apart from the kvm patch, which I can skip, I suppose) and build, so I can try this I figure. Is there anything else which I should ensure or know about? Apart from enable CONFIG_ARM64_PSUEDO_NMI. A quickly taken perf annotate and report is at the tip here: https://github.com/hisilicon/kernel-dev/commits/private-topic-nvme-5.6-profiling In the meantime, maybe I can do some trickery by putting the local_irq_restore() in a separate function, outside arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same function. Scratch that :) I don't see how you can improve the profiling without compromising the locking in this case... Cheers, John [1] https://patchwork.kernel.org/cover/11047407/ ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 2020-03-23 09:03, John Garry wrote: On 20/03/2020 16:33, Marc Zyngier wrote: JFYI, I've been playing for "perf annotate" today and it's giving strange results for my NVMe testing. So "report" looks somewhat sane, if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist(): 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_cmdq_issue_cmdlist 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_atc_inv_domain.constprop.42 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw But "annotate" consistently tells me that a specific instruction consumes ~99% of the load for the enqueue function: : /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ : if (sync) { 0.00 : 80001071c948: ldr w0, [x29, #108] : int ret = 0; 0.00 : 80001071c94c: mov w24, #0x0 // #0 : if (sync) { 0.00 : 80001071c950: cbnz w0, 80001071c990 : arch_local_irq_restore(): 0.00 : 80001071c954: msr daif, x21 : arm_smmu_cmdq_issue_cmdlist(): : } : } : : local_irq_restore(flags); : return ret; : } 99.51 : 80001071c958: adrp x0, 800011909000 Hi Marc, This is likely the side effect of the re-enabling of interrupts (msr daif, x21) on the previous instruction which causes the perf interrupt to fire right after. ok, makes sense. Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? There was. Julien Thierry has a bunch of patches for that [1], but they needs reviving. In the meantime, maybe I can do some trickery by putting the local_irq_restore() in a separate function, outside arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same function. I don't see how you can improve the profiling without compromising the locking in this case... Thanks, M. [1] https://patchwork.kernel.org/cover/11047407/ -- Jazz is not dead. It just smells funny... ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On 20/03/2020 16:33, Marc Zyngier wrote: JFYI, I've been playing for "perf annotate" today and it's giving strange results for my NVMe testing. So "report" looks somewhat sane, if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist(): 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_cmdq_issue_cmdlist 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_atc_inv_domain.constprop.42 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw But "annotate" consistently tells me that a specific instruction consumes ~99% of the load for the enqueue function: : /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ : if (sync) { 0.00 : 80001071c948: ldr w0, [x29, #108] : int ret = 0; 0.00 : 80001071c94c: mov w24, #0x0 // #0 : if (sync) { 0.00 : 80001071c950: cbnz w0, 80001071c990 : arch_local_irq_restore(): 0.00 : 80001071c954: msr daif, x21 : arm_smmu_cmdq_issue_cmdlist(): : } : } : : local_irq_restore(flags); : return ret; : } 99.51 : 80001071c958: adrp x0, 800011909000 Hi Marc, This is likely the side effect of the re-enabling of interrupts (msr daif, x21) on the previous instruction which causes the perf interrupt to fire right after. ok, makes sense. Time to enable pseudo-NMIs in the PMUv3 driver... Do you know if there is any plan for this? In the meantime, maybe I can do some trickery by putting the local_irq_restore() in a separate function, outside arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same function. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
Hi John, On 2020-03-20 16:20, John Garry wrote: I've run a bunch of netperf instances on multiple cores and collecting SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty consistently. - 6.07% arm_smmu_iotlb_sync - 5.74% arm_smmu_tlb_inv_range 5.09% arm_smmu_cmdq_issue_cmdlist 0.28% __pi_memset 0.08% __pi_memcpy 0.08% arm_smmu_atc_inv_domain.constprop.37 0.07% arm_smmu_cmdq_build_cmd 0.01% arm_smmu_cmdq_batch_add 0.31% __pi_memset So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(), when ATS is not used. According to the annotations, the load from the atomic_read(), that checks whether the domain uses ATS, is 77% of the samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure there is much room for optimization there. Well I did originally suggest using RCU protection to scan the list of devices, instead of reading an atomic and checking for non-zero value. But that would be an optimsation for ATS also, and there was no ATS devices at the time (to verify performance). Heh, I have yet to get my hands on one. Currently I can't evaluate ATS performance, but I agree that using RCU to scan the list should get better results when using ATS. When ATS isn't in use however, I suspect reading nr_ats_masters should be more efficient than taking the RCU lock + reading an "ats_devices" list (since the smmu_domain->devices list also serves context descriptor invalidation, even when ATS isn't in use). I'll run some tests however, to see if I can micro-optimize this case, but I don't expect noticeable improvements. ok, cheers. I, too, would not expect a significant improvement there. JFYI, I've been playing for "perf annotate" today and it's giving strange results for my NVMe testing. So "report" looks somewhat sane, if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist(): 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_cmdq_issue_cmdlist 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_atc_inv_domain.constprop.42 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw But "annotate" consistently tells me that a specific instruction consumes ~99% of the load for the enqueue function: : /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ : if (sync) { 0.00 : 80001071c948: ldr w0, [x29, #108] : int ret = 0; 0.00 : 80001071c94c: mov w24, #0x0 // #0 : if (sync) { 0.00 : 80001071c950: cbnzw0, 80001071c990 : arch_local_irq_restore(): 0.00 : 80001071c954: msr daif, x21 : arm_smmu_cmdq_issue_cmdlist(): : } : } : : local_irq_restore(flags); : return ret; : } 99.51 : 80001071c958: adrpx0, 800011909000 This is likely the side effect of the re-enabling of interrupts (msr daif, x21) on the previous instruction which causes the perf interrupt to fire right after. Time to enable pseudo-NMIs in the PMUv3 driver... M. -- Jazz is not dead. It just smells funny... ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
I've run a bunch of netperf instances on multiple cores and collecting SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty consistently. - 6.07% arm_smmu_iotlb_sync - 5.74% arm_smmu_tlb_inv_range 5.09% arm_smmu_cmdq_issue_cmdlist 0.28% __pi_memset 0.08% __pi_memcpy 0.08% arm_smmu_atc_inv_domain.constprop.37 0.07% arm_smmu_cmdq_build_cmd 0.01% arm_smmu_cmdq_batch_add 0.31% __pi_memset So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(), when ATS is not used. According to the annotations, the load from the atomic_read(), that checks whether the domain uses ATS, is 77% of the samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure there is much room for optimization there. Well I did originally suggest using RCU protection to scan the list of devices, instead of reading an atomic and checking for non-zero value. But that would be an optimsation for ATS also, and there was no ATS devices at the time (to verify performance). Heh, I have yet to get my hands on one. Currently I can't evaluate ATS performance, but I agree that using RCU to scan the list should get better results when using ATS. When ATS isn't in use however, I suspect reading nr_ats_masters should be more efficient than taking the RCU lock + reading an "ats_devices" list (since the smmu_domain->devices list also serves context descriptor invalidation, even when ATS isn't in use). I'll run some tests however, to see if I can micro-optimize this case, but I don't expect noticeable improvements. ok, cheers. I, too, would not expect a significant improvement there. JFYI, I've been playing for "perf annotate" today and it's giving strange results for my NVMe testing. So "report" looks somewhat sane, if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist(): 55.39% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_cmdq_issue_cmdlist 9.74% irq/342-nvme0q1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 2.02% irq/342-nvme0q1 [kernel.kallsyms] [k] nvme_irq 1.86% irq/342-nvme0q1 [kernel.kallsyms] [k] fput_many 1.73% irq/342-nvme0q1 [kernel.kallsyms] [k] arm_smmu_atc_inv_domain.constprop.42 1.67% irq/342-nvme0q1 [kernel.kallsyms] [k] __arm_lpae_unmap 1.49% irq/342-nvme0q1 [kernel.kallsyms] [k] aio_complete_rw But "annotate" consistently tells me that a specific instruction consumes ~99% of the load for the enqueue function: : /* 5. If we are inserting a CMD_SYNC, we must wait for it to complete */ : if (sync) { 0.00 : 80001071c948: ldr w0, [x29, #108] : int ret = 0; 0.00 : 80001071c94c: mov w24, #0x0 // #0 : if (sync) { 0.00 : 80001071c950: cbnzw0, 80001071c990 : arch_local_irq_restore(): 0.00 : 80001071c954: msr daif, x21 : arm_smmu_cmdq_issue_cmdlist(): : } : } : : local_irq_restore(flags); : return ret; : } 99.51 : 80001071c958: adrpx0, 800011909000 0.00 : 80001071c95c: add x21, x0, #0x908 0.02 : 80001071c960: ldr x2, [x29, #488] 0.14 : 80001071c964: ldr x1, [x21] 0.00 : 80001071c968: eor x1, x2, x1 0.00 : 80001071c96c: mov w0, w24 But there may be a hint that we're getting consuming a lot of time in polling the CMD_SYNC consumption. The files are available here: https://raw.githubusercontent.com/hisilicon/kernel-dev/private-topic-nvme-5.6-profiling/ann.txt, report Or maybe I'm just not using the tool properly ... Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On Fri, Mar 20, 2020 at 10:41:44AM +, John Garry wrote: > On 19/03/2020 18:43, Jean-Philippe Brucker wrote: > > On Thu, Mar 19, 2020 at 12:54:59PM +, John Garry wrote: > > > Hi Will, > > > > > > > > > > > On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote: > > > > > And for the overall system, we have: > > > > > > > > > > PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: > > > > > 0/34434 drop: > > > > > 0/40116 [4000Hz cycles], (all, 96 CPUs) > > > > > -- > > > > > > > > > > 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist > > > > > 11.71% [kernel] [k] _raw_spin_unlock_irqrestore > > > > >6.35% [kernel] [k] _raw_spin_unlock_irq > > > > >2.65% [kernel] [k] get_user_pages_fast > > > > >2.03% [kernel] [k] __slab_free > > > > >1.55% [kernel] [k] tick_nohz_idle_exit > > > > >1.47% [kernel] [k] arm_lpae_map > > > > >1.39% [kernel] [k] __fget > > > > >1.14% [kernel] [k] __lock_text_start > > > > >1.09% [kernel] [k] _raw_spin_lock > > > > >1.08% [kernel] [k] bio_release_pages.part.42 > > > > >1.03% [kernel] [k] __sbitmap_get_word > > > > >0.97% [kernel] [k] > > > > > arm_smmu_atc_inv_domain.constprop.42 > > > > >0.91% [kernel] [k] fput_many > > > > >0.88% [kernel] [k] __arm_lpae_map > > > > > > > > > > One thing to note is that we still spend an appreciable amount of > > > > > time in > > > > > arm_smmu_atc_inv_domain(), which is disappointing when considering it > > > > > should > > > > > effectively be a noop. > > > > > > > > > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the > > > > > testing our > > > > > batch size is 1, so we're not seeing the real benefit of the > > > > > batching. I > > > > > can't help but think that we could improve this code to try to > > > > > combine CMD > > > > > SYNCs for small batches. > > > > > > > > > > Anyway, let me know your thoughts or any questions. I'll have a look > > > > > if a > > > > > get a chance for other possible bottlenecks. > > > > > > > > Did you ever get any more information on this? I don't have any SMMUv3 > > > > hardware any more, so I can't really dig into this myself. > > > > > > I'm only getting back to look at this now, as SMMU performance is a bit > > > of a > > > hot topic again for us. > > > > > > So one thing we are doing which looks to help performance is this series > > > from Marc: > > > > > > https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d > > > > > > So that is just spreading the per-CPU load for NVMe interrupt handling > > > (where the DMA unmapping is happening), so I'd say just side-stepping any > > > SMMU issue really. > > > > > > Going back to the SMMU, I wanted to run epbf and perf annotate to help > > > profile this, but was having no luck getting them to work properly. I'll > > > look at this again now. > > > > Could you also try with the upcoming ATS change currently in Will's tree? > > They won't improve your numbers but it'd be good to check that they don't > > make things worse. > > I can do when I get a chance. > > > > > I've run a bunch of netperf instances on multiple cores and collecting > > SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty > > consistently. > > > > - 6.07% arm_smmu_iotlb_sync > > - 5.74% arm_smmu_tlb_inv_range > > 5.09% arm_smmu_cmdq_issue_cmdlist > > 0.28% __pi_memset > > 0.08% __pi_memcpy > > 0.08% arm_smmu_atc_inv_domain.constprop.37 > > 0.07% arm_smmu_cmdq_build_cmd > > 0.01% arm_smmu_cmdq_batch_add > > 0.31% __pi_memset > > > > So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(), > > when ATS is not used. According to the annotations, the load from the > > atomic_read(), that checks whether the domain uses ATS, is 77% of the > > samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure > > there is much room for optimization there. > > Well I did originally suggest using RCU protection to scan the list of > devices, instead of reading an atomic and checking for non-zero value. But > that would be an optimsation for ATS also, and there was no ATS devices at > the time (to verify performance). Heh, I have yet to get my hands on one. Currently I can't evaluate ATS performance, but I agree that using RCU to scan the list should get better results when using ATS. When ATS isn't in use however, I suspect reading nr_ats_masters should be more efficient than taking the RCU lock + reading an "ats_devices" list (since the smmu_domain->devices list also serves context
Re: arm-smmu-v3 high cpu usage for NVMe
On 19/03/2020 18:43, Jean-Philippe Brucker wrote: On Thu, Mar 19, 2020 at 12:54:59PM +, John Garry wrote: Hi Will, On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote: And for the overall system, we have: PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop: 0/40116 [4000Hz cycles], (all, 96 CPUs) -- 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 11.71% [kernel] [k] _raw_spin_unlock_irqrestore 6.35% [kernel] [k] _raw_spin_unlock_irq 2.65% [kernel] [k] get_user_pages_fast 2.03% [kernel] [k] __slab_free 1.55% [kernel] [k] tick_nohz_idle_exit 1.47% [kernel] [k] arm_lpae_map 1.39% [kernel] [k] __fget 1.14% [kernel] [k] __lock_text_start 1.09% [kernel] [k] _raw_spin_lock 1.08% [kernel] [k] bio_release_pages.part.42 1.03% [kernel] [k] __sbitmap_get_word 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 0.91% [kernel] [k] fput_many 0.88% [kernel] [k] __arm_lpae_map One thing to note is that we still spend an appreciable amount of time in arm_smmu_atc_inv_domain(), which is disappointing when considering it should effectively be a noop. As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our batch size is 1, so we're not seeing the real benefit of the batching. I can't help but think that we could improve this code to try to combine CMD SYNCs for small batches. Anyway, let me know your thoughts or any questions. I'll have a look if a get a chance for other possible bottlenecks. Did you ever get any more information on this? I don't have any SMMUv3 hardware any more, so I can't really dig into this myself. I'm only getting back to look at this now, as SMMU performance is a bit of a hot topic again for us. So one thing we are doing which looks to help performance is this series from Marc: https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d So that is just spreading the per-CPU load for NVMe interrupt handling (where the DMA unmapping is happening), so I'd say just side-stepping any SMMU issue really. Going back to the SMMU, I wanted to run epbf and perf annotate to help profile this, but was having no luck getting them to work properly. I'll look at this again now. Could you also try with the upcoming ATS change currently in Will's tree? They won't improve your numbers but it'd be good to check that they don't make things worse. I can do when I get a chance. I've run a bunch of netperf instances on multiple cores and collecting SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty consistently. - 6.07% arm_smmu_iotlb_sync - 5.74% arm_smmu_tlb_inv_range 5.09% arm_smmu_cmdq_issue_cmdlist 0.28% __pi_memset 0.08% __pi_memcpy 0.08% arm_smmu_atc_inv_domain.constprop.37 0.07% arm_smmu_cmdq_build_cmd 0.01% arm_smmu_cmdq_batch_add 0.31% __pi_memset So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(), when ATS is not used. According to the annotations, the load from the atomic_read(), that checks whether the domain uses ATS, is 77% of the samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure there is much room for optimization there. Well I did originally suggest using RCU protection to scan the list of devices, instead of reading an atomic and checking for non-zero value. But that would be an optimsation for ATS also, and there was no ATS devices at the time (to verify performance). Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
On Thu, Mar 19, 2020 at 12:54:59PM +, John Garry wrote: > Hi Will, > > > > > On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote: > > > And for the overall system, we have: > > > > > >PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 > > > drop: > > > 0/40116 [4000Hz cycles], (all, 96 CPUs) > > > -- > > > > > > 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist > > > 11.71% [kernel] [k] _raw_spin_unlock_irqrestore > > > 6.35% [kernel] [k] _raw_spin_unlock_irq > > > 2.65% [kernel] [k] get_user_pages_fast > > > 2.03% [kernel] [k] __slab_free > > > 1.55% [kernel] [k] tick_nohz_idle_exit > > > 1.47% [kernel] [k] arm_lpae_map > > > 1.39% [kernel] [k] __fget > > > 1.14% [kernel] [k] __lock_text_start > > > 1.09% [kernel] [k] _raw_spin_lock > > > 1.08% [kernel] [k] bio_release_pages.part.42 > > > 1.03% [kernel] [k] __sbitmap_get_word > > > 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 > > > 0.91% [kernel] [k] fput_many > > > 0.88% [kernel] [k] __arm_lpae_map > > > > > > One thing to note is that we still spend an appreciable amount of time in > > > arm_smmu_atc_inv_domain(), which is disappointing when considering it > > > should > > > effectively be a noop. > > > > > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing > > > our > > > batch size is 1, so we're not seeing the real benefit of the batching. I > > > can't help but think that we could improve this code to try to combine CMD > > > SYNCs for small batches. > > > > > > Anyway, let me know your thoughts or any questions. I'll have a look if a > > > get a chance for other possible bottlenecks. > > > > Did you ever get any more information on this? I don't have any SMMUv3 > > hardware any more, so I can't really dig into this myself. > > I'm only getting back to look at this now, as SMMU performance is a bit of a > hot topic again for us. > > So one thing we are doing which looks to help performance is this series > from Marc: > > https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d > > So that is just spreading the per-CPU load for NVMe interrupt handling > (where the DMA unmapping is happening), so I'd say just side-stepping any > SMMU issue really. > > Going back to the SMMU, I wanted to run epbf and perf annotate to help > profile this, but was having no luck getting them to work properly. I'll > look at this again now. Could you also try with the upcoming ATS change currently in Will's tree? They won't improve your numbers but it'd be good to check that they don't make things worse. I've run a bunch of netperf instances on multiple cores and collecting SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty consistently. - 6.07% arm_smmu_iotlb_sync - 5.74% arm_smmu_tlb_inv_range 5.09% arm_smmu_cmdq_issue_cmdlist 0.28% __pi_memset 0.08% __pi_memcpy 0.08% arm_smmu_atc_inv_domain.constprop.37 0.07% arm_smmu_cmdq_build_cmd 0.01% arm_smmu_cmdq_batch_add 0.31% __pi_memset So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(), when ATS is not used. According to the annotations, the load from the atomic_read(), that checks whether the domain uses ATS, is 77% of the samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure there is much room for optimization there. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
Hi Will, On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote: And for the overall system, we have: PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop: 0/40116 [4000Hz cycles], (all, 96 CPUs) -- 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist 11.71% [kernel] [k] _raw_spin_unlock_irqrestore 6.35% [kernel] [k] _raw_spin_unlock_irq 2.65% [kernel] [k] get_user_pages_fast 2.03% [kernel] [k] __slab_free 1.55% [kernel] [k] tick_nohz_idle_exit 1.47% [kernel] [k] arm_lpae_map 1.39% [kernel] [k] __fget 1.14% [kernel] [k] __lock_text_start 1.09% [kernel] [k] _raw_spin_lock 1.08% [kernel] [k] bio_release_pages.part.42 1.03% [kernel] [k] __sbitmap_get_word 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 0.91% [kernel] [k] fput_many 0.88% [kernel] [k] __arm_lpae_map One thing to note is that we still spend an appreciable amount of time in arm_smmu_atc_inv_domain(), which is disappointing when considering it should effectively be a noop. As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our batch size is 1, so we're not seeing the real benefit of the batching. I can't help but think that we could improve this code to try to combine CMD SYNCs for small batches. Anyway, let me know your thoughts or any questions. I'll have a look if a get a chance for other possible bottlenecks. Did you ever get any more information on this? I don't have any SMMUv3 hardware any more, so I can't really dig into this myself. I'm only getting back to look at this now, as SMMU performance is a bit of a hot topic again for us. So one thing we are doing which looks to help performance is this series from Marc: https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d So that is just spreading the per-CPU load for NVMe interrupt handling (where the DMA unmapping is happening), so I'd say just side-stepping any SMMU issue really. Going back to the SMMU, I wanted to run epbf and perf annotate to help profile this, but was having no luck getting them to work properly. I'll look at this again now. Cheers, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: arm-smmu-v3 high cpu usage for NVMe
Hi John, On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote: > And for the overall system, we have: > > PerfTop: 85864 irqs/sec kernel:89.6% exact: 0.0% lost: 0/34434 drop: > 0/40116 [4000Hz cycles], (all, 96 CPUs) > -- > > 27.43% [kernel] [k] arm_smmu_cmdq_issue_cmdlist > 11.71% [kernel] [k] _raw_spin_unlock_irqrestore > 6.35% [kernel] [k] _raw_spin_unlock_irq > 2.65% [kernel] [k] get_user_pages_fast > 2.03% [kernel] [k] __slab_free > 1.55% [kernel] [k] tick_nohz_idle_exit > 1.47% [kernel] [k] arm_lpae_map > 1.39% [kernel] [k] __fget > 1.14% [kernel] [k] __lock_text_start > 1.09% [kernel] [k] _raw_spin_lock > 1.08% [kernel] [k] bio_release_pages.part.42 > 1.03% [kernel] [k] __sbitmap_get_word > 0.97% [kernel] [k] arm_smmu_atc_inv_domain.constprop.42 > 0.91% [kernel] [k] fput_many > 0.88% [kernel] [k] __arm_lpae_map > > One thing to note is that we still spend an appreciable amount of time in > arm_smmu_atc_inv_domain(), which is disappointing when considering it should > effectively be a noop. > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our > batch size is 1, so we're not seeing the real benefit of the batching. I > can't help but think that we could improve this code to try to combine CMD > SYNCs for small batches. > > Anyway, let me know your thoughts or any questions. I'll have a look if a > get a chance for other possible bottlenecks. Did you ever get any more information on this? I don't have any SMMUv3 hardware any more, so I can't really dig into this myself. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu