RE: arm-smmu-v3 high cpu usage for NVMe

Song Bao Hua (Barry Song) Sun, 24 May 2020 22:57:30 -0700

> Subject: Re: arm-smmu-v3 high cpu usage for NVMe
> 
> On 20/03/2020 10:41, John Garry wrote:
> 
> + Barry, Alexandru
> 
> >>>>>     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost:
> >>>>> 0/34434 drop:
> >>>>> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> >>>>>
> ----------------------------------------------------------------------------------------------------------------
> ----------
> >>>>>
> >>>>>
> >>>>>       27.43%  [kernel]          [k]
> arm_smmu_cmdq_issue_cmdlist
> >>>>>       11.71%  [kernel]          [k]
> _raw_spin_unlock_irqrestore
> >>>>>        6.35%  [kernel]          [k] _raw_spin_unlock_irq
> >>>>>        2.65%  [kernel]          [k] get_user_pages_fast
> >>>>>        2.03%  [kernel]          [k] __slab_free
> >>>>>        1.55%  [kernel]          [k] tick_nohz_idle_exit
> >>>>>        1.47%  [kernel]          [k] arm_lpae_map
> >>>>>        1.39%  [kernel]          [k] __fget
> >>>>>        1.14%  [kernel]          [k] __lock_text_start
> >>>>>        1.09%  [kernel]          [k] _raw_spin_lock
> >>>>>        1.08%  [kernel]          [k] bio_release_pages.part.42
> >>>>>        1.03%  [kernel]          [k] __sbitmap_get_word
> >>>>>        0.97%  [kernel]          [k]
> >>>>> arm_smmu_atc_inv_domain.constprop.42
> >>>>>        0.91%  [kernel]          [k] fput_many
> >>>>>        0.88%  [kernel]          [k] __arm_lpae_map
> >>>>>
> 
> Hi Will, Robin,
> 
> I'm just getting around to look at this topic again. Here's the current
> picture for my NVMe test:
> 
> perf top -C 0 *
> Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
> Overhead Shared Object Symbol
> 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> 3.28% [kernel] [k] arm_smmu_tlb_inv_range
> 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
> 2.35% [kernel] [k] _raw_spin_unlock_irqrestore
> 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
> 1.20% [kernel] [k] aio_complete_rw
> 0.96% [kernel] [k] enqueue_task_fair
> 0.93% [kernel] [k] gic_handle_irq
> 0.86% [kernel] [k] _raw_spin_lock_irqsave
> 0.72% [kernel] [k] put_reqs_available
> 0.72% [kernel] [k] sbitmap_queue_clear
> 
> * only certain CPUs run the dma unmap for my scenario, cpu0 being one of
> them.
> 
> Colleague Barry has similar findings for some other scenarios.


I wrote a test module and use the parameter "ways" to simulate how busy SMMU is 
and
compare the latency under different degrees of contentions.
1.      static int ways=16;  
2.      module_param(ways, int, S_IRUGO);  
3.        
4.      static int seconds=120;  
5.      module_param(seconds, int, S_IRUGO);  
6.        
7.      extern struct device *get_zip_dev(void);  
8.        
9.      static noinline void test_mapsingle(struct device *dev, void *buf, int 
size)  
10.     {  
11.         dma_addr_t dma_addr = dma_map_single(dev, buf, size, 
DMA_TO_DEVICE);  
12.         dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);  
13.     }  
14.       
15.     static noinline void test_memcpy(void *out, void *in, int size)  
16.     {  
17.         memcpy(out, in, size);  
18.     }  
19.       
20.     static int testthread(void *data)  
21.     {  
22.         unsigned long stop = jiffies +seconds*HZ;  
23.         struct device *dev = get_zip_dev();  
24.       
25.         char *input = kzalloc(4096, GFP_KERNEL);  
26.         if (!input)  
27.             return -ENOMEM;  
28.       
29.         char *output = kzalloc(4096, GFP_KERNEL);  
30.         if (!output)  
31.             return -ENOMEM;  
32.       
33.         while (time_before(jiffies, stop)) {  
34.             test_mapsingle(dev, input, 4096);  
35.             test_memcpy(output, input, 4096);  
36.         }  
37.       
38.         kfree(output);  
39.         kfree(input);  
40.       
41.         return 0;  
42.     }  
43.       
44.     static int __init test_init(void)  
45.     {  
46.         struct task_struct *tsk;  
47.         int i;  
50.       
51.         for(i=0;i<ways;i++) {  
52.             tsk = kthread_run(testthread, &ways, "map_test-%d", i);  
53.             if (IS_ERR(tsk))   
54.                 printk(KERN_ERR "create test thread failed\n");  
55.         }  
56.       
57.         return 0;  
58.     }  
59.       
60.     static void __exit test_exit(void)  
61.     {  
62.     }  
63.       
64.     module_init(test_init);  
65.     module_exit(test_exit);  
66.     MODULE_LICENSE("GPL");

While ways=1, smmu is quite free with only one user,  
arm_smmu_cmdq_issue_cmdlist() will spend more than 60% time
on arm_smmu_cmdq_poll_until_sync(). It seems SMMU reports the completion of 
CMD_SYNC quite slowly.

When I increased "ways", I found the contention would increase rapidly. When 
ways=16, more than 40% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)

when ways=64, more than 60% time will be on:
cmpxchg_relaxed(&cmdq->q.llq.val, llq.val, head.val)

here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with 
different ways:
         whole unmap(ns)   arm_smmu_cmdq_issue_cmdlist()ns      wait 
CMD_SYNC(ns) 
Ways=1     1956            1328                                 883      
Ways=16    8891            7474                                 4000
Ways=32    22043                   19519                                6879
Ways=64    60842                   55895                                16746 
Ways=96    101880                  93649                                24429

As you can see, while ways=1, we still need 2us to unmap, and 
arm_smmu_cmdq_issue_cmdlist() takes 60% time of the dma_unmap, CMD_SNC
takes more than 60% time of arm_smmu_cmdq_issue_cmdlist().

When SMMU is very busy, dma_unmap latency can be very large due to contention, 
more than 100us.


Thanks
Barry

> 
> So we tried the latest perf NMI support wip patches, and noticed a few
> hotspots (see
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/perf%20annotate
> and
> https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3
> 912703cfcd4985a00f6bbb/report.txt)
> when running some NVMe traffic:
> 
> - initial cmpxchg to get a place in the queue
>       - when more CPUs get involved, we start failing at an exponential rate
> 0.00 :        ffff8000107a3500:       cas     x4, x2, [x27]
> 26.52 :        ffff8000107a3504:       mov     x0, x4 :
> arm_smmu_cmdq_issue_cmdlist():
> 
> - the queue locking
> - polling cmd_sync
> 
> Some ideas to optimise:
> 
> a. initial cmpxchg
> So this cmpxchg could be considered unfair. In addition, with all the
> contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged
> around the system.
> Maybe we can implement something similar to the idea of queued/ticketed
> spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after
> initial cmpxchg fails, released by its leader, and releasing subsequent
> followers
> 
> b. Drop the queue_full checking in certain circumstances
> If we cannot theoretically fill the queue, then stop the checking for
> queue full or similar. This should also help current problem of a., as
> the less time between cmpxchg, the less chance of failing (as we check
> queue available space between cmpxchg attempts).
> 
> So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we
> always issue a cmd_sync for a batch (regardless of whether requested),
> then we should never fill (I think).
> 
> c. Don't do queue locking in certain circumstances
> If we implement (and support) b. and support MSI polling, then I don't
> think that this is required.
> 
> d. More minor ideas are to move forward when the "owner" stops gathering
> to reduce time of advancing the prod, hopefully reducing cmd_sync
> polling time; and also use a smaller word size for the valid bitmap
> operations, maybe 32b atomic operations are overall more efficient (than
> 64b) - mostly valid range check is < 16 bits from my observation.
> 
> Let me know your thoughts or any other ideas.
> 
> Thanks,
> John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: arm-smmu-v3 high cpu usage for NVMe

Reply via email to