RE: arm-smmu-v3 high cpu usage for NVMe

2020-05-24 Thread Song Bao Hua (Barry Song)
> Subject: Re: arm-smmu-v3 high cpu usage for NVMe
> 
> On 20/03/2020 10:41, John Garry wrote:
> 
> + Barry, Alexandru
> 
> >>>>>     PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost:
> >>>>> 0/34434 drop:
> >>>>> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> >>>>>
> 
> --
> >>>>>
> >>>>>
> >>>>>   27.43%  [kernel]  [k]
> arm_smmu_cmdq_issue_cmdlist
> >>>>>   11.71%  [kernel]  [k]
> _raw_spin_unlock_irqrestore
> >>>>>    6.35%  [kernel]  [k] _raw_spin_unlock_irq
> >>>>>    2.65%  [kernel]  [k] get_user_pages_fast
> >>>>>    2.03%  [kernel]  [k] __slab_free
> >>>>>    1.55%  [kernel]  [k] tick_nohz_idle_exit
> >>>>>    1.47%  [kernel]  [k] arm_lpae_map
> >>>>>    1.39%  [kernel]  [k] __fget
> >>>>>    1.14%  [kernel]  [k] __lock_text_start
> >>>>>    1.09%  [kernel]  [k] _raw_spin_lock
> >>>>>    1.08%  [kernel]  [k] bio_release_pages.part.42
> >>>>>    1.03%  [kernel]  [k] __sbitmap_get_word
> >>>>>    0.97%  [kernel]  [k]
> >>>>> arm_smmu_atc_inv_domain.constprop.42
> >>>>>    0.91%  [kernel]  [k] fput_many
> >>>>>    0.88%  [kernel]  [k] __arm_lpae_map
> >>>>>
> 
> Hi Will, Robin,
> 
> I'm just getting around to look at this topic again. Here's the current
> picture for my NVMe test:
> 
> perf top -C 0 *
> Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
> Overhead Shared Object Symbol
> 75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
> 3.28% [kernel] [k] arm_smmu_tlb_inv_range
> 2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
> 2.35% [kernel] [k] _raw_spin_unlock_irqrestore
> 1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
> 1.20% [kernel] [k] aio_complete_rw
> 0.96% [kernel] [k] enqueue_task_fair
> 0.93% [kernel] [k] gic_handle_irq
> 0.86% [kernel] [k] _raw_spin_lock_irqsave
> 0.72% [kernel] [k] put_reqs_available
> 0.72% [kernel] [k] sbitmap_queue_clear
> 
> * only certain CPUs run the dma unmap for my scenario, cpu0 being one of
> them.
> 
> Colleague Barry has similar findings for some other scenarios.

I wrote a test module and use the parameter "ways" to simulate how busy SMMU is 
and
compare the latency under different degrees of contentions.
1.  static int ways=16;  
2.  module_param(ways, int, S_IRUGO);  
3.
4.  static int seconds=120;  
5.  module_param(seconds, int, S_IRUGO);  
6.
7.  extern struct device *get_zip_dev(void);  
8.
9.  static noinline void test_mapsingle(struct device *dev, void *buf, int 
size)  
10. {  
11. dma_addr_t dma_addr = dma_map_single(dev, buf, size, 
DMA_TO_DEVICE);  
12. dma_unmap_single(dev, dma_addr, size, DMA_TO_DEVICE);  
13. }  
14.   
15. static noinline void test_memcpy(void *out, void *in, int size)  
16. {  
17. memcpy(out, in, size);  
18. }  
19.   
20. static int testthread(void *data)  
21. {  
22. unsigned long stop = jiffies +seconds*HZ;  
23. struct device *dev = get_zip_dev();  
24.   
25. char *input = kzalloc(4096, GFP_KERNEL);  
26. if (!input)  
27. return -ENOMEM;  
28.   
29. char *output = kzalloc(4096, GFP_KERNEL);  
30. if (!output)  
31. return -ENOMEM;  
32.   
33. while (time_before(jiffies, stop)) {  
34. test_mapsingle(dev, input, 4096);  
35. test_memcpy(output, input, 4096);  
36. }  
37.   
38. kfree(output);  
39. kfree(input);  
40.   
41. return 0;  
42. }  
43.   
44. static int __init test_init(void)  
45. {  
46. struct task_struct *tsk;  
47. int i;  
50.   
51. for(i=0;iq.llq.val, llq.val, head.val)

when ways=64, more than 60% time will be on:
cmpxchg_relaxed(>q.llq.val, llq.val, head.val)

here is a table for dma_unmap, arm_smmu_cmdq_issue_cmdlist() and CMD_SYNC with 
different ways:
 whole unmap(ns)   arm_smmu_cmdq_issue_cmdlist()ns  wait 
CMD_SYNC(ns) 
Ways=1 19561328 883  
Ways=1688917474 4000
Ways=3222043   19519  

Re: arm-smmu-v3 high cpu usage for NVMe

2020-05-22 Thread John Garry

On 20/03/2020 10:41, John Garry wrote:

+ Barry, Alexandru

    PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 
0/34434 drop:

0/40116 [4000Hz cycles],  (all, 96 CPUs)
-- 



  27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
  11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.35%  [kernel]  [k] _raw_spin_unlock_irq
   2.65%  [kernel]  [k] get_user_pages_fast
   2.03%  [kernel]  [k] __slab_free
   1.55%  [kernel]  [k] tick_nohz_idle_exit
   1.47%  [kernel]  [k] arm_lpae_map
   1.39%  [kernel]  [k] __fget
   1.14%  [kernel]  [k] __lock_text_start
   1.09%  [kernel]  [k] _raw_spin_lock
   1.08%  [kernel]  [k] bio_release_pages.part.42
   1.03%  [kernel]  [k] __sbitmap_get_word
   0.97%  [kernel]  [k] 
arm_smmu_atc_inv_domain.constprop.42

   0.91%  [kernel]  [k] fput_many
   0.88%  [kernel]  [k] __arm_lpae_map



Hi Will, Robin,

I'm just getting around to look at this topic again. Here's the current 
picture for my NVMe test:


perf top -C 0 *
Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
Overhead Shared Object Symbol
75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
3.28% [kernel] [k] arm_smmu_tlb_inv_range
2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
2.35% [kernel] [k] _raw_spin_unlock_irqrestore
1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
1.20% [kernel] [k] aio_complete_rw
0.96% [kernel] [k] enqueue_task_fair
0.93% [kernel] [k] gic_handle_irq
0.86% [kernel] [k] _raw_spin_lock_irqsave
0.72% [kernel] [k] put_reqs_available
0.72% [kernel] [k] sbitmap_queue_clear

* only certain CPUs run the dma unmap for my scenario, cpu0 being one of 
them.


Colleague Barry has similar findings for some other scenarios.

So we tried the latest perf NMI support wip patches, and noticed a few 
hotspots (see 
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotate 
and 
https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt) 
when running some NVMe traffic:


- initial cmpxchg to get a place in the queue
- when more CPUs get involved, we start failing at an exponential rate
0.00 :8000107a3500:   cas x4, x2, [x27]
26.52 :8000107a3504:   mov x0, x4 : 
arm_smmu_cmdq_issue_cmdlist():


- the queue locking
- polling cmd_sync

Some ideas to optimise:

a. initial cmpxchg
So this cmpxchg could be considered unfair. In addition, with all the 
contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged 
around the system.
Maybe we can implement something similar to the idea of queued/ticketed 
spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after 
initial cmpxchg fails, released by its leader, and releasing subsequent 
followers


b. Drop the queue_full checking in certain circumstances
If we cannot theoretically fill the queue, then stop the checking for 
queue full or similar. This should also help current problem of a., as 
the less time between cmpxchg, the less chance of failing (as we check 
queue available space between cmpxchg attempts).


So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we 
always issue a cmd_sync for a batch (regardless of whether requested), 
then we should never fill (I think).


c. Don't do queue locking in certain circumstances
If we implement (and support) b. and support MSI polling, then I don't 
think that this is required.


d. More minor ideas are to move forward when the "owner" stops gathering 
to reduce time of advancing the prod, hopefully reducing cmd_sync 
polling time; and also use a smaller word size for the valid bitmap 
operations, maybe 32b atomic operations are overall more efficient (than 
64b) - mostly valid range check is < 16 bits from my observation.


Let me know your thoughts or any other ideas.

Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

2020-04-06 Thread John Garry

On 02/04/2020 13:10, John Garry wrote:

On 18/03/2020 20:53, Will Deacon wrote:
As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the 
testing our

batch size is 1, so we're not seeing the real benefit of the batching. I
can't help but think that we could improve this code to try to 
combine CMD

SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look 
if a

get a chance for other possible bottlenecks.

Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.





Hi Will,

JFYI, I added some debug in arm_smmu_cmdq_issue_cmdlist() to get some 
idea of what is going on. Perf annotate did not tell much.


I tested NVMe performance with and without Marc's patchset to spread 
LPIs for managed interrupts.


Average duration of arm_smmu_cmdq_issue_cmdlist() mainline [all results 
are approximations]:

owner: 6ms
non-owner: 4ms

mainline + LPI spreading patchset:
owner: 25ms
non-owner: 22ms

For this, a list would be a itlb+cmd_sync.

Please note that the LPI spreading patchset is still giving circa 25% 
NVMe throughput increase. What happens there would be that we get many 
more cpus involved, which creates more inter-cpu contention. But the 
performance increase comes from just alleviating pressure on those 
overloaded cpus.


I also notice that with the LPI spreading patchset, on average a cpu is 
an "owner" in arm_smmu_cmdq_issue_cmdlist() 1 in 8, as opposed to 1 in 3 
for mainline. This means that we're just creating longer chains of lists 
to be published.


But I found that for a non-owner, average msi cmd_sync polling time is 
12ms with the LPI spreading patchset. As such, it seems to be really 
taking approx (12*2/8-1=) ~3ms to consume a single list. This seems 
consistent with my finding that an owner polls consumption for 3ms also. 
Without the LPI speading patchset, polling time is approx 2 and 3ms for 
both owner and non-owner, respectively.


As an experiment, I did try to hack the code to use a spinlock again for 
protecting the command queue, instead of current solution - and always 
saw a performance drop there. To be expected. But maybe we can try to 
not use a spinlock, but still serialise production+consumption to 
alleviate the long polling periods.


Let me know your thoughts.

Cheers,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-25 Thread John Garry




FWIW I believe is is still on the plan for someone here to dust off 
the PMU pNMI patches at some point.


Cool. Well I can try to experiment with what Julien had at v4 for now.



JFYI, I have done some more perf record capturing, and updated the 
"annotate" and "report" output here 
https://raw.githubusercontent.com/hisilicon/kernel-dev/679eca1008b1d11b42e1b5fa8a205266c240d1e1/ann.txt 
and .../report


This capture is just for cpu0, since NVMe irq handling+dma unmapping 
will occur on specific CPUs, cpu0 being one of them.


The reports look somewhat sane. So we no longer have ~99% of time 
attributed to re-enabling interrupts, now that's like:



3.14 :   80001071eae0:   ldr w0, [x29, #108]
 :  int ret = 0;
0.00 :   80001071eae4:   mov w24, #0x0 
 // #0

 :  if (sync) {
0.00 :   80001071eae8:   cbnzw0, 80001071eb44 


 :  arch_local_irq_restore():
 :  asm volatile(ALTERNATIVE(
0.00 :   80001071eaec:   msr daif, x21
 :  arch_static_branch():
0.25 :   80001071eaf0:   nop
 :  arm_smmu_cmdq_issue_cmdlist():
 :  }
 :  }
 :
 :  local_irq_restore(flags);
 :  return ret;
 :  }
One observation (if these reports are to be believed) is that we may 
spend a lot of time in the CAS loop, trying to get a place in the queue 
initially:


 :  __CMPXCHG_CASE(w,  , , 32,   )
 :  __CMPXCHG_CASE(x,  , , 64,   )
0.00 :   80001071e828:   mov x0, x27
0.00 :   80001071e82c:   mov x4, x1
0.00 :   80001071e830:   cas x4, x2, [x27]
   28.61 :   80001071e834:   mov x0, x4
 :  arm_smmu_cmdq_issue_cmdlist():
 :  if (old == llq.val)
0.00 :   80001071e838:   ldr x1, [x23]

John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-24 Thread John Garry

On 24/03/2020 12:07, Robin Murphy wrote:

On 2020-03-24 11:55 am, John Garry wrote:

On 24/03/2020 10:43, Marc Zyngier wrote:

On Tue, 24 Mar 2020 09:18:10 +
John Garry  wrote:


On 23/03/2020 09:16, Marc Zyngier wrote:

+ Julien, Mark

Hi Marc,


Time to enable pseudo-NMIs in the PMUv3 driver...


Do you know if there is any plan for this?
There was. Julien Thierry has a bunch of patches for that [1], but 
they > needs

reviving.


So those patches still apply cleanly (apart from the kvm patch, which
I can skip, I suppose) and build, so I can try this I figure. Is
there anything else which I should ensure or know about? Apart from
enable CONFIG_ARM64_PSUEDO_NMI.

You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
has it set to 0, preventing me from being able to use the feature
(hint, nudge...;-).


Yeah, apparently it's set on our D06CS board, but I just need to 
double check the FW version with our FW guy.


Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but 
since it's not been called out:


   This high priority configuration for interrupts needs to be
   explicitly enabled by setting the kernel parameter
   "irqchip.gicv3_pseudo_nmi" to 1.


Yeah, I saw that by chance somewhere else previously.



FWIW I believe is is still on the plan for someone here to dust off the 
PMU pNMI patches at some point.


Cool. Well I can try to experiment with what Julien had at v4 for now.

Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-24 Thread Robin Murphy

On 2020-03-24 11:55 am, John Garry wrote:

On 24/03/2020 10:43, Marc Zyngier wrote:

On Tue, 24 Mar 2020 09:18:10 +
John Garry  wrote:


On 23/03/2020 09:16, Marc Zyngier wrote:

+ Julien, Mark

Hi Marc,


Time to enable pseudo-NMIs in the PMUv3 driver...


Do you know if there is any plan for this?
There was. Julien Thierry has a bunch of patches for that [1], but 
they > needs

reviving.


So those patches still apply cleanly (apart from the kvm patch, which
I can skip, I suppose) and build, so I can try this I figure. Is
there anything else which I should ensure or know about? Apart from
enable CONFIG_ARM64_PSUEDO_NMI.

You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
has it set to 0, preventing me from being able to use the feature
(hint, nudge...;-).


Yeah, apparently it's set on our D06CS board, but I just need to double 
check the FW version with our FW guy.


Hopefully you saw the help for CONFIG_ARM64_PSUEDO_NMI already, but 
since it's not been called out:


  This high priority configuration for interrupts needs to be
  explicitly enabled by setting the kernel parameter
  "irqchip.gicv3_pseudo_nmi" to 1.

FWIW I believe is is still on the plan for someone here to dust off the 
PMU pNMI patches at some point.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-24 Thread John Garry

On 24/03/2020 10:43, Marc Zyngier wrote:

On Tue, 24 Mar 2020 09:18:10 +
John Garry  wrote:


On 23/03/2020 09:16, Marc Zyngier wrote:

+ Julien, Mark

Hi Marc,


Time to enable pseudo-NMIs in the PMUv3 driver...


Do you know if there is any plan for this?

There was. Julien Thierry has a bunch of patches for that [1], but they > needs
reviving.


So those patches still apply cleanly (apart from the kvm patch, which
I can skip, I suppose) and build, so I can try this I figure. Is
there anything else which I should ensure or know about? Apart from
enable CONFIG_ARM64_PSUEDO_NMI.

You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
has it set to 0, preventing me from being able to use the feature
(hint, nudge...;-).


Yeah, apparently it's set on our D06CS board, but I just need to double 
check the FW version with our FW guy.


As for D05, there has not been a FW update there in quite a long time 
and no plans for it. Sorry.


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-24 Thread Marc Zyngier
On Tue, 24 Mar 2020 09:18:10 +
John Garry  wrote:

> On 23/03/2020 09:16, Marc Zyngier wrote:
> 
> + Julien, Mark
> 
> Hi Marc,
> 
> >>> Time to enable pseudo-NMIs in the PMUv3 driver...
> >>>
> >>
> >> Do you know if there is any plan for this?
> > 
> > There was. Julien Thierry has a bunch of patches for that [1], but they > 
> > needs
> > reviving.
> > 
> 
> So those patches still apply cleanly (apart from the kvm patch, which
> I can skip, I suppose) and build, so I can try this I figure. Is
> there anything else which I should ensure or know about? Apart from
> enable CONFIG_ARM64_PSUEDO_NMI.

You need to make sure that your firmware sets SCR_EL3.FIQ to 1. My D05
has it set to 0, preventing me from being able to use the feature
(hint, nudge... ;-).

M.
-- 
Jazz is not dead. It just smells funny...
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-24 Thread John Garry

On 23/03/2020 09:16, Marc Zyngier wrote:

+ Julien, Mark

Hi Marc,


Time to enable pseudo-NMIs in the PMUv3 driver...



Do you know if there is any plan for this?


There was. Julien Thierry has a bunch of patches for that [1], but they 
needs

reviving.



So those patches still apply cleanly (apart from the kvm patch, which I 
can skip, I suppose) and build, so I can try this I figure. Is there 
anything else which I should ensure or know about? Apart from enable 
CONFIG_ARM64_PSUEDO_NMI.


A quickly taken perf annotate and report is at the tip here: 
https://github.com/hisilicon/kernel-dev/commits/private-topic-nvme-5.6-profiling 





In the meantime, maybe I can do some trickery by putting the
local_irq_restore() in a separate function, outside
arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same
function.




Scratch that :)


I don't see how you can improve the profiling without compromising
the locking in this case...



Cheers,
John

[1] https://patchwork.kernel.org/cover/11047407/
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-23 Thread Marc Zyngier

On 2020-03-23 09:03, John Garry wrote:

On 20/03/2020 16:33, Marc Zyngier wrote:

JFYI, I've been playing for "perf annotate" today and it's giving
strange results for my NVMe testing. So "report" looks somewhat sane,
if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():


    55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_cmdq_issue_cmdlist
 9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
_raw_spin_unlock_irqrestore

 2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
 1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
 1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k]
arm_smmu_atc_inv_domain.constprop.42
 1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
 1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw

But "annotate" consistently tells me that a specific instruction
consumes ~99% of the load for the enqueue function:

 :  /* 5. If we are inserting a CMD_SYNC,
we must wait for it to complete */
 :  if (sync) {
    0.00 :   80001071c948:   ldr w0, [x29, #108]
 :  int ret = 0;
    0.00 :   80001071c94c:   mov w24, #0x0  // #0
 :  if (sync) {
    0.00 :   80001071c950:   cbnz    w0, 80001071c990

 :  arch_local_irq_restore():
    0.00 :   80001071c954:   msr daif, x21
 :  arm_smmu_cmdq_issue_cmdlist():
 :  }
 :  }
 :
 :  local_irq_restore(flags);
 :  return ret;
 :  }
   99.51 :   80001071c958:   adrp    x0, 800011909000





Hi Marc,

This is likely the side effect of the re-enabling of interrupts (msr 
daif, x21)
on the previous instruction which causes the perf interrupt to fire 
right after.


ok, makes sense.



Time to enable pseudo-NMIs in the PMUv3 driver...



Do you know if there is any plan for this?


There was. Julien Thierry has a bunch of patches for that [1], but they 
needs

reviving.



In the meantime, maybe I can do some trickery by putting the
local_irq_restore() in a separate function, outside
arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same
function.


I don't see how you can improve the profiling without compromising
the locking in this case...

Thanks,

M.

[1] https://patchwork.kernel.org/cover/11047407/
--
Jazz is not dead. It just smells funny...
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-23 Thread John Garry

On 20/03/2020 16:33, Marc Zyngier wrote:

JFYI, I've been playing for "perf annotate" today and it's giving
strange results for my NVMe testing. So "report" looks somewhat sane,
if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():


    55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_cmdq_issue_cmdlist
 9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
_raw_spin_unlock_irqrestore

 2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
 1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
 1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k]
arm_smmu_atc_inv_domain.constprop.42
 1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
 1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw

But "annotate" consistently tells me that a specific instruction
consumes ~99% of the load for the enqueue function:

 :  /* 5. If we are inserting a CMD_SYNC,
we must wait for it to complete */
 :  if (sync) {
    0.00 :   80001071c948:   ldr w0, [x29, #108]
 :  int ret = 0;
    0.00 :   80001071c94c:   mov w24, #0x0  // #0
 :  if (sync) {
    0.00 :   80001071c950:   cbnz    w0, 80001071c990

 :  arch_local_irq_restore():
    0.00 :   80001071c954:   msr daif, x21
 :  arm_smmu_cmdq_issue_cmdlist():
 :  }
 :  }
 :
 :  local_irq_restore(flags);
 :  return ret;
 :  }
   99.51 :   80001071c958:   adrp    x0, 800011909000





Hi Marc,

This is likely the side effect of the re-enabling of interrupts (msr 
daif, x21)
on the previous instruction which causes the perf interrupt to fire 
right after.


ok, makes sense.



Time to enable pseudo-NMIs in the PMUv3 driver...



Do you know if there is any plan for this?

In the meantime, maybe I can do some trickery by putting the 
local_irq_restore() in a separate function, outside 
arm_smmu_cmdq_issue_cmdlist(), to get a fair profile for that same function.


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-20 Thread Marc Zyngier

Hi John,

On 2020-03-20 16:20, John Garry wrote:




I've run a bunch of netperf instances on multiple cores and 
collecting

SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.

- 6.07% arm_smmu_iotlb_sync
 - 5.74% arm_smmu_tlb_inv_range
  5.09% arm_smmu_cmdq_issue_cmdlist
  0.28% __pi_memset
  0.08% __pi_memcpy
  0.08% arm_smmu_atc_inv_domain.constprop.37
  0.07% arm_smmu_cmdq_build_cmd
  0.01% arm_smmu_cmdq_batch_add
   0.31% __pi_memset

So arm_smmu_atc_inv_domain() takes about 1.4% of 
arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from 
the
atomic_read(), that checks whether the domain uses ATS, is 77% of 
the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm 
not sure

there is much room for optimization there.


Well I did originally suggest using RCU protection to scan the list 
of
devices, instead of reading an atomic and checking for non-zero 
value. But
that would be an optimsation for ATS also, and there was no ATS 
devices at

the time (to verify performance).


Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
performance, but I agree that using RCU to scan the list should get 
better

results when using ATS.

When ATS isn't in use however, I suspect reading nr_ats_masters should 
be
more efficient than taking the RCU lock + reading an "ats_devices" 
list

(since the smmu_domain->devices list also serves context descriptor
invalidation, even when ATS isn't in use). I'll run some tests 
however, to

see if I can micro-optimize this case, but I don't expect noticeable
improvements.


ok, cheers. I, too, would not expect a significant improvement there.

JFYI, I've been playing for "perf annotate" today and it's giving
strange results for my NVMe testing. So "report" looks somewhat sane,
if not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():


55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_cmdq_issue_cmdlist
 9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
_raw_spin_unlock_irqrestore

 2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
 1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
 1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k]
arm_smmu_atc_inv_domain.constprop.42
 1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
 1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw

But "annotate" consistently tells me that a specific instruction
consumes ~99% of the load for the enqueue function:

 :  /* 5. If we are inserting a CMD_SYNC,
we must wait for it to complete */
 :  if (sync) {
0.00 :   80001071c948:   ldr w0, [x29, #108]
 :  int ret = 0;
0.00 :   80001071c94c:   mov w24, #0x0  // #0
 :  if (sync) {
0.00 :   80001071c950:   cbnzw0, 80001071c990

 :  arch_local_irq_restore():
0.00 :   80001071c954:   msr daif, x21
 :  arm_smmu_cmdq_issue_cmdlist():
 :  }
 :  }
 :
 :  local_irq_restore(flags);
 :  return ret;
 :  }
   99.51 :   80001071c958:   adrpx0, 800011909000



This is likely the side effect of the re-enabling of interrupts (msr 
daif, x21)
on the previous instruction which causes the perf interrupt to fire 
right after.


Time to enable pseudo-NMIs in the PMUv3 driver...

 M.
--
Jazz is not dead. It just smells funny...
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-20 Thread John Garry







I've run a bunch of netperf instances on multiple cores and collecting
SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.

- 6.07% arm_smmu_iotlb_sync
 - 5.74% arm_smmu_tlb_inv_range
  5.09% arm_smmu_cmdq_issue_cmdlist
  0.28% __pi_memset
  0.08% __pi_memcpy
  0.08% arm_smmu_atc_inv_domain.constprop.37
  0.07% arm_smmu_cmdq_build_cmd
  0.01% arm_smmu_cmdq_batch_add
   0.31% __pi_memset

So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from the
atomic_read(), that checks whether the domain uses ATS, is 77% of the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
there is much room for optimization there.


Well I did originally suggest using RCU protection to scan the list of
devices, instead of reading an atomic and checking for non-zero value. But
that would be an optimsation for ATS also, and there was no ATS devices at
the time (to verify performance).


Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
performance, but I agree that using RCU to scan the list should get better
results when using ATS.

When ATS isn't in use however, I suspect reading nr_ats_masters should be
more efficient than taking the RCU lock + reading an "ats_devices" list
(since the smmu_domain->devices list also serves context descriptor
invalidation, even when ATS isn't in use). I'll run some tests however, to
see if I can micro-optimize this case, but I don't expect noticeable
improvements.


ok, cheers. I, too, would not expect a significant improvement there.

JFYI, I've been playing for "perf annotate" today and it's giving 
strange results for my NVMe testing. So "report" looks somewhat sane, if 
not a worryingly high % for arm_smmu_cmdq_issue_cmdlist():



55.39%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_cmdq_issue_cmdlist
 9.74%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
_raw_spin_unlock_irqrestore

 2.02%  irq/342-nvme0q1  [kernel.kallsyms]  [k] nvme_irq
 1.86%  irq/342-nvme0q1  [kernel.kallsyms]  [k] fput_many
 1.73%  irq/342-nvme0q1  [kernel.kallsyms]  [k] 
arm_smmu_atc_inv_domain.constprop.42

 1.67%  irq/342-nvme0q1  [kernel.kallsyms]  [k] __arm_lpae_unmap
 1.49%  irq/342-nvme0q1  [kernel.kallsyms]  [k] aio_complete_rw

But "annotate" consistently tells me that a specific instruction 
consumes ~99% of the load for the enqueue function:


 :  /* 5. If we are inserting a CMD_SYNC, 
we must wait for it to complete */

 :  if (sync) {
0.00 :   80001071c948:   ldr w0, [x29, #108]
 :  int ret = 0;
0.00 :   80001071c94c:   mov w24, #0x0 
 // #0

 :  if (sync) {
0.00 :   80001071c950:   cbnzw0, 80001071c990 


 :  arch_local_irq_restore():
0.00 :   80001071c954:   msr daif, x21
 :  arm_smmu_cmdq_issue_cmdlist():
 :  }
 :  }
 :
 :  local_irq_restore(flags);
 :  return ret;
 :  }
   99.51 :   80001071c958:   adrpx0, 800011909000 


0.00 :   80001071c95c:   add x21, x0, #0x908
0.02 :   80001071c960:   ldr x2, [x29, #488]
0.14 :   80001071c964:   ldr x1, [x21]
0.00 :   80001071c968:   eor x1, x2, x1
0.00 :   80001071c96c:   mov w0, w24


But there may be a hint that we're getting consuming a lot of time in 
polling the CMD_SYNC consumption.


The files are available here:

https://raw.githubusercontent.com/hisilicon/kernel-dev/private-topic-nvme-5.6-profiling/ann.txt, 
report


Or maybe I'm just not using the tool properly ...

Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-20 Thread Jean-Philippe Brucker
On Fri, Mar 20, 2020 at 10:41:44AM +, John Garry wrote:
> On 19/03/2020 18:43, Jean-Philippe Brucker wrote:
> > On Thu, Mar 19, 2020 at 12:54:59PM +, John Garry wrote:
> > > Hi Will,
> > > 
> > > > 
> > > > On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote:
> > > > > And for the overall system, we have:
> > > > > 
> > > > > PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 
> > > > > 0/34434 drop:
> > > > > 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> > > > > --
> > > > > 
> > > > >   27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
> > > > >   11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > > > >6.35%  [kernel]  [k] _raw_spin_unlock_irq
> > > > >2.65%  [kernel]  [k] get_user_pages_fast
> > > > >2.03%  [kernel]  [k] __slab_free
> > > > >1.55%  [kernel]  [k] tick_nohz_idle_exit
> > > > >1.47%  [kernel]  [k] arm_lpae_map
> > > > >1.39%  [kernel]  [k] __fget
> > > > >1.14%  [kernel]  [k] __lock_text_start
> > > > >1.09%  [kernel]  [k] _raw_spin_lock
> > > > >1.08%  [kernel]  [k] bio_release_pages.part.42
> > > > >1.03%  [kernel]  [k] __sbitmap_get_word
> > > > >0.97%  [kernel]  [k] 
> > > > > arm_smmu_atc_inv_domain.constprop.42
> > > > >0.91%  [kernel]  [k] fput_many
> > > > >0.88%  [kernel]  [k] __arm_lpae_map
> > > > > 
> > > > > One thing to note is that we still spend an appreciable amount of 
> > > > > time in
> > > > > arm_smmu_atc_inv_domain(), which is disappointing when considering it 
> > > > > should
> > > > > effectively be a noop.
> > > > > 
> > > > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the 
> > > > > testing our
> > > > > batch size is 1, so we're not seeing the real benefit of the 
> > > > > batching. I
> > > > > can't help but think that we could improve this code to try to 
> > > > > combine CMD
> > > > > SYNCs for small batches.
> > > > > 
> > > > > Anyway, let me know your thoughts or any questions. I'll have a look 
> > > > > if a
> > > > > get a chance for other possible bottlenecks.
> > > > 
> > > > Did you ever get any more information on this? I don't have any SMMUv3
> > > > hardware any more, so I can't really dig into this myself.
> > > 
> > > I'm only getting back to look at this now, as SMMU performance is a bit 
> > > of a
> > > hot topic again for us.
> > > 
> > > So one thing we are doing which looks to help performance is this series
> > > from Marc:
> > > 
> > > https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
> > > 
> > > So that is just spreading the per-CPU load for NVMe interrupt handling
> > > (where the DMA unmapping is happening), so I'd say just side-stepping any
> > > SMMU issue really.
> > > 
> > > Going back to the SMMU, I wanted to run epbf and perf annotate to help
> > > profile this, but was having no luck getting them to work properly. I'll
> > > look at this again now.
> > 
> > Could you also try with the upcoming ATS change currently in Will's tree?
> > They won't improve your numbers but it'd be good to check that they don't
> > make things worse.
> 
> I can do when I get a chance.
> 
> > 
> > I've run a bunch of netperf instances on multiple cores and collecting
> > SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
> > consistently.
> > 
> > - 6.07% arm_smmu_iotlb_sync
> > - 5.74% arm_smmu_tlb_inv_range
> >  5.09% arm_smmu_cmdq_issue_cmdlist
> >  0.28% __pi_memset
> >  0.08% __pi_memcpy
> >  0.08% arm_smmu_atc_inv_domain.constprop.37
> >  0.07% arm_smmu_cmdq_build_cmd
> >  0.01% arm_smmu_cmdq_batch_add
> >   0.31% __pi_memset
> > 
> > So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
> > when ATS is not used. According to the annotations, the load from the
> > atomic_read(), that checks whether the domain uses ATS, is 77% of the
> > samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
> > there is much room for optimization there.
> 
> Well I did originally suggest using RCU protection to scan the list of
> devices, instead of reading an atomic and checking for non-zero value. But
> that would be an optimsation for ATS also, and there was no ATS devices at
> the time (to verify performance).

Heh, I have yet to get my hands on one. Currently I can't evaluate ATS
performance, but I agree that using RCU to scan the list should get better
results when using ATS.

When ATS isn't in use however, I suspect reading nr_ats_masters should be
more efficient than taking the RCU lock + reading an "ats_devices" list
(since the smmu_domain->devices list also serves context 

Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-20 Thread John Garry

On 19/03/2020 18:43, Jean-Philippe Brucker wrote:

On Thu, Mar 19, 2020 at 12:54:59PM +, John Garry wrote:

Hi Will,



On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote:

And for the overall system, we have:

PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
0/40116 [4000Hz cycles],  (all, 96 CPUs)
--

  27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
  11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.35%  [kernel]  [k] _raw_spin_unlock_irq
   2.65%  [kernel]  [k] get_user_pages_fast
   2.03%  [kernel]  [k] __slab_free
   1.55%  [kernel]  [k] tick_nohz_idle_exit
   1.47%  [kernel]  [k] arm_lpae_map
   1.39%  [kernel]  [k] __fget
   1.14%  [kernel]  [k] __lock_text_start
   1.09%  [kernel]  [k] _raw_spin_lock
   1.08%  [kernel]  [k] bio_release_pages.part.42
   1.03%  [kernel]  [k] __sbitmap_get_word
   0.97%  [kernel]  [k] arm_smmu_atc_inv_domain.constprop.42
   0.91%  [kernel]  [k] fput_many
   0.88%  [kernel]  [k] __arm_lpae_map

One thing to note is that we still spend an appreciable amount of time in
arm_smmu_atc_inv_domain(), which is disappointing when considering it should
effectively be a noop.

As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
batch size is 1, so we're not seeing the real benefit of the batching. I
can't help but think that we could improve this code to try to combine CMD
SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look if a
get a chance for other possible bottlenecks.


Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.


I'm only getting back to look at this now, as SMMU performance is a bit of a
hot topic again for us.

So one thing we are doing which looks to help performance is this series
from Marc:

https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d

So that is just spreading the per-CPU load for NVMe interrupt handling
(where the DMA unmapping is happening), so I'd say just side-stepping any
SMMU issue really.

Going back to the SMMU, I wanted to run epbf and perf annotate to help
profile this, but was having no luck getting them to work properly. I'll
look at this again now.


Could you also try with the upcoming ATS change currently in Will's tree?
They won't improve your numbers but it'd be good to check that they don't
make things worse.


I can do when I get a chance.



I've run a bunch of netperf instances on multiple cores and collecting
SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.

- 6.07% arm_smmu_iotlb_sync
- 5.74% arm_smmu_tlb_inv_range
 5.09% arm_smmu_cmdq_issue_cmdlist
 0.28% __pi_memset
 0.08% __pi_memcpy
 0.08% arm_smmu_atc_inv_domain.constprop.37
 0.07% arm_smmu_cmdq_build_cmd
 0.01% arm_smmu_cmdq_batch_add
  0.31% __pi_memset

So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from the
atomic_read(), that checks whether the domain uses ATS, is 77% of the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
there is much room for optimization there.


Well I did originally suggest using RCU protection to scan the list of 
devices, instead of reading an atomic and checking for non-zero value. 
But that would be an optimsation for ATS also, and there was no ATS 
devices at the time (to verify performance).


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-19 Thread Jean-Philippe Brucker
On Thu, Mar 19, 2020 at 12:54:59PM +, John Garry wrote:
> Hi Will,
> 
> > 
> > On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote:
> > > And for the overall system, we have:
> > > 
> > >PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 
> > > drop:
> > > 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> > > --
> > > 
> > >  27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
> > >  11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
> > >   6.35%  [kernel]  [k] _raw_spin_unlock_irq
> > >   2.65%  [kernel]  [k] get_user_pages_fast
> > >   2.03%  [kernel]  [k] __slab_free
> > >   1.55%  [kernel]  [k] tick_nohz_idle_exit
> > >   1.47%  [kernel]  [k] arm_lpae_map
> > >   1.39%  [kernel]  [k] __fget
> > >   1.14%  [kernel]  [k] __lock_text_start
> > >   1.09%  [kernel]  [k] _raw_spin_lock
> > >   1.08%  [kernel]  [k] bio_release_pages.part.42
> > >   1.03%  [kernel]  [k] __sbitmap_get_word
> > >   0.97%  [kernel]  [k] arm_smmu_atc_inv_domain.constprop.42
> > >   0.91%  [kernel]  [k] fput_many
> > >   0.88%  [kernel]  [k] __arm_lpae_map
> > > 
> > > One thing to note is that we still spend an appreciable amount of time in
> > > arm_smmu_atc_inv_domain(), which is disappointing when considering it 
> > > should
> > > effectively be a noop.
> > > 
> > > As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing 
> > > our
> > > batch size is 1, so we're not seeing the real benefit of the batching. I
> > > can't help but think that we could improve this code to try to combine CMD
> > > SYNCs for small batches.
> > > 
> > > Anyway, let me know your thoughts or any questions. I'll have a look if a
> > > get a chance for other possible bottlenecks.
> > 
> > Did you ever get any more information on this? I don't have any SMMUv3
> > hardware any more, so I can't really dig into this myself.
> 
> I'm only getting back to look at this now, as SMMU performance is a bit of a
> hot topic again for us.
> 
> So one thing we are doing which looks to help performance is this series
> from Marc:
> 
> https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d
> 
> So that is just spreading the per-CPU load for NVMe interrupt handling
> (where the DMA unmapping is happening), so I'd say just side-stepping any
> SMMU issue really.
> 
> Going back to the SMMU, I wanted to run epbf and perf annotate to help
> profile this, but was having no luck getting them to work properly. I'll
> look at this again now.

Could you also try with the upcoming ATS change currently in Will's tree?
They won't improve your numbers but it'd be good to check that they don't
make things worse.

I've run a bunch of netperf instances on multiple cores and collecting
SMMU usage (on TaiShan 2280). I'm getting the following ratio pretty
consistently.

- 6.07% arm_smmu_iotlb_sync
   - 5.74% arm_smmu_tlb_inv_range
5.09% arm_smmu_cmdq_issue_cmdlist
0.28% __pi_memset
0.08% __pi_memcpy
0.08% arm_smmu_atc_inv_domain.constprop.37
0.07% arm_smmu_cmdq_build_cmd
0.01% arm_smmu_cmdq_batch_add
 0.31% __pi_memset

So arm_smmu_atc_inv_domain() takes about 1.4% of arm_smmu_iotlb_sync(),
when ATS is not used. According to the annotations, the load from the
atomic_read(), that checks whether the domain uses ATS, is 77% of the
samples in arm_smmu_atc_inv_domain() (265 of 345 samples), so I'm not sure
there is much room for optimization there.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-19 Thread John Garry

Hi Will,



On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote:

And for the overall system, we have:

   PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
0/40116 [4000Hz cycles],  (all, 96 CPUs)
--

 27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
 11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
  6.35%  [kernel]  [k] _raw_spin_unlock_irq
  2.65%  [kernel]  [k] get_user_pages_fast
  2.03%  [kernel]  [k] __slab_free
  1.55%  [kernel]  [k] tick_nohz_idle_exit
  1.47%  [kernel]  [k] arm_lpae_map
  1.39%  [kernel]  [k] __fget
  1.14%  [kernel]  [k] __lock_text_start
  1.09%  [kernel]  [k] _raw_spin_lock
  1.08%  [kernel]  [k] bio_release_pages.part.42
  1.03%  [kernel]  [k] __sbitmap_get_word
  0.97%  [kernel]  [k] arm_smmu_atc_inv_domain.constprop.42
  0.91%  [kernel]  [k] fput_many
  0.88%  [kernel]  [k] __arm_lpae_map

One thing to note is that we still spend an appreciable amount of time in
arm_smmu_atc_inv_domain(), which is disappointing when considering it should
effectively be a noop.

As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
batch size is 1, so we're not seeing the real benefit of the batching. I
can't help but think that we could improve this code to try to combine CMD
SYNCs for small batches.

Anyway, let me know your thoughts or any questions. I'll have a look if a
get a chance for other possible bottlenecks.


Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.


I'm only getting back to look at this now, as SMMU performance is a bit 
of a hot topic again for us.


So one thing we are doing which looks to help performance is this series 
from Marc:


https://lore.kernel.org/lkml/9171c554-50d2-142b-96ae-1357952fc...@huawei.com/T/#mee5562d1efd6aaeb8d2682bdb6807fe7b5d7f56d

So that is just spreading the per-CPU load for NVMe interrupt handling 
(where the DMA unmapping is happening), so I'd say just side-stepping 
any SMMU issue really.


Going back to the SMMU, I wanted to run epbf and perf annotate to help 
profile this, but was having no luck getting them to work properly. I'll 
look at this again now.


Cheers,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: arm-smmu-v3 high cpu usage for NVMe

2020-03-18 Thread Will Deacon
Hi John,

On Thu, Jan 02, 2020 at 05:44:39PM +, John Garry wrote:
> And for the overall system, we have:
> 
>   PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
> 0/40116 [4000Hz cycles],  (all, 96 CPUs)
> --
> 
> 27.43%  [kernel]  [k] arm_smmu_cmdq_issue_cmdlist
> 11.71%  [kernel]  [k] _raw_spin_unlock_irqrestore
>  6.35%  [kernel]  [k] _raw_spin_unlock_irq
>  2.65%  [kernel]  [k] get_user_pages_fast
>  2.03%  [kernel]  [k] __slab_free
>  1.55%  [kernel]  [k] tick_nohz_idle_exit
>  1.47%  [kernel]  [k] arm_lpae_map
>  1.39%  [kernel]  [k] __fget
>  1.14%  [kernel]  [k] __lock_text_start
>  1.09%  [kernel]  [k] _raw_spin_lock
>  1.08%  [kernel]  [k] bio_release_pages.part.42
>  1.03%  [kernel]  [k] __sbitmap_get_word
>  0.97%  [kernel]  [k] arm_smmu_atc_inv_domain.constprop.42
>  0.91%  [kernel]  [k] fput_many
>  0.88%  [kernel]  [k] __arm_lpae_map
> 
> One thing to note is that we still spend an appreciable amount of time in
> arm_smmu_atc_inv_domain(), which is disappointing when considering it should
> effectively be a noop.
> 
> As for arm_smmu_cmdq_issue_cmdlist(), I do note that during the testing our
> batch size is 1, so we're not seeing the real benefit of the batching. I
> can't help but think that we could improve this code to try to combine CMD
> SYNCs for small batches.
> 
> Anyway, let me know your thoughts or any questions. I'll have a look if a
> get a chance for other possible bottlenecks.

Did you ever get any more information on this? I don't have any SMMUv3
hardware any more, so I can't really dig into this myself.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu