On 20/03/2020 10:41, John Garry wrote:

+ Barry, Alexandru

    PerfTop:   85864 irqs/sec  kernel:89.6%  exact:  0.0% lost: 0/34434 drop:
0/40116 [4000Hz cycles],  (all, 96 CPUs)
--------------------------------------------------------------------------------------------------------------------------

      27.43%  [kernel]          [k] arm_smmu_cmdq_issue_cmdlist
      11.71%  [kernel]          [k] _raw_spin_unlock_irqrestore
       6.35%  [kernel]          [k] _raw_spin_unlock_irq
       2.65%  [kernel]          [k] get_user_pages_fast
       2.03%  [kernel]          [k] __slab_free
       1.55%  [kernel]          [k] tick_nohz_idle_exit
       1.47%  [kernel]          [k] arm_lpae_map
       1.39%  [kernel]          [k] __fget
       1.14%  [kernel]          [k] __lock_text_start
       1.09%  [kernel]          [k] _raw_spin_lock
       1.08%  [kernel]          [k] bio_release_pages.part.42
       1.03%  [kernel]          [k] __sbitmap_get_word
       0.97%  [kernel]          [k] arm_smmu_atc_inv_domain.constprop.42
       0.91%  [kernel]          [k] fput_many
       0.88%  [kernel]          [k] __arm_lpae_map


Hi Will, Robin,

I'm just getting around to look at this topic again. Here's the current picture for my NVMe test:

perf top -C 0 *
Samples: 808 of event 'cycles:ppp', Event count (approx.): 469909024
Overhead Shared Object Symbol
75.91% [kernel] [k] arm_smmu_cmdq_issue_cmdlist
3.28% [kernel] [k] arm_smmu_tlb_inv_range
2.42% [kernel] [k] arm_smmu_atc_inv_domain.constprop.49
2.35% [kernel] [k] _raw_spin_unlock_irqrestore
1.32% [kernel] [k] __arm_smmu_cmdq_poll_set_valid_map.isra.41
1.20% [kernel] [k] aio_complete_rw
0.96% [kernel] [k] enqueue_task_fair
0.93% [kernel] [k] gic_handle_irq
0.86% [kernel] [k] _raw_spin_lock_irqsave
0.72% [kernel] [k] put_reqs_available
0.72% [kernel] [k] sbitmap_queue_clear

* only certain CPUs run the dma unmap for my scenario, cpu0 being one of them.

Colleague Barry has similar findings for some other scenarios.

So we tried the latest perf NMI support wip patches, and noticed a few hotspots (see https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/perf%20annotate and https://raw.githubusercontent.com/hisilicon/kernel-dev/fee69c8ca3784b9dd3912703cfcd4985a00f6bbb/report.txt) when running some NVMe traffic:

- initial cmpxchg to get a place in the queue
        - when more CPUs get involved, we start failing at an exponential rate
0.00 :        ffff8000107a3500:       cas     x4, x2, [x27]
26.52 : ffff8000107a3504: mov x0, x4 : arm_smmu_cmdq_issue_cmdlist():

- the queue locking
- polling cmd_sync

Some ideas to optimise:

a. initial cmpxchg
So this cmpxchg could be considered unfair. In addition, with all the contention on arm_smmu_cmdq.q, that cacheline would be constantly pinged around the system. Maybe we can implement something similar to the idea of queued/ticketed spinlocks, making a CPU spin on own copy of arm_smmu_cmdq.q after initial cmpxchg fails, released by its leader, and releasing subsequent followers

b. Drop the queue_full checking in certain circumstances
If we cannot theoretically fill the queue, then stop the checking for queue full or similar. This should also help current problem of a., as the less time between cmpxchg, the less chance of failing (as we check queue available space between cmpxchg attempts).

So if cmdq depth > nr_available_cpus * (max batch size + 1) AND we always issue a cmd_sync for a batch (regardless of whether requested), then we should never fill (I think).

c. Don't do queue locking in certain circumstances
If we implement (and support) b. and support MSI polling, then I don't think that this is required.

d. More minor ideas are to move forward when the "owner" stops gathering to reduce time of advancing the prod, hopefully reducing cmd_sync polling time; and also use a smaller word size for the valid bitmap operations, maybe 32b atomic operations are overall more efficient (than 64b) - mostly valid range check is < 16 bits from my observation.

Let me know your thoughts or any other ideas.

Thanks,
John

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to