Re: [PATCH net] xsk: remove cheap_dma optimization

Robin Murphy Mon, 29 Jun 2020 12:20:40 -0700

On 2020-06-28 18:16, Björn Töpel wrote:


On 2020-06-27 09:04, Christoph Hellwig wrote:

On Sat, Jun 27, 2020 at 01:00:19AM +0200, Daniel Borkmann wrote:

Given there is roughly a ~5 weeks window at max where this removal could
still be applied in the worst case, could we come up with a fix /proposalfirst that moves this into the DMA mapping core? If there issomething that
can be agreed upon by all parties, then we could avoid re-adding the 9%
slowdown. :/


I'd rather turn it upside down - this abuse of the internals blocks work
that has basically just missed the previous window and I'm not going
to wait weeks to sort out the API misuse.  But we can add optimizations
back later if we find a sane way.


I'm not super excited about the performance loss, but I do get
Christoph's frustration about gutting the DMA API making it harder for
DMA people to get work done. Lets try to solve this properly using
proper DMA APIs.

That being said I really can't see how this would make so much of a
difference.  What architecture and what dma_ops are you using for
those measurements?  What is the workload?

The 9% is for an AF_XDP (Fast raw Ethernet socket. Think AF_PACKET, butfaster.) benchmark: receive the packet from the NIC, and drop it. TheDMA syncs stand out in the perf top:


   28.63%  [kernel]                   [k] i40e_clean_rx_irq_zc
   17.12%  [kernel]                   [k] xp_alloc
    8.80%  [kernel]                   [k] __xsk_rcv_zc
    7.69%  [kernel]                   [k] xdp_do_redirect
    5.35%  bpf_prog_992d9ddc835e5629  [k] bpf_prog_992d9ddc835e5629
    4.77%  [kernel]                   [k] xsk_rcv.part.0
    4.07%  [kernel]                   [k] __xsk_map_redirect
    3.80%  [kernel]                   [k] dma_direct_sync_single_for_cpu
    3.03%  [kernel]                   [k] dma_direct_sync_single_for_device
    2.76%  [kernel]                   [k] i40e_alloc_rx_buffers_zc
    1.83%  [kernel]                   [k] xsk_flush
...

For this benchmark the dma_ops are NULL (dma_is_direct() == true), and
the main issue is that SWIOTLB is now unconditionally enabled [1] for
x86, and for each sync we have to check that if is_swiotlb_buffer()
which involves a some costly indirection.

That was pretty much what my hack avoided. Instead we did all the checks
upfront, since AF_XDP has long-term DMA mappings, and just set a flag
for that.

Avoiding the whole "is this address swiotlb" in
dma_direct_sync_single_for_{cpu, device]() per-packet
would help a lot.

I'm pretty sure that's one of the things we hope to achieve with thegeneric bypass flag :)

Somewhat related to the DMA API; It would have performance benefits for
AF_XDP if the DMA range of the mapped memory was linear, i.e. by IOMMU
utilization. I've started hacking a thing a little bit, but it would be
nice if such API was part of the mapping core.

Input: array of pages Output: array of dma addrs (and obviously dev,
flags and such)

For non-IOMMU len(array of pages) == len(array of dma addrs)
For best-case IOMMU len(array of dma addrs) == 1 (large linear space)

But that's for later. :-)

FWIW you will typically get that behaviour from IOMMU-basedimplementations of dma_map_sg() right now, although it's not strictlyguaranteed. If you can weather some additional setup cost of callingsg_alloc_table_from_pages() plus walking the list after mapping to testwhether you did get a contiguous result, you could start takingadvantage of it as some of the dma-buf code in DRM and v4l2 does already(although those cases actually treat it as a strict dependency ratherthan an optimisation).

I'm inclined to agree that if we're going to see more of these cases, anew API call that did formally guarantee a DMA-contiguous mapping(either via IOMMU or bounce buffering) or failure might indeed be handy.


Robin.

Re: [PATCH net] xsk: remove cheap_dma optimization

Reply via email to