Re: Optimizing kernel compilation / alignments for network performance

Rafał Miłecki Tue, 10 May 2022 03:40:24 -0700

On 6.05.2022 14:42, Andrew Lunn wrote:

I just took a quick look at the driver. It allocates and maps rx buffers that 
can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724.
This seems rather excessive, especially since most people are going to use a 
MTU of 1500.
My proposal would be to add support for making rx buffer size dependent on MTU, 
reallocating the ring on MTU changes.
This should significantly reduce the time spent on flushing caches.


Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac:
configure MTU and add support for frames beyond 8192 byte size"):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03

It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps).

I do all my testing with
#define BGMAC_RX_MAX_FRAME_SIZE                 1536


That helps show that cache operations are part of your bottleneck.

Taking a quick look at the driver. On the receive side:

                        /* Unmap buffer to make it accessible to the CPU */
                         dma_unmap_single(dma_dev, dma_addr,
                                          BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE);

Here is data is mapped read for the CPU to use it.

                        /* Get info from the header */
                         len = le16_to_cpu(rx->len);
                         flags = le16_to_cpu(rx->flags);

                         /* Check for poison and drop or pass the packet */
                         if (len == 0xdead && flags == 0xbeef) {
                                 netdev_err(bgmac->net_dev, "Found poisoned packet 
at slot %d, DMA issue!\n",
                                            ring->start);
                                 put_page(virt_to_head_page(buf));
                                 bgmac->net_dev->stats.rx_errors++;
                                 break;
                         }

                         if (len > BGMAC_RX_ALLOC_SIZE) {
                                 netdev_err(bgmac->net_dev, "Found oversized packet 
at slot %d, DMA issue!\n",
                                            ring->start);
                                 put_page(virt_to_head_page(buf));
                                 bgmac->net_dev->stats.rx_length_errors++;
                                 bgmac->net_dev->stats.rx_errors++;
                                 break;
                         }

                         /* Omit CRC. */
                         len -= ETH_FCS_LEN;

                         skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE);
                         if (unlikely(!skb)) {
                                 netdev_err(bgmac->net_dev, "build_skb 
failed\n");
                                 put_page(virt_to_head_page(buf));
                                 bgmac->net_dev->stats.rx_errors++;
                                 break;
                         }
                         skb_put(skb, BGMAC_RX_FRAME_OFFSET +
                                 BGMAC_RX_BUF_OFFSET + len);
                         skb_pull(skb, BGMAC_RX_FRAME_OFFSET +
                                  BGMAC_RX_BUF_OFFSET);

                         skb_checksum_none_assert(skb);
                         skb->protocol = eth_type_trans(skb, bgmac->net_dev);

and this is the first access of the actual data. You can make the
cache actually work for you, rather than against you, to adding a call to

        prefetch(buf);

just after the dma_unmap_single(). That will start getting the frame
header from DRAM into cache, so hopefully it is available by the time
eth_type_trans() is called and you don't have a cache miss.



I don't think that analysis is correct.

Please take a look at following lines:
struct bgmac_rx_header *rx = slot->buf + BGMAC_RX_BUF_OFFSET;
void *buf = slot->buf;

The first we do after dma_unmap_single() call is rx->len read. That
actually points to DMA data. There is nothing we could keep CPU busy
with while preteching data.

FWIW I tried adding prefetch(buf); anyway. I didn't change NAT speed by
a single 1 Mb/s. Speed was exactly the same as without prefetch() call.

_______________________________________________
openwrt-devel mailing list
openwrt-devel@lists.openwrt.org
https://lists.openwrt.org/mailman/listinfo/openwrt-devel

Re: Optimizing kernel compilation / alignments for network performance

Reply via email to