> > I just took a quick look at the driver. It allocates and maps rx buffers > > that can cover a packet size of BGMAC_RX_MAX_FRAME_SIZE = 9724. > > This seems rather excessive, especially since most people are going to use > > a MTU of 1500. > > My proposal would be to add support for making rx buffer size dependent on > > MTU, reallocating the ring on MTU changes. > > This should significantly reduce the time spent on flushing caches. > > Oh, that's important too, it was changed by commit 8c7da63978f1 ("bgmac: > configure MTU and add support for frames beyond 8192 byte size"): > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c7da63978f1672eb4037bbca6e7eac73f908f03 > > It lowered NAT speed with bgmac by 60% (362 Mbps → 140 Mbps). > > I do all my testing with > #define BGMAC_RX_MAX_FRAME_SIZE 1536
That helps show that cache operations are part of your bottleneck. Taking a quick look at the driver. On the receive side: /* Unmap buffer to make it accessible to the CPU */ dma_unmap_single(dma_dev, dma_addr, BGMAC_RX_BUF_SIZE, DMA_FROM_DEVICE); Here is data is mapped read for the CPU to use it. /* Get info from the header */ len = le16_to_cpu(rx->len); flags = le16_to_cpu(rx->flags); /* Check for poison and drop or pass the packet */ if (len == 0xdead && flags == 0xbeef) { netdev_err(bgmac->net_dev, "Found poisoned packet at slot %d, DMA issue!\n", ring->start); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_errors++; break; } if (len > BGMAC_RX_ALLOC_SIZE) { netdev_err(bgmac->net_dev, "Found oversized packet at slot %d, DMA issue!\n", ring->start); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_length_errors++; bgmac->net_dev->stats.rx_errors++; break; } /* Omit CRC. */ len -= ETH_FCS_LEN; skb = build_skb(buf, BGMAC_RX_ALLOC_SIZE); if (unlikely(!skb)) { netdev_err(bgmac->net_dev, "build_skb failed\n"); put_page(virt_to_head_page(buf)); bgmac->net_dev->stats.rx_errors++; break; } skb_put(skb, BGMAC_RX_FRAME_OFFSET + BGMAC_RX_BUF_OFFSET + len); skb_pull(skb, BGMAC_RX_FRAME_OFFSET + BGMAC_RX_BUF_OFFSET); skb_checksum_none_assert(skb); skb->protocol = eth_type_trans(skb, bgmac->net_dev); and this is the first access of the actual data. You can make the cache actually work for you, rather than against you, to adding a call to prefetch(buf); just after the dma_unmap_single(). That will start getting the frame header from DRAM into cache, so hopefully it is available by the time eth_type_trans() is called and you don't have a cache miss. Andrew _______________________________________________ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel