On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote: > On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet <eric.duma...@gmail.com> wrote: > > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote: > >> Use of order-3 pages is problematic in some cases. > >> > >> This patch might add three kinds of regression : > >> > >> 1) a CPU performance regression, but we will add later page > >> recycling and performance should be back. > >> > >> 2) TCP receiver could grow its receive window slightly slower, > >> because skb->len/skb->truesize ratio will decrease. > >> This is mostly ok, we prefer being conservative to not risk OOM, > >> and eventually tune TCP better in the future. > >> This is consistent with other drivers using 2048 per ethernet frame. > >> > >> 3) Because we allocate one page per RX slot, we consume more > >> memory for the ring buffers. XDP already had this constraint anyway. > >> > >> Signed-off-by: Eric Dumazet <eduma...@google.com> > >> --- > > > > Note that we also could use a different strategy. > > > > Assume RX rings of 4096 entries/slots. > > > > With this patch, mlx4 gets the strategy used by Alexander in Intel > > drivers : > > > > Each RX slot has an allocated page, and uses half of it, flipping to the > > other half every time the slot is used. > > > > So a ring buffer of 4096 slots allocates 4096 pages. > > > > When we receive a packet train for the same flow, GRO builds an skb with > > ~45 page frags, all from different pages. > > > > The put_page() done from skb_release_data() touches ~45 different struct > > page cache lines, and show a high cost. (compared to the order-3 used > > today by mlx4, this adds extra cache line misses and stalls for the > > consumer) > > > > If we instead try to use the two halves of one page on consecutive RX > > slots, we might instead cook skb with the same number of MSS (45), but > > half the number of cache lines for put_page(), so we should speed up the > > consumer. > > So there is a problem that is being overlooked here. That is the cost > of the DMA map/unmap calls. The problem is many PowerPC systems have > an IOMMU that you have to work around, and that IOMMU comes at a heavy > cost for every map/unmap call. So unless you are saying you wan to > setup a hybrid between the mlx5 and this approach where we have a page > cache that these all fall back into you will take a heavy cost for > having to map and unmap pages. > > The whole reason why I implemented the Intel page reuse approach the > way I did is to try and mitigate the IOMMU issue, it wasn't so much to > resolve allocator/freeing expense. Basically the allocator scales, > the IOMMU does not. So any solution would require making certain that > we can leave the pages pinned in the DMA to avoid having to take the > global locks involved in accessing the IOMMU.
I do not see any difference for the fact that we keep pages mapped the same way. mlx4_en_complete_rx_desc() will still use the : dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset, frag_size, priv->dma_dir); for every single MSS we receive. This wont change.