On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote:
> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
> >> Use of order-3 pages is problematic in some cases.
> >>
> >> This patch might add three kinds of regression :
> >>
> >> 1) a CPU performance regression, but we will add later page
> >> recycling and performance should be back.
> >>
> >> 2) TCP receiver could grow its receive window slightly slower,
> >>    because skb->len/skb->truesize ratio will decrease.
> >>    This is mostly ok, we prefer being conservative to not risk OOM,
> >>    and eventually tune TCP better in the future.
> >>    This is consistent with other drivers using 2048 per ethernet frame.
> >>
> >> 3) Because we allocate one page per RX slot, we consume more
> >>    memory for the ring buffers. XDP already had this constraint anyway.
> >>
> >> Signed-off-by: Eric Dumazet <eduma...@google.com>
> >> ---
> >
> > Note that we also could use a different strategy.
> >
> > Assume RX rings of 4096 entries/slots.
> >
> > With this patch, mlx4 gets the strategy used by Alexander in Intel
> > drivers :
> >
> > Each RX slot has an allocated page, and uses half of it, flipping to the
> > other half every time the slot is used.
> >
> > So a ring buffer of 4096 slots allocates 4096 pages.
> >
> > When we receive a packet train for the same flow, GRO builds an skb with
> > ~45 page frags, all from different pages.
> >
> > The put_page() done from skb_release_data() touches ~45 different struct
> > page cache lines, and show a high cost. (compared to the order-3 used
> > today by mlx4, this adds extra cache line misses and stalls for the
> > consumer)
> >
> > If we instead try to use the two halves of one page on consecutive RX
> > slots, we might instead cook skb with the same number of MSS (45), but
> > half the number of cache lines for put_page(), so we should speed up the
> > consumer.
> 
> So there is a problem that is being overlooked here.  That is the cost
> of the DMA map/unmap calls.  The problem is many PowerPC systems have
> an IOMMU that you have to work around, and that IOMMU comes at a heavy
> cost for every map/unmap call.  So unless you are saying you wan to
> setup a hybrid between the mlx5 and this approach where we have a page
> cache that these all fall back into you will take a heavy cost for
> having to map and unmap pages.
> 
> The whole reason why I implemented the Intel page reuse approach the
> way I did is to try and mitigate the IOMMU issue, it wasn't so much to
> resolve allocator/freeing expense.  Basically the allocator scales,
> the IOMMU does not.  So any solution would require making certain that
> we can leave the pages pinned in the DMA to avoid having to take the
> global locks involved in accessing the IOMMU.


I do not see any difference for the fact that we keep pages mapped the
same way.

mlx4_en_complete_rx_desc() will still use the :

dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
                              frag_size, priv->dma_dir);

for every single MSS we receive.

This wont change.


Reply via email to