On Thu, Feb 29, 2024 at 05:19:44PM +0100, Julien Panis wrote:
> On 2/27/24 00:18, Andrew Lunn wrote:
> > > +static struct sk_buff *am65_cpsw_alloc_skb(struct net_device *ndev, 
> > > unsigned int len)
> > > +{
> > > + struct page *page;
> > > + struct sk_buff *skb;
> > > +
> > > + page = dev_alloc_pages(0);
> > You are likely to get better performance if you use the page_pool.
> > 
> > When FEC added XDP support, the first set of changes was to make use
> > of page_pool. That improved the drivers performance. Then XDP was
> > added on top. Maybe you can follow that pattern.
> > 
> >        Andrew
> 
> Hello Andrew,
> 
> Thanks for this suggestion. I've been working on it over the last days.
> I encountered issues and I begin to wonder if that's a good option for
> this driver. Let me explain...

I'm not a page pool expert, so hopefully those that are will jump in
and help.

> This device has 3 ports:
> - Port0 is the host port (internal to the subsystem and referred as CPPI
> in the driver).
> - Port1 and 2 are the external ethernet ports.
> Each port's RX FIFO has 1 queue.
> 
> As mentioned in page_pool documentation:
> https://docs.kernel.org/networking/page_pool.html
> "The number of pools created MUST match the number of hardware
> queues unless hardware restrictions make that impossible. This would
> otherwise beat the purpose of page pool, which is allocate pages fast
> from cache without locking."

My guess is, this last bit is the important part. Locking. Do ports 1
and port 2 rx and tx run in parallel on different CPUs? Hence do you
need locking?

> So, for this driver I should use 2 page pools (for port1 and 2) if possible.

Maybe, maybe not. It is not really the number of front panel interface
which matters here. It is the number of entities which need buffers.

> But, if I I replace any alloc_skb() with page_pool_alloc() in the original
> driver, I will have to create only 1 page_pool.
> This is visible in am65_cpsw_nuss_common_open(), which starts with:
> "if (common->usage_count) return 0;" to ensure that subsequent code
> will be executed only once even if the 2 interfaces are ndo_open'd.
> IOW, skbs will be allocated for only 1 RX channel. I checked that behavior,
> and that's the way it works.

> This is because the host port (CPPI) has only 1 RX queue. This single
> queue is used to get all the packets, from both Ports and 2 (port ID is
> stored in each descriptor).

So you have one entity which needs buffers. CPPI. It seems like Port1
and Port2 do not need buffers? So to me, you need one page pool.

> So, because of this constraint, I ended up working on the "single
> page pool" option.
> 
> Questions:
> 1) Is the behavior described above usual for ethernet switch devices ?

This might be the first time page pool has been used by a switch? I
would check the melanox and sparx5 driver and see if they use page
pool.

> 2) Given that I can't use a page pool for each HW queue, is it worth
> using the page pool memory model ?

> 3) Currently I use 2 xdp_rxq (one for port1 and another one for port2).
> If an XDP program is attached to port1, my page pool dma parameter
> will need to be DMA_BIDIRECTIONAL (because of XDP_TX). As a result,
> even if there is no XDP program attached to port2, the setting for page
> pool will need to be DMA_BIDIRECTIONAL instead of DMA_FROM_DEVICE.
> In such situation, isn't it a problem for port2 ?
> 4) Because of 1) and 2), will page pool performance really be better for
> this driver ?

You probably need to explain the TX architecture a bit. How are
packets passed to the hardware? Is it again via a single CPPI entity?
Or are there two entities?

DMA_BIDIRECTIONAL and DMA_FROM_DEVICE is about cache flushing and
invalidation. Before you pass a buffer to the hardware for it to
receive into, you need to invalidate the cache. That means when the
hardware gives the buffer back with a packet in it, there is a cache
miss and it fetches the new data from memory. If that packet gets
turned into an XDP_TX, you need to flush the cache to force any
changes out of the cache and into memory. The DMA from memory then
gets the up to date packet contents.

My guess would be, always using DMA_BIDIRECTIONAL is fine, so long as
your cache operations are correct. Turn on DMA_API_DEBUG and make sure
it is happy.

     Andrew

Reply via email to