On Fri, 20 Mar 2026 11:37:36 -0700 Dipayaan Roy wrote:
> On Sat, Mar 14, 2026 at 12:50:53PM -0700, Jakub Kicinski wrote:
> > On Tue, 10 Mar 2026 21:00:49 -0700 Dipayaan Roy wrote:  
> > > On certain systems configured with 4K PAGE_SIZE, utilizing page_pool
> > > fragments for RX buffers results in a significant throughput regression.
> > > Profiling reveals that this regression correlates with high overhead in 
> > > the
> > > fragment allocation and reference counting paths on these specific
> > > platforms, rendering the multi-buffer-per-page strategy 
> > > counterproductive.  
> > 
> > Can you say more ? We could technically take two references on the page
> > right away if MTU is small and avoid some of the cost.  
> 
> There is a 15-20% shortfall in achieving line rate for MANA (180+ Gbps)
> on a particular ARM64 SKU. The issue is only specific to this processor SKU —
> not seen on other ARM64 SKUs (e.g., GB200) or x86 SKUs. Critically, the
> regression only manifests beyond 16 TCP connections, which strongly indicates
> seen when there is  high contention and traffic.
> 
>   no. of     | rx buf backed       | rx buf backed
>  connections | with page fragments | with full page
> -------------+---------------------+---------------
>            4 |         139 Gbps    |     138 Gbps
>            8 |         140 Gbps    |     162 Gbps
>           16 |         186 Gbps    |     186 Gbps

These results look at bit odd, 4 and 16 streams have the same perf,
while all other cases indeed show a delta. What I was hoping for was
a more precise attribution of the performance issue. Like perf top
showing that its indeed the atomic ops on the refcount that stall.

>           32 |         136 Gbps    |     183 Gbps
>           48 |         159 Gbps    |     185 Gbps
>           64 |         165 Gbps    |     184 Gbps
>          128 |         170 Gbps    |     180 Gbps
>  
> HW team is still working to RCA this hw behaviour.
> 
> Regarding "We could technically take two references on the page right
> away", are you suggesting having page reference counting logic to driver
> instead of relying on page pool?

Yes, either that or adjust the page pool APIs. 
page_pool_alloc_frag_netmem() currently sets the refcount to BIAS
which it then has to subtract later. So we get:

  set(BIAS)
  .. driver allocates chunks ..
  sub(BIAS_MAX - pool->frag_users)

Instead of using BIAS we could make the page pool guess that the caller
will keep asking for the same frame size. So initially take
(PAGE_SIZE/size) references.

> > The driver doesn't seem to set skb->truesize accordingly after this
> > change. So you're lying to the stack about how much memory each packet
> > consumes. This is a blocker for the change.
> >   
> ACK. I will send out a separate patch with fixes tag to fix the skb true
> size.
> 
> > > To mitigate this, bypass the page_pool fragment path and force a single RX
> > > packet per page allocation when all the following conditions are met:
> > >   1. The system is configured with a 4K PAGE_SIZE.
> > >   2. A processor-specific quirk is detected via SMBIOS Type 4 data.  
> > 
> > I don't think we want the kernel to be in the business of carrying
> > matching on platform names and providing optimal config by default.
> > This sort of logic needs to live in user space or the hypervisor 
> > (which can then pass a single bit to the driver to enable the behavior)
> >   
> As per our internal discussion the hypervisor cannot provide the CPU
> version info(in vm as well as in bare metal offerings).

Why? I suppose it's much more effort for you but it's much more effort
for the community to carry the workaround. So..

> On handling it from user side are you suggesting it to introduce a new
> ethtool Private Flags and have udev rules for the driver to set the private
> flag and switch to full page rx buffers? Given that the wide number of distro
> support this might be harder to maintain/backport. 
> 
> Also the dmi parsing design was influenced by other net wireleass
> drivers as /wireless/ath/ath10k/core.c. If this approach is not
> acceptable for MANA driver then will have to take a alternate route
> based on the dsicussion right above it.

Plenty of ugly hacks in the kernel, it's no excuse.

Reply via email to