On Fri, Apr 24, 2026 at 01:05:24PM -0700, David Wei wrote:
> On 2026-04-23 05:48, Dipayaan Roy wrote:
> > On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
> > > On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
> > > > I still see roughly a 5% overhead from the atomic refcount operation
> > > > itself, but on that platform there is no throughput drop when using
> > > > page fragments versus full-page mode.
> > > 
> > > That seems to contradict your claim that it's a problem with a specific
> > > platform.. Since we're in the merge window I asked David Wei to try to
> > > experiment with disabling page fragmentation on the ARM64 platforms we
> > > have at Meta. If it repros we should use the generic rx-buf-len
> > > ringparam because more NICs may want to implement this strategy.
> > 
> > Hi Jakub,
> > 
> > Thanks. I think I was not precise enough in my previous reply.
> > 
> > What I meant is that the atomic refcount cost itself does not appear to
> > be unique to the affected platform. I see a similar ~5% overhead on
> > another ARM64 platformi (different vendor) as well. However, on that 
> > platform
> > there is no throughput delta between fragment mode and full-page mode; both 
> > reach
> > line rate.
> > 
> > On the affected platform, fragment mode shows an additional ~15%
> > throughput drop versus full-page mode. So the current data suggests that
> > the atomic overhead is common, but the throughput regression is not
> > explained by that overhead alone and likely depends on an additional
> > platform-specific factor.
> > 
> > Separately, the hardware team collected PCIe traces on the affected
> > platform and reported stalls in the fragment-mode case that are not seen
> > in full-page mode. They are still investigating the root cause, but
> > their current hypothesis is that this is related to that platform’s
> > PCIe/root-port microarchitecture rather than to page_pool refcounting
> > alone.
> > 
> > That said, I agree the right direction depends on whether this
> > reproduces on other ARM64 platforms. If David is able to reproduce the
> > same behavior, then using the generic rx-buf-len ringparam sounds like
> > the better direction.
> > 
> > Please let me know what David finds, and I can rework the patch
> > accordingly.
> 
> I ran a test on Grace, 4 KB pages, 72 cores, 1 NUMA node.
> 
> Broadcom NIC, bnxt driver, 50 Gbps bandwidth. Hacked it up to either
> give me 1 or 2 frags per page. No agg ring, no HDS, no HW GRO.
> 
> Use 1 combined queue only for the server. Affinitized its net rx softirq
> to run on core 4.
> 
> Ran iperf3 server, taskset onto cpu cores 32-47. The iperf3 client is
> running on a host w/ same hw in the same region. Using 32 queues, no
> softirq affinities. The idea is to hammer page->pp_ref_count from
> different cores.
> 
> * 1 frag/page  -> 32.3 Gbps
> * 2 frags/page -> 36.0 Gbps
> 
> Comparing perf, for 2 frags/page the cost of skb_release_data() hitting
> pp_ref_count goes up, as expected. Is this what you see? When you say
> there's a +5% overhead, what function?
> 
> Overall tput is higher with multiple frags. That's to be expected w/
> page pool.

Hi David,

Thanks for running this. Your results are consistent with mine.

I have tested this on 2 ARM64 platforms from different vendors,
running ntttcp and iperf3 using 4k as base page size.
In my observation I see both platforms show a 5% overhead in
napi_pp_put_page (~3.9%) and page_pool_alloc_frag_netmem (~1.9%)
when running in fragment mode, both stalling on the LSE ldaddal
atomic that maintains pp_ref_count.
This seems to be same as your observation as well. However in my
observation one of the platform shows 15% drop in throughput when
in fragment mode vs page mode. The other platform I ran the test on
infact performs slighty better in fragment mode than in full page
mode (simillar observation as yours).

So the atomic refcount overhead appears to be common across ARM64
platforms, but it does not cause a throughput regression.
The throughput regression seems specific to one platform only for which
we want to have the full page work around, also the HW team has
identified PCIe stalls in fragment mode that are absent in full-page mode.
Their investigation points to a suspected microarchitectural
issue in the PCIe root port. IMO, there seems to be no issue with
page_pool itself.

Given that:
 - Grace shows fragments are faster (your data)
 - A second ARM64 platform shows no regression (my data)
 - Only the affected platform shows a throughput drop
 - The HW team suspects this to a platform-specific PCIe issue,
   also form our experiment data the drop in throughput seems to
   be platform specific only.

I believe this remains a platform-specific workaround rather than
a generic issue. Would a private flag still be acceptable for this
case?


> 
> There are some 200 Gbps NICs but they're mlx5 so I'd have to redo the
> driver hack. Are you going to re-implement this change with rx-buf-len
> instead of a private flag? If so, I won't spend more time running this
> test.
> 
I can go either way depending on what Jakub prefers.
Hi Jakub,
with this new data from David, is it convincing enough for a mana driver
specific private flag, which can be set from user space by a udev rule
by detecting the underlying platform? If not then I will send the next
version with the other rxbuflen approach. 
> > 
> > 
> > Regards
> > Dipayaan Roy


Thanks and Regards
Dipayaan Roy

Reply via email to