On 2026-04-23 05:48, Dipayaan Roy wrote:
On Thu, Apr 16, 2026 at 08:31:46AM -0700, Jakub Kicinski wrote:
On Tue, 14 Apr 2026 09:00:56 -0700 Dipayaan Roy wrote:
I still see roughly a 5% overhead from the atomic refcount operation
itself, but on that platform there is no throughput drop when using
page fragments versus full-page mode.

That seems to contradict your claim that it's a problem with a specific
platform.. Since we're in the merge window I asked David Wei to try to
experiment with disabling page fragmentation on the ARM64 platforms we
have at Meta. If it repros we should use the generic rx-buf-len
ringparam because more NICs may want to implement this strategy.

Hi Jakub,

Thanks. I think I was not precise enough in my previous reply.

What I meant is that the atomic refcount cost itself does not appear to
be unique to the affected platform. I see a similar ~5% overhead on
another ARM64 platformi (different vendor) as well. However, on that platform
there is no throughput delta between fragment mode and full-page mode; both 
reach
line rate.

On the affected platform, fragment mode shows an additional ~15%
throughput drop versus full-page mode. So the current data suggests that
the atomic overhead is common, but the throughput regression is not
explained by that overhead alone and likely depends on an additional
platform-specific factor.

Separately, the hardware team collected PCIe traces on the affected
platform and reported stalls in the fragment-mode case that are not seen
in full-page mode. They are still investigating the root cause, but
their current hypothesis is that this is related to that platform’s
PCIe/root-port microarchitecture rather than to page_pool refcounting
alone.

That said, I agree the right direction depends on whether this
reproduces on other ARM64 platforms. If David is able to reproduce the
same behavior, then using the generic rx-buf-len ringparam sounds like
the better direction.

Please let me know what David finds, and I can rework the patch
accordingly.

Hi Dipayaan. Can you please share more details on your testing setup?

* What are you using as the test client/server? iperf3 or something
  else?
* What do you mean specifically by "5% overhead from the atomic refcount
  operation"? Some specific function?
* What are you using to measure? perf?
* How many queues, what is the napi softirq affinity?
* How many NUMA nodes? Does the problem only appear when crossing?

Thanks,
David



Regards
Dipayaan Roy

Reply via email to