On 2/21/25 04:12, Masakazu Asama wrote:
We have observed a very rare issue in Intel E810 environments where SNMP-retrieved TX/RX counter values are sometimes nearly twice the actual values.

Upon investigation, we identified a problem in the process that updates the transmit and receive ring statistics in the ice driver. This issue occurs when the counter update process is executed simultaneously on different CPU cores.

I have attached a patch to fix this issue.

This patch is intended for Linux kernel 5.15 on Ubuntu 22.04, as my environment is Ubuntu 22.04.

In my test environment, applying this patch prevents the issue from occurring.

The function ice_update_vsi_ring_stats takes a pointer to a struct ice_vsi as an argument. This structure is allocated on the heap and shared across all CPU cores. The function resets the counter values to zero and then accumulates the values from each ring of the NIC.

However, since struct ice_vsi is shared across all CPU cores, the following race condition can occur when ice_update_vsi_ring_stats is executed simultaneously on different CPUs:

1.Multiple CPU cores reset the counter values in struct ice_vsi to zero at the same time.

2.Each CPU core independently increments the counter values.

As a result, the counter values may be updated to a higher-than-actual value.

We had observed other problems caused by the very same shared data, it
already was fixed as part of kernel 5.16 via
commit 1a0f25a52e08 ("ice: safer stats processing").
Sadly it was not backported to 5.15.

From your proposed patch I could tell that the fix is not present on
your Ubuntu kernel.

The first step is to check if the linked patch fixes the issue at hand,
could you please give it a try?


The attached patch modifies the implementation to store the counter values on the stack, initialize them to zero, increment them with the values from each ring, and finally update struct ice_vsi. By avoiding the use of shared data for intermediate calculations, this fix prevents the issue.

In my environment, multiple Intel E810 NICs are bonded together.

I use Zabbix to graph the RX/TX counters of the bonding interface. However, due to the way bonding ignores decreases in the counters of slave interfaces, this issue makes the statistics completely unreliable.

Graphs generated from the slave interfaces may appear normal because, even if the counter temporarily increases, it is corrected in the next observation.

When I reported this issue to the Ubuntu bug tracking system, I was told to get it merged upstream first.

I would like this issue to be fixed, but what should I do to get it accepted?

Any advice would be greatly appreciated.

You hit the correct mailing list for the upstream process.

Process is a bit different depending on weather we will need to just
backport Jesse's patch or parts of yours. For backports you will reach
to [email protected]

One more question prior to adding more patches: does the issue reproduce
with the current kernel (6.13, or even better if net-next:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git )

Reply via email to