We have observed a very rare issue in Intel E810 environments where
SNMP-retrieved TX/RX counter values are sometimes nearly twice the actual
values.

Upon investigation, we identified a problem in the process that updates the
transmit and receive ring statistics in the ice driver. This issue occurs
when the counter update process is executed simultaneously on different CPU
cores.

I have attached a patch to fix this issue.

This patch is intended for Linux kernel 5.15 on Ubuntu 22.04, as my
environment is Ubuntu 22.04.

In my test environment, applying this patch prevents the issue from
occurring.

The function ice_update_vsi_ring_stats takes a pointer to a struct ice_vsi
as an argument. This structure is allocated on the heap and shared across
all CPU cores. The function resets the counter values to zero and then
accumulates the values from each ring of the NIC.

However, since struct ice_vsi is shared across all CPU cores, the following
race condition can occur when ice_update_vsi_ring_stats is executed
simultaneously on different CPUs:

1. Multiple CPU cores reset the counter values in struct ice_vsi to zero at
the same time.

2. Each CPU core independently increments the counter values.

As a result, the counter values may be updated to a higher-than-actual
value.

The attached patch modifies the implementation to store the counter values
on the stack, initialize them to zero, increment them with the values from
each ring, and finally update struct ice_vsi. By avoiding the use of shared
data for intermediate calculations, this fix prevents the issue.

In my environment, multiple Intel E810 NICs are bonded together.

I use Zabbix to graph the RX/TX counters of the bonding interface. However,
due to the way bonding ignores decreases in the counters of slave
interfaces, this issue makes the statistics completely unreliable.

Graphs generated from the slave interfaces may appear normal because, even
if the counter temporarily increases, it is corrected in the next
observation.

When I reported this issue to the Ubuntu bug tracking system, I was told to
get it merged upstream first.

I would like this issue to be fixed, but what should I do to get it
accepted?

Any advice would be greatly appreciated.

Attachment: ice_update_vsi_ring_stats.patch
Description: Binary data

Reply via email to