Hi there, I broke some network cards again. This time I noticed continuous RX 
packet drops with an Intel E810-XXV.

When such a card temporarily (just for a few seconds) receives a large flood of 
packets and the kernel cannot keep
up with processing them, the following appears in the Kernel log:

kernel: ice 0000:c7:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002b 
address=0x4000180000 flags=0x0020]
kernel: ice 0000:c7:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002b 
address=0x4000180000 flags=0x0020]
kernel: workqueue: ice_rx_dim_work [ice] hogged CPU for >10000us 4 times, 
consider switching to WQ_UNBOUND
kernel: workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 4 times, 
consider switching to WQ_UNBOUND
kernel: workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 5 times, 
consider switching to WQ_UNBOUND
kernel: workqueue: ice_rx_dim_work [ice] hogged CPU for >10000us 5 times, 
consider switching to WQ_UNBOUND
kernel: workqueue: psi_avgs_work hogged CPU for >10000us 4 times, consider 
switching to WQ_UNBOUND
kernel: ice 0000:c7:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002b 
address=0x4000180000 flags=0x0020]
kernel: workqueue: drm_fb_helper_damage_work hogged CPU for >10000us 7 times, 
consider switching to WQ_UNBOUND
kernel: workqueue: ice_rx_dim_work [ice] hogged CPU for >10000us 7 times, 
consider switching to WQ_UNBOUND
kernel: workqueue: psi_avgs_work hogged CPU for >10000us 5 times, consider 
switching to WQ_UNBOUND
kernel: ice 0000:c7:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002b 
address=0x4000180000 flags=0x0020]
...

After that, the NIC seems to be in a permanently broken state and continues to 
drop a few percent of the received
packets, even at low data rates. When reducing the incoming packet rate to just 
10.000 pps, I can see over 500 pps
of that being dropped. After reinitializing the NIC (e.g. by changing the 
channel count using ethtool), the error
goes away and it's rock solid again. Until the next packet flood.

We have reproduced this with:
  Linux 6.8.0-88-generic (Ubuntu 24.04)
  Linux 6.14.0-36-generic (Ubuntu 24.04 HWE)
  Linux 6.18.0-061800-generic (Ubuntu Mainline PPA)

CPU: AMD EPYC 9825 144-Core Processor (288 threads)

lspci | grep Ethernet
  c7:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV 
for SFP (rev 02)
  c7:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV 
for SFP (rev 02)

ethtool -i eth0
  driver: ice
  version: 6.18.0-061800-generic
  firmware-version: 4.90 0x80020ef6 1.3863.0
  expansion-rom-version: 
  bus-info: 0000:c7:00.0
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: yes

ethtool -l eth0
  Channel parameters for eth0:
  Pre-set maximums:
  RX:           288
  TX:           288
  Other:                1
  Combined:     288
  Current hardware settings:
  RX:           0
  TX:           32
  Other:                1
  Combined:     256
These are the defaults after boot.

ethtool -S eth0 | grep rx_dropped
  rx_dropped: 7206525
  rx_dropped.nic: 0
ethtool -S eth1 | grep rx_dropped
  rx_dropped: 6889634
  rx_dropped.nic: 0

How to reproduce:

1. Use another host to flood the host with the E810 NIC with 64 byte large UDP 
packets. I used trafgen for that and
made sure, that the source ports are randomized to make RSS spread the load 
over all channels. The packet rate must
be high enough to overload the packet processing on the receiving host.
In my case, 4 Mpps was already enough to make the errors show up in the kernel 
log and trigger the permanent packet
loss, but the needed packet rate may depend on how CPU intensive the processing 
of each packet is. Dropping packets
early (e.g. using iptables) makes reproducing harder.

2. Monitor the rx_dropped counter and the kernel log. After a few seconds, 
above warnings/errors should show up in
the kernel log.

3. Stop the traffic generator and re-run it with a way lower packet rate, e.g. 
10.000 pps. Now it can be seen that
a good part of these packets is being dropped, even though the kernel could 
easily keep up with this small packet rate.

In my case the two ports of the E810 NIC were part of a bonding, but I don't 
think this is required to reproduce the
issue.

Please let me know, if there is more information I could provide.

Thanks,
Marcus

Reply via email to