Hi netdev,

After a recent Fedora CoreOS upgrade at the beginning of February, we are
experiencing several kernel crashes on multiple nodes of the same style
hardware, which are pointing to an issue with the ixgbe driver.

The affected nodes reboot following a kernel NULL pointer dereference
immediately after an ixgbe "Malicious event on VF" message.

Feb 27 21:34:34 d8-r11-c8-n3 kernel: ixgbe 0000:21:00.0: Malicious event on
VF 3 tx:80000 rx:0
Feb 27 21:34:34 d8-r11-c8-n3 kernel: BUG: kernel NULL pointer dereference,
address: 0000000000000304

Different node in another chassis (identical hardware):

Feb 13 10:42:49 d8-r11-c9-n2 kernel: ixgbe 0000:21:00.0: Malicious event on
VF 12 tx:80000 rx:0
Feb 13 10:42:49 d8-r11-c9-n2 kernel: BUG: kernel NULL pointer dereference,
address: 0000000000000b2c

This has occurred on at least five separate nodes since our FCOS upgrade
maintenance on February 3. After reboot, nodes return to normal operation
until the next occurrence. Currently for each of these systems, the journal
always truncates after the BUG line. I will increase the panic delay and
capture settings and update if we happen to catch a more meaningful trace
before the reboot triggers.

Here's some relevant info pertaining to our environment:

- Linux kernel: 6.17.7-300.fc43.x86_64
- OS: Fedora CoreOS 43.20251110.3.1

Hardware Info:

- Gigabyte MZ62-HD0 nodes (H262-Z62 chassis) -- happening across multiple
nodes in multiple chassis in the stack (i.e., not isolated to a single
chassis)
- CPU is AMD EPYC 7302
- The NIC causing issues is: Intel X550 (rev 01) dual-port 10GBASE-T
-- Bonded interfaces (802.3ad) to redundant leaf switches

Driver/mod Info:

- driver: ixgbe
- version: 6.17.7-300.fc43.x86_64
- firmware-version: 0x80000c67, 1.1276.0

>From the modinfo:
- filename:
/lib/modules/6.17.7-300.fc43.x86_64/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko.xz
- description:    Intel(R) 10 Gigabit PCI Express Network Driver
- rhelversion:    10.99

The SR-IOV capability is present on the X550 adapter, however no VFs are
configured:

/sys/class/net/enp33s0f0/device/sriov_numvfs = 0

/sys/module/ixgbe/parameters/ has only allow_unsupported_sfp = N

Also, no VF PCI devices appear in lspci output.

In checking the priv flags, I noticed there's one for mdd-disable-vf. I can
try to set mdd-disable-vf to on after sending to see if that helps as a
potential mitigation, but the nondeterministic nature of this issue means
it will take some time for us to know whether this change restores
stability:

Private flags for enp33s0f0:
- legacy-rx     : off
- vf-ipsec      : off
- mdd-disable-vf: off

I'm wondering if this is a known issue in recent kernels affecting
ixgbe/X550 devices when MDD events are triggered without SR-IOV VFs
configured? I could not find anything recent in my searches, so I thought I
would reach out to report the behavior and see if there's anything I might
be missing.

Thanks for your time,

Melissa

Reply via email to