I leveraged Claude Opus 4.6 to develop a stress-test suite with a
primary 'break-it' objective targeting VF stability. The suite focuses
on aggressive edge cases, specifically cyclic VF migration between
network namespaces while VLAN filtering is active a sequence known
to trigger state machine regressions. The following output
demonstrates the failure state on an unpatched iavf driver (prior to
the 'fix VLAN filter state machine races' patch):

echo 8 > /sys/class/net/enp65s0f0np0/device/sriov_numvfs
# ./tools/testing/selftests/drivers/net/iavf_vlan_state.sh
================================================
  iavf VLAN state machine test suite
================================================
  VF1:  enp65s0f0v0 (0000:41:01.0) -> iavf-t1-6502
  VF2:  enp65s0f0v1 (0000:41:01.1) -> iavf-t2-6502
  PF:   enp65s0f0np0 (0000:41:00.0)
  MAX:  8 user VLANs per VF
================================================
  PASS  state: basic add/remove
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
  FAIL  state: 8 VLANs add/remove  (only 7 created)
  PASS  state: VLAN persists across down/up
  PASS  state: 5 VLANs persist across down/up
  PASS  state: rapid add/del same VLAN x100
  PASS  state: add during remove (REMOVING race)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
  PASS  state: bulk 8 add then remove
  PASS  state: 20x rapid down/up with VLAN
  PASS  state: add VLAN while down
  PASS  state: remove VLAN while down
  PASS  state: down -> remove -> up
  PASS  state: add VLANs while down, verify all after up
  PASS  state: double add same VLAN (idempotent)
  PASS  state: double remove same VLAN
  PASS  state: interleaved add/remove different VIDs
  PASS  state: remove+re-add loop x50
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
  FAIL  state: stress 8 VLANs (fill to max)  (expected 8, got 7)
  PASS  state: VLAN VID 1 (common edge case)
  PASS  state: VLAN VID 4094 (max)
  PASS  state: concurrent VLAN adds (4 parallel)
  PASS  state: concurrent VLAN deletes (4 parallel)
  PASS  state: add/del storm (200 ops, 5 VIDs)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
  FAIL  state: over-limit VLAN rejected, existing survive  (fill: expected 8, got 7)
  PASS  reset: VLANs recover after VF PCI FLR
  PASS  reset: 5 VLANs recover after VF PCI FLR
  PASS  reset: rapid VF resets x5 with VLANs
  PASS  reset: VLANs survive PF link flap
  PASS  reset: 5 VLANs survive PF link flap
  PASS  reset: VLANs survive 3x PF link flap
  PASS  reset: VLANs survive PF PCI FLR
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
  FAIL  reset: all 8 VLANs recover after VF FLR  (VLAN 107 gone)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
  FAIL  reset: all 8 VLANs survive PF link flap  (VLAN 107 gone)
RTNETLINK answers: Input/output error
Cannot find device "enp65s0f0v0.107"
Cannot find device "enp65s0f0v0.107"
  FAIL  reset: all 8 VLANs survive PF PCI FLR  (VLAN 107 gone)
  PASS  reset: FLR during VLAN add/del (race)
  PASS  reset: VF driver unbind/bind cycle
  PASS  ping: basic VLAN traffic
  PASS  ping: 5 VLANs simultaneously
  PASS  ping: survives VF down/up
  PASS  ping: survives 10x rapid VF flap
  PASS  ping: survives VF PCI FLR
  PASS  ping: survives PF link flap
  PASS  ping: survives PF PCI FLR
  PASS  ping: stable while adding/removing other VLANs
  PASS  ping: all 3 VLANs work after down/up
  PASS  ping: parallel VLAN churn from both VFs
  PASS  ping: VLANs work after rapid add/del churn
  PASS  ping: VLANs survive repeated NS move cycle
  PASS  ping: all VLANs survive PF link flap
  PASS  ping: VLAN isolation (no cross-VLAN leakage)
  PASS  ping: traffic works with spoofchk enabled
  PASS  ping: port VLAN (PF-assigned pvid)
  PASS  dmesg: no call traces / BUGs / stalls

================================================
  PASS 46  |  FAIL 6  |  SKIP 0  |  TOTAL 52
================================================
  RESULT: FAIL  -- check dmesg


The underlying failures stem from a breakdown in state synchronization
between the VF and the PF. This desynchronization prevents the driver
from maintaining a consistent hardware state during rapid configuration
cycles, leading to the observed issues.

...................

Patched kernel:

# echo 8 > /sys/class/net/enp65s0f0np0/device/sriov_numvfs
# ./tools/testing/selftests/drivers/net/iavf_vlan_state.sh
================================================
  iavf VLAN state machine test suite
================================================
  VF1:  enp65s0f0v0 (0000:41:01.0) -> iavf-t1-6573
  VF2:  enp65s0f0v1 (0000:41:01.1) -> iavf-t2-6573
  PF:   enp65s0f0np0 (0000:41:00.0)
  MAX:  8 user VLANs per VF
================================================
  PASS  state: basic add/remove
  PASS  state: 8 VLANs add/remove
  PASS  state: VLAN persists across down/up
  PASS  state: 5 VLANs persist across down/up
  PASS  state: rapid add/del same VLAN x100
  PASS  state: add during remove (REMOVING race)
  PASS  state: bulk 8 add then remove
  PASS  state: 20x rapid down/up with VLAN
  PASS  state: add VLAN while down
  PASS  state: remove VLAN while down
  PASS  state: down -> remove -> up
  PASS  state: add VLANs while down, verify all after up
  PASS  state: double add same VLAN (idempotent)
  PASS  state: double remove same VLAN
  PASS  state: interleaved add/remove different VIDs
  PASS  state: remove+re-add loop x50
  PASS  state: stress 8 VLANs (fill to max)
  PASS  state: VLAN VID 1 (common edge case)
  PASS  state: VLAN VID 4094 (max)
  PASS  state: concurrent VLAN adds (4 parallel)
  PASS  state: concurrent VLAN deletes (4 parallel)
  PASS  state: add/del storm (200 ops, 5 VIDs)
  PASS  state: over-limit VLAN rejected, existing survive
  PASS  reset: VLANs recover after VF PCI FLR
  PASS  reset: 5 VLANs recover after VF PCI FLR
  PASS  reset: rapid VF resets x5 with VLANs
  PASS  reset: VLANs survive PF link flap
  PASS  reset: 5 VLANs survive PF link flap
  PASS  reset: VLANs survive 3x PF link flap
  PASS  reset: VLANs survive PF PCI FLR
  PASS  reset: all 8 VLANs recover after VF FLR
  PASS  reset: all 8 VLANs survive PF link flap
  PASS  reset: all 8 VLANs survive PF PCI FLR
  PASS  reset: FLR during VLAN add/del (race)
  PASS  reset: VF driver unbind/bind cycle
  PASS  ping: basic VLAN traffic
  PASS  ping: 5 VLANs simultaneously
  PASS  ping: survives VF down/up
  PASS  ping: survives 10x rapid VF flap
  PASS  ping: survives VF PCI FLR
  PASS  ping: survives PF link flap
  PASS  ping: survives PF PCI FLR
  PASS  ping: stable while adding/removing other VLANs
  PASS  ping: all 3 VLANs work after down/up
  PASS  ping: parallel VLAN churn from both VFs
  PASS  ping: VLANs work after rapid add/del churn
  PASS  ping: VLANs survive repeated NS move cycle
  PASS  ping: all VLANs survive PF link flap
  PASS  ping: VLAN isolation (no cross-VLAN leakage)
  PASS  ping: traffic works with spoofchk enabled
  PASS  ping: port VLAN (PF-assigned pvid)
  PASS  dmesg: no call traces / BUGs / stalls

================================================
  PASS 52  |  FAIL 0  |  SKIP 0  |  TOTAL 52
================================================
  RESULT: OK

Additionally, interface up/down performance with active VLAN
filtering is significantly improved. The previous bottleneck—a
synchronous VLAN filtering cycle (VF -> PF -> HW -> PF -> VF)
utilizing AdminQ for per-VLAN updates introduced substantial
latency.

Test suite:

https://github.com/torvalds/linux/commit/5c60850c33da80a1c2497fb6bc31f956316197a9

Regards,

Petr


Reply via email to