I've re-executed the test plan on Mellanox ConnectX-6 Dx (MT2892). As
there seems to be some issue with this hardware generation (see bug
#2020409 comment #11++). And indeed, I seem to be able to reproduce that
failure, the devices are not set to "switchdev" mode and the VF-LAG is
not activated:
ubuntu@romano:~$ sudo lshw -c network -businfo
Bus info Device Class Description
============================================================
pci@0000:21:00.0 ens13f0np0 network BCM57416 NetXtreme-E
Dual-Media 10G RDMA Ethernet Controller
pci@0000:21:00.1 ens13f1np1 network BCM57416 NetXtreme-E
Dual-Media 10G RDMA Ethernet Controller
pci@0000:61:00.0 ens7f0 network MT2892 Family [ConnectX-6 Dx]
pci@0000:61:00.1 ens7f1 network MT2892 Family [ConnectX-6 Dx]
ubuntu@romano:~$ sudo devlink dev eswitch show pci/0000:61:00.0
kernel answers: Operation not supported
ubuntu@romano:~$ sudo devlink dev eswitch show pci/0000:61:00.1
kernel answers: Operation not supported
ubuntu@romano:~$ sudo apt-get install --install-recommends
linux-generic-hwe-22.04
# reboot
ubuntu@romano:~$ uname -a
Linux romano 6.8.0-52-generic #53~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jan 15
19:18:46 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
ubuntu@romano:~$ sudo apt install -t jammy-proposed netplan.io
ubuntu@romano:~$ apt list *netplan*
Listing... Done
libnetplan-dev/jammy-proposed 0.107.1-3ubuntu0.22.04.2 amd64
libnetplan0/jammy-proposed,now 0.107.1-3ubuntu0.22.04.2 amd64
[installed,automatic]
netplan-generator/jammy-proposed,now 0.107.1-3ubuntu0.22.04.2 amd64
[installed,automatic]
netplan.io/jammy-proposed,now 0.107.1-3ubuntu0.22.04.2 amd64 [installed]
python3-netplan/jammy-proposed,now 0.107.1-3ubuntu0.22.04.2 amd64
[installed,automatic]
ubuntu@romano:~$ sudo cat /sys/kernel/debug/mlx5/0000:61:00.0/lag/state
disabled
ubuntu@romano:~$ sudo cat /sys/kernel/debug/mlx5/0000:61:00.1/lag/state
disabled
ubuntu@romano:~$ sudo netplan get
network:
version: 2
ethernets:
ens13f0np0:
match:
macaddress: "84:16:0c:3d:63:ce"
addresses:
- "10.241.7.26/24"
nameservers:
addresses:
- 10.239.8.12
- 10.239.8.13
- 10.239.8.11
- 10.176.2.4
- 10.176.2.2
- 10.176.2.3
search:
- maas
- dh1-j8-1.tor3-sqa-shared-maas.solutionsqa
- dh1-j8-2.tor3-sqa-shared-maas.solutionsqa
- dh1-j9-1.tor3-sqa-shared-maas.solutionsqa
- dh1-j9-2.tor3-sqa-shared-maas.solutionsqa
gateway4: 10.241.7.1
set-name: "ens13f0np0"
mtu: 1500
ens13f1np1:
match:
macaddress: "84:16:0c:3d:63:cf"
set-name: "ens13f1np1"
mtu: 1500
ens7f0:
match:
macaddress: "b8:3f:d2:2d:68:7e"
optional: true
set-name: "ens7f0"
mtu: 1500
virtual-function-count: 8
embedded-switch-mode: "switchdev"
delay-virtual-functions-rebind: true
ens7f1:
match:
macaddress: "b8:3f:d2:2d:68:7f"
set-name: "ens7f1"
mtu: 1500
virtual-function-count: 8
embedded-switch-mode: "switchdev"
delay-virtual-functions-rebind: true
bonds:
bond0:
interfaces:
- ens7f0
- ens7f1
parameters:
mode: "active-backup"
# reboot
## FAILURE
ubuntu@romano:~$ sudo lshw -c network -businfo
Bus info Device Class Description
============================================================
pci@0000:21:00.0 ens13f0np0 network BCM57416 NetXtreme-E
Dual-Media 10G RDMA Ethernet Controller
pci@0000:21:00.1 ens13f1np1 network BCM57416 NetXtreme-E
Dual-Media 10G RDMA Ethernet Controller
pci@0000:61:00.0 ens7f0 network MT2892 Family [ConnectX-6 Dx]
pci@0000:61:00.1 ens7f1 network MT2892 Family [ConnectX-6 Dx]
pci@0000:61:00.2 ens7f0v0 network ConnectX Family mlx5Gen
Virtual Function
pci@0000:61:00.3 ens7f0v1 network ConnectX Family mlx5Gen
Virtual Function
pci@0000:61:00.4 ens7f0v2 network ConnectX Family mlx5Gen
Virtual Function
pci@0000:61:00.5 ens7f0v3 network ConnectX Family mlx5Gen
Virtual Function
pci@0000:61:00.6 ens7f0v4 network ConnectX Family mlx5Gen
Virtual Function
pci@0000:61:00.7 ens7f0v5 network ConnectX Family mlx5Gen
Virtual Function
pci@0000:61:01.0 ens7f0v6 network ConnectX Family mlx5Gen
Virtual Function
pci@0000:61:01.1 ens7f0v7 network ConnectX Family mlx5Gen
Virtual Function
ubuntu@romano:~$ sudo cat /sys/kernel/debug/mlx5/0000:61:00.0/lag/state
disabled
ubuntu@romano:~$ sudo cat /sys/kernel/debug/mlx5/0000:61:00.1/lag/state
disabled
ubuntu@romano:~$ sudo devlink dev eswitch show pci/0000:61:00.0
pci/0000:61:00.0: mode legacy inline-mode none encap-mode basic
ubuntu@romano:~$ sudo devlink dev eswitch show pci/0000:61:00.1
pci/0000:61:00.1: mode legacy inline-mode none encap-mode basic
** Tags added: block-proposed-jammy
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1988018
Title:
[SRU][mlx5] Intermittent VF-LAG activation failure
Status in linux package in Ubuntu:
Fix Committed
Status in netplan.io package in Ubuntu:
Fix Released
Status in linux source package in Jammy:
Confirmed
Status in netplan.io source package in Jammy:
Fix Committed
Status in linux source package in Kinetic:
Won't Fix
Status in netplan.io source package in Kinetic:
Won't Fix
Status in linux source package in Mantic:
Won't Fix
Status in netplan.io source package in Mantic:
Won't Fix
Status in linux source package in Noble:
Fix Committed
Status in netplan.io source package in Noble:
Fix Released
Bug description:
[ Impact ]
Due to limitations in how Netplan handles SR-IOV devices, the VF-LAG
feature found on Mellanox NICs couldn't be used. Certain configuration steps
must happen in a very specific order and Netplan fails to perform the set up
correctly.
Netplan must wait until the backend finishes adding interfaces to the Bond
and the Mellanox driver reports the VF-LAG feature as "active" before binding
VFs to
the driver.
See also https://bugs.launchpad.net/netplan/+bug/2083008
This problem is fixed by introducing a proper ordering in the configuration
process
and monitoring the driver state until it reports as ready (or times out).
This fix is available on Ubuntu 24.04.
[ Test Plan ]
To reproduce the problem addressed by this SRU one needs to
have access to specialized hardware (SR-IOV-capable Mellanox NICs).
The fix for the problem described above was already verified on Ubuntu 22.04
and
solved the problem (more details
https://bugs.launchpad.net/netplan/+bug/2083008).
We will work with Canonical's Openstack team to do the fix
verification.
* detailed instructions how to reproduce the bug
A configuration file that looks like the one below can be used
to test the fix.
After booting the system with this configuration, the Mellanox driver
should report the LAG state as "active" for all the devices.
It can be checked in the debugfs file:
/sys/kernel/debug/mlx5/{pci_addr}/lag/state
network:
version: 2
ethernets:
ens4f0np0:
virtual-function-count: 16
embedded-switch-mode: switchdev
delay-virtual-functions-rebind: true
ens4f1np1:
virtual-function-count: 16
embedded-switch-mode: switchdev
delay-virtual-functions-rebind: true
bonds:
bond0:
interfaces:
- ens4f0np0
- ens4f1np1
parameters:
mode: active-backup
[ Where problems could occur ]
These changes should affect only SR-IOV related scenarios.
Undetected problems could cause Netplan to fail to configure the device
and Virtual Functions wouldn't be created anymore.
[ Other Info ]
Related work:
https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018
https://github.com/canonical/netplan/pull/439
A PPA for Ubuntu 22.04 can be found here
https://launchpad.net/~danilogondolfo/+archive/ubuntu/netplan-sru
---- Original bug description ----
During system initialization there is a specific sequence that must be
followed to enable the use of hardware offload and VF-LAG.
Intermittently one may see that VF-LAG initialization fails:
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2
shared_fdb:1
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid
9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome
(0x7d49cb)
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid
9): Failed to create LAG (-22)
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid
9): Failed to activate VF LAG
Make sure all VFs are unbound prior to VF LAG
activation or deactivation
This is caused by rebinding the driver prior to the VF lag being
ready.
A sysfs knob has recently been added to the driver [0] and we should
monitor it before attempting to rebind the driver:
$ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state
The kernel feature is available in the upcoming Kinetic 5.19 kernel
and we should probably backport it to the Jammy 5.15 kernel.
0:
https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp