** Also affects: charm-ovn-chassis Importance: Undecided Status: New
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2009594 Title: Mlx5 kworker blocked Kernel 5.19 (Jammy HWE) Status in charm-ovn-chassis: New Status in linux package in Ubuntu: Confirmed Bug description: This is seen on particular with : * Charmed Openstack with Jammy Yoga * 5.19.0-35-generic (linux-generic-hwe-22.04/jammy-updates) * Mellanox Connectx-6 card with mlx5_core module being used * SR-IOV is being used with VF-LAG for the use of OVN hardware offloading The servers enter into very high load (around 75~100) quickly during the boot with all process relying on network communication with the Mellanox network card being stuck or extremely slow. Kernel logs are being displayed about kworkers being blocked for more than 120 seconds The number of SR-IOV devices configured both from the firmware and the kernel seems to have a serious correlation with the likeliness of this bug to occur. Having enabled more VF seems to hugely increase the risk for this bug to arise. This does not happen systematically at every boot, but with 32 VFs on each PF, it occurs about 40% of the time. To recover the server, a cold reboot is required. Look at a quick sample of the trace, this seems to involve directly the mlx5 driver within the kernel : Mar 07 05:24:56 nova-1 kernel: INFO: task kworker/0:1:19 blocked for more than 120 seconds. Mar 07 05:24:56 nova-1 kernel: Tainted: P OE 5.19.0-35-generic #36~22.04.1-Ubuntu Mar 07 05:24:56 nova-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 07 05:24:56 nova-1 kernel: task:kworker/0:1 state:D stack: 0 pid: 19 ppid: 2 flags:0x00004000 Mar 07 05:24:56 nova-1 kernel: Workqueue: events work_for_cpu_fn Mar 07 05:24:56 nova-1 kernel: Call Trace: Mar 07 05:24:56 nova-1 kernel: <TASK> Mar 07 05:24:56 nova-1 kernel: __schedule+0x257/0x5d0 Mar 07 05:24:56 nova-1 kernel: schedule+0x68/0x110 Mar 07 05:24:56 nova-1 kernel: schedule_preempt_disabled+0x15/0x30 Mar 07 05:24:56 nova-1 kernel: __mutex_lock.constprop.0+0x4f1/0x750 Mar 07 05:24:56 nova-1 kernel: __mutex_lock_slowpath+0x13/0x20 Mar 07 05:24:56 nova-1 kernel: mutex_lock+0x3e/0x50 Mar 07 05:24:56 nova-1 kernel: mlx5_register_device+0x1c/0xb0 [mlx5_core] Mar 07 05:24:56 nova-1 kernel: mlx5_init_one+0xe4/0x110 [mlx5_core] Mar 07 05:24:56 nova-1 kernel: probe_one+0xcb/0x120 [mlx5_core] Mar 07 05:24:56 nova-1 kernel: local_pci_probe+0x4b/0x90 Mar 07 05:24:56 nova-1 kernel: work_for_cpu_fn+0x1a/0x30 Mar 07 05:24:56 nova-1 kernel: process_one_work+0x21f/0x400 Mar 07 05:24:56 nova-1 kernel: worker_thread+0x200/0x3f0 Mar 07 05:24:56 nova-1 kernel: ? rescuer_thread+0x3a0/0x3a0 Mar 07 05:24:56 nova-1 kernel: kthread+0xee/0x120 Mar 07 05:24:56 nova-1 kernel: ? kthread_complete_and_exit+0x20/0x20 Mar 07 05:24:56 nova-1 kernel: ret_from_fork+0x22/0x30 Mar 07 05:24:56 nova-1 kernel: </TASK> To manage notifications about this bug go to: https://bugs.launchpad.net/charm-ovn-chassis/+bug/2009594/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp