Somewhat similarly on recent 5.2 kernel without the fix. (again with options to ignore hardware errors/faults)
Aug 19 17:15:15 HOSTNAME kernel: Uhhuh. NMI received for unknown reason 21 on CPU 0. Aug 19 17:15:15 HOSTNAME kernel: perf interrupt took too long (3222 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 Aug 19 17:15:15 HOSTNAME kernel: TCP: request_sock_TCP: Possible SYN flooding on port 9300. Sending cookies. Check SNMP counters. Aug 19 17:15:15 HOSTNAME kernel: Do you have a strange power saving mode enabled? Aug 19 17:15:15 HOSTNAME kernel: Dazed and confused, but trying to continue ... Aug 19 17:15:21 HOSTNAME kernel: NETDEV WATCHDOG: eno1 (bnx2x): transmit queue 0 timed out ... Aug 19 17:15:21 HOSTNAME kernel: bnx2x: [bnx2x_sp_rtnl_task:10229(eno1)]Indicating link is down due to Tx-timeout Aug 19 17:15:21 HOSTNAME kernel: bond0: link status down for interface eno1, disabling it in 200 ms Aug 19 17:15:21 HOSTNAME kernel: bnx2x: [bnx2x_stats_comp:205(eno1)]timeout waiting for stats finished Aug 19 17:15:21 HOSTNAME kernel: bnx2x: [bnx2x_stats_comp:205(eno1)]timeout waiting for stats finished Aug 19 17:15:23 HOSTNAME kernel: bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for queue[0]: txdata->tx_pkt_prod(4) != txdata->tx_pkt_cons(2) Aug 19 17:15:25 HOSTNAME kernel: bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for queue[8]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0) ... Aug 19 17:17:14 HOSTNAME kernel: bnx2x: [bnx2x_state_wait:310(eno1)]timeout waiting for state 0 Aug 19 17:17:14 HOSTNAME kernel: bnx2x: [bnx2x_del_all_macs:8499(eno1)]Failed to delete MACs: -16 Aug 19 17:17:14 HOSTNAME kernel: bnx2x: [bnx2x_chip_cleanup:9319(eno1)]Failed to schedule DEL commands for UC MACs list: -16 Aug 19 17:17:24 HOSTNAME kernel: bnx2x: [bnx2x_state_wait:310(eno1)]timeout waiting for state 9 Aug 19 17:17:34 HOSTNAME kernel: bnx2x: [bnx2x_state_wait:310(eno1)]timeout waiting for state 2 Aug 19 17:17:34 HOSTNAME kernel: bnx2x: [bnx2x_func_stop:9078(eno1)]FUNC_STOP ramrod failed. Running a dry transaction Aug 19 17:17:35 HOSTNAME kernel: bnx2x: [bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout! Aug 19 17:17:35 HOSTNAME kernel: bnx2x: [bnx2x_write_dmae:598(eno1)]DMAE returned failure -1 Aug 19 17:17:35 HOSTNAME kernel: bnx2x: [bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout! ... -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1840789 Title: bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled Status in linux package in Ubuntu: Fix Released Status in linux source package in Xenial: In Progress Status in linux source package in Bionic: In Progress Status in linux source package in Disco: In Progress Status in linux source package in Eoan: Fix Released Bug description: [Impact] * The bnx2x driver may cause hardware faults (leading to panic/reboot) and other behaviors as transmit timeouts, after commit 3968d38917eb ("bnx2x: Fix Multi-Cos.") is introduced. * This issue has been observed by an user shortly after starting docker & kubelet, with adapters: - Broadcom NetXtreme II BCM57800 [14e4:168a] from Dell [1028:1f5c] - Broadcom NetXtreme II BCM57840 [14e4:16a1] from Dell [1028:1f79] * If options to ignore hardware faults are used (erst_disable=1 hest_disable=1 ghes.disable=1) the system doesn't panic/reboot and continues on to timeout on adapter stats, then transmit timeouts, spewing some adapter firmware dumps, but the network interface is non-functional. * The issue only happened when LLDP is enabled on the network switches, and crashdump shows the bnx2x driver is stuck/waits for firmware to complete the stop traffic command in LLDP handling. Workaround used is to disable LLDP in the network switches/ports. * Analysis of the driver and firmware dumps didn't help significantly towards finding the root cause. * Upstream/mainline recently just reverted the patch, due to similar problem reports, while looking for the root cause/proper fix. [Test Case] * No reproducible test case found outside the user's systems/cluster, where it is enough to start docker & kubelet & wait. * The user verified test kernels for Xenial and Bionic - the problem does not happen; build-tested on Disco. [Regression Potential] * Users who significantly use/apply the non-default traffic class (tc) / class of service (cos) might possibly see performance changes (if any at all) in such applications, however that's unclear now. * This is a recent revert upstream (v5.3-rc'ish), so there's chance things might change in this area. * Nonetheless, the patch is authored by the driver vendor, and made its way into stable kernels (e.g., v5.2.8 which made Eoan/19.10 recently). To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840789/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

