Somewhat similarly on recent 5.2 kernel without the fix.
(again with options to ignore hardware errors/faults)

Aug 19 17:15:15 HOSTNAME kernel: Uhhuh. NMI received for unknown reason 21 on 
CPU 0.
Aug 19 17:15:15 HOSTNAME kernel: perf interrupt took too long (3222 > 2500), 
lowering kernel.perf_event_max_sample_rate to 50000
Aug 19 17:15:15 HOSTNAME kernel: TCP: request_sock_TCP: Possible SYN flooding 
on port 9300. Sending cookies.  Check SNMP counters.
Aug 19 17:15:15 HOSTNAME kernel: Do you have a strange power saving mode 
enabled?
Aug 19 17:15:15 HOSTNAME kernel: Dazed and confused, but trying to continue
...
Aug 19 17:15:21 HOSTNAME kernel: NETDEV WATCHDOG: eno1 (bnx2x): transmit queue 
0 timed out
...
Aug 19 17:15:21 HOSTNAME kernel: bnx2x: 
[bnx2x_sp_rtnl_task:10229(eno1)]Indicating link is down due to Tx-timeout
Aug 19 17:15:21 HOSTNAME kernel: bond0: link status down for interface eno1, 
disabling it in 200 ms
Aug 19 17:15:21 HOSTNAME kernel: bnx2x: [bnx2x_stats_comp:205(eno1)]timeout 
waiting for stats finished
Aug 19 17:15:21 HOSTNAME kernel: bnx2x: [bnx2x_stats_comp:205(eno1)]timeout 
waiting for stats finished
Aug 19 17:15:23 HOSTNAME kernel: bnx2x: 
[bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for queue[0]: 
txdata->tx_pkt_prod(4) != txdata->tx_pkt_cons(2)
Aug 19 17:15:25 HOSTNAME kernel: bnx2x: 
[bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for queue[8]: 
txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
...
Aug 19 17:17:14 HOSTNAME kernel: bnx2x: [bnx2x_state_wait:310(eno1)]timeout 
waiting for state 0
Aug 19 17:17:14 HOSTNAME kernel: bnx2x: [bnx2x_del_all_macs:8499(eno1)]Failed 
to delete MACs: -16
Aug 19 17:17:14 HOSTNAME kernel: bnx2x: [bnx2x_chip_cleanup:9319(eno1)]Failed 
to schedule DEL commands for UC MACs list: -16
Aug 19 17:17:24 HOSTNAME kernel: bnx2x: [bnx2x_state_wait:310(eno1)]timeout 
waiting for state 9
Aug 19 17:17:34 HOSTNAME kernel: bnx2x: [bnx2x_state_wait:310(eno1)]timeout 
waiting for state 2
Aug 19 17:17:34 HOSTNAME kernel: bnx2x: [bnx2x_func_stop:9078(eno1)]FUNC_STOP 
ramrod failed. Running a dry transaction
Aug 19 17:17:35 HOSTNAME kernel: bnx2x: 
[bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout!
Aug 19 17:17:35 HOSTNAME kernel: bnx2x: [bnx2x_write_dmae:598(eno1)]DMAE 
returned failure -1
Aug 19 17:17:35 HOSTNAME kernel: bnx2x: 
[bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout!
...

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840789

Title:
  bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

   * The bnx2x driver may cause hardware faults (leading to
     panic/reboot) and other behaviors as transmit timeouts,
     after commit 3968d38917eb ("bnx2x: Fix Multi-Cos.") is
     introduced.

   * This issue has been observed by an user shortly
     after starting docker & kubelet, with adapters:
     - Broadcom NetXtreme II BCM57800 [14e4:168a] from Dell [1028:1f5c]
     - Broadcom NetXtreme II BCM57840 [14e4:16a1] from Dell [1028:1f79]

   * If options to ignore hardware faults are used
     (erst_disable=1 hest_disable=1 ghes.disable=1)
     the system doesn't panic/reboot and continues
     on to timeout on adapter stats, then transmit
     timeouts, spewing some adapter firmware dumps,
     but the network interface is non-functional.

   * The issue only happened when LLDP is enabled
     on the network switches, and crashdump shows
     the bnx2x driver is stuck/waits for firmware
     to complete the stop traffic command in LLDP
     handling. Workaround used is to disable LLDP
     in the network switches/ports.

   * Analysis of the driver and firmware dumps
     didn't help significantly towards finding
     the root cause.

   * Upstream/mainline recently just reverted the
     patch, due to similar problem reports, while
     looking for the root cause/proper fix.

  [Test Case]

   * No reproducible test case found outside
     the user's systems/cluster, where it is
     enough to start docker & kubelet & wait.

   * The user verified test kernels for Xenial
     and Bionic - the problem does not happen;
     build-tested on Disco.

  [Regression Potential]

   * Users who significantly use/apply the non-default
     traffic class (tc) / class of service (cos) might
     possibly see performance changes (if any at all)
     in such applications, however that's unclear now.

   * This is a recent revert upstream (v5.3-rc'ish),
     so there's chance things might change in this area.

   * Nonetheless, the patch is authored by the driver
     vendor, and made its way into stable kernels
     (e.g., v5.2.8 which made Eoan/19.10 recently).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840789/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to