For documentation purposes, in a recent Xenial/4.4 kernel,
this kernel error log is seen (with options to ignore the
hardware error/fault that panics/reboots the system).

[  113.658876] bnx2x: [bnx2x_stats_comp:205(eno1)]timeout waiting for stats 
finished
[  123.648066] bnx2x: [bnx2x_state_wait:310(eno1)]timeout waiting for state 6
[  123.730345] bnx2x: [bnx2x_dcbx_stop_hw_tx:443(eno1)]Unable to hold traffic 
for HW configuration
[  123.834443] bnx2x: [bnx2x_dcbx_stop_hw_tx:444(eno1)]driver assert
[  123.907439] bnx2x: [bnx2x_panic_dump:919(eno1)]begin crash dump 
-----------------
...
[  123.907662] bnx2x 0000:19:00.0 eno1: bc 7.14.11
[  123.907666] begin fw dump (mark 0x3c65c8)
[  123.908033] end of fw dump
[  123.908048] bnx2x: [bnx2x_mc_assert:751(eno1)]Chip Revision: everest3, FW 
Version: 7_12_30
[  123.908049] bnx2x: [bnx2x_panic_dump:1182(eno1)]end crash dump 
-----------------
[  128.701944] bnx2x: [bnx2x_func_state_change:6306(eno1)]timeout waiting for 
previous ramrod completion
[  128.701946] bnx2x: [bnx2x_dcbx_resume_hw_tx:469(eno1)]Unable to resume 
traffic after HW configuration
[  128.701946] bnx2x: [bnx2x_dcbx_resume_hw_tx:470(eno1)]driver assert
[  128.701948] bnx2x: [bnx2x_panic_dump:919(eno1)]begin crash dump 
-----------------
...
[  128.702170] bnx2x 0000:19:00.0 eno1: bc 7.14.11
[  128.702173] begin fw dump (mark 0x3c65c8)
[  128.702542] end of fw dump
[  128.702557] bnx2x: [bnx2x_mc_assert:751(eno1)]Chip Revision: everest3, FW 
Version: 7_12_30
[  128.702558] bnx2x: [bnx2x_panic_dump:1182(eno1)]end crash dump 
-----------------
[  128.702565] bnx2x: [bnx2x_sp_rtnl_task:10229(eno1)]Indicating link is down 
due to Tx-timeout
[  130.704628] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for 
queue[0]: txdata->tx_pkt_prod(4) != txdata->tx_pkt_cons(3)
[  132.706968] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for 
queue[8]: txdata->tx_pkt_prod(445) != txdata->tx_pkt_cons(443)
[  134.710090] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for 
queue[16]: txdata->tx_pkt_prod(29) != txdata->tx_pkt_cons(25)
...
[  202.648543] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for 
queue[7]: txdata->tx_pkt_prod(25) != txdata->tx_pkt_cons(24)
[  204.792441] bnx2x: [bnx2x_clean_tx_queue:1204(eno1)]timeout waiting for 
queue[23]: txdata->tx_pkt_prod(51) != txdata->tx_pkt_cons(46)
[  204.940151] bnx2x: [bnx2x_del_all_macs:8499(eno1)]Failed to delete MACs: -5
[  205.023453] bnx2x: [bnx2x_chip_cleanup:9319(eno1)]Failed to schedule DEL 
commands for UC MACs list: -5
[  206.351810] bnx2x: [bnx2x_func_stop:9078(eno1)]FUNC_STOP ramrod failed. 
Running a dry transaction
[  206.778590] bnx2x: [bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout!
[  206.856735] bnx2x: [bnx2x_write_dmae:598(eno1)]DMAE returned failure -1
[  207.134674] bnx2x: [bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout!
[  207.212785] bnx2x: [bnx2x_write_dmae:598(eno1)]DMAE returned failure -1
[  207.490725] bnx2x: [bnx2x_issue_dmae_with_comp:550(eno1)]DMAE timeout!
...

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1840789

Title:
  bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Bionic:
  In Progress
Status in linux source package in Disco:
  In Progress
Status in linux source package in Eoan:
  Fix Released

Bug description:
  [Impact]

   * The bnx2x driver may cause hardware faults (leading to
     panic/reboot) and other behaviors as transmit timeouts,
     after commit 3968d38917eb ("bnx2x: Fix Multi-Cos.") is
     introduced.

   * This issue has been observed by an user shortly
     after starting docker & kubelet, with adapters:
     - Broadcom NetXtreme II BCM57800 [14e4:168a] from Dell [1028:1f5c]
     - Broadcom NetXtreme II BCM57840 [14e4:16a1] from Dell [1028:1f79]

   * If options to ignore hardware faults are used
     (erst_disable=1 hest_disable=1 ghes.disable=1)
     the system doesn't panic/reboot and continues
     on to timeout on adapter stats, then transmit
     timeouts, spewing some adapter firmware dumps,
     but the network interface is non-functional.

   * The issue only happened when LLDP is enabled
     on the network switches, and crashdump shows
     the bnx2x driver is stuck/waits for firmware
     to complete the stop traffic command in LLDP
     handling. Workaround used is to disable LLDP
     in the network switches/ports.

   * Analysis of the driver and firmware dumps
     didn't help significantly towards finding
     the root cause.

   * Upstream/mainline recently just reverted the
     patch, due to similar problem reports, while
     looking for the root cause/proper fix.

  [Test Case]

   * No reproducible test case found outside
     the user's systems/cluster, where it is
     enough to start docker & kubelet & wait.

   * The user verified test kernels for Xenial
     and Bionic - the problem does not happen;
     build-tested on Disco.

  [Regression Potential]

   * Users who significantly use/apply the non-default
     traffic class (tc) / class of service (cos) might
     possibly see performance changes (if any at all)
     in such applications, however that's unclear now.

   * This is a recent revert upstream (v5.3-rc'ish),
     so there's chance things might change in this area.

   * Nonetheless, the patch is authored by the driver
     vendor, and made its way into stable kernels
     (e.g., v5.2.8 which made Eoan/19.10 recently).

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1840789/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to