Hi,
On 03/18, Lei Yang wrote:
> On Tue, Mar 18, 2025 at 10:15 AM Jason Wang <[email protected]> wrote:
> >
> > On Tue, Mar 18, 2025 at 9:55 AM Lei Yang <[email protected]> wrote:
> > >
> > > Hi Jonah
> > >
> > > I tested this series with the vhost_vdpa device based on mellanox
> > > ConnectX-6 DX nic and hit the host kernel crash. This problem can be
> > > easier to reproduce under the hotplug/unplug device scenario.
> > > For the core dump messages please review the attachment.
> > > FW version:
> > > # flint -d 0000:0d:00.0 q |grep Version
> > > FW Version: 22.44.1036
> > > Product Version: 22.44.1036
> >
> > The trace looks more like a mlx5e driver bug other than vDPA?
> >
> > [ 3256.256707] Call Trace:
> > [ 3256.256708] <IRQ>
> > [ 3256.256709] ? show_trace_log_lvl+0x1c4/0x2df
> > [ 3256.256714] ? show_trace_log_lvl+0x1c4/0x2df
> > [ 3256.256715] ? __build_skb+0x4a/0x60
> > [ 3256.256719] ? __die_body.cold+0x8/0xd
> > [ 3256.256720] ? die_addr+0x39/0x60
> > [ 3256.256725] ? exc_general_protection+0x1ec/0x420
> > [ 3256.256729] ? asm_exc_general_protection+0x22/0x30
> > [ 3256.256736] ? __build_skb_around+0x8c/0xf0
> > [ 3256.256738] __build_skb+0x4a/0x60
> > [ 3256.256740] build_skb+0x11/0xa0
> > [ 3256.256743] mlx5e_skb_from_cqe_mpwrq_linear+0x156/0x280 [mlx5_core]
> > [ 3256.256872] mlx5e_handle_rx_cqe_mpwrq_rep+0xcb/0x1e0 [mlx5_core]
> > [ 3256.256964] mlx5e_rx_cq_process_basic_cqe_comp+0x39f/0x3c0 [mlx5_core]
> > [ 3256.257053] mlx5e_poll_rx_cq+0x3a/0xc0 [mlx5_core]
> > [ 3256.257139] mlx5e_napi_poll+0xe2/0x710 [mlx5_core]
> > [ 3256.257226] __napi_poll+0x29/0x170
> > [ 3256.257229] net_rx_action+0x29c/0x370
> > [ 3256.257231] handle_softirqs+0xce/0x270
> > [ 3256.257236] __irq_exit_rcu+0xa3/0xc0
> > [ 3256.257238] common_interrupt+0x80/0xa0
> >
The logs indicate that the mlx5_vdpa device is already in bad FW state
before this crash:
[ 445.937186] mlx5_core 0000:0d:00.0: poll_health:801:(pid 0): device's health
compromised - reached miss count
[ 445.937212] mlx5_core 0000:0d:00.0: print_health_info:431:(pid 0): Health
issue observed, firmware internal error, severity(3) ERROR:
[ 445.937221] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0):
assert_var[0] 0x0521945b
[ 445.937228] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0):
assert_var[1] 0x00000000
[ 445.937234] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0):
assert_var[2] 0x00000000
[ 445.937240] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0):
assert_var[3] 0x00000000
[ 445.937247] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0):
assert_var[4] 0x00000000
[ 445.937253] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0):
assert_var[5] 0x00000000
[ 445.937259] mlx5_core 0000:0d:00.0: print_health_info:438:(pid 0):
assert_exit_ptr 0x21492f38
[ 445.937265] mlx5_core 0000:0d:00.0: print_health_info:439:(pid 0):
assert_callra 0x2102d5f0
[ 445.937280] mlx5_core 0000:0d:00.0: print_health_info:440:(pid 0): fw_ver
22.44.1036
[ 445.937286] mlx5_core 0000:0d:00.0: print_health_info:442:(pid 0): time
1742220438
[ 445.937294] mlx5_core 0000:0d:00.0: print_health_info:443:(pid 0): hw_id
0x00000212
[ 445.937296] mlx5_core 0000:0d:00.0: print_health_info:444:(pid 0): rfr 0
[ 445.937297] mlx5_core 0000:0d:00.0: print_health_info:445:(pid 0): severity
3 (ERROR)
[ 445.937303] mlx5_core 0000:0d:00.0: print_health_info:446:(pid 0):
irisc_index 3
[ 445.937314] mlx5_core 0000:0d:00.0: print_health_info:447:(pid 0): synd 0x1:
firmware internal error
[ 445.937320] mlx5_core 0000:0d:00.0: print_health_info:449:(pid 0): ext_synd
0x8f7a
[ 445.937327] mlx5_core 0000:0d:00.0: print_health_info:450:(pid 0): raw
fw_ver 0x162c040c
[ 446.257192] mlx5_core 0000:0d:00.2: poll_health:801:(pid 0): device's health
compromised - reached miss count
[ 446.513190] mlx5_core 0000:0d:00.3: poll_health:801:(pid 0): device's health
compromised - reached miss count
[ 446.577190] mlx5_core 0000:0d:00.4: poll_health:801:(pid 0): device's health
compromised - reached miss count
[ 447.473192] mlx5_core 0000:0d:00.1: poll_health:801:(pid 0): device's health
compromised - reached miss count
[ 447.473215] mlx5_core 0000:0d:00.1: print_health_info:431:(pid 0): Health
issue observed, firmware internal error, severity(3) ERROR:
[ 447.473221] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0):
assert_var[0] 0x0521945b
[ 447.473228] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0):
assert_var[1] 0x00000000
[ 447.473234] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0):
assert_var[2] 0x00000000
[ 447.473240] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0):
assert_var[3] 0x00000000
[ 447.473246] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0):
assert_var[4] 0x00000000
[ 447.473252] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0):
assert_var[5] 0x00000000
[ 447.473259] mlx5_core 0000:0d:00.1: print_health_info:438:(pid 0):
assert_exit_ptr 0x21492f38
[ 447.473265] mlx5_core 0000:0d:00.1: print_health_info:439:(pid 0):
assert_callra 0x2102d5f0
[ 447.473279] mlx5_core 0000:0d:00.1: print_health_info:440:(pid 0): fw_ver
22.44.1036
[ 447.473286] mlx5_core 0000:0d:00.1: print_health_info:442:(pid 0): time
1742220438
[ 447.473292] mlx5_core 0000:0d:00.1: print_health_info:443:(pid 0): hw_id
0x00000212
[ 447.473293] mlx5_core 0000:0d:00.1: print_health_info:444:(pid 0): rfr 0
[ 447.473295] mlx5_core 0000:0d:00.1: print_health_info:445:(pid 0): severity
3 (ERROR)
[ 447.473300] mlx5_core 0000:0d:00.1: print_health_info:446:(pid 0):
irisc_index 3
[ 447.473311] mlx5_core 0000:0d:00.1: print_health_info:447:(pid 0): synd 0x1:
firmware internal error
[ 447.473317] mlx5_core 0000:0d:00.1: print_health_info:449:(pid 0): ext_synd
0x8f7a
[ 447.473323] mlx5_core 0000:0d:00.1: print_health_info:450:(pid 0): raw
fw_ver 0x162c040c
[ 447.729198] mlx5_core 0000:0d:00.5: poll_health:801:(pid 0): device's health
compromised - reached miss count
This is related to a ring translation error on the FW side.
Si-Wei has some relevant fixes in the latest kernel [0][1]. And there is
an upcoming fix [2] which is pending merge. These might help. Either
that or there is something off with the mapping.
[0] 35025963326e ("vdpa/mlx5: Fix suboptimal range on iotlb iteration")
[1] 29ce8b8a4fa7 ("vdpa/mlx5: Fix PA offset with unaligned starting iotlb map")
[2] a6097e0a54a5 ("vdpa/mlx5: Fix oversized null mkey longer than 32bit")
Thanks,
Dragos