Hi,

On 03/18, Lei Yang wrote:
> On Tue, Mar 18, 2025 at 10:15 AM Jason Wang <[email protected]> wrote:
> >
> > On Tue, Mar 18, 2025 at 9:55 AM Lei Yang <[email protected]> wrote:
> > >
> > > Hi Jonah
> > >
> > > I tested this series with the vhost_vdpa device based on mellanox
> > > ConnectX-6 DX nic and hit the host kernel crash. This problem can be
> > > easier to reproduce under the hotplug/unplug device scenario.
> > > For the core dump messages please review the attachment.
> > > FW version:
> > > #  flint -d 0000:0d:00.0 q |grep Version
> > > FW Version:            22.44.1036
> > > Product Version:       22.44.1036
> >
> > The trace looks more like a mlx5e driver bug other than vDPA?
> >
> > [ 3256.256707] Call Trace:
> > [ 3256.256708]  <IRQ>
> > [ 3256.256709]  ? show_trace_log_lvl+0x1c4/0x2df
> > [ 3256.256714]  ? show_trace_log_lvl+0x1c4/0x2df
> > [ 3256.256715]  ? __build_skb+0x4a/0x60
> > [ 3256.256719]  ? __die_body.cold+0x8/0xd
> > [ 3256.256720]  ? die_addr+0x39/0x60
> > [ 3256.256725]  ? exc_general_protection+0x1ec/0x420
> > [ 3256.256729]  ? asm_exc_general_protection+0x22/0x30
> > [ 3256.256736]  ? __build_skb_around+0x8c/0xf0
> > [ 3256.256738]  __build_skb+0x4a/0x60
> > [ 3256.256740]  build_skb+0x11/0xa0
> > [ 3256.256743]  mlx5e_skb_from_cqe_mpwrq_linear+0x156/0x280 [mlx5_core]
> > [ 3256.256872]  mlx5e_handle_rx_cqe_mpwrq_rep+0xcb/0x1e0 [mlx5_core]
> > [ 3256.256964]  mlx5e_rx_cq_process_basic_cqe_comp+0x39f/0x3c0 [mlx5_core]
> > [ 3256.257053]  mlx5e_poll_rx_cq+0x3a/0xc0 [mlx5_core]
> > [ 3256.257139]  mlx5e_napi_poll+0xe2/0x710 [mlx5_core]
> > [ 3256.257226]  __napi_poll+0x29/0x170
> > [ 3256.257229]  net_rx_action+0x29c/0x370
> > [ 3256.257231]  handle_softirqs+0xce/0x270
> > [ 3256.257236]  __irq_exit_rcu+0xa3/0xc0
> > [ 3256.257238]  common_interrupt+0x80/0xa0
> >
The logs indicate that the mlx5_vdpa device is already in bad FW state
before this crash:

[  445.937186] mlx5_core 0000:0d:00.0: poll_health:801:(pid 0): device's health 
compromised - reached miss count
[  445.937212] mlx5_core 0000:0d:00.0: print_health_info:431:(pid 0): Health 
issue observed, firmware internal error, severity(3) ERROR:
[  445.937221] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0): 
assert_var[0] 0x0521945b
[  445.937228] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0): 
assert_var[1] 0x00000000
[  445.937234] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0): 
assert_var[2] 0x00000000
[  445.937240] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0): 
assert_var[3] 0x00000000
[  445.937247] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0): 
assert_var[4] 0x00000000
[  445.937253] mlx5_core 0000:0d:00.0: print_health_info:435:(pid 0): 
assert_var[5] 0x00000000
[  445.937259] mlx5_core 0000:0d:00.0: print_health_info:438:(pid 0): 
assert_exit_ptr 0x21492f38
[  445.937265] mlx5_core 0000:0d:00.0: print_health_info:439:(pid 0): 
assert_callra 0x2102d5f0
[  445.937280] mlx5_core 0000:0d:00.0: print_health_info:440:(pid 0): fw_ver 
22.44.1036
[  445.937286] mlx5_core 0000:0d:00.0: print_health_info:442:(pid 0): time 
1742220438
[  445.937294] mlx5_core 0000:0d:00.0: print_health_info:443:(pid 0): hw_id 
0x00000212
[  445.937296] mlx5_core 0000:0d:00.0: print_health_info:444:(pid 0): rfr 0
[  445.937297] mlx5_core 0000:0d:00.0: print_health_info:445:(pid 0): severity 
3 (ERROR)
[  445.937303] mlx5_core 0000:0d:00.0: print_health_info:446:(pid 0): 
irisc_index 3
[  445.937314] mlx5_core 0000:0d:00.0: print_health_info:447:(pid 0): synd 0x1: 
firmware internal error
[  445.937320] mlx5_core 0000:0d:00.0: print_health_info:449:(pid 0): ext_synd 
0x8f7a
[  445.937327] mlx5_core 0000:0d:00.0: print_health_info:450:(pid 0): raw 
fw_ver 0x162c040c
[  446.257192] mlx5_core 0000:0d:00.2: poll_health:801:(pid 0): device's health 
compromised - reached miss count
[  446.513190] mlx5_core 0000:0d:00.3: poll_health:801:(pid 0): device's health 
compromised - reached miss count
[  446.577190] mlx5_core 0000:0d:00.4: poll_health:801:(pid 0): device's health 
compromised - reached miss count
[  447.473192] mlx5_core 0000:0d:00.1: poll_health:801:(pid 0): device's health 
compromised - reached miss count
[  447.473215] mlx5_core 0000:0d:00.1: print_health_info:431:(pid 0): Health 
issue observed, firmware internal error, severity(3) ERROR:
[  447.473221] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0): 
assert_var[0] 0x0521945b
[  447.473228] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0): 
assert_var[1] 0x00000000
[  447.473234] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0): 
assert_var[2] 0x00000000
[  447.473240] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0): 
assert_var[3] 0x00000000
[  447.473246] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0): 
assert_var[4] 0x00000000
[  447.473252] mlx5_core 0000:0d:00.1: print_health_info:435:(pid 0): 
assert_var[5] 0x00000000
[  447.473259] mlx5_core 0000:0d:00.1: print_health_info:438:(pid 0): 
assert_exit_ptr 0x21492f38
[  447.473265] mlx5_core 0000:0d:00.1: print_health_info:439:(pid 0): 
assert_callra 0x2102d5f0
[  447.473279] mlx5_core 0000:0d:00.1: print_health_info:440:(pid 0): fw_ver 
22.44.1036
[  447.473286] mlx5_core 0000:0d:00.1: print_health_info:442:(pid 0): time 
1742220438
[  447.473292] mlx5_core 0000:0d:00.1: print_health_info:443:(pid 0): hw_id 
0x00000212
[  447.473293] mlx5_core 0000:0d:00.1: print_health_info:444:(pid 0): rfr 0
[  447.473295] mlx5_core 0000:0d:00.1: print_health_info:445:(pid 0): severity 
3 (ERROR)
[  447.473300] mlx5_core 0000:0d:00.1: print_health_info:446:(pid 0): 
irisc_index 3
[  447.473311] mlx5_core 0000:0d:00.1: print_health_info:447:(pid 0): synd 0x1: 
firmware internal error
[  447.473317] mlx5_core 0000:0d:00.1: print_health_info:449:(pid 0): ext_synd 
0x8f7a
[  447.473323] mlx5_core 0000:0d:00.1: print_health_info:450:(pid 0): raw 
fw_ver 0x162c040c
[  447.729198] mlx5_core 0000:0d:00.5: poll_health:801:(pid 0): device's health 
compromised - reached miss count

This is related to a ring translation error on the FW side.

Si-Wei has some relevant fixes in the latest kernel [0][1]. And there is
an upcoming fix [2] which is pending merge. These might help. Either
that or there is something off with the mapping.

[0] 35025963326e ("vdpa/mlx5: Fix suboptimal range on iotlb iteration")
[1] 29ce8b8a4fa7 ("vdpa/mlx5: Fix PA offset with unaligned starting iotlb map")
[2] a6097e0a54a5 ("vdpa/mlx5: Fix oversized null mkey longer than 32bit")

Thanks,
Dragos
 

Reply via email to