On Fri, 2020-06-12 at 15:09 +0200, Niklas Schnelle wrote: > Hello Parav, Hello Saeed, > > our CI system for IBM Z Linux found a hang[0] when hot unplugging a > ConnectX-4 Lx VF from a z/VM guest > in Linus' current tree and added during the merge window. > Sadly it didn't happen all the time which sent me on the wrong path > for two full git bisects. > > Anyway, I've now tracked this down to the following commit which when > reverted > fixes the issue: > > 41798df9bfca ("net/mlx5: Drain wq first during PCI device removal") > > Looking at the diff I'd say the likely culprit is that before > the commit the order of calls was: > > mlx5_unregister_device(dev) > mlx5_drain_health_wq(dev) > > But with the commit it becomes > > mlx5_drain_health_wq(dev) > mlx5_unregister_device(dev) > > So without really knowing anything about these functions I would > guess that with the device still registered the drained > queue does not remain empty as new entries are added. > Does that sound plausible to you? >
I don't think it is related, maybe this is similar to some issues addressed lately by Shay's patches: https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-2-sae...@mellanox.com/ https://patchwork.ozlabs.org/project/netdev/patch/20200611224708.235014-3-sae...@mellanox.com/ net/mlx5: drain health workqueue in case of driver load error net/mlx5: Fix fatal error handling during device load > Best regards, > Niklas Schnelle > > [0] dmesg output: > [ 36.447442] mlx5_core 0000:00:00.0: poll_health:694:(pid 0): Fatal > error 1 detected > [ 36.447450] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[0] 0xffffffff > [ 36.447453] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[1] 0xffffffff > [ 36.447455] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[2] 0xffffffff > [ 36.447458] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[3] 0xffffffff > [ 36.447461] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[4] 0xffffffff > [ 36.447463] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0): > assert_exit_ptr 0xffffffff > [ 36.447467] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0): > assert_callra 0xffffffff > [ 36.447471] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): > fw_ver 65535.65535.65535 > [ 36.447475] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0): > hw_id 0xffffffff > [ 36.447478] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0): > irisc_index 255 > [ 36.447492] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): > synd 0xff: unrecognized error > [ 36.447621] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): > ext_synd 0xffff > [ 36.447624] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0): > raw fw_ver 0xffffffff > [ 36.447885] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, > anc=0, erc=0, rsid=0 > [ 36.447897] zpci: 0000:00:00.0: Event 0x303 reconfigured PCI > function 0x514 > [ 47.099220] mlx5_core 0000:00:00.0: poll_health:709:(pid 0): > device's health compromised - reached miss count > [ 47.099228] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[0] 0xffffffff > [ 47.099231] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[1] 0xffffffff > [ 47.099234] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[2] 0xffffffff > [ 47.099236] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[3] 0xffffffff > [ 47.099239] mlx5_core 0000:00:00.0: print_health_info:372:(pid 0): > assert_var[4] 0xffffffff > [ 47.099241] mlx5_core 0000:00:00.0: print_health_info:375:(pid 0): > assert_exit_ptr 0xffffffff > [ 47.099245] mlx5_core 0000:00:00.0: print_health_info:377:(pid 0): > assert_callra 0xffffffff > [ 47.099249] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): > fw_ver 65535.65535.65535 > [ 47.099253] mlx5_core 0000:00:00.0: print_health_info:381:(pid 0): > hw_id 0xffffffff > [ 47.099256] mlx5_core 0000:00:00.0: print_health_info:382:(pid 0): > irisc_index 255 > [ 47.099327] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): > synd 0xff: unrecognized error > [ 47.099329] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): > ext_synd 0xffff > [ 47.099330] mlx5_core 0000:00:00.0: print_health_info:387:(pid 0): > raw fw_ver 0xffffffff > [ 100.539106] mlx5_core 0000:00:00.0: wait_func:991:(pid 121): > 2RST_QP(0x50a) timeout. Will cause a leak of a command resource > [ 100.539118] infiniband mlx5_0: destroy_qp_common:2525:(pid 121): > mlx5_ib: modify QP 0x00072c to RESET failed > [ 141.499325] mlx5_core 0000:00:00.0: wait_func:991:(pid 32): > QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command > resource > [ 161.978957] mlx5_core 0000:00:00.0: wait_func:991:(pid 121): > DESTROY_QP(0x501) timeout. Will cause a leak of a command resource Shay's patches also came to avoid such command timeouts.