Hi, Vineeth. The patch in hand is included in the HWE version of the Focal Kernel and in the LTS version of the Jammy Kernel. Both are 5.15, FWIW, and the fix has a different id there:
f0f894f0f636 net/mlx5: Fix handling of entry refcount when command is not issued to FW The Focal LTS Kernel is the only one that needs the backport. Let us know how testing goes at your end. BR, pprincipeza -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2019011 Title: [UBUNTU 20.04] [HPS] Kernel panic with "refcount_t: underflow" in mlx5 driver Status in linux package in Ubuntu: New Status in linux source package in Focal: In Progress Bug description: ---Problem Description--- Kernel panic with "refcount_t: underflow" in kernel log Contact Information = rijo...@ibm.com, vineeth.vija...@ibm.com ---uname output--- 5.4.0-128-generic Machine Type = s390x ---System Hang--- Kernel panic and stack-trace as below ---Debugger--- A debugger is not configured Stack trace output: [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a5939a286>] refcount_warn_saturate+0xce/0x140) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f861e>] cmd_ent_put+0xe6/0xf8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9b6a>] mlx5_cmd_comp_handler+0x102/0x4f0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805f9f8a>] cmd_comp_notifier+0x32/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fe4fc>] mlx5_eq_async_int+0x13c/0x200 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf0c6>] notifier_call_chain+0x4e/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecf17e>] atomic_notifier_call_chain+0x2e/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061318e>] mlx5_irq_int_handler+0x2e/0x48 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e960ce>] zpci_floating_irq_handler+0xe6/0x1b8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a594f54a6>] do_airq_interrupt+0x96/0x130 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1455a>] __handle_irq_event_percpu+0x6a/0x250 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f14770>] handle_irq_event_percpu+0x30/0x78 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f1a0c8>] handle_percpu_irq+0x68/0xa0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58f134d2>] generic_handle_irq+0x3a/0x60 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e30e42>] do_IRQ+0x7a/0xb0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a408>] io_int_handler+0x12c/0x294 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e2752e>] enabled_wait+0x46/0xd8 [Sat Apr 8 17:52:21 UTC 2023] ([<0000002a58e2752e>] enabled_wait+0x46/0xd8) [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e278aa>] arch_cpu_idle+0x2a/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee1536>] do_idle+0xee/0x1b0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ee17a6>] cpu_startup_entry+0x36/0x40 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3ab38>] smp_init_secondary+0xc8/0xe8 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58e3a770>] smp_start_secondary+0x88/0x90 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5939a286>] refcount_warn_saturate+0xce/0x140 [Sat Apr 8 17:52:21 UTC 2023] ---[ end trace 6ec6f9c6f666ca2d ]--- [Sat Apr 8 17:52:21 UTC 2023] specification exception: 0006 ilc:3 [#1] SMP [Sat Apr 8 17:52:21 UTC 2023] Modules linked in: sysdigcloud_probe(OE) vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache ebtable_broute binfmt_misc nbd veth xt_statistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_mangle ip6table_mangle ip6table_nat xt_mark sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw ptp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfr m_algo bonding s390_trng [Sat Apr 8 17:52:21 UTC 2023] vfio_ccw chsc_sch vfio_mdev mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_vx_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [last unloaded: sysdigcloud_probe] [Sat Apr 8 17:52:21 UTC 2023] CPU: 12 PID: 83893 Comm: kworker/u400:91 Kdump: loaded Tainted: G W OE 5.4.0-128-generic #144~18.04.1-Ubuntu [Sat Apr 8 17:52:21 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sat Apr 8 17:52:21 UTC 2023] Workqueue: mlx5e mlx5e_update_stats_work [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] Krnl PSW : 0404d00180000000 0000002a58ec51d8 (queue_work_on+0x30/0x70) [Sat Apr 8 17:52:21 UTC 2023] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3 [Sat Apr 8 17:52:21 UTC 2023] Krnl GPRS: 1d721b7c57e8d7f5 0000000000000001 0000000000000200 0000006222a0e800 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000000000000000 0000000000000000 000003e016d23d08 [Sat Apr 8 17:52:21 UTC 2023] 0000005a8e94a4e1 0000006287800120 0000003b8dbbd740 0700003b8dbbd740 [Sat Apr 8 17:52:21 UTC 2023] 00000062690c6600 000003ff8069c808 000003e016d23ae0 000003e016d23aa8 [Sat Apr 8 17:52:21 UTC 2023] Krnl Code: 0000002a58ec51c6: f0a0a7190001 srp 1817(11,%r10),1,0 0000002a58ec51cc: e3b0f0a00004 lg %r11,160(%r15) #0000002a58ec51d2: eb11400000e6 laog %r1,%r1,0(%r4) >0000002a58ec51d8: 07e0 bcr 14,%r0 0000002a58ec51da: a7110001 tmll %r1,1 0000002a58ec51de: a7840016 brc 8,0000002a58ec520a 0000002a58ec51e2: a7280000 lhi %r2,0 0000002a58ec51e6: a7b20300 tmhh %r11,768 [Sat Apr 8 17:52:21 UTC 2023] Call Trace: [Sat Apr 8 17:52:21 UTC 2023] ([<000003e016d23ae0>] 0x3e016d23ae0) [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fab0a>] cmd_exec+0x44a/0xab0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805fb2b0>] mlx5_cmd_exec+0x40/0x70 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80657cb0>] mlx5_eswitch_get_vport_stats+0xb0/0x2a0 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff80644602>] mlx5e_rep_update_hw_counters+0x52/0xb8 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<000003ff8061f1ec>] mlx5e_update_stats_work+0x44/0x58 [mlx5_core] [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec56f4>] process_one_work+0x274/0x4d0 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ec5998>] worker_thread+0x48/0x560 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a58ecd014>] kthread+0x144/0x160 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a094>] ret_from_fork+0x28/0x30 [Sat Apr 8 17:52:21 UTC 2023] [<0000002a5972a09c>] kernel_thread_starter+0x0/0x10 [Sat Apr 8 17:52:21 UTC 2023] Last Breaking-Event-Address: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops Oops output: [Sat Apr 8 17:52:21 UTC 2023] [<000003ff805ee060>] 0x3ff805ee060 [Sat Apr 8 17:52:21 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops ------------ [Michael] I had a look into the dump from wdc3-qz1-sr2-rk086-s05: crash> sys The system was up and running since: UPTIME: 282 days, 02:16:10 There a a lot of martian source messages again like: [Sun Apr 16 11:09:28 UTC 2023] IPv4: martian source 11.44.203.141 from 11.21.133.2, on dev ipsec0 [Sun Apr 16 11:09:28 UTC 2023] ll header: 00000000: ff ff ff ff ff ff fe ff 0b 15 85 02 08 06 I hope that we get them suppressed soon. Then at the following time a first issue can be observed: NFS timeout [Sun Apr 16 11:09:39 UTC 2023] nfs: server ccistorwdc0751-sec- fz.service.softlayer.com not responding, timed out The reason could be a) the server b) the network c) the local network adapter Then about 1:05 hour later the first mlx5 related issues are reported [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.1 if0200023AF58D: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:07.0 if02000440845F: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.2 p0v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.3 p0v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:00.6 p0v4: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.2 p1v0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 [Sun Apr 16 12:15:50 UTC 2023] mlx5_core 0001:00:08.3 p1v1: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -5 ? Then about 15 minutes later the NFS code performs a panic_on_oops ? [Sun Apr 16 12:32:34 UTC 2023] nfs: server ccistorwdc0751-sec-fz.service.softlayer.com not responding, timed out [Sun Apr 16 12:34:10 UTC 2023] Unable to handle kernel pointer dereference in virtual kernel address space [Sun Apr 16 12:34:10 UTC 2023] Failing address: 0000809f00008000 TEID: 0000809f00008803 [Sun Apr 16 12:34:10 UTC 2023] Fault in home space mode while using kernel ASCE. [Sun Apr 16 12:34:10 UTC 2023] AS:00000047431f4007 R3:0000000000000024 [Sun Apr 16 12:34:10 UTC 2023] Oops: 0038 ilc:3 [#1] SMP [Sun Apr 16 12:34:10 UTC 2023] Modules linked in: sysdigcloud_probe(OE) binfmt_misc nbd vhost_net vhost macvtap macvlan tap rpcsec_gss_krb5 auth_rpcgss nfsv3 nfs_acl nfs lockd grace fscache xt_s tatistic ipt_REJECT nf_reject_ipv4 ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs ip6table_mangle ip6table_nat ebt_redirect ebt_ip ebtable_broute sunrpc lcs ctcm fsm zfcp scsi_transport_fc dasd_fba_mod dasd_ eckd_mod dasd_mod nf_log_ipv6 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp xt_multiport xt_set ip_set_hash_net ip_set_hash_ip ip_set tcp_diag inet_diag xt_comment xt_nat act_gact iptable_ mangle xt_mark veth sch_multiq act_mirred act_pedit act_tunnel_key cls_flower act_police cls_u32 vxlan ip6_udp_tunnel udp_tunnel dummy sch_ingress mlx5_ib ib_uverbs ib_core mlx5_core tls mlxfw p tp pps_core xt_MASQUERADE iptable_nat xt_addrtype xt_conntrack br_netfilter bridge stp llc aufs ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter xfrm_user xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 af_key xfrm_algo [Sun Apr 16 12:34:10 UTC 2023] s390_trng vfio_ccw vfio_mdev chsc_sch mdev vfio_iommu_type1 eadm_sch vfio sch_fq_codel ip_tables x_tables overlay raid10 raid456 async_raid6_recov async_memcpy as ync_pq async_xor async_tx xor raid6_pq raid1 raid0 linear nf_tables nf_nat nf_conntrack_netlink nfnetlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c qeth_l3 qeth_l2 pkey zcrypt crc32_v x_s390 ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common qeth ccwgroup qdio scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [la st unloaded: sysdigcloud_probe] [Sun Apr 16 12:34:10 UTC 2023] CPU: 4 PID: 32942 Comm: kubelet Kdump: loaded Tainted: G W OE 5.4.0-110-generic #124~18.04.1+hf334332v20220521b1-Ubuntu [Sun Apr 16 12:34:10 UTC 2023] Hardware name: IBM 8562 GT2 A00 (LPAR) [Sun Apr 16 12:34:10 UTC 2023] Krnl PSW : 0704f00180000000 000003ff8076304a (call_bind+0x3a/0xf8 [sunrpc]) [Sun Apr 16 12:34:10 UTC 2023] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3 [Sun Apr 16 12:34:10 UTC 2023] Krnl GPRS: 00000000000001dc 0000005d16d22400 00000041b9826500 000003e008637ad8 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807794d6 0000004742e35898 0000000000000000 00000041b9826537 [Sun Apr 16 12:34:10 UTC 2023] 000003ff807ae63c 000003ff80763010 0000809f0000809f 00000041b9826500 [Sun Apr 16 12:34:10 UTC 2023] 00000015a0c80000 000003ff807a1d80 000003e008637a80 000003e008637a48 [Sun Apr 16 12:34:10 UTC 2023] Krnl Code: 000003ff8076303a: a7840041 brc 8,000003ff807630bc 000003ff8076303e: e31020c00004 lg %r1,192(%r2) #000003ff80763044: e3a010000004 lg %r10,0(%r1) >000003ff8076304a: e310a4070090 llgc %r1,1031(%r10) 000003ff80763050: a7110010 tmll %r1,16 000003ff80763054: a7740025 brc 7,000003ff8076309e 000003ff80763058: c418ffffe7d8 lgrl %r1,000003ff80760008 000003ff8076305e: 91021003 tm 3(%r1),2 [Sun Apr 16 12:34:10 UTC 2023] Call Trace: [Sun Apr 16 12:34:10 UTC 2023] ([<0000000000000000>] 0x0) [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779454>] __rpc_execute+0x8c/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779df2>] rpc_execute+0x8a/0x128 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766d62>] rpc_run_task+0x132/0x180 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80766e00>] rpc_call_sync+0x50/0xa0 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80360e40>] nfs3_rpc_wrapper.constprop.12+0x48/0xe0 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80361c5e>] nfs3_proc_getattr+0x6e/0xc8 [nfsv3] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaeaa8>] __nfs_revalidate_inode+0x158/0x3b0 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80aaef9c>] nfs_getattr+0x1bc/0x388 [nfs] [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161032>] vfs_statx+0xaa/0xf8 [Sun Apr 16 12:34:10 UTC 2023] [<0000004742161798>] __do_sys_newstat+0x38/0x60 [Sun Apr 16 12:34:10 UTC 2023] [<000000474277e802>] system_call+0x2a6/0x2c8 [Sun Apr 16 12:34:10 UTC 2023] Last Breaking-Event-Address: [Sun Apr 16 12:34:10 UTC 2023] [<000003ff80779452>] __rpc_execute+0x8a/0x488 [sunrpc] [Sun Apr 16 12:34:10 UTC 2023] Kernel panic - not syncing: Fatal exception: panic_on_oops The network interfaces p0 and p1 are missing: crash> net | grep -P "p0 |p1 " 5b726fa000 macvtap0 It looks like the p0/p1 issues where the network interfaces have been lost but no recovery was attempted. There are no related recovery messages from the mlx5 kernel module. The kernel finally dumps in the area of the NFS/RPC code. That would be the related upstream commit: aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW ---- [Niklas] I agree that commit does sound like it could be the fix for exactly this issue. I checked the kernel tree at the tag Ubuntu-5.4.0-128.144 and that does not appear to have this fix. If I read things correctly this is again an issue that may occur during a recovery when the PCI device is isolated and thus doesn't respond. So it likely won't help with not losing the interface but it does sound like it could solve the kernel crash/refcount warning. ==================================================================================================== Summary: Looks like this patch (aaf2e65cac7f) is missing in 20.04 and could be reason for the crash. We would like to backport this to 20.04, 20.04 HWE, 22.04 and 22.04 HWE. aaf2e65cac7f net/mlx5: Fix handling of entry refcount when command is not issued to FW https://lore.kernel.org/netdev/20221122022559.89459-6-sa...@kernel.org/ ==================================================================================================== To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2019011/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp