** Description changed: - * Explain the bug(s) We've found regression release Ubuntu-bluefield-5.15.0-1046.48, where kernel crashes due to NULL pointer deference. * Regression: Yes Last working version: Ubuntu-bluefield-5.15.0-1045.47 - There are huge difference between the two - % git lo Ubuntu-bluefield-5.15.0-1045.47..Ubuntu-bluefield-5.15.0-1046.48 | wc -l - 1434 + There are huge difference between the two % git lo Ubuntu-bluefield-5.15.0-1045.47..Ubuntu-bluefield-5.15.0-1046.48 net/ | wc -l - 240 + 240 % git lo Ubuntu-bluefield-5.15.0-1045.47..Ubuntu-bluefield-5.15.0-1046.48 net/netfilter | wc -l - 30 + 30 % git lo -Gflow_offload Ubuntu-bluefield-5.15.0-1045.47..Ubuntu-bluefield-5.15.0-1046.48 net/netfilter 94679d661b0a net/sched: act_ct: Fix promotion of offloaded unreplied tuple fd360d6e23da netfilter: flowtable: cache info of last offload 2821d27406fd netfilter: flowtable: allow unidirectional rules I'd guess the bug is related to one of these above 3 commits - * How to test + + * compare with last release Ubuntu-bluefield-5.15.0-1042.44 + + % git lo Ubuntu-bluefield-5.15.0-1042.44 + f857e8400551 (tag: Ubuntu-bluefield-5.15.0-1042.44) UBUNTU: Ubuntu-bluefield-5.15.0-1042.44 + ... + 5136d51e6602 genetlink: fix single op policy dump when do is present + fbb97233eb24 net: openvswitch: add missing .resv_start_op + ... + 18b4e928e9ed net/sched: flower: Add lock protection when remove filter handle + ... + 7a92e980a3ab genetlink: allow families to use split ops directly* How to test + + * how to reproduce Basic test flow: DPU --- DPU 1. Create several VFs on each DPU, assign IP 2. Create LAG (bond0) 3. create OVS bridge on DPU, assign bond0, PF rep, and VFs to bridge - # ovs-vsctl add-br ovs0_602 - # ip link set dev bond0 up - # ovs-vsctl add-port ovs0_602 bond0 - # ovs-vsctl add-port ovs0_602 pf0hpf - # ovs-vsctl add-port ovs0_602 pf0vf0 - # ovs-vsctl add-port ovs0_602 pf0vf1 - # ovs-vsctl add-port ovs0_602 pf0vf2 - # ovs-vsctl add-port ovs0_602 pf0vf3 + # ovs-vsctl add-br ovs0_602 + # ip link set dev bond0 up + # ovs-vsctl add-port ovs0_602 bond0 + # ovs-vsctl add-port ovs0_602 pf0hpf + # ovs-vsctl add-port ovs0_602 pf0vf0 + # ovs-vsctl add-port ovs0_602 pf0vf1 + # ovs-vsctl add-port ovs0_602 pf0vf2 + # ovs-vsctl add-port ovs0_602 pf0vf3 - # ovs-vsctl set Open_vSwitch . other_config:hw-offload=true - # systemctl restart openvswitch-switch + # ovs-vsctl set Open_vSwitch . other_config:hw-offload=true + # systemctl restart openvswitch-switch 4. create the following CT rules: # ovs-ofctl del-flows ovs0_602 # ovs-ofctl add-flow ovs0_602 "table=0,icmp,action=normal" # ovs-ofctl add-flow ovs0_602 "table=0,arp,action=normal" # ovs-ofctl add-flow ovs0_602 "table=0,ip,ct_state=-trk,action=ct(table=1)" # ovs-ofctl add-flow ovs0_602 "table=1,priority=1,ip,ct_state=+trk+new,action=ct(commit),normal" # ovs-ofctl add-flow ovs0_602 "table=1,priority=1,ip,ct_state=+trk+est,action=normal" Ping PFs and VFs to verify connectivity before start sending traffic - * Kernel crash log [ 1948.480916] Bluefield ct offload: add wq coremask 80, del wq coremask 40 [ 1948.979246] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000b40 [ 1948.988048] Mem abort info: [ 1948.990829] ESR = 0x0000000096000004 [ 1948.979246] Unable to[ h1a9n48.994569] EC = 0x25: DABT (current EL), IL = 32 bits dle kernel NULL pointer dereference at virtual address 0000000000000b40 [ 1948.988048] Mem abort info: [ 1948.990829] ESR = 0x0000000096000004 [ 1948.994569] EC = 0x25: DABT (current EL), IL = 32 bits [ 1949.020420] SET = 0, FnV = 0 [ 1949.023464] EA = 0, S1PTW = 0 [ 1949.020420] SET = 0, F[n V1 49.026594] FSC = 0x04: level 0 translation fault = 0 [ 1949.023464] EA = 0, S1PTW = 0 [ 1949.026594] FSC = 0x04: level 0 translation fault [ 1949.042381] Data abort info: [ 1949.042381] Data abort info:[ 1949.045252] ISV = 0, ISS = 0x00000004 [ 1949.045252] ISV = 0, ISS = 0x00000004 [ 1949.055747] CM = 0, WnR = 0 [ 1949.[0 515974497.]0 5 8 7C0M9 ] user pgtable: 4k pages, 48-bit VAs, pgdp=00000001194ad000 = 0, WnR = 0 [ 1949.058709] user pgtable: 4k pages, 48-bit VAs, pgdp=00000001194ad000 [ 1949.074503] [0000000000000b40] pgd=0000000000000000, p4d=0000000000000000 [ 1949.074503] [000000000[0 010909.081290] Internal error: Oops: 0000000096000004 [#1] SMP b40] pgd=0000000000000000, p4d=0000000000000000 [ 1949.081290] Internal error: Oops: 0000000096000004 [#1] SMP [ 1949.099070] Modules linked in: act_ct(E) nf_flow_table(E) act_skbedit(E) act_mirred(E) cls_matchall(E) act_gact(E) cls_flower(E) sch_ingress(E) nfnetlink_cttimeout(E) bonding(E) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) mlxdevm(OE) ib_uverbs(OE) psample(E) tls(E) ib_core(OE) ipmb_host(E) sbsa_gwdt(E) openvswitch(E) nsh(E) nf_conncount(E) ipmi_ssif(E) xfrm_interface(E) xfrm6_tunnel(E) tunnel6(E) tunnel4(E) xfrm_user(E) xfrm_algo(E) nvme_fabrics(OE) tpm_ftpm_tee(E) mst_pciconf(OE) ipmi_devintf(E) ipmi_msghandler(E) ipmb_dev_int(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) overlay(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nft_counter(E) xt_tcpmss(E) xt_NFLOG(E) nfnetlink_log(E) xt_recent(E) xt_hashlimit(E) xt_state(E) xt_conntrack(E) xt_mark(E) xt_comment(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) nft_compat(E) sunrpc(E) binfmt_misc(E) nf_tables(E) nfnetlink(E) nls_iso8859_1(E) optee(E) uio_pdrv_genirq(E) uio(E) tee(E) [ 1949.099135] mlxbf_pmc(E) mlxbf_pka(E) sch_fq_codel(E) dm_multipath(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) drm(E) ip_tables(E) x_tables(E) virtio_net(E) net_failover(E) failover(E) nvme(OE) gpio_mlxbf3(E) crct10dif_ce(E) ghash_ce(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) vitesse(E) nvme_core(OE) mlx_compat(OE) sdhci_of_dwcmshc(E) sdhci_pltfm(E) sdhci(E) i2c_mlxbf(E) mlxbf_gige(E) mlxbf_bootctl(E) pinctrl_mlxbf3(E) mlxbf_tmfifo(E) pwr_mlxbf(E) autofs4(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) [last unloaded: ib_core] [ 1949.234245] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G OE 5.15.0-1046-bluefield #48-Ubuntu [ 1949.243706] Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.5.2.13183 Jun 17 2024 [ 1949.256548] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1949.263492] pc : flow_offload_queue_work+0x28/0xb0 [nf_flow_table] [ 1949.269661] lr : nf_flow_offload_add+0x24/0x30 [nf_flow_table] [ 1949.275478] sp : ffff8000080737d0 [ 1949.278776] x29: ffff8000080737d0 x28: 0000000000000000 x27: ffffc941a7cbccc0 [ 1949.285894] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000001 [ 1949.293012] x23: 0000000000000000 x22: ffffc941a7726000 x21: ffff0000861ece50 [ 1949.300129] x20: ffff0000861ece40 x19: ffff000095cb5000 x18: 0000000000000000 [ 1949.307246] x17: ffff36c6350db000 x16: ffffc941a527e890 x15: 0000000000000000 [ 1949.314363] x14: ffffc941a776d0b0 x13: 0000000000000000 x12: 0000000000000032 [ 1949.321480] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc9417f432674 [ 1949.328598] x8 : ffff0000cb1ab580 x7 : 0000000000000000 x6 : 000000000000003f [ 1949.335715] x5 : 0000000000000040 x4 : ffff8000080737d0 x3 : 0000000fffffffe0 [ 1949.342832] x2 : ffff0000cb1ab528 x1 : 0000000000000000 x0 : 0000000000000000 [ 1949.349950] Call trace: [ 1949.352382] flow_offload_queue_work+0x28/0xb0 [nf_flow_table] [ 1949.358199] nf_flow_offload_add+0x24/0x30 [nf_flow_table] [ 1949.363668] flow_offload_add+0x138/0x1e0 [nf_flow_table] [ 1949.369051] tcf_ct_flow_table_add+0x110/0x160 [act_ct] [ 1949.374262] tcf_ct_act+0x924/0xb6c [act_ct] [ 1949.378516] tcf_action_exec+0xb4/0x1f0 [ 1949.382342] __tcf_classify+0xd8/0x220 [ 1949.386077] tcf_classify+0xa0/0x240 [ 1949.389637] sch_handle_ingress.constprop.0+0xd4/0x23c [ 1949.394760] __netif_receive_skb_core.constprop.0+0x494/0x8d0 [ 1949.400489] __netif_receive_skb_list_core+0xf0/0x214 [ 1949.405524] netif_receive_skb_list_internal+0x198/0x2ac [ 1949.410819] napi_complete_done+0x70/0x1ec [ 1949.414899] mlx5e_napi_poll+0x15c/0x5ec [mlx5_core] [ 1949.419940] __napi_poll+0x40/0x230 [ 1949.423416] net_rx_action+0x178/0x360 [ 1949.427150] __do_softirq+0x15c/0x410 [ 1949.430798] irq_exit+0xa0/0xe0 [ 1949.433925] handle_domain_irq+0x6c/0xa0 [ 1949.437834] gic_handle_irq+0xec/0x1b0 [ 1949.441567] call_on_irq_stack+0x20/0x2c [ 1949.445474] do_interrupt_handler+0x5c/0x70 [ 1949.449642] el1_interrupt+0x30/0x50 [ 1949.453202] el1h_64_irq_handler+0x18/0x2c [ 1949.457282] el1h_64_irq+0x7c/0x80 [ 1949.460668] arch_cpu_idle+0x18/0x3c [ 1949.464228] default_idle_call+0x44/0x150 [ 1949.468223] cpuidle_idle_call+0x174/0x200 [ 1949.472304] do_idle+0xac/0x100 [ 1949.475430] cpu_startup_entry+0x30/0x70 [ 1949.479336] secondary_start_kernel+0xfc/0x190 [ 1949.483765] __secondary_switched+0x90/0x94 [ 1949.487935] Code: f9400c01 910003e4 b9401000 f940a021 (f945a023) [ 1949.494012] ---[ end trace d597d62fb2400054 ]--- [ 1950.101574] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 1950.108438] SMP: stopping secondary CPUs [ 1950.112382] Kernel Offset: 0x49419ced0000 from 0xffff800008000000 [ 1950.118458] PHYS_OFFSET: 0x80000000 [ 1950.121930] CPU features: 0x0,000005c1,a3332e5a [ 1950.126446] Memory Limit: none [ 1950.736220] Rebooting in 10 seconds.. Nvidia BlueField-3 rev1 BL1 V1.0
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/2071715 Title: LAG with CT causes DPU Kernel Panic To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2071715/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs