** Description changed:

- 
  * Explain the bug(s)
  
  We've found regression release Ubuntu-bluefield-5.15.0-1046.48, where
  kernel crashes due to NULL pointer deference.
  
  * Regression: Yes
  Last working version: Ubuntu-bluefield-5.15.0-1045.47
  
- There are huge difference between the two 
- % git lo Ubuntu-bluefield-5.15.0-1045.47..Ubuntu-bluefield-5.15.0-1046.48 | 
wc -l
-     1434
+ There are huge difference between the two
  % git lo Ubuntu-bluefield-5.15.0-1045.47..Ubuntu-bluefield-5.15.0-1046.48 
net/ | wc -l
-      240
+      240
  % git lo Ubuntu-bluefield-5.15.0-1045.47..Ubuntu-bluefield-5.15.0-1046.48 
net/netfilter | wc -l
-       30
+       30
  
  % git lo -Gflow_offload 
Ubuntu-bluefield-5.15.0-1045.47..Ubuntu-bluefield-5.15.0-1046.48 net/netfilter
  94679d661b0a net/sched: act_ct: Fix promotion of offloaded unreplied tuple
  fd360d6e23da netfilter: flowtable: cache info of last offload
  2821d27406fd netfilter: flowtable: allow unidirectional rules
  
  I'd guess the bug is related to one of these above 3 commits
  
- * How to test
+ 
+ * compare with last release Ubuntu-bluefield-5.15.0-1042.44
+ 
+ % git lo Ubuntu-bluefield-5.15.0-1042.44
+ f857e8400551 (tag: Ubuntu-bluefield-5.15.0-1042.44) UBUNTU: 
Ubuntu-bluefield-5.15.0-1042.44
+ ...
+ 5136d51e6602 genetlink: fix single op policy dump when do is present
+ fbb97233eb24 net: openvswitch: add missing .resv_start_op
+ ...
+ 18b4e928e9ed net/sched: flower: Add lock protection when remove filter handle
+ ...
+ 7a92e980a3ab genetlink: allow families to use split ops directly* How to test
+ 
+ * how to reproduce
  
  Basic test flow: DPU --- DPU
  
  1. Create several VFs on each DPU, assign IP
  2. Create LAG (bond0)
  3. create OVS bridge on DPU, assign bond0, PF rep, and VFs to bridge
- # ovs-vsctl add-br ovs0_602 
- # ip link set dev bond0 up 
- # ovs-vsctl add-port ovs0_602 bond0 
- # ovs-vsctl add-port ovs0_602 pf0hpf 
- # ovs-vsctl add-port ovs0_602 pf0vf0 
- # ovs-vsctl add-port ovs0_602 pf0vf1 
- # ovs-vsctl add-port ovs0_602 pf0vf2 
- # ovs-vsctl add-port ovs0_602 pf0vf3 
+ # ovs-vsctl add-br ovs0_602
+ # ip link set dev bond0 up
+ # ovs-vsctl add-port ovs0_602 bond0
+ # ovs-vsctl add-port ovs0_602 pf0hpf
+ # ovs-vsctl add-port ovs0_602 pf0vf0
+ # ovs-vsctl add-port ovs0_602 pf0vf1
+ # ovs-vsctl add-port ovs0_602 pf0vf2
+ # ovs-vsctl add-port ovs0_602 pf0vf3
  
- # ovs-vsctl set Open_vSwitch . other_config:hw-offload=true 
- # systemctl restart openvswitch-switch 
+ # ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
+ # systemctl restart openvswitch-switch
  
  4. create the following CT rules:
  # ovs-ofctl del-flows ovs0_602
  # ovs-ofctl add-flow ovs0_602 "table=0,icmp,action=normal"
  # ovs-ofctl add-flow ovs0_602 "table=0,arp,action=normal"
  # ovs-ofctl add-flow ovs0_602 "table=0,ip,ct_state=-trk,action=ct(table=1)"
  # ovs-ofctl add-flow ovs0_602 
"table=1,priority=1,ip,ct_state=+trk+new,action=ct(commit),normal"
  # ovs-ofctl add-flow ovs0_602 
"table=1,priority=1,ip,ct_state=+trk+est,action=normal"
  Ping PFs and VFs to verify connectivity before start sending traffic
- 
  
  * Kernel crash log
  
  [ 1948.480916] Bluefield ct offload: add wq coremask 80, del wq coremask 40
  [ 1948.979246] Unable to handle kernel NULL pointer dereference at virtual 
address 0000000000000b40
  [ 1948.988048] Mem abort info:
  [ 1948.990829]   ESR = 0x0000000096000004
  [ 1948.979246] Unable to[  h1a9n48.994569]   EC = 0x25: DABT (current EL), IL 
= 32 bits
  dle kernel NULL pointer dereference at virtual address 0000000000000b40
  [ 1948.988048] Mem abort info:
  [ 1948.990829]   ESR = 0x0000000096000004
  [ 1948.994569]   EC = 0x25: DABT (current EL), IL = 32 bits
  [ 1949.020420]   SET = 0, FnV = 0
  [ 1949.023464]   EA = 0, S1PTW = 0
  [ 1949.020420]   SET = 0, F[n V1 49.026594]   FSC = 0x04: level 0 translation 
fault
  = 0
  [ 1949.023464]   EA = 0, S1PTW = 0
  [ 1949.026594]   FSC = 0x04: level 0 translation fault
  [ 1949.042381] Data abort info:
  [ 1949.042381] Data abort info:[ 1949.045252]   ISV = 0, ISS = 0x00000004
  
  [ 1949.045252]   ISV = 0, ISS = 0x00000004
  [ 1949.055747]   CM = 0, WnR = 0
  [ 1949.[0 515974497.]0 5 8 7C0M9 ] user pgtable: 4k pages, 48-bit VAs, 
pgdp=00000001194ad000
  = 0, WnR = 0
  [ 1949.058709] user pgtable: 4k pages, 48-bit VAs, pgdp=00000001194ad000
  [ 1949.074503] [0000000000000b40] pgd=0000000000000000, p4d=0000000000000000
  [ 1949.074503] [000000000[0 010909.081290] Internal error: Oops: 
0000000096000004 [#1] SMP
  b40] pgd=0000000000000000, p4d=0000000000000000
  [ 1949.081290] Internal error: Oops: 0000000096000004 [#1] SMP
  [ 1949.099070] Modules linked in: act_ct(E) nf_flow_table(E) act_skbedit(E) 
act_mirred(E) cls_matchall(E) act_gact(E) cls_flower(E) sch_ingress(E) 
nfnetlink_cttimeout(E) bonding(E) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) 
ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) 
mlxdevm(OE) ib_uverbs(OE) psample(E) tls(E) ib_core(OE) ipmb_host(E) 
sbsa_gwdt(E) openvswitch(E) nsh(E) nf_conncount(E) ipmi_ssif(E) 
xfrm_interface(E) xfrm6_tunnel(E) tunnel6(E) tunnel4(E) xfrm_user(E) 
xfrm_algo(E) nvme_fabrics(OE) tpm_ftpm_tee(E) mst_pciconf(OE) ipmi_devintf(E) 
ipmi_msghandler(E) ipmb_dev_int(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) 
overlay(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nft_counter(E) 
xt_tcpmss(E) xt_NFLOG(E) nfnetlink_log(E) xt_recent(E) xt_hashlimit(E) 
xt_state(E) xt_conntrack(E) xt_mark(E) xt_comment(E) ipt_REJECT(E) 
nf_reject_ipv4(E) xt_tcpudp(E) nft_compat(E) sunrpc(E) binfmt_misc(E) 
nf_tables(E) nfnetlink(E) nls_iso8859_1(E) optee(E) uio_pdrv_genirq(E) uio(E) 
tee(E)
  [ 1949.099135]  mlxbf_pmc(E) mlxbf_pka(E) sch_fq_codel(E) dm_multipath(E) 
scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) drm(E) ip_tables(E) x_tables(E) 
virtio_net(E) net_failover(E) failover(E) nvme(OE) gpio_mlxbf3(E) 
crct10dif_ce(E) ghash_ce(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) vitesse(E) 
nvme_core(OE) mlx_compat(OE) sdhci_of_dwcmshc(E) sdhci_pltfm(E) sdhci(E) 
i2c_mlxbf(E) mlxbf_gige(E) mlxbf_bootctl(E) pinctrl_mlxbf3(E) mlxbf_tmfifo(E) 
pwr_mlxbf(E) autofs4(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) 
[last unloaded: ib_core]
  [ 1949.234245] CPU: 14 PID: 0 Comm: swapper/14 Tainted: G           OE     
5.15.0-1046-bluefield #48-Ubuntu
  [ 1949.243706] Hardware name: https://www.mellanox.com BlueField-3 SmartNIC 
Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.5.2.13183 Jun 17 2024
  [ 1949.256548] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  [ 1949.263492] pc : flow_offload_queue_work+0x28/0xb0 [nf_flow_table]
  [ 1949.269661] lr : nf_flow_offload_add+0x24/0x30 [nf_flow_table]
  [ 1949.275478] sp : ffff8000080737d0
  [ 1949.278776] x29: ffff8000080737d0 x28: 0000000000000000 x27: 
ffffc941a7cbccc0
  [ 1949.285894] x26: 0000000000000000 x25: 0000000000000000 x24: 
0000000000000001
  [ 1949.293012] x23: 0000000000000000 x22: ffffc941a7726000 x21: 
ffff0000861ece50
  [ 1949.300129] x20: ffff0000861ece40 x19: ffff000095cb5000 x18: 
0000000000000000
  [ 1949.307246] x17: ffff36c6350db000 x16: ffffc941a527e890 x15: 
0000000000000000
  [ 1949.314363] x14: ffffc941a776d0b0 x13: 0000000000000000 x12: 
0000000000000032
  [ 1949.321480] x11: 0000000000000000 x10: 0000000000000000 x9 : 
ffffc9417f432674
  [ 1949.328598] x8 : ffff0000cb1ab580 x7 : 0000000000000000 x6 : 
000000000000003f
  [ 1949.335715] x5 : 0000000000000040 x4 : ffff8000080737d0 x3 : 
0000000fffffffe0
  [ 1949.342832] x2 : ffff0000cb1ab528 x1 : 0000000000000000 x0 : 
0000000000000000
  [ 1949.349950] Call trace:
  [ 1949.352382]  flow_offload_queue_work+0x28/0xb0 [nf_flow_table]
  [ 1949.358199]  nf_flow_offload_add+0x24/0x30 [nf_flow_table]
  [ 1949.363668]  flow_offload_add+0x138/0x1e0 [nf_flow_table]
  [ 1949.369051]  tcf_ct_flow_table_add+0x110/0x160 [act_ct]
  [ 1949.374262]  tcf_ct_act+0x924/0xb6c [act_ct]
  [ 1949.378516]  tcf_action_exec+0xb4/0x1f0
  [ 1949.382342]  __tcf_classify+0xd8/0x220
  [ 1949.386077]  tcf_classify+0xa0/0x240
  [ 1949.389637]  sch_handle_ingress.constprop.0+0xd4/0x23c
  [ 1949.394760]  __netif_receive_skb_core.constprop.0+0x494/0x8d0
  [ 1949.400489]  __netif_receive_skb_list_core+0xf0/0x214
  [ 1949.405524]  netif_receive_skb_list_internal+0x198/0x2ac
  [ 1949.410819]  napi_complete_done+0x70/0x1ec
  [ 1949.414899]  mlx5e_napi_poll+0x15c/0x5ec [mlx5_core]
  [ 1949.419940]  __napi_poll+0x40/0x230
  [ 1949.423416]  net_rx_action+0x178/0x360
  [ 1949.427150]  __do_softirq+0x15c/0x410
  [ 1949.430798]  irq_exit+0xa0/0xe0
  [ 1949.433925]  handle_domain_irq+0x6c/0xa0
  [ 1949.437834]  gic_handle_irq+0xec/0x1b0
  [ 1949.441567]  call_on_irq_stack+0x20/0x2c
  [ 1949.445474]  do_interrupt_handler+0x5c/0x70
  [ 1949.449642]  el1_interrupt+0x30/0x50
  [ 1949.453202]  el1h_64_irq_handler+0x18/0x2c
  [ 1949.457282]  el1h_64_irq+0x7c/0x80
  [ 1949.460668]  arch_cpu_idle+0x18/0x3c
  [ 1949.464228]  default_idle_call+0x44/0x150
  [ 1949.468223]  cpuidle_idle_call+0x174/0x200
  [ 1949.472304]  do_idle+0xac/0x100
  [ 1949.475430]  cpu_startup_entry+0x30/0x70
  [ 1949.479336]  secondary_start_kernel+0xfc/0x190
  [ 1949.483765]  __secondary_switched+0x90/0x94
  [ 1949.487935] Code: f9400c01 910003e4 b9401000 f940a021 (f945a023)
  [ 1949.494012] ---[ end trace d597d62fb2400054 ]---
  [ 1950.101574] Kernel panic - not syncing: Oops: Fatal exception in interrupt
  [ 1950.108438] SMP: stopping secondary CPUs
  [ 1950.112382] Kernel Offset: 0x49419ced0000 from 0xffff800008000000
  [ 1950.118458] PHYS_OFFSET: 0x80000000
  [ 1950.121930] CPU features: 0x0,000005c1,a3332e5a
  [ 1950.126446] Memory Limit: none
  [ 1950.736220] Rebooting in 10 seconds..
  Nvidia BlueField-3 rev1 BL1 V1.0

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2071715

Title:
  LAG with CT causes DPU Kernel Panic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2071715/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to