Hello,
In a high-traffic scenario, when modifying the bond-rebalance-interval
configuration for an OVS-DPDK bond interface,
we observed that OVS-DPDK generated USERSPACE_INVALID_PORT_DROP errors.
After analysis, executing the command ovs-vsctl set port dpdk_tun_port
other_config:bond-rebalance-interval=1000
triggered the following process, ultimately leading to the
USERSPACE_INVALID_PORT_DROP errors:
1. Execution of memset(bond->hash, 0, hash_len);
Call stack:
#0 bond_entry_reset (bond=0x4c64bc0) at ofproto/bond.c:1852
#1 0x0000000001a2a238 in bond_reconfigure (bond=0x4c64bc0, s=0x7fff6d1dec10) at
ofproto/bond.c:514
#2 0x0000000001a4e253 in bundle_set (ofproto_=0x4c21110, aux=0x4c39d90,
s=0x7fff6d1deb90) at ofproto/ofproto-dpif.c:3484
#3 0x0000000001a31b27 in ofproto_bundle_register (ofproto=0x4c21110,
aux=0x4c39d90, s=0x7fff6d1deb90) at ofproto/ofproto.c:1430
#4 0x0000000001a1c80e in port_configure (port=0x4c39d90) at
vswitchd/bridge.c:1384
#5 0x0000000001a1b7b3 in bridge_reconfigure (ovs_cfg=0x4bb37c0) at
vswitchd/bridge.c:1005
#6 0x0000000001a223e7 in bridge_run () at vswitchd/bridge.c:3423
#7 0x0000000001a27b9e in main (argc=11, argv=0x7fff6d1def38) at
vswitchd/ovs-vswitchd.c:129
2. Execution of member_map[i] = OFPP_NONE
Call stack:
#0 bond_add_lb_output_buckets (bond=0x37220f0) at ofproto/bond.c:2135
#1 0x0000000001a29b4f in update_recirc_rules__ (bond=0x37220f0) at
ofproto/bond.c:356
#2 0x0000000001a29ebe in update_recirc_rules (bond=0x37220f0) at
ofproto/bond.c:426
#3 0x0000000001a2a262 in bond_reconfigure (bond=0x37220f0, s=0x7fffffffe230) at
ofproto/bond.c:520
#4 0x0000000001a4e292 in bundle_set (ofproto_=0x366afa0, aux=0x3713290,
s=0x7fffffffe1b0) at ofproto/ofproto-dpif.c:3484
#5 0x0000000001a31b66 in ofproto_bundle_register (ofproto=0x366afa0,
aux=0x3713290, s=0x7fffffffe1b0) at ofproto/ofproto.c:1430
#6 0x0000000001a1c80e in port_configure (port=0x3713290) at
vswitchd/bridge.c:1384
#7 0x0000000001a1b7b3 in bridge_reconfigure (ovs_cfg=0x3660180) at
vswitchd/bridge.c:1005
#8 0x0000000001a223b7 in bridge_run () at vswitchd/bridge.c:3422
#9 0x0000000001a27b92 in main (argc=1, argv=0x7fffffffe558)
3.PMD thread sending packets found port_no=0xffffffff
Call stack:
#0 dp_execute_output_action (pmd=0x7fff68731010, packets_=0x7fff53ff8f50,
should_steal=true, port_no=4294967295)
at lib/dpif-netdev.c:9273
#1 0x0000000001acaf6d in dp_execute_lb_output_action (pmd=0x7fff68731010,
packets_=0x7fff53ff9ca0, should_steal=true,
bond=1) at lib/dpif-netdev.c:9350
#2 0x0000000001acb0b6 in dp_execute_cb (aux_=0x7fff53ff9b30,
packets_=0x7fff53ff9ca0, a=0x7fff4800f074, should_steal=true)
at lib/dpif-netdev.c:9379
#3 0x0000000001b526b5 in odp_execute_actions (dp=0x7fff53ff9b30,
batch=0x7fff53ff9ca0, steal=true,
actions=0x7fff4800f074, actions_len=8, dp_execute_action=0x1acafc0
<dp_execute_cb>) at lib/odp-execute.c:1016
#4 0x0000000001acbc8e in dp_netdev_execute_actions (pmd=0x7fff68731010,
packets=0x7fff53ff9ca0, should_steal=true,
flow=0x7fff4800ea70, actions=0x7fff4800f074, actions_len=8) at
lib/dpif-netdev.c:9698
#5 0x0000000001ac8133 in packet_batch_per_flow_execute (batch=0x7fff53ff9c90,
pmd=0x7fff68731010)
at lib/dpif-netdev.c:8338
#6 0x0000000001aca3ad in dp_netdev_input__ (pmd=0x7fff68731010,
packets=0x7fff53ffbdf0, md_is_valid=false, port_no=4)
at lib/dpif-netdev.c:9055
#7 0x0000000001aca3ff in dp_netdev_input (pmd=0x7fff68731010,
packets=0x7fff53ffbdf0, port_no=4) at lib/dpif-netdev.c:9064
#8 0x0000000001ac0da2 in dp_netdev_process_rxq_port (pmd=0x7fff68731010,
rxq=0x3720220, port_no=4)
at lib/dpif-netdev.c:5690
#9 0x0000000001ac566a in pmd_thread_main (f_=0x7fff68731010) at
lib/dpif-netdev.c:7334
#10 0x0000000001bc4b1b in ovsthread_wrapper (aux_=0x3711920) at
lib/ovs-thread.c:422
#11 0x00007ffff76f4802 in start_thread () from /lib64/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x00007ffff7694314 in clone () from /lib64/libc.so.6
The main issue arises from a timing discrepancy between the main thread and the
PMD thread when operating on pmd->tx_bonds,
which causes the PMD to temporarily resolve the egress interface to 0xffffffff
(an invalid value).
What solutions does the community propose to address this problem?
our ovs version 2.17.5 lts.
_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss