On Fri, Jun 5, 2026 at 10:16 AM Eelco Chaudron <[email protected]> wrote:
> > > On 2 Jun 2026, at 22:50, [email protected] wrote: > > > From: Numan Siddique <[email protected]> > > > > Hello, > > > > Below is a side-by-side trace of the same OVN-driven datapath pipeline, > > In our prod deployment we are seeing intermittent offload issues. All > > the datapath flows of a chain are getting offloaded except the last one. > > It is installed in the kernel dp because of which it kills the > performance. > > If I run the command - ovs-appctl dpctl/del-flows, the problematic > > flow gets offloaded. > > > > The issue again can be reproduced if we run a script to the delete > > the dp flows in a loop with a sleep of 5 seconds. Generally the issue > > gets surfaced after 5-6 del flows. > > > > Below are the datapath flows when the issue is seen > > > > The traffic is destined to the public IP(s) (there are 2 public ips in > > our setup) of the VM and enters the compute node PF and via the br-ex > > to the OVN pipeline. > > > > > ------------------------------------------------------------------------------- > > 1. BEFORE FLUSH > > > ------------------------------------------------------------------------------- > > > > Two upstream branches merge into the stranded chain `0x314610`. Each > > branch is exactly two TC-offloaded stages followed by the umbrella that > > is stuck in dp:ovs. > > > > +------------------------------+ +------------------------------+ > > | recirc_id(0) BRANCH A | | recirc_id(0) BRANCH B | > > | ufid:b81b9ab4 | | ufid:29ac3fbf | > > | in_port=enp210s0f0np0 (PF) | | in_port=enp210s0f0np0 (PF) | > > | eth_type=0x8100, VLAN 120 | | eth_type=0x8100, VLAN 26 | > > | eth(src=b0:cf:0e:b1:31:ff, | | eth(src=b0:cf:0e:b1:31:ff, | > > | dst=ae:ad:c9:2a:9d:0f) | | dst=ae:ad:c9:2a:9d:0f) | > > | ipv4(dst=AA.BB.CC.DD, | | ipv4(dst=XX.YY.ZZ.AA, | > > | src=32.0.0.0/224.0.0.0, | | src=8.0.0.0/248.0.0.0, > | > > | ttl=119) | | ttl=62) | > > | ct_state(0/0x2b) | | ct_state(0/0x2b) | > > | ct_mark(0/0x2) | | ct_mark(0/0x2) | > > | | | | > > | actions: | | actions: | > > | pop_vlan, | | pop_vlan, | > > | ct(zone=6,nat), | | ct(zone=5,nat), | > > | recirc(0x320213) | | recirc(0x321f17) | > > | | | | > > | pkts=565 bytes=44636 | | pkts=26,867,289 | > > | used=1.640s | | bytes=241,881,487,034 | > > | offloaded:yes, dp:tc | | used=0.620s | > > +------------------------------+ | offloaded:yes, dp:tc | > > | +------------------------------+ > > | post-DNAT in zone 6 | > > v | post-DNAT in > zone 5 > > +------------------------------+ v > > | recirc_id(0x320213) | +------------------------------+ > > | ufid:3413a279 | | recirc_id(0x321f17) | > > | in_port=enp210s0f0np0 (PF) | | ufid:9f638cd3 | > > | ct_state(0x2a/0x3e) | | in_port=enp210s0f0np0 (PF) | > > | ct_mark(0/0x43) | | ct_state(0x2a/0x3e) | > > | eth(src=b0:cf:0e:b1:31:ff, | | ct_mark(0/0x43) | > > | dst=ae:ad:c9:2a:9d:0f) | | eth(src=b0:cf:0e:b1:31:ff, | > > | ipv4(src=0.0.0.0/128.0.0.0, | | dst=ae:ad:c9:2a:9d:0f) > | > > | dst=172.27.61.7, | | ipv4(src=8.0.0.0/248.0.0.0, > | > > | proto=6, ttl=119) | | dst=172.27.61.7, | > > | | | proto=6, ttl=62) | > > | actions: | | | > > | ct_clear, | | actions: | > > | set(eth src= | | ct_clear, | > > | 1a:83:58:7b:a8:ed), | | set(eth src= | > > | set(ipv4 ttl=118), | | 1a:83:58:7b:a8:ed), | > > | ct(zone=11,nat), | | set(ipv4 ttl=60), | > > | recirc(0x314610) | | ct(zone=11,nat), | > > | | | recirc(0x314610) | > > | pkts=565 bytes=44636 | | | > > | used=1.640s | | pkts=26,867,289 | > > | offloaded:yes, dp:tc | | bytes=241,881,487,034 | > > +------------------------------+ | used=0.620s | > > | | offloaded:yes, dp:tc | > > | +------------------------------+ > > | | > > +--------------+ +-------------+ > > | | > > v v > > +--------------------------------------+ > > | recirc_id(0x314610) STAGE 2 | > > | ufid:1ee350bf | > > | in_port=enp210s0f0np0 (PF) | > > | ct_state(0x2a/0x3f) <-- mask 0x3f | > > | ct_mark(0/0x41) | > > | eth(src=*, dst=ae:ad:c9:2a:9d:0f) | > > | ipv4(src=*, dst=172.27.61.7, | > > | proto=0/0, ttl=0/0) | > > | | > > | actions: enp210s0f0_1 (VF) | > > | | > > | pkts=41,192,879 | > > | bytes=2,502,536,363,732 | > > | used=0.020s, flags=SFPR. | > > | | > > | dp:ovs <-- STRANDED, NOT OFFLOADED | > > +--------------------------------------+ > > > > > > > ------------------------------------------------------------------------------- > > 2. AFTER FLUSH (ovs-appctl dpctl/del-flows) > > > ------------------------------------------------------------------------------- > > > > After `ovs-appctl dpctl/del-flows` everything is re-installed in the > > natural pipeline order, so the chain check passes for every stage. > > The megaflow masks have not been re-aggregated yet, so we see a > > "fanned out" pipeline: > > > > +----------------+ +----------------+ +----------------+ > > | recirc_id(0) | | recirc_id(0) | | (parent for | > > | BRANCH A | | BRANCH B | | chain | > > | 5 sub-megaflows| | 1 megaflow | | 0x3229d9 had | > > | vlan 120 | | vlan 26 | | aged out at | > > | zone 6 NAT | | zone 5 NAT | | dump time -- | > > | | | | | the two | > > | dst= | | dst= | | stage-1 | > > | AA.BB.CC.DD | | XX.YY.ZZ.AA | | flows below | > > | by src/ttl: | | src=8.0.0.0/5 | | had pkts=0) | > > | 104/5 ttl=56 | | ttl=62 | | | > > | 32/3 ttl=119 | | | | ufid:1b6d210e | > > | 124/7 ttl=234 | | pkts=14,326,765| | -- not | > > | 32/3 ttl=122 | | bytes=128.7 GB | | captured | > > | 192/3 ttl=243 | | used=0.660s | | for branch | > > | | | | | C | > > | actions: | | actions: | | | > > | pop_vlan, | | pop_vlan, | | | > > | ct(zone=6, | | ct(zone=5, | | | > > | nat), | | nat), | | | > > | recirc( | | recirc( | | | > > | 0x320213) | | 0x321f17) | | | > > | offloaded:yes | | offloaded:yes | | | > > | dp:tc | | dp:tc | | | > > +----------------+ +----------------+ +----------------+ > > | | : > > v v v > > +----------------+ +----------------+ +----------------+ > > | recirc_id | | recirc_id | | recirc_id | > > | (0x320213) | | (0x321f17) | | (0x3229d9) | > > | | | | | | > > | 3 sub-megaflows| | 1 megaflow | | 2 megaflows | > > | ct_state( | | ct_state( | | ct_state( | > > | 0x2a/0x3e) | | 0x2a/0x3e) | | 0x21/0x3f) | > > | (+est+rpl+trk) | | (+est+rpl+trk) | | (+new+trk) | > > | | | | | | > > | ttl 119 -> 118 | | ttl 62 -> 60 | | ttl 234 -> 233 | > > | ttl 56 -> 55 | | | | ttl 243 -> 242 | > > | ttl 122 -> 121 | | pkts=14,326,690| | | > > | | | bytes=128.7 GB | | pkts=0 (new | > > | pkts=68+9+1=78 | | used=0.660s | | conn attempts| > > | | | | | in flight) | > > | actions: | | actions: | | | > > | ct_clear, | | ct_clear, | | actions: | > > | set(eth src= | | set(eth src= | | (same shape | > > | 1a:83:..), | | 1a:83:..), | | as branch | > > | set(ipv4 ttl | | set(ipv4 ttl | | A/B stage 1) | > > | -1), | | -1), | | recirc( | > > | ct(zone=11, | | ct(zone=11, | | 0x314610) | > > | nat), | | nat), | | | > > | recirc( | | recirc( | | offloaded:yes | > > | 0x314610) | | 0x314610) | | dp:tc | > > | offloaded:yes | | offloaded:yes | | | > > | dp:tc | | dp:tc | | | > > +----------------+ +----------------+ +----------------+ > > | | | > > +--------+ | +------------+ > > | | | > > v v v > > +-----------------------------------------+ > > | recirc_id(0x314610) STAGE 2 | > > | | > > | Three flows now (all offloaded:yes, | > > | dp:tc): | > > | | > > | 1. ufid:c51ef89d <-- the umbrella | > > | ct_state(0x2a/0x3e) <-- mask 0x3e | > > | ct_mark(0/0x41) | > > | eth(src=*, dst=ae:ad:c9:2a:9d:0f) | > > | ipv4(dst=172.27.61.7) | > > | actions: enp210s0f0_1 (VF) | > > | pkts=14,326,720 | > > | bytes=128,674,265,194 | > > | used=0.660s | > > | | > > | 2. ufid:d6f6c8c3 (DROP, new conn ACL) | > > | ct_state(0x21/0x3f) (+new+trk) | > > | eth(src=1a:83:58:7b:a8:ed, | > > | dst=ae:ad:00:00:00:00/ | > > | ff:ff:00:00:00:00) | > > | dst=172.27.60.0/23, | > > | tcp ports w/ submask | > > | actions: drop | > > | pkts=0 | > > | | > > | 3. ufid:0b52d8bd (DROP, new conn ACL) | > > | same shape, different tcp submask | > > | actions: drop | > > | pkts=0 | > > +-----------------------------------------+ > > > > > > (Note: The above ascii graph is generated by Claude) > > > > > > In the OVS logs we also see the below msg ( > https://github.com/openvswitch/ovs/blob/main/lib/dpif-offload-tc-netdev.c#L2363 > ) > > > > ``` > > 2026-06-01T21:15:33.774Z|10763|netdev_offload_tc(handler18)|DBG| > > match for chain 3229200 failed due to non-existing goto chain action > > ``` > > > > There seems to be a race condition during the ccmap 'used_chains'. > > > > As per Claude, the issue seems to be introduced in the commit : > > `273a4fce951a`** — `netdev-offload-tc: Only install recirc flows if the > parent is present.` > > and there is a possibility of a race window in the function > netdev_tc_flow_put() > > between > https://github.com/openvswitch/ovs/blob/main/lib/dpif-offload-tc-netdev.c#L2695 > > and > https://github.com/openvswitch/ovs/blob/main/lib/dpif-offload-tc-netdev.c#L2730 > > > > @Eelco @Ilya - Do you have any idea on what could be going on here ? > > Hi Numan, > > Sorry for the late response, but this message ended up in my > spam box which I was cleaning up :( Put Ilya also on the TO > line, maybe it ended up in his spam also. > > I'm on PTO on Monday, so will try to take a look at this later > in the week. > > //Eelco > Hi Eelco, Thanks for the reply. Below are some of the details I found during my investigation. OVN Logical topology ---- VM1 -> VPC Logical switch (with ACLs) -> Logical Router (NATs configured) -> Public logical switch with localnet port -> br-int <-> patch ports -> br-ex -> Physical NIC In this case, there is an iperf session between the VM1 and outside iperf server. When the reply traffic from iperf server to the VM 1 enters the host via the physical NIC, before the packet is delivered to the VM, there are in total 3 recirculations (stage 0, 1 and 2) because of NAT in the logical router pipeline and ACLs in the logical switch pipeline -> recirc 0, recirc 4 and recirc 5 (for example). When the datapath flow for recirc 0 is offloaded, recirc 4 is saved in the "used_chains" (in tc_netdev_flow_put()) and when the dp flow with recirc 4 is offloaded, recirc 5 is saved in the "used_chains". What I noticed is that for some of the packets from the server to the VM, tcp.flags has push/ack set and for some only ack set. And ovs-vswitchd generates different ufid in both the cases, but after the flow translation same datapath flow is generated. Suppose if I delete the dp flows when iperf is running (ovs-appctl dpctl/del-flows), a lot of packets get upcalled. What I noticed is one set of reply packets from server has tcp.flags == push/ack and another set with just tcp.flags == ack. When the handler thread does the flow translation for packets with [recirc=5, tcp.flags == push/ack] it generates ufid 'A' and offloads the flow if '5' is present in 'used_chains' And when the handler thread does the flow translation for packets with [recirc=5, tcp.flags == ack] it generates ufid 'B' and when it tries to offload, tc returns EEXISTS because both ufid 'A' and 'B' generates the same datapath flow (as the datapath flow doesn't have matches for tcp flags). And I see the error message 'match for chain 5 failed due to non-existing goto chain action' if the stage 1 flow (with recirc 4) was deleted. I was able to fix this issue by hacking ovn-northd and adding the below logical flow table=6 (ls_out_acl_eval ), priority=65533, match=(ct.est && !ct.rel && ct.rpl && ct_mark.blocked == 0 && tcp && (tcp.flags == 24 || tcp.flags == 16)), action=(reg8[21] = ct_label.nf; reg8[16] = 1; next;) table=6 (ls_out_acl_eval ), priority=65532, match=(ct.est && !ct.rel && ct.rpl && ct_mark.blocked == 0), action=(reg8[21] = ct_label.nf; reg8[16] = 1; next;) Priority 65532 is the existing logical flow which northd adds to allow all the established reply packets. Since priority 65533 flow matches on tcp.flags, now two distinct datapath flows are added and I do not see this race issue even after deleting the dpctl flows in a loop. (I added a sleep of 8 seconds between the deletes). Without the hack in northd, I'm able to reproduce the issue 100% of the time within the first 3 dp flow deletes in a fake-multinode environment. When the issue is seen, the first 2 dp flows (recirc 0 and 4) are offloaded to tc and the last one is not. When 2 different ufids result in the same dp flow, perhaps the tc_netdev_flow_put() should store both ufids in the ufid_to_tc_mappings ? Let me know if you want me to provide a script to reproduce using ovn-fake-multinode. Thanks Numan > > > Let me know if you need more information. I'll try to debug further. > > > > Thanks > > Numan > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
