Hi Oscar, On 5/13/25 1:04 PM, Dumitru Ceara wrote: > On 5/13/25 11:06 AM, Trọng Đạt Trần wrote: >> Dear Dumitru, >> > > Hi Oscar, > >> In the previous days, I’ve performed additional tests to gain better >> understanding around the issue before giving you the details. >> >> Thank you for your earlier explanation, it clarified how conntrack and >> sampling work in the simple "|vm1 --- ls --- vm2"| topology. However, I >> believe my original observations still hold in router related topologies. >> >> ------------------------------------------------------------------------ >> >> >> Setup Recap >> >> *Topology*: vm_a(10.2.1.5) --- ls1 --- router --- ls2 --- vm_b (10.2.3.5) >> >> ACLs applied to a shared Port Group (|pg_d559...|): >> >> * >> >> *ACL A*: |from-lport| – allow-related IPv4 (sample_est = |2000000|) >> >> * >> >> *ACL B*: |to-lport| – allow-related ICMP (sample_est = |1000000|) >> >> *Sample configuration*: >> >> * ACL A: direction=from-lport, match="inport == @pg && ip4", >> sample_est=2000000 >> * ACL B: direction=to-lport, match="outport == @pg && ip4 && icmp4", >> sample_est=1000000 >> >> # ovn-nbctl acl-list pg_d559bf91_b95f_49c0_8e4a_bf35f15e1dcc >> from-lport 1002 (inport == >> @pg_d559bf91_b95f_49c0_8e4a_bf35f15e1dcc && ip4) allow-related >> to-lport 1002 (outport == >> @pg_d559bf91_b95f_49c0_8e4a_bf35f15e1dcc && ip4 && ip4.src == >> 0.0.0.0/0 <http://0.0.0.0/0> && icmp4) allow-related >> >> | >> ------------------------------------------------------------------------ >> >> >> Expected Behavior (based on your explanation) >> >> * >> >> *First ICMP request*: no sample (ct=new). >> >> * >> >> *First ICMP reply*: >> >> o >> >> One sample from *ingress pipeline* (sample_est = |1000000|) >> >> o >> >> One sample from *egress pipeline* (sample_est = |2000000|) >> → *Total: 2 samples* for reply --> True >> >> ------------------------------------------------------------------------ >> >> >> Actual Behavior Observed >> >> On the *first ICMP reply*, I see: >> >> * >> >> *3 samples total*: >> >> o >> >> *2 samples* in the *ingress pipeline*, both with | >> obs_point_id=1000000| >> >> o >> >> *1 sample* in the egress pipeline, with |obs_point_id=2000000| >> >> This results in *duplicated sampling actions for a single logical >> datapath flow* within the ingress pipeline. >> >> Evidence: >> >> # ovs-dpctl dump-flows | grep 10.2.1.5 >> recirc_id(0x1d5),in_port(6),ct_state(-new+est-rel+rpl- >> inv+trk),ct_mark(0x20020/0xff0031),ct_label(0xf4240000000000000000000000000),eth(src=fa:16:3e:6b:42:8e,dst=fa:16:3e:dd:02:c0),eth_type(0x0800),ipv4(src=10.2.1.5,dst=10.2.3.5,proto=1,ttl=64,frag=no), >> packets:299, bytes:29302, used:0.376s, >> actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554437,obs_point_id=1000000,output_port=4294967295)),userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554437,obs_point_id=1000000,output_port=4294967295)),ct_clear,set(eth(src=fa:16:3e:d5:7b:d1,dst=fa:16:3e:f8:af:7d)),set(ipv4(ttl=63)),ct(zone=21),recirc(0x1d6) >> |# recirc_id(0x1d5): two flow_sample(...) actions with same metadata >> (1000000) >> recirc_id(0x1d6),in_port(6),ct_state(-new+est-rel+rpl- >> inv+trk),ct_mark(0x20000/0xff0031),ct_label(0x1e8480000000000000000000000000),eth(dst=fa:16:3e:f8:af:7d),eth_type(0x0800),ipv4(dst=10.2.3.5,frag=no), >> packets:299, bytes:29302, used:0.376s, >> actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554439,obs_point_id=2000000,output_port=4294967295)),9 >> | >> |# plus one flow_sample(...) later in the pipeline with metadata (2000000)| >> >> Also confirmed via IPFIX stats: >> >> # IPFIX before ping >> |sampled pkts: 192758 # After a single ping sampled pkts: 192761 → Δ = 3| >> >> >> Additional Findings >> >> * >> >> The issue *only occurs* when VMs are on *separate logical switches >> connected by a router*. >> >> * >> >> If both VMs are on the *same logical switch*, IPFIX is correctly >> sampled only once per ACL. >> >> * >> >> The duplicated sampling occurs *even if ACL A (IPv4) and ACL C >> (IPv6) are unrelated*, as long as both have |sample_est| and belong >> to the same Port Group. >> >> * >> >> The error can be reproduced *even when only vm_a's Port Group has >> the sampling ACLs*. vm_b does not require any sampling configuration >> for the issue to occur. >> > > Thanks a lot for the follow up! You're right, this is indeed a bug. > And that's because we don't clear the packet's ct_state (well all > conntrack related information) when advancing to the egress pipeline of > a switch when the outport is one connected to a router. > > That's due to https://github.com/ovn-org/ovn/commit/d17ece7 where we > chose to skip ct_clear if the switch has stateful (allow-related) ACLs: > > "Also, this patch does not change the behavior for ACLs such as > allow-related: packets are still sent to conntrack, even for router > ports. While this does not work if router ports are distributed, > allow-related ACLs work today on router ports when those ports are > handled on the same chassis for ingress and egress traffic. This patch > does not change that behavior." > > On a second look, the above reasoning seems wrong. It doesn't sound OK > to rely on conntrack state retrieved from a CT zone that's not assigned > to the logical port we're processing the packet on. > > I'm going to think about the right way to fix this issue and come back > to this thread once it's figured out. >
It turns out the fix is not necessarily that straight forward. There are a few different ways to address this though. As we (Red Hat) are also using this feature, I opened a ticket in our internal tracking system so that we analyze it in more depth. https://issues.redhat.com/browse/FDP-1408 However, if the OVN community in general is willing to look at fixing this bug that would be great too. Regards, Dumitru > Thanks again for the bug report! > > Regards, > Dumitru > >> ------------------------------------------------------------------------ >> >> >> Another Reproducible Scenario (Minimal) >> >> Port Group A on |vm_a| with: >> >> * >> >> ACL A: |from-lport| IP4 (sample_est or not) >> >> * >> >> ACL B: |to-lport| ICMP |sample_est=1000000| >> >> * >> >> ACL C: |from-lport| IP6 sample_est=2000000 >> >> Port Group B on |vm_b|: >> >> * >> >> No sampling required >> >> * >> >> ACL to allow from-lport and to-lport traffic >> >> When pinging |vm_a| from |vm_b|, the ICMP reply still results in *two >> samples with |obs_point_id=1000000|*. >> >> ------------------------------------------------------------------------ >> >> >> 📌 Key Takeaway >> >> I believe this confirms the IPFIX duplication issue is *not due to >> conntrack behavior*, but rather due to *how multiple ACLs with >> sample_est on the same Port Group (in different directions) result in >> twice |userspace(flow_sample(...))| actions* in the same flow. >> >> ------------------------------------------------------------------------ >> >> >> To avoid overloading the email, I’ve included more detailed output >> and explanations in the attachment. >> >> >> This email uses formatting elements such as icons, headers, and >> dividers for clarity. If you experience any display issues, please >> let me know and I’ll avoid using them in future messages. >> >> >> Please tell me if I can run any additional traces. I’m happy to >> assist further. >> >> >> Best regards, >> >> >> *Oscar* >> >> | >> >> >> On Fri, May 9, 2025 at 7:16 PM Dumitru Ceara <[email protected] >> <mailto:[email protected]>> wrote: >> >> On 5/9/25 2:14 PM, Dumitru Ceara wrote: >> > On 5/9/25 5:38 AM, Trọng Đạt Trần wrote: >> >> Hi Dimitru, >> >> >> > >> > Hi Oscar, >> > >> > >> >> Thank you for pointing that out. >> >> >> >> To clarify: the terms “inbound” and “outbound” in my previous message >> >> were used from the *VM’s perspective*. >> >> >> >> >> >> Topology: >> >> >> >> |vm_a ---- network1 ---- router ---- network2 ---- vm_b | >> >> >> >> >> >> ACLs: >> >> >> >> * >> >> >> >> *ACL A*: allow-related VMs to *send* IPv4 traffic (| >> direction=from- >> >> lport|) >> >> >> >> * >> >> >> >> *ACL B*: allow-related VMs to *receive* ICMP traffic (| >> direction=to- >> >> lport|) >> >> >> >> I’ve attached both the *Northbound and Southbound database dumps* to >> >> ensure the full context is available. >> >> >> > >> > Thanks for the info, I tried locally with a simplified setup where I >> > emulate your topology: >> > >> > switch c9c171ef-849c-436d-b3f9-73d83b9c4e5d (ls) >> > port vm2 >> > addresses: ["00:00:00:00:00:02"] >> > port vm1 >> > addresses: ["00:00:00:00:00:01"] >> > >> > Those two VIFs are in a port group: >> > >> > # ovn-nbctl list port_group >> > _uuid : 7e7a96b9-e708-4eea-b380-018314f2435c >> > acls : [1d0e7b71-ff03-4c78-ace4-2448bf237e11, >> > 7cb023e9-fee5-4576-a67d-ce1f5d98805b] >> > external_ids : {} >> > name : pg >> > ports : [d991baa6-21b0-4d46-a15d-71b9e8d6708d, >> > f2c5679c-d891-4d34-8402-8bc2047fba61] >> > >> > With two ACLs applied: >> > # ovn-nbctl acl-list pg >> > from-lport 100 (inport==@pg && ip4) allow-related >> > to-lport 200 (outport==@pg && ip4 && icmp4) allow-related >> > >> > Both ACLs have only sampling for established traffic (sample_est) set: >> > # ovn-nbctl list acl >> > _uuid : 1d0e7b71-ff03-4c78-ace4-2448bf237e11 >> > action : allow-related >> > direction : from-lport >> > match : "inport==@pg && ip4" >> > priority : 100 >> > sample_est : 23153fae-0a73-4f86-bdf2-137e76647da8 >> > sample_new : [] >> > >> > _uuid : 7cb023e9-fee5-4576-a67d-ce1f5d98805b >> > action : allow-related >> > direction : to-lport >> > match : "outport==@pg && ip4 && icmp4" >> > priority : 200 >> > sample_est : 42391c82-23d2-4f2b-a7b9-88afaa68282c >> > sample_new : [] >> > >> > # ovn-nbctl list sample >> > _uuid : 23153fae-0a73-4f86-bdf2-137e76647da8 >> > collectors : [82540855-dcd4-44e4-8354-e08a972500cd] >> > metadata : 2000000 >> > >> > _uuid : 42391c82-23d2-4f2b-a7b9-88afaa68282c >> > collectors : [82540855-dcd4-44e4-8354-e08a972500cd] >> > metadata : 1000000 >> > >> > Then I send a single ICMP echo packet from vm2 towards vm1. The ICMP >> > echo hits both ACLs but because it's the packet initiating the session >> > doesn't generate a sample (sample_new is not set in the ACLs). >> Instead >> > 2 conntrack entries are created for the ICMP session: >> > >> > - one in the CT zone of vm2 - here the from-lport ACL is hit so the >> > sample_est metadata of the from-lport ACL (200000) is stored along in >> > the conntrack state >> > >> > - one in the CT zone of vm1 - here the tolport ACL is hit so the >> > sample_est metadata of the to-lport ACL (100000) is stored along >> in the >> > conntrack state >> > >> > The ICMP echo packet reaches vm1 which replies with ICMP ECHO Reply. >> > >> > For the reply the CT zone of vm1 is first checked, we match the >> existing >> > conntrack entry (its state moves to "established") and a sample >> for the >> > stored metadata, 100000, is generated. Then, in the egress pipeline, >> > the CT zone of vm2 is checked, we match the other existing conntrack >> > entry (its state also moves to "established") and a sample for the >> > stored metadata, 200000, is generated. >> > >> > This seems correct to me. Stats also seem to confirm that: >> > # ip netns exec vm2 ping 42.42.42.2 -c1 >> > PING 42.42.42.2 (42.42.42.2) 56(84) bytes of data. >> > 64 bytes from 42.42.42.2 <http://42.42.42.2>: icmp_seq=1 ttl=64 >> time=1.46 ms >> > >> > --- 42.42.42.2 ping statistics --- >> > 1 packets transmitted, 1 received, 0% packet loss, time 0ms >> > rtt min/avg/max/mdev = 1.455/1.455/1.455/0.000 ms >> > >> > # ovs-ofctl dump-ipfix-flow br-int >> > NXST_IPFIX_FLOW reply (xid=0x2): 1 ids >> > id 2: flows=2, current flows=0, sampled pkts=2, ipv4 ok=2, ipv6 >> > ok=0, tx pkts=11 >> > pkts errs=0, ipv4 errs=0, ipv6 errs=0, tx errs=11 >> > >> > But then, when I increase the number of packets things become more >> > interesting. ICMP echos also generate samples. And while that might >> > seem like a bug, it's not. :) >> > >> > When ping sends multiple packets for a single invocation it uses the >> > same ICMP ID and just increments the ICMP seq, e.g.: >> > >> > 14:07:41.986618 00:00:00:00:00:02 > 00:00:00:00:00:01, ethertype IPv4 >> > (0x0800), length 98: (tos 0x0, ttl 64, id 58647, offset 0, flags [DF], >> > proto ICMP (1), length 84) >> > 42.42.42.3 > 42.42.42.2 <http://42.42.42.2>: ICMP echo >> request, id 35717, seq 1, length 64 >> > >> > 14:07:42.988077 00:00:00:00:00:02 > 00:00:00:00:00:01, ethertype IPv4 >> > (0x0800), length 98: (tos 0x0, ttl 64, id 59085, offset 0, flags [DF], >> > proto ICMP (1), length 84) >> > 42.42.42.3 > 42.42.42.2 <http://42.42.42.2>: ICMP echo >> request, id 35717, seq 2, length 64 >> > >> > But conntrack doesn't use the ICMP ID in the key for the session it >> > installs: >> >> Sorry about the typo, I meant to say "conntrack doesn't use the ICMP SEQ >> in the key for the session it installs, it only uses the ICMP ID". >> >> > >> > # ovs-appctl dpctl/dump-conntrack | grep 42.42.42 >> > >> >> icmp,orig=(src=42.42.42.3,dst=42.42.42.2,id=35628,type=8,code=0),reply=(src=42.42.42.2,dst=42.42.42.3,id=35628,type=0,code=0),zone=4,mark=131104,labels=0xf4240000000000000000000000000 >> > >> >> icmp,orig=(src=42.42.42.3,dst=42.42.42.2,id=35628,type=8,code=0),reply=(src=42.42.42.2,dst=42.42.42.3,id=35628,type=0,code=0),zone=6,mark=131072,labels=0x1e8480000000000000000000000000 >> > >> > So, subsequent ICMP requests will match on these two existing >> > established entries and (because sampling_est) is configured >> samples are >> > generated for them too. >> > >> > That's also visible in the datapath flows that forward packets in the >> > "original" direction (ICMP ECHOs in our case): >> > >> > # ovs-appctl dpctl/dump-flows | grep sample | grep '\-rpl' >> > recirc_id(0x29),in_port(3),ct_state(-new+est-rel-rpl- >> >> inv+trk),ct_mark(0x20000/0xff0071),ct_label(0x1e8480000000000000000000000000),eth(src=00:00:00:00:00:02,dst=00:00:00:00:00:01),eth_type(0x0800),ipv4(proto=1,frag=no), >> > packets:8, bytes:784, used:2.342s, >> > >> >> actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554434,obs_point_id=2000000,output_port=4294967295)),ct(commit,zone=6,mark=0x20000/0xff0071,label=0x1e8480000000000000000000000000/0xffffffffffff00000000000000000000,nat(src)),ct(zone=4),recirc(0x2a) >> > >> > recirc_id(0x2a),in_port(3),ct_state(-new+est-rel-rpl- >> >> inv+trk),ct_mark(0x20020/0xff0071),ct_label(0xf4240000000000000000000000000),eth(src=00:00:00:00:00:02,dst=00:00:00:00:00:00/ff:ff:00:00:00:00),eth_type(0x0800),ipv4(proto=1,frag=no), >> > packets:8, bytes:784, used:2.342s, >> > >> >> actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554434,obs_point_id=1000000,output_port=4294967295)),ct(commit,zone=4,mark=0x20020/0xff0071,label=0xf4240000000000000000000000000/0xffffffffffff00000000000000000000,nat(src)),1 >> > >> > So, for a less complicated test, maybe you should try with UDP/TCP >> instead. >> > >> > I hope that clarifies your doubts. >> > >> > Best regards, >> > Dumitru >> > >> >> Best regards, >> >> >> >> Oscar >> >> >> >> >> >> On Thu, May 8, 2025 at 8:11 PM Dumitru Ceara <[email protected] >> <mailto:[email protected]> >> >> <mailto:[email protected] <mailto:[email protected]>>> wrote: >> >> >> >> Hi Oscar, >> >> >> >> On 5/6/25 12:31 PM, Trọng Đạt Trần wrote: >> >> > As requested, I’ve attached additional tracing information >> related to >> >> > the sampling duplication issue. >> >> > >> >> > * >> >> > >> >> > The file |ofproto_trace.log| contains the full output >> of |ofproto/ >> >> > trace| commands. >> >> > >> >> > * >> >> > >> >> > The archive |ovn-detrace.tar.gz| includes six separate >> files, each >> >> > corresponding to an |ovn-detrace| output for a flow I >> believe is >> >> > involved in the duplicated sampling. >> >> > >> >> > Since I’m not fully confident in how to use |--ct-next >> option|, I’ve >> >> > included traces for all six related flows to ensure >> completeness. >> >> > >> >> > Please let me know if you need further details, or if I >> should re-run >> >> > any commands with additional options. >> >> > >> >> >> >> This seems fairly easy to reproduce locally for >> investigation; I didn't >> >> try yet though. However, would you mind sharing your OVN NB >> database >> >> file (I'm assuming this is a test environment)? >> >> >> >> I would like to make sure we don't have any misunderstanding >> because the >> >> terms you use below in your ACL description (e.g., >> "outbound"/"inbound") >> >> are not standard terms. Having the actual ACL (and the rest >> of the NB) >> >> contents will make it easier to debug. >> >> >> >> Thanks, >> >> Dumitru >> >> >> >> > Best regards, >> >> > >> >> > *Oscar* >> >> > >> >> > >> >> > On Tue, May 6, 2025 at 4:15 PM Adrián Moreno >> <[email protected] <mailto:[email protected]> >> >> <mailto:[email protected] <mailto:[email protected]>> >> >> > <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>>> wrote: >> >> > >> >> > On Tue, May 06, 2025 at 11:48:07AM +0700, Trọng Đạt >> Trần wrote: >> >> > > Dear Adrián, >> >> > > >> >> > > Thank you for your response. I’ve applied your >> suggestion to use >> >> > separate >> >> > > sample entries for each ACL. However, I am still seeing >> >> unexpected >> >> > behavior >> >> > > in the IPFIX output that I’d like to clarify. >> >> > > Test Setup (Same as Before) >> >> > > >> >> > > vm_a ---- network1 ---- router ---- network2 ---- vm_b >> >> > > >> >> > > >> >> > > - >> >> > > >> >> > > Two ACLs: >> >> > > - >> >> > > >> >> > > ACL A: allow-related *outbound* IPv4 >> >> > > - >> >> > > >> >> > > ACL B: allow-related *inbound* ICMP >> >> > > - >> >> > > >> >> > > ACLs applied symmetrically to both VMs. >> >> > > - >> >> > > >> >> > > Test traffic: ICMP request from vm_b to vm_a, and >> reply from >> >> > vm_a to vm_b >> >> > > . >> >> > > >> >> > > Key Problem Observed >> >> > > >> >> > > When sampling is enabled on *both* ACLs, the IPFIX >> record for >> >> > *flow (3)* >> >> > > (the ICMP reply from vm_a → router) shows *120 >> packets/min*. >> >> > > >> >> > > However: >> >> > > >> >> > > - >> >> > > >> >> > > If *only ACL B* (inbound ICMP) is sampled → (3) = 60 >> >> packets/min >> >> > > - >> >> > > >> >> > > If *only ACL A* (outbound IP4) is sampled → (3) >> not present >> >> > > - >> >> > > >> >> > > If both are sampled → (3) = 120 packets/min >> >> > > >> >> > > This suggests that *flow (3) is being sampled twice* >> — even >> >> though it >> >> > > represents a *single logical flow and matches only >> ACL B*. >> >> > > IPFIX Observations >> >> > > FlowDescriptionExpectedActual >> >> > > (1) vm_b → router (ICMP request) 60 pkt/m 60 >> >> > > (2) router → vm_a (ICMP request) 60 pkt/m 60 >> >> > > (3) vm_a → router (ICMP reply) 60 pkt/m 120 ⚠️ >> >> > > (4) router → vm_b (ICMP reply) 60 pkt/m 60 >> >> > >> >> > This is not what I'd expect, maybe Dumitru knows? >> >> > >> >> > Could you attach ofproto/trace and ovn-detrce outputs >> from both >> >> > directions? >> >> > >> >> > Thanks. >> >> > Adrián >> >> > >> >> >> _______________________________________________ discuss mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
