Re: [ovs-discuss] [ovn-controller] Possible duplicated sampling in OpenFlow flows causing IPFIX duplication

Trọng Đạt Trần via discuss Mon, 19 May 2025 19:35:04 -0700

Hi Dumitru,

Thanks again for confirming the behavior of the sampling rate.
If the probability is applied per flow, then the current sample_collector
design is indeed sufficient — we can use metadata to separate ACL sampling
domains between different teams, which addresses my concern about traffic
imbalance.


Regarding psample, I’d like to confirm whether there's a minimum required
kernel version for it to work properly.

>From the Q&A section in your OVSCON’24 presentation, I understood that
kernel *6.11* might be required — but I also saw that psample was
introduced upstream as early as *4.11*.

My current setup is:

   -

   *OVS Library version*: 3.4.0
   -

   *Linux Kernel*: 5.15.0-134-generic (Ubuntu)

So far, I haven’t observed any psample activity. Is there a specific kernel
OVS configuration or module that must be enabled for psample flows to be
generated?
If this is something I should explore further on my own, I completely
understand — just wanted to double-check before diving deeper.

Thanks again for all the support and insight you've shared.

Best regards,
*Oscar*
On Mon, May 19, 2025 at 3:46 PM Dumitru Ceara <[email protected]> wrote:

> On 5/19/25 9:01 AM, Trọng Đạt Trần wrote:
> > Hi Dumitru,
> >
>
> Hi Oscar,
>
> > I’d like to verify my understanding of how sampling behaves under traffic
> > imbalance, specifically when multiple ACLs use the *same
> sample_collector*.
> > ------------------------------
> > 🔧 Simplified Scenario
> >
> >    -
> >
> >    *ACL_A* (Team A) is configured with:
> >    -
> >
> >       sample action → metadata: 100
> >       -
> >
> >       uses sample_collector_share
> >       -
> >
> >    *ACL_B* (Team B) is configured with:
> >    -
> >
> >       sample action → metadata: 200
> >       -
> >
> >       uses the *same* sample_collector_share
> >       -
> >
> >    sample_collector_share is configured with:
> >    -
> >
> >       probability = 6553 (10%)
> >
> > Now assume the following:
> >
> >    -
> >
> >    90 packets match *ACL_A*
> >    -
> >
> >    10 packets match *ACL_B*
> >
> > ------------------------------
> > ❓Question
> >
> > Which of the two behaviors should I expect?
> >
> > *(1)* A total of *10 packets randomly sampled* from the full 100 packets,
> > regardless of metadata (since the sample configuration share the same
> > sample_collector);
> > *or*
> > *(2)* A *proportional sampling* outcome:
> >
> >    -
> >
> >    9 packets sampled from ACL_A (90 × 10%)
> >    -
> >
> >    1 packet sampled from ACL_B (10 × 10%)
> >
> > ------------------------------
> > 📖 Documentation vs. OpenFlow Action
> >
> > The OVN NB schema documentation under Sample_Collector suggests the
> *first*
> > interpretation:
> >
> > “Probability: Sampling probability for this collector.”
> >
> > However, based on your earlier explanation and the OpenFlow action:
> >
> > flow_sample(probability=65535, collector_set_id=2, obs_domain_id=...,
> > obs_point_id=...)
> >
> > ... I’m inclined to believe the *second* interpretation is correct, since
> > each sample action is independently applied with its own metadata
> > (obs_point_id), even if they point to the same sample_collector.
> > ------------------------------
> >
> > Could you kindly confirm which interpretation is correct?
> > ------------------------------
>
> You're right, the second interpretation is correct.  The probability is
> associated with the OVS sample() actions the resulting openflows will
> have.  So, with your example, for the two ACLs we'll have two different
> openflow rules with different sample actions so the probability will be
> applied per flow.
>
> Maybe we should update the OVN documentation to make it clearer.
>
> > 📖 Sample Performance
> >
> > Thank you for pointing me to your OVSCON'24 presentation — I had missed
> it
> > earlier. It was very informative and gave me a much better understanding
> of
> > the potential performance bottlenecks in the current sampling design.
> >
> > I'll make sure to explore those aspects further in my upcoming tests.
> >
>
> Ack.
>
> > Regarding *psample*, I’d be happy to evaluate its performance when it’s
> > ready or when support becomes stable in OVN environments. It seems like a
> > promising direction to offload sampling and reduce vswitchd overhead.
> >
>
> Yes, that was the goal, reduce latency and reduce load on vswitchd.
>
> The psample support is stable (part of OVS release 3.4.0 - August 2024).
>  The only requirement is that the running kernel also supports the
> datapath action.
>
> Regards,
> Dumitru
>
> > Best regards,
> >
> > *Oscar*
> >
> > On Fri, May 16, 2025 at 8:03 PM Dumitru Ceara <[email protected]> wrote:
> >
> >> On 5/16/25 6:07 AM, Trọng Đạt Trần wrote:
> >>> Dear Dumitru,
> >>>
> >>
> >> Hi Oscar,
> >>
> >>> Thank you for confirming the bug — I’m happy to help however I can.
> >>> ------------------------------
> >>> I. Temporary Workaround & Feedback
> >>>
> >>> To work around the IPFIX duplication issue in the meantime, I’ve
> >>> implemented a post-processing filter that divides duplicate samples by
> >> two.
> >>> The logic relies on two elements:
> >>>
> >>>    1.
> >>>
> >>>    *Source and destination MAC addresses* to detect reply traffic from
> >> VM →
> >>>    router port.
> >>>    2.
> >>>
> >>>    *Sample metadata* (from the sample entry) to ensure that the match
> >> comes
> >>>    from a to-lport ACL.
> >>>
> >>> This combination seems to reliably identify duplicated samples. I've
> >> tested
> >>> this across multiple scenarios and it works well so far.
> >>>
> >>> *Do you foresee any edge cases where this workaround might break down
> or
> >>> behave incorrectly?*
> >>
> >> At a first glance this seems OK to me.
> >>
> >>> ------------------------------
> >>> II. Questions Regarding OVN Sampling 1. *Sample Collector Table Limits*
> >>>
> >>> In my deployment, multiple teams share the network, but generate highly
> >>> imbalanced traffic. For example:
> >>>
> >>>    -
> >>>
> >>>    Team A sends 90% of total traffic.
> >>>    -
> >>>
> >>>    Team B sends only 10%.
> >>>
> >>> If I configure a shared sample_collector with probability = 6553
> (≈10%),
> >>> there’s a chance Team A may generate most or all samples while Team B’s
> >>> traffic may not be captured at all.
> >>>
> >>
> >> Is traffic from Team A and Team B hitting the same ACLs?  Can't the ACLs
> >> be partitioned (different port groups) per team?  Then you'd be able to
> >> use different Sample.metadata for different teams.
> >>
> >>> Furthermore, the IPFIX table in the ovsdb would set cache_max_flows
> >> limits
> >>> causing team A and B could not be configured on the same set_id.
> >>>
> >>> To solve this, I configure one sample_collector per team (different
> >> set_ids),
> >>> so each has independent sampling:
> >>>
> >>> sample_collector "team_a": id=2, set_id=2
> >>> sample_collector "team_b": id=1, set_id=1
> >>>
> >>> This setup works, but it introduces a potential limitation:
> >>>
> >>>    -
> >>>
> >>>    Since set_id is limited to 256 values, we can only support up to 256
> >>>    teams (or Tenants).
> >>>    -
> >>>
> >>>    In multi-tenant environments, this ceiling may be too low.
> >>>
> >>> Would it make sense to consider increasing this limit?
> >>
> >> Actually, the set_id shouldn't be limited to 8bits, it can be any 32-bit
> >> value according to the schema:
> >>
> >> "set_id": {"type": {"key": {
> >>     "type": "integer",
> >>     "minInteger": 1,
> >>     "maxInteger": 4294967295}}},
> >>
> >> As a side thing, now that you mention this, we only use the 8 LSB as
> >> set_id in the flows we generate.  I think that's a bug and we should
> >> fix it.  I posted a patch here:
> >>
> >> https://mail.openvswitch.org/pipermail/ovs-dev/2025-May/423409.html
> >>
> >> However, there is indeed a limit that allows at _most_ 255 unique
> >> Sample_Collector NB records:
> >>
> >> "Sample_Collector": {
> >>     "columns": {
> >>         "id": {"type": {"key": {
> >>             "type": "integer",
> >>             "minInteger": 1,
> >>             "maxInteger": 255}}},
> >>
> >> That's because we need to store the NB Sample_Collector ID in the
> >> conntrack mark of the session we're sampling.  CT mark is a 32bit
> >> value and we use some bits in it for other features:
> >>
> >>     expr_symtab_add_subfield_scoped(symtab, "ct_mark.obs_collector_id",
> >> NULL,
> >>                                     "ct_mark[16..23]", WR_CT_COMMIT);
> >>
> >> Looking at the current code I _think_ we have 8 more bits
> >> available.  However, expanding the ct_mark.obs_collector_id to use
> >> the whole remainder of ct_mark (64K values) seems "risky" because
> >> we don't know before hand if we'll need more bits for other features
> >> in the future.
> >>
> >> Do you have a suggestion of reasonable maximum limit for the number
> >> of teams (users) in your use case?
> >>
> >>> 2. *Sampling Performance Considerations*
> >>>
> >>> Here is my current understanding — I’d appreciate confirmation or
> >>> corrections:
> >>>
> >>>    -
> >>>
> >>>    Sampling performance is not heavily dependent on ovn-northd or
> >>>    ovn-controller, since the generation of the sampling flow is
> >>>    insignificant compared to many other features.
> >>>    -
> >>>
> >>>    In ovs-vswitchd, both memory and CPU usage scale roughly linearly
> with
> >>>    the number of active OpenFlow rules using sample(...) actions and
> the
> >>>    rate at which those samples are triggered and exported.
> >>>    -
> >>>
> >>>    Under high load, performance can be tuned using the
> >> cache_active_timeout
> >>>    and cache_max_flows fields in the IPFIX table. These parameters
> >> control
> >>>    export frequency and the size of the flow cache, allowing a balance
> >> between
> >>>    monitoring fidelity and resource efficiency.
> >>>
> >>> Is this an accurate summary? Or are there other scaling or bottleneck
> >>> factors I should consider?
> >>
> >> I'm not sure if you're aware but OVS (with the kernel netlink datapath
> and
> >> on relatively new kernels) supports a different way of sampling,
> psample.
> >>
> >> https://github.com/openvswitch/ovs/commit/1a3bd96
> >>
> >> This avoids sending packets all together to vswitchd and allows better
> >> sampling performance.
> >>
> >> This might give more insights, a presentation from OVSCON'24 with an
> end to
> >> end solution for sampling network policies (ACLs) with psample in
> >> ovn-kubernetes:
> >>
> >> https://www.youtube.com/watch?v=gLwDsaiUuN4&t=2s
> >>
> >>> 3. *Separate Bug Regarding ACL Tier and Sampling*
> >>>
> >>> I’ve also observed an issue related to sampling and ACL tier
> >> interactions.
> >>> Would you prefer I continue in this thread or open a new one?
> >>>
> >>
> >> It might be better to start a new thread.  Thanks again for trying this
> >> new feature out!
> >>
> >>> Happy to follow your preferred workflow.
> >>> ------------------------------
> >>>
> >>> Thanks again for your time and support.
> >>>
> >>> Best regards,
> >>> *Oscar*
> >>>
> >>
> >> Best regards,
> >> Dumitru
> >>
> >>> On Wed, May 14, 2025 at 5:10 PM Dumitru Ceara <[email protected]>
> wrote:
> >>>
> >>>> Hi Oscar,
> >>>>
> >>>> On 5/13/25 1:04 PM, Dumitru Ceara wrote:
> >>>>> On 5/13/25 11:06 AM, Trọng Đạt Trần wrote:
> >>>>>> Dear Dumitru,
> >>>>>>
> >>>>>
> >>>>> Hi Oscar,
> >>>>>
> >>>>>> In the previous days, I’ve performed additional tests to gain better
> >>>>>> understanding around the issue before giving you the details.
> >>>>>>
> >>>>>> Thank you for your earlier explanation, it clarified how conntrack
> and
> >>>>>> sampling work in the simple "|vm1 --- ls --- vm2"| topology.
> However,
> >> I
> >>>>>> believe my original observations still hold in router related
> >>>> topologies.
> >>>>>>
> >>>>>>
> >> ------------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>       Setup Recap
> >>>>>>
> >>>>>> *Topology*: vm_a(10.2.1.5) --- ls1 --- router --- ls2 --- vm_b
> >>>> (10.2.3.5)
> >>>>>>
> >>>>>> ACLs applied to a shared Port Group (|pg_d559...|):
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     *ACL A*: |from-lport| – allow-related IPv4 (sample_est =
> >> |2000000|)
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     *ACL B*: |to-lport| – allow-related ICMP (sample_est =
> |1000000|)
> >>>>>>
> >>>>>> *Sample configuration*:
> >>>>>>
> >>>>>>   * ACL A: direction=from-lport, match="inport == @pg && ip4",
> >>>>>>     sample_est=2000000
> >>>>>>   * ACL B: direction=to-lport, match="outport == @pg && ip4 &&
> icmp4",
> >>>>>>     sample_est=1000000
> >>>>>>
> >>>>>>     # ovn-nbctl acl-list pg_d559bf91_b95f_49c0_8e4a_bf35f15e1dcc
> >>>>>>       from-lport  1002 (inport ==
> >>>>>>     @pg_d559bf91_b95f_49c0_8e4a_bf35f15e1dcc && ip4) allow-related
> >>>>>>       to-lport  1002 (outport ==
> >>>>>>     @pg_d559bf91_b95f_49c0_8e4a_bf35f15e1dcc && ip4 && ip4.src ==
> >>>>>>     0.0.0.0/0 <http://0.0.0.0/0> && icmp4) allow-related
> >>>>>>
> >>>>>> |
> >>>>>>
> >> ------------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>       Expected Behavior (based on your explanation)
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     *First ICMP request*: no sample (ct=new).
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     *First ICMP reply*:
> >>>>>>
> >>>>>>       o
> >>>>>>
> >>>>>>         One sample from *ingress pipeline* (sample_est = |1000000|)
> >>>>>>
> >>>>>>       o
> >>>>>>
> >>>>>>         One sample from *egress pipeline* (sample_est = |2000000|)
> >>>>>>         → *Total: 2 samples* for reply --> True
> >>>>>>
> >>>>>>
> >> ------------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>       Actual Behavior Observed
> >>>>>>
> >>>>>> On the *first ICMP reply*, I see:
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     *3 samples total*:
> >>>>>>
> >>>>>>       o
> >>>>>>
> >>>>>>         *2 samples* in the *ingress pipeline*, both with |
> >>>>>>         obs_point_id=1000000|
> >>>>>>
> >>>>>>       o
> >>>>>>
> >>>>>>         *1 sample* in the egress pipeline, with
> |obs_point_id=2000000|
> >>>>>>
> >>>>>> This results in *duplicated sampling actions for a single logical
> >>>>>> datapath flow* within the ingress pipeline.
> >>>>>>
> >>>>>> Evidence:
> >>>>>>
> >>>>>> # ovs-dpctl dump-flows | grep 10.2.1.5
> >>>>>> recirc_id(0x1d5),in_port(6),ct_state(-new+est-rel+rpl-
> >>>>>>
> >>>>
> >>
> inv+trk),ct_mark(0x20020/0xff0031),ct_label(0xf4240000000000000000000000000),eth(src=fa:16:3e:6b:42:8e,dst=fa:16:3e:dd:02:c0),eth_type(0x0800),ipv4(src=10.2.1.5,dst=10.2.3.5,proto=1,ttl=64,frag=no),
> >>>> packets:299, bytes:29302, used:0.376s,
> >>>>
> >>
> actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554437,obs_point_id=1000000,output_port=4294967295)),userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554437,obs_point_id=1000000,output_port=4294967295)),ct_clear,set(eth(src=fa:16:3e:d5:7b:d1,dst=fa:16:3e:f8:af:7d)),set(ipv4(ttl=63)),ct(zone=21),recirc(0x1d6)
> >>>>>> |# recirc_id(0x1d5): two flow_sample(...) actions with same metadata
> >>>>>> (1000000)
> >>>>>> recirc_id(0x1d6),in_port(6),ct_state(-new+est-rel+rpl-
> >>>>>>
> >>>>
> >>
> inv+trk),ct_mark(0x20000/0xff0031),ct_label(0x1e8480000000000000000000000000),eth(dst=fa:16:3e:f8:af:7d),eth_type(0x0800),ipv4(dst=10.2.3.5,frag=no),
> >>>> packets:299, bytes:29302, used:0.376s,
> >>>>
> >>
> actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554439,obs_point_id=2000000,output_port=4294967295)),9
> >>>>>> |
> >>>>>> |# plus one flow_sample(...) later in the pipeline with metadata
> >>>> (2000000)|
> >>>>>>
> >>>>>> Also confirmed via IPFIX stats:
> >>>>>>
> >>>>>> # IPFIX before ping
> >>>>>> |sampled pkts: 192758 # After a single ping sampled pkts: 192761 →
> Δ =
> >>>> 3|
> >>>>>>
> >>>>>>
> >>>>>>       Additional Findings
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     The issue *only occurs* when VMs are on *separate logical
> switches
> >>>>>>     connected by a router*.
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     If both VMs are on the *same logical switch*, IPFIX is correctly
> >>>>>>     sampled only once per ACL.
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     The duplicated sampling occurs *even if ACL A (IPv4) and ACL C
> >>>>>>     (IPv6) are unrelated*, as long as both have |sample_est| and
> >> belong
> >>>>>>     to the same Port Group.
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     The error can be reproduced *even when only vm_a's Port Group
> has
> >>>>>>     the sampling ACLs*. vm_b does not require any sampling
> >> configuration
> >>>>>>     for the issue to occur.
> >>>>>>
> >>>>>
> >>>>> Thanks a lot for the follow up!  You're right, this is indeed a bug.
> >>>>> And that's because we don't clear the packet's ct_state (well all
> >>>>> conntrack related information) when advancing to the egress pipeline
> of
> >>>>> a switch when the outport is one connected to a router.
> >>>>>
> >>>>> That's due to https://github.com/ovn-org/ovn/commit/d17ece7 where we
> >>>>> chose to skip ct_clear if the switch has stateful (allow-related)
> ACLs:
> >>>>>
> >>>>> "Also, this patch does not change the behavior for ACLs such as
> >>>>> allow-related: packets are still sent to conntrack, even for router
> >>>>> ports. While this does not work if router ports are distributed,
> >>>>> allow-related ACLs work today on router ports when those ports are
> >>>>> handled on the same chassis for ingress and egress traffic. This
> patch
> >>>>> does not change that behavior."
> >>>>>
> >>>>> On a second look, the above reasoning seems wrong.  It doesn't sound
> OK
> >>>>> to rely on conntrack state retrieved from a CT zone that's not
> assigned
> >>>>> to the logical port we're processing the packet on.
> >>>>>
> >>>>> I'm going to think about the right way to fix this issue and come
> back
> >>>>> to this thread once it's figured out.
> >>>>>
> >>>>
> >>>> It turns out the fix is not necessarily that straight forward.  There
> >>>> are a few different ways to address this though.  As we (Red Hat) are
> >>>> also using this feature, I opened a ticket in our internal tracking
> >>>> system so that we analyze it in more depth.
> >>>>
> >>>> https://issues.redhat.com/browse/FDP-1408
> >>>>
> >>>> However, if the OVN community in general is willing to look at fixing
> >>>> this bug that would be great too.
> >>>>
> >>>> Regards,
> >>>> Dumitru
> >>>>
> >>>>> Thanks again for the bug report!
> >>>>>
> >>>>> Regards,
> >>>>> Dumitru
> >>>>>
> >>>>>>
> >> ------------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>       Another Reproducible Scenario (Minimal)
> >>>>>>
> >>>>>> Port Group A on |vm_a| with:
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     ACL A: |from-lport| IP4 (sample_est or not)
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     ACL B: |to-lport| ICMP |sample_est=1000000|
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     ACL C: |from-lport| IP6 sample_est=2000000
> >>>>>>
> >>>>>> Port Group B on |vm_b|:
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     No sampling required
> >>>>>>
> >>>>>>   *
> >>>>>>
> >>>>>>     ACL to allow from-lport and to-lport traffic
> >>>>>>
> >>>>>> When pinging |vm_a| from |vm_b|, the ICMP reply still results in
> *two
> >>>>>> samples with |obs_point_id=1000000|*.
> >>>>>>
> >>>>>>
> >> ------------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>       📌 Key Takeaway
> >>>>>>
> >>>>>> I believe this confirms the IPFIX duplication issue is *not due to
> >>>>>> conntrack behavior*, but rather due to *how multiple ACLs with
> >>>>>> sample_est on the same Port Group (in different directions) result
> in
> >>>>>> twice |userspace(flow_sample(...))| actions* in the same flow.
> >>>>>>
> >>>>>>
> >> ------------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>       To avoid overloading the email, I’ve included more detailed
> >> output
> >>>>>>       and explanations in the attachment.
> >>>>>>
> >>>>>>
> >>>>>>       This email uses formatting elements such as icons, headers,
> and
> >>>>>>       dividers for clarity. If you experience any display issues,
> >> please
> >>>>>>       let me know and I’ll avoid using them in future messages.
> >>>>>>
> >>>>>>
> >>>>>>       Please tell me if I can run any additional traces. I’m happy
> to
> >>>>>>       assist further.
> >>>>>>
> >>>>>>
> >>>>>>       Best regards,
> >>>>>>
> >>>>>>
> >>>>>>       *Oscar*
> >>>>>>
> >>>>>> |
> >>>>>>
> >>>>>>
> >>>>>> On Fri, May 9, 2025 at 7:16 PM Dumitru Ceara <[email protected]
> >>>>>> <mailto:[email protected]>> wrote:
> >>>>>>
> >>>>>>     On 5/9/25 2:14 PM, Dumitru Ceara wrote:
> >>>>>>     > On 5/9/25 5:38 AM, Trọng Đạt Trần wrote:
> >>>>>>     >> Hi Dimitru,
> >>>>>>     >>
> >>>>>>     >
> >>>>>>     > Hi Oscar,
> >>>>>>     >
> >>>>>>     >
> >>>>>>     >> Thank you for pointing that out.
> >>>>>>     >>
> >>>>>>     >> To clarify: the terms “inbound” and “outbound” in my previous
> >>>> message
> >>>>>>     >> were used from the *VM’s perspective*.
> >>>>>>     >>
> >>>>>>     >>
> >>>>>>     >>       Topology:
> >>>>>>     >>
> >>>>>>     >> |vm_a ---- network1 ---- router ---- network2 ---- vm_b |
> >>>>>>     >>
> >>>>>>     >>
> >>>>>>     >>       ACLs:
> >>>>>>     >>
> >>>>>>     >>   *
> >>>>>>     >>
> >>>>>>     >>     *ACL A*: allow-related VMs to *send* IPv4 traffic (|
> >>>>>>     direction=from-
> >>>>>>     >>     lport|)
> >>>>>>     >>
> >>>>>>     >>   *
> >>>>>>     >>
> >>>>>>     >>     *ACL B*: allow-related VMs to *receive* ICMP traffic (|
> >>>>>>     direction=to-
> >>>>>>     >>     lport|)
> >>>>>>     >>
> >>>>>>     >> I’ve attached both the *Northbound and Southbound database
> >>>> dumps* to
> >>>>>>     >> ensure the full context is available.
> >>>>>>     >>
> >>>>>>     >
> >>>>>>     > Thanks for the info, I tried locally with a simplified setup
> >>>> where I
> >>>>>>     > emulate your topology:
> >>>>>>     >
> >>>>>>     > switch c9c171ef-849c-436d-b3f9-73d83b9c4e5d (ls)
> >>>>>>     >     port vm2
> >>>>>>     >         addresses: ["00:00:00:00:00:02"]
> >>>>>>     >     port vm1
> >>>>>>     >         addresses: ["00:00:00:00:00:01"]
> >>>>>>     >
> >>>>>>     > Those two VIFs are in a port group:
> >>>>>>     >
> >>>>>>     > # ovn-nbctl list port_group
> >>>>>>     > _uuid               : 7e7a96b9-e708-4eea-b380-018314f2435c
> >>>>>>     > acls                : [1d0e7b71-ff03-4c78-ace4-2448bf237e11,
> >>>>>>     > 7cb023e9-fee5-4576-a67d-ce1f5d98805b]
> >>>>>>     > external_ids        : {}
> >>>>>>     > name                : pg
> >>>>>>     > ports               : [d991baa6-21b0-4d46-a15d-71b9e8d6708d,
> >>>>>>     > f2c5679c-d891-4d34-8402-8bc2047fba61]
> >>>>>>     >
> >>>>>>     > With two ACLs applied:
> >>>>>>     > # ovn-nbctl acl-list pg
> >>>>>>     > from-lport   100 (inport==@pg && ip4) allow-related
> >>>>>>     >   to-lport   200 (outport==@pg && ip4 && icmp4) allow-related
> >>>>>>     >
> >>>>>>     > Both ACLs have only sampling for established traffic
> >> (sample_est)
> >>>> set:
> >>>>>>     > # ovn-nbctl list acl
> >>>>>>     > _uuid               : 1d0e7b71-ff03-4c78-ace4-2448bf237e11
> >>>>>>     > action              : allow-related
> >>>>>>     > direction           : from-lport
> >>>>>>     > match               : "inport==@pg && ip4"
> >>>>>>     > priority            : 100
> >>>>>>     > sample_est          : 23153fae-0a73-4f86-bdf2-137e76647da8
> >>>>>>     > sample_new          : []
> >>>>>>     >
> >>>>>>     > _uuid               : 7cb023e9-fee5-4576-a67d-ce1f5d98805b
> >>>>>>     > action              : allow-related
> >>>>>>     > direction           : to-lport
> >>>>>>     > match               : "outport==@pg && ip4 && icmp4"
> >>>>>>     > priority            : 200
> >>>>>>     > sample_est          : 42391c82-23d2-4f2b-a7b9-88afaa68282c
> >>>>>>     > sample_new          : []
> >>>>>>     >
> >>>>>>     > # ovn-nbctl list sample
> >>>>>>     > _uuid               : 23153fae-0a73-4f86-bdf2-137e76647da8
> >>>>>>     > collectors          : [82540855-dcd4-44e4-8354-e08a972500cd]
> >>>>>>     > metadata            : 2000000
> >>>>>>     >
> >>>>>>     > _uuid               : 42391c82-23d2-4f2b-a7b9-88afaa68282c
> >>>>>>     > collectors          : [82540855-dcd4-44e4-8354-e08a972500cd]
> >>>>>>     > metadata            : 1000000
> >>>>>>     >
> >>>>>>     > Then I send a single ICMP echo packet from vm2 towards vm1.
> The
> >>>> ICMP
> >>>>>>     > echo hits both ACLs but because it's the packet initiating the
> >>>> session
> >>>>>>     > doesn't generate a sample (sample_new is not set in the ACLs).
> >>>>>>     Instead
> >>>>>>     > 2 conntrack entries are created for the ICMP session:
> >>>>>>     >
> >>>>>>     > - one in the CT zone of vm2 - here the from-lport ACL is hit
> so
> >>>> the
> >>>>>>     > sample_est metadata of the from-lport ACL (200000) is stored
> >>>> along in
> >>>>>>     > the conntrack state
> >>>>>>     >
> >>>>>>     > - one in the CT zone of vm1 - here the tolport ACL is hit so
> the
> >>>>>>     > sample_est metadata of the to-lport ACL (100000) is stored
> along
> >>>>>>     in the
> >>>>>>     > conntrack state
> >>>>>>     >
> >>>>>>     > The ICMP echo packet reaches vm1 which replies with ICMP ECHO
> >>>> Reply.
> >>>>>>     >
> >>>>>>     > For the reply the CT zone of vm1 is first checked, we match
> the
> >>>>>>     existing
> >>>>>>     > conntrack entry (its state moves to "established") and a
> sample
> >>>>>>     for the
> >>>>>>     > stored metadata, 100000, is generated.  Then, in the egress
> >>>> pipeline,
> >>>>>>     > the CT zone of vm2 is checked, we match the other existing
> >>>> conntrack
> >>>>>>     > entry (its state also moves to "established") and a sample for
> >> the
> >>>>>>     > stored metadata, 200000, is generated.
> >>>>>>     >
> >>>>>>     > This seems correct to me.  Stats also seem to confirm that:
> >>>>>>     > # ip netns exec vm2 ping 42.42.42.2 -c1
> >>>>>>     > PING 42.42.42.2 (42.42.42.2) 56(84) bytes of data.
> >>>>>>     > 64 bytes from 42.42.42.2 <http://42.42.42.2>: icmp_seq=1
> ttl=64
> >>>>>>     time=1.46 ms
> >>>>>>     >
> >>>>>>     > --- 42.42.42.2 ping statistics ---
> >>>>>>     > 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> >>>>>>     > rtt min/avg/max/mdev = 1.455/1.455/1.455/0.000 ms
> >>>>>>     >
> >>>>>>     > # ovs-ofctl dump-ipfix-flow br-int
> >>>>>>     > NXST_IPFIX_FLOW reply (xid=0x2): 1 ids
> >>>>>>     >   id   2: flows=2, current flows=0, sampled pkts=2, ipv4 ok=2,
> >>>> ipv6
> >>>>>>     > ok=0, tx pkts=11
> >>>>>>     >           pkts errs=0, ipv4 errs=0, ipv6 errs=0, tx errs=11
> >>>>>>     >
> >>>>>>     > But then, when I increase the number of packets things become
> >> more
> >>>>>>     > interesting.  ICMP echos also generate samples.  And while
> that
> >>>> might
> >>>>>>     > seem like a bug, it's not. :)
> >>>>>>     >
> >>>>>>     > When ping sends multiple packets for a single invocation it
> uses
> >>>> the
> >>>>>>     > same ICMP ID and just increments the ICMP seq, e.g.:
> >>>>>>     >
> >>>>>>     > 14:07:41.986618 00:00:00:00:00:02 > 00:00:00:00:00:01,
> ethertype
> >>>> IPv4
> >>>>>>     > (0x0800), length 98: (tos 0x0, ttl 64, id 58647, offset 0,
> flags
> >>>> [DF],
> >>>>>>     > proto ICMP (1), length 84)
> >>>>>>     >     42.42.42.3 > 42.42.42.2 <http://42.42.42.2>: ICMP echo
> >>>>>>     request, id 35717, seq 1, length 64
> >>>>>>     >
> >>>>>>     > 14:07:42.988077 00:00:00:00:00:02 > 00:00:00:00:00:01,
> ethertype
> >>>> IPv4
> >>>>>>     > (0x0800), length 98: (tos 0x0, ttl 64, id 59085, offset 0,
> flags
> >>>> [DF],
> >>>>>>     > proto ICMP (1), length 84)
> >>>>>>     >     42.42.42.3 > 42.42.42.2 <http://42.42.42.2>: ICMP echo
> >>>>>>     request, id 35717, seq 2, length 64
> >>>>>>     >
> >>>>>>     > But conntrack doesn't use the ICMP ID in the key for the
> session
> >>>> it
> >>>>>>     > installs:
> >>>>>>
> >>>>>>     Sorry about the typo, I meant to say "conntrack doesn't use the
> >>>> ICMP SEQ
> >>>>>>     in the key for the session it installs, it only uses the ICMP
> ID".
> >>>>>>
> >>>>>>     >
> >>>>>>     > # ovs-appctl dpctl/dump-conntrack | grep 42.42.42
> >>>>>>     >
> >>>>>>
> >>>>
> >>
> icmp,orig=(src=42.42.42.3,dst=42.42.42.2,id=35628,type=8,code=0),reply=(src=42.42.42.2,dst=42.42.42.3,id=35628,type=0,code=0),zone=4,mark=131104,labels=0xf4240000000000000000000000000
> >>>>>>     >
> >>>>>>
> >>>>
> >>
> icmp,orig=(src=42.42.42.3,dst=42.42.42.2,id=35628,type=8,code=0),reply=(src=42.42.42.2,dst=42.42.42.3,id=35628,type=0,code=0),zone=6,mark=131072,labels=0x1e8480000000000000000000000000
> >>>>>>     >
> >>>>>>     > So, subsequent ICMP requests will match on these two existing
> >>>>>>     > established entries and (because sampling_est) is configured
> >>>>>>     samples are
> >>>>>>     > generated for them too.
> >>>>>>     >
> >>>>>>     > That's also visible in the datapath flows that forward packets
> >> in
> >>>> the
> >>>>>>     > "original" direction (ICMP ECHOs in our case):
> >>>>>>     >
> >>>>>>     > # ovs-appctl dpctl/dump-flows | grep sample | grep '\-rpl'
> >>>>>>     > recirc_id(0x29),in_port(3),ct_state(-new+est-rel-rpl-
> >>>>>>
> >>>>
> >>
> inv+trk),ct_mark(0x20000/0xff0071),ct_label(0x1e8480000000000000000000000000),eth(src=00:00:00:00:00:02,dst=00:00:00:00:00:01),eth_type(0x0800),ipv4(proto=1,frag=no),
> >>>>>>     > packets:8, bytes:784, used:2.342s,
> >>>>>>     >
> >>>>>>
> >>>>
> >>
> actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554434,obs_point_id=2000000,output_port=4294967295)),ct(commit,zone=6,mark=0x20000/0xff0071,label=0x1e8480000000000000000000000000/0xffffffffffff00000000000000000000,nat(src)),ct(zone=4),recirc(0x2a)
> >>>>>>     >
> >>>>>>     > recirc_id(0x2a),in_port(3),ct_state(-new+est-rel-rpl-
> >>>>>>
> >>>>
> >>
> inv+trk),ct_mark(0x20020/0xff0071),ct_label(0xf4240000000000000000000000000),eth(src=00:00:00:00:00:02,dst=00:00:00:00:00:00/ff:ff:00:00:00:00),eth_type(0x0800),ipv4(proto=1,frag=no),
> >>>>>>     > packets:8, bytes:784, used:2.342s,
> >>>>>>     >
> >>>>>>
> >>>>
> >>
> actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554434,obs_point_id=1000000,output_port=4294967295)),ct(commit,zone=4,mark=0x20020/0xff0071,label=0xf4240000000000000000000000000/0xffffffffffff00000000000000000000,nat(src)),1
> >>>>>>     >
> >>>>>>     > So, for a less complicated test, maybe you should try with
> >> UDP/TCP
> >>>>>>     instead.
> >>>>>>     >
> >>>>>>     > I hope that clarifies your doubts.
> >>>>>>     >
> >>>>>>     > Best regards,
> >>>>>>     > Dumitru
> >>>>>>     >
> >>>>>>     >> Best regards,
> >>>>>>     >>
> >>>>>>     >> Oscar
> >>>>>>     >>
> >>>>>>     >>
> >>>>>>     >> On Thu, May 8, 2025 at 8:11 PM Dumitru Ceara <
> >> [email protected]
> >>>>>>     <mailto:[email protected]>
> >>>>>>     >> <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >>>>>>     >>
> >>>>>>     >>     Hi Oscar,
> >>>>>>     >>
> >>>>>>     >>     On 5/6/25 12:31 PM, Trọng Đạt Trần wrote:
> >>>>>>     >>     > As requested, I’ve attached additional tracing
> >> information
> >>>>>>     related to
> >>>>>>     >>     > the sampling duplication issue.
> >>>>>>     >>     >
> >>>>>>     >>     >   *
> >>>>>>     >>     >
> >>>>>>     >>     >     The file |ofproto_trace.log| contains the full
> output
> >>>>>>     of |ofproto/
> >>>>>>     >>     >     trace| commands.
> >>>>>>     >>     >
> >>>>>>     >>     >   *
> >>>>>>     >>     >
> >>>>>>     >>     >     The archive |ovn-detrace.tar.gz| includes six
> >> separate
> >>>>>>     files, each
> >>>>>>     >>     >     corresponding to an |ovn-detrace| output for a
> flow I
> >>>>>>     believe is
> >>>>>>     >>     >     involved in the duplicated sampling.
> >>>>>>     >>     >
> >>>>>>     >>     > Since I’m not fully confident in how to use |--ct-next
> >>>>>>     option|, I’ve
> >>>>>>     >>     > included traces for all six related flows to ensure
> >>>>>>     completeness.
> >>>>>>     >>     >
> >>>>>>     >>     > Please let me know if you need further details, or if I
> >>>>>>     should re-run
> >>>>>>     >>     > any commands with additional options.
> >>>>>>     >>     >
> >>>>>>     >>
> >>>>>>     >>     This seems fairly easy to reproduce locally for
> >>>>>>     investigation; I didn't
> >>>>>>     >>     try yet though.  However, would you mind sharing your OVN
> >> NB
> >>>>>>     database
> >>>>>>     >>     file (I'm assuming this is a test environment)?
> >>>>>>     >>
> >>>>>>     >>     I would like to make sure we don't have any
> >> misunderstanding
> >>>>>>     because the
> >>>>>>     >>     terms you use below in your ACL description (e.g.,
> >>>>>>     "outbound"/"inbound")
> >>>>>>     >>     are not standard terms.  Having the actual ACL (and the
> >> rest
> >>>>>>     of the NB)
> >>>>>>     >>     contents will make it easier to debug.
> >>>>>>     >>
> >>>>>>     >>     Thanks,
> >>>>>>     >>     Dumitru
> >>>>>>     >>
> >>>>>>     >>     > Best regards,
> >>>>>>     >>     >
> >>>>>>     >>     > *Oscar*
> >>>>>>     >>     >
> >>>>>>     >>     >
> >>>>>>     >>     > On Tue, May 6, 2025 at 4:15 PM Adrián Moreno
> >>>>>>     <[email protected] <mailto:[email protected]>
> >>>>>>     >>     <mailto:[email protected] <mailto:[email protected]
> >>
> >>>>>>     >>     > <mailto:[email protected] <mailto:
> [email protected]>
> >>>>>>     <mailto:[email protected] <mailto:[email protected]>>>>
> >> wrote:
> >>>>>>     >>     >
> >>>>>>     >>     >     On Tue, May 06, 2025 at 11:48:07AM +0700, Trọng Đạt
> >>>>>>     Trần wrote:
> >>>>>>     >>     >     > Dear Adrián,
> >>>>>>     >>     >     >
> >>>>>>     >>     >     > Thank you for your response. I’ve applied your
> >>>>>>     suggestion to use
> >>>>>>     >>     >     separate
> >>>>>>     >>     >     > sample entries for each ACL. However, I am still
> >>>> seeing
> >>>>>>     >>     unexpected
> >>>>>>     >>     >     behavior
> >>>>>>     >>     >     > in the IPFIX output that I’d like to clarify.
> >>>>>>     >>     >     > Test Setup (Same as Before)
> >>>>>>     >>     >     >
> >>>>>>     >>     >     > vm_a ---- network1 ---- router ---- network2 ----
> >>>> vm_b
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >    -
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >    Two ACLs:
> >>>>>>     >>     >     >    -
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >       ACL A: allow-related *outbound* IPv4
> >>>>>>     >>     >     >       -
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >       ACL B: allow-related *inbound* ICMP
> >>>>>>     >>     >     >       -
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >    ACLs applied symmetrically to both VMs.
> >>>>>>     >>     >     >    -
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >    Test traffic: ICMP request from vm_b to vm_a,
> >> and
> >>>>>>     reply from
> >>>>>>     >>     >     vm_a to vm_b
> >>>>>>     >>     >     >    .
> >>>>>>     >>     >     >
> >>>>>>     >>     >     > Key Problem Observed
> >>>>>>     >>     >     >
> >>>>>>     >>     >     > When sampling is enabled on *both* ACLs, the
> IPFIX
> >>>>>>     record for
> >>>>>>     >>     >     *flow (3)*
> >>>>>>     >>     >     > (the ICMP reply from vm_a → router) shows *120
> >>>>>>     packets/min*.
> >>>>>>     >>     >     >
> >>>>>>     >>     >     > However:
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >    -
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >    If *only ACL B* (inbound ICMP) is sampled →
> (3)
> >> =
> >>>> 60
> >>>>>>     >>     packets/min
> >>>>>>     >>     >     >    -
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >    If *only ACL A* (outbound IP4) is sampled →
> (3)
> >>>>>>     not present
> >>>>>>     >>     >     >    -
> >>>>>>     >>     >     >
> >>>>>>     >>     >     >    If both are sampled → (3) = 120 packets/min
> >>>>>>     >>     >     >
> >>>>>>     >>     >     > This suggests that *flow (3) is being sampled
> >> twice*
> >>>>>>     — even
> >>>>>>     >>     though it
> >>>>>>     >>     >     > represents a *single logical flow and matches
> only
> >>>>>>     ACL B*.
> >>>>>>     >>     >     > IPFIX Observations
> >>>>>>     >>     >     > FlowDescriptionExpectedActual
> >>>>>>     >>     >     > (1) vm_b → router (ICMP request) 60 pkt/m 60
> >>>>>>     >>     >     > (2) router → vm_a (ICMP request) 60 pkt/m 60
> >>>>>>     >>     >     > (3) vm_a → router (ICMP reply) 60 pkt/m 120 ⚠️
> >>>>>>     >>     >     > (4) router → vm_b (ICMP reply) 60 pkt/m 60
> >>>>>>     >>     >
> >>>>>>     >>     >     This is not what I'd expect, maybe Dumitru knows?
> >>>>>>     >>     >
> >>>>>>     >>     >     Could you attach ofproto/trace and ovn-detrce
> outputs
> >>>>>>     from both
> >>>>>>     >>     >     directions?
> >>>>>>     >>     >
> >>>>>>     >>     >     Thanks.
> >>>>>>     >>     >     Adrián
> >>>>>>     >>     >
> >>>>>>     >>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

_______________________________________________
discuss mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] [ovn-controller] Possible duplicated sampling in OpenFlow flows causing IPFIX duplication

Reply via email to