Re: [ovs-dev] [PATCH ovn v2] Don't blindly save original dst IP and Port to avoid megaflow unwildcarding.

Han Zhou Thu, 08 Sep 2022 17:25:12 -0700

On Wed, Sep 7, 2022 at 8:47 AM Dumitru Ceara <dce...@redhat.com> wrote:
>
> On 9/6/22 22:03, Han Zhou wrote:
> > On Tue, Sep 6, 2022 at 6:42 AM Dumitru Ceara <dce...@redhat.com> wrote:
> >>
> >> On 8/31/22 16:59, Han Zhou wrote:
> >>> On Wed, Aug 31, 2022 at 1:38 AM Dumitru Ceara <dce...@redhat.com>
wrote:
> >>>>
> >>>> On 8/31/22 02:17, Han Zhou wrote:
> >>>>> On Tue, Aug 30, 2022 at 8:18 AM Dumitru Ceara <dce...@redhat.com>
> > wrote:
> >>>>>>
> >>>>>> Hi Han,
> >>>>>>
> >>>>>> On 8/24/22 08:40, Han Zhou wrote:
> >>>>>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and
Port
> > to
> >>>>>>> registers is causing a critical dataplane performance impact to
> >>>>>>> short-lived connections, because it unwildcards megaflows with
exact
> >>>>>>> match on dst IP and L4 ports. Any new connections with a different
> >>>>>>> client side L4 port will encounter datapath flow miss and upcall
to
> >>>>>>> ovs-vswitchd, which makes typical use cases such as HTTP1.0 based
> >>>>>>> RESTful API calls suffer big performance degredations.
> >>>>>>>
> >>>>>>> These fields (dst IP and port) were saved to registers to solve a
> >>>>>>> problem of LB hairpin use case when different VIPs are sharing
> >>>>>>> overlapping backend+port [0]. The change [0] might not have as
wide
> >>>>>>> performance impact as it is now because at that time one of the
> > match
> >>>>>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
> > and
> >>>>>>> natted traffic, while now the impact is more obvious because
> >>>>>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> >>>>>>> configured on the LS) since commit [1], after several other
> > indirectly
> >>>>>>> related optimizations and refactors.
> >>>>>>>
> >>>>>>> This patch fixes the problem by modifying the priority-120 flows
in
> >>>>>>> ls_in_pre_stateful. Instead of blindly saving dst IP and L4 port
for
> >>> any
> >>>>>>> traffic with the REGBIT_CONNTRACK_NAT == 1, we now save dst IP and
> > L4
> >>>>>>> port only for traffic matching the LB VIPs, because these are the
> > ones
> >>>>>>> that need to be saved for the hairpin purpose. The existed
> >>> priority-110
> >>>>>>> flows will match the rest of the traffic just like before but
> > wouldn't
> >>>>>>> not save dst IP and L4 port, so any server->client traffic would
not
> >>>>>>> unwildcard megaflows with client side L4 ports.
> >>>>>>>
> >>>>>>
> >>>>>> Just to be 100% sure, the client->server traffic will still
generate
> > up
> >>>>>> to V x H flows where V is "the number of VIP:port tuples" and H is
> > "the
> >>>>>> number of potential (dp_)hash values", right?
> >>>>>
> >>>>> Yes, if all the VIPs and backends are being accessed at the same
time.
> >>> This
> >>>>> is expected. For any given pair of client<->server, regardless of
the
> >>>>> connections (source port number changes), the packets should match
the
> >>> same
> >>>>> mega-flow and be handled by fastpath (instead of going to userspace,
> >>> which
> >>>>> is the behavior before this patch).
> >>>>>
> >>>>
> >>>> Thanks for the confirmation!
> >>>>
> >>>>>>
> >>>>>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> >>> shared
> >>>>> backends.")
> >>>>>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> >>> traffic.")
> >>>>>>>
> >>>>>>> Signed-off-by: Han Zhou <hz...@ovn.org>
> >>>>>>> ---
> >>>>>>> v1 -> v2: Add the missing changes for ovn-northd.8.xml which I
> > forgot
> >>>>> to commit
> >>>>>>>           before sending v1.
> >>>>>>>
> >>>>>>>  northd/northd.c         | 125
> >>> +++++++++++++++++++++++++---------------
> >>>>>>>  northd/ovn-northd.8.xml |  14 ++---
> >>>>>>>  tests/ovn-northd.at     |  87 ++++++++++------------------
> >>>>>>>  tests/ovn.at            |  14 ++---
> >>>>>>>  4 files changed, 118 insertions(+), 122 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/northd/northd.c b/northd/northd.c
> >>>>>>> index 7e2681865..860641936 100644
> >>>>>>> --- a/northd/northd.c
> >>>>>>> +++ b/northd/northd.c
> >>>>>>> @@ -273,15 +273,15 @@ enum ovn_stage {
> >>>>>>>   * |    | REGBIT_ACL_HINT_{ALLOW_NEW/ALLOW/DROP/BLOCK} |   |
> >>>>>        |
> >>>>>>>   * |    |     REGBIT_ACL_LABEL                         | X |
> >>>>>        |
> >>>>>>>   * +----+----------------------------------------------+ X |
> >>>>>        |
> >>>>>>> - * | R1 |         ORIG_DIP_IPV4 (>= IN_STATEFUL)       | R |
> >>>>>        |
> >>>>>>> + * | R1 |         ORIG_DIP_IPV4 (>= IN_PRE_STATEFUL)   | R |
> >>>>>        |
> >>>>>>>   * +----+----------------------------------------------+ E |
> >>>>>        |
> >>>>>>> - * | R2 |         ORIG_TP_DPORT (>= IN_STATEFUL)       | G |
> >>>>>        |
> >>>>>>> + * | R2 |         ORIG_TP_DPORT (>= IN_PRE_STATEFUL)   | G |
> >>>>>        |
> >>>>>>>   * +----+----------------------------------------------+ 0 |
> >>>>>        |
> >>>>>>>   * | R3 |                  ACL LABEL                   |   |
> >>>>>        |
> >>>>>>>   *
> >>>>>
> >>>
> >
+----+----------------------------------------------+---+------------------+
> >>>>>>>   * | R4 |                   UNUSED                     |   |
> >>>>>        |
> >>>>>>> - * +----+----------------------------------------------+ X |
> >>>>> ORIG_DIP_IPV6  |
> >>>>>>> - * | R5 |                   UNUSED                     | X | (>=
> >>>>> IN_STATEFUL) |
> >>>>>>> + * +----+----------------------------------------------+ X |
> >>>>> ORIG_DIP_IPV6(>= |
> >>>>>>> + * | R5 |                   UNUSED                     | X |
> >>>>> IN_PRE_STATEFUL) |
> >>>>>>>   * +----+----------------------------------------------+ R |
> >>>>>        |
> >>>>>>>   * | R6 |                   UNUSED                     | E |
> >>>>>        |
> >>>>>>>   * +----+----------------------------------------------+ G |
> >>>>>        |
> >>>>>>> @@ -5899,43 +5899,17 @@ build_pre_stateful(struct ovn_datapath
*od,
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 0, "1",
> >>>>> "next;");
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 0, "1",
> >>>>> "next;");
> >>>>>>>
> >>>>>>> -    const char *ct_lb_action = features->ct_no_masked_label
> >>>>>>> -                               ? "ct_lb_mark"
> >>>>>>> -                               : "ct_lb";
> >>>>>>> -    const char *lb_protocols[] = {"tcp", "udp", "sctp"};
> >>>>>>> -    struct ds actions = DS_EMPTY_INITIALIZER;
> >>>>>>> -    struct ds match = DS_EMPTY_INITIALIZER;
> >>>>>>> -
> >>>>>>> -    for (size_t i = 0; i < ARRAY_SIZE(lb_protocols); i++) {
> >>>>>>> -        ds_clear(&match);
> >>>>>>> -        ds_clear(&actions);
> >>>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip4
&&
> >>> %s",
> >>>>>>> -                      lb_protocols[i]);
> >>>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV4 " = ip4.dst; "
> >>>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst;
%s;",
> >>>>>>> -                      lb_protocols[i], ct_lb_action);
> >>>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
> >>>>>>> -
> >>>>>>> -        ds_clear(&match);
> >>>>>>> -        ds_clear(&actions);
> >>>>>>> -        ds_put_format(&match, REGBIT_CONNTRACK_NAT" == 1 && ip6
&&
> >>> %s",
> >>>>>>> -                      lb_protocols[i]);
> >>>>>>> -        ds_put_format(&actions, REG_ORIG_DIP_IPV6 " = ip6.dst; "
> >>>>>>> -                                REG_ORIG_TP_DPORT " = %s.dst;
%s;",
> >>>>>>> -                      lb_protocols[i], ct_lb_action);
> >>>>>>> -        ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 120,
> >>>>>>> -                      ds_cstr(&match), ds_cstr(&actions));
> >>>>>>> -    }
> >>>>>>> +    /* Note: priority-120 flows are added in
> >>>>> build_lb_rules_pre_stateful(). */
> >>>>>>>
> >>>>>>> -    ds_clear(&actions);
> >>>>>>> -    ds_put_format(&actions, "%s;", ct_lb_action);
> >>>>>>> +    const char *ct_lb_action = features->ct_no_masked_label
> >>>>>>> +                               ? "ct_lb_mark;"
> >>>>>>> +                               : "ct_lb;";
> >>>>>>>
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_IN_PRE_STATEFUL, 110,
> >>>>>>> -                  REGBIT_CONNTRACK_NAT" == 1",
ds_cstr(&actions));
> >>>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >>>>>>>
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 110,
> >>>>>>> -                  REGBIT_CONNTRACK_NAT" == 1",
ds_cstr(&actions));
> >>>>>>> +                  REGBIT_CONNTRACK_NAT" == 1", ct_lb_action);
> >>>>>>>
> >>>>>>>      /* If REGBIT_CONNTRACK_DEFRAG is set as 1, then the packets
> >>> should
> >>>>> be
> >>>>>>>       * sent to conntrack for tracking and defragmentation. */
> >>>>>>> @@ -5945,8 +5919,6 @@ build_pre_stateful(struct ovn_datapath *od,
> >>>>>>>      ovn_lflow_add(lflows, od, S_SWITCH_OUT_PRE_STATEFUL, 100,
> >>>>>>>                    REGBIT_CONNTRACK_DEFRAG" == 1", "ct_next;");
> >>>>>>>
> >>>>>>> -    ds_destroy(&actions);
> >>>>>>> -    ds_destroy(&match);
> >>>>>>>  }
> >>>>>>>
> >>>>>>>  static void
> >>>>>>> @@ -6841,22 +6813,16 @@ build_qos(struct ovn_datapath *od, struct
> > hmap
> >>>>> *lflows) {
> >>>>>>>  }
> >>>>>>>
> >>>>>>>  static void
> >>>>>>> -build_lb_rules(struct hmap *lflows, struct ovn_northd_lb *lb,
bool
> >>>>> ct_lb_mark,
> >>>>>>> -               struct ds *match, struct ds *action,
> >>>>>>> -               const struct shash *meter_groups)
> >>>>>>> +build_lb_rules_pre_stateful(struct hmap *lflows, struct
> > ovn_northd_lb
> >>>>> *lb,
> >>>>>>> +                            bool ct_lb_mark, struct ds *match,
> >>>>>>> +                            struct ds *action)
> >>>>>>>  {
> >>>>>>>      for (size_t i = 0; i < lb->n_vips; i++) {
> >>>>>>>          struct ovn_lb_vip *lb_vip = &lb->vips[i];
> >>>>>>> -        struct ovn_northd_lb_vip *lb_vip_nb = &lb->vips_nb[i];
> >>>>>>> -        const char *ip_match = NULL;
> >>>>>>> -
> >>>>>>>          ds_clear(action);
> >>>>>>>          ds_clear(match);
> >>>>>>> +        const char *ip_match = NULL;
> >>>>>>>
> >>>>>>> -        /* Make sure that we clear the REGBIT_CONNTRACK_COMMIT
> > flag.
> >>>>> Otherwise
> >>>>>>> -         * the load balanced packet will be committed again in
> >>>>>>> -         * S_SWITCH_IN_STATEFUL. */
> >>>>>>> -        ds_put_format(action, REGBIT_CONNTRACK_COMMIT" = 0; ");
> >>>>>>>          /* Store the original destination IP to be used when
> >>> generating
> >>>>>>>           * hairpin flows.
> >>>>>>>           */
> >>>>>>> @@ -6887,6 +6853,67 @@ build_lb_rules(struct hmap *lflows, struct
> >>>>> ovn_northd_lb *lb, bool ct_lb_mark,
> >>>>>>>              ds_put_format(action, REG_ORIG_TP_DPORT " =
%"PRIu16";
> > ",
> >>>>>>>                            lb_vip->vip_port);
> >>>>>>>          }
> >>>>>>> +        ds_put_format(action, "%s;", ct_lb_mark ? "ct_lb_mark" :
> >>>>> "ct_lb");
> >>>>>>> +
> >>>>>>> +        ds_put_format(match, "%s.dst == %s", ip_match,
> >>>>> lb_vip->vip_str);
> >>>>>>> +        if (lb_vip->vip_port) {
> >>>>>>> +            ds_put_format(match, " && %s.dst == %d", proto,
> >>>>> lb_vip->vip_port);
> >>>>>>> +        }
> >>>>>>> +
> >>>>>>> +        struct ovn_lflow *lflow_ref = NULL;
> >>>>>>> +        uint32_t hash = ovn_logical_flow_hash(
> >>>>>>> +                ovn_stage_get_table(S_SWITCH_IN_PRE_STATEFUL),
> >>>>>>> +                ovn_stage_get_pipeline(S_SWITCH_IN_PRE_STATEFUL),
> >>> 120,
> >>>>>>> +                ds_cstr(match), ds_cstr(action));
> >>>>>>> +
> >>>>>>> +        for (size_t j = 0; j < lb->n_nb_ls; j++) {
> >>>>>>> +            struct ovn_datapath *od = lb->nb_ls[j];
> >>>>>>> +
> >>>>>>> +            if (!ovn_dp_group_add_with_reference(lflow_ref, od))
{
> >>>>>>> +                lflow_ref = ovn_lflow_add_at_with_hash(
> >>>>>>> +                        lflows, od, S_SWITCH_IN_PRE_STATEFUL,
120,
> >>>>>>> +                        ds_cstr(match), ds_cstr(action),
> >>>>>>> +                        NULL, NULL, &lb->nlb->header_,
> >>>>>>> +                        OVS_SOURCE_LOCATOR, hash);
> >>>>>>> +            }
> >>>>>>> +        }
> >>>>>>> +    }
> >>>>>>
> >>>>>> I see how this can fix the dataplane issue because we're limiting
the
> >>>>>> number of dp flows but this now induces a measurable penalty on the
> >>>>>> control plane side because we essentially double the number of
> > logical
> >>>>>> flows that ovn-northd generates for load balancer VIPs
> >>>>>>
> >>>>>> Using a NB database [2] from a 250 node density heavy test [3] we
ran
> >>>>>> in-house I see the following:
> >>>>>>
> >>>>>> a. before this patch (main):
> >>>>>> - number of SB lflows:         146233
> >>>>>> - SB size on disk (compacted): 70MB
> >>>>>> - northd poll loop intervals:  8525ms
> >>>>>>
> >>>>>> b. with this patch:
> >>>>>> - number of SB lflows:         163303
> >>>>>> - SB size on disk (compacted): 76MB
> >>>>>> - northd poll loop intervals:  9958ms
> >>>>>>
> >>>>>> [2]
> >>>>>>
> >>>>>
> >>>
> >
https://people.redhat.com/~dceara/250-density-heavy/20220830/ovnnb_db.db.gz
> >>>>>> [3]
> >>>>>>
> >>>>>
> >>>
> >
https://github.com/dceara/ovn-heater/blob/main/test-scenarios/ocp-250-density-heavy.yml
> >>>>>>
> >>>>>> This raises some concerns.  However, I went ahead and also ran a
test
> >>>>>> with Ilya's load balancer optimization series [4] (up for review)
> > and I
> >>>>> got:
> >>>>>>
> >>>>>> c. with this patch + Ilya's series:
> >>>>>> - number of SB lflows:         163303  << Didn't change compared to
> > "b"
> >>>>>> - SB size on disk (compacted): 76MB    << Didn't change compared to
> > "b"
> >>>>>> - northd poll loop intervals:  6437ms  << 25% better than "a"
> >>>>>>
> >>>>>> Then I ran a test with main + Ilya's series.
> >>>>>>
> >>>>>> d. main + Ilya's series:
> >>>>>> - number of SB lflows:         146233  << Didn't change compared to
> > "a"
> >>>>>> - SB size on disk (compacted): 70MB    << Didn't change compared to
> > "a"
> >>>>>> - northd poll loop intervals:  5413ms  << 35% better than "a"
> >>>>>>
> >>>>>> [4]
> >>> https://patchwork.ozlabs.org/project/ovn/list/?series=315213&state=*
> >>>>>>
> >>>>>> I guess what I'm trying to say is that if go ahead with the
approach
> >>> you
> >>>>>> proposed and especially if we want to backport it to the LTS we
> > should
> >>>>>> consider accepting/backporting other optimizations too (such as
> > Ilya's)
> >>>>>> that would compensate (and more) for the control plane performance
> > hit.
> >>>>>>
> >>>>> Thanks for the scale test!
> >>>>> I do expect some cost in the control plane, and your test result is
> >>> aligned
> >>>>> with my expectation. It is unpleasant to see ~10% SB size increase
(in
> >>> the
> >>>>> density-heavy and LB/service-heavy test scenario), but if it is
> >>> necessary
> >>>>> to fix the dataplane performance bug I think it is worth it. After
> > all,
> >>>>> everything the control plane does is to serve the dataplane, right?
> > The
> >>>>> resulting dataplane performance boost is at least 10x for
short-lived
> >>>>> connection scenarios.
> >>>>> It's great to see Ilya's patch helps the LB scale, and I have no
> >>> objection
> >>>>> to backport them if it is considered necessary to stable branches,
but
> >>> I'd
> >>>>> consider it independent of the current patch.
> >>>>>
> >>>>
> >>>> They are independent.  But the problem is your fix will visibily
impact
> >>>> control plane latency (~16% on my setup).  So, if we can avoid some
of
> >>>> that hit by considering both patchsets together, I'd vote for that.
> >>>>
> >>> Again, no objection to backporting Ilya's patches, but this is the
> > criteria
> >>> in my mind:
> >>>
> >>> 1. For this patch, it improves dataplane performance (to me it is in
> > fact
> >>> fixing a performance bug) at the cost of the control plane. This is
not
> >>> uncommon in the OVN project history - when adding some new features
> >>> sometimes it does add control plane cost. When this happens, whether
to
> >>> accept the new feature is based on the value of the feature v.s. the
> > cost
> >>> of the control plane - a tradeoff has to be made in many cases. In
this
> >>> case, for 16% control plane latency increase we get 10x dataplane
speed
> > up,
> >>> I would  vote for it.
> >>
> >> Hi Han,
> >>
> >> I'm not arguing that we don't get dataplane speed up.  I'm sure we will
> >> and it should be a decent amount.  But I've been trying to replicate
> >> this 10x boost with an ovn-kubernetes kind deployment and I didn't
> >> really manage to.
> >>
> >> I did this on a 28 core Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz in
our
> >> lab.
> >>
> >> a. build an ovnkube image (with and without your patch); essentially
> >> what our ovn-kubernetes CI jobs do:
> >>
> >>
> >
https://github.com/ovn-org/ovn/blob/main/.github/workflows/ovn-kubernetes.yml#L39
> >>
> >> b. start a kind cluster with the image we built at step "a".
> >>
> >> c. edit the ovs-node daemonset definition and increase the resource
> >> limits to 16 CPUs and 2GB RSS (the kind defaults are very low: 0.2CPU
> >> and 300MB).  This restarts the ovs-daemon pods, wait for those to come
> >> back up.
> >>
> >> d. start an nginx deployment with 50 replicas: this is essentially 50
> >> backends accepting http requests on port 80.
> >>
> >> e. create a service load balancing http requests received on a port,
> >> e.g. 42424, to port 80 on the pods in the nginx deployment above.
> >>
> >> f. create a clients deployment with 100 replicas: 100 pods running the
> >> following bash script, a hack to simulate 40 new connections with
> >> different ephemeral source ports every 2 seconds:
> >>
> >> i=0
> >> while true; do
> >>   i=$((i+1))
> >>   [ $((i%40)) -eq 0 ] && sleep 2
> >>   nc -z -v $SERVICE_VIP 42424 &
> >> done
> >>
> >> Comparing the runs with and without your patch applied I see
improvement:
> >>
> >> 1. without this patch applied:
> >> - 17K, 18K, 27K datapath flows on the 3 nodes of the cluster
> >> - OVS cpu usage on the 3 nodes: 30-50%
> >>
> >> 2. with this patch applied:
> >> - 9K, 9.5K, 9.K datapath flows on the 3 nodes of the cluster
> >> - OVS cpu usage on the 3 nodes: 10-20%
> >>
> >> In both cases the whole system is not overloaded, OVS doesn't reach
it's
> >> CPU limits.
> >>
> >> I see clear improvement but I was expecting to see more than this.  It
> >> does look like my test doesn't generate enough unique sources per
second
> >> so I was wondering if you could share your test methodology?
> >>
> >
> > Thanks Dumitru for the performance test. I guess if you reduce the
number
> > of clients while increasing the connection speed per-client you would
see
> > more differences. And I am not sure why you saw 10-20% OVS cpu usage
with
> > the patch applied. If all the 100 clients keep sending packets you
should
> > have constant 100xN (N=2 or 3) megaflows instead of 9K (unless there are
> > other problems causing flow miss). BTW, I assume you don't have
HW-offload
> > enabled, right? Could you also try sending to a backend directly
instead of
> > VIP and see the difference?
>
> No HWOL enabled on my platform.  I think my setup was still too
> overloaded (too many containers and virtual nodes on the same physical
> machine).
>
> >
> > Here is a brief test I had, with the below patch based on the system
test
> > case "load-balancing". It is basically a single client sending requests
to
> > one of the LB backends using the "ab - apache benchmark" test.
> >
-------------------------------------------------------------------------
> > diff --git a/tests/system-ovn.at b/tests/system-ovn.at
> > index 992813614..3adc72da1 100644
> > --- a/tests/system-ovn.at
> > +++ b/tests/system-ovn.at
> > @@ -1430,10 +1430,12 @@ ovn-nbctl lsp-add bar bar3 \
> >  # Config OVN load-balancer with a VIP.
> >  ovn-nbctl lb-add lb1 30.0.0.1 "172.16.1.2,172.16.1.3,172.16.1.4"
> >  ovn-nbctl ls-lb-add foo lb1
> > +ovn-nbctl ls-lb-add bar lb1
> >
> >  # Create another load-balancer with another VIP.
> >  lb2_uuid=`ovn-nbctl create load_balancer name=lb2
> > vips:30.0.0.3="172.16.1.2,172.16.1.3,172.16.1.4"`
> >  ovn-nbctl ls-lb-add foo lb2
> > +ovn-nbctl ls-lb-add bar lb2
> >
> >  # Config OVN load-balancer with another VIP (this time with ports).
> >  ovn-nbctl set load_balancer $lb2_uuid vips:'"30.0.0.2:8000"'='"
> > 172.16.1.2:80,172.16.1.3:80,172.16.1.4:80"'
> > @@ -1503,6 +1505,10 @@
> >
tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(s
> >
 
tcp,orig=(src=192.168.1.2,dst=30.0.0.3,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
> >  ])
> >
> > +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4 &&
> > ip4.dst == {172.16.1.2,172.16.1.3,172.16.1.4} && tcp.dst == 80"
> > allow-related
> > +#check ovn-nbctl --apply-after-lb acl-add foo from-lport 1002 "ip4"
> > allow-related
> > +#check ovn-nbctl --wait=hv sync
> > +
> >  dnl Test load-balancing that includes L4 ports in NAT.
> >  dnl Each server should have at least one connection.
> >  OVS_WAIT_FOR_OUTPUT([
> > @@ -1516,6 +1522,15 @@
> >
tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(s
> >
 
tcp,orig=(src=192.168.1.2,dst=30.0.0.2,sport=<cleared>,dport=<cleared>),reply=(src=172.16.1.4,dst=192.168.1.2,sport=<cleared>,dport=<cleared>),zone=<cleared>,mark=2,protoinfo=(state=<cleared>)
> >  ])
> >
> > +ip netns exec foo1 ab -n 10000 -c 1 -q 172.16.1.2:80/
> > +
> >  # Configure selection_fields.
> >  ovn-nbctl set load_balancer $lb2_uuid
> > selection_fields="ip_src,ip_dst,tp_src,tp_dst"
> >  OVS_WAIT_UNTIL([
> > -----------------------------------------------------------------------
> >
> > Tests on 24 cores x Intel(R) Core(TM) i9-7920X CPU @ 2.90GHz
> > The test results are:
> >
> > // Before the change
> > Concurrency Level:      1
> > Time taken for tests:   12.913 seconds
> > Complete requests:      10000
> > Failed requests:        0
> > Total transferred:      21450000 bytes
> > HTML transferred:       19590000 bytes
> > Requests per second:    774.41 [#/sec] (mean)
> > Time per request:       1.291 [ms] (mean)
> > Time per request:       1.291 [ms] (mean, across all concurrent
requests)
> > Transfer rate:          1622.18 [Kbytes/sec] received
> >
> > OVS CPU: ~280%
> >
> > // After the change
> > Concurrency Level:      1
> > Time taken for tests:   5.629 seconds
> > Complete requests:      10000
> > Failed requests:        0
> > Total transferred:      21450000 bytes
> > HTML transferred:       19590000 bytes
> > Requests per second:    1776.62 [#/sec] (mean)
> > Time per request:       0.563 [ms] (mean)
> > Time per request:       0.563 [ms] (mean, across all concurrent
requests)
> > Transfer rate:          3721.54 [Kbytes/sec] received
> >
> > OVS CPU: ~0%
> >
> > We can see that the RPS is 2.5x higher (1776 v.s. 774) even with a
single
> > thread (-c 1). It is not 10x as what I guessed earlier, but keep in mind
> > that the "ab" test has a complete TCP handshake and then
request/response,
> > and finally tear down, including usually 6 packets per connection, so at
> > least half of the packets are still going through fast-path even before
the
> > change. But for the "first" packets of each connection, the speed up
should
> > be at 10x level, which we can easily tell by a ping test and see the RTT
> > difference between the first packet that goes through the slow-path v.s.
> > the later packets that go through the fast path only.
>
> I changed the test, on an ovn-kubernetes kind cluster, with netperf
> TCP_CRR (connection + request + response):
>
> https://github.com/jtaleric/k8s-netperf/pull/8
>
> >
> > And the CPU utilization for OVS is huge before the change while after
the
> > change it is 0. So if we combine more complex scenarios with the system
> > overloaded I believe the difference should also be even bigger.
> >
>
> I see a big difference in CPU usage too.  It goes up to ~90% CPU (almost
> one whole CPU) just for processing CRR between a client and a server.
> While with your patch applied the CPU usage for OVS is negligible.
>
> The number of DP flows also drops drastically, as expected.
>
> Comparing the number of "transactions" (connect + request + response)
> between a single pair of client + server I see on my machine:
>
> a. without your patch: ~2K TPS
> b. with your patch: ~6K TPS
>
> Probably a coincidence because our platforms and tests are very
> different but I also see 2-3x boost in transactions per second.
>
> >> For the OVN code changes, they look good to me.  I think we should
apply
> >> them to the main branch.  For that:
> >>
> >> Acked-by: Dumitru Ceara <dce...@redhat.com>
> >>
> >> For backporting the change to stable branches I'd really like wait to
> >> first understand the exact scenario in which the megaflow
undwildcarding
> >> created such a big performance regression in ovn-kubernetes.
> >
> > Thank you for all the reviews and tests. I applied to main and am
waiting
> > for your confirmation for backporting.
> > For branch-22.09 I think maybe we should backport for now before it is
> > released?
> >
>
> In light of your results and based on the findings above I think we
> should accept the backport even if it might cause a small control plane
> degradation.  This also having in mind Ilya's northd patches that I
> think are a good candidate for stable branches too.  But I'll let Mark
> and Numan, from their maintainer perspective, share their thoughts on
> this matter.
>
> Regards,
> Dumitru


Thanks Dumitru for confirming. In today's OVN meeting I got confirmation
from Mark that he is in favor of backporting it as well, so I just
backported down to branch-22.03.

Han

>
> > Han
> >
> >>
> >> Thanks,
> >> Dumitru
> >>
> >>>
> >>> 2. For backporting Ilya's optimization, I'd evaluate using the
criteria
> > for
> >>> any optimization backporting. Say the patch has 10-20% speed up, and
we
> >>> think that speed up is critical for a branch for some reason, and the
> >>> changes are well contained, I'd vote for the backport.
> >>>
> >>> Does this make sense?
> >>> @Numan Siddique <num...@ovn.org> @Mark Michelson <mmich...@redhat.com>
> > any
> >>> comments?
> >>>
> >>> Thanks,
> >>> Han
> >>>
> >>>> Given that Ilya's changes seem very contained I don't think this
should
> >>>> be a problem.  They just need a proper review first.
> >>>>
> >>>>>> I tried to think of an alternative that wouldn't require doubling
the
> >>>>>> number of VIP logical flows but couldn't come up with one.  Maybe
you
> >>>>>> have more ideas?
> >>>>>>
> >>>>> This is the approach that has the least impact and changes to
existing
> >>>>> pipelines I have thought about so far. The earlier attempt wouldn't
> >>>>> increase the number of flows but it had to compromise to either
> > failing
> >>> to
> >>>>> support a corner case (that we are still not sure if it is really
> >>> useful at
> >>>>> all) or losing HW offloading capability. I'll still think about some
> >>> more
> >>>>> approaches but the NAT/LB/ACL pipelines are very complex now and I'd
> >>> avoid
> >>>>> big changes in case any corner case is broken. I'd spend more time
> >>> sorting
> >>>>> out different use cases and the documentation before making more
> >>>>> optimizations, but I don't think we should also fix this performance
> >>> bug as
> >>>>> soon as we can. (I am also curious why we haven't heard other OVN
> > users
> >>>>> complain about this yet ...)
> >>>>
> >>>> It's a guess, but we probably didn't notice this issue because it
> > should
> >>>> be more visible when there are lots of short lived connections.
> >>>>
> >>>> Thanks,
> >>>> Dumitru
> >>>>
> >>>
> >>
> >
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v2] Don't blindly save original dst IP and Port to avoid megaflow unwildcarding.

Reply via email to