On Tue, Jul 28, 2020 at 8:38 PM Mark Michelson <mmich...@redhat.com> wrote:

> On 7/28/20 9:23 AM, Numan Siddique wrote:
> >
> >
> > On Tue, Jul 28, 2020 at 2:51 AM Mark Michelson <mmich...@redhat.com
> > <mailto:mmich...@redhat.com>> wrote:
> >
> >     When traffic arrives over an ECMP route, there is no guarantee that
> the
> >     reply traffic will egress over the same route. Sometimes, the nature
> of
> >     the traffic (or the intervening equipment) means that it is important
> >     for reply traffic to go out the same route it came in.
> >
> >     This commit introduces optional ECMP symmetric reply behavior. If
> >     configured, then traffic to or from the ECMP route will be sent to
> >     conntrack. New incoming traffic over the route will have the source
> MAC
> >     address and incoming port saved in the ct_label. Reply traffic then
> uses
> >     this saved information to send the packet back out the same way it
> came
> >     in.
> >
> >     To facilitate this, a new table was added to the ingress logical
> router
> >     pipeline. The ECMP_STATEFUL table is responsible for committing to
> >     conntrack and setting the ct_label when it detects new incoming
> traffic
> >     from the route.
> >
> >     Since ingress pipeline logic on the logical router depends on ct
> state
> >     of a particular hypervisor, this feature is only usable on gateway
> >     routers.
> >
> >     Signed-off-by: Mark Michelson <mmich...@redhat.com
> >     <mailto:mmich...@redhat.com>>
> >     Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1849683
> >
> >
> > Hi Mark,
> >
> > Thanks for the new version. The first 4 patches in the series LGTM.
> >
> > I've few comments in this patch
> >
> > 1. This patch series needs a rebase as it's not applying cleanly on top
> > of the master.
>
> OK thanks, I'll get this fixed.
>
> >
> > 2. I think we should not exclude this feature to logical routers with
> > distributed gateway ports.
> >     For logical router with gw port I think you can add the same ecmp
> > symmetric flows but with
> >     one extra match - "inport == cr-<gw-port> && ...."
> >     We do the same in many parts of the code. Openstack may use this
> > feature and Openstack neutron
> >     don't use gateway routers.
>
> I think I need a bit of education here about how this can work. Let me
> explain how I'm viewing this and you can explain if my thinking is wrong.
>
> Let's say that you have a router ro-1 with ECMP routes. You've set this
> up to be a distributed router with a gateway port, and the gateway port
> is bound to chassis-1 (either via ha_chassis_group, gateway_chassis, or
> options:redirect_chassis). You have switch ls-1 connected to ro-1. VMs
> connected to ls-1 are distributed across multiple chassis.
>
> Traffic originates from outside of the logical network and comes in the
> gateway port of ro-1 on chassis-1. The ingress pipeline runs, and
> conntrack saves the source ethernet address and port so we can send
> return traffic out the same port. Next, the egress pipeline of ro-1 runs
> on chassis-1. Then the ingress pipeline of ls-1 runs on chassis-1.
> During this, it is determined that the destination output port is on
> chassis-2. So the packet is tunneled to chassis-2. There, the egress
> pipeline of ls-1 runs and the packet is output to the VM. All is fine at
> this point.
>
> Now, the VM sends reply traffic. The ingress pipeline of ls-1 is run,
> and the output port is the port linking ls-1 to ro-1. Now here's where
> things get a bit hazy for me. Since ro-1 is a distributed router, the
> port binding type for ls-1's port to ro-1 is a "patch" port. So the
> egress pipeline of ls-1 is run on chassis-2. Then the ingress pipeline
> of ro-1 will also run on chassis-2. This is a problem, because the
> conntrack entries for symmetric ECMP reply are on chassis-1. It's not
> until the ingress pipeline of ro-1 is completed that the packet is
> tunneled to chassis-1. Then on chassis-1 the egress pipeline of ro-1
> will run.
>
> When you use a gateway router, the port binding type for ls-1's port to
> ro-1 is "l3gateway". This means that the packet would get tunneled to
> chassis-1 before running the egress pipeline for ls-1. Then, the ingress
> pipeline of ro-1 runs on chassis-1 so everything works.
>
> Have I misunderstood how this works?
>


I made a mistake earlier -- I meant to say "is_chassis_resident("cr-...")

I think you're right. I need to take a closer look on how that can be
supported if at all it can be.

We can explore later if there is a requirement to support this feature on
gw router ports.

Thanks
Numan


>
> Assuming I haven't...
>
> If the logic for using ECMP symmetric reply on the return traffic could
> be moved to the egress router pipeline, then I understand how it would
> work with a distributed router with gateway port. But I don't see how
> you can do that since the ECMP symmetric reply needs to choose the
> output port. By definition that has to be done in the ingress pipeline.
>
> I guess one option would be to limit the use of ECMP symmetric reply
> traffic to only the gateway port on a distributed router. In this case,
> there would be no need to save the input port in conntrack since there's
> only one possibility. Instead, we would only need to save the nexthop
> MAC address. This way, in the egress pipeline we could override the
> initial ECMP route selection by changing eth.dst.
>
> >
> > 3. In my testing with the logical resources created from system-ovn.at
> > <http://system-ovn.at>, I noticed that
> >       - The traffic initiated from bob1 to alice1 works as expected. The
> > newly added logical flows gets hit
> >          and the ct_label is set as expected.
> >
> >        - The problem is in the traffic initiated by alice1. For the
> > first packet from alice1, the select action is executed
> >          to choose one ecmp route (which is expected) and this packet is
> > not committed to the conntrack.
> >          For the reply traffic from bob1, the packet gets committed
> > because of this flow
> >          table=7 (lr_in_ecmp_stateful), priority=100  , match=(inport ==
> > "R1_ext" && ip4.dst == 10.0.0.0/24 <http://10.0.0.0/24> && (ct.new &&
> > !ct.est)), action=(ct_commit { ct_label.ecmp_reply_eth = eth.src;
> > ct_label.ecmp_reply_port = 2;}; next;)
> >    -   Basically the reverse traffic is treated as new traffic. And from
> > here on, the packet from alice1 is considered as reply traffic.
> >        table=10(lr_in_ip_routing   ), priority=100  , match=(ct.rpl &&
> > ct_label.ecmp_reply_port == 2 && ip4.src == 10.0.0.0/24
> > <http://10.0.0.0/24>), action=(ip.ttl--; flags.loopback = 1; eth.src =
> > 00:00:04:01:02:03; reg1 = 20.0.0.1; outport = "R1_ext"; next;)
> >    - I'm not really sure if it's a problem or not. Maybe it's fine. But
> > is it as expected ? I personally don't see any harm with this.
> >
> >     - But I would like to know your comments and maybe Han has some
> > comments.
>
> Hm, this is a bit hard to fix.
>
> If you don't turn on symmetric replies, then traffic that originates
> from Alice for a connection *should* choose the same outgoing route
> every time since the 4-tuple will be the same throughout the life of the
> connection.
>
> If you turn on symmetric replies, then you still get the same behavior,
> but you're adding in extra conntrack use.
>
> So how do you detect that the traffic coming from Bob to Alice is in
> reply to Alice's traffic and avoid sending it to conntrack? You have to
> use conntrack to detect the direction, right? So in order to avoid using
> conntrack, we have to use conntrack...
>

Agree. I'm fine with the observed behaviour with your patches.



>
> >
> >   4. The test case - "3: ovn -- conntrack fields" is failing with this
> > patch. It's a small error which you forgot to change I suppose.
>
> I actually had fixed this locally but then I guess I accidentally
> overwrote the changes and pushed an unfixed version. Sorry about that.
>
> >
> >   5. Since you are adding a new column in Logical_Router_Static_Route, I
> > think the schema version needs to be updated to - "5.25.0"
>
> Will do.
>
>

Thanks
Numan


> >
> > Thanks
> > Numan
> >
> >
> >     ---
> >       lib/logical-fields.c      |   4 +
> >       northd/ovn-northd.8.xml   |  49 ++++++++++---
> >       northd/ovn-northd.c       | 123 +++++++++++++++++++++++++++----
> >       ovn-architecture.7.xml    |   7 +-
> >       ovn-nb.ovsschema          |   5 +-
> >       ovn-nb.xml                |  16 ++++
> >       tests/ovn.at <http://ovn.at>              | 151
> >     ++++++++++++++++++++++++++++++++++----
> >       tests/system-ovn.at <http://system-ovn.at>       | 144
> >     ++++++++++++++++++++++++++++++++++++
> >       utilities/ovn-nbctl.8.xml |  31 ++++++--
> >       utilities/ovn-nbctl.c     |  18 ++++-
> >       10 files changed, 496 insertions(+), 52 deletions(-)
> >
> >     diff --git a/lib/logical-fields.c b/lib/logical-fields.c
> >     index fde53a47e..15342dded 100644
> >     --- a/lib/logical-fields.c
> >     +++ b/lib/logical-fields.c
> >     @@ -130,6 +130,10 @@ ovn_init_symtab(struct shash *symtab)
> >                                        WR_CT_COMMIT);
> >           expr_symtab_add_subfield_scoped(symtab, "ct_label.blocked",
> NULL,
> >                                           "ct_label[0]", WR_CT_COMMIT);
> >     +    expr_symtab_add_subfield_scoped(symtab,
> >     "ct_label.ecmp_reply_eth", NULL,
> >     +                                    "ct_label[32..79]",
> WR_CT_COMMIT);
> >     +    expr_symtab_add_subfield_scoped(symtab,
> >     "ct_label.ecmp_reply_port", NULL,
> >     +                                    "ct_label[80..95]",
> WR_CT_COMMIT);
> >
> >           expr_symtab_add_field(symtab, "ct_state", MFF_CT_STATE, NULL,
> >     false);
> >
> >     diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> >     index eb2514f15..cf251e02a 100644
> >     --- a/northd/ovn-northd.8.xml
> >     +++ b/northd/ovn-northd.8.xml
> >     @@ -2120,15 +2120,31 @@ icmp6 {
> >           <p>
> >             This is to send packets to connection tracker for tracking
> and
> >             defragmentation.  It contains a priority-0 flow that simply
> >     moves traffic
> >     -      to the next table.  If load balancing rules with virtual IP
> >     addresses
> >     -      (and ports) are configured in <code>OVN_Northbound</code>
> >     database for a
> >     -      Gateway router, a priority-100 flow is added for each
> >     configured virtual
> >     -      IP address <var>VIP</var>. For IPv4 <var>VIPs</var> the flow
> >     matches
> >     -      <code>ip &amp;&amp; ip4.dst == <var>VIP</var></code>.  For
> IPv6
> >     -      <var>VIPs</var>, the flow matches <code>ip &amp;&amp; ip6.dst
> ==
> >     -      <var>VIP</var></code>.  The flow uses the action
> >     <code>ct_next;</code>
> >     -      to send IP packets to the connection tracker for packet
> >     de-fragmentation
> >     -      and tracking before sending it to the next table.
> >     +      to the next table.
> >     +    </p>
> >     +
> >     +    <p>
> >     +      If load balancing rules with virtual IP addresses (and ports)
> are
> >     +      configured in <code>OVN_Northbound</code> database for a
> >     Gateway router,
> >     +      a priority-100 flow is added for each configured virtual IP
> >     address
> >     +      <var>VIP</var>. For IPv4 <var>VIPs</var> the flow matches
> >     <code>ip
> >     +      &amp;&amp; ip4.dst == <var>VIP</var></code>.  For IPv6
> >     <var>VIPs</var>,
> >     +      the flow matches <code>ip &amp;&amp; ip6.dst ==
> >     <var>VIP</var></code>.
> >     +      The flow uses the action <code>ct_next;</code> to send IP
> >     packets to the
> >     +      connection tracker for packet de-fragmentation and tracking
> >     before
> >     +      sending it to the next table.
> >     +    </p>
> >     +
> >     +    <p>
> >     +      If ECMP routes with symmetric reply are configured in the
> >     +      <code>OVN_Northbound</code> database for a gateway router, a
> >     priority-100
> >     +      flow is added for each router port on which symmetric replies
> are
> >     +      configured. The matching logic for these ports essentially
> >     reverses the
> >     +      configured logic of the ECMP route. So for instance, a route
> >     with a
> >     +      destination routing policy will instead match if the source
> >     IP address
> >     +      matches the static route's prefix. The flow uses the action
> >     +      <code>ct_next</code> to send IP packets to the connection
> >     tracker for
> >     +      packet de-fragmentation and tracking before sending it to the
> >     next table.
> >           </p>
> >
> >           <h3>Ingress Table 5: UNSNAT</h3>
> >     @@ -2489,7 +2505,15 @@ output;
> >             table.  This table, instead, is responsible for determine
> >     the ECMP
> >             group id and select a member id within the group based on
> >     5-tuple
> >             hashing.  It stores group id in <code>reg8[0..15]</code> and
> >     member id in
> >     -      <code>reg8[16..31]</code>.
> >     +      <code>reg8[16..31]</code>. This step is skipped if the
> >     traffic going
> >     +      out the ECMP route is reply traffic, and the ECMP route was
> >     configured
> >     +      to use symmetric replies. Instead, the stored
> >     <code>ct_label</code> value
> >     +      is used to choose the destination. The least significant 48
> >     bits of the
> >     +      <code>ct_label</code> tell the destination MAC address to
> >     which the
> >     +      packet should be sent. The next 16 bits tell the logical
> >     router port on
> >     +      which the packet should be sent. These values in the
> >     +      <code>ct_label</code> are set when the initial ingress
> traffic is
> >     +      received over the ECMP route.
> >           </p>
> >
> >           <p>
> >     @@ -2639,6 +2663,11 @@ select(reg8[16..31], <var>MID1</var>,
> >     <var>MID2</var>, ...);
> >             address and <code>reg1</code> as the source protocol
> address).
> >           </p>
> >
> >     +    <p>
> >     +      This processing is skipped for reply traffic being sent out
> >     of an ECMP
> >     +      route if the route was configured to use symmetric replies.
> >     +    </p>
> >     +
> >           <p>
> >             This table contains the following logical flows:
> >           </p>
> >     diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> >     index d10e5ee5d..85f04ccde 100644
> >     --- a/northd/ovn-northd.c
> >     +++ b/northd/ovn-northd.c
> >     @@ -172,16 +172,17 @@ enum ovn_stage {
> >           PIPELINE_STAGE(ROUTER, IN,  DEFRAG,          4,
> >     "lr_in_defrag")       \
> >           PIPELINE_STAGE(ROUTER, IN,  UNSNAT,          5,
> >     "lr_in_unsnat")       \
> >           PIPELINE_STAGE(ROUTER, IN,  DNAT,            6, "lr_in_dnat")
> >             \
> >     -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   7,
> >     "lr_in_nd_ra_options") \
> >     -    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  8,
> >     "lr_in_nd_ra_response") \
> >     -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      9,
> >     "lr_in_ip_routing")   \
> >     -    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 10,
> >     "lr_in_ip_routing_ecmp") \
> >     -    PIPELINE_STAGE(ROUTER, IN,  POLICY,          11,
> >     "lr_in_policy")       \
> >     -    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     12,
> >     "lr_in_arp_resolve")  \
> >     -    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  13,
> >     "lr_in_chk_pkt_len")   \
> >     -    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> >       14,"lr_in_larger_pkts")   \
> >     -    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     15,
> >     "lr_in_gw_redirect")  \
> >     -    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     16,
> >     "lr_in_arp_request")  \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ECMP_STATEFUL,   7,
> >     "lr_in_ecmp_stateful") \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_OPTIONS,   8,
> >     "lr_in_nd_ra_options") \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ND_RA_RESPONSE,  9,
> >     "lr_in_nd_ra_response") \
> >     +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING,      10,
> >     "lr_in_ip_routing")   \
> >     +    PIPELINE_STAGE(ROUTER, IN,  IP_ROUTING_ECMP, 11,
> >     "lr_in_ip_routing_ecmp") \
> >     +    PIPELINE_STAGE(ROUTER, IN,  POLICY,          12,
> >     "lr_in_policy")       \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ARP_RESOLVE,     13,
> >     "lr_in_arp_resolve")  \
> >     +    PIPELINE_STAGE(ROUTER, IN,  CHK_PKT_LEN   ,  14,
> >     "lr_in_chk_pkt_len")   \
> >     +    PIPELINE_STAGE(ROUTER, IN,  LARGER_PKTS,
> >       15,"lr_in_larger_pkts")   \
> >     +    PIPELINE_STAGE(ROUTER, IN,  GW_REDIRECT,     16,
> >     "lr_in_gw_redirect")  \
> >     +    PIPELINE_STAGE(ROUTER, IN,  ARP_REQUEST,     17,
> >     "lr_in_arp_request")  \
> >
> >         \
> >           /* Logical router egress stages. */
> >         \
> >           PIPELINE_STAGE(ROUTER, OUT, UNDNAT,    0, "lr_out_undnat")
> >          \
> >     @@ -7312,6 +7313,7 @@ struct parsed_route {
> >           bool is_src_route;
> >           uint32_t hash;
> >           const struct nbrec_logical_router_static_route *route;
> >     +    bool ecmp_symmetric_reply;
> >       };
> >
> >       static uint32_t
> >     @@ -7373,6 +7375,8 @@ parsed_routes_add(struct ovs_list *routes,
> >                                                        "src-ip"));
> >           pr->hash = route_hash(pr);
> >           pr->route = route;
> >     +    pr->ecmp_symmetric_reply = smap_get_bool(&route->options,
> >     +
> >       "ecmp_symmetric_reply", false);
> >           ovs_list_insert(routes, &pr->list_node);
> >           return pr;
> >       }
> >     @@ -7621,18 +7625,95 @@ find_static_route_outport(struct
> >     ovn_datapath *od, struct hmap *ports,
> >           return true;
> >       }
> >
> >     +static void
> >     +add_ecmp_symmetric_reply_flows(struct hmap *lflows,
> >     +                               struct ovn_datapath *od,
> >     +                               const char *port_ip,
> >     +                               struct ovn_port *out_port,
> >     +                               const struct parsed_route *route,
> >     +                               struct ds *route_match)
> >     +{
> >     +    const struct nbrec_logical_router_static_route *st_route =
> >     route->route;
> >     +    struct ds match = DS_EMPTY_INITIALIZER;
> >     +    struct ds actions = DS_EMPTY_INITIALIZER;
> >     +    struct ds ecmp_reply = DS_EMPTY_INITIALIZER;
> >     +    char *cidr = normalize_v46_prefix(&route->prefix, route->plen);
> >     +
> >     +    /* If symmetric ECMP replies are enabled, then packets that
> >     arrive over
> >     +     * an ECMP route need to go through conntrack.
> >     +     */
> >     +    ds_put_format(&match, "inport == %s && ip%s.%s == %s",
> >     +                  out_port->json_key,
> >     +                  route->prefix.family == AF_INET ? "4" : "6",
> >     +                  route->is_src_route ? "dst" : "src",
> >     +                  cidr);
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
> >     +                            ds_cstr(&match), "ct_next;",
> >     +                            &st_route->header_);
> >     +
> >     +    /* And packets that go out over an ECMP route need conntrack */
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_DEFRAG, 100,
> >     +                            ds_cstr(route_match), "ct_next;",
> >     +                            &st_route->header_);
> >     +
> >     +    /* Save src eth and inport in ct_label for packets that arrive
> over
> >     +     * an ECMP route.
> >     +     *
> >     +     * NOTE: we purposely are not clearing match before this
> >     +     * ds_put_cstr() call. The previous contents are needed.
> >     +     */
> >     +    ds_put_cstr(&match, " && (ct.new && !ct.est)");
> >     +
> >     +    ds_put_format(&actions, "ct_commit { ct_label.ecmp_reply_eth =
> >     eth.src;"
> >     +                  " ct_label.ecmp_reply_port = %" PRId64 ";};
> next;",
> >     +                  out_port->sb->tunnel_key);
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ECMP_STATEFUL,
> 100,
> >     +                            ds_cstr(&match), ds_cstr(&actions),
> >     +                            &st_route->header_);
> >     +
> >     +    /* Bypass ECMP selection if we already have ct_label information
> >     +     * for where to route the packet.
> >     +     */
> >     +    ds_put_format(&ecmp_reply, "ct.rpl && ct_label.ecmp_reply_port
> >     == %"
> >     +                  PRId64, out_port->sb->tunnel_key);
> >     +    ds_clear(&match);
> >     +    ds_put_format(&match, "%s && %s", ds_cstr(&ecmp_reply),
> >     +                  ds_cstr(route_match));
> >     +    ds_clear(&actions);
> >     +    ds_put_format(&actions, "ip.ttl--; flags.loopback = 1; "
> >     +                  "eth.src = %s; %sreg1 = %s; outport = %s; next;",
> >     +                  out_port->lrp_networks.ea_s,
> >     +                  route->prefix.family == AF_INET ? "" : "xx",
> >     +                  port_ip, out_port->json_key);
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_IP_ROUTING, 100,
> >     +                           ds_cstr(&match), ds_cstr(&actions),
> >     +                           &st_route->header_);
> >     +
> >     +    /* Egress reply traffic for symmetric ECMP routes skips router
> >     policies. */
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_POLICY, 65535,
> >     +                            ds_cstr(&ecmp_reply), "next;",
> >     +                            &st_route->header_);
> >     +
> >     +    ds_clear(&actions);
> >     +    ds_put_cstr(&actions, "eth.dst = ct_label.ecmp_reply_eth;
> next;");
> >     +    ovn_lflow_add_with_hint(lflows, od, S_ROUTER_IN_ARP_RESOLVE,
> >     +                            200, ds_cstr(&ecmp_reply),
> >     +                            ds_cstr(&actions), &st_route->header_);
> >     +}
> >     +
> >       static void
> >       build_ecmp_route_flow(struct hmap *lflows, struct ovn_datapath *od,
> >                             struct hmap *ports, struct ecmp_groups_node
> *eg)
> >
> >       {
> >           bool is_ipv4 = (eg->prefix.family == AF_INET);
> >     -    struct ds match = DS_EMPTY_INITIALIZER;
> >           uint16_t priority;
> >     +    struct ecmp_route_list_node *er;
> >     +    struct ds route_match = DS_EMPTY_INITIALIZER;
> >
> >           char *prefix_s = build_route_prefix_s(&eg->prefix, eg->plen);
> >           build_route_match(NULL, prefix_s, eg->plen, eg->is_src_route,
> >     is_ipv4,
> >     -                      &match, &priority);
> >     +                      &route_match, &priority);
> >           free(prefix_s);
> >
> >           struct ds actions = DS_EMPTY_INITIALIZER;
> >     @@ -7640,7 +7721,6 @@ build_ecmp_route_flow(struct hmap *lflows,
> >     struct ovn_datapath *od,
> >                         "; %s = select(", REG_ECMP_GROUP_ID, eg->id,
> >                         REG_ECMP_MEMBER_ID);
> >
> >     -    struct ecmp_route_list_node *er;
> >           bool is_first = true;
> >           LIST_FOR_EACH (er, list_node, &eg->route_list) {
> >               if (is_first) {
> >     @@ -7654,11 +7734,12 @@ build_ecmp_route_flow(struct hmap *lflows,
> >     struct ovn_datapath *od,
> >           ds_put_cstr(&actions, ");");
> >
> >           ovn_lflow_add(lflows, od, S_ROUTER_IN_IP_ROUTING, priority,
> >     -                  ds_cstr(&match), ds_cstr(&actions));
> >     +                  ds_cstr(&route_match), ds_cstr(&actions));
> >
> >           /* Add per member flow */
> >     +    struct ds match = DS_EMPTY_INITIALIZER;
> >     +    struct sset visited_ports = SSET_INITIALIZER(&visited_ports);
> >           LIST_FOR_EACH (er, list_node, &eg->route_list) {
> >     -
> >               const struct parsed_route *route_ = er->route;
> >               const struct nbrec_logical_router_static_route *route =
> >     route_->route;
> >               /* Find the outgoing port. */
> >     @@ -7668,6 +7749,15 @@ build_ecmp_route_flow(struct hmap *lflows,
> >     struct ovn_datapath *od,
> >                                              &out_port)) {
> >                   continue;
> >               }
> >     +        /* Symmetric ECMP reply is only usable on gateway routers.
> >     +         * It is NOT usable on distributed routers with a gateway
> port.
> >     +         */
> >     +        if (smap_get(&od->nbr->options, "chassis") &&
> >     +            route_->ecmp_symmetric_reply && sset_add(&visited_ports,
> >     +
>  out_port->key)) {
> >     +            add_ecmp_symmetric_reply_flows(lflows, od, lrp_addr_s,
> >     out_port,
> >     +                                           route_, &route_match);
> >     +        }
> >               ds_clear(&match);
> >               ds_put_format(&match, REG_ECMP_GROUP_ID" == %"PRIu16" && "
> >                             REG_ECMP_MEMBER_ID" == %"PRIu16,
> >     @@ -7688,7 +7778,9 @@ build_ecmp_route_flow(struct hmap *lflows,
> >     struct ovn_datapath *od,
> >                                       ds_cstr(&match), ds_cstr(&actions),
> >                                       &route->header_);
> >           }
> >     +    sset_destroy(&visited_ports);
> >           ds_destroy(&match);
> >     +    ds_destroy(&route_match);
> >           ds_destroy(&actions);
> >       }
> >
> >     @@ -8972,6 +9064,7 @@ build_lrouter_flows(struct hmap *datapaths,
> >     struct hmap *ports,
> >               ovn_lflow_add(lflows, od, S_ROUTER_IN_DNAT, 0, "1",
> "next;");
> >               ovn_lflow_add(lflows, od, S_ROUTER_OUT_UNDNAT, 0, "1",
> >     "next;");
> >               ovn_lflow_add(lflows, od, S_ROUTER_OUT_EGR_LOOP, 0, "1",
> >     "next;");
> >     +        ovn_lflow_add(lflows, od, S_ROUTER_IN_ECMP_STATEFUL, 0,
> >     "1", "next;");
> >
> >               /* Send the IPv6 NS packets to next table. When
> ovn-controller
> >                * generates IPv6 NS (for the action - nd_ns{}), the
> injected
> >     diff --git a/ovn-architecture.7.xml b/ovn-architecture.7.xml
> >     index 246cebc19..b1a462933 100644
> >     --- a/ovn-architecture.7.xml
> >     +++ b/ovn-architecture.7.xml
> >     @@ -1210,11 +1210,12 @@
> >           <dd>
> >             Fields that denote the connection tracking zones for
> >     routers.  These
> >             values only have local significance and are not meaningful
> >     between
> >     -      chassis.  OVN stores the zone information for DNATting in
> >     Open vSwitch
> >     +      chassis.  OVN stores the zone information for north to south
> >     traffic
> >     +      (for DNATting or ECMP symmetric replies) in Open vSwitch
> >               <!-- Keep the following in sync with MFF_LOG_DNAT_ZONE and
> >               MFF_LOG_SNAT_ZONE in ovn/lib/logical-fields.h. -->
> >     -      extension register number 11 and zone information for SNATing
> in
> >     -      Open vSwitch extension register number 12.
> >     +      extension register number 11 and zone information for south
> >     to north
> >     +      traffic (for SNATing) in Open vSwitch extension register
> >     number 12.
> >           </dd>
> >
> >           <dt>logical flow flags</dt>
> >     diff --git a/ovn-nb.ovsschema b/ovn-nb.ovsschema
> >     index da9af7157..16f7794f2 100644
> >     --- a/ovn-nb.ovsschema
> >     +++ b/ovn-nb.ovsschema
> >     @@ -1,7 +1,7 @@
> >       {
> >           "name": "OVN_Northbound",
> >           "version": "5.24.0",
> >     -    "cksum": "1092394564 25961",
> >     +    "cksum": "679745602 26116",
> >           "tables": {
> >               "NB_Global": {
> >                   "columns": {
> >     @@ -365,6 +365,9 @@
> >                                           "min": 0, "max": 1}},
> >                       "nexthop": {"type": "string"},
> >                       "output_port": {"type": {"key": "string", "min":
> >     0, "max": 1}},
> >     +                "options": {
> >     +                    "type": {"key": "string", "value": "string",
> >     +                             "min": 0, "max": "unlimited"}},
> >                       "external_ids": {
> >                           "type": {"key": "string", "value": "string",
> >                                    "min": 0, "max": "unlimited"}}},
> >     diff --git a/ovn-nb.xml b/ovn-nb.xml
> >     index db5908cd5..5e434d257 100644
> >     --- a/ovn-nb.xml
> >     +++ b/ovn-nb.xml
> >     @@ -2481,6 +2481,22 @@
> >             </column>
> >           </group>
> >
> >     +    <group title="Common options">
> >     +      <column name="options">
> >     +        This column provides general key/value settings. The
> supported
> >     +        options are described individually below.
> >     +      </column>
> >     +
> >     +      <column name="options" key="ecmp_symmetric_reply">
> >     +        It true, then new traffic that arrives over this route will
> >     have
> >     +        its reply traffic bypass ECMP route selection and will be
> >     sent out
> >     +        this route instead. Note that this option overrides any
> >     rules set
> >     +        in the <ref table="Logical_Router_policy" /> table. This
> option
> >     +        only works on gateway routers (routers that have
> >     +        <ref column="options" key="chassis" table="Logical_Router"
> >     /> set).
> >     +      </column>
> >     +    </group>
> >     +
> >         </table>
> >
> >         <table name="Logical_Router_Policy" title="Logical router
> policies">
> >     diff --git a/tests/ovn.at <http://ovn.at> b/tests/ovn.at <
> http://ovn.at>
> >     index f8dde14c2..c1ab6b85f 100644
> >     --- a/tests/ovn.at <http://ovn.at>
> >     +++ b/tests/ovn.at <http://ovn.at>
> >     @@ -195,6 +195,8 @@ ct.snat = ct_state[6]
> >       ct.trk = ct_state[5]
> >       ct_label = NXM_NX_CT_LABEL
> >       ct_label.blocked = ct_label[0]
> >     +ct_label.ecmp_reply_eth = ct_label[0..47]
> >     +ct_label.ecmp_reply_port = ct_label[48..63]
> >       ct_mark = NXM_NX_CT_MARK
> >       ct_state = NXM_NX_CT_STATE
> >       ]])
> >     @@ -16065,7 +16067,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       # Since the sw0-vir is not claimed by any chassis, eth.dst should
> >     be set to
> >       # zero if the ip4.dst is the virtual ip in the router pipeline.
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     00:00:00:00:00:00; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     00:00:00:00:00:00; next;)
> >       ])
> >
> >       ip_to_hex() {
> >     @@ -16116,7 +16118,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       # There should be an arp resolve flow to resolve the virtual_ip
> >     with the
> >       # sw0-p1's MAC.
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:03; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:03; next;)
> >       ])
> >
> >       # Forcibly clear virtual_parent. ovn-controller should release the
> >     binding
> >     @@ -16157,7 +16159,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       # There should be an arp resolve flow to resolve the virtual_ip
> >     with the
> >       # sw0-p2's MAC.
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:05; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:05; next;)
> >       ])
> >
> >       # send the garp from sw0-p2 (in hv2). hv2 should claim sw0-vir
> >     @@ -16180,7 +16182,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       # There should be an arp resolve flow to resolve the virtual_ip
> >     with the
> >       # sw0-p3's MAC.
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:04; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:04; next;)
> >       ])
> >
> >       # Now send arp reply from sw0-p1. hv1 should claim sw0-vir
> >     @@ -16201,7 +16203,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       > lflows.txt
> >
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:03; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:03; next;)
> >       ])
> >
> >       # Delete hv1-vif1 port. hv1 should release sw0-vir
> >     @@ -16219,7 +16221,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       > lflows.txt
> >
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     00:00:00:00:00:00; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     00:00:00:00:00:00; next;)
> >       ])
> >
> >       # Now send arp reply from sw0-p2. hv2 should claim sw0-vir
> >     @@ -16240,7 +16242,7 @@ ovn-sbctl dump-flows lr0 | grep
> >     lr_in_arp_resolve | grep "reg0 == 10.0.0.10" \
> >       > lflows.txt
> >
> >       AT_CHECK([cat lflows.txt], [0], [dnl
> >     -  table=12(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:04; next;)
> >     +  table=13(lr_in_arp_resolve  ), priority=100  , match=(outport ==
> >     "lr0-sw0" && reg0 == 10.0.0.10), action=(eth.dst =
> >     50:54:00:00:00:04; next;)
> >       ])
> >
> >       # Delete sw0-p2 logical port
> >     @@ -20274,22 +20276,22 @@ ovn-nbctl set logical_router_policy $pol5
> >     options:pkt_mark=5
> >       ovn-nbctl --wait=hv sync
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x64->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x3->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x4->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x5->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >     @@ -20380,12 +20382,12 @@ send_ipv4_pkt hv1 hv1-vif1 505400000003
> >     00000000ff01 \
> >           $(ip_to_hex 10 0 0 3) $(ip_to_hex 172 168 0 120)
> >
> >       OVS_WAIT_UNTIL([
> >     -    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x2->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >       AT_CHECK([
> >     -    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=19 | \
> >     +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=20 | \
> >           grep "load:0x64->NXM_NX_PKT_MARK" -c)
> >       ])
> >
> >     @@ -20741,3 +20743,126 @@ AT_CHECK([test "$hv2_offlows" =
> >     "$hv2_offlows_mon"])
> >
> >       OVN_CLEANUP([hv1], [hv2])
> >       AT_CLEANUP
> >     +
> >     +AT_SETUP([ovn -- Symmetric ECMP reply flows])
> >     +ovn_start
> >     +
> >     +net_add n1
> >     +sim_add hv1
> >     +as hv1
> >     +ovs-vsctl add-br br-phys
> >     +ovn_attach n1 br-phys 192.168.0.1
> >     +
> >     +sim_add hv2
> >     +as hv2
> >     +ovs-vsctl add-br br-phys
> >     +ovn_attach n1 br-phys 192.168.0.2
> >     +
> >     +# Logical network
> >     +#
> >     +#   ls1 \
> >     +#        \
> >     +#         DR -- join -- GW -- ext
> >     +#        /
> >     +#   ls2 /
> >     +#
> >     +#  ls1 and ls2 are internal switches connected to distributed router
> >     +#  DR. DR is then connected via a join switch to gateway router GW.
> >     +#  GW is then connected to external switch ext. In real life, this
> >     +#  would likely have a localnet port, but for the purposes of this
> test
> >     +#  it is unnecessary.
> >     +
> >     +ovn-nbctl create Logical_Router name=DR
> >     +gw_uuid=$(ovn-nbctl create Logical_Router name=GW)
> >     +
> >     +ovn-nbctl ls-add ls1
> >     +ovn-nbctl ls-add ls2
> >     +ovn-nbctl ls-add join
> >     +ovn-nbctl ls-add ext
> >     +
> >     +# Connect ls1 to DR
> >     +ovn-nbctl lrp-add DR dr-ls1 00:00:01:01:02:03 10.0.0.1/24
> >     <http://10.0.0.1/24>
> >     +ovn-nbctl lsp-add ls1 ls1-dr -- set Logical_Switch_Port ls1-dr \
> >     +    type=router options:router-port=dr-ls1
> >     addresses='"00:00:01:01:02:03"'
> >     +
> >     +# Connect ls2 to DR
> >     +ovn-nbctl lrp-add DR dr-ls2 00:00:01:01:02:04 10.0.0.2/24
> >     <http://10.0.0.2/24>
> >     +ovn-nbctl lsp-add ls2 ls2-dr -- set Logical_Switch_Port ls2-dr \
> >     +    type=router options:router-port=dr-ls2
> >     addresses='"00:00:01:01:02:04"'
> >     +
> >     +# Connect join to DR
> >     +ovn-nbctl lrp-add DR dr-join 00:00:02:01:02:03 20.0.0.1/24
> >     <http://20.0.0.1/24>
> >     +ovn-nbctl lsp-add join join-dr -- set Logical_Switch_Port join-dr \
> >     +    type=router options:router-port=dr-join
> >     addresses='"00:00:02:01:02:03"'
> >     +
> >     +# Connect join to GW
> >     +ovn-nbctl lrp-add GW gw-join 00:00:02:01:02:04 20.0.0.2/24
> >     <http://20.0.0.2/24>
> >     +ovn-nbctl lsp-add join join-gw -- set Logical_Switch_Port join-gw \
> >     +    type=router options:router-port=gw-join
> >     addresses='"00:00:02:01:02:04"'
> >     +
> >     +# Connect ext to GW
> >     +ovn-nbctl lrp-add GW gw-ext 00:00:03:01:02:03 172.16.0.1/16
> >     <http://172.16.0.1/16>
> >     +ovn-nbctl lsp-add ext ext-gw -- set Logical_Switch_Port ext-gw \
> >     +    type=router options:router-port=gw-ext
> >     addresses='"00:00:03:01:02:03"'
> >     +
> >     +ovn-nbctl lr-route-add GW 10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.1
> >     +ovn-nbctl --policy="src-ip" lr-route-add DR 10.0.0.0/24
> >     <http://10.0.0.0/24> 20.0.0.2
> >     +
> >     +# Now add some ECMP routes to the GW router.
> >     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW
> >     10.0.0.0/24 <http://10.0.0.0/24> 172.16.0.2
> >     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add GW
> >     10.0.0.0/24 <http://10.0.0.0/24> 172.16.0.3
> >     +
> >     +ovn-nbctl --wait=hv sync
> >     +
> >     +# Ensure ECMP symmetric reply flows are not present on any
> hypervisor.
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
> >     +    grep "priority=100" | \
> >     +    grep
> >
>  
> "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> >     -c)
> >     +])
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
> >     +    grep "priority=200" | \
> >     +    grep
> >     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]"
> -c)
> >     +])
> >     +
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
> >     +    grep "priority=100" | \
> >     +    grep
> >
>  
> "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> >     -c)
> >     +])
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
> >     +    grep "priority=200" | \
> >     +    grep
> >     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]"
> -c)
> >     +])
> >     +
> >     +# Now make GW a gateway router on hv1
> >     +ovn-nbctl set Logical_Router $gw_uuid options:chassis=hv1
> >     +ovn-nbctl --wait=hv sync
> >     +
> >     +# And ensure that ECMP symmetric reply flows are present only on hv1
> >     +AT_CHECK([
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=15 | \
> >     +    grep "priority=100" | \
> >     +    grep
> >
>  
> "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> >     -c)
> >     +])
> >     +AT_CHECK([
> >     +    test 1 -eq $(as hv1 ovs-ofctl dump-flows br-int table=21 | \
> >     +    grep "priority=200" | \
> >     +    grep
> >     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]"
> -c)
> >     +])
> >     +
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=15 | \
> >     +    grep "priority=100" | \
> >     +    grep
> >
>  
> "ct(commit,zone=NXM_NX_REG11\\[[0..15\\]],exec(move:NXM_OF_ETH_SRC\\[[\\]]->NXM_NX_CT_LABEL\\[[32..79\\]],load:0x[[0-9]]->NXM_NX_CT_LABEL\\[[80..95\\]]))"
> >     -c)
> >     +])
> >     +AT_CHECK([
> >     +    test 0 -eq $(as hv2 ovs-ofctl dump-flows br-int table=21 | \
> >     +    grep "priority=200" | \
> >     +    grep
> >     "actions=move:NXM_NX_CT_LABEL\\[[32..79\\]]->NXM_OF_ETH_DST\\[[\\]]"
> -c)
> >     +])
> >     +
> >     +OVN_CLEANUP([hv1], [hv2])
> >     +AT_CLEANUP
> >     diff --git a/tests/system-ovn.at <http://system-ovn.at>
> >     b/tests/system-ovn.at <http://system-ovn.at>
> >     index eddc530f9..e239b7394 100644
> >     --- a/tests/system-ovn.at <http://system-ovn.at>
> >     +++ b/tests/system-ovn.at <http://system-ovn.at>
> >     @@ -4483,3 +4483,147 @@ OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query
> >     port patch-.*/d
> >       /connection dropped.*/d"])
> >
> >       AT_CLEANUP
> >     +
> >     +AT_SETUP([ovn -- ECMP symmetric reply])
> >     +AT_KEYWORDS([ecmp])
> >     +
> >     +CHECK_CONNTRACK()
> >     +ovn_start
> >     +
> >     +OVS_TRAFFIC_VSWITCHD_START()
> >     +ADD_BR([br-int])
> >     +
> >     +# Set external-ids in br-int needed for ovn-controller
> >     +ovs-vsctl \
> >     +        -- set Open_vSwitch . external-ids:system-id=hv1 \
> >     +        -- set Open_vSwitch .
> >     external-ids:ovn-remote=unix:$ovs_base/ovn-sb/ovn-sb.sock \
> >     +        -- set Open_vSwitch . external-ids:ovn-encap-type=geneve \
> >     +        -- set Open_vSwitch . external-ids:ovn-encap-ip=169.0.0.1 \
> >     +        -- set bridge br-int fail-mode=secure
> >     other-config:disable-in-band=true
> >     +
> >     +# Start ovn-controller
> >     +start_daemon ovn-controller
> >     +
> >     +# Logical network:
> >     +# Alice is connected to gateway router R1. R1 is connected to two
> >     "external"
> >     +# routers, R2 and R3 via an "ext" switch.
> >     +# Bob is connected to both R2 and R3. R1 contains two ECMP routes,
> >     one through R2
> >     +# and one through R3, to Bob.
> >     +#
> >     +#     alice -- R1 -- ext ---- R2
> >     +#                     |         \
> >     +#                     |           bob
> >     +#                     |         /
> >     +#                     + ----- R3
> >     +#
> >     +# For this test, Bob sends request traffic through R2 to Alice. We
> >     want to ensure that
> >     +# all response traffic from Alice is routed through R2 as well.
> >     +
> >     +ovn-nbctl create Logical_Router name=R1 options:chassis=hv1
> >     +ovn-nbctl create Logical_Router name=R2
> >     +ovn-nbctl create Logical_Router name=R3
> >     +
> >     +ovn-nbctl ls-add alice
> >     +ovn-nbctl ls-add bob
> >     +ovn-nbctl ls-add ext
> >     +
> >     +# connect alice to R1
> >     +ovn-nbctl lrp-add R1 alice 00:00:01:01:02:03 10.0.0.1/24
> >     <http://10.0.0.1/24>
> >     +ovn-nbctl lsp-add alice rp-alice -- set Logical_Switch_Port
> rp-alice \
> >     +    type=router options:router-port=alice
> >     addresses='"00:00:01:01:02:03"'
> >     +
> >     +# connect bob to R2
> >     +ovn-nbctl lrp-add R2 R2_bob 00:00:02:01:02:03 172.16.0.2/16
> >     <http://172.16.0.2/16>
> >     +ovn-nbctl lsp-add bob rp2-bob -- set Logical_Switch_Port rp2-bob \
> >     +    type=router options:router-port=R2_bob
> >     addresses='"00:00:02:01:02:03"'
> >     +
> >     +# connect bob to R3
> >     +ovn-nbctl lrp-add R3 R3_bob 00:00:02:01:02:04 172.16.0.3/16
> >     <http://172.16.0.3/16>
> >     +ovn-nbctl lsp-add bob rp3-bob -- set Logical_Switch_Port rp3-bob \
> >     +    type=router options:router-port=R3_bob
> >     addresses='"00:00:02:01:02:04"'
> >     +
> >     +# Connect R1 to ext
> >     +ovn-nbctl lrp-add R1 R1_ext 00:00:04:01:02:03 20.0.0.1/24
> >     <http://20.0.0.1/24>
> >     +ovn-nbctl lsp-add ext r1-ext -- set Logical_Switch_Port r1-ext \
> >     +    type=router options:router-port=R1_ext
> >     addresses='"00:00:04:01:02:03"'
> >     +
> >     +# Connect R2 to ext
> >     +ovn-nbctl lrp-add R2 R2_ext 00:00:04:01:02:04 20.0.0.2/24
> >     <http://20.0.0.2/24>
> >     +ovn-nbctl lsp-add ext r2-ext -- set Logical_Switch_Port r2-ext \
> >     +    type=router options:router-port=R2_ext
> >     addresses='"00:00:04:01:02:04"'
> >     +
> >     +# Connect R3 to ext
> >     +ovn-nbctl lrp-add R3 R3_ext 00:00:04:01:02:05 20.0.0.3/24
> >     <http://20.0.0.3/24>
> >     +ovn-nbctl lsp-add ext r3-ext -- set Logical_Switch_Port r3-ext \
> >     +    type=router options:router-port=R3_ext
> >     addresses='"00:00:04:01:02:05"'
> >     +
> >     +# Install ECMP routes for alice.
> >     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
> >     10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.2
> >     +ovn-nbctl --ecmp-symmetric-reply --policy="src-ip" lr-route-add R1
> >     10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.3
> >     +
> >     +# Static Routes
> >     +ovn-nbctl lr-route-add R2 10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.1
> >     +ovn-nbctl lr-route-add R3 10.0.0.0/24 <http://10.0.0.0/24> 20.0.0.1
> >     +
> >     +# Logical port 'alice1' in switch 'alice'.
> >     +ADD_NAMESPACES(alice1)
> >     +ADD_VETH(alice1, alice1, br-int, "10.0.0.2/24
> >     <http://10.0.0.2/24>", "f0:00:00:01:02:04", \
> >     +         "10.0.0.1")
> >     +ovn-nbctl lsp-add alice alice1 \
> >     +-- lsp-set-addresses alice1 "f0:00:00:01:02:04 10.0.0.2"
> >     +
> >     +# Logical port 'bob1' in switch 'bob'.
> >     +ADD_NAMESPACES(bob1)
> >     +ADD_VETH(bob1, bob1, br-int, "172.16.0.1/16
> >     <http://172.16.0.1/16>", "f0:00:00:01:02:06", \
> >     +         "172.16.0.2")
> >     +ovn-nbctl lsp-add bob bob1 \
> >     +-- lsp-set-addresses bob1 "f0:00:00:01:02:06 172.16.0.1"
> >     +
> >     +# Ensure ovn-controller is caught up
> >     +ovn-nbctl --wait=hv sync
> >     +
> >     +on_exit 'ovs-ofctl dump-flows br-int'
> >     +
> >     +# 'bob1' should be able to ping 'alice1' directly.
> >     +NS_CHECK_EXEC([bob1], [ping -q -c 20 -i 0.3 -w 15 10.0.0.2 |
> >     FORMAT_PING], \
> >     +[0], [dnl
> >     +20 packets transmitted, 20 received, 0% packet loss, time 0ms
> >     +])
> >     +
> >     +# Ensure conntrack entry is present. We should not try to predict
> >     +# the tunnel key for the output port, so we strip it from the labels
> >     +# and just ensure that the known ethernet address is present.
> >     +AT_CHECK([ovs-appctl dpctl/dump-conntrack | FORMAT_CT(172.16.0.1) |
> \
> >     +sed -e 's/zone=[[0-9]]*/zone=<cleared>/' |
> >     +sed -e
> >
>  's/labels=0x[[0-9a-f]]*00000401020400000000/labels=0x00000401020400000000/'],
> >     [0], [dnl
> >
>  
> +icmp,orig=(src=172.16.0.1,dst=10.0.0.2,id=<cleared>,type=8,code=0),reply=(src=10.0.0.2,dst=172.16.0.1,id=<cleared>,type=0,code=0),zone=<cleared>,labels=0x00000401020400000000
> >     +])
> >     +
> >     +# Ensure datapaths show conntrack states as expected
> >     +# Like with conntrack entries, we shouldn't try to predict
> >     +# port binding tunnel keys. So omit them from expected labels.
> >     +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
> >
>  
> 'ct_state(+new-est-rpl+trk).*ct(.*label=0x.*00000401020400000000/0xffffffffffffffff00000000)'
> >     -c], [0], [dnl
> >     +1
> >     +])
> >     +AT_CHECK([ovs-appctl dpctl/dump-flows | grep
> >
>  
> 'ct_state(-new+est+rpl+trk).*ct_label(0x.*00000401020400000000/0xffffffffffffffff00000000)'
> >     -c], [0], [dnl
> >     +1
> >     +])
> >     +
> >     +ovs-ofctl dump-flows br-int
> >     +
> >     +OVS_APP_EXIT_AND_WAIT([ovn-controller])
> >     +
> >     +as ovn-sb
> >     +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >     +
> >     +as ovn-nb
> >     +OVS_APP_EXIT_AND_WAIT([ovsdb-server])
> >     +
> >     +as northd
> >     +OVS_APP_EXIT_AND_WAIT([ovn-northd])
> >     +
> >     +as
> >     +OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
> >     +/connection dropped.*/d"])
> >     +
> >     +AT_CLEANUP
> >     diff --git a/utilities/ovn-nbctl.8.xml b/utilities/ovn-nbctl.8.xml
> >     index de86b70e6..18bf90e08 100644
> >     --- a/utilities/ovn-nbctl.8.xml
> >     +++ b/utilities/ovn-nbctl.8.xml
> >     @@ -658,7 +658,8 @@
> >
> >           <dl>
> >             <dt>[<code>--may-exist</code>]
> >     [<code>--policy</code>=<var>POLICY</var>]
> >     -        [<code>--ecmp</code>] <code>lr-route-add</code>
> >     <var>router</var>
> >     +        [<code>--ecmp</code>] [<code>--ecmp-symmetric-reply</code>]
> >     +        <code>lr-route-add</code> <var>router</var>
> >               <var>prefix</var> <var>nexthop</var> [<var>port</var>]</dt>
> >             <dd>
> >               <p>
> >     @@ -680,15 +681,31 @@
> >                 specified, the default is "dst-ip".
> >               </p>
> >
> >     +        <p>
> >     +          The <code>--ecmp</code> option allows for multiple routes
> >     with the
> >     +          same <var>prefix</var> <var>POLICY</var> but different
> >     +          <var>nexthop</var> and <var>port</var> to be added.
> >     +        </p>
> >     +
> >     +        <p>
> >     +          The <code>--ecmp-symmetric-reply</code> option makes it
> >     so that
> >     +          traffic that arrives over an ECMP route will have its
> >     reply traffic
> >     +          sent out over that same route. Setting
> >     +          <code>--ecmp-symmetric-reply</code> implies
> >     <code>--ecmp</code> so
> >     +          it is not necessary to set both.
> >     +        </p>
> >     +
> >               <p>
> >                 It is an error if a route with <var>prefix</var> and
> >     -          <var>POLICY</var> already exists, unless
> >     <code>--may-exist</code> or
> >     -          <code>--ecmp</code> is specified.  If
> >     <code>--may-exist</code> is
> >     -          specified but not <code>--ecmp</code>, the existed route
> >     will be
> >     -          updated with the new nexthop and port.  If
> >     <code>--ecmp</code> is
> >     +          <var>POLICY</var> already exists, unless
> >     <code>--may-exist</code>,
> >     +          <code>--ecmp</code>, or
> >     <code>--ecmp-symmetric-reply</code> is
> >     +          specified.  If <code>--may-exist</code> is specified but
> not
> >     +          <code>--ecmp</code> or
> >     <code>--ecmp-symmetric-reply</code>, the
> >     +          existed route will be updated with the new nexthop and
> >     port.  If
> >     +          <code>--ecmp</code> or
> <code>--ecmp-symmetric-reply</code> is
> >                 specified, a new route will be added, regardless of the
> >     existed
> >     -          route, which is useful when adding ECMP routes, i.e.
> >     routes with same
> >     -          <var>POLICY</var> and <var>prefix</var> but different
> >     +          route., which is useful when adding ECMP routes, i.e.
> >     routes with
> >     +          same <var>POLICY</var> and <var>prefix</var> but different
> >                 <var>nexthop</var> and <var>port</var>.
> >               </p>
> >             </dd>
> >     diff --git a/utilities/ovn-nbctl.c b/utilities/ovn-nbctl.c
> >     index 0079ad5a6..e6d8dbe63 100644
> >     --- a/utilities/ovn-nbctl.c
> >     +++ b/utilities/ovn-nbctl.c
> >     @@ -687,7 +687,8 @@ Logical router port commands:\n\
> >                                   ('overlay' or 'bridged')\n\
> >       \n\
> >       Route commands:\n\
> >     -  [--policy=POLICY] [--ecmp] lr-route-add ROUTER PREFIX NEXTHOP
> >     [PORT]\n\
> >     +  [--policy=POLICY] [--ecmp] [--ecmp-symmetric-reply] lr-route-add
> >     ROUTER \n\
> >     +                            PREFIX NEXTHOP [PORT]\n\
> >                                   add a route to ROUTER\n\
> >         [--policy=POLICY] lr-route-del ROUTER [PREFIX [NEXTHOP
> [PORT]]]\n\
> >                                   remove routes from ROUTER\n\
> >     @@ -3855,7 +3856,10 @@ nbctl_lr_route_add(struct ctl_context *ctx)
> >           }
> >
> >           bool may_exist = shash_find(&ctx->options, "--may-exist") !=
> NULL;
> >     -    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL;
> >     +    bool ecmp_symmetric_reply = shash_find(&ctx->options,
> >     +
> >       "--ecmp-symmetric-reply") != NULL;
> >     +    bool ecmp = shash_find(&ctx->options, "--ecmp") != NULL ||
> >     +                ecmp_symmetric_reply;
> >           if (!ecmp) {
> >               for (int i = 0; i < lr->n_static_routes; i++) {
> >                   const struct nbrec_logical_router_static_route *route
> >     @@ -3920,6 +3924,13 @@ nbctl_lr_route_add(struct ctl_context *ctx)
> >               nbrec_logical_router_static_route_set_policy(route,
> policy);
> >           }
> >
> >     +    if (ecmp_symmetric_reply) {
> >     +        const struct smap options = SMAP_CONST1(&options,
> >     +
> "ecmp_symmetric_reply",
> >     +                                                "true");
> >     +        nbrec_logical_router_static_route_set_options(route,
> &options);
> >     +    }
> >     +
> >           nbrec_logical_router_verify_static_routes(lr);
> >           struct nbrec_logical_router_static_route **new_routes
> >               = xmalloc(sizeof *new_routes * (lr->n_static_routes + 1));
> >     @@ -6361,7 +6372,8 @@ static const struct ctl_command_syntax
> >     nbctl_commands[] = {
> >
> >           /* logical router route commands. */
> >           { "lr-route-add", 3, 4, "ROUTER PREFIX NEXTHOP [PORT]", NULL,
> >     -      nbctl_lr_route_add, NULL, "--may-exist,--ecmp,--policy=", RW
> },
> >     +      nbctl_lr_route_add, NULL,
> >     "--may-exist,--ecmp,--ecmp-symmetric-reply,"
> >     +      "--policy=", RW },
> >           { "lr-route-del", 1, 4, "ROUTER [PREFIX [NEXTHOP [PORT]]]",
> NULL,
> >             nbctl_lr_route_del, NULL, "--if-exists,--policy=", RW },
> >           { "lr-route-list", 1, 1, "ROUTER", NULL, nbctl_lr_route_list,
> >     NULL,
> >     --
> >     2.25.4
> >
> >     _______________________________________________
> >     dev mailing list
> >     d...@openvswitch.org <mailto:d...@openvswitch.org>
> >     https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to