On 8/26/25 6:45 AM, Sragdhara Datta Chaudhuri wrote:
> Hi Dumitru,
> 

Hi Sragdhara,

> Thanks for your comments. Most are taken care of in v8. Please see inline for 
> responses.
> 
> Thanks,
> Sragdhara
> 
> From: Dumitru Ceara <dce...@redhat.com>
> Date: Thursday, August 21, 2025 at 7:46 AM
> To: Sragdhara Datta Chaudhuri <sragdha.chau...@nutanix.com>, 
> ovs-dev@openvswitch.org <ovs-dev@openvswitch.org>
> Cc: Numan Siddique <num...@ovn.org>
> Subject: Re: [ovs-dev] [PATCH OVN v6 4/5] controller, tests: Network Function 
> insertion tunneling of cross-host VLAN traffic.
> !-------------------------------------------------------------------|
>   CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> Hi Sragdhara,
> 
> Thanks for the patch!
> 
> On 8/20/25 3:25 AM, Sragdhara Datta Chaudhuri wrote:
>> For overlay subnets, all cross-host traffic exchanges are tunneled. For VLAN
>> subnets, we need to selectively tunnel traffic sent to or coming from the NF
>> ports. Consider a from-lport ACL applied to port p1 on host1. The NF ports 
>> nfp1
>> and nfp2 are on host2. A new option in LSP allows the NF ports to be linked.
>> The “network-function-linked-port” in nfp1 is to be set to nfp2 and vice 
>> versa.
> 
> This option needs to be documented in ovn-nb.xml.
> 
>> The ingress pipeline on host1 sets the outport to nfp1 and the packet is then
>> processed by table 42.
>>
>> On host1
>> --------
>> REMOTE_OUTPUT (table 43):
>> It tunnels traffic destined to all non-local overlay ports to their 
>> associated
>> hosts. The Same rule is now also added for traffic to non-local NF ports. 
>> Thus
>> the packets from p1 get tunneled to host 2.
>>
>> Upon reaching host2
>> -------------------
>> PHY_TO_LOG (table 0):
>> Existing priority 100 rule: for each geneve tunnel interface on the chassis,
>> copy info from header to inport, outport, metadata registers. Now same rule
>> also stores the tun intf id in a register (reg5[16..31]).
>>
>> CHECK_LOOPBACK (table 45)
>> This table has a rule that clears all the registers. The change is to skip 
>> the
>> clearing of reg5[16..31].
>>
>> Logical egress pipeline:
>> ls_out_stateful priority 120: If the outport is NF port, copy reg5[16..31]
>> (table0 had set it) to ct_label.tun_if_id.
>>
>> LOCAL_OUTPUT (table 44)
>> When the packet comes out of the other NF port (nfp2), following two rules 
>> send
>> it back to the host that it originally came from:
>> Priority 110: For each NF port local to this host, following rule processes 
>> the
>> packet through CT of linked port:
>>   match: inport==nfp2 && RECIRC_BIT==0
>>   action: RECIRC_BIT = 1, ct(zone=nfp1’s zone, table=LOCAL), resubmit table 
>> 43
>>
>> Priority 109: For each local {tunnel_id, nf port}, send the recirculated 
>> packet
>> using tun_if_id in ct zone:
>>   match: inport==nfp1 && RECIRC_BIT==1 && && ct_label.tun_if_id==<tun-id>
>>   action: tunnel packet using tun-id
>>
>> Case where NF responds back on nfp1, instead of forwarding to nfp2
>> ------------------------------------------------------------------
>> For example, a SYN packet from p1 got redirected to nfp1. Then the NF, which 
>> is
>> a firewall VM, drops the SYN and sends RST back on port nfp1. In this case,
>> looking up in linked port (nfp2) ct zone will not give anything. The 
>> following
>> rule uses ct.inv to identify such scenario and uses nfp1’s CT zone to send 
>> the
>> packet back. To achieve this, following 2 rules are installed:
>>
>> in_network_function:
>> Priority 100 rule that allows packets incoming from NF type ports, is 
>> enhanced
>> with additional action to store the tun_if_id from ct_label into 
>> reg5[16..31].
>>
>> LOCAL_OUTPUT (table 44)
>> Priority 110 rule: for recirculated packets, if ct (of the linked port) is
>> invalid, use the tun id from MFF_LOG_TUN_OFPORT to tunnel the packet back (as
>> CT zone info has been overwritten in the above 110 priority rule in table 
>> 42).
>>   match: inport==nfp1 && RECIRC_BIT==1 && ct.inv && reg5[16..31]==<tun-id>
>>   action: tunnel packet using tun-id
>>
>> Signed-off-by: Sragdhara Datta Chaudhuri <sragdha.chau...@nutanix.com>
>> Acked-by: Naveen Yerramneni <naveen.yerramn...@nutanix.com>
>> ---
>>  controller/physical.c        | 324 ++++++++++++++++++++++++--
>>  include/ovn/logical-fields.h |   9 +
>>  lib/logical-fields.c         |  10 +
>>  northd/northd.c              |  61 ++++-
>>  tests/ovn-controller.at      |   4 +-
>>  tests/ovn-northd.at          |  32 ++-
>>  tests/ovn.at                 | 434 +++++++++++++++++++++++++++--------
>>  7 files changed, 743 insertions(+), 131 deletions(-)
>>
>> diff --git a/controller/physical.c b/controller/physical.c
>> index c5f74010a..6ebc880b0 100644
>> --- a/controller/physical.c
>> +++ b/controller/physical.c
>> @@ -173,6 +173,8 @@ put_decapsulation(enum mf_field_id mff_ovn_geneve,
>>          put_move(MFF_TUN_ID, 0,  MFF_LOG_DATAPATH, 0, 24, ofpacts);
>>          put_move(mff_ovn_geneve, 16, MFF_LOG_INPORT, 0, 15, ofpacts);
>>          put_move(mff_ovn_geneve, 0, MFF_LOG_OUTPORT, 0, 16, ofpacts);
>> +        put_load(ofp_to_u16(tun->ofport), MFF_LOG_TUN_OFPORT,
>> +                 16, 16, ofpacts);
>>      } else if (tun->type == VXLAN) {
>>          /* Add flows for non-VTEP tunnels. Split VNI into two 12-bit
>>           * sections and use them for datapath and outport IDs. */
>> @@ -385,6 +387,15 @@ match_outport_dp_and_port_keys(struct match *match,
>>      match_set_reg(match, MFF_LOG_OUTPORT - MFF_REG0, port_key);
>>  }
>>
>> +static void
>> +match_inport_dp_and_port_keys(struct match *match,
>> +                              uint32_t dp_key, uint32_t port_key)
>> +{
>> +    match_init_catchall(match);
>> +    match_set_metadata(match, htonll(dp_key));
>> +    match_set_reg(match, MFF_LOG_INPORT - MFF_REG0, port_key);
>> +}
>> +
>>  static struct sbrec_encap *
>>  find_additional_encap_for_chassis(const struct sbrec_port_binding *pb,
>>                                    const struct sbrec_chassis *chassis_rec)
>> @@ -450,7 +461,8 @@ put_remote_port_redirect_overlay(const struct 
>> sbrec_port_binding *binding,
>>                                   uint32_t port_key,
>>                                   struct match *match,
>>                                   struct ofpbuf *ofpacts_p,
>> -                                 struct ovn_desired_flow_table *flow_table)
>> +                                 struct ovn_desired_flow_table *flow_table,
>> +                                 bool allow_hairpin)
>>  {
>>      /* Setup encapsulation */
>>      for (size_t i = 0; i < ctx->n_encap_ips; i++) {
>> @@ -469,6 +481,14 @@ put_remote_port_redirect_overlay(const struct 
>> sbrec_port_binding *binding,
>>                           ofpacts_clone);
>>              }
>>
>> +            /* Clear the MFF_INPORT if the same packet may need to go out 
>> from
>> +             * the same tunnel inport. */
>> +            if (allow_hairpin) {
>> +                put_stack(MFF_IN_PORT, 
>> ofpact_put_STACK_PUSH(ofpacts_clone));
>> +                put_load(ofp_to_u16(OFPP_NONE), MFF_IN_PORT, 0, 16,
>> +                         ofpacts_clone);
>> +            }
>> +
>>              const struct chassis_tunnel *tun;
>>              VECTOR_FOR_EACH (&tuns, tun) {
>>                  put_encapsulation(ctx->mff_ovn_geneve, tun, 
>> binding->datapath,
>> @@ -476,6 +496,11 @@ put_remote_port_redirect_overlay(const struct 
>> sbrec_port_binding *binding,
>>                  ofpact_put_OUTPUT(ofpacts_clone)->port = tun->ofport;
>>              }
>>              put_resubmit(OFTABLE_LOCAL_OUTPUT, ofpacts_clone);
>> +
>> +            if (allow_hairpin) {
>> +                put_stack(MFF_IN_PORT, ofpact_put_STACK_POP(ofpacts_clone));
>> +            }
>> +
>>              ofctrl_add_flow(flow_table, OFTABLE_REMOTE_OUTPUT, 100,
>>                              binding->header_.uuid.parts[0], match,
>>                              ofpacts_clone, &binding->header_.uuid);
>> @@ -485,6 +510,239 @@ put_remote_port_redirect_overlay(const struct 
>> sbrec_port_binding *binding,
>>      }
>>  }
>>
>> +static const struct sbrec_port_binding *
>> +get_binding_network_function_linked_port(
>> +                    struct ovsdb_idl_index *sbrec_port_binding_by_name,
>> +                    const struct sbrec_port_binding *binding)
>> +{
>> +    const char *nf_linked_name = smap_get(&binding->options,
>> +                                          "network-function-linked-port");
>> +    if (!nf_linked_name) {
>> +        return NULL;
>> +    }
>> +    VLOG_DBG("get NF linked port_binding %s:%s",
>> +             binding->logical_port, nf_linked_name);
>> +    const struct sbrec_port_binding *nf_linked_port = lport_lookup_by_name(
>> +        sbrec_port_binding_by_name, nf_linked_name);
>> +    static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
>> +    if (!nf_linked_port) {
>> +        VLOG_INFO("Binding not found for network-function-linked-port"
>> +                  " %s", nf_linked_name);
> 
> This will spam ovn-controller's logs.  We should probably rate limit it.
>  Also, should it be a WARN?
> [Sragdhara] Added rate limit. Even under normal scenarios, this log might be 
> seen transiently. That’s why didn’t make these warning. That causes unit 
> tests to fail. Same for other places.

If needed we can ignore these logs in the test, we do that for other
similar logs.  But VLOG_INFO_RL is OK for me too.

> 
>> +        return NULL;
>> +    }
>> +    if (strcmp(nf_linked_port->type, binding->type)) {
>> +        VLOG_ERR_RL(&rl, "Binding type mismatch between %s and "
>> +                  "network-function-linked-port %s",
>> +                  binding->logical_port,  nf_linked_name);
>> +        return NULL;
>> +    }
>> +    const char *nf_linked_linked_name = smap_get(
>> +        &nf_linked_port->options, "network-function-linked-port");
>> +    if (!nf_linked_linked_name || strcmp(nf_linked_linked_name,
>> +                                         binding->logical_port)) {
>> +        VLOG_INFO("LSP name %s does not match linked_linked_name",
>> +                  binding->logical_port);
> 
> Same comment here.
> 
>> +        return NULL;
>> +    }
>> +
>> +    return nf_linked_port;
>> +}
>> +
>> +static void
>> +send_traffic_by_tunnel(
>> +    const struct sbrec_port_binding *binding,
>> +    struct match *match,
>> +    struct ofpbuf *ofpacts_p,
>> +    uint32_t dp_key,
>> +    uint32_t port_key,
>> +    struct chassis_tunnel *tun,
>> +    enum mf_field_id mff_ovn_geneve,
>> +    struct ovn_desired_flow_table *flow_table)
> 
> The indentation doesn't really align with most of the existing code.  I
> think I'd write it as:
> 
> static void
> send_traffic_by_tunnel(const struct sbrec_port_binding *binding,
>                        struct match *match, struct ofpbuf *ofpacts_p,
>                        uint32_t dp_key, uint32_t port_key,
>                        struct chassis_tunnel *tun,
>                        enum mf_field_id mff_ovn_geneve,
>                        struct ovn_desired_flow_table *flow_table)
> 
>> +{
>> +    match_init_catchall(match);
>> +    ofpbuf_clear(ofpacts_p);
>> +
>> +    match_inport_dp_and_port_keys(match, dp_key, port_key);
>> +    match_set_reg_masked(match, MFF_LOG_FLAGS - MFF_REG0, MLF_RECIRC,
>> +                         MLF_RECIRC);
>> +    ovs_u128 of_tun_ct_label_id_val = {
>> +        .u64.hi = ((uint32_t) ofp_to_u16(tun->ofport)) << 16,
>> +    };
>> +    ovs_u128 of_tun_ct_label_id_mask = {
>> +        .u64.hi = 0x00000000ffff0000,
>> +    };
>> +
>> +    match_set_ct_label_masked(match, of_tun_ct_label_id_val,
>> +                              of_tun_ct_label_id_mask);
>> +
>> +    put_load(binding->datapath->tunnel_key, MFF_TUN_ID, 0, 24, ofpacts_p);
>> +    put_move(MFF_LOG_OUTPORT, 0, mff_ovn_geneve, 0, 32, ofpacts_p);
>> +    put_load(port_key, mff_ovn_geneve, 16, 15, ofpacts_p);
>> +
>> +    ofpact_put_OUTPUT(ofpacts_p)->port = tun->ofport;
>> +    ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 109,
>> +                    binding->header_.uuid.parts[0], match,
>> +                    ofpacts_p, &binding->header_.uuid);
>> +}
>> +
>> +
>> +static void
>> +put_redirect_overlay_to_source(
>> +    const struct sbrec_port_binding *binding,
>> +    int linked_ct,
>> +    const struct hmap *chassis_tunnels,
>> +    enum mf_field_id mff_ovn_geneve,
>> +    struct match *match,
>> +    struct ofpbuf *ofpacts_p,
>> +    struct ovn_desired_flow_table *flow_table)
> 
> Same comment about indentation here too.
> 
>> +{
>> +    uint32_t dp_key = binding->datapath->tunnel_key;
>> +    uint32_t port_key = binding->tunnel_key;
>> +
>> +    /* Say, a network function has ports nf1 and nf2. The source port p1 is 
>> on
>> +     * a different host. The packet redirected from p1 was tunneled to the 
>> NF
>> +     * host. In PHY_TO_LOG table the tunnel interface id is stored in
>> +     * MFF_LOG_TUN_OFPORT. The egress pipeline then commits it into ct_label
>> +     * tun_if_id in nf1's zone (out_stateful priority 120 rule). When the 
>> same
>> +     * packet comes out from nf2, two rules process it:
>> +     * first rule sets recirc bit to 1 and processes the packet through 
>> nf1's
>> +     * ct zone and resubmits to same table. When the recirculated packet 
>> comes
>> +     * back, the second rule (which checks recirc bit == 1) uses the 
>> tun_if_id
>> +     * from ct_label to send the packet back to p1's host.
>> +     */
>> +
>> +
> 
> Nit: one empty line too many.
> 
>> +    /* Table 44 (LOCAL_OUTPUT), priority 110
>> +     * =====================================
>> +     *
>> +     * Each flow matches a logical inport to a nf port and checks if
>> +     * recirc bit is 0 (i.e. packet first time being processed by this 
>> table).
>> +     * The action processes the packet through ct zone of the linked nf port
>> +     * and resubmits to the same table after setting recirc bit to 1.
>> +     * match: inport == svc-port[i] && MLF_RECIRC_BIT = 0
>> +     * action: MLF_RECIRC_BIT = 1, ct(zone=linked-zone[i], table=LOCAL)
>> +     */
>> +    match_init_catchall(match);
>> +    ofpbuf_clear(ofpacts_p);
>> +    match_inport_dp_and_port_keys(match, dp_key, port_key);
>> +    match_set_dl_type(match, htons(ETH_TYPE_IP));
>> +    match_set_reg_masked(match, MFF_LOG_FLAGS - MFF_REG0, 0, MLF_RECIRC);
>> +
>> +    put_load(1, MFF_LOG_FLAGS, MLF_RECIRC_BIT, 1, ofpacts_p);
>> +    put_load(linked_ct, MFF_LOG_CT_ZONE, 0, 16, ofpacts_p);
>> +
>> +    struct ofpact_conntrack *ct = ofpact_put_CT(ofpacts_p);
>> +    ct->recirc_table = OFTABLE_LOCAL_OUTPUT;
>> +    ct->zone_src.field = mf_from_id(MFF_LOG_CT_ZONE);
>> +    ct->zone_src.ofs = 0;
>> +    ct->zone_src.n_bits = 16;
>> +    ct->flags = 0;
>> +    ct->alg = 0;
>> +    ofpact_finish(ofpacts_p, &ct->ofpact);
>> +
>> +    ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 110,
>> +                    binding->header_.uuid.parts[0], match,
>> +                    ofpacts_p, &binding->header_.uuid);
>> +
>> +    /* Table 44 (LOCAL_OUTPUT), priority 110
>> +     * In case NF is sending back a response on the port it received the
>> +     * packet on, instead of forwarding out of the other port (e.g. NF 
>> sending
>> +     * RST to the SYN received), the ct lookup in linked port's zone would
>> +     * fail. Based on ct.inv check the packet is then tunneled back using
>> +     * the tunnel id from this port's zone itself. The above rule has
>> +     * overwritten the zone info by now, so we recover it from the register
>> +     * that was populated by in_network_function stage with the tunnel id.
>> +     * match: inport == svc-port[i] && MLF_RECIRC_BIT = 1
>> +     *        && ct.inv && MFF_LOG_TUN_OFPORT == <tun-id>
>> +     * action: tunnel back using above tun-id
>> +     */
>> +    struct chassis_tunnel *tun;
>> +    HMAP_FOR_EACH (tun, hmap_node, chassis_tunnels) {
>> +        match_init_catchall(match);
>> +        ofpbuf_clear(ofpacts_p);
>> +        match_inport_dp_and_port_keys(match, dp_key, port_key);
>> +        match_set_reg_masked(match, MFF_LOG_FLAGS - MFF_REG0, MLF_RECIRC,
>> +                             MLF_RECIRC);
>> +        match_set_ct_state_masked(match, OVS_CS_F_INVALID, 
>> OVS_CS_F_INVALID);
>> +        match_set_reg_masked(match, MFF_LOG_TUN_OFPORT - MFF_REG0,
>> +                             ((uint32_t) ofp_to_u16(tun->ofport)) << 16,
>> +                             ((uint32_t) 0xffff) << 16);
>> +        put_load(binding->datapath->tunnel_key, MFF_TUN_ID, 0, 24, 
>> ofpacts_p);
>> +        put_move(MFF_LOG_OUTPORT, 0, mff_ovn_geneve, 0, 32, ofpacts_p);
>> +        put_load(port_key, mff_ovn_geneve, 16, 15, ofpacts_p);
>> +
>> +        ofpact_put_OUTPUT(ofpacts_p)->port = tun->ofport;
>> +        ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 110,
>> +                        binding->header_.uuid.parts[0], match,
>> +                        ofpacts_p, &binding->header_.uuid);
>> +    }
>> +
>> +    /* Table 44 (LOCAL_OUTPUT), priority 109
>> +     * =====================================
>> +     *
>> +     * A flow is installed For each {remote tunnel_id, nf port} 
>> combination. It
>> +     * matches the inport with the nf port and the ct_label.tun_if_id with 
>> the
>> +     * tunnel_id. Also checks if the recirc bit is 1 (i.e. packet being
>> +     * processed by this table second time). The action is to send the 
>> packet
>> +     * out using the tunnel interface.
>> +     * match: inport == svc-port[i] && MLF_RECIRC_BIT = 1
>> +     *        && ct_label.tun_if_id == <tun-id>
>> +     * action: tunnel back using tun-id
>> +     */
>> +    HMAP_FOR_EACH (tun, hmap_node, chassis_tunnels) {
>> +        send_traffic_by_tunnel(binding, match, ofpacts_p, dp_key, port_key,
>> +                               tun, mff_ovn_geneve, flow_table);
>> +    }
>> +    ofpbuf_clear(ofpacts_p);
>> +}
>> +
>> +static void
>> +put_redirect_overlay_to_source_from_nf_port(
>> +    const struct sbrec_port_binding *binding,
>> +    struct ovsdb_idl_index *sbrec_port_binding_by_name,
>> +    const struct hmap *chassis_tunnels,
>> +    const struct shash *ct_zones,
>> +    enum mf_field_id mff_ovn_geneve,
>> +    struct match *match,
>> +    struct ofpbuf *ofpacts_p,
>> +    struct ovn_desired_flow_table *flow_table)
> 
> Nit: indentation.
> 
>> +{
>> +    const struct sbrec_port_binding *linked_pb;
>> +    linked_pb = get_binding_network_function_linked_port(
>> +        sbrec_port_binding_by_name, binding);
>> +    if (!linked_pb) {
>> +        VLOG_INFO("Linked port not found for %s", binding->logical_port);
> 
> This needs to be rate limited and maybe a warning.
> 
>> +        return;
>> +    }
>> +    struct zone_ids zone = get_zone_ids(binding, ct_zones);
>> +    if (!zone.ct) {
>> +        VLOG_INFO("Port zone not found for %s", binding->logical_port);
> 
> This needs to be rate limited and maybe a warning
> 
> 
>> +        return;
>> +    }
>> +    struct zone_ids linked_zone = get_zone_ids(linked_pb, ct_zones);
>> +    if (!linked_zone.ct) {
>> +        VLOG_INFO("Linked port zone not found for %s", 
>> binding->logical_port);
> 
> This needs to be rate limited and maybe a warning.
> 
> 
>> +        return;
>> +    }
>> +    VLOG_INFO("Both port zones found for NF port %s", 
>> binding->logical_port);
> 
> Should this be VLOG_DBG instead?
> 
>> +    put_redirect_overlay_to_source(
>> +                                 binding,
>> +                                 linked_zone.ct,
>> +                                 chassis_tunnels,
>> +                                 mff_ovn_geneve,
>> +                                 match,
>> +                                 ofpacts_p,
>> +                                 flow_table);
>> +    put_redirect_overlay_to_source(
>> +                                 linked_pb,
>> +                                 zone.ct,
>> +                                 chassis_tunnels,
>> +                                 mff_ovn_geneve,
>> +                                 match,
>> +                                 ofpacts_p,
>> +                                 flow_table);
>> +}
>> +
>>  static void
>>  put_remote_port_redirect_overlay_ha_remote(
>>      const struct sbrec_port_binding *binding,
>> @@ -973,12 +1231,12 @@ put_local_common_flows(uint32_t dp_key,
>>
>>      uint32_t port_key = pb->tunnel_key;
>>
>> -    /* Table 43, priority 100.
>> +    /* Table 44, priority 100.
>>       * =======================
>>       *
>>       * Implements output to local hypervisor.  Each flow matches a
>>       * logical output port on the local hypervisor, and resubmits to
>> -     * table 44.
>> +     * table 45.
>>       */
>>
>>      ofpbuf_clear(ofpacts_p);
>> @@ -988,13 +1246,13 @@ put_local_common_flows(uint32_t dp_key,
>>
>>      put_zones_ofpacts(zone_ids, ofpacts_p);
>>
>> -    /* Resubmit to table 44. */
>> +    /* Resubmit to table 45. */
>>      put_resubmit(OFTABLE_CHECK_LOOPBACK, ofpacts_p);
>>      ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 100,
>>                      pb->header_.uuid.parts[0], &match, ofpacts_p,
>>                      &pb->header_.uuid);
>>
>> -    /* Table 44, Priority 100.
>> +    /* Table 45, Priority 100.
>>       * =======================
>>       *
>>       * Drop packets whose logical inport and outport are the same
>> @@ -1818,7 +2076,7 @@ consider_port_binding(const struct physical_ctx *ctx,
>>               * go out from the same tunnel inport. */
>>              put_load(ofp_to_u16(OFPP_NONE), MFF_IN_PORT, 0, 16, ofpacts_p);
>>
>> -            /* Resubmit to table 41. */
>> +            /* Resubmit to table 45. */
>>              put_resubmit(OFTABLE_CHECK_LOOPBACK, ofpacts_p);
>>          }
>>
>> @@ -1905,10 +2163,12 @@ consider_port_binding(const struct physical_ctx *ctx,
>>
>>      /* Determine how the port is accessed. */
>>      enum access_type access_type = PORT_LOCAL;
>> +    bool is_nf = (smap_get(&binding->options, 
>> "network-function-linked-port") ?
>> +                           true: false);
>>      if (!ofport) {
>>          /* Enforce tunneling while we clone packets to additional chassis 
>> b/c
>>           * otherwise upstream switch won't flood the packet to both 
>> chassis. */
>> -        if (localnet_port && !binding->additional_chassis) {
>> +        if (localnet_port && !binding->additional_chassis && !is_nf) {
>>              ofport = u16_to_ofp(simap_get(ctx->patch_ofports,
>>                                            localnet_port->logical_port));
>>              if (!ofport) {
>> @@ -2120,7 +2380,7 @@ consider_port_binding(const struct physical_ctx *ctx,
>>              }
>>          }
>>
>> -        /* Table 42, priority 150.
>> +        /* Table 43, priority 150.
>>           * =======================
>>           *
>>           * Handles packets received from ports of type "localport".  These
>> @@ -2138,9 +2398,23 @@ consider_port_binding(const struct physical_ctx *ctx,
>>                              binding->header_.uuid.parts[0], &match,
>>                              ofpacts_p, &binding->header_.uuid);
>>          }
>> +
>> +        /* Packets egressing from network function ports need to be sent to 
>> the
>> +         * source */
>> +        if (is_nf && localnet_port) {
>> +            put_redirect_overlay_to_source_from_nf_port(
>> +                                 binding,
>> +                                 ctx->sbrec_port_binding_by_name,
>> +                                 ctx->chassis_tunnels,
>> +                                 ctx->ct_zones,
>> +                                 ctx->mff_ovn_geneve,
>> +                                 &match,
>> +                                 ofpacts_p,
>> +                                 flow_table);
>> +        }
>>      } else if (access_type == PORT_LOCALNET && !ctx->always_tunnel) {
>>          /* Remote port connected by localnet port */
>> -        /* Table 43, priority 100.
>> +        /* Table 44, priority 100.
>>           * =======================
>>           *
>>           * Implements switching to localnet port. Each flow matches a
>> @@ -2160,7 +2434,7 @@ consider_port_binding(const struct physical_ctx *ctx,
>>
>>          put_load(localnet_port->tunnel_key, MFF_LOG_OUTPORT, 0, 32, 
>> ofpacts_p);
>>
>> -        /* Resubmit to table 43. */
>> +        /* Resubmit to table 44. */
>>          put_resubmit(OFTABLE_LOCAL_OUTPUT, ofpacts_p);
>>          ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 100,
>>                          binding->header_.uuid.parts[0],
>> @@ -2197,7 +2471,8 @@ consider_port_binding(const struct physical_ctx *ctx,
>>              &match, ofpacts_p, ctx->chassis_tunnels, flow_table);
>>      } else {
>>          put_remote_port_redirect_overlay(
>> -            binding, type, ctx, port_key, &match, ofpacts_p, flow_table);
>> +            binding, type, ctx, port_key, &match, ofpacts_p, flow_table,
>> +            is_nf);
>>      }
>>  out:
>>      if (ha_ch_ordered) {
>> @@ -2806,7 +3081,6 @@ physical_run(struct physical_ctx *p_ctx,
>>
>>          ofpbuf_clear(&ofpacts);
>>          put_decapsulation(p_ctx->mff_ovn_geneve, tun, &ofpacts);
>> -
>>          put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts);
>>          ofctrl_add_flow(flow_table, OFTABLE_PHY_TO_LOG, 100, 0, &match,
>>                          &ofpacts, hc_uuid);
>> @@ -2936,12 +3210,12 @@ physical_run(struct physical_ctx *p_ctx,
>>      ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 0, 0, 
>> &match,
>>                      &ofpacts, hc_uuid);
>>
>> -    /* Table 42, priority 150.
>> +    /* Table 43, priority 150.
>>       * =======================
>>       *
>>       * Handles packets received from a VXLAN tunnel which get resubmitted to
>>       * OFTABLE_LOG_INGRESS_PIPELINE due to lack of needed metadata in VXLAN,
>> -     * explicitly skip sending back out any tunnels and resubmit to table 40
>> +     * explicitly skip sending back out any tunnels and resubmit to table 43
>>       * for local delivery, except packets which have MLF_ALLOW_LOOPBACK bit
>>       * set.
>>       */
>> @@ -2949,13 +3223,13 @@ physical_run(struct physical_ctx *p_ctx,
>>      match_set_reg_masked(&match, MFF_LOG_FLAGS - MFF_REG0, 
>> MLF_RCV_FROM_RAMP,
>>                           MLF_RCV_FROM_RAMP | MLF_ALLOW_LOOPBACK);
>>
>> -    /* Resubmit to table 40. */
>> +    /* Resubmit to table 44. */
>>      ofpbuf_clear(&ofpacts);
>>      put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts);
>>      ofctrl_add_flow(flow_table, OFTABLE_REMOTE_OUTPUT, 150, 0,
>>                      &match, &ofpacts, hc_uuid);
>>
>> -    /* Table 42, priority 150.
>> +    /* Table 43, priority 150.
>>       * =======================
>>       *
>>       * Packets that should not be sent to other hypervisors.
>> @@ -2963,13 +3237,13 @@ physical_run(struct physical_ctx *p_ctx,
>>      match_init_catchall(&match);
>>      match_set_reg_masked(&match, MFF_LOG_FLAGS - MFF_REG0,
>>                           MLF_LOCAL_ONLY, MLF_LOCAL_ONLY);
>> -    /* Resubmit to table 40. */
>> +    /* Resubmit to table 44. */
>>      ofpbuf_clear(&ofpacts);
>>      put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts);
>>      ofctrl_add_flow(flow_table, OFTABLE_REMOTE_OUTPUT, 150, 0,
>>                      &match, &ofpacts, hc_uuid);
>>
>> -    /* Table 42, Priority 0.
>> +    /* Table 43, Priority 0.
>>       * =======================
>>       *
>>       * Resubmit packets that are not directed at tunnels or part of a
>> @@ -2980,24 +3254,30 @@ physical_run(struct physical_ctx *p_ctx,
>>      ofctrl_add_flow(flow_table, OFTABLE_REMOTE_OUTPUT, 0, 0, &match,
>>                      &ofpacts, hc_uuid);
>>
>> -    /* Table 40, priority 0.
>> +    /* Table 44, priority 0.
>>       * ======================
>>       *
>>       * Drop packets that do not match previous flows.
>>       */
>>      add_default_drop_flow(p_ctx, OFTABLE_LOCAL_OUTPUT, flow_table);
>>
>> -    /* Table 41, Priority 0.
>> +    /* Table 45, Priority 0.
>>       * =======================
>>       *
>>       * Resubmit packets that don't output to the ingress port (already 
>> checked
>> -     * in table 40) to the logical egress pipeline, clearing the logical
>> +     * in table 43) to the logical egress pipeline, clearing the logical
>>       * registers (for consistent behavior with packets that get tunneled). 
>> */
>>      match_init_catchall(&match);
>>      ofpbuf_clear(&ofpacts);
>>      for (int i = 0; i < MFF_N_LOG_REGS; i++) {
>> -        put_load(0, MFF_REG0 + i, 0, 32, &ofpacts);
>> +        if ((MFF_REG0 + i) != MFF_LOG_TUN_OFPORT) {
>> +            put_load(0, MFF_REG0 + i, 0, 32, &ofpacts);
>> +        }
>>      }
>> +    /* In MFF_LOG_TUN_OFPORT, the bits 16..31 are used to store geneve
> 
> Why do we restrict this only to geneve?  Wouldn't this work for VXLAN
> too out of the box?
> 
> [Sragdhara]  Yes, we can make it work for VXLAN as well. That will need some 
> more changes. Is it okay to address it in a future patch, after this one is 
> merged?
> 

Sure, perfectly fine to do this as a follow up, let's not forget to add
a TODO.rst item for this though.  I'll leave a comment about that in the
v8 patch so we don't forget.

> 
>> +     * tunnel id of received packets and these need to be carried over to
>> +     * the egress pipeline. The remaining bits can be reset to zero. */
>> +    put_load(0, MFF_LOG_TUN_OFPORT, 0, 16, &ofpacts);
>>      put_resubmit(OFTABLE_LOG_EGRESS_PIPELINE, &ofpacts);
>>      ofctrl_add_flow(flow_table, OFTABLE_CHECK_LOOPBACK, 0, 0, &match,
>>                      &ofpacts, hc_uuid);
>> diff --git a/include/ovn/logical-fields.h b/include/ovn/logical-fields.h
>> index 98e134e0f..47775192d 100644
>> --- a/include/ovn/logical-fields.h
>> +++ b/include/ovn/logical-fields.h
>> @@ -42,6 +42,7 @@ enum ovn_controller_event {
>>                                         * (16..31 of the 32 bits). */
>>  #define MFF_LOG_INPORT     MFF_REG14  /* Logical input port (32 bits). */
>>  #define MFF_LOG_OUTPORT    MFF_REG15  /* Logical output port (32 bits). */
>> +#define MFF_LOG_TUN_OFPORT MFF_REG5   /* 16..31 of the 32 bits */
>>
>>  /* Logical registers.
>>   *
>> @@ -101,6 +102,7 @@ enum mff_log_flags_bits {
>>      MLF_UNSNAT_NEW_BIT = 20,
>>      MLF_UNSNAT_NOT_TRACKED_BIT = 21,
>>      MLF_IGMP_IGMP_SNOOP_INJECT_BIT = 22,
>> +    MLF_RECIRC_BIT = 23,
>>      MLF_NETWORK_ID_START_BIT = 28,
>>      MLF_NETWORK_ID_END_BIT = 31,
>>  };
>> @@ -170,6 +172,9 @@ enum mff_log_flags {
>>      /* Indicate that this is an IGMP packet reinjected by ovn-controller. */
>>      MLF_IGMP_IGMP_SNOOP = (1 << MLF_IGMP_IGMP_SNOOP_INJECT_BIT),
>>
>> +    /* Indicate the packet has been processed by LOCAL table once before. */
>> +    MLF_RECIRC = (1 << MLF_RECIRC_BIT),
>> +
>>      /* Assign network ID to packet to choose correct network for snat when
>>       * lb_force_snat_ip=router_ip. */
>>      MLF_NETWORK_ID = (OVN_MAX_NETWORK_ID << MLF_NETWORK_ID_START_BIT),
>> @@ -237,15 +242,19 @@ const struct ovn_field *ovn_field_from_name(const char 
>> *name);
>>  #define OVN_CT_OBS_STAGE_END_BIT 5
>>  #define OVN_CT_ALLOW_ESTABLISHED_BIT 6
>>  #define OVN_CT_NETWORK_FUNCTION_GROUP_BIT 7
>> +#define OVN_CT_TUN_IF_BIT 8
>>
>>  #define OVN_CT_BLOCKED 1
>>  #define OVN_CT_NATTED  2
>>  #define OVN_CT_LB_SKIP_SNAT 4
>>  #define OVN_CT_LB_FORCE_SNAT 8
>>  #define OVN_CT_NETWORK_FUNCTION_GROUP 128
>> +#define OVN_CT_TUN_IF 256
>>
>>  #define OVN_CT_NETWORK_FUNCTION_GROUP_ID_1ST_BIT 17
>>  #define OVN_CT_NETWORK_FUNCTION_GROUP_ID_END_BIT 24
>> +#define OVN_CT_TUN_IF_1ST_BIT 80
>> +#define OVN_CT_TUN_IF_END_BIT 95
>>
>>  #define OVN_CT_ECMP_ETH_1ST_BIT 32
>>  #define OVN_CT_ECMP_ETH_END_BIT 79
>> diff --git a/lib/logical-fields.c b/lib/logical-fields.c
>> index 681a6dd50..15876a64f 100644
>> --- a/lib/logical-fields.c
>> +++ b/lib/logical-fields.c
>> @@ -228,6 +228,16 @@ ovn_init_symtab(struct shash *symtab)
>>                                      
>> OVN_CT_NETWORK_FUNCTION_GROUP_ID_END_BIT)
>>                                      "]",
>>                                      WR_CT_COMMIT);
>> +    expr_symtab_add_subfield_scoped(symtab, "ct_label.tun_if", NULL,
>> +                                    "ct_label["
>> +                                    OVN_CT_STR(OVN_CT_TUN_IF_BIT)
>> +                                    "]",
>> +                                    WR_CT_COMMIT);
>> +    expr_symtab_add_subfield_scoped(symtab, "ct_label.tun_if_id", NULL,
>> +                                    "ct_label["
>> +                                    OVN_CT_STR(OVN_CT_TUN_IF_1ST_BIT) ".."
>> +                                    OVN_CT_STR(OVN_CT_TUN_IF_END_BIT) "]",
>> +                                    WR_CT_COMMIT);
>>
>>      expr_symtab_add_field(symtab, "ct_state", MFF_CT_STATE, NULL, false);
>>
>> diff --git a/northd/northd.c b/northd/northd.c
>> index 11498217f..382182ca6 100644
>> --- a/northd/northd.c
>> +++ b/northd/northd.c
>> @@ -240,6 +240,10 @@ static const char *reg_ct_state[] = {
>>  #undef CS_STATE
>>  };
>>
>> +/* Register used for storing tunnel openflow interface id, in a Logical 
>> Switch.
>> + * Must match the MFF_LOG_TUN_OFPORT in logical-fields.h */
>> +#define REG_TUN_OFPORT "reg5[16..31]"
> 
> As mentioned in patch 3/5, this will overlap with XXREG1 (REG_LB_IPV6).
> Wouldn't that be a problem?
> 
> Also, please update the "OVS register usage:" table below.
> 
> 
> 
> [Sragdhara] Yes, this will overlap. Moved to reg0 available bits like you 
> suggested. Also, LB is not tested along with NF. Theoretically, there might 
> be problems for NF with from-lport ACL, in combination with LB, due to 
> asymmetric traffic path. We thought we would take up LB case in a separate 
> patch. Have added in the documentation ovn-nb.xml as well as in NEWs 
> mentioning LB.

Ack, thanks for the documentation, I'll check v8 closely.

> 
>> +
>>  /* Register used for temporarily store ECMP eth.src to avoid masked ct_label
>>   * access. It doesn't really occupy registers because the content of the
>>   * register is saved to stack and then restored in the same flow.
>> @@ -17468,6 +17472,48 @@ network_function_get_active(const struct 
>> nbrec_network_function_group *nfg)
>>      return nfg->n_network_function ? nfg->network_function[0] : NULL;
>>  }
>>
>> +/* For packets received on tunnel and egressing towards a network-function 
>> port
>> + * commit the tunnel interface id in CT. This will be utilized when the 
>> packet
>> + * comes out of the other network-function interface of the service VM. The
>> + * packet then will be tunneled back to the source host. */
>> +static void
>> +build_lswitch_stateful_nf(struct ovn_port *op,
>> +                          struct lflow_table *lflows,
>> +                          struct ds *actions, struct ds *match,
>> +                          struct lflow_ref *lflow_ref)
>> +{
>> +    ds_clear(actions);
>> +    ds_clear(match);
>> +
>> +    ds_put_cstr(actions,
>> +                 "ct_commit { "
>> +                    "ct_mark.blocked = 0; "
>> +                    "ct_mark.allow_established = " REGBIT_ACL_PERSIST_ID "; 
>> "
>> +                    "ct_label.acl_id = " REG_ACL_ID "; "
>> +                    "ct_label.tun_if_id = " REG_TUN_OFPORT "; }; next;");
>> +    ds_put_format(match,
>> +                  "outport == %s && " REGBIT_ACL_LABEL" == 0", 
>> op->json_key);
>> +    ovn_lflow_add(lflows, op->od, S_SWITCH_OUT_STATEFUL, 120,
>> +                  ds_cstr(match), ds_cstr(actions), lflow_ref);
>> +
>> +    ds_clear(actions);
>> +    ds_clear(match);
>> +    ds_put_format(match,
>> +                  "outport == %s && " REGBIT_ACL_LABEL" == 1",
>> +                  op->json_key);
>> +    ds_put_cstr(actions,
>> +                 "ct_commit { "
>> +                    "ct_mark.blocked = 0; "
>> +                    "ct_mark.allow_established = " REGBIT_ACL_PERSIST_ID "; 
>> "
>> +                    "ct_label.acl_id = " REG_ACL_ID "; "
>> +                    "ct_mark.obs_stage = " REGBIT_ACL_OBS_STAGE "; "
>> +                    "ct_mark.obs_collector_id = " REG_OBS_COLLECTOR_ID_EST 
>> "; "
>> +                    "ct_label.obs_point_id = " REG_OBS_POINT_ID_EST "; "
>> +                    "ct_label.tun_if_id = " REG_TUN_OFPORT "; }; next;");
> 
> I'm afraid using ct_label for all the Network Function features will
> make it such that traffic is not offloadable (with Nvidia NICs) if both
> ACL Sampling and Network Functions are used.
> 
> 
> [Sragdhara] Was not aware. Thanks for pointing it out. If I understand it 
> right, you are saying that since sampling feature is using ct_mark and NF is 
> using ct_label, these two cannot be used together with HW offload. Any 
> pointers to understand this issue better? Let me know if you have any 
> suggestions Also, is it okay to take up the changes related to HW offload, 
> later in a separate patch?

On some specific hardware (AFAIK on Nvidia NICs that support conntrack
hardware offload) some types of ovs matches are not supported.  Among
those masked ct_label access.  That's why for ACL sampling we had to
make sure we match on the complete contents of the CT label, please see
below:

build_acl_sample_label_match(...)
{
...
    ds_put_format(match, "ct_label.obs_point_id == %"PRIu32" && "
                         "ct_label.obs_unused == 0", point_id);
...
}

From my perspective it's probably fine to address HW offload in the
future.  We should however call this out in the documentation and maybe
add a TODO.rst item for it too.

Regards,
Dumitru

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to