On 8/26/25 6:45 AM, Sragdhara Datta Chaudhuri wrote: > Hi Dumitru, >
Hi Sragdhara, > Thanks for your comments. Most are taken care of in v8. Please see inline for > responses. > > Thanks, > Sragdhara > > From: Dumitru Ceara <dce...@redhat.com> > Date: Thursday, August 21, 2025 at 7:46 AM > To: Sragdhara Datta Chaudhuri <sragdha.chau...@nutanix.com>, > ovs-dev@openvswitch.org <ovs-dev@openvswitch.org> > Cc: Numan Siddique <num...@ovn.org> > Subject: Re: [ovs-dev] [PATCH OVN v6 4/5] controller, tests: Network Function > insertion tunneling of cross-host VLAN traffic. > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > Hi Sragdhara, > > Thanks for the patch! > > On 8/20/25 3:25 AM, Sragdhara Datta Chaudhuri wrote: >> For overlay subnets, all cross-host traffic exchanges are tunneled. For VLAN >> subnets, we need to selectively tunnel traffic sent to or coming from the NF >> ports. Consider a from-lport ACL applied to port p1 on host1. The NF ports >> nfp1 >> and nfp2 are on host2. A new option in LSP allows the NF ports to be linked. >> The “network-function-linked-port” in nfp1 is to be set to nfp2 and vice >> versa. > > This option needs to be documented in ovn-nb.xml. > >> The ingress pipeline on host1 sets the outport to nfp1 and the packet is then >> processed by table 42. >> >> On host1 >> -------- >> REMOTE_OUTPUT (table 43): >> It tunnels traffic destined to all non-local overlay ports to their >> associated >> hosts. The Same rule is now also added for traffic to non-local NF ports. >> Thus >> the packets from p1 get tunneled to host 2. >> >> Upon reaching host2 >> ------------------- >> PHY_TO_LOG (table 0): >> Existing priority 100 rule: for each geneve tunnel interface on the chassis, >> copy info from header to inport, outport, metadata registers. Now same rule >> also stores the tun intf id in a register (reg5[16..31]). >> >> CHECK_LOOPBACK (table 45) >> This table has a rule that clears all the registers. The change is to skip >> the >> clearing of reg5[16..31]. >> >> Logical egress pipeline: >> ls_out_stateful priority 120: If the outport is NF port, copy reg5[16..31] >> (table0 had set it) to ct_label.tun_if_id. >> >> LOCAL_OUTPUT (table 44) >> When the packet comes out of the other NF port (nfp2), following two rules >> send >> it back to the host that it originally came from: >> Priority 110: For each NF port local to this host, following rule processes >> the >> packet through CT of linked port: >> match: inport==nfp2 && RECIRC_BIT==0 >> action: RECIRC_BIT = 1, ct(zone=nfp1’s zone, table=LOCAL), resubmit table >> 43 >> >> Priority 109: For each local {tunnel_id, nf port}, send the recirculated >> packet >> using tun_if_id in ct zone: >> match: inport==nfp1 && RECIRC_BIT==1 && && ct_label.tun_if_id==<tun-id> >> action: tunnel packet using tun-id >> >> Case where NF responds back on nfp1, instead of forwarding to nfp2 >> ------------------------------------------------------------------ >> For example, a SYN packet from p1 got redirected to nfp1. Then the NF, which >> is >> a firewall VM, drops the SYN and sends RST back on port nfp1. In this case, >> looking up in linked port (nfp2) ct zone will not give anything. The >> following >> rule uses ct.inv to identify such scenario and uses nfp1’s CT zone to send >> the >> packet back. To achieve this, following 2 rules are installed: >> >> in_network_function: >> Priority 100 rule that allows packets incoming from NF type ports, is >> enhanced >> with additional action to store the tun_if_id from ct_label into >> reg5[16..31]. >> >> LOCAL_OUTPUT (table 44) >> Priority 110 rule: for recirculated packets, if ct (of the linked port) is >> invalid, use the tun id from MFF_LOG_TUN_OFPORT to tunnel the packet back (as >> CT zone info has been overwritten in the above 110 priority rule in table >> 42). >> match: inport==nfp1 && RECIRC_BIT==1 && ct.inv && reg5[16..31]==<tun-id> >> action: tunnel packet using tun-id >> >> Signed-off-by: Sragdhara Datta Chaudhuri <sragdha.chau...@nutanix.com> >> Acked-by: Naveen Yerramneni <naveen.yerramn...@nutanix.com> >> --- >> controller/physical.c | 324 ++++++++++++++++++++++++-- >> include/ovn/logical-fields.h | 9 + >> lib/logical-fields.c | 10 + >> northd/northd.c | 61 ++++- >> tests/ovn-controller.at | 4 +- >> tests/ovn-northd.at | 32 ++- >> tests/ovn.at | 434 +++++++++++++++++++++++++++-------- >> 7 files changed, 743 insertions(+), 131 deletions(-) >> >> diff --git a/controller/physical.c b/controller/physical.c >> index c5f74010a..6ebc880b0 100644 >> --- a/controller/physical.c >> +++ b/controller/physical.c >> @@ -173,6 +173,8 @@ put_decapsulation(enum mf_field_id mff_ovn_geneve, >> put_move(MFF_TUN_ID, 0, MFF_LOG_DATAPATH, 0, 24, ofpacts); >> put_move(mff_ovn_geneve, 16, MFF_LOG_INPORT, 0, 15, ofpacts); >> put_move(mff_ovn_geneve, 0, MFF_LOG_OUTPORT, 0, 16, ofpacts); >> + put_load(ofp_to_u16(tun->ofport), MFF_LOG_TUN_OFPORT, >> + 16, 16, ofpacts); >> } else if (tun->type == VXLAN) { >> /* Add flows for non-VTEP tunnels. Split VNI into two 12-bit >> * sections and use them for datapath and outport IDs. */ >> @@ -385,6 +387,15 @@ match_outport_dp_and_port_keys(struct match *match, >> match_set_reg(match, MFF_LOG_OUTPORT - MFF_REG0, port_key); >> } >> >> +static void >> +match_inport_dp_and_port_keys(struct match *match, >> + uint32_t dp_key, uint32_t port_key) >> +{ >> + match_init_catchall(match); >> + match_set_metadata(match, htonll(dp_key)); >> + match_set_reg(match, MFF_LOG_INPORT - MFF_REG0, port_key); >> +} >> + >> static struct sbrec_encap * >> find_additional_encap_for_chassis(const struct sbrec_port_binding *pb, >> const struct sbrec_chassis *chassis_rec) >> @@ -450,7 +461,8 @@ put_remote_port_redirect_overlay(const struct >> sbrec_port_binding *binding, >> uint32_t port_key, >> struct match *match, >> struct ofpbuf *ofpacts_p, >> - struct ovn_desired_flow_table *flow_table) >> + struct ovn_desired_flow_table *flow_table, >> + bool allow_hairpin) >> { >> /* Setup encapsulation */ >> for (size_t i = 0; i < ctx->n_encap_ips; i++) { >> @@ -469,6 +481,14 @@ put_remote_port_redirect_overlay(const struct >> sbrec_port_binding *binding, >> ofpacts_clone); >> } >> >> + /* Clear the MFF_INPORT if the same packet may need to go out >> from >> + * the same tunnel inport. */ >> + if (allow_hairpin) { >> + put_stack(MFF_IN_PORT, >> ofpact_put_STACK_PUSH(ofpacts_clone)); >> + put_load(ofp_to_u16(OFPP_NONE), MFF_IN_PORT, 0, 16, >> + ofpacts_clone); >> + } >> + >> const struct chassis_tunnel *tun; >> VECTOR_FOR_EACH (&tuns, tun) { >> put_encapsulation(ctx->mff_ovn_geneve, tun, >> binding->datapath, >> @@ -476,6 +496,11 @@ put_remote_port_redirect_overlay(const struct >> sbrec_port_binding *binding, >> ofpact_put_OUTPUT(ofpacts_clone)->port = tun->ofport; >> } >> put_resubmit(OFTABLE_LOCAL_OUTPUT, ofpacts_clone); >> + >> + if (allow_hairpin) { >> + put_stack(MFF_IN_PORT, ofpact_put_STACK_POP(ofpacts_clone)); >> + } >> + >> ofctrl_add_flow(flow_table, OFTABLE_REMOTE_OUTPUT, 100, >> binding->header_.uuid.parts[0], match, >> ofpacts_clone, &binding->header_.uuid); >> @@ -485,6 +510,239 @@ put_remote_port_redirect_overlay(const struct >> sbrec_port_binding *binding, >> } >> } >> >> +static const struct sbrec_port_binding * >> +get_binding_network_function_linked_port( >> + struct ovsdb_idl_index *sbrec_port_binding_by_name, >> + const struct sbrec_port_binding *binding) >> +{ >> + const char *nf_linked_name = smap_get(&binding->options, >> + "network-function-linked-port"); >> + if (!nf_linked_name) { >> + return NULL; >> + } >> + VLOG_DBG("get NF linked port_binding %s:%s", >> + binding->logical_port, nf_linked_name); >> + const struct sbrec_port_binding *nf_linked_port = lport_lookup_by_name( >> + sbrec_port_binding_by_name, nf_linked_name); >> + static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1); >> + if (!nf_linked_port) { >> + VLOG_INFO("Binding not found for network-function-linked-port" >> + " %s", nf_linked_name); > > This will spam ovn-controller's logs. We should probably rate limit it. > Also, should it be a WARN? > [Sragdhara] Added rate limit. Even under normal scenarios, this log might be > seen transiently. That’s why didn’t make these warning. That causes unit > tests to fail. Same for other places. If needed we can ignore these logs in the test, we do that for other similar logs. But VLOG_INFO_RL is OK for me too. > >> + return NULL; >> + } >> + if (strcmp(nf_linked_port->type, binding->type)) { >> + VLOG_ERR_RL(&rl, "Binding type mismatch between %s and " >> + "network-function-linked-port %s", >> + binding->logical_port, nf_linked_name); >> + return NULL; >> + } >> + const char *nf_linked_linked_name = smap_get( >> + &nf_linked_port->options, "network-function-linked-port"); >> + if (!nf_linked_linked_name || strcmp(nf_linked_linked_name, >> + binding->logical_port)) { >> + VLOG_INFO("LSP name %s does not match linked_linked_name", >> + binding->logical_port); > > Same comment here. > >> + return NULL; >> + } >> + >> + return nf_linked_port; >> +} >> + >> +static void >> +send_traffic_by_tunnel( >> + const struct sbrec_port_binding *binding, >> + struct match *match, >> + struct ofpbuf *ofpacts_p, >> + uint32_t dp_key, >> + uint32_t port_key, >> + struct chassis_tunnel *tun, >> + enum mf_field_id mff_ovn_geneve, >> + struct ovn_desired_flow_table *flow_table) > > The indentation doesn't really align with most of the existing code. I > think I'd write it as: > > static void > send_traffic_by_tunnel(const struct sbrec_port_binding *binding, > struct match *match, struct ofpbuf *ofpacts_p, > uint32_t dp_key, uint32_t port_key, > struct chassis_tunnel *tun, > enum mf_field_id mff_ovn_geneve, > struct ovn_desired_flow_table *flow_table) > >> +{ >> + match_init_catchall(match); >> + ofpbuf_clear(ofpacts_p); >> + >> + match_inport_dp_and_port_keys(match, dp_key, port_key); >> + match_set_reg_masked(match, MFF_LOG_FLAGS - MFF_REG0, MLF_RECIRC, >> + MLF_RECIRC); >> + ovs_u128 of_tun_ct_label_id_val = { >> + .u64.hi = ((uint32_t) ofp_to_u16(tun->ofport)) << 16, >> + }; >> + ovs_u128 of_tun_ct_label_id_mask = { >> + .u64.hi = 0x00000000ffff0000, >> + }; >> + >> + match_set_ct_label_masked(match, of_tun_ct_label_id_val, >> + of_tun_ct_label_id_mask); >> + >> + put_load(binding->datapath->tunnel_key, MFF_TUN_ID, 0, 24, ofpacts_p); >> + put_move(MFF_LOG_OUTPORT, 0, mff_ovn_geneve, 0, 32, ofpacts_p); >> + put_load(port_key, mff_ovn_geneve, 16, 15, ofpacts_p); >> + >> + ofpact_put_OUTPUT(ofpacts_p)->port = tun->ofport; >> + ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 109, >> + binding->header_.uuid.parts[0], match, >> + ofpacts_p, &binding->header_.uuid); >> +} >> + >> + >> +static void >> +put_redirect_overlay_to_source( >> + const struct sbrec_port_binding *binding, >> + int linked_ct, >> + const struct hmap *chassis_tunnels, >> + enum mf_field_id mff_ovn_geneve, >> + struct match *match, >> + struct ofpbuf *ofpacts_p, >> + struct ovn_desired_flow_table *flow_table) > > Same comment about indentation here too. > >> +{ >> + uint32_t dp_key = binding->datapath->tunnel_key; >> + uint32_t port_key = binding->tunnel_key; >> + >> + /* Say, a network function has ports nf1 and nf2. The source port p1 is >> on >> + * a different host. The packet redirected from p1 was tunneled to the >> NF >> + * host. In PHY_TO_LOG table the tunnel interface id is stored in >> + * MFF_LOG_TUN_OFPORT. The egress pipeline then commits it into ct_label >> + * tun_if_id in nf1's zone (out_stateful priority 120 rule). When the >> same >> + * packet comes out from nf2, two rules process it: >> + * first rule sets recirc bit to 1 and processes the packet through >> nf1's >> + * ct zone and resubmits to same table. When the recirculated packet >> comes >> + * back, the second rule (which checks recirc bit == 1) uses the >> tun_if_id >> + * from ct_label to send the packet back to p1's host. >> + */ >> + >> + > > Nit: one empty line too many. > >> + /* Table 44 (LOCAL_OUTPUT), priority 110 >> + * ===================================== >> + * >> + * Each flow matches a logical inport to a nf port and checks if >> + * recirc bit is 0 (i.e. packet first time being processed by this >> table). >> + * The action processes the packet through ct zone of the linked nf port >> + * and resubmits to the same table after setting recirc bit to 1. >> + * match: inport == svc-port[i] && MLF_RECIRC_BIT = 0 >> + * action: MLF_RECIRC_BIT = 1, ct(zone=linked-zone[i], table=LOCAL) >> + */ >> + match_init_catchall(match); >> + ofpbuf_clear(ofpacts_p); >> + match_inport_dp_and_port_keys(match, dp_key, port_key); >> + match_set_dl_type(match, htons(ETH_TYPE_IP)); >> + match_set_reg_masked(match, MFF_LOG_FLAGS - MFF_REG0, 0, MLF_RECIRC); >> + >> + put_load(1, MFF_LOG_FLAGS, MLF_RECIRC_BIT, 1, ofpacts_p); >> + put_load(linked_ct, MFF_LOG_CT_ZONE, 0, 16, ofpacts_p); >> + >> + struct ofpact_conntrack *ct = ofpact_put_CT(ofpacts_p); >> + ct->recirc_table = OFTABLE_LOCAL_OUTPUT; >> + ct->zone_src.field = mf_from_id(MFF_LOG_CT_ZONE); >> + ct->zone_src.ofs = 0; >> + ct->zone_src.n_bits = 16; >> + ct->flags = 0; >> + ct->alg = 0; >> + ofpact_finish(ofpacts_p, &ct->ofpact); >> + >> + ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 110, >> + binding->header_.uuid.parts[0], match, >> + ofpacts_p, &binding->header_.uuid); >> + >> + /* Table 44 (LOCAL_OUTPUT), priority 110 >> + * In case NF is sending back a response on the port it received the >> + * packet on, instead of forwarding out of the other port (e.g. NF >> sending >> + * RST to the SYN received), the ct lookup in linked port's zone would >> + * fail. Based on ct.inv check the packet is then tunneled back using >> + * the tunnel id from this port's zone itself. The above rule has >> + * overwritten the zone info by now, so we recover it from the register >> + * that was populated by in_network_function stage with the tunnel id. >> + * match: inport == svc-port[i] && MLF_RECIRC_BIT = 1 >> + * && ct.inv && MFF_LOG_TUN_OFPORT == <tun-id> >> + * action: tunnel back using above tun-id >> + */ >> + struct chassis_tunnel *tun; >> + HMAP_FOR_EACH (tun, hmap_node, chassis_tunnels) { >> + match_init_catchall(match); >> + ofpbuf_clear(ofpacts_p); >> + match_inport_dp_and_port_keys(match, dp_key, port_key); >> + match_set_reg_masked(match, MFF_LOG_FLAGS - MFF_REG0, MLF_RECIRC, >> + MLF_RECIRC); >> + match_set_ct_state_masked(match, OVS_CS_F_INVALID, >> OVS_CS_F_INVALID); >> + match_set_reg_masked(match, MFF_LOG_TUN_OFPORT - MFF_REG0, >> + ((uint32_t) ofp_to_u16(tun->ofport)) << 16, >> + ((uint32_t) 0xffff) << 16); >> + put_load(binding->datapath->tunnel_key, MFF_TUN_ID, 0, 24, >> ofpacts_p); >> + put_move(MFF_LOG_OUTPORT, 0, mff_ovn_geneve, 0, 32, ofpacts_p); >> + put_load(port_key, mff_ovn_geneve, 16, 15, ofpacts_p); >> + >> + ofpact_put_OUTPUT(ofpacts_p)->port = tun->ofport; >> + ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 110, >> + binding->header_.uuid.parts[0], match, >> + ofpacts_p, &binding->header_.uuid); >> + } >> + >> + /* Table 44 (LOCAL_OUTPUT), priority 109 >> + * ===================================== >> + * >> + * A flow is installed For each {remote tunnel_id, nf port} >> combination. It >> + * matches the inport with the nf port and the ct_label.tun_if_id with >> the >> + * tunnel_id. Also checks if the recirc bit is 1 (i.e. packet being >> + * processed by this table second time). The action is to send the >> packet >> + * out using the tunnel interface. >> + * match: inport == svc-port[i] && MLF_RECIRC_BIT = 1 >> + * && ct_label.tun_if_id == <tun-id> >> + * action: tunnel back using tun-id >> + */ >> + HMAP_FOR_EACH (tun, hmap_node, chassis_tunnels) { >> + send_traffic_by_tunnel(binding, match, ofpacts_p, dp_key, port_key, >> + tun, mff_ovn_geneve, flow_table); >> + } >> + ofpbuf_clear(ofpacts_p); >> +} >> + >> +static void >> +put_redirect_overlay_to_source_from_nf_port( >> + const struct sbrec_port_binding *binding, >> + struct ovsdb_idl_index *sbrec_port_binding_by_name, >> + const struct hmap *chassis_tunnels, >> + const struct shash *ct_zones, >> + enum mf_field_id mff_ovn_geneve, >> + struct match *match, >> + struct ofpbuf *ofpacts_p, >> + struct ovn_desired_flow_table *flow_table) > > Nit: indentation. > >> +{ >> + const struct sbrec_port_binding *linked_pb; >> + linked_pb = get_binding_network_function_linked_port( >> + sbrec_port_binding_by_name, binding); >> + if (!linked_pb) { >> + VLOG_INFO("Linked port not found for %s", binding->logical_port); > > This needs to be rate limited and maybe a warning. > >> + return; >> + } >> + struct zone_ids zone = get_zone_ids(binding, ct_zones); >> + if (!zone.ct) { >> + VLOG_INFO("Port zone not found for %s", binding->logical_port); > > This needs to be rate limited and maybe a warning > > >> + return; >> + } >> + struct zone_ids linked_zone = get_zone_ids(linked_pb, ct_zones); >> + if (!linked_zone.ct) { >> + VLOG_INFO("Linked port zone not found for %s", >> binding->logical_port); > > This needs to be rate limited and maybe a warning. > > >> + return; >> + } >> + VLOG_INFO("Both port zones found for NF port %s", >> binding->logical_port); > > Should this be VLOG_DBG instead? > >> + put_redirect_overlay_to_source( >> + binding, >> + linked_zone.ct, >> + chassis_tunnels, >> + mff_ovn_geneve, >> + match, >> + ofpacts_p, >> + flow_table); >> + put_redirect_overlay_to_source( >> + linked_pb, >> + zone.ct, >> + chassis_tunnels, >> + mff_ovn_geneve, >> + match, >> + ofpacts_p, >> + flow_table); >> +} >> + >> static void >> put_remote_port_redirect_overlay_ha_remote( >> const struct sbrec_port_binding *binding, >> @@ -973,12 +1231,12 @@ put_local_common_flows(uint32_t dp_key, >> >> uint32_t port_key = pb->tunnel_key; >> >> - /* Table 43, priority 100. >> + /* Table 44, priority 100. >> * ======================= >> * >> * Implements output to local hypervisor. Each flow matches a >> * logical output port on the local hypervisor, and resubmits to >> - * table 44. >> + * table 45. >> */ >> >> ofpbuf_clear(ofpacts_p); >> @@ -988,13 +1246,13 @@ put_local_common_flows(uint32_t dp_key, >> >> put_zones_ofpacts(zone_ids, ofpacts_p); >> >> - /* Resubmit to table 44. */ >> + /* Resubmit to table 45. */ >> put_resubmit(OFTABLE_CHECK_LOOPBACK, ofpacts_p); >> ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 100, >> pb->header_.uuid.parts[0], &match, ofpacts_p, >> &pb->header_.uuid); >> >> - /* Table 44, Priority 100. >> + /* Table 45, Priority 100. >> * ======================= >> * >> * Drop packets whose logical inport and outport are the same >> @@ -1818,7 +2076,7 @@ consider_port_binding(const struct physical_ctx *ctx, >> * go out from the same tunnel inport. */ >> put_load(ofp_to_u16(OFPP_NONE), MFF_IN_PORT, 0, 16, ofpacts_p); >> >> - /* Resubmit to table 41. */ >> + /* Resubmit to table 45. */ >> put_resubmit(OFTABLE_CHECK_LOOPBACK, ofpacts_p); >> } >> >> @@ -1905,10 +2163,12 @@ consider_port_binding(const struct physical_ctx *ctx, >> >> /* Determine how the port is accessed. */ >> enum access_type access_type = PORT_LOCAL; >> + bool is_nf = (smap_get(&binding->options, >> "network-function-linked-port") ? >> + true: false); >> if (!ofport) { >> /* Enforce tunneling while we clone packets to additional chassis >> b/c >> * otherwise upstream switch won't flood the packet to both >> chassis. */ >> - if (localnet_port && !binding->additional_chassis) { >> + if (localnet_port && !binding->additional_chassis && !is_nf) { >> ofport = u16_to_ofp(simap_get(ctx->patch_ofports, >> localnet_port->logical_port)); >> if (!ofport) { >> @@ -2120,7 +2380,7 @@ consider_port_binding(const struct physical_ctx *ctx, >> } >> } >> >> - /* Table 42, priority 150. >> + /* Table 43, priority 150. >> * ======================= >> * >> * Handles packets received from ports of type "localport". These >> @@ -2138,9 +2398,23 @@ consider_port_binding(const struct physical_ctx *ctx, >> binding->header_.uuid.parts[0], &match, >> ofpacts_p, &binding->header_.uuid); >> } >> + >> + /* Packets egressing from network function ports need to be sent to >> the >> + * source */ >> + if (is_nf && localnet_port) { >> + put_redirect_overlay_to_source_from_nf_port( >> + binding, >> + ctx->sbrec_port_binding_by_name, >> + ctx->chassis_tunnels, >> + ctx->ct_zones, >> + ctx->mff_ovn_geneve, >> + &match, >> + ofpacts_p, >> + flow_table); >> + } >> } else if (access_type == PORT_LOCALNET && !ctx->always_tunnel) { >> /* Remote port connected by localnet port */ >> - /* Table 43, priority 100. >> + /* Table 44, priority 100. >> * ======================= >> * >> * Implements switching to localnet port. Each flow matches a >> @@ -2160,7 +2434,7 @@ consider_port_binding(const struct physical_ctx *ctx, >> >> put_load(localnet_port->tunnel_key, MFF_LOG_OUTPORT, 0, 32, >> ofpacts_p); >> >> - /* Resubmit to table 43. */ >> + /* Resubmit to table 44. */ >> put_resubmit(OFTABLE_LOCAL_OUTPUT, ofpacts_p); >> ofctrl_add_flow(flow_table, OFTABLE_LOCAL_OUTPUT, 100, >> binding->header_.uuid.parts[0], >> @@ -2197,7 +2471,8 @@ consider_port_binding(const struct physical_ctx *ctx, >> &match, ofpacts_p, ctx->chassis_tunnels, flow_table); >> } else { >> put_remote_port_redirect_overlay( >> - binding, type, ctx, port_key, &match, ofpacts_p, flow_table); >> + binding, type, ctx, port_key, &match, ofpacts_p, flow_table, >> + is_nf); >> } >> out: >> if (ha_ch_ordered) { >> @@ -2806,7 +3081,6 @@ physical_run(struct physical_ctx *p_ctx, >> >> ofpbuf_clear(&ofpacts); >> put_decapsulation(p_ctx->mff_ovn_geneve, tun, &ofpacts); >> - >> put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts); >> ofctrl_add_flow(flow_table, OFTABLE_PHY_TO_LOG, 100, 0, &match, >> &ofpacts, hc_uuid); >> @@ -2936,12 +3210,12 @@ physical_run(struct physical_ctx *p_ctx, >> ofctrl_add_flow(flow_table, OFTABLE_OUTPUT_LARGE_PKT_PROCESS, 0, 0, >> &match, >> &ofpacts, hc_uuid); >> >> - /* Table 42, priority 150. >> + /* Table 43, priority 150. >> * ======================= >> * >> * Handles packets received from a VXLAN tunnel which get resubmitted to >> * OFTABLE_LOG_INGRESS_PIPELINE due to lack of needed metadata in VXLAN, >> - * explicitly skip sending back out any tunnels and resubmit to table 40 >> + * explicitly skip sending back out any tunnels and resubmit to table 43 >> * for local delivery, except packets which have MLF_ALLOW_LOOPBACK bit >> * set. >> */ >> @@ -2949,13 +3223,13 @@ physical_run(struct physical_ctx *p_ctx, >> match_set_reg_masked(&match, MFF_LOG_FLAGS - MFF_REG0, >> MLF_RCV_FROM_RAMP, >> MLF_RCV_FROM_RAMP | MLF_ALLOW_LOOPBACK); >> >> - /* Resubmit to table 40. */ >> + /* Resubmit to table 44. */ >> ofpbuf_clear(&ofpacts); >> put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts); >> ofctrl_add_flow(flow_table, OFTABLE_REMOTE_OUTPUT, 150, 0, >> &match, &ofpacts, hc_uuid); >> >> - /* Table 42, priority 150. >> + /* Table 43, priority 150. >> * ======================= >> * >> * Packets that should not be sent to other hypervisors. >> @@ -2963,13 +3237,13 @@ physical_run(struct physical_ctx *p_ctx, >> match_init_catchall(&match); >> match_set_reg_masked(&match, MFF_LOG_FLAGS - MFF_REG0, >> MLF_LOCAL_ONLY, MLF_LOCAL_ONLY); >> - /* Resubmit to table 40. */ >> + /* Resubmit to table 44. */ >> ofpbuf_clear(&ofpacts); >> put_resubmit(OFTABLE_LOCAL_OUTPUT, &ofpacts); >> ofctrl_add_flow(flow_table, OFTABLE_REMOTE_OUTPUT, 150, 0, >> &match, &ofpacts, hc_uuid); >> >> - /* Table 42, Priority 0. >> + /* Table 43, Priority 0. >> * ======================= >> * >> * Resubmit packets that are not directed at tunnels or part of a >> @@ -2980,24 +3254,30 @@ physical_run(struct physical_ctx *p_ctx, >> ofctrl_add_flow(flow_table, OFTABLE_REMOTE_OUTPUT, 0, 0, &match, >> &ofpacts, hc_uuid); >> >> - /* Table 40, priority 0. >> + /* Table 44, priority 0. >> * ====================== >> * >> * Drop packets that do not match previous flows. >> */ >> add_default_drop_flow(p_ctx, OFTABLE_LOCAL_OUTPUT, flow_table); >> >> - /* Table 41, Priority 0. >> + /* Table 45, Priority 0. >> * ======================= >> * >> * Resubmit packets that don't output to the ingress port (already >> checked >> - * in table 40) to the logical egress pipeline, clearing the logical >> + * in table 43) to the logical egress pipeline, clearing the logical >> * registers (for consistent behavior with packets that get tunneled). >> */ >> match_init_catchall(&match); >> ofpbuf_clear(&ofpacts); >> for (int i = 0; i < MFF_N_LOG_REGS; i++) { >> - put_load(0, MFF_REG0 + i, 0, 32, &ofpacts); >> + if ((MFF_REG0 + i) != MFF_LOG_TUN_OFPORT) { >> + put_load(0, MFF_REG0 + i, 0, 32, &ofpacts); >> + } >> } >> + /* In MFF_LOG_TUN_OFPORT, the bits 16..31 are used to store geneve > > Why do we restrict this only to geneve? Wouldn't this work for VXLAN > too out of the box? > > [Sragdhara] Yes, we can make it work for VXLAN as well. That will need some > more changes. Is it okay to address it in a future patch, after this one is > merged? > Sure, perfectly fine to do this as a follow up, let's not forget to add a TODO.rst item for this though. I'll leave a comment about that in the v8 patch so we don't forget. > >> + * tunnel id of received packets and these need to be carried over to >> + * the egress pipeline. The remaining bits can be reset to zero. */ >> + put_load(0, MFF_LOG_TUN_OFPORT, 0, 16, &ofpacts); >> put_resubmit(OFTABLE_LOG_EGRESS_PIPELINE, &ofpacts); >> ofctrl_add_flow(flow_table, OFTABLE_CHECK_LOOPBACK, 0, 0, &match, >> &ofpacts, hc_uuid); >> diff --git a/include/ovn/logical-fields.h b/include/ovn/logical-fields.h >> index 98e134e0f..47775192d 100644 >> --- a/include/ovn/logical-fields.h >> +++ b/include/ovn/logical-fields.h >> @@ -42,6 +42,7 @@ enum ovn_controller_event { >> * (16..31 of the 32 bits). */ >> #define MFF_LOG_INPORT MFF_REG14 /* Logical input port (32 bits). */ >> #define MFF_LOG_OUTPORT MFF_REG15 /* Logical output port (32 bits). */ >> +#define MFF_LOG_TUN_OFPORT MFF_REG5 /* 16..31 of the 32 bits */ >> >> /* Logical registers. >> * >> @@ -101,6 +102,7 @@ enum mff_log_flags_bits { >> MLF_UNSNAT_NEW_BIT = 20, >> MLF_UNSNAT_NOT_TRACKED_BIT = 21, >> MLF_IGMP_IGMP_SNOOP_INJECT_BIT = 22, >> + MLF_RECIRC_BIT = 23, >> MLF_NETWORK_ID_START_BIT = 28, >> MLF_NETWORK_ID_END_BIT = 31, >> }; >> @@ -170,6 +172,9 @@ enum mff_log_flags { >> /* Indicate that this is an IGMP packet reinjected by ovn-controller. */ >> MLF_IGMP_IGMP_SNOOP = (1 << MLF_IGMP_IGMP_SNOOP_INJECT_BIT), >> >> + /* Indicate the packet has been processed by LOCAL table once before. */ >> + MLF_RECIRC = (1 << MLF_RECIRC_BIT), >> + >> /* Assign network ID to packet to choose correct network for snat when >> * lb_force_snat_ip=router_ip. */ >> MLF_NETWORK_ID = (OVN_MAX_NETWORK_ID << MLF_NETWORK_ID_START_BIT), >> @@ -237,15 +242,19 @@ const struct ovn_field *ovn_field_from_name(const char >> *name); >> #define OVN_CT_OBS_STAGE_END_BIT 5 >> #define OVN_CT_ALLOW_ESTABLISHED_BIT 6 >> #define OVN_CT_NETWORK_FUNCTION_GROUP_BIT 7 >> +#define OVN_CT_TUN_IF_BIT 8 >> >> #define OVN_CT_BLOCKED 1 >> #define OVN_CT_NATTED 2 >> #define OVN_CT_LB_SKIP_SNAT 4 >> #define OVN_CT_LB_FORCE_SNAT 8 >> #define OVN_CT_NETWORK_FUNCTION_GROUP 128 >> +#define OVN_CT_TUN_IF 256 >> >> #define OVN_CT_NETWORK_FUNCTION_GROUP_ID_1ST_BIT 17 >> #define OVN_CT_NETWORK_FUNCTION_GROUP_ID_END_BIT 24 >> +#define OVN_CT_TUN_IF_1ST_BIT 80 >> +#define OVN_CT_TUN_IF_END_BIT 95 >> >> #define OVN_CT_ECMP_ETH_1ST_BIT 32 >> #define OVN_CT_ECMP_ETH_END_BIT 79 >> diff --git a/lib/logical-fields.c b/lib/logical-fields.c >> index 681a6dd50..15876a64f 100644 >> --- a/lib/logical-fields.c >> +++ b/lib/logical-fields.c >> @@ -228,6 +228,16 @@ ovn_init_symtab(struct shash *symtab) >> >> OVN_CT_NETWORK_FUNCTION_GROUP_ID_END_BIT) >> "]", >> WR_CT_COMMIT); >> + expr_symtab_add_subfield_scoped(symtab, "ct_label.tun_if", NULL, >> + "ct_label[" >> + OVN_CT_STR(OVN_CT_TUN_IF_BIT) >> + "]", >> + WR_CT_COMMIT); >> + expr_symtab_add_subfield_scoped(symtab, "ct_label.tun_if_id", NULL, >> + "ct_label[" >> + OVN_CT_STR(OVN_CT_TUN_IF_1ST_BIT) ".." >> + OVN_CT_STR(OVN_CT_TUN_IF_END_BIT) "]", >> + WR_CT_COMMIT); >> >> expr_symtab_add_field(symtab, "ct_state", MFF_CT_STATE, NULL, false); >> >> diff --git a/northd/northd.c b/northd/northd.c >> index 11498217f..382182ca6 100644 >> --- a/northd/northd.c >> +++ b/northd/northd.c >> @@ -240,6 +240,10 @@ static const char *reg_ct_state[] = { >> #undef CS_STATE >> }; >> >> +/* Register used for storing tunnel openflow interface id, in a Logical >> Switch. >> + * Must match the MFF_LOG_TUN_OFPORT in logical-fields.h */ >> +#define REG_TUN_OFPORT "reg5[16..31]" > > As mentioned in patch 3/5, this will overlap with XXREG1 (REG_LB_IPV6). > Wouldn't that be a problem? > > Also, please update the "OVS register usage:" table below. > > > > [Sragdhara] Yes, this will overlap. Moved to reg0 available bits like you > suggested. Also, LB is not tested along with NF. Theoretically, there might > be problems for NF with from-lport ACL, in combination with LB, due to > asymmetric traffic path. We thought we would take up LB case in a separate > patch. Have added in the documentation ovn-nb.xml as well as in NEWs > mentioning LB. Ack, thanks for the documentation, I'll check v8 closely. > >> + >> /* Register used for temporarily store ECMP eth.src to avoid masked ct_label >> * access. It doesn't really occupy registers because the content of the >> * register is saved to stack and then restored in the same flow. >> @@ -17468,6 +17472,48 @@ network_function_get_active(const struct >> nbrec_network_function_group *nfg) >> return nfg->n_network_function ? nfg->network_function[0] : NULL; >> } >> >> +/* For packets received on tunnel and egressing towards a network-function >> port >> + * commit the tunnel interface id in CT. This will be utilized when the >> packet >> + * comes out of the other network-function interface of the service VM. The >> + * packet then will be tunneled back to the source host. */ >> +static void >> +build_lswitch_stateful_nf(struct ovn_port *op, >> + struct lflow_table *lflows, >> + struct ds *actions, struct ds *match, >> + struct lflow_ref *lflow_ref) >> +{ >> + ds_clear(actions); >> + ds_clear(match); >> + >> + ds_put_cstr(actions, >> + "ct_commit { " >> + "ct_mark.blocked = 0; " >> + "ct_mark.allow_established = " REGBIT_ACL_PERSIST_ID "; >> " >> + "ct_label.acl_id = " REG_ACL_ID "; " >> + "ct_label.tun_if_id = " REG_TUN_OFPORT "; }; next;"); >> + ds_put_format(match, >> + "outport == %s && " REGBIT_ACL_LABEL" == 0", >> op->json_key); >> + ovn_lflow_add(lflows, op->od, S_SWITCH_OUT_STATEFUL, 120, >> + ds_cstr(match), ds_cstr(actions), lflow_ref); >> + >> + ds_clear(actions); >> + ds_clear(match); >> + ds_put_format(match, >> + "outport == %s && " REGBIT_ACL_LABEL" == 1", >> + op->json_key); >> + ds_put_cstr(actions, >> + "ct_commit { " >> + "ct_mark.blocked = 0; " >> + "ct_mark.allow_established = " REGBIT_ACL_PERSIST_ID "; >> " >> + "ct_label.acl_id = " REG_ACL_ID "; " >> + "ct_mark.obs_stage = " REGBIT_ACL_OBS_STAGE "; " >> + "ct_mark.obs_collector_id = " REG_OBS_COLLECTOR_ID_EST >> "; " >> + "ct_label.obs_point_id = " REG_OBS_POINT_ID_EST "; " >> + "ct_label.tun_if_id = " REG_TUN_OFPORT "; }; next;"); > > I'm afraid using ct_label for all the Network Function features will > make it such that traffic is not offloadable (with Nvidia NICs) if both > ACL Sampling and Network Functions are used. > > > [Sragdhara] Was not aware. Thanks for pointing it out. If I understand it > right, you are saying that since sampling feature is using ct_mark and NF is > using ct_label, these two cannot be used together with HW offload. Any > pointers to understand this issue better? Let me know if you have any > suggestions Also, is it okay to take up the changes related to HW offload, > later in a separate patch? On some specific hardware (AFAIK on Nvidia NICs that support conntrack hardware offload) some types of ovs matches are not supported. Among those masked ct_label access. That's why for ACL sampling we had to make sure we match on the complete contents of the CT label, please see below: build_acl_sample_label_match(...) { ... ds_put_format(match, "ct_label.obs_point_id == %"PRIu32" && " "ct_label.obs_unused == 0", point_id); ... } From my perspective it's probably fine to address HW offload in the future. We should however call this out in the documentation and maybe add a TODO.rst item for it too. Regards, Dumitru _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev