On 10/7/25 8:36 PM, Sragdhara Datta Chaudhuri wrote: > Hi Dumitru, >
Hi Sragdhara, > Thanks so much for your detailed review over multiple iterations. > > Numan, thank you as well for your valuable inputs. > > Regarding the points you mentioned in this email: > > * > Agree that dedicated registers would be good. i picked reg5 as no > other registers were completely free. The lack of registers would > hamper other features in future as well. > * > We’ll take a look at the system test tcpdump issue and fix it in a > future patch. > * > We noticed the test 622 (Network function packet flow > outbound) failure but couldn’t reproduce in in-house even after > running in a loop. We’ll look into it. > Thank you! Let me know if you need help reproducing this in CI. Regards, Dumitru > > Thanks, > Sragdhara > > *From: *Dumitru Ceara <[email protected]> > *Date: *Friday, October 3, 2025 at 2:01 AM > *To: *Sragdhara Datta Chaudhuri <[email protected]>, ovs- > [email protected] <[email protected]> > *Cc: *Numan Siddique <[email protected]>, Naveen Yerramneni > <[email protected]>, Karthik Chandrashekar > <[email protected]>, Ilya Maximets <[email protected]> > *Subject: *Re: [ovs-dev] [PATCH OVN v9 0/5] Network Function Insertion. > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > On 9/15/25 9:34 PM, Sragdhara Datta Chaudhuri wrote: >> RFC: NETWORK FUNCTION INSERTION IN OVN >> >> 1. Introduction >> ================ >> The objective is to insert a Network Function (NF) in the path of > outbound/inbound traffic from/to a port-group. The use case is to > integrate a 3rd party service in the path of traffic. An example of such > a service would be layer7 firewall. The NF VM will be like a bump in the > wire and should not modify the packet, i.e. the IP header, the MAC > addresses, VLAN tag, sequence numbers remain unchanged. >> >> Here are some of the highlights: >> - A new entity network-function (NF) has been introduced. It contains > a pair of LSPs. The CMS would designate one as “inport” and the other as > “outport”. >> - For high-availability, a network function group (NFG) entity > consists of a group of NFs. Only one NF in a NFG has an active role > based on health monitoring. >> - ACL would accept NFG as a parameter and traffic matching the ACL > would be redirected to the associated active NF’s port. NFG is accepted > for stateful allow action only. >> - The ACL’s port-group is the point of reference when defining the > role of the NF ports. The “inport” is the port closer to the port-group > and “outport” is the one away from it. For from-lport ACLs, the request > packets would be redirected to the NF “inport” and for to-lport ACLs, > the request packets would be redirected to NF “outport”. When the same > packet comes out of the other NF port, it gets simply forwarded. >> - Statefulness will be maintained, i.e. the response traffic will also > go through the same pair of NF ports but in reverse order. >> - For the NF ports we need to disable port security check, fdb > learning and multicast/broadcast forwarding. >> - Health monitoring involves ovn-controller periodically injecting > ICMP probe packets into the NF inport and monitor the same packet coming > out of the NF outport. >> - If the traffic redirection involves cross-host traffic (e.g. for a > from-lport ACL, if the source VM and NF VM are on different hosts), > packets would be tunneled to and from the NF VM's host. >> - If the port-group to which the ACL is being applied has members > spread across multiple LSs, CMS needs to create child ports for the NF > ports on each of these LSs. The redirection rules in each LS will use > the child ports on that LS. >> >> 2. NB tables >> ============= >> New NB tables >> —------------ >> Network_Function: Each row contains {inport, outport, health_check} >> Network_Function_Group: Each row contains a list of Network_Function > entities. It also contains a unique id (between 1 and 255) and a > reference to the current active NF. >> Network_Function_Health_Check: Each row contains configuration for > probes in options field: {interval, timeout, success_count, failure_count} >> >> "Network_Function_Health_Check": { >> "columns": { >> "name": {"type": "string"}, >> "options": { >> "type": {"key": "string", >> "value": "string", >> "min": 0, >> "max": "unlimited"}}, >> "external_ids": { >> "type": {"key": "string", "value": "string", >> "min": 0, "max": "unlimited"}}}, >> "isRoot": true}, >> "Network_Function": { >> "columns": { >> "name": {"type": "string"}, >> "outport": {"type": {"key": {"type": "uuid", >> "refTable": > "Logical_Switch_Port", >> "refType": "strong"}, >> "min": 1, "max": 1}}, >> "inport": {"type": {"key": {"type": "uuid", >> "refTable": > "Logical_Switch_Port", >> "refType": "strong"}, >> "min": 1, "max": 1}}, >> "health_check": {"type": { >> "key": {"type": "uuid", >> "refTable": "Network_Function_Health_Check", >> "refType": "strong"}, >> "min": 0, "max": 1}}, >> "external_ids": { >> "type": {"key": "string", "value": "string", >> "min": 0, "max": "unlimited"}}}, >> "isRoot": true}, >> "Network_Function_Group": { >> "columns": { >> "name": {"type": "string"}, >> "network_function": {"type": >> {"key": {"type": "uuid", >> "refTable": "Network_Function", >> "refType": "strong"}, >> "min": 0, "max": "unlimited"}}, >> "mode": {"type": {"key": {"type": "string", >> "enum": ["set", ["inline"]]}}}, >> "network_function_active": {"type": >> {"key": {"type": "uuid", >> "refTable": "Network_Function", >> "refType": "strong"}, >> "min": 0, "max": 1}}, >> "id": { >> "type": {"key": {"type": "integer", >> "minInteger": 0, >> "maxInteger": 255}}}, >> "external_ids": { >> "type": {"key": "string", "value": "string", >> "min": 0, "max": "unlimited"}}}, >> "isRoot": true}, >> >> >> Modified NB table >> —---------------- >> ACL: The ACL entity would have a new optional field that is a > reference to a Network_Function_Group entity. This field can be present > only for stateful allow ACLs. >> >> "ACL": { >> "columns": { >> "network_function_group": {"type": {"key": {"type": > "uuid", >> "refTable": > "Network_Function_Group", >> "refType": "strong"}, >> "min": 0, >> "max": 1}}, >> >> New options for Logical_Switch_Port >> —---------------------------------- >> receive_multicast=<boolean>: Default true. If set to false, LS will > not forward broadcast/multicast traffic to this port. This is to prevent > looping of such packets. >> >> lsp_learn_mac=<boolean>: Default true. If set to false, fdb learning > will be skipped for packets coming out of this port. Redirected packets > from the NF port would be carrying the originating VM’s MAC in source, > and so learning should not happen. >> >> CMS needs to set both the above options to false for NF ports, in > addition to disabling port security. >> >> nf-linked-port=<lsp-name>: Each NF port needs to have this set to the > other NF port of the pair. >> >> is-nf=<boolean>: Each NF port needs to have this set to true. >> >> New NB_global options >> —-------------------- >> svc_monitor_mac_dst: destination MAC of probe packets (svc_monitor_mac > is already there and will be used as source MAC) >> svc_monitor_ip: source IP of probe packets >> svc_monitor_ip_dst: destination IP of probe packets >> >> Sample configuration >> —------------------- >> ovn-nbctl ls-add ls1 >> ovn-nbctl lsp-add ls1 nfp1 >> ovn-nbctl lsp-add ls1 nfp2 >> ovn-nbctl set logical_switch_port nfp1 options:receive_multicast=false > options:lsp_learn_mac=false options:nf-linked-port=nfp2 >> ovn-nbctl set logical_switch_port nfp2 options:receive_multicast=false > options:lsp_learn_mac=false options:nf-linked-port=nfp1 >> ovn-nbctl nf-add nf1 nfp1 nfp2 >> ovn-nbctl nfg-add nfg1 123 inline nf1 >> ovn-nbctl lsp-add ls1 p1 -- lsp-set-addresses p1 "50:6b:8d:3e:ed:c4 > 10.1.1.4" >> ovn-nbctl pg-add pg1 p1 >> ovn-nbctl create Address_Set name=as1 addresses=10.1.1.4 >> ovn-nbctl lsp-add ls1 p2 -- lsp-set-addresses p2 "50:6b:8d:3e:ed:c5 > 10.1.1.5" >> ovn-nbctl create Address_Set name=as2 addresses=10.1.1.5 >> ovn-nbctl acl-add pg1 from-lport 200 'inport==@pg1 && ip4.dst == $as2' > allow-related nfg1 >> ovn-nbctl acl-add pg1 to-lport 100 'outport==@pg1 && ip4.src == $as2' > allow-related nfg1 >> >> 3. SB tables >> ============ >> Service_Monitor: >> This is currently used by Load balancer. New fields are: “type” - to > indicate LB or NF, “mac” - the destination MAC address for monitor > packets, “logical_input_port” - the LSP to which the probe packet would > be sent. Also, “icmp” has been added as a protocol type, used only for NF. >> >> "Service_Monitor": { >> "columns": { >> "type": {"type": {"key": { >> "type": "string", >> "enum": ["set", ["load-balancer", "network- > function"]]}}}, >> "mac": {"type": "string"}, >> "protocol": { >> "type": {"key": {"type": "string", >> "enum": ["set", ["tcp", "udp", "icmp"]]}, >> "min": 0, "max": 1}}, >> "logical_input_port": {"type": "string"}, >> >> northd would create one Service_Monitor entity for each NF. The > logical_input_port and logical_port would be populated from the NF > inport and outport fields respectively. The probe packets would be > injected into the logical_input_port and would be monitored out of > logical_port. >> >> 4. Logical Flows >> ================ >> Logical Switch ingress pipeline: >> - in_network_function added after in_stateful. >> - Modifications to in_acl_eval, in_stateful and in_l2_lookup. >> Logical Switch egress pipeline: >> - out_network_function added after out_stateful. >> - Modifications to out_pre_acl, out_acl_eval and out_stateful. >> >> 4.1 from-lport ACL >> ------------------ >> The diagram shows the request path for packets from VM1 port p1, which > is a member of the pg to which ACL is applied. The response would follow > the reverse path, i.e. packet would be redirected to nfp2 and come out > of nfp1 and be forwarded to p1. >> Also, p2 does not need to be on the same LS. Only the p1, nfp1, nfp2 > are on the same LS. >> >> ----- ------- ----- >> | VM1 | | NF VM | | VM2 | >> ----- ------- ----- >> | /\ | / \ >> | | | | >> \ / | \ / | >> ------------------------------------------------------------ >> | p1 nfp1 nfp2 p2 | >> | | >> | Logical Switch | >> ------------------------------------------------------------- >> pg1: [p1] as2: [p2-ip] >> ovn-nbctl nf-add nf1 nfp1 nfp2 >> ovn-nbctl nfg-add nfg1 123 inline nf1 >> ovn-nbctl acl-add pg1 from-lport 200 'inport==@pg1 && ip4.dst == $as2' > allow-related nfg1 >> >> The request packets from p1 matching a from-lport ACL with NFG, are > redirected to nfp1 and the NFG id is committed to the ct label in p1's > zone. When the same packet comes out of nfp2 it gets forwarded the > normal way. >> Response packets have destination as p1's MAC. Ingress processing sets > the outport to p1 and the CT lookup in egress pipeline (in p1's ct zone) > yields the NFG id and the packet injected back to ingress pipeline after > setting the outport to nfp2. >> >> Below are the changes in detail. >> >> 4.1.1 Request processing >> ------------------------ >> >> in_acl_eval: For from-lport ACLs with NFG, the existing rule's action > has been enhanced to set: >> - reg8[21] = 1: to indicate that packet has matched a rule with NFG >> - reg0[22..29] = <NFG-unique-id> >> - reg8[22] = <direction> (1: request, 0: response) >> >> table=8 (ls_in_acl_eval), priority=1200 , match=(reg0[7] == 1 && > (inport==@pg1 && ip4.dst == $as2)), action=(reg8[16] = 1; reg0[1] = 1; > reg8[21] = 1; reg8[22] = 1; reg0[22..29] = 123; next;) >> table=8 (ls_in_acl_eval), priority=1200 , match=(reg0[8] == 1 && > (inport==@pg1 && ip4.dst == $as2)), action=(reg8[16] = 1; reg8[21] = 1; > reg8[22] = 1; reg0[22..29] = 123; next;) >> >> in_stateful: Priority 110: set NFG id in CT label if reg8[21] is set. >> - bit 7 (ct_label.network_function_group): Set to 1 to indicate NF > insertion. >> - bits 17 to 24 (ct_label.network_function_group_id): Stores the 8 > bit NFG id >> >> table=21(ls_in_stateful ), priority=110 , match=(reg0[1] == 1 > && reg0[13] == 0 && reg8[21] == 1), action=(ct_commit { ct_mark.blocked > = 0; ct_mark.allow_established = reg0[20]; ct_label.acl_id = > reg2[16..31]; ct_label.network_function_group = 1; > ct_label.network_function_group_id = reg0[22..29]; }; next;) >> table=21(ls_in_stateful ), priority=110 , match=(reg0[1] == 1 > && reg0[13] == 1 && reg8[21] == 1), action=(ct_commit { ct_mark.blocked > = 0; ct_mark.allow_established = reg0[20]; ct_mark.obs_stage = > reg8[19..20]; ct_mark.obs_collector_id = reg8[8..15]; > ct_label.obs_point_id = reg9; ct_label.acl_id = reg2[16..31]; > ct_label.network_function_group = 1; ct_label.network_function_group_id > = reg0[22..29]; }; next;) >> table=21(ls_in_stateful ), priority=100 , match=(reg0[1] == 1 > && reg0[13] == 0), action=(ct_commit { ct_mark.blocked = 0; > ct_mark.allow_established = reg0[20]; ct_label.acl_id = reg2[16..31]; > ct_label.network_function_group = 0; ct_label.network_function_group_id > = 0; }; next;) >> table=21(ls_in_stateful ), priority=100 , match=(reg0[1] == 1 > && reg0[13] == 1), action=(ct_commit { ct_mark.blocked = 0; > ct_mark.allow_established = reg0[20]; ct_mark.obs_stage = reg8[19..20]; > ct_mark.obs_collector_id = reg8[8..15]; ct_label.obs_point_id = reg9; > ct_label.acl_id = reg2[16..31]; ct_label.network_function_group = 0; > ct_label.network_function_group_id = 0; }; next;) >> table=21(ls_in_stateful ), priority=0 , match=(1), action=(next;) >> >> >> For non-NFG cases, the existing priority 100 rules will be hit. There > additional action has been added to clear the NFG bits in ct label. >> >> in_network_function: A new stage with priority 99 rules to redirect > packets by setting outport to the NF “inport” (or its child port) based > on the NFG id set by the prior ACL stage. >> Priority 100 rules ensure that when the same packets come out of the > NF ports, they are not redirected again (the setting of reg5 here > relates to the cross-host packet tunneling and will be explained later). >> Priority 1 rule: if reg8[21] is set, but the NF port (or child port) > is not present on this LS, drop packets. >> >> table=22(ls_in_network_function), priority=100 , match=(inport == > "nfp1"), action=(reg5[16..31] = ct_label.tun_if_id; next;) >> table=22(ls_in_network_function), priority=100 , match=(inport == > "nfp2"), action=(reg5[16..31] = ct_label.tun_if_id; next;) >> table=22(ls_in_network_function), priority=100 , match=(reg8[21] == > 1 && eth.mcast), action=(next;) >> table=22(ls_in_network_function), priority=99 , match=(reg8[21] == > 1 && reg8[22] == 1 && reg0[22..29] == 123), action=(outport = "nfp1"; > output;) >> table=22(ls_in_network_function), priority=1 , match=(reg8[21] == > 1), action=(drop;) >> table=22(ls_in_network_function), priority=0 , match=(1), > action=(next;) >> >> >> 4.1.2 Response processing >> ------------------------- >> out_acl_eval: High priority rules that allow response and related > packets to go through have been enhanced to also copy CT label NFG bit > into reg8[21]. >> >> table=6(ls_out_acl_eval), priority=65532, match=(!ct.est && ct.rel > && !ct.new && !ct.inv && ct_mark.blocked == 0), action=(reg8[21] = > ct_label.network_function_group; reg8[16] = 1; ct_commit_nat;) >> table=6(ls_out_acl_eval), priority=65532, match=(ct.est && !ct.rel > && !ct.new && !ct.inv && ct.rpl && ct_mark.blocked == 0), > action=(reg8[21] = ct_label.network_function_group; reg8[16] = 1; next;) >> >> out_network_function: Priority 99 rule matches on the nfg_id in > ct_label and sets the outport to the NF “outport”. It also sets > reg8[23]=1 and injects the packet to ingress pipeline (in_l2_lookup). >> Priority 100 rule forwards all packets to NF ports to the next table. >> >> table=11 (ls_out_network_function), priority=100 , match=(outport > == "nfp1"), action=(next;) >> table=11 (ls_out_network_function), priority=100 , match=(outport > == "nfp2"), action=(next;) >> table=11(ls_out_network_function), priority=100 , match=(reg8[21] > == 1 && eth.mcast), action=(next;) >> table=11 (ls_out_network_function), priority=99 , match=(reg8[21] > == 1 && reg8[22] == 0 && ct_label.network_function_group_id == 123), > action=(outport = "nfp2"; reg8[23] = 1; next(pipeline=ingress, table=29);) >> table=11 (ls_out_network_function), priority=1 , match=(reg8[21] > == 1), action=(drop;) >> table=11 (ls_out_network_function), priority=0 , match=(1), > action=(next;) >> >> in_l2_lkup: if reg8[23] == 1 (packet has come back from egress), > simply forward such packets as outport is already set. >> >> table=29(ls_in_l2_lkup), priority=100 , match=(reg8[23] == 1), > action=(output;) >> >> The above set of rules ensure that the response packet is sent to > nfp2. When the same packet comes out of nfp1, the ingress pipeline would > set the outport to p1 and it enters the egress pipeline. >> >> out_pre_acl: If the packet is coming from the NF inport, skip the > egress pipeline upto the out_nf stage, as the packet has already gone > through it and we don't want the same packet to be processed by CT twice. >> table=2 (ls_out_pre_acl ), priority=110 , match=(inport == > "nfp1"), action=(next(pipeline=egress, table=12);) >> >> >> 4.2 to-lport ACL >> ---------------- >> ----- -------- ----- >> | VM1 | | NF VM | | VM2 | >> ----- -------- ----- >> / \ | / \ | >> | | | | >> | \ / | \ / >> ------------------------------------------------------------- >> | p1 nfp1 nfp2 p2 | >> | | >> | Logical Switch | >> ------------------------------------------------------------- >> ovn-nbctl acl-add pg1 to-lport 100 'outport==@pg1&& ip4.src == $as2' > allow-related nfg1 >> Diagram shows request traffic path. The response will follow a reverse > path. >> >> Ingress pipeline sets the outport to p1 based on destination MAC > lookup. The packet enters the egress pipeline. There the to-lport ACL > with NFG gets evaluated and the NFG id gets committed to the CT label. > Then the outport is set to nfp2 and then the packet is injected back to > ingress. When the same packet comes out of nfp1, it gets forwarded to p1 > the normal way. >>>From the response packet from p1, ingress pipeline gets the NFG id > from CT label and accordingly redirects it to nfp1. When it comes out of > nfp2 it is forwarded the normal way. >> >> 4.2.1 Request processing >> ------------------------ >> out_acl_eval: For to-lport ACLs with NFG, the existing rule's action > has been enhanced to set: >> - reg8[21] = 1: to indicate that packet has matched a rule with NFG >> - reg0[22..29] = <NFG-unique-id> >> - reg8[22] = <direction> (1: request, 0: response) >> >> table=6 (ls_out_acl_eval ), priority=1100 , match=(reg0[7] == 1 > && (outport==@pg1 && ip4.src == $as2)), action=(reg8[16] = 1; reg0[1] = > 1; reg8[21] = 1; reg8[22] = 1; reg0[22..29] = 123; next;) >> table=6 (ls_out_acl_eval ), priority=1100 , match=(reg0[8] == 1 > && (outport==@pg1 && ip4.src == $as2)), action=(reg8[16] = 1; reg0[1] = > 1; reg8[21] = 1; reg8[22] = 1; reg0[22..29] = 123; next;) >> >> >> >> Out_stateful: Priority 110: set NFG id in CT label if reg8[21] is set. >> >> table=10(ls_out_stateful ), priority=110 , match=(reg0[1] == 1 > && reg0[13] == 0 && reg8[21] == 1), action=(ct_commit { ct_mark.blocked > = 0; ct_mark.allow_established = reg0[20]; ct_label.acl_id = > reg2[16..31]; ct_label.network_function_group = 1; > ct_label.network_function_group_id = reg0[22..29]; }; next;) >> table=10(ls_out_stateful ), priority=110 , match=(reg0[1] == 1 > && reg0[13] == 1 && reg8[21] == 1), action=(ct_commit { ct_mark.blocked > = 0; ct_mark.allow_established = reg0[20]; ct_mark.obs_stage = > reg8[19..20]; ct_mark.obs_collector_id = reg8[8..15]; > ct_label.obs_point_id = reg9; ct_label.acl_id = reg2[16..31]; > ct_label.network_function_group = 1; ct_label.network_function_group_id > = reg0[22..29]; }; next;) >> table=10(ls_out_stateful ), priority=100 , match=(reg0[1] == 1 > && reg0[13] == 0), action=(ct_commit { ct_mark.blocked = 0; > ct_mark.allow_established = reg0[20]; ct_label.acl_id = reg2[16..31]; > ct_label.network_function_group = 0; ct_label.network_function_group_id > = 0; }; next;) >> table=10(ls_out_stateful ), priority=100 , match=(reg0[1] == 1 > && reg0[13] == 1), action=(ct_commit { ct_mark.blocked = 0; > ct_mark.allow_established = reg0[20]; ct_mark.obs_stage = reg8[19..20]; > ct_mark.obs_collector_id = reg8[8..15]; ct_label.obs_point_id = reg9; > ct_label.acl_id = reg2[16..31]; ct_label.network_function_group = 0; > ct_label.network_function_group_id = 0; }; next;) >> table=10(ls_out_stateful ), priority=0 , match=(1), action=(next;) >> >> out_network_function: A new stage that has priority 99 rules to > redirect packet by setting outport to the NF “outport” (or its child > port) based on the NFG id set by the prior ACL stage, and then injecting > back to ingress. Priority 100 rules ensure that when the packets are > going to NF ports, they are not redirected again. >> Priority 1 rule: if reg8[21] is set, but the NF port (or child port) > is not present on this LS, drop packets. >> >> table=11(ls_out_network_function), priority=100 , match=(outport == > "nfp1"), action=(next;) >> table=11(ls_out_network_function), priority=100 , match=(outport == > "nfp2"), action=(next;) >> table=11(ls_out_network_function), priority=100 , match=(reg8[21] > == 1 && eth.mcast), action=(next;) >> table=11(ls_out_network_function), priority=99 , match=(reg8[21] > == 1 && reg8[22] == 1 && reg0[22..29] == 123), action=(outport = "nfp2"; > reg8[23] = 1; next(pipeline=ingress, table=29);) >> table=11(ls_out_network_function), priority=1 , match=(reg8[21] > == 1), action=(drop;) >> table=11(ls_out_network_function), priority=0 , match=(1), > action=(next;) >> >> >> in_l2_lkup: As described earlier, the priority 100 rule will forward > these packets. >> >> Then the same packet comes out from nfp1 and goes through the ingress > processing where the outport gets set to p1. The egress pipeline > out_pre_acl priority 110 rule described earlier, matches against inport > as nfp1 and directly jumps to the stage after out_network_function. Thus > the packet is not redirected again. >> >> 4.2.2 Response processing >> ------------------------- >> in_acl_eval: High priority rules that allow response and related > packets to go through have been enhanced to also copy CT label NFG bit > into reg8[21]. >> >> table=8(ls_in_acl_eval), priority=65532, match=(!ct.est && ct.rel > && !ct.new && !ct.inv && ct_mark.blocked == 0), action=(reg0[17] = 1; > reg8[21] = ct_label.network_function_group; reg8[16] = 1; ct_commit_nat;) >> table=8 (ls_in_acl_eval), priority=65532, match=(ct.est && !ct.rel > && !ct.new && !ct.inv && ct.rpl && ct_mark.blocked == 0), > action=(reg0[9] = 0; reg0[10] = 0; reg0[17] = 1; reg8[21] = > ct_label.network_function_group; reg8[16] = 1; next;) >> >> in_network_function: Priority 99 rule matches on the nfg_id in > ct_label and sets the outport to the NF “inport”. >> Priority 100 rule forwards all packets to NF ports to the next table. >> table=22(ls_in_network_function), priority=99 , match=(reg8[21] == > 1 && reg8[22] == 0 && ct_label.network_function_group_id == 123), > action=(outport = "nfp1"; output;) >> >> >> 5. Cross-host Traffic for VLAN Network >> ====================================== >> For overlay subnets, all cross-host traffic exchanges are tunneled. In > the case of VLAN subnets, there needs to be special handling to > selectively tunnel only the traffic to or from the NF ports. >> Take the example of a from-lport ACL. Packets from p1 to p2, gets > redirected to nfp1 in host1. If this packet is simply sent out from > host1, the physical network will directly forward it to host2 where VM2 > is. So, we need to tunnel the redirected packets from host1 to host3. > Now, once the packets come out of nfp2, if host3 sends the packets out, > the physical network would learn p1's MAC coming from host3. So, these > packets need to be tunneled back to host1. From there the packet would > be forwarded to VM2 via the physical network. >> >> ----- ----- -------- >> | VM2 | | VM1 | | NF VM | >> ----- ----- -------- >> / \ | / \ | >> | (7) | (1) (3)| |(4) >> | \ / | \ / >> -------------- -------------- (2) --------------- >> | p2 | (6) | p1 |______\ | nfp1 nfp2 | >> | |/____ | |------/ | | >> | host2 |\ | host1 |/______ | host3 | >> | | | |\------ | | >> -------------- -------------- (5) -------------- >> >> The above figure shows the request packet path for a from-lport ACL. > Response would follow the same path in reverse direction. >> >> To achieve this, the following would be done: >> >> On host where the ACL port group members are present (host1) >> —----------------------------------------------------------- >> REMOTE_OUTPUT (table 42): >> Currently, it tunnels traffic destined to all non-local overlay ports > to their associated hosts. The same rule is now also added for traffic > to non-local NF ports. Thus the packets from p1 get tunneled to host 3. >> >> On host with NF (host3) forward packet to nfp1 >> —---------------------------------------------- >> Upon reaching host3, the following rules come into play: >> PHY_TO_LOG (table 0): >> Ppriority 100: Existing rule - for each geneve tunnel interface on the > chassis, copies info from header to inport, outport, metadata registers. > Now the same rule also stores the tun intf id in a register (reg5[16..31]). >> >> CHECK_LOOPBACK (table 44) >> This table has a rule that clears all the registers. The change is to > skip the clearing of reg5[16..31]. >> >> Logical egress pipeline: >> >> ls_out_stateful priority 120: If the outport is an NF port, copy > reg5[16..31] (table0 had set it) to ct_label.tun_if_id.) >> >> table=10(ls_out_stateful ), priority=120 , match=(outport == > "nfp1" && reg0[13] == 0), action=(ct_commit { ct_mark.blocked = 0; > ct_mark.allow_established = reg0[20]; ct_label.acl_id = reg2[16..31]; > ct_label.tun_if_id = reg5[16..31]; }; next;) >> table=10(ls_out_stateful ), priority=120 , match=(outport == > "nfp1" && reg0[13] == 1), action=(ct_commit { ct_mark.blocked = 0; > ct_mark.allow_established = reg0[20]; ct_label.acl_id = reg2[16..31]; > ct_mark.obs_stage = reg8[19..20]; ct_mark.obs_collector_id = > reg8[8..15]; ct_label.obs_point_id = reg9; ct_label.tun_if_id = > reg5[16..31]; }; next;) >> >> The above sequence of flows ensure that if a packet is received via > tunnel on host3, with outport as nfp1, the tunnel interface id is > committed to the ct entry in nfp1's zone. >> >> On host with NF (host3) tunnel packets from nfp2 back to host1 >> —-------------------------------------------------------------- >> When the same packet comes out of nfp2 on host3: >> >> LOCAL_OUTPUT (table 43) >> When the packet comes out of the other NF port (nfp2), following two > rules send it back to the host that it originally came from: >> >> Priority 110: For each NF port local to this host, following rule > processes the >> packet through CT of linked port (for nfp2, it is nfp1): >> match: inport==nfp2 && RECIRC_BIT==0 >> action: RECIRC_BIT = 1, ct(zone=nfp1’s zone, table=LOCAL), resubmit > to table 43 >> >> Priority 109: For each {tunnel_id, nf port} on this host, if the > tun_if_id in ct_label matches the tunnel_id, send the recirculated > packet using tunnetl_id: >> match: inport==nfp1 && RECIRC_BIT==1 && ct_label.tun_if_id==<tun-id> >> action: tunnel packet using tun-id >> >> If p1 and nfp1 happen to be on the same host, the tun_if_id would not > be set and thus none of the priority 109 rules would match. It would be > forwarded the usual way matching the existing priority 100 rules in > LOCAL_TABLE. >> >> Special handling of the case where NF responds back on nfp1, instead > of forwarding packet out of nfp2: >> For example, a SYN packet from p1 got redirected to nfp1. Then the NF, > which is a firewall VM, drops the SYN and sends RST back on port nfp1. > In this case, looking up in the linked port (nfp2) ct zone will not give > anything. The following rule uses ct.inv to identify such scenarios and > uses nfp1’s CT zone to send the packet back. To achieve this, following > 2 rules are installed: >> >> in_network_function: >> Priority 100 rule that allows packets incoming from NF type ports, is > enhanced with additional action to store the tun_if_id from ct_label > into reg5[16..31]. >> table=22(ls_in_network_function), priority=100 , match=(inport == > "nfp1"), action=(reg5[16..31] = ct_label.tun_if_id; next;) >> >> LOCAL_OUTPUT (table 43) >> Priority 110 rule: for recirculated packets, if ct (of the linked > port) is invalid, use the tun id from reg5[16..31] to tunnel the packet > back to host1 (as CT zone info has been overwritten in the above 110 > priority rule in table 42). >> match: inport==nf1 && RECIRC_BIT==1 && ct.inv && > MFF_LOG_TUN_OFPORT==<tun-id> >> action: tunnel packet using tun-id >> >> >> 6. NF insertion across logical switches >> ======================================= >> If the port-group where the ACL is being applied has members across > multiple logical switches, there needs to be a NF port pair on each of > these switches. >> The NF VM will have only one inport and one outport. The CMS is > expected to create child ports linked to these ports on each logical > switch where port-group members are present. >> The network-function entity would be configured with the parent ports > only. When CMS creates the child ports, it does not need to change any > of the NF, NFG or ACL config tables. >> When northd configures the redirection rules for a specific LS, it > will use the parent or child port depending on what it finds on that LS. >> -------- >> | NF VM | >> -------- >> | | >> ----- | | ----- >> | VM1 | nfp1 nfp2 | VM2 | >> ---- - | | -------------- ----- | > | >> | | | | SVC LS | | | > | >> p1| nfp1_ch1 nfp2_ch1 -------------- p3| > nfp1_ch2 nfp2_ch2 >> -------------------- > -------------------- >> | LS1 | | > LS2 | >> -------------------- > -------------------- >> >> In this example, the CMS created the parent ports for the NF VM on LS > named SVC LS. The ports are nfp1 and nfp2. The CMS configures the NF > using these ports: >> ovn-nbctl nf-add nf1 nfp1 nfp2 >> ovn-nbctl nfg-add nfg1 123 inline nf1 >> ovn-nbctl acl-add pg1 from-lport 200 'inport==@pg1 && ip4.dst == $as2' > allow-related nfg1 >> >> The port group to which the ACL is applied is pg1 and pg1 has two > ports: p1 on LS1 and p3 on LS2. >> The CMS needs to create child ports for the NF ports on LS1 and LS2. > On LS1: nfp1_ch1 and nfp2_ch1. On LS2: nfp1_ch2 and nfp2_ch2 >> >> When northd creates rules on LS1, it would use nfp1_ch1 and nfp2_ch1. >> >> table=22(ls_in_network_function), priority=100 , match=(inport == > "nfp2_ch1"), action=(reg5[16..31] = ct_label.tun_if_id; next;) >> table=22(ls_in_network_function), priority=99 , match=(reg8[21] == > 1 && reg8[22] == 1 && reg0[22..29] == 123), action=(outport = > "nfp1_ch1"; output;) >> >> When northd is creating rules on LS2, it would use nfp1_ch2 and nfp2_ch2. >> table=22(ls_in_network_function), priority=100 , match=(inport == > "nfp2_ch2"), action=(reg5[16..31] = ct_label.tun_if_id; next;) >> table=22(ls_in_network_function), priority=99 , match=(reg8[21] == > 1 && reg8[22] == 1 && reg0[22..29] == 123), action=(outport = > "nfp1_ch2"; output;) >> >> >> 7. Health Monitoring >> ==================== >> The LB health monitoring functionality has been extended to support > NFs. Network_Function_Group has a list of Network_Functions, each of > which has a reference to network_Function_Health_Check that has the > monitoring config. There is a corresponding SB service_monitor > maintaining the online/offline status. When status changes, northd picks > one of the “online” NFs and sets it in the network_function_active field > of NFG. The redirection rule in LS uses the ports from this NF. >> >> Ovn-controller performs the health monitoring by sending ICMP echo > request with source IP and MAC from NB global options “svc_monitor_ip4” > and “svc_monitor_mac”, and destination IP and MAC from new NB global > options “svc_monitor_ip4_dst” and “svc_monitor_mac_dst”. The sequence > number and id are randomly generated and stored in service_mon. The NF > VM forwards the same packet out of the other port. When it comes out, > ovn-controller matches the sequence number and id with stored values and > marks online if matched. >> >> V1: >> - First patch. >> >> V2: >> - Rebased code. >> - Added "mode" field in Network_function_group table, with only allowed >> value as "inline". This is for future expansion to include > "mirror" mode. >> - Added a flow in the in_network_function and out_network_function > table to >> skip redirection of multicast traffic. >> >> V3: >> - Rebased code. >> >> V4: >> - Rebased code. >> >> V5: >> - FIxed issue of packet duplication for NF on overlay. Added a check >> in physical/controller.c so that only for localnet case the new flows >> in table 44 (that tunnel packets back to source host based on tunnel >> i/f stored in CT) get installed. >> - Fixed packet drop issue for NF on overlay. This happens when port1 > is on >> host1, and port2 & NF are on host2; packets sent from port2 to port1 >> are redirected to NF port as intended but it is not reaching host2. >> This is because OVS drops packets that are sent to same port on which >> they were received (in this case packet received on host1 on >> a tunnel interface and being forwarded to the same tunnel). Updated >> the table 43 rule for remote overlay ports to clear in_port to get >> around this. >> - Added unit tests in ovn.at to cover various multi-chassis scenarios >> for both VLAN and overlay cases. >> >> V6: >> - Addressed review comments from Numan Siddique. >> - Updated ovn-northd.8.xml and ovn-nbctl.8.xml. >> - Unit test added for health monitoring of network function >> - System test added for single chassis case. Multichassis ST to be > added later. >> There were already the multi-chassis cases added in ovn.at. >> >> V7: >> - Rebased >> - Fixed unit test failures resulting from new IC SB service monitor > feature >> >> V8: >> - Addressed review comments from Dumitru Ceara. Main changes are > listed below. >> - NFG unique id is not generated internally, but is configured by CMS. >> - NFG register changed from reg5[0…7] to reg0[22..29] >> - IPv6 support for health probes >> - Update to NEWs and various xml files for proper documentation. >> >> V9: >> - Addressed new review comments from Dumitru Ceara. Main changes are > as below. >> - Changed isRoot from true to false for Network_Function_Health_Check. >> - Changed the behaviour of --may-exist from no-op to update. >> - In nbctl commands, replaced "network-function" with "nf" and >> "network-function-group" with "nfg". >> - The change related to skipping the clearing of tunnel id part of > reg5 is now >> being done only for NF datapaths. >> - Fixed the intermittent test failures earlier being caused by patch5, by >> sorting the svc_macs. >> - Fixed backward compatibility with respect to "type" field in > service_monitor. >> - New system test added for the case without health_check. >> - Replaced the scoring mechanism for NF election with a deterministic > method. >> - Moved some test and xml changes to other patches to align with the > changes. >> - Added a few TODO items. Split the NEWs change across multiple patches. >> > > Hi Sragdhara, Naveen, Karthik, Numan > > Thank you for this new revision and for the reviews. Also thanks for > bearing with us while trying to get this new feature in OVN! I know > it's been a challenge. > > Overall I didn't see any major issues in the series. There's quite a > lot of code being added/changed so I'm quite sure we missed some > things. However, we have quite a lot of time to address any of those > in the remainder of the 26.03 development cycle. > > I did apply some small changes to the last 3 patches of the series, > mostly style issues and then pushed the series to main. > > But I also have a few items I'd like to try to follow up on. It would > be great if you could set some time aside to look into that. Please > see below, listed for the corresponding patches. > >> Sragdhara Datta Chaudhuri (5): >> ovn-nb: Network Function insertion OVN-NB schema changes. >> ovn-nbctl: Network Function insertion commands. >> northd, tests: Network Function insertion logical flow programming. >> controller, tests: Network Function insertion tunneling of cross-host >> VLAN traffic. > > This patch changes the PHY_TO_LOG() table to automatically load the > "tun intf id" for decapsulated packets into a "random" register, > reg5[16..31]. I know we also define a logical field for these 16 bits > of reg5: > > #define MFF_LOG_TUN_OFPORT MFF_REG5 /* 16..31 of the 32 bits */ > > But that seems a bit error prone. We also (for NF traffic) skip > clearing these bits when going from the logical ingress pipeline to > the logical egress one. > > I think it might make sense to have a dedicated (set of) registers > that are preserved between logical pipelines. The problem is we > currently are out of OVS registers that we can use: > > /* Logical fields. > * > * These values are documented in ovn-architecture(7), please update the > * documentation if you change any of them. */ > #define MFF_LOG_DATAPATH MFF_METADATA /* Logical datapath (64 bits). */ > #define MFF_LOG_FLAGS MFF_REG10 /* One of MLF_* (32 bits). */ > #define MFF_LOG_DNAT_ZONE MFF_REG11 /* conntrack dnat zone for gateway > router > * (32 bits). */ > #define MFF_LOG_SNAT_ZONE MFF_REG12 /* conntrack snat zone for gateway > router > * (32 bits). */ > #define MFF_LOG_CT_ZONE MFF_REG13 /* Logical conntrack zone for lports > * (0..15 of the 32 bits). */ > #define MFF_LOG_ENCAP_ID MFF_REG13 /* Encap ID for lports > * (16..31 of the 32 bits). */ > #define MFF_LOG_INPORT MFF_REG14 /* Logical input port (32 bits). */ > #define MFF_LOG_OUTPORT MFF_REG15 /* Logical output port (32 bits). */ > > I had a quick glance at OVS' code and I _think_ there should be no > major issue if we add a nother xxreg (i.e., 2 extra xregs / 4 extra regs). > > We're also running low on available bits in the logical flags register > (that's also preserved between ingress and egress) so having some extra > register storage space to work with would make our lives easier. The > impact of adding new registers should be relatively minimal (I guess) as > it would only potentially increase the upcall handling time but likely > with a marginal amount. > > I'm CC-ing Ilya to see if such a change would be acceptable in OVS. > >> northd, controller, tests: Network Function Health monitoring. > > The new system test in this patch blindly kills any "random" tcpdump > instance that might be running on the host where the tests are being > executed. We should probably improve this test to use > NETNS_START_TCPDUMP() or something similar. > > Also, with this last patch, I saw the "622: Network function packet > flow - outbound" test fail in GitHub CI a couple of times. I couldn't > reproduce the problem locally and after rebase it seems to pass. But > it would be great if you could monitor this test and fix it in the > future if it starts failing again. > > Thanks again for all the hard work on this new feature! > > Best regards, > Dumitru > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
