On 10/7/25 8:36 PM, Sragdhara Datta Chaudhuri wrote:
> Hi Dumitru,
> 

Hi Sragdhara,

> Thanks so much for your detailed review over multiple iterations.
> 
> Numan, thank you as well for your valuable inputs.
> 
> Regarding the points you mentioned in this email:
> 
>   *
>     Agree that dedicated registers would be good. i picked reg5 as no
>     other registers were completely free. The lack of registers would
>     hamper other features in future as well.
>   *
>     We’ll take a look at the system test tcpdump issue and fix it in a
>     future patch.
>   *
>     We noticed the test 622 (Network function packet flow
>     outbound) failure but couldn’t reproduce in in-house even after
>     running in a loop. We’ll look into it.
> 

Thank you!  Let me know if you need help reproducing this in CI.

Regards,
Dumitru

> 
> Thanks,
> Sragdhara
> 
> *From: *Dumitru Ceara <[email protected]>
> *Date: *Friday, October 3, 2025 at 2:01 AM
> *To: *Sragdhara Datta Chaudhuri <[email protected]>, ovs-
> [email protected] <[email protected]>
> *Cc: *Numan Siddique <[email protected]>, Naveen Yerramneni
> <[email protected]>, Karthik Chandrashekar
> <[email protected]>, Ilya Maximets <[email protected]>
> *Subject: *Re: [ovs-dev] [PATCH OVN v9 0/5] Network Function Insertion.
> 
> !-------------------------------------------------------------------|
>   CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On 9/15/25 9:34 PM, Sragdhara Datta Chaudhuri wrote:
>> RFC: NETWORK FUNCTION INSERTION IN OVN
>>
>> 1. Introduction
>> ================
>> The objective is to insert a Network Function (NF) in the path of
> outbound/inbound traffic from/to a port-group. The use case is to
> integrate a 3rd party service in the path of traffic. An example of such
> a service would be layer7 firewall. The NF VM will be like a bump in the
> wire and should not modify the packet, i.e. the IP header, the MAC
> addresses, VLAN tag, sequence numbers remain unchanged.
>>
>> Here are some of the highlights:
>> - A new entity network-function (NF) has been introduced. It contains
> a pair of LSPs. The CMS would designate one as “inport” and the other as
> “outport”.
>> - For high-availability, a network function group (NFG) entity
> consists of a group of NFs. Only one NF in a NFG has an active role
> based on health monitoring.
>> - ACL would accept NFG as a parameter and traffic matching the ACL
> would be redirected to the associated active NF’s port. NFG is accepted
> for stateful allow action only.
>> - The ACL’s port-group is the point of reference when defining the
> role of the NF ports. The “inport” is the port closer to the port-group
> and “outport” is the one away from it. For from-lport ACLs, the request
> packets would be redirected to the NF “inport” and for to-lport ACLs,
> the request packets would be redirected to NF “outport”. When the same
> packet comes out of the other NF port, it gets simply forwarded.
>> - Statefulness will be maintained, i.e. the response traffic will also
> go through the same pair of NF ports but in reverse order.
>> - For the NF ports we need to disable port security check, fdb
> learning and multicast/broadcast forwarding.
>> - Health monitoring involves ovn-controller periodically injecting
> ICMP probe packets into the NF inport and monitor the same packet coming
> out of the NF outport.
>> - If the traffic redirection involves cross-host traffic (e.g. for a
> from-lport ACL, if the source VM and NF VM are on different hosts),
> packets would be tunneled to and from the NF VM's host.
>> - If the port-group to which the ACL is being applied has members
> spread across multiple LSs, CMS needs to create child ports for the NF
> ports on each of these LSs. The redirection rules in each LS will use
> the child ports on that LS.
>>
>> 2. NB tables
>> =============
>> New NB tables
>> —------------
>> Network_Function: Each row contains {inport, outport, health_check}
>> Network_Function_Group: Each row contains a list of Network_Function
> entities. It also contains a unique id (between 1 and 255) and a
> reference to the current active NF.
>> Network_Function_Health_Check: Each row contains configuration for
> probes in options field: {interval, timeout, success_count, failure_count}
>>
>>         "Network_Function_Health_Check": {
>>             "columns": {
>>                 "name": {"type": "string"},
>>                 "options": {
>>                      "type": {"key": "string",
>>                               "value": "string",
>>                               "min": 0,
>>                               "max": "unlimited"}},
>>                 "external_ids": {
>>                     "type": {"key": "string", "value": "string",
>>                              "min": 0, "max": "unlimited"}}},
>>             "isRoot": true},
>>         "Network_Function": {
>>             "columns": {
>>                 "name": {"type": "string"},
>>                 "outport": {"type": {"key": {"type": "uuid",
>>                                              "refTable":
> "Logical_Switch_Port",
>>                                              "refType": "strong"},
>>                                      "min": 1, "max": 1}},
>>                 "inport": {"type": {"key": {"type": "uuid",
>>                                             "refTable":
> "Logical_Switch_Port",
>>                                             "refType": "strong"},
>>                                     "min": 1, "max": 1}},
>>                 "health_check": {"type": {
>>                     "key": {"type": "uuid",
>>                             "refTable": "Network_Function_Health_Check",
>>                             "refType": "strong"},
>>                     "min": 0, "max": 1}},
>>                 "external_ids": {
>>                     "type": {"key": "string", "value": "string",
>>                              "min": 0, "max": "unlimited"}}},
>>             "isRoot": true},
>>         "Network_Function_Group": {
>>             "columns": {
>>                 "name": {"type": "string"},
>>                 "network_function": {"type":
>>                                   {"key": {"type": "uuid",
>>                                            "refTable": "Network_Function",
>>                                            "refType": "strong"},
>>                                            "min": 0, "max": "unlimited"}},
>>                 "mode": {"type": {"key": {"type": "string",
>>                                           "enum": ["set", ["inline"]]}}},
>>                 "network_function_active": {"type":
>>                                   {"key": {"type": "uuid",
>>                                            "refTable": "Network_Function",
>>                                            "refType": "strong"},
>>                                            "min": 0, "max": 1}},
>>                 "id": {
>>                      "type": {"key": {"type": "integer",
>>                                       "minInteger": 0,
>>                                       "maxInteger": 255}}},
>>                 "external_ids": {
>>                     "type": {"key": "string", "value": "string",
>>                              "min": 0, "max": "unlimited"}}},
>>             "isRoot": true},
>>
>>
>> Modified NB table
>> —----------------
>> ACL: The ACL entity would have a new optional field that is a
> reference to a Network_Function_Group entity. This field can be present
> only for stateful allow ACLs.
>>
>>         "ACL": {
>>             "columns": {
>>                 "network_function_group": {"type": {"key": {"type":
> "uuid",
>>                                            "refTable":
> "Network_Function_Group",
>>                                            "refType": "strong"},
>>                                            "min": 0,
>>                                            "max": 1}},
>>
>> New options for Logical_Switch_Port
>> —----------------------------------
>> receive_multicast=<boolean>: Default true. If set to false, LS will
> not forward broadcast/multicast traffic to this port. This is to prevent
> looping of such packets.
>>
>> lsp_learn_mac=<boolean>: Default true. If set to false, fdb learning
> will be skipped for packets coming out of this port. Redirected packets
> from the NF port would be carrying the originating VM’s MAC in source,
> and so learning should not happen.
>>
>> CMS needs to set both the above options to false for NF ports, in
> addition to disabling port security.
>>
>> nf-linked-port=<lsp-name>: Each NF port needs to have this set to the
> other NF port of the pair.
>>
>> is-nf=<boolean>: Each NF port needs to have this set to true.
>>
>> New NB_global options
>> —--------------------
>> svc_monitor_mac_dst: destination MAC of probe packets (svc_monitor_mac
> is already there and will be used as source MAC)
>> svc_monitor_ip: source IP of probe packets
>> svc_monitor_ip_dst: destination IP of probe packets
>>
>> Sample configuration
>> —-------------------
>> ovn-nbctl ls-add ls1
>> ovn-nbctl lsp-add ls1 nfp1
>> ovn-nbctl lsp-add ls1 nfp2
>> ovn-nbctl set logical_switch_port nfp1 options:receive_multicast=false
> options:lsp_learn_mac=false options:nf-linked-port=nfp2
>> ovn-nbctl set logical_switch_port nfp2 options:receive_multicast=false
> options:lsp_learn_mac=false options:nf-linked-port=nfp1
>> ovn-nbctl nf-add nf1 nfp1 nfp2
>> ovn-nbctl nfg-add nfg1 123 inline nf1
>> ovn-nbctl lsp-add ls1 p1 -- lsp-set-addresses p1 "50:6b:8d:3e:ed:c4
> 10.1.1.4"
>> ovn-nbctl pg-add pg1 p1
>> ovn-nbctl create Address_Set name=as1 addresses=10.1.1.4
>> ovn-nbctl lsp-add ls1 p2 -- lsp-set-addresses p2 "50:6b:8d:3e:ed:c5
> 10.1.1.5"
>> ovn-nbctl create Address_Set name=as2 addresses=10.1.1.5
>> ovn-nbctl acl-add pg1 from-lport 200 'inport==@pg1 && ip4.dst == $as2'
> allow-related nfg1
>> ovn-nbctl acl-add pg1 to-lport 100 'outport==@pg1 && ip4.src == $as2'
> allow-related nfg1
>>
>> 3. SB tables
>> ============
>> Service_Monitor:
>> This is currently used by Load balancer. New fields are: “type” - to
> indicate LB or NF, “mac” - the destination MAC address for monitor
> packets, “logical_input_port” - the LSP to which the probe packet would
> be sent. Also, “icmp” has been added as a protocol type, used only for NF.
>>
>>          "Service_Monitor": {
>>              "columns": {
>>                "type": {"type": {"key": {
>>                           "type": "string",
>>                           "enum": ["set", ["load-balancer", "network-
> function"]]}}},
>>                "mac": {"type": "string"},
>>                  "protocol": {
>>                      "type": {"key": {"type": "string",
>>                             "enum": ["set", ["tcp", "udp", "icmp"]]},
>>                               "min": 0, "max": 1}},
>>                "logical_input_port": {"type": "string"},
>>
>> northd would create one Service_Monitor entity for each NF. The
> logical_input_port and logical_port would be populated from the NF
> inport and outport fields respectively. The probe packets would be
> injected into the logical_input_port and would be monitored out of
> logical_port.
>>
>> 4. Logical Flows
>> ================
>> Logical Switch ingress pipeline:
>> - in_network_function added after in_stateful.
>> - Modifications to in_acl_eval, in_stateful and in_l2_lookup.
>> Logical Switch egress pipeline:
>> - out_network_function added after out_stateful.
>> - Modifications to out_pre_acl, out_acl_eval and out_stateful.
>>
>> 4.1 from-lport ACL
>> ------------------
>> The diagram shows the request path for packets from VM1 port p1, which
> is a member of the pg to which ACL is applied. The response would follow
> the reverse path, i.e. packet would be redirected to nfp2 and come out
> of nfp1 and be forwarded to p1.
>> Also, p2 does not need to be on the same LS. Only the p1, nfp1, nfp2
> are on the same LS.
>>
>>       -----                  -------                  -----
>>      | VM1 |                | NF VM |                | VM2 |
>>       -----                  -------                  -----
>>         |                    /\    |                   / \
>>         |                    |     |                    |
>>        \ /                   |    \ /                   |
>>    ------------------------------------------------------------
>>   |     p1                 nfp1  nfp2                   p2     |
>>   |                                                            |
>>   |                      Logical Switch                        |
>>    -------------------------------------------------------------
>> pg1: [p1]         as2: [p2-ip]
>> ovn-nbctl nf-add nf1 nfp1 nfp2
>> ovn-nbctl nfg-add nfg1 123 inline nf1
>> ovn-nbctl acl-add pg1 from-lport 200 'inport==@pg1 && ip4.dst == $as2'
> allow-related nfg1
>>
>> The request packets from p1 matching a from-lport ACL with NFG, are
> redirected to nfp1 and the NFG id is committed to the ct label in p1's
> zone. When the same packet comes out of nfp2 it gets forwarded the
> normal way.
>> Response packets have destination as p1's MAC. Ingress processing sets
> the outport to p1 and the CT lookup in egress pipeline (in p1's ct zone)
> yields the NFG id and the packet injected back to ingress pipeline after
> setting the outport to nfp2.
>>
>> Below are the changes in detail.
>>
>> 4.1.1 Request processing
>> ------------------------
>>
>> in_acl_eval: For from-lport ACLs with NFG, the existing rule's action
> has been enhanced to set:
>>  - reg8[21] = 1: to indicate that packet has matched a rule with NFG
>>  - reg0[22..29] = <NFG-unique-id>
>>  - reg8[22] = <direction> (1: request, 0: response)
>>
>>   table=8 (ls_in_acl_eval), priority=1200 , match=(reg0[7] == 1 &&
> (inport==@pg1 && ip4.dst == $as2)), action=(reg8[16] = 1; reg0[1] = 1;
> reg8[21] = 1; reg8[22] = 1; reg0[22..29] = 123; next;)
>>   table=8 (ls_in_acl_eval), priority=1200 , match=(reg0[8] == 1 &&
> (inport==@pg1 && ip4.dst == $as2)), action=(reg8[16] = 1; reg8[21] = 1;
> reg8[22] = 1; reg0[22..29] = 123; next;)
>>
>> in_stateful: Priority 110: set NFG id in CT label if reg8[21] is set.
>>  - bit 7 (ct_label.network_function_group): Set to 1 to indicate NF
> insertion.
>>  - bits 17 to 24 (ct_label.network_function_group_id): Stores the 8
> bit NFG id
>>
>>   table=21(ls_in_stateful     ), priority=110  , match=(reg0[1] == 1
> && reg0[13] == 0 && reg8[21] == 1), action=(ct_commit { ct_mark.blocked
> = 0; ct_mark.allow_established = reg0[20]; ct_label.acl_id =
> reg2[16..31]; ct_label.network_function_group = 1;
> ct_label.network_function_group_id = reg0[22..29]; }; next;)
>>   table=21(ls_in_stateful     ), priority=110  , match=(reg0[1] == 1
> && reg0[13] == 1 && reg8[21] == 1), action=(ct_commit { ct_mark.blocked
> = 0; ct_mark.allow_established = reg0[20]; ct_mark.obs_stage =
> reg8[19..20]; ct_mark.obs_collector_id = reg8[8..15];
> ct_label.obs_point_id = reg9; ct_label.acl_id = reg2[16..31];
> ct_label.network_function_group = 1; ct_label.network_function_group_id
> = reg0[22..29]; }; next;)
>>   table=21(ls_in_stateful     ), priority=100  , match=(reg0[1] == 1
> && reg0[13] == 0), action=(ct_commit { ct_mark.blocked = 0;
> ct_mark.allow_established = reg0[20]; ct_label.acl_id = reg2[16..31];
> ct_label.network_function_group = 0; ct_label.network_function_group_id
> = 0; }; next;)
>>   table=21(ls_in_stateful     ), priority=100  , match=(reg0[1] == 1
> && reg0[13] == 1), action=(ct_commit { ct_mark.blocked = 0;
> ct_mark.allow_established = reg0[20]; ct_mark.obs_stage = reg8[19..20];
> ct_mark.obs_collector_id = reg8[8..15]; ct_label.obs_point_id = reg9;
> ct_label.acl_id = reg2[16..31]; ct_label.network_function_group = 0;
> ct_label.network_function_group_id = 0; }; next;)
>>   table=21(ls_in_stateful     ), priority=0    , match=(1), action=(next;)
>>
>>
>> For non-NFG cases, the existing priority 100 rules will be hit. There
> additional action has been added to clear the NFG bits in ct label.
>>
>> in_network_function: A new stage with priority 99 rules to redirect
> packets by setting outport to the NF “inport” (or its child port) based
> on the NFG id set by the prior ACL stage.
>> Priority 100 rules ensure that when the same packets come out of the
> NF ports, they are not redirected again (the setting of reg5 here
> relates to the cross-host packet tunneling and will be explained later).
>> Priority 1 rule: if reg8[21] is set, but the NF port (or child port)
> is not present on this LS, drop packets.
>>
>>   table=22(ls_in_network_function), priority=100  , match=(inport ==
> "nfp1"), action=(reg5[16..31] = ct_label.tun_if_id; next;)
>>   table=22(ls_in_network_function), priority=100  , match=(inport ==
> "nfp2"), action=(reg5[16..31] = ct_label.tun_if_id; next;)
>>   table=22(ls_in_network_function), priority=100  , match=(reg8[21] ==
> 1 && eth.mcast), action=(next;)
>>   table=22(ls_in_network_function), priority=99   , match=(reg8[21] ==
> 1 && reg8[22] == 1 && reg0[22..29] == 123), action=(outport = "nfp1";
> output;)
>>   table=22(ls_in_network_function), priority=1    , match=(reg8[21] ==
> 1), action=(drop;)
>>   table=22(ls_in_network_function), priority=0    , match=(1),
> action=(next;)
>>
>>
>> 4.1.2 Response processing
>> -------------------------
>> out_acl_eval: High priority rules that allow response and related
> packets to go through have been enhanced to also copy CT label NFG bit
> into reg8[21].
>>
>>   table=6(ls_out_acl_eval), priority=65532, match=(!ct.est && ct.rel
> && !ct.new && !ct.inv && ct_mark.blocked == 0), action=(reg8[21] =
> ct_label.network_function_group; reg8[16] = 1; ct_commit_nat;)
>>   table=6(ls_out_acl_eval), priority=65532, match=(ct.est && !ct.rel
> && !ct.new && !ct.inv && ct.rpl && ct_mark.blocked == 0),
> action=(reg8[21] = ct_label.network_function_group; reg8[16] = 1; next;)
>>
>> out_network_function: Priority 99 rule matches on the nfg_id in
> ct_label and sets the outport to the NF “outport”. It also sets
> reg8[23]=1 and injects the packet to ingress pipeline (in_l2_lookup).
>> Priority 100 rule forwards all packets to NF ports to the next table.
>>
>>   table=11 (ls_out_network_function), priority=100  , match=(outport
> == "nfp1"), action=(next;)
>>   table=11 (ls_out_network_function), priority=100  , match=(outport
> == "nfp2"), action=(next;)
>>   table=11(ls_out_network_function), priority=100  , match=(reg8[21]
> == 1 && eth.mcast), action=(next;)
>>   table=11 (ls_out_network_function), priority=99   , match=(reg8[21]
> == 1 && reg8[22] == 0 && ct_label.network_function_group_id == 123),
> action=(outport = "nfp2"; reg8[23] = 1; next(pipeline=ingress, table=29);)
>>   table=11 (ls_out_network_function), priority=1    , match=(reg8[21]
> == 1), action=(drop;)
>>   table=11 (ls_out_network_function), priority=0    , match=(1),
> action=(next;)
>>
>> in_l2_lkup: if reg8[23] == 1 (packet has come back from egress),
> simply forward such packets as outport is already set.
>>
>>   table=29(ls_in_l2_lkup), priority=100  , match=(reg8[23] == 1),
> action=(output;)
>>
>> The above set of rules ensure that the response packet is sent to
> nfp2. When the same packet comes out of nfp1, the ingress pipeline would
> set the outport to p1 and it enters the egress pipeline.
>>
>> out_pre_acl: If the packet is coming from the NF inport, skip the
> egress pipeline upto the out_nf stage, as the packet has already gone
> through it and we don't want the same packet to be processed by CT twice.
>>   table=2 (ls_out_pre_acl     ), priority=110  , match=(inport ==
> "nfp1"), action=(next(pipeline=egress, table=12);)
>>
>>
>> 4.2 to-lport ACL
>> ----------------
>>       -----                  --------                  -----
>>      | VM1 |                |  NF VM |                | VM2 |
>>       -----                  --------                  -----
>>        / \                    |   / \                    |
>>         |                     |    |                     |
>>         |                    \ /   |                    \ /
>>    -------------------------------------------------------------
>>   |     p1                  nfp1   nfp2                  p2     |
>>   |                                                             |
>>   |                      Logical Switch                         |
>>    -------------------------------------------------------------
>> ovn-nbctl acl-add pg1 to-lport 100 'outport==@pg1&& ip4.src == $as2'
> allow-related nfg1
>> Diagram shows request traffic path. The response will follow a reverse
> path.
>>
>> Ingress pipeline sets the outport to p1 based on destination MAC
> lookup. The packet enters the egress pipeline. There the to-lport ACL
> with NFG gets evaluated and the NFG id gets committed to the CT label.
> Then the outport is set to nfp2 and then the packet is injected back to
> ingress. When the same packet comes out of nfp1, it gets forwarded to p1
> the normal way.
>>>From the response packet from p1, ingress pipeline gets the NFG id
> from CT label and accordingly redirects it to nfp1. When it comes out of
> nfp2 it is forwarded the normal way.
>>
>> 4.2.1 Request processing
>> ------------------------
>> out_acl_eval: For to-lport ACLs with NFG, the existing rule's action
> has been enhanced to set:
>>  - reg8[21] = 1: to indicate that packet has matched a rule with NFG
>>  - reg0[22..29] = <NFG-unique-id>
>>  - reg8[22] = <direction> (1: request, 0: response)
>>
>>   table=6 (ls_out_acl_eval    ), priority=1100 , match=(reg0[7] == 1
> && (outport==@pg1 && ip4.src == $as2)), action=(reg8[16] = 1; reg0[1] =
> 1; reg8[21] = 1; reg8[22] = 1; reg0[22..29] = 123; next;)
>>   table=6 (ls_out_acl_eval    ), priority=1100 , match=(reg0[8] == 1
> && (outport==@pg1 && ip4.src == $as2)), action=(reg8[16] = 1; reg0[1] =
> 1; reg8[21] = 1; reg8[22] = 1; reg0[22..29] = 123; next;)
>>
>>
>>
>> Out_stateful: Priority 110: set NFG id in CT label if reg8[21] is set.
>>
>>   table=10(ls_out_stateful    ), priority=110  , match=(reg0[1] == 1
> && reg0[13] == 0 && reg8[21] == 1), action=(ct_commit { ct_mark.blocked
> = 0; ct_mark.allow_established = reg0[20]; ct_label.acl_id =
> reg2[16..31]; ct_label.network_function_group = 1;
> ct_label.network_function_group_id = reg0[22..29]; }; next;)
>>   table=10(ls_out_stateful    ), priority=110  , match=(reg0[1] == 1
> && reg0[13] == 1 && reg8[21] == 1), action=(ct_commit { ct_mark.blocked
> = 0; ct_mark.allow_established = reg0[20]; ct_mark.obs_stage =
> reg8[19..20]; ct_mark.obs_collector_id = reg8[8..15];
> ct_label.obs_point_id = reg9; ct_label.acl_id = reg2[16..31];
> ct_label.network_function_group = 1; ct_label.network_function_group_id
> = reg0[22..29]; }; next;)
>>   table=10(ls_out_stateful    ), priority=100  , match=(reg0[1] == 1
> && reg0[13] == 0), action=(ct_commit { ct_mark.blocked = 0;
> ct_mark.allow_established = reg0[20]; ct_label.acl_id = reg2[16..31];
> ct_label.network_function_group = 0; ct_label.network_function_group_id
> = 0; }; next;)
>>   table=10(ls_out_stateful    ), priority=100  , match=(reg0[1] == 1
> && reg0[13] == 1), action=(ct_commit { ct_mark.blocked = 0;
> ct_mark.allow_established = reg0[20]; ct_mark.obs_stage = reg8[19..20];
> ct_mark.obs_collector_id = reg8[8..15]; ct_label.obs_point_id = reg9;
> ct_label.acl_id = reg2[16..31]; ct_label.network_function_group = 0;
> ct_label.network_function_group_id = 0; }; next;)
>>   table=10(ls_out_stateful    ), priority=0    , match=(1), action=(next;)
>>
>> out_network_function: A new stage that has priority 99 rules to
> redirect packet by setting outport to the NF “outport” (or its child
> port) based on the NFG id set by the prior ACL stage, and then injecting
> back to ingress. Priority 100 rules ensure that when the packets are
> going to NF ports, they are not redirected again.
>> Priority 1 rule: if reg8[21] is set, but the NF port (or child port)
> is not present on this LS, drop packets.
>>
>>   table=11(ls_out_network_function), priority=100  , match=(outport ==
> "nfp1"), action=(next;)
>>   table=11(ls_out_network_function), priority=100  , match=(outport ==
> "nfp2"), action=(next;)
>>   table=11(ls_out_network_function), priority=100  , match=(reg8[21]
> == 1 && eth.mcast), action=(next;)
>>   table=11(ls_out_network_function), priority=99   , match=(reg8[21]
> == 1 && reg8[22] == 1 && reg0[22..29] == 123), action=(outport = "nfp2";
> reg8[23] = 1; next(pipeline=ingress, table=29);)
>>   table=11(ls_out_network_function), priority=1    , match=(reg8[21]
> == 1), action=(drop;)
>>   table=11(ls_out_network_function), priority=0    , match=(1),
> action=(next;)
>>
>>
>> in_l2_lkup: As described earlier, the priority 100 rule will forward
> these packets.
>>
>> Then the same packet comes out from nfp1 and goes through the ingress
> processing where the outport gets set to p1. The egress pipeline
> out_pre_acl priority 110 rule described earlier, matches against inport
> as nfp1 and directly jumps to the stage after out_network_function. Thus
> the packet is not redirected again.
>>
>> 4.2.2 Response processing
>> -------------------------
>> in_acl_eval: High priority rules that allow response and related
> packets to go through have been enhanced to also copy CT label NFG bit
> into reg8[21].
>>
>>   table=8(ls_in_acl_eval), priority=65532, match=(!ct.est && ct.rel
> && !ct.new && !ct.inv && ct_mark.blocked == 0), action=(reg0[17] = 1;
> reg8[21] = ct_label.network_function_group; reg8[16] = 1; ct_commit_nat;)
>>   table=8 (ls_in_acl_eval), priority=65532, match=(ct.est && !ct.rel
> && !ct.new && !ct.inv && ct.rpl && ct_mark.blocked == 0),
> action=(reg0[9] = 0; reg0[10] = 0; reg0[17] = 1; reg8[21] =
> ct_label.network_function_group; reg8[16] = 1; next;)
>>
>> in_network_function: Priority 99 rule matches on the nfg_id in
> ct_label and sets the outport to the NF “inport”.
>> Priority 100 rule forwards all packets to NF ports to the next table.
>>   table=22(ls_in_network_function), priority=99   , match=(reg8[21] ==
> 1 && reg8[22] == 0 && ct_label.network_function_group_id == 123),
> action=(outport = "nfp1"; output;)
>>
>>
>> 5. Cross-host Traffic for VLAN Network
>> ======================================
>> For overlay subnets, all cross-host traffic exchanges are tunneled. In
> the case of VLAN subnets, there needs to be special handling to
> selectively tunnel only the traffic to or from the NF ports.
>> Take the example of a from-lport ACL. Packets from p1 to p2, gets
> redirected to nfp1 in host1. If this packet is simply sent out from
> host1, the physical network will directly forward it to host2 where VM2
> is. So, we need to tunnel the redirected packets from host1 to host3.
> Now, once the packets come out of nfp2, if host3 sends the packets out,
> the physical network would learn p1's MAC coming from host3. So, these
> packets need to be tunneled back to host1. From there the packet would
> be forwarded to VM2 via the physical network.
>>
>>       -----                  -----                  --------
>>      | VM2 |                | VM1 |                | NF VM  |
>>       -----                  -----                  --------
>>        / \                     |                    / \   |
>>         | (7)                  |  (1)             (3)|    |(4)
>>         |                     \ /                    |   \ /
>>   --------------        --------------   (2)    ---------------
>>  |      p2      |  (6) |      p1      |______\ |   nfp1  nfp2  |
>>  |              |/____ |              |------/ |               |
>>  |    host2     |\     |     host1    |/______ |     host3     |
>>  |              |      |              |\------ |               |
>>   --------------        --------------   (5)    --------------
>>
>> The above figure shows the request packet path for a from-lport ACL.
> Response would follow the same path in reverse direction.
>>
>> To achieve this, the following would be done:
>>
>> On host where the ACL port group members are present (host1)
>> —-----------------------------------------------------------
>> REMOTE_OUTPUT (table 42):
>> Currently, it tunnels traffic destined to all non-local overlay ports
> to their associated hosts. The same rule is now also added for traffic
> to non-local NF ports. Thus the packets from p1 get tunneled to host 3.
>>
>> On host with NF (host3) forward packet to nfp1
>> —----------------------------------------------
>> Upon reaching host3, the following rules come into play:
>> PHY_TO_LOG (table 0):
>> Ppriority 100: Existing rule - for each geneve tunnel interface on the
> chassis, copies info from header to inport, outport, metadata registers.
> Now the same rule also stores the tun intf id in a register (reg5[16..31]).
>>
>> CHECK_LOOPBACK (table 44)
>> This table has a rule that clears all the registers. The change is to
> skip the clearing of reg5[16..31].
>>
>> Logical egress pipeline:
>>
>> ls_out_stateful priority 120: If the outport is an NF port, copy
> reg5[16..31] (table0 had set it) to ct_label.tun_if_id.)
>>
>>   table=10(ls_out_stateful    ), priority=120  , match=(outport ==
> "nfp1" && reg0[13] == 0), action=(ct_commit { ct_mark.blocked = 0;
> ct_mark.allow_established = reg0[20]; ct_label.acl_id = reg2[16..31];
> ct_label.tun_if_id = reg5[16..31]; }; next;)
>>   table=10(ls_out_stateful    ), priority=120  , match=(outport ==
> "nfp1" && reg0[13] == 1), action=(ct_commit { ct_mark.blocked = 0;
> ct_mark.allow_established = reg0[20]; ct_label.acl_id = reg2[16..31];
> ct_mark.obs_stage = reg8[19..20]; ct_mark.obs_collector_id =
> reg8[8..15]; ct_label.obs_point_id = reg9; ct_label.tun_if_id =
> reg5[16..31]; }; next;)
>>
>> The above sequence of flows ensure that if a packet is received via
> tunnel on host3, with outport as nfp1, the tunnel interface id is
> committed to the ct entry in nfp1's zone.
>>
>> On host with NF (host3) tunnel packets from nfp2 back to host1
>> —--------------------------------------------------------------
>> When the same packet comes out of nfp2 on host3:
>>
>> LOCAL_OUTPUT (table 43)
>> When the packet comes out of the other NF port (nfp2), following two
> rules send it back to the host that it originally came from:
>>
>> Priority 110: For each NF port local to this host, following rule
> processes the
>> packet through CT of linked port (for nfp2, it is nfp1):
>>   match: inport==nfp2 && RECIRC_BIT==0
>>   action: RECIRC_BIT = 1, ct(zone=nfp1’s zone, table=LOCAL), resubmit
> to table 43
>>
>> Priority 109: For each {tunnel_id, nf port} on this host, if the
> tun_if_id in ct_label matches the tunnel_id, send the recirculated
> packet using tunnetl_id:
>>   match: inport==nfp1 && RECIRC_BIT==1 && ct_label.tun_if_id==<tun-id>
>>   action: tunnel packet using tun-id
>>
>> If p1 and nfp1 happen to be on the same host, the tun_if_id would not
> be set and thus none of the priority 109 rules would match. It would be
> forwarded the usual way matching the existing priority 100 rules in
> LOCAL_TABLE.
>>
>> Special handling of the case where NF responds back on nfp1, instead
> of forwarding packet out of nfp2:
>> For example, a SYN packet from p1 got redirected to nfp1. Then the NF,
> which is a firewall VM, drops the SYN and sends RST back on port nfp1.
> In this case, looking up in the linked port (nfp2) ct zone will not give
> anything. The following rule uses ct.inv to identify such scenarios and
> uses nfp1’s CT zone to send the packet back. To achieve this, following
> 2 rules are installed:
>>    
>> in_network_function:
>> Priority 100 rule that allows packets incoming from NF type ports, is
> enhanced with additional action to store the tun_if_id from ct_label
> into reg5[16..31].
>>   table=22(ls_in_network_function), priority=100  , match=(inport ==
> "nfp1"), action=(reg5[16..31] = ct_label.tun_if_id; next;)
>>
>> LOCAL_OUTPUT (table 43)
>> Priority 110 rule: for recirculated packets, if ct (of the linked
> port) is invalid, use the tun id from reg5[16..31] to tunnel the packet
> back to host1 (as CT zone info has been overwritten in the above 110
> priority rule in table 42).
>>       match: inport==nf1 && RECIRC_BIT==1 && ct.inv &&
> MFF_LOG_TUN_OFPORT==<tun-id>
>>       action: tunnel packet using tun-id
>>
>>
>> 6. NF insertion across logical switches
>> =======================================
>> If the port-group where the ACL is being applied has members across
> multiple logical switches, there needs to be a NF port pair on each of
> these switches.
>> The NF VM will have only one inport and one outport. The CMS is
> expected to create child ports linked to these ports on each logical
> switch where port-group members are present.
>> The network-function entity would be configured with the parent ports
> only. When CMS creates the child ports, it does not need to change any
> of the NF, NFG or ACL config tables.
>> When northd configures the redirection rules for a specific LS, it
> will use the parent or child port depending on what it finds on that LS.
>>                                      --------
>>                                     | NF VM  |
>>                                      --------
>>                                      |      |
>>           -----                      |      |              -----
>>          | VM1 |                    nfp1   nfp2           | VM2 |
>>           ---- -   |     |         --------------          -----    |
>       |
>>             |      |     |        |    SVC LS    |          |       |
>       |
>>           p1|  nfp1_ch1  nfp2_ch1  --------------         p3| 
> nfp1_ch2  nfp2_ch2
>>           --------------------                            
> --------------------
>>          |         LS1        |                           |        
> LS2        |
>>           --------------------                            
> --------------------
>>
>> In this example, the CMS created the parent ports for the NF VM on LS
> named SVC LS. The ports are nfp1 and nfp2. The CMS configures the NF
> using these ports:
>> ovn-nbctl nf-add nf1 nfp1 nfp2
>> ovn-nbctl nfg-add nfg1 123 inline nf1
>> ovn-nbctl acl-add pg1 from-lport 200 'inport==@pg1 && ip4.dst == $as2'
> allow-related nfg1
>>
>> The port group to which the ACL is applied is pg1 and pg1 has two
> ports: p1 on LS1 and p3 on LS2.
>> The CMS needs to create child ports for the NF ports on LS1 and LS2.
> On LS1: nfp1_ch1 and nfp2_ch1. On LS2: nfp1_ch2 and nfp2_ch2
>>
>> When northd creates rules on LS1, it would use nfp1_ch1 and nfp2_ch1.
>>
>>   table=22(ls_in_network_function), priority=100  , match=(inport ==
> "nfp2_ch1"), action=(reg5[16..31] = ct_label.tun_if_id; next;)
>>   table=22(ls_in_network_function), priority=99   , match=(reg8[21] ==
> 1 && reg8[22] == 1 && reg0[22..29] == 123), action=(outport =
> "nfp1_ch1"; output;)
>>
>> When northd is creating rules on LS2, it would use nfp1_ch2 and nfp2_ch2.
>>   table=22(ls_in_network_function), priority=100  , match=(inport ==
> "nfp2_ch2"), action=(reg5[16..31] = ct_label.tun_if_id; next;)
>>   table=22(ls_in_network_function), priority=99   , match=(reg8[21] ==
> 1 && reg8[22] == 1 && reg0[22..29] == 123), action=(outport =
> "nfp1_ch2"; output;)
>>
>>
>> 7. Health Monitoring
>> ====================
>> The LB health monitoring functionality has been extended to support
> NFs. Network_Function_Group has a list of Network_Functions, each of
> which has a reference to network_Function_Health_Check that has the
> monitoring config. There is a corresponding SB service_monitor
> maintaining the online/offline status. When status changes, northd picks
> one of the “online” NFs and sets it in the network_function_active field
> of NFG. The redirection rule in LS uses the ports from this NF.
>>
>> Ovn-controller performs the health monitoring by sending ICMP echo
> request with source IP and MAC from NB global options “svc_monitor_ip4”
> and “svc_monitor_mac”, and destination IP and MAC from new NB global
> options “svc_monitor_ip4_dst” and “svc_monitor_mac_dst”. The sequence
> number and id are randomly generated and stored in service_mon. The NF
> VM forwards the same packet out of the other port. When it comes out,
> ovn-controller matches the sequence number and id with stored values and
> marks online if matched.
>>
>> V1:
>>   - First patch.
>>
>> V2:
>>   - Rebased code.
>>   - Added "mode" field in Network_function_group table, with only allowed
>>     value as "inline". This is for future expansion to include
> "mirror" mode.
>>   - Added a flow in the in_network_function and out_network_function
> table to
>>     skip redirection of multicast traffic.
>>
>> V3:
>>  - Rebased code.
>>
>> V4:
>>  - Rebased code.
>>
>> V5:
>>  - FIxed issue of packet duplication for NF on overlay. Added a check
>>    in physical/controller.c so that only for localnet case the new flows
>>    in table 44 (that tunnel packets back to source host based on tunnel
>>    i/f stored in CT) get installed.
>>  - Fixed packet drop issue for NF on overlay. This happens when port1
> is on
>>    host1, and port2 & NF are on host2; packets sent from port2 to port1
>>    are redirected to NF port as intended but it is not reaching host2.
>>    This is because OVS drops packets that are sent to same port on which
>>    they were received (in this case packet received on host1 on
>>    a tunnel interface and being forwarded to the same tunnel). Updated
>>    the table 43 rule for remote overlay ports to clear in_port to get
>>    around this.
>>  - Added unit tests in ovn.at to cover various multi-chassis scenarios
>>    for both VLAN and overlay cases.
>>
>> V6:
>> - Addressed review comments from Numan Siddique.
>> - Updated ovn-northd.8.xml and ovn-nbctl.8.xml.
>> - Unit test added for health monitoring of network function
>> - System test added for single chassis case. Multichassis ST to be
> added later.
>>   There were already the multi-chassis cases added in ovn.at.
>>
>> V7:
>> - Rebased
>> - Fixed unit test failures resulting from new IC SB service monitor
> feature
>>
>> V8:
>> - Addressed review comments from Dumitru Ceara. Main changes are
> listed below.
>> - NFG unique id is not generated internally, but is configured by CMS.
>> - NFG register changed from reg5[0…7] to reg0[22..29]
>> - IPv6 support for health probes
>> - Update to NEWs and various xml files for proper documentation.
>>
>> V9:
>> - Addressed new review comments from Dumitru Ceara. Main changes are
> as below.
>> - Changed isRoot from true to false for Network_Function_Health_Check.
>> - Changed the behaviour of --may-exist from no-op to update.
>> - In nbctl commands, replaced "network-function" with "nf" and
>>   "network-function-group" with "nfg".
>> - The change related to skipping the clearing of tunnel id part of
> reg5 is now
>>   being done only for NF datapaths.
>> - Fixed the intermittent test failures earlier being caused by patch5, by
>>   sorting the svc_macs.
>> - Fixed backward compatibility with respect to "type" field in
> service_monitor.
>> - New system test added for the case without health_check.
>> - Replaced the scoring mechanism for NF election with a deterministic
> method.
>> - Moved some test and xml changes to other patches to align with the
> changes.
>> - Added a few TODO items. Split the NEWs change across multiple patches.
>>
> 
> Hi Sragdhara, Naveen, Karthik, Numan
> 
> Thank you for this new revision and for the reviews.  Also thanks for
> bearing with us while trying to get this new feature in OVN!  I know
> it's been a challenge.
> 
> Overall I didn't see any major issues in the series.  There's quite a
> lot of code being added/changed so I'm quite sure we missed some
> things.  However, we have quite a lot of time to address any of those
> in the remainder of the 26.03 development cycle.
> 
> I did apply some small changes to the last 3 patches of the series,
> mostly style issues and then pushed the series to main.
> 
> But I also have a few items I'd like to try to follow up on.  It would
> be great if you could set some time aside to look into that.  Please
> see below, listed for the corresponding patches.
> 
>> Sragdhara Datta Chaudhuri (5):
>>   ovn-nb: Network Function insertion OVN-NB schema changes.
>>   ovn-nbctl: Network Function insertion commands.
>>   northd, tests: Network Function insertion logical flow programming.
>>   controller, tests: Network Function insertion tunneling of cross-host
>>     VLAN traffic.
> 
> This patch changes the PHY_TO_LOG() table to automatically load the
> "tun intf id" for decapsulated packets into a "random" register,
> reg5[16..31].  I know we also define a logical field for these 16 bits
> of reg5:
> 
> #define MFF_LOG_TUN_OFPORT MFF_REG5   /* 16..31 of the 32 bits */
> 
> But that seems a bit error prone.  We also (for NF traffic) skip
> clearing these bits when going from the logical ingress pipeline to
> the logical egress one.
> 
> I think it might make sense to have a dedicated (set of) registers
> that are preserved between logical pipelines.  The problem is we
> currently are out of OVS registers that we can use:
> 
> /* Logical fields.
>  *
>  * These values are documented in ovn-architecture(7), please update the
>  * documentation if you change any of them. */
> #define MFF_LOG_DATAPATH MFF_METADATA /* Logical datapath (64 bits). */
> #define MFF_LOG_FLAGS      MFF_REG10  /* One of MLF_* (32 bits). */
> #define MFF_LOG_DNAT_ZONE  MFF_REG11  /* conntrack dnat zone for gateway
> router
>                                        * (32 bits). */
> #define MFF_LOG_SNAT_ZONE  MFF_REG12  /* conntrack snat zone for gateway
> router
>                                        * (32 bits). */
> #define MFF_LOG_CT_ZONE    MFF_REG13  /* Logical conntrack zone for lports
>                                        * (0..15 of the 32 bits). */
> #define MFF_LOG_ENCAP_ID   MFF_REG13  /* Encap ID for lports
>                                        * (16..31 of the 32 bits). */
> #define MFF_LOG_INPORT     MFF_REG14  /* Logical input port (32 bits). */
> #define MFF_LOG_OUTPORT    MFF_REG15  /* Logical output port (32 bits). */
> 
> I had a quick glance at OVS' code and I _think_ there should be no
> major issue if we add a nother xxreg (i.e., 2 extra xregs / 4 extra regs).
> 
> We're also running low on available bits in the logical flags register
> (that's also preserved between ingress and egress) so having some extra
> register storage space to work with would make our lives easier.  The
> impact of adding new registers should be relatively minimal (I guess) as
> it would only potentially increase the upcall handling time but likely
> with a marginal amount.
> 
> I'm CC-ing Ilya to see if such a change would be acceptable in OVS.
> 
>>   northd, controller, tests: Network Function Health monitoring.
> 
> The new system test in this patch blindly kills any "random" tcpdump
> instance that might be running on the host where the tests are being
> executed.  We should probably improve this test to use
> NETNS_START_TCPDUMP() or something similar.
> 
> Also, with this last patch, I saw the "622: Network function packet
> flow - outbound" test fail in GitHub CI a couple of times.  I couldn't
> reproduce the problem locally and after rebase it seems to pass. But
> it would be great if you could monitor this test and fix it in the
> future if it starts failing again.
> 
> Thanks again for all the hard work on this new feature!
> 
> Best regards,
> Dumitru
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to