On Tue, Mar 7, 2023 at 7:58 PM Aaron Conole <acon...@redhat.com> wrote:
>
> Robin Jarry <rja...@redhat.com> writes:
>
> > Some control protocols are used to maintain link status between
> > forwarding engines (e.g. LACP). When the system is not sized properly,
> > the PMD threads may not be able to process all incoming traffic from the
> > configured Rx queues. When a signaling packet of such protocols is
> > dropped, it can cause link flapping, worsening the situation.
> >
> > Use the RTE flow API to redirect these protocols into a dedicated Rx
> > queue. The assumption is made that the ratio between control protocol
> > traffic and user data traffic is very low and thus this dedicated Rx
> > queue will never get full. The RSS redirection table is re-programmed to
> > only use the other Rx queues. The RSS table size is stored in the
> > netdev_dpdk structure at port initialization to avoid requesting the
> > information again when changing the port configuration.
> >
> > The additional Rx queue will be assigned a PMD core like any other Rx
> > queue. Polling that extra queue may introduce increased latency and
> > a slight performance penalty at the benefit of preventing link flapping.
> >
> > This feature must be enabled per port on specific protocols via the
> > cp-protection option. This option takes a coma-separated list of
> > protocol names. It is only supported on ethernet ports. This feature is
> > experimental.
> >
> > If the user has already configured multiple Rx queues on the port, an
> > additional one will be allocated for control plane packets. If the
> > hardware cannot satisfy the requested number of requested Rx queues, the
> > last Rx queue will be assigned for control plane. If only one Rx queue
> > is available, the cp-protection feature will be disabled. If the
> > hardware does not support the RTE flow matchers/actions, the feature
> > will be disabled.
> >
> > It cannot be enabled when other_config:hw-offload=true as it may
> > conflict with the offloaded RTE flows. Similarly, if hw-offload is
> > enabled while some ports already have cp-protection enabled, RTE flow
> > offloading will be disabled on these ports.
>
> I'm concerned about this - this is a negative interference with rte_flow
> offload.  And rte_flow offload would also just alleviate these problems,
> yes?  Maybe we should encourage a user to just turn on flow offloading?

I agree that *in the near future* rte_flow is a great solution, but
having a dedicated queue to handle control plane packets is a great
addition.

Yet, I don't think rte_flow offload will "just" alleviate these problems:
- rte_flow offload, in the current state, isn't a working solution for
all NICs (Niantic)
- with a dedicated queue for LACP (and in the future, other CP packets
such as BFD ), we can maintain the link status even when the system is
overcapacity, whereas with don't a way to express this constraint with
rte_flow.

If I'm not mistaken, rte_flow priorities are not supported by Niantic
nics, but with other nics, could we have:
- high priority flow for the CP protection
- lower priorities dedicated for rte_flow offload

So, what do you think about getting that improvement now, and keeping
it in mind as a requirement for rte_flow offload ?
Christophe


>
> Additionally, it doesn't seem to have a good solution for kernel side.
> And I worry about adding flows into the system that the ofproto layer
> doesn't know about but will interact with - but maybe I'm just being
> paranoid.
>
> > Example use:
> >
> >  ovs-vsctl add-bond br-phy bond0 phy0 phy1 -- \
> >    set interface phy0 type=dpdk options:dpdk-devargs=0000:ca:00.0 -- \
> >    set interface phy0 options:cp-protection=lacp -- \
> >    set interface phy1 type=dpdk options:dpdk-devargs=0000:ca:00.1 -- \
> >    set interface phy1 options:cp-protection=lacp
> >
> > As a starting point, only one protocol is supported: LACP. Other
> > protocols can be added in the future. NIC compatibility should be
> > checked.
> >
> > To validate that this works as intended, I used a traffic generator to
> > generate random traffic slightly above the machine capacity at line rate
> > on a two ports bond interface. OVS is configured to receive traffic on
> > two VLANs and pop/push them in a br-int bridge based on tags set on
> > patch ports.
> >
> >    +----------------------+
> >    |         DUT          |
> >    |+--------------------+|
> >    ||       br-int       || 
> > in_port=patch10,actions=mod_dl_src:$patch11,mod_dl_dst:$tgen1,output:patch11
> >    ||                    || 
> > in_port=patch11,actions=mod_dl_src:$patch10,mod_dl_dst:$tgen0,output:patch10
> >    || patch10    patch11 ||
> >    |+---|-----------|----+|
> >    |    |           |     |
> >    |+---|-----------|----+|
> >    || patch00    patch01 ||
> >    ||  tag:10    tag:20  ||
> >    ||                    ||
> >    ||       br-phy       || default flow, action=NORMAL
> >    ||                    ||
> >    ||       bond0        || balance-slb, lacp=passive, lacp-time=fast
> >    ||    phy0   phy1     ||
> >    |+------|-----|-------+|
> >    +-------|-----|--------+
> >            |     |
> >    +-------|-----|--------+
> >    |     port0  port1     | balance L3/L4, lacp=active, lacp-time=fast
> >    |         lag          | mode trunk VLANs 10, 20
> >    |                      |
> >    |        switch        |
> >    |                      |
> >    |  vlan 10    vlan 20  |  mode access
> >    |   port2      port3   |
> >    +-----|----------|-----+
> >          |          |
> >    +-----|----------|-----+
> >    |   tgen0      tgen1   |  Random traffic that is properly balanced
> >    |                      |  across the bond ports in both directions.
> >    |  traffic generator   |
> >    +----------------------+
> >
> > Without cp-protection, the bond0 links are randomly switching to
> > "defaulted" when one of the LACP packets sent by the switch is dropped
> > because the RX queues are full and the PMD threads did not process them
> > fast enough. When that happens, all traffic must go through a single
> > link which causes above line rate traffic to be dropped.
> >
> >  ~# ovs-appctl lacp/show-stats bond0
> >  ---- bond0 statistics ----
> >  member: phy0:
> >    TX PDUs: 347246
> >    RX PDUs: 14865
> >    RX Bad PDUs: 0
> >    RX Marker Request PDUs: 0
> >    Link Expired: 168
> >    Link Defaulted: 0
> >    Carrier Status Changed: 0
> >  member: phy1:
> >    TX PDUs: 347245
> >    RX PDUs: 14919
> >    RX Bad PDUs: 0
> >    RX Marker Request PDUs: 0
> >    Link Expired: 147
> >    Link Defaulted: 1
> >    Carrier Status Changed: 0
> >
> > When cp-protection is enabled, no LACP packet is dropped and the bond
> > links remain enabled at all times, maximizing the throughput. Neither
> > the "Link Expired" nor the "Link Defaulted" counters are incremented
> > anymore.
> >
> > This feature may be considered as "QoS". However, it does not work by
> > limiting the rate of traffic explicitly. It only guarantees that some
> > protocols have a lower chance of being dropped because the PMD cores
> > cannot keep up with regular traffic.
> >
> > The choice of protocols is limited on purpose. This is not meant to be
> > configurable by users. Some limited configurability could be considered
> > in the future but it would expose to more potential issues if users are
> > accidentally redirecting all traffic in the control plane queue.
> >
> > Cc: Anthony Harivel <ahari...@redhat.com>
> > Cc: Christophe Fontaine <cfont...@redhat.com>
> > Cc: David Marchand <david.march...@redhat.com>
> > Cc: Kevin Traynor <ktray...@redhat.com>
> > Signed-off-by: Robin Jarry <rja...@redhat.com>
> > ---
> > v8 -> v9:
> >
> > * Rebased on cf288fdfe2bf ("AUTHORS: Add Liang Mancang and Viacheslav
> >   Galaktionov.")
> > * Reset rte_flow_error struct before passing it to functions.
> > * Refined some comments.
> > * Updated check for hw-offload on a per-port basis. That way, if a port
> >   already has cp-protection enabled, hw-offload will not be enabled on
> >   it but cp-protection will continue to work until next restart.
> >   However, On next restart, hw-offload will be checked first and
> >   therefore cp-protection will be disabled on all ports.
> >
> > Unless there are significant reserves about this patch. Would it be ok
> > to include it for 3.2?
> >
> > Thanks!
> >
> >  Documentation/topics/dpdk/phy.rst |  77 ++++++++
> >  NEWS                              |   4 +
> >  lib/netdev-dpdk.c                 | 310 +++++++++++++++++++++++++++++-
> >  vswitchd/vswitch.xml              |  26 +++
> >  4 files changed, 414 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/topics/dpdk/phy.rst 
> > b/Documentation/topics/dpdk/phy.rst
> > index 4b0fe8dded3a..518b67134639 100644
> > --- a/Documentation/topics/dpdk/phy.rst
> > +++ b/Documentation/topics/dpdk/phy.rst
> > @@ -131,6 +131,83 @@ possible with DPDK acceleration. It is possible to 
> > configure multiple Rx queues
> >  for ``dpdk`` ports, thus ensuring this is not a bottleneck for 
> > performance. For
> >  information on configuring PMD threads, refer to :doc:`pmd`.
> >
> > +Control Plane Protection
> > +------------------------
> > +
> > +.. warning:: This feature is experimental.
> > +
> > +Some control protocols are used to maintain link status between forwarding
> > +engines. In SDN environments, these packets share the same physical network
> > +than the user data traffic.
> > +
> > +When the system is not sized properly, the PMD threads may not be able to
> > +process all incoming traffic from the configured Rx queues. When a 
> > signaling
> > +packet of such protocols is dropped, it can cause link flapping, worsening 
> > the
> > +situation.
> > +
> > +Some physical NICs can be programmed to put these protocols in a dedicated
> > +hardware Rx queue using the rte_flow__ API.
> > +
> > +__ https://doc.dpdk.org/guides-22.11/prog_guide/rte_flow.html
> > +
> > +The currently supported control plane protocols are:
> > +
> > +``lacp``
> > +   `Link Aggregation Control Protocol`__. Ether type ``0x8809``.
> > +
> > +   __ https://www.ieee802.org/3/ad/public/mar99/seaman_1_0399.pdf
> > +
> > +.. warning::
> > +
> > +   This feature is not compatible with all NICs. Refer to the DPDK
> > +   `compatibilty matrix`__ and vendor documentation for more details.
> > +
> > +   __ https://doc.dpdk.org/guides-22.11/nics/overview.html
> > +
> > +Control plane protection must be enabled on specific protocols per port. 
> > The
> > +``cp-protection`` option requires a coma separated list of protocol names::
> > +
> > +   $ ovs-vsctl add-port br0 dpdk-p0 -- set Interface dpdk-p0 type=dpdk \
> > +        options:dpdk-devargs=0000:01:00.0 options:n_rxq=2 \
> > +        options:cp-protection=lacp
> > +
> > +.. note::
> > +
> > +   If multiple Rx queues are already configured, regular RSS (Receive Side
> > +   Scaling) queue balancing is done on all but the extra control plane
> > +   protection queue.
> > +
> > +.. tip::
> > +
> > +   You can check if control plane protection is supported on a port with 
> > the
> > +   following command::
> > +
> > +      $ ovs-vsctl get interface dpdk-p0 status
> > +      {cp_protection_queue="2", driver_name=..., rss_queues="0-1"}
> > +
> > +   This will also show in ``ovs-vswitchd.log``::
> > +
> > +      INFO|dpdk-p0: cp-protection: redirecting lacp traffic to queue 2
> > +      INFO|dpdk-p0: cp-protection: applying rss on queues 0-1
> > +
> > +   If the hardware does not support redirecting control plane traffic to
> > +   a dedicated queue, it will be explicit::
> > +
> > +      $ ovs-vsctl get interface dpdk-p0 status
> > +      {cp_protection=unsupported, driver_name=...}
> > +
> > +   More details can often be found in ``ovs-vswitchd.log``::
> > +
> > +      WARN|dpdk-p0: cp-protection: failed to add lacp flow: Unsupported 
> > pattern
> > +
> > +To disable control plane protection on a port, use the following command::
> > +
> > +   $ ovs-vsctl remove Interface dpdk-p0 options cp-protection
> > +
> > +You can see that it has been disabled in ``ovs-vswitchd.log``::
> > +
> > +   INFO|dpdk-p0: cp-protection: disabled
> > +
> >  .. _dpdk-phy-flow-control:
> >
> >  Flow Control
> > diff --git a/NEWS b/NEWS
> > index 85b34962145e..22505ea9d4ad 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -10,6 +10,10 @@ Post-v3.1.0
> >         in order to create OVSDB sockets with access mode of 0770.
> >     - QoS:
> >       * Added new configuration option 'jitter' for a linux-netem QoS type.
> > +   - DPDK:
> > +     * New experimental "cp-protection=<protocol>" option to redirect 
> > certain
> > +       protocols (for now, only LACP) to a dedicated hardware queue using
> > +       RTE flow.
> >
> >
> >  v3.1.0 - 16 Feb 2023
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index fb0dd43f75c5..549fc425edba 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -415,6 +415,10 @@ enum dpdk_hw_ol_features {
> >      NETDEV_TX_SCTP_CHECKSUM_OFFLOAD = 1 << 4,
> >  };
> >
> > +enum dpdk_cp_prot_flags {
> > +    DPDK_CP_PROT_LACP = 1 << 0,
> > +};
> > +
> >  /*
> >   * In order to avoid confusion in variables names, following naming 
> > convention
> >   * should be used, if possible:
> > @@ -504,11 +508,18 @@ struct netdev_dpdk {
> >           * so we remember the request and update them next time
> >           * netdev_dpdk*_reconfigure() is called */
> >          int requested_mtu;
> > +        /* User input for n_rxq + an optional control plane protection 
> > queue
> > +         * (see netdev_dpdk_reconfigure). This field is different from the
> > +         * other requested_* fields as it may contain a different value 
> > than
> > +         * the user input. */
> >          int requested_n_txq;
> >          int requested_n_rxq;
> >          int requested_rxq_size;
> >          int requested_txq_size;
> >
> > +        /* User input for n_rxq (see dpdk_set_rxq_config). */
> > +        int user_n_rxq;
> > +
> >          /* Number of rx/tx descriptors for physical devices */
> >          int rxq_size;
> >          int txq_size;
> > @@ -534,6 +545,13 @@ struct netdev_dpdk {
> >
> >          /* VF configuration. */
> >          struct eth_addr requested_hwaddr;
> > +
> > +        /* Requested control plane protection flags,
> > +         * from the enum set 'dpdk_cp_prot_flags'. */
> > +        uint64_t requested_cp_prot_flags;
> > +        uint64_t cp_prot_flags;
> > +        size_t cp_prot_flows_num;
> > +        struct rte_flow **cp_prot_flows;
> >      );
> >
> >      PADDED_MEMBERS(CACHE_LINE_SIZE,
> > @@ -1310,9 +1328,14 @@ common_construct(struct netdev *netdev, dpdk_port_t 
> > port_no,
> >      netdev->n_rxq = 0;
> >      netdev->n_txq = 0;
> >      dev->requested_n_rxq = NR_QUEUE;
> > +    dev->user_n_rxq = NR_QUEUE;
> >      dev->requested_n_txq = NR_QUEUE;
> >      dev->requested_rxq_size = NIC_PORT_DEFAULT_RXQ_SIZE;
> >      dev->requested_txq_size = NIC_PORT_DEFAULT_TXQ_SIZE;
> > +    dev->requested_cp_prot_flags = 0;
> > +    dev->cp_prot_flags = 0;
> > +    dev->cp_prot_flows_num = 0;
> > +    dev->cp_prot_flows = NULL;
> >
> >      /* Initialize the flow control to NULL */
> >      memset(&dev->fc_conf, 0, sizeof dev->fc_conf);
> > @@ -1487,6 +1510,8 @@ common_destruct(struct netdev_dpdk *dev)
> >      ovs_mutex_destroy(&dev->mutex);
> >  }
> >
> > +static void dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev);
> > +
> >  static void
> >  netdev_dpdk_destruct(struct netdev *netdev)
> >  {
> > @@ -1494,6 +1519,9 @@ netdev_dpdk_destruct(struct netdev *netdev)
> >
> >      ovs_mutex_lock(&dpdk_mutex);
> >
> > +    /* Destroy any rte flows to allow RXQs to be removed. */
> > +    dpdk_cp_prot_unconfigure(dev);
> > +
> >      rte_eth_dev_stop(dev->port_id);
> >      dev->started = false;
> >
> > @@ -1908,8 +1936,8 @@ dpdk_set_rxq_config(struct netdev_dpdk *dev, const 
> > struct smap *args)
> >      int new_n_rxq;
> >
> >      new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1);
> > -    if (new_n_rxq != dev->requested_n_rxq) {
> > -        dev->requested_n_rxq = new_n_rxq;
> > +    if (new_n_rxq != dev->user_n_rxq) {
> > +        dev->user_n_rxq = new_n_rxq;
> >          netdev_request_reconfigure(&dev->up);
> >      }
> >  }
> > @@ -1931,6 +1959,48 @@ dpdk_process_queue_size(struct netdev *netdev, const 
> > struct smap *args,
> >      }
> >  }
> >
> > +static void
> > +dpdk_cp_prot_set_config(struct netdev *netdev, struct netdev_dpdk *dev,
> > +                        const struct smap *args, char **errp)
> > +{
> > +    const char *arg = smap_get_def(args, "cp-protection", "");
> > +    char *token, *saveptr, *buf;
> > +    uint64_t flags = 0;
> > +
> > +    buf = xstrdup(arg);
> > +    token = strtok_r(buf, ",", &saveptr);
> > +    while (token) {
> > +        if (strcmp(token, "lacp") == 0) {
> > +            flags |= DPDK_CP_PROT_LACP;
> > +        } else {
> > +            VLOG_WARN_BUF(errp, "%s options:cp-protection "
> > +                          "unknown protocol '%s'",
> > +                          netdev_get_name(netdev), token);
> > +        }
> > +        token = strtok_r(NULL, ",", &saveptr);
> > +    }
> > +    free(buf);
> > +
> > +    if (flags && dev->type != DPDK_DEV_ETH) {
> > +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> > +                      "is only supported on ethernet ports",
> > +                      netdev_get_name(netdev));
> > +        flags = 0;
> > +    }
> > +
> > +    if (flags && ovsrcu_get(void *, &netdev->hw_info.offload_data)) {
> > +        VLOG_WARN_BUF(errp, "%s options:cp-protection "
> > +                      "is incompatible with hw-offload",
> > +                      netdev_get_name(netdev));
> > +        flags = 0;
> > +    }
> > +
> > +    if (flags != dev->requested_cp_prot_flags) {
> > +        dev->requested_cp_prot_flags = flags;
> > +        netdev_request_reconfigure(netdev);
> > +    }
> > +}
> > +
> >  static int
> >  netdev_dpdk_set_config(struct netdev *netdev, const struct smap *args,
> >                         char **errp)
> > @@ -1950,6 +2020,8 @@ netdev_dpdk_set_config(struct netdev *netdev, const 
> > struct smap *args,
> >      ovs_mutex_lock(&dpdk_mutex);
> >      ovs_mutex_lock(&dev->mutex);
> >
> > +    dpdk_cp_prot_set_config(netdev, dev, args, errp);
> > +
> >      dpdk_set_rxq_config(dev, args);
> >
> >      dpdk_process_queue_size(netdev, args, "n_rxq_desc",
> > @@ -3819,9 +3891,12 @@ netdev_dpdk_get_status(const struct netdev *netdev, 
> > struct smap *args)
> >  {
> >      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >      struct rte_eth_dev_info dev_info;
> > +    size_t cp_prot_flows_num;
> > +    uint64_t cp_prot_flags;
> >      const char *bus_info;
> >      uint32_t link_speed;
> >      uint32_t dev_flags;
> > +    int n_rxq;
> >
> >      if (!rte_eth_dev_is_valid_port(dev->port_id)) {
> >          return ENODEV;
> > @@ -3833,6 +3908,9 @@ netdev_dpdk_get_status(const struct netdev *netdev, 
> > struct smap *args)
> >      link_speed = dev->link.link_speed;
> >      dev_flags = *dev_info.dev_flags;
> >      bus_info = rte_dev_bus_info(dev_info.device);
> > +    cp_prot_flags = dev->cp_prot_flags;
> > +    cp_prot_flows_num = dev->cp_prot_flows_num;
> > +    n_rxq = netdev->n_rxq;
> >      ovs_mutex_unlock(&dev->mutex);
> >      ovs_mutex_unlock(&dpdk_mutex);
> >
> > @@ -3875,6 +3953,19 @@ netdev_dpdk_get_status(const struct netdev *netdev, 
> > struct smap *args)
> >                          ETH_ADDR_ARGS(dev->hwaddr));
> >      }
> >
> > +    if (cp_prot_flags) {
> > +        if (!cp_prot_flows_num) {
> > +            smap_add(args, "cp_protection", "unsupported");
> > +        } else {
> > +            smap_add_format(args, "cp_protection_queue", "%d", n_rxq - 1);
> > +            if (n_rxq > 2) {
> > +                smap_add_format(args, "rss_queues", "0-%d", n_rxq - 2);
> > +            } else {
> > +                smap_add(args, "rss_queues", "0");
> > +            }
> > +        }
> > +    }
> > +
> >      return 0;
> >  }
> >
> > @@ -5178,16 +5269,203 @@ static const struct dpdk_qos_ops trtcm_policer_ops 
> > = {
> >      .qos_queue_dump_state_init = trtcm_policer_qos_queue_dump_state_init
> >  };
> >
> > +static int
> > +dpdk_cp_prot_add_flow(struct netdev_dpdk *dev,
> > +                      const struct rte_flow_item items[],
> > +                      const char *desc)
> > +{
> > +    const struct rte_flow_attr attr = { .ingress = 1 };
> > +    const struct rte_flow_action actions[] = {
> > +        {
> > +            .type = RTE_FLOW_ACTION_TYPE_QUEUE,
> > +            .conf = &(const struct rte_flow_action_queue) {
> > +                .index = dev->up.n_rxq - 1,
> > +            },
> > +        },
> > +        { .type = RTE_FLOW_ACTION_TYPE_END },
> > +    };
> > +    struct rte_flow_error error;
> > +    struct rte_flow *flow;
> > +    size_t num;
> > +    int err;
> > +
> > +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> > +    err = rte_flow_validate(dev->port_id, &attr, items, actions, &error);
> > +    if (err) {
> > +        VLOG_WARN("%s: cp-protection: device does not support %s flow: %s",
> > +                  netdev_get_name(&dev->up), desc,
> > +                  error.message ? error.message : "");
> > +        goto out;
> > +    }
> > +
> > +    set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> > +    flow = rte_flow_create(dev->port_id, &attr, items, actions, &error);
> > +    if (flow == NULL) {
> > +        VLOG_WARN("%s: cp-protection: failed to add %s flow: %s",
> > +                  netdev_get_name(&dev->up), desc,
> > +                  error.message ? error.message : "");
> > +        err = rte_errno;
> > +        goto out;
> > +    }
> > +
> > +    num = dev->cp_prot_flows_num + 1;
> > +    dev->cp_prot_flows = xrealloc(dev->cp_prot_flows, sizeof(flow) * num);
> > +    dev->cp_prot_flows[dev->cp_prot_flows_num] = flow;
> > +    dev->cp_prot_flows_num = num;
> > +
> > +    VLOG_INFO("%s: cp-protection: redirected %s traffic to rx queue %d",
> > +              netdev_get_name(&dev->up), desc, dev->up.n_rxq - 1);
> > +out:
> > +    return err;
> > +}
> > +
> > +#define RETA_CONF_SIZE (RTE_ETH_RSS_RETA_SIZE_512 / 
> > RTE_ETH_RETA_GROUP_SIZE)
> > +
> > +static int
> > +dpdk_cp_prot_rss_configure(struct netdev_dpdk *dev, int rss_n_rxq)
> > +{
> > +    struct rte_eth_rss_reta_entry64 reta_conf[RETA_CONF_SIZE];
> > +    struct rte_eth_dev_info info;
> > +    int err;
> > +
> > +    rte_eth_dev_info_get(dev->port_id, &info);
> > +
> > +    if (info.reta_size % rss_n_rxq != 0 &&
> > +        info.reta_size < RTE_ETH_RSS_RETA_SIZE_128) {
> > +        /*
> > +         * Some drivers set reta_size equal to the total number of rxqs 
> > that
> > +         * are configured when it is a power of two. Since we are actually
> > +         * reconfiguring the redirection table to exclude the last rxq, we 
> > may
> > +         * end up with an imbalanced redirection table. For example, such
> > +         * configuration:
> > +         *
> > +         *   options:n_rxq=3 options:cp-protection=lacp
> > +         *
> > +         * Will actually configure 4 rxqs on the NIC, and the default reta 
> > to:
> > +         *
> > +         *   [0, 1, 2, 3]
> > +         *
> > +         * And dpdk_cp_prot_rss_configure() will reconfigure reta to:
> > +         *
> > +         *   [0, 1, 2, 0]
> > +         *
> > +         * Causing queue 0 to receive twice as much traffic as queues 1 
> > and 2.
> > +         *
> > +         * Work around that corner case by forcing a bigger redirection 
> > table
> > +         * size to 128 entries when reta_size is not a multiple of 
> > rss_n_rxq
> > +         * and when reta_size is less than 128. This value seems to be
> > +         * supported by most of the drivers that also support rte flow.
> > +         */
> > +        info.reta_size = RTE_ETH_RSS_RETA_SIZE_128;
> > +    }
> > +
> > +    memset(reta_conf, 0, sizeof(reta_conf));
> > +    for (uint16_t i = 0; i < info.reta_size; i++) {
> > +        uint16_t idx = i / RTE_ETH_RETA_GROUP_SIZE;
> > +        uint16_t shift = i % RTE_ETH_RETA_GROUP_SIZE;
> > +        reta_conf[idx].mask |= 1ULL << shift;
> > +        reta_conf[idx].reta[shift] = i % rss_n_rxq;
> > +    }
> > +    err = rte_eth_dev_rss_reta_update(dev->port_id, reta_conf, 
> > info.reta_size);
> > +    if (err < 0) {
> > +        VLOG_WARN("%s: failed to configure RSS redirection table: err=%d",
> > +                  netdev_get_name(&dev->up), err);
> > +    }
> > +
> > +    return err;
> > +}
> > +
> > +static int
> > +dpdk_cp_prot_configure(struct netdev_dpdk *dev)
> > +{
> > +    int err = 0;
> > +
> > +    if (dev->up.n_rxq < 2) {
> > +        err = ENOTSUP;
> > +        VLOG_WARN("%s: cp-protection: not enough available rx queues",
> > +                  netdev_get_name(&dev->up));
> > +        goto out;
> > +    }
> > +
> > +    if (dev->requested_cp_prot_flags & DPDK_CP_PROT_LACP) {
> > +        const struct rte_flow_item items[] = {
> > +            {
> > +                .type = RTE_FLOW_ITEM_TYPE_ETH,
> > +                .spec = &(const struct rte_flow_item_eth){
> > +                    .type = htons(ETH_TYPE_LACP),
> > +                },
> > +                .mask = &(const struct rte_flow_item_eth){
> > +                    .type = htons(0xffff),
> > +                },
> > +            },
> > +            { .type = RTE_FLOW_ITEM_TYPE_END },
> > +        };
> > +        err = dpdk_cp_prot_add_flow(dev, items, "lacp");
> > +        if (err) {
> > +            goto out;
> > +        }
> > +    }
> > +
> > +    if (dev->cp_prot_flows_num) {
> > +        /* Reconfigure RSS reta in all but the cp protection queue. */
> > +        err = dpdk_cp_prot_rss_configure(dev, dev->up.n_rxq - 1);
> > +        if (!err) {
> > +            if (dev->up.n_rxq == 2) {
> > +                VLOG_INFO("%s: cp-protection: redirected other traffic to "
> > +                          "rx queue 0", netdev_get_name(&dev->up));
> > +            } else {
> > +                VLOG_INFO("%s: cp-protection: applied rss on rx queue 
> > 0-%u",
> > +                          netdev_get_name(&dev->up), dev->up.n_rxq - 2);
> > +            }
> > +        }
> > +    }
> > +
> > +out:
> > +    return err;
> > +}
> > +
> > +static void
> > +dpdk_cp_prot_unconfigure(struct netdev_dpdk *dev)
> > +{
> > +    struct rte_flow_error error;
> > +
> > +    if (!dev->cp_prot_flows_num) {
> > +        return;
> > +    }
> > +
> > +    VLOG_DBG("%s: cp-protection: reset flows", netdev_get_name(&dev->up));
> > +
> > +    for (int i = 0; i < dev->cp_prot_flows_num; i++) {
> > +        set_error(&error, RTE_FLOW_ERROR_TYPE_NONE);
> > +        if (rte_flow_destroy(dev->port_id, dev->cp_prot_flows[i], &error)) 
> > {
> > +            VLOG_DBG("%s: cp-protection: failed to destroy flow: %s",
> > +                     netdev_get_name(&dev->up),
> > +                     error.message ? error.message : "");
> > +        }
> > +    }
> > +    free(dev->cp_prot_flows);
> > +    dev->cp_prot_flows_num = 0;
> > +    dev->cp_prot_flows = NULL;
> > +}
> > +
> >  static int
> >  netdev_dpdk_reconfigure(struct netdev *netdev)
> >  {
> >      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > +    bool try_cp_prot;
> >      int err = 0;
> >
> >      ovs_mutex_lock(&dev->mutex);
> >
> > +    try_cp_prot = dev->requested_cp_prot_flags != 0;
> > +    dev->requested_n_rxq = dev->user_n_rxq;
> > +    if (try_cp_prot) {
> > +        dev->requested_n_rxq += 1;
> > +    }
> > +
> >      if (netdev->n_txq == dev->requested_n_txq
> >          && netdev->n_rxq == dev->requested_n_rxq
> > +        && dev->cp_prot_flags == dev->requested_cp_prot_flags
> >          && dev->mtu == dev->requested_mtu
> >          && dev->lsc_interrupt_mode == dev->requested_lsc_interrupt_mode
> >          && dev->rxq_size == dev->requested_rxq_size
> > @@ -5200,6 +5478,9 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> >          goto out;
> >      }
> >
> > +retry:
> > +    dpdk_cp_prot_unconfigure(dev);
> > +
> >      if (dev->reset_needed) {
> >          rte_eth_dev_reset(dev->port_id);
> >          if_notifier_manual_report();
> > @@ -5224,6 +5505,7 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> >      dev->txq_size = dev->requested_txq_size;
> >
> >      rte_free(dev->tx_q);
> > +    dev->tx_q = NULL;
> >
> >      if (!eth_addr_equals(dev->hwaddr, dev->requested_hwaddr)) {
> >          err = netdev_dpdk_set_etheraddr__(dev, dev->requested_hwaddr);
> > @@ -5255,6 +5537,23 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> >       */
> >      dev->requested_hwaddr = dev->hwaddr;
> >
> > +    if (try_cp_prot) {
> > +        err = dpdk_cp_prot_configure(dev);
> > +        if (err) {
> > +            /* No hw support, disable & recover gracefully. */
> > +            try_cp_prot = false;
> > +            /*
> > +             * The extra queue must be explicitly removed here to ensure 
> > that
> > +             * it is unconfigured immediately.
> > +             */
> > +            dev->requested_n_rxq = dev->user_n_rxq;
> > +            goto retry;
> > +        }
> > +    } else {
> > +        VLOG_INFO("%s: cp-protection: disabled", 
> > netdev_get_name(&dev->up));
> > +    }
> > +    dev->cp_prot_flags = dev->requested_cp_prot_flags;
> > +
> >      dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
> >      if (!dev->tx_q) {
> >          err = ENOMEM;
> > @@ -5468,7 +5767,12 @@ netdev_dpdk_flow_api_supported(struct netdev *netdev)
> >      ovs_mutex_lock(&dev->mutex);
> >      if (dev->type == DPDK_DEV_ETH) {
> >          /* TODO: Check if we able to offload some minimal flow. */
> > -        ret = true;
> > +        if (dev->requested_cp_prot_flags) {
> > +            VLOG_WARN("%s: hw-offload is mutually exclusive with "
> > +                      "cp-protection", netdev_get_name(netdev));
> > +        } else {
> > +            ret = true;
> > +        }
> >      }
> >      ovs_mutex_unlock(&dev->mutex);
> >  out:
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index 05ac1fbe5ef6..e6575f84eb3c 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -3453,6 +3453,32 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> > type=patch options:peer=p1 \
> >          <p>This option may only be used with dpdk VF representors.</p>
> >        </column>
> >
> > +      <column name="options" key="cp-protection"
> > +              type='{"type": "string", "enum": ["set", ["lacp"]]}'>
> > +        <p>
> > +          Allocate an extra Rx queue for control plane packets of the 
> > specified
> > +          protocol(s).
> > +        </p>
> > +        <p>
> > +          If the user has already configured multiple
> > +          <code>options:n_rxq</code> on the port, an additional one will be
> > +          allocated for control plane packets. If the hardware cannot 
> > satisfy
> > +          the requested number of requested Rx queues, the last Rx queue 
> > will
> > +          be assigned for control plane. If only one Rx queue is available 
> > or
> > +          if the hardware does not support the RTE flow matchers/actions
> > +          required to redirect the selected protocols,
> > +          <code>cp-protection</code> will be disabled.
> > +        </p>
> > +        <p>
> > +          This feature is multually exclusive with
> > +          <code>other_options:hw-offload</code> as it may conflict with the
> > +          offloaded RTE flows.
> > +        </p>
> > +        <p>
> > +          Disabled by default.
> > +        </p>
> > +      </column>
> > +
> >        <column name="other_config" key="tx-steering"
> >                type='{"type": "string",
> >                       "enum": ["set", ["thread", "hash"]]}'>
>
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to