On 3/11/20 3:15 PM, Vishal Deep Ajmera wrote:
> Problem:
> --------
> In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
> (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
> "HASH" and "RECIRC" datapath actions. After recirculation, the packet is
> forwarded to the bond member port based on 8-bits of the datapath hash
> value computed through dp_hash. This causes performance degradation in the
> following ways:
> 
> 1. The recirculation of the packet implies another lookup of the packet’s
> flow key in the exact match cache (EMC) and potentially Megaflow classifier
> (DPCLS). This is the biggest cost factor.
> 
> 2. The recirculated packets have a new “RSS” hash and compete with the
> original packets for the scarce number of EMC slots. This implies more
> EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
> 
> 3. The 256 extra megaflow entries per bond for dp_hash bond selection put
> additional load on the revalidation threads.
> 
> Owing to this performance degradation, deployments stick to “balance-slb”
> bond mode even though it does not do active-active load balancing for
> VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
> source MAC address.
> 
> Proposed optimization:
> ----------------------
> This proposal introduces a new load-balancing output action instead of
> recirculation.
> 
> Maintain one table per-bond (could just be an array of uint16's) and
> program it the same way internal flows are created today for each possible
> hash value(256 entries) from ofproto layer. Use this table to load-balance
> flows as part of output action processing.
> 
> Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules()
> -> bond_may_recirc() and compose_output_action__() generate
> “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
> RecircID identifies the bond. For the recirculated packets the ofproto layer
> installs megaflow entries that match on RecircID and masked dp_hash and send
> them to the corresponding output port.
> 
> Instead, we will now generate action as
>     "lb_output(bond,<bond id>)"
> 
> This combines hash computation (only if needed, else re-use RSS hash) and
> inline load-balancing over the bond. This action is used *only* for 
> balance-tcp
> bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).
> 
> Example:
> --------
> Current scheme:
> ---------------
> With 1 IP-UDP flow:
> 
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no),
>  packets:2828969, bytes:181054016, used:0.000s, 
> actions:hash(hash_l4(0)),recirc(0x1)
> 
> recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>  packets:2828937, bytes:181051968, used:0.000s, actions:2
> 
> With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):
> 
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no),
>  packets:2674009, bytes:171136576, used:0.000s, 
> actions:hash(hash_l4(0)),recirc(0x1)
> 
> recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>  packets:377395, bytes:24153280, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>  packets:333486, bytes:21343104, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>  packets:348461, bytes:22301504, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>  packets:633353, bytes:40534592, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>  packets:319901, bytes:20473664, used:0.001s, actions:2
> recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>  packets:334985, bytes:21439040, used:0.001s, actions:1
> recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
>  packets:326404, bytes:20889856, used:0.001s, actions:1
> 
> New scheme:
> -----------
> We can do with a single flow entry (for any number of new flows):
> 
> in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no),
>  packets:2674009, bytes:171136576, used:0.000s, actions:lb_output(bond,1)
> 
> A new CLI has been added to dump datapath bond cache as given below.
> 
> “sudo ovs-appctl dpif-netdev/dp-bond-show [dp]”
> 
> root@ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/dp-bond-show
> Bond cache:
>         bond-id 1 :
>                 bucket 0 - slave 2
>                 bucket 1 - slave 1
>                 bucket 2 - slave 2
>                 bucket 3 - slave 1
> 
> Performance improvement:
> ------------------------
> With a prototype of the proposed idea, the following perf improvement is seen
> with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement
> is even more enhanced (due to reduced number of flows).
> 
> 1 VM:
> *****
> +--------------------------------------+
> |                 mpps                 |
> +--------------------------------------+
> | Flows  master  with-opt.   %delta    |
> +--------------------------------------+
> | 1      4.34    5.59        28.86
> | 10     4.09    5.43        32.61
> | 400    3.44    5.35        55.25
> | 1k     3.31    5.25        58.41
> | 10k    2.46    4.43        79.78
> | 100k   2.32    4.27        83.59
> | 500k   2.29    4.23        84.57
> +--------------------------------------+
> mpps: million packets per second.
> packet size 64 bytes.
> 
> Signed-off-by: Manohar Krishnappa Chidambaraswamy <man...@gmail.com>
> Co-authored-by: Manohar Krishnappa Chidambaraswamy <man...@gmail.com>
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajm...@ericsson.com>
> 
> CC: Jan Scheurich <jan.scheur...@ericsson.com>
> CC: Venkatesan Pradeep <venkatesan.prad...@ericsson.com>
> CC: Ben Pfaff <b...@ovn.org>
> CC: Ilya Maximets <i.maxim...@ovn.org>
> CC: Ian Stokes <ian.sto...@intel.com>
> CC: David Marchand <david.march...@redhat.com>
> CC: Matteo Croce <mcr...@redhat.com>
> CC: Eelco Chaudron <echau...@redhat.com>
> 
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajm...@ericsson.com>
> ---
>  datapath/linux/compat/include/linux/openvswitch.h |   1 +
>  lib/dpif-netdev.c                                 | 399 
> +++++++++++++++++++---
>  lib/dpif-netlink.c                                |   3 +
>  lib/dpif-provider.h                               |  11 +
>  lib/dpif.c                                        |  49 +++
>  lib/dpif.h                                        |  13 +
>  lib/odp-execute.c                                 |   2 +
>  lib/odp-util.c                                    |   4 +
>  ofproto/bond.c                                    |  68 +++-
>  ofproto/bond.h                                    |   5 +
>  ofproto/ofproto-dpif-ipfix.c                      |   1 +
>  ofproto/ofproto-dpif-sflow.c                      |   3 +-
>  ofproto/ofproto-dpif-xlate.c                      |  32 +-
>  ofproto/ofproto-dpif.c                            |  30 ++
>  ofproto/ofproto-dpif.h                            |   8 +-
>  ofproto/ofproto-provider.h                        |   3 +
>  ofproto/ofproto.c                                 |  10 +
>  ofproto/ofproto.h                                 |   1 +
>  tests/lacp.at                                     |   9 +
>  vswitchd/bridge.c                                 |  13 +
>  vswitchd/vswitch.xml                              |  23 ++
>  21 files changed, 632 insertions(+), 56 deletions(-)
> 
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h 
> b/datapath/linux/compat/include/linux/openvswitch.h
> index 2f0c655..8073d39 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -1021,6 +1021,7 @@ enum ovs_action_attr {
>       OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
>       OVS_ACTION_ATTR_TUNNEL_POP,    /* u32 port number. */
>       OVS_ACTION_ATTR_DROP,          /* u32 xlate_error. */
> +     OVS_ACTION_ATTR_LB_OUTPUT,     /* u32 bond-id. */
>  #endif
>       __OVS_ACTION_ATTR_MAX,        /* Nothing past this will be accepted
>                                      * from userspace. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index d393aab..68f3a95 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -309,6 +309,7 @@ struct pmd_auto_lb {
>   *
>   *    dp_netdev_mutex (global)
>   *    port_mutex
> + *    bond_mutex
>   *    non_pmd_mutex
>   */
>  struct dp_netdev {
> @@ -376,6 +377,12 @@ struct dp_netdev {
>  
>      struct conntrack *conntrack;
>      struct pmd_auto_lb pmd_alb;
> +
> +    /* Bonds.
> +     *
> +     */

This should be just '/* Bonds. */' on a single line.

> +    struct ovs_mutex bond_mutex; /* Protects updates of 'tx_bonds'. */
> +    struct cmap tx_bonds; /* Contains 'struct tx_bond'. */
>  };
>  
>  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -607,6 +614,20 @@ struct tx_port {
>      struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>  };
>  
> +/* Contained by struct tx_bond 'slave_buckets'. */
> +struct slave_entry {
> +    odp_port_t slave_id;
> +    atomic_ullong n_packets;
> +    atomic_ullong n_bytes;
> +};
> +
> +/* Contained by struct dp_netdev_pmd_thread's 'tx_bonds'. */
> +struct tx_bond {
> +    struct cmap_node node;
> +    uint32_t bond_id;
> +    struct slave_entry slave_buckets[BOND_BUCKETS];
> +};
> +
>  /* A set of properties for the current processing loop that is not directly
>   * associated with the pmd thread itself, but with the packets being
>   * processed or the short-term system configuration (for example, time).
> @@ -739,6 +760,11 @@ struct dp_netdev_pmd_thread {
>       * read by the pmd thread. */
>      struct hmap tx_ports OVS_GUARDED;
>  
> +    struct ovs_mutex bond_mutex;    /* Protects updates of 'tx_bonds'. */
> +    /* Map of 'tx_bond's used for transmission.  Written by the main thread
> +     * and read by the pmd thread. */
> +    struct cmap tx_bonds;
> +
>      /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
>       * ports (that support push_tunnel/pop_tunnel), the other contains ports
>       * with at least one txq (that support send).  A port can be in both.
> @@ -830,6 +856,10 @@ static void dp_netdev_del_rxq_from_pmd(struct 
> dp_netdev_pmd_thread *pmd,
>  static int
>  dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
>                                     bool force);
> +static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> +                                         struct tx_bond *bond);
> +static void dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> +                                           uint32_t bond_id);
>  
>  static void reconfigure_datapath(struct dp_netdev *dp)
>      OVS_REQUIRES(dp->port_mutex);
> @@ -1396,6 +1426,49 @@ pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
>      par.command_type = PMD_INFO_PERF_SHOW;
>      dpif_netdev_pmd_info(conn, argc, argv, &par);
>  }
> +
> +static void
> +dpif_netdev_bond_show(struct unixctl_conn *conn, int argc,
> +                      const char *argv[], void *aux OVS_UNUSED)
> +{
> +    struct ds reply = DS_EMPTY_INITIALIZER;
> +
> +    ovs_mutex_lock(&dp_netdev_mutex);
> +
> +    struct dp_netdev *dp = NULL;

Move this to the function top.

> +    if (argc == 2) {
> +        dp = shash_find_data(&dp_netdevs, argv[1]);
> +    } else if (shash_count(&dp_netdevs) == 1) {
> +        /* There's only one datapath. */
> +        dp = shash_first(&dp_netdevs)->data;
> +    }
> +    if (!dp) {
> +        ovs_mutex_unlock(&dp_netdev_mutex);
> +        unixctl_command_reply_error(conn,
> +                                    "please specify an existing datapath");
> +        return;
> +    }
> +
> +    ovs_mutex_lock(&dp->bond_mutex);

There was no lock in v11 and it should not be here.  We're reading cmap,
RCU should protect us.

> +    if (cmap_count(&dp->tx_bonds) > 0) {
> +        struct tx_bond *dp_bond_entry;

Empty line.

> +        ds_put_cstr(&reply, "\nBonds:\n");
> +        CMAP_FOR_EACH (dp_bond_entry, node, &dp->tx_bonds) {
> +            ds_put_format(&reply, "\tbond-id %"PRIu32":\n",
> +                          dp_bond_entry->bond_id);
> +            for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +                ds_put_format(&reply, "\t\tbucket %d - slave %"PRIu32"\n",
> +                  bucket,
> +                  odp_to_u32(dp_bond_entry->slave_buckets[bucket].slave_id));

Indentation.  If you're confused with the too long argument, just create a 
variable:
---
                odp_port_t id = dp_bond_entry->slave_buckets[bucket].slave_id;

                ds_put_format(&reply, "\t\tbucket %d - slave %"PRIu32"\n",
                              bucket, odp_to_u32(id));
---

> +            }
> +        }
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +    ovs_mutex_unlock(&dp_netdev_mutex);
> +    unixctl_command_reply(conn, ds_cstr(&reply));
> +    ds_destroy(&reply);
> +}
> +
>  
>  static int
>  dpif_netdev_init(void)
> @@ -1427,6 +1500,9 @@ dpif_netdev_init(void)
>                               "[-us usec] [-q qlen]",
>                               0, 10, pmd_perf_log_set_cmd,
>                               NULL);
> +    unixctl_command_register("dpif-netdev/bond-show", "[dp]",
> +                             0, 1, dpif_netdev_bond_show,
> +                             NULL);
>      return 0;
>  }
>  
> @@ -1551,6 +1627,9 @@ create_dp_netdev(const char *name, const struct 
> dpif_class *class,
>      ovs_mutex_init_recursive(&dp->port_mutex);
>      hmap_init(&dp->ports);
>      dp->port_seq = seq_create();
> +    ovs_mutex_init(&dp->bond_mutex);
> +    cmap_init(&dp->tx_bonds);
> +
>      fat_rwlock_init(&dp->upcall_rwlock);
>  
>      dp->reconfigure_seq = seq_create();
> @@ -1657,6 +1736,12 @@ dp_delete_meter(struct dp_netdev *dp, uint32_t 
> meter_id)
>      }
>  }
>  
> +static uint32_t
> +hash_bond_id(uint32_t bond_id)
> +{
> +    return hash_int(bond_id, 0);
> +}
> +
>  /* Requires dp_netdev_mutex so that we can't get a new reference to 'dp'
>   * through the 'dp_netdevs' shash while freeing 'dp'. */
>  static void
> @@ -1664,6 +1749,7 @@ dp_netdev_free(struct dp_netdev *dp)
>      OVS_REQUIRES(dp_netdev_mutex)
>  {
>      struct dp_netdev_port *port, *next;
> +    struct tx_bond *bond;
>  
>      shash_find_and_delete(&dp_netdevs, dp->name);
>  
> @@ -1673,6 +1759,13 @@ dp_netdev_free(struct dp_netdev *dp)
>      }
>      ovs_mutex_unlock(&dp->port_mutex);
>  
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
> +        cmap_remove(&dp->tx_bonds, &bond->node, hash_bond_id(bond->bond_id));
> +        ovsrcu_postpone(free, bond);
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
>      dp_netdev_destroy_all_pmds(dp, true);
>      cmap_destroy(&dp->poll_threads);
>  
> @@ -1691,6 +1784,9 @@ dp_netdev_free(struct dp_netdev *dp)
>      hmap_destroy(&dp->ports);
>      ovs_mutex_destroy(&dp->port_mutex);
>  
> +    cmap_destroy(&dp->tx_bonds);
> +    ovs_mutex_destroy(&dp->bond_mutex);
> +
>      /* Upcalls must be disabled at this point */
>      dp_netdev_destroy_upcall_lock(dp);
>  
> @@ -4421,6 +4517,20 @@ tx_port_lookup(const struct hmap *hmap, odp_port_t 
> port_no)
>      return NULL;
>  }
>  
> +static struct tx_bond *
> +tx_bond_lookup(const struct cmap *tx_bonds, uint32_t bond_id)
> +{
> +    struct tx_bond *tx;
> +    uint32_t hash = hash_bond_id(bond_id);
> +
> +    CMAP_FOR_EACH_WITH_HASH (tx, node, hash, tx_bonds) {
> +        if (tx->bond_id == bond_id) {
> +            return tx;
> +        }
> +    }
> +    return NULL;
> +}
> +
>  static int
>  port_reconfigure(struct dp_netdev_port *port)
>  {
> @@ -4919,6 +5029,7 @@ reconfigure_datapath(struct dp_netdev *dp)
>      struct hmapx busy_threads = HMAPX_INITIALIZER(&busy_threads);
>      struct dp_netdev_pmd_thread *pmd;
>      struct dp_netdev_port *port;
> +    struct tx_bond *bond;
>      int wanted_txqs;
>  
>      dp->last_reconfigure_seq = seq_read(dp->reconfigure_seq);
> @@ -5060,17 +5171,24 @@ reconfigure_datapath(struct dp_netdev *dp)
>          }
>      }
>  
> -    /* Add every port to the tx cache of every pmd thread, if it's not
> -     * there already and if this pmd has at least one rxq to poll. */
> +    /* Add every port to the tx cache and bond to the cmap of every pmd 
> thread,

Above sentence doesn't look good, maybe some commas missed?
Or maybe it should be:
"Add every port and bond to the tx port and bond caches of every pmd thread,"

> +     * if it's not there already and if this pmd has at least one rxq to 
> poll.
> +     */
> +    ovs_mutex_lock(&dp->bond_mutex);

We don't need to hold dp->bond_mutex while reading cmap.  RCU should protect us.

>      CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
>          ovs_mutex_lock(&pmd->port_mutex);
>          if (hmap_count(&pmd->poll_list) || pmd->core_id == NON_PMD_CORE_ID) {
>              HMAP_FOR_EACH (port, node, &dp->ports) {
>                  dp_netdev_add_port_tx_to_pmd(pmd, port);
>              }
> +
> +            CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
> +                dp_netdev_add_bond_tx_to_pmd(pmd, bond);
> +            }
>          }
>          ovs_mutex_unlock(&pmd->port_mutex);
>      }
> +    ovs_mutex_unlock(&dp->bond_mutex);
>  
>      /* Reload affected pmd threads. */
>      reload_affected_pmds(dp);
> @@ -6110,6 +6228,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread 
> *pmd, struct dp_netdev *dp,
>      atomic_init(&pmd->reload, false);
>      ovs_mutex_init(&pmd->flow_mutex);
>      ovs_mutex_init(&pmd->port_mutex);
> +    ovs_mutex_init(&pmd->bond_mutex);
>      cmap_init(&pmd->flow_table);
>      cmap_init(&pmd->classifiers);
>      pmd->ctx.last_rxq = NULL;
> @@ -6120,6 +6239,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread 
> *pmd, struct dp_netdev *dp,
>      hmap_init(&pmd->tx_ports);
>      hmap_init(&pmd->tnl_port_cache);
>      hmap_init(&pmd->send_port_cache);
> +    cmap_init(&pmd->tx_bonds);
>      /* init the 'flow_cache' since there is no
>       * actual thread created for NON_PMD_CORE_ID. */
>      if (core_id == NON_PMD_CORE_ID) {
> @@ -6135,11 +6255,17 @@ static void
>  dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>  {
>      struct dpcls *cls;
> +    struct tx_bond *tx;
>  
>      dp_netdev_pmd_flow_flush(pmd);
>      hmap_destroy(&pmd->send_port_cache);
>      hmap_destroy(&pmd->tnl_port_cache);
>      hmap_destroy(&pmd->tx_ports);
> +    CMAP_FOR_EACH (tx, node, &pmd->tx_bonds) {
> +        cmap_remove(&pmd->tx_bonds, &tx->node, hash_bond_id(tx->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    }

This should be done inside dp_netdev_pmd_clear_ports(), not here.

> +    cmap_destroy(&pmd->tx_bonds);
>      hmap_destroy(&pmd->poll_list);
>      /* All flows (including their dpcls_rules) have been deleted already */
>      CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
> @@ -6151,6 +6277,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>      ovs_mutex_destroy(&pmd->flow_mutex);
>      seq_destroy(pmd->reload_seq);
>      ovs_mutex_destroy(&pmd->port_mutex);
> +    ovs_mutex_destroy(&pmd->bond_mutex);
>      free(pmd);
>  }
>  
> @@ -6304,6 +6431,53 @@ dp_netdev_del_port_tx_from_pmd(struct 
> dp_netdev_pmd_thread *pmd,
>      free(tx);
>      pmd->need_reload = true;
>  }
> +
> +/* Add bond to the tx bond cmap of 'pmd'. */
> +static void
> +dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> +                             struct tx_bond *bond)
> +{
> +    struct tx_bond *tx;
> +
> +    ovs_mutex_lock(&pmd->bond_mutex);
> +    tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
> +    if (tx) {
> +        struct tx_bond *new_tx = xmemdup(bond, sizeof *bond);
> +
> +        /* Copy the stats for each bucket. */
> +        for (int i = 0; i < BOND_BUCKETS; i++) {
> +            uint64_t n_packets, n_bytes;
> +
> +            atomic_read_relaxed(&tx->slave_buckets[i].n_packets, &n_packets);
> +            atomic_read_relaxed(&tx->slave_buckets[i].n_bytes, &n_bytes);
> +            atomic_init(&new_tx->slave_buckets[i].n_packets, n_packets);
> +            atomic_init(&new_tx->slave_buckets[i].n_bytes, n_bytes);

Actually, why we're copying stats here?

This function is called directly from dpif_netdev_bond_add() which is called
from update_recirc_rules__() after boning rebalancing.  This means that bucket
to slave mapping is likely changed.  This code copies bucket stats from the
previous state where it was related to a different slave.
Moreover, even if it was the same slave, bond_recirculation_account() always
adds stats received from flow/datapath, i.e. same stats will be incrementally
accounted over and over again messing up slave statistics.
AFAIU, recirculation rules are always brand new, i.e. their datapath stats are
zero after each rebalancing, the same should be here.

Could you, please, check that slave statistics makes any sence on rebalancing?
i.e. it is realistic or it's increased way to fast?  I bet there is an issue 
here.

> +        }
> +        cmap_replace(&pmd->tx_bonds, &tx->node, &new_tx->node,
> +                     hash_bond_id(bond->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    } else {
> +        tx = xmemdup(bond, sizeof *bond);
> +        cmap_insert(&pmd->tx_bonds, &tx->node, hash_bond_id(bond->bond_id));
> +    }
> +    ovs_mutex_unlock(&pmd->bond_mutex);
> +}
> +
> +/* Delete bond from the tx bond cmap of 'pmd'. */
> +static void
> +dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> +                               uint32_t bond_id)
> +{
> +    struct tx_bond *tx;
> +
> +    ovs_mutex_lock(&pmd->bond_mutex);
> +    tx = tx_bond_lookup(&pmd->tx_bonds, bond_id);
> +    if (tx) {
> +        cmap_remove(&pmd->tx_bonds, &tx->node, hash_bond_id(tx->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    }
> +    ovs_mutex_unlock(&pmd->bond_mutex);
> +}
>  
>  static char *
>  dpif_netdev_get_datapath_version(void)
> @@ -7129,6 +7303,88 @@ dp_execute_userspace_action(struct 
> dp_netdev_pmd_thread *pmd,
>      }
>  }
>  
> +static bool
> +dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
> +                         struct dp_packet_batch *packets_,
> +                         bool should_steal, odp_port_t port_no)
> +{

See my other reply regarding these "output" functions.

> +    struct tx_port *p = pmd_send_port_cache_lookup(pmd, port_no);
> +    if (!OVS_LIKELY(p)) {
> +        return false;
> +    }
> +
> +    struct dp_packet_batch out;
> +    if (!should_steal) {
> +        dp_packet_batch_clone(&out, packets_);
> +        dp_packet_batch_reset_cutlen(packets_);
> +        packets_ = &out;
> +    }
> +    dp_packet_batch_apply_cutlen(packets_);
> +#ifdef DPDK_NETDEV
> +    if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> +                     && packets_->packets[0]->source
> +                     != p->output_pkts.packets[0]->source)) {
Once again, please keep the original indentation while copying.

> +        /* netdev-dpdk assumes that all packets in a single
> +         * output batch has the same source. Flush here to
> +         * avoid memory access issues. */
> +        dp_netdev_pmd_flush_output_on_port(pmd, p);
> +    }
> +#endif
> +    if (dp_packet_batch_size(&p->output_pkts)
> +        + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> +        /* Flush here to avoid overflow. */
> +        dp_netdev_pmd_flush_output_on_port(pmd, p);
> +    }
> +    if (dp_packet_batch_is_empty(&p->output_pkts)) {
> +        pmd->n_output_batches++;
> +    }
> +
> +    struct dp_packet *packet;
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> +        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> +            pmd->ctx.last_rxq;
> +        dp_packet_batch_add(&p->output_pkts, packet);
> +    }
> +    return true;
> +}
> +
> +static bool
> +dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd,
> +                            struct dp_packet_batch *packets_,
> +                            bool should_steal, uint32_t bond)
> +{
> +    struct dp_packet *packet;
> +    uint32_t i;
> +    const uint32_t cnt = dp_packet_batch_size(packets_);
> +    struct tx_bond *p_bond = tx_bond_lookup(&pmd->tx_bonds, bond);
> +
> +    if (!p_bond) {
> +        return false;
> +    }
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
> +        /*
> +         * Lookup the bond-hash table using hash to get the slave.
> +         */
> +        uint32_t hash = dp_packet_get_rss_hash(packet);
> +        struct slave_entry *s_entry = &p_bond->slave_buckets[hash & 
> BOND_MASK];
> +        odp_port_t bond_member = s_entry->slave_id;
> +        uint32_t size = dp_packet_size(packet);
> +
> +        struct dp_packet_batch output_pkt;
> +        dp_packet_batch_init_packet(&output_pkt, packet);
> +        if (OVS_LIKELY(dp_execute_output_action(pmd, &output_pkt, 
> should_steal,
> +                                                bond_member))) {
> +            /* Update slave stats. */
> +            non_atomic_ullong_add(&s_entry->n_packets, 1);
> +            non_atomic_ullong_add(&s_entry->n_bytes, size);
> +        } else {
> +            dp_packet_batch_refill(packets_, packet, i);
> +        }
> +    }
> +
> +    return dp_packet_batch_is_empty(packets_);
> +}
> +
>  static void
>  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>                const struct nlattr *a, bool should_steal)
> @@ -7144,43 +7400,18 @@ dp_execute_cb(void *aux_, struct dp_packet_batch 
> *packets_,
>  
>      switch ((enum ovs_action_attr)type) {
>      case OVS_ACTION_ATTR_OUTPUT:
> -        p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a));
> -        if (OVS_LIKELY(p)) {
> -            struct dp_packet *packet;
> -            struct dp_packet_batch out;
> -
> -            if (!should_steal) {
> -                dp_packet_batch_clone(&out, packets_);
> -                dp_packet_batch_reset_cutlen(packets_);
> -                packets_ = &out;
> -            }
> -            dp_packet_batch_apply_cutlen(packets_);
> -
> -#ifdef DPDK_NETDEV
> -            if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> -                             && packets_->packets[0]->source
> -                                != p->output_pkts.packets[0]->source)) {
> -                /* XXX: netdev-dpdk assumes that all packets in a single
> -                 *      output batch has the same source. Flush here to
> -                 *      avoid memory access issues. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -#endif
> -            if (dp_packet_batch_size(&p->output_pkts)
> -                + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> -                /* Flush here to avoid overflow. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -
> -            if (dp_packet_batch_is_empty(&p->output_pkts)) {
> -                pmd->n_output_batches++;
> -            }
> +        if (dp_execute_output_action(pmd, packets_, should_steal,
> +                                     nl_attr_get_odp_port(a))) {
> +            return;
> +        } else {
> +            COVERAGE_ADD(datapath_drop_invalid_port,
> +                         dp_packet_batch_size(packets_));
> +        }
> +        break;
>  
> -            DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> -                p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> -                                                             
> pmd->ctx.last_rxq;
> -                dp_packet_batch_add(&p->output_pkts, packet);
> -            }
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
> +        if (dp_execute_lb_output_action(pmd, packets_, should_steal,
> +                                        nl_attr_get_u32(a))) {
>              return;
>          } else {
>              COVERAGE_ADD(datapath_drop_invalid_port,
> @@ -7738,6 +7969,95 @@ dpif_netdev_ipf_dump_done(struct dpif *dpif 
> OVS_UNUSED, void *ipf_dump_ctx)
>  
>  }
>  
> +static int
> +dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id,
> +                     odp_port_t *slave_map)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    /* Prepare new bond mapping. */
> +    struct tx_bond *new_tx = xzalloc(sizeof *new_tx);
> +
> +    new_tx->bond_id = bond_id;
> +    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +        new_tx->slave_buckets[bucket].slave_id = slave_map[bucket];
> +    }
> +
> +    /* Check if bond already existed. */
> +    struct tx_bond *old_tx = tx_bond_lookup(&dp->tx_bonds, bond_id);
> +    if (old_tx) {
> +        cmap_replace(&dp->tx_bonds, &old_tx->node, &new_tx->node,
> +                     hash_bond_id(bond_id));
> +        ovsrcu_postpone(free, old_tx);
> +    } else {
> +        cmap_insert(&dp->tx_bonds, &new_tx->node,
> +                    hash_bond_id(bond_id));
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
> +    /* Update all PMDs with new bond mapping. */
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        dp_netdev_add_bond_tx_to_pmd(pmd, new_tx);
> +    }
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    /* Check if bond existed. */
> +    struct tx_bond *tx = tx_bond_lookup(&dp->tx_bonds, bond_id);
> +    if (tx) {
> +        cmap_remove(&dp->tx_bonds, &tx->node, hash_bond_id(bond_id));
> +        ovsrcu_postpone(free, tx);
> +    } else {
> +        /* Bond is not present. */
> +        ovs_mutex_unlock(&dp->bond_mutex);
> +        return 0;
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
> +    /* Remove the bond map in all pmds. */
> +    struct dp_netdev_pmd_thread *pmd;
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        dp_netdev_del_bond_tx_from_pmd(pmd, bond_id);
> +    }
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                           uint64_t *n_bytes)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +
> +    /* Search the bond in all PMDs. */
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        struct tx_bond *pmd_bond_entry
> +            = tx_bond_lookup(&pmd->tx_bonds, bond_id);
> +
> +        if (!pmd_bond_entry) {
> +            continue;
> +        }
> +
> +        /* Read bond stats. */
> +        for (int i = 0; i < BOND_BUCKETS; i++) {
> +            uint64_t pmd_n_bytes;
> +            atomic_read_relaxed(
> +                 &pmd_bond_entry->slave_buckets[i].n_bytes,
> +                 &pmd_n_bytes);

Please, make this 2in 2 lines.

> +            n_bytes[i] += pmd_n_bytes;
> +        }
> +    }
> +    return 0;
> +}
> +
>  const struct dpif_class dpif_netdev_class = {
>      "netdev",
>      true,                       /* cleanup_required */
> @@ -7811,6 +8131,9 @@ const struct dpif_class dpif_netdev_class = {
>      dpif_netdev_meter_set,
>      dpif_netdev_meter_get,
>      dpif_netdev_meter_del,
> +    dpif_netdev_bond_add,
> +    dpif_netdev_bond_del,
> +    dpif_netdev_bond_stats_get,
>  };
>  
>  static void
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 5b5c96d..116f8fc 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -3993,6 +3993,9 @@ const struct dpif_class dpif_netlink_class = {
>      dpif_netlink_meter_set,
>      dpif_netlink_meter_get,
>      dpif_netlink_meter_del,
> +    NULL,                       /* bond_add */
> +    NULL,                       /* bond_del */
> +    NULL,                       /* bond_stats_get */
>  };
>  
>  static int
> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> index b77317b..781c7ca 100644
> --- a/lib/dpif-provider.h
> +++ b/lib/dpif-provider.h
> @@ -616,6 +616,17 @@ struct dpif_class {
>       * zero. */
>      int (*meter_del)(struct dpif *, ofproto_meter_id meter_id,
>                       struct ofputil_meter_stats *, uint16_t n_bands);
> +
> +    /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */
> +    int (*bond_add)(struct dpif *dpif, uint32_t bond_id,
> +                    odp_port_t *slave_map);
> +
> +    /* Removes bond identified by 'bond_id' from 'dpif'. */
> +    int (*bond_del)(struct dpif *dpif, uint32_t bond_id);
> +
> +    /* Reads bond stats from 'dpif'. */
> +    int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id,
> +                          uint64_t *n_bytes);
>  };
>  
>  extern const struct dpif_class dpif_netlink_class;
> diff --git a/lib/dpif.c b/lib/dpif.c
> index 9d9c716..c8dc8ed 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1170,6 +1170,7 @@ dpif_execute_helper_cb(void *aux_, struct 
> dp_packet_batch *packets_,
>  
>      case OVS_ACTION_ATTR_CT:
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -1220,6 +1221,7 @@ dpif_execute_helper_cb(void *aux_, struct 
> dp_packet_batch *packets_,
>          struct dp_packet *clone = NULL;
>          uint32_t cutlen = dp_packet_get_cutlen(packet);
>          if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT
> +                        || type == OVS_ACTION_ATTR_LB_OUTPUT
>                          || type == OVS_ACTION_ATTR_TUNNEL_PUSH
>                          || type == OVS_ACTION_ATTR_TUNNEL_POP
>                          || type == OVS_ACTION_ATTR_USERSPACE)) {
> @@ -1879,6 +1881,16 @@ dpif_supports_explicit_drop_action(const struct dpif 
> *dpif)
>      return dpif_is_netdev(dpif);
>  }
>  
> +bool
> +dpif_supports_lb_output_action(const struct dpif *dpif)
> +{
> +    /*
> +     * Balance-tcp optimization is currently supported in netdev
> +     * datapath only.
> +     */
> +    return dpif_is_netdev(dpif);
> +}
> +
>  /* Meters */
>  void
>  dpif_meter_get_features(const struct dpif *dpif,
> @@ -1976,3 +1988,40 @@ dpif_meter_del(struct dpif *dpif, ofproto_meter_id 
> meter_id,
>      }
>      return error;
>  }
> +
> +int
> +dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t *slave_map)
> +{
> +    int error = -ENOTSUP;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> +        error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map);
> +    }
> +
> +    return error;
> +}
> +
> +int
> +dpif_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    int error = -ENOTSUP;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> +        error = dpif->dpif_class->bond_del(dpif, bond_id);
> +    }
> +
> +    return error;
> +}
> +
> +int
> +dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                    uint64_t *n_bytes)
> +{
> +    int error = -ENOTSUP;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) {
> +        error = dpif->dpif_class->bond_stats_get(dpif, bond_id, n_bytes);
> +    }
> +
> +    return error;
> +}
> diff --git a/lib/dpif.h b/lib/dpif.h
> index 4df8f7c..2f63b2c 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -891,6 +891,19 @@ int dpif_meter_get(const struct dpif *, ofproto_meter_id 
> meter_id,
>                     struct ofputil_meter_stats *, uint16_t n_bands);
>  int dpif_meter_del(struct dpif *, ofproto_meter_id meter_id,
>                     struct ofputil_meter_stats *, uint16_t n_bands);
> +
> +/* Bonding. */
> +
> +/* Bit-mask for hashing a flow down to a bucket. */
> +#define BOND_MASK 0xff
> +#define BOND_BUCKETS (BOND_MASK + 1)
> +
> +int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t 
> *slave_map);
> +int dpif_bond_del(struct dpif *dpif, uint32_t bond_id);
> +int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                        uint64_t *n_bytes);
> +bool dpif_supports_lb_output_action(const struct dpif *);
> +

No need for extra line here.

>  
>  /* Miscellaneous. */
>  
> diff --git a/lib/odp-execute.c b/lib/odp-execute.c
> index 42d3335..6018e37 100644
> --- a/lib/odp-execute.c
> +++ b/lib/odp-execute.c
> @@ -793,6 +793,7 @@ requires_datapath_assistance(const struct nlattr *a)
>      switch (type) {
>          /* These only make sense in the context of a datapath. */
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -1068,6 +1069,7 @@ odp_execute_actions(void *dp, struct dp_packet_batch 
> *batch, bool steal,
>              return;
>          }
>          case OVS_ACTION_ATTR_OUTPUT:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>          case OVS_ACTION_ATTR_TUNNEL_PUSH:
>          case OVS_ACTION_ATTR_TUNNEL_POP:
>          case OVS_ACTION_ATTR_USERSPACE:
> diff --git a/lib/odp-util.c b/lib/odp-util.c
> index 746d1e9..4ef4cdc 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -119,6 +119,7 @@ odp_action_len(uint16_t type)
>  
>      switch ((enum ovs_action_attr) type) {
>      case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
> +    case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t);
>      case OVS_ACTION_ATTR_TRUNC: return sizeof(struct ovs_action_trunc);
>      case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE;
>      case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
> @@ -1122,6 +1123,9 @@ format_odp_action(struct ds *ds, const struct nlattr *a,
>      case OVS_ACTION_ATTR_OUTPUT:
>          odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), ds);
>          break;
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
> +        ds_put_format(ds, "lb_output(bond,%"PRIu32")", nl_attr_get_u32(a));

Why the word 'bond' here? ----------------^
It should be part of an action name or not exist at all.  It should not be an
argument.

One more thing is that you're adding flow formatting for this action, but I 
don't
see any changes for parse_odp_action__() that should be able to parse datapath
actions.  Please, add.

It also would make sense to add some tests to tests/odp.at.

> +        break;
>      case OVS_ACTION_ATTR_TRUNC: {
>          const struct ovs_action_trunc *trunc =
>                         nl_attr_get_unspec(a, sizeof *trunc);
> diff --git a/ofproto/bond.c b/ofproto/bond.c
> index 405202f..3503a78 100644
> --- a/ofproto/bond.c
> +++ b/ofproto/bond.c
> @@ -54,10 +54,6 @@ static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER;
>  static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__);
>  static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = &all_bonds__;
>  
> -/* Bit-mask for hashing a flow down to a bucket. */
> -#define BOND_MASK 0xff
> -#define BOND_BUCKETS (BOND_MASK + 1)
> -
>  /* Priority for internal rules created to handle recirculation */
>  #define RECIRC_RULE_PRIORITY 20
>  
> @@ -126,6 +122,8 @@ struct bond {
>      enum lacp_status lacp_status; /* Status of LACP negotiations. */
>      bool bond_revalidate;       /* True if flows need revalidation. */
>      uint32_t basis;             /* Basis for flow hash function. */
> +    bool use_bond_buckets;      /* Use bond buckets to avoid recirculation.
> +                                   Applicable only for Balance TCP mode. */

As Eelco already mentioned, this name is confusing.
In this patch you have:
1. use_bond_buckets
2. bond_is_cache_mode_enabled
3. is_lb_output_action_supported
All three basically stands for a same thing in different parts of code/user
interface.  Additionally, there are functions like:
1. bond_add_slave_buckets
2. ofproto_dpif_bundle_add
...
These functions operates on the same things, but calls them differently, i.e.
'bond buckets' vs. 'bundle'.  This is kind of confusing too.

>  
>      /* SLB specific bonding info. */
>      struct bond_entry *hash;     /* An array of BOND_BUCKETS elements. */
> @@ -188,6 +186,9 @@ static struct bond_slave *choose_output_slave(const 
> struct bond *,
>  static void update_recirc_rules__(struct bond *bond);
>  static bool bond_is_falling_back_to_ab(const struct bond *);
>  
> +static void bond_add_slave_buckets(const struct bond *bond);
> +static void bond_del_slave_buckets(const struct bond *bond);
> +
>  /* Attempts to parse 's' as the name of a bond balancing mode.  If 
> successful,
>   * stores the mode in '*balance' and returns true.  Otherwise returns false
>   * without modifying '*balance'. */
> @@ -282,6 +283,8 @@ bond_unref(struct bond *bond)
>  
>      /* Free bond resources. Remove existing post recirc rules. */
>      if (bond->recirc_id) {
> +        /* Delete bond buckets from datapath if installed. */
> +        bond_del_slave_buckets(bond);
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
> @@ -355,6 +358,7 @@ update_recirc_rules__(struct bond *bond)
>                              &bond->hash[i].pr_rule);
>              }
>          }
> +        bond_add_slave_buckets(bond);

This updates bonds inside the datapath, but it seems that we're still creating 
and
installing all the post-recirc actions to datapath in the next loop.
This doesn't make sense.  Am I missing something?

>      }
>  
>      HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) {
> @@ -464,9 +468,14 @@ bond_reconfigure(struct bond *bond, const struct 
> bond_settings *s)
>              bond->recirc_id = recirc_alloc_id(bond->ofproto);
>          }
>      } else if (bond->recirc_id) {
> +        /* Delete bond buckets from datapath if installed. */
> +        bond_del_slave_buckets(bond);
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
> +    if (bond->use_bond_buckets != s->use_bond_buckets) {
> +        bond->use_bond_buckets = s->use_bond_buckets;
> +    }
>  
>      if (bond->balance == BM_AB || !bond->hash || revalidate) {
>          bond_entry_reset(bond);
> @@ -944,6 +953,14 @@ bond_recirculation_account(struct bond *bond)
>      OVS_REQ_WRLOCK(rwlock)
>  {
>      int i;
> +    uint64_t n_bytes[BOND_BUCKETS] = {0};
> +    bool use_bond_buckets = bond_is_cache_mode_enabled(bond);
> +
> +    if (use_bond_buckets) {
> +        /* Retrieve bond stats from datapath. */
> +        dpif_bond_stats_get(bond->ofproto->backer->dpif,
> +                            bond->recirc_id, n_bytes);
> +    }
>  
>      for (i=0; i<=BOND_MASK; i++) {
>          struct bond_entry *entry = &bond->hash[i];
> @@ -953,8 +970,12 @@ bond_recirculation_account(struct bond *bond)
>              struct pkt_stats stats;
>              long long int used OVS_UNUSED;
>  
> -            rule->ofproto->ofproto_class->rule_get_stats(
> -                rule, &stats, &used);
> +            if (!use_bond_buckets) {
> +                rule->ofproto->ofproto_class->rule_get_stats(
> +                    rule, &stats, &used);
> +            } else {
> +                stats.n_bytes = n_bytes[i];
> +            }
>              bond_entry_account(entry, stats.n_bytes);
>          }
>      }
> @@ -1365,6 +1386,8 @@ bond_print_details(struct ds *ds, const struct bond 
> *bond)
>                    may_recirc ? "yes" : "no", may_recirc ? recirc_id: -1);
>  
>      ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis);
> +    ds_put_format(ds, "lb-output-action: %s, bond-id: %u\n",
> +                  bond->use_bond_buckets ? "enabled" : "disabled", 
> recirc_id);
>  
>      ds_put_format(ds, "updelay: %d ms\n", bond->updelay);
>      ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay);
> @@ -1942,3 +1965,36 @@ bond_get_changed_active_slave(const char *name, struct 
> eth_addr *mac,
>  
>      return false;
>  }
> +
> +bool
> +bond_is_cache_mode_enabled(const struct bond *bond)
> +{
> +    return (bond->balance == BM_TCP) && bond->use_bond_buckets;
> +}
> +
> +static void
> +bond_add_slave_buckets(const struct bond *bond)
> +{
> +    ofp_port_t slave_map[BOND_MASK];
> +
> +    if (bond_is_cache_mode_enabled(bond)) {
> +        for (int i = 0; i < BOND_BUCKETS; i++) {
> +            struct bond_slave *slave = bond->hash[i].slave;
> +
> +            if (slave) {
> +                slave_map[i] = slave->ofp_port;
> +            } else {
> +                slave_map[i] = OFPP_NONE;
> +            }
> +        }
> +        ofproto_dpif_bundle_add(bond->ofproto, bond->recirc_id, slave_map);
> +    }
> +}
> +
> +static void
> +bond_del_slave_buckets(const struct bond *bond)
> +{
> +    if (bond_is_cache_mode_enabled(bond)) {
> +        ofproto_dpif_bundle_del(bond->ofproto, bond->recirc_id);
> +    }
> +}
> diff --git a/ofproto/bond.h b/ofproto/bond.h
> index e7c3d9b..376a280 100644
> --- a/ofproto/bond.h
> +++ b/ofproto/bond.h
> @@ -58,6 +58,8 @@ struct bond_settings {
>                                  /* The MAC address of the interface
>                                     that was active during the last
>                                     ovs run. */
> +    bool use_bond_buckets;      /* Use bond buckets. Only applicable for
> +                                   bond mode BALANCE TCP. */
>  };
>  
>  /* Program startup. */
> @@ -122,4 +124,7 @@ void bond_rebalance(struct bond *);
>  */
>  void bond_update_post_recirc_rules(struct bond *, uint32_t *recirc_id,
>                                     uint32_t *hash_basis);
> +
> +bool bond_is_cache_mode_enabled(const struct bond *);
> +
>  #endif /* bond.h */
> diff --git a/ofproto/ofproto-dpif-ipfix.c b/ofproto/ofproto-dpif-ipfix.c
> index b413768..796eb6f 100644
> --- a/ofproto/ofproto-dpif-ipfix.c
> +++ b/ofproto/ofproto-dpif-ipfix.c
> @@ -2979,6 +2979,7 @@ dpif_ipfix_read_actions(const struct flow *flow,
>          enum ovs_action_attr type = nl_attr_type(a);
>          switch (type) {
>          case OVS_ACTION_ATTR_OUTPUT:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>              ipfix_actions->output_action = true;
>              break;
>          case OVS_ACTION_ATTR_SAMPLE:
> diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
> index f9ea47a..f616fb2 100644
> --- a/ofproto/ofproto-dpif-sflow.c
> +++ b/ofproto/ofproto-dpif-sflow.c
> @@ -1175,8 +1175,9 @@ dpif_sflow_read_actions(const struct flow *flow,
>          case OVS_ACTION_ATTR_RECIRC:
>          case OVS_ACTION_ATTR_HASH:
>          case OVS_ACTION_ATTR_CT:
> -    case OVS_ACTION_ATTR_CT_CLEAR:
> +        case OVS_ACTION_ATTR_CT_CLEAR:
>          case OVS_ACTION_ATTR_METER:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>              break;
>  
>          case OVS_ACTION_ATTR_SET_MASKED:
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index adf57a5..970bdbc 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -4190,7 +4190,6 @@ compose_output_action__(struct xlate_ctx *ctx, 
> ofp_port_t ofp_port,
>          xlate_commit_actions(ctx);
>  
>          if (xr) {
> -            /* Recirculate the packet. */
>              struct ovs_action_hash *act_hash;
>  
>              /* Hash action. */
> @@ -4199,15 +4198,27 @@ compose_output_action__(struct xlate_ctx *ctx, 
> ofp_port_t ofp_port,
>                  /* Algorithm supported by all datapaths. */
>                  hash_alg = OVS_HASH_ALG_L4;
>              }
> -            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
> -                                                OVS_ACTION_ATTR_HASH,
> -                                                sizeof *act_hash);
> -            act_hash->hash_alg = hash_alg;
> -            act_hash->hash_basis = xr->hash_basis;
>  
> -            /* Recirc action. */
> -            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> -                           xr->recirc_id);
> +            if (bond_is_cache_mode_enabled(xport->xbundle->bond)) {
> +                /*
> +                 * If bond mode is balance-tcp and optimize balance tcp
> +                 * is enabled then use the hash directly for slave selection
> +                 * and avoid recirculation.
> +                 *
> +                 * Currently support for netdev datapath only.
> +                 */
> +                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_LB_OUTPUT,
> +                               xr->recirc_id);
> +            } else {
> +                act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
> +                                                    OVS_ACTION_ATTR_HASH,
> +                                                    sizeof *act_hash);
> +                act_hash->hash_alg = hash_alg;
> +                act_hash->hash_basis = xr->hash_basis;
> +                /* Recirculate the packet. */
> +                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> +                               xr->recirc_id);
> +            }
>          } else if (is_native_tunnel) {
>              /* Output to native tunnel port. */
>              native_tunnel_output(ctx, xport, flow, odp_port, truncate);
> @@ -7268,7 +7279,8 @@ count_output_actions(const struct ofpbuf *odp_actions)
>      int n = 0;
>  
>      NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, odp_actions->size) {
> -        if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) {
> +        if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) ||
> +            (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) {
>              n++;
>          }
>      }
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index d21874b..c4d219b 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -1582,6 +1582,8 @@ check_support(struct dpif_backer *backer)
>      backer->rt_support.ct_timeout = check_ct_timeout_policy(backer);
>      backer->rt_support.explicit_drop_action =
>          dpif_supports_explicit_drop_action(backer->dpif);
> +    backer->rt_support.lb_output_action=
> +        dpif_supports_lb_output_action(backer->dpif);
>  
>      /* Flow fields. */
>      backer->rt_support.odp.ct_state = check_ct_state(backer);
> @@ -3441,6 +3443,25 @@ bundle_remove(struct ofport *port_)
>      }
>  }
>  
> +int
> +ofproto_dpif_bundle_add(struct ofproto_dpif *ofproto,
> +                        uint32_t bond_id, const ofp_port_t *slave_map)
> +{
> +    odp_port_t odp_map[BOND_BUCKETS];
> +
> +    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +        /* Convert ofp_port to odp_port. */
> +        odp_map[bucket] = ofp_port_to_odp_port(ofproto, slave_map[bucket]);
> +    }
> +    return dpif_bond_add(ofproto->backer->dpif, bond_id, odp_map);
> +}
> +
> +int
> +ofproto_dpif_bundle_del(struct ofproto_dpif *ofproto, uint32_t bond_id)
> +{
> +    return dpif_bond_del(ofproto->backer->dpif, bond_id);
> +}
> +
>  static void
>  send_pdu_cb(void *port_, const void *pdu, size_t pdu_size)
>  {
> @@ -5571,6 +5592,7 @@ get_datapath_cap(const char *datapath_type, struct smap 
> *cap)
>      smap_add(cap, "ct_timeout", s.ct_timeout ? "true" : "false");
>      smap_add(cap, "explicit_drop_action",
>               s.explicit_drop_action ? "true" :"false");
> +    smap_add(cap, "lb_output_action", s.lb_output_action ? "true" : "false");
>  }
>  
>  /* Gets timeout policy name in 'backer' based on 'zone', 'dl_type' and
> @@ -6609,6 +6631,13 @@ meter_del(struct ofproto *ofproto_, ofproto_meter_id 
> meter_id)
>      ovsrcu_postpone(free_meter_id, arg);
>  }
>  
> +static bool
> +is_lb_output_action_supported(const struct ofproto *ofproto_)
> +{
> +    struct ofproto_dpif *ofproto = ofproto_dpif_cast(ofproto_);
> +    return ofproto->backer->rt_support.lb_output_action;
> +}
> +
>  const struct ofproto_class ofproto_dpif_class = {
>      init,
>      enumerate_types,
> @@ -6716,4 +6745,5 @@ const struct ofproto_class ofproto_dpif_class = {
>      ct_flush,                   /* ct_flush */
>      ct_set_zone_timeout_policy,
>      ct_del_zone_timeout_policy,
> +    is_lb_output_action_supported,
>  };
> diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
> index aee61d6..9e2293e 100644
> --- a/ofproto/ofproto-dpif.h
> +++ b/ofproto/ofproto-dpif.h
> @@ -202,7 +202,10 @@ struct group_dpif *group_dpif_lookup(struct ofproto_dpif 
> *,
>      DPIF_SUPPORT_FIELD(bool, ct_timeout, "Conntrack timeout policy")        \
>                                                                              \
>      /* True if the datapath supports explicit drop action. */               \
> -    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action")
> +    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action")  \
> +                                                                            \
> +    /* True if the datapath supports balance_tcp optimization */            \
> +    DPIF_SUPPORT_FIELD(bool, lb_output_action, "Optimized Balance TCP mode")
>  
>  
>  /* Stores the various features which the corresponding backer supports. */
> @@ -382,6 +385,9 @@ int ofproto_dpif_add_internal_flow(struct ofproto_dpif *,
>                                     struct rule **rulep);
>  int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct match *,
>                                        int priority);
> +int ofproto_dpif_bundle_add(struct ofproto_dpif *, uint32_t bond_id,
> +                            const ofp_port_t *slave_map);
> +int ofproto_dpif_bundle_del(struct ofproto_dpif *, uint32_t bond_id);
>  
>  bool ovs_native_tunneling_is_on(struct ofproto_dpif *);
>  
> diff --git a/ofproto/ofproto-provider.h b/ofproto/ofproto-provider.h
> index afecb24..571da24 100644
> --- a/ofproto/ofproto-provider.h
> +++ b/ofproto/ofproto-provider.h
> @@ -1905,6 +1905,9 @@ struct ofproto_class {
>      /* Deletes the timeout policy associated with 'zone' in datapath type
>       * 'dp_type'. */
>      void (*ct_del_zone_timeout_policy)(const char *dp_type, uint16_t zone);
> +
> +    /* Return 'true' if datapath supports 'lb-output-action'. */
> +    bool (*is_lb_output_action_supported)(const struct ofproto *);
>  };
>  
>  extern const struct ofproto_class ofproto_dpif_class;
> diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
> index 0fbd6c3..0d1cb49 100644
> --- a/ofproto/ofproto.c
> +++ b/ofproto/ofproto.c
> @@ -1497,6 +1497,16 @@ ofproto_is_mirror_output_bundle(const struct ofproto 
> *ofproto, void *aux)
>              ? ofproto->ofproto_class->is_mirror_output_bundle(ofproto, aux)
>              : false);
>  }
> +
> +/* Returns true if datapath supports lb-output-action. */
> +bool
> +ofproto_is_lb_output_action_supported(const struct ofproto *ofproto)
> +{
> +    return (ofproto->ofproto_class->is_lb_output_action_supported
> +            ? ofproto->ofproto_class->is_lb_output_action_supported(ofproto)
> +            : false);
> +}
> +
>  
>  /* Configuration of OpenFlow tables. */
>  
> diff --git a/ofproto/ofproto.h b/ofproto/ofproto.h
> index 2dd2531..dda2733 100644
> --- a/ofproto/ofproto.h
> +++ b/ofproto/ofproto.h
> @@ -505,6 +505,7 @@ unsigned int ofproto_aa_vlan_get_queue_size(struct 
> ofproto *ofproto);
>  
>  int ofproto_set_flood_vlans(struct ofproto *, unsigned long *flood_vlans);
>  bool ofproto_is_mirror_output_bundle(const struct ofproto *, void *aux);
> +bool ofproto_is_lb_output_action_supported(const struct ofproto *ofproto);
>  
>  /* Configuration of OpenFlow tables. */
>  struct ofproto_table_settings {
> diff --git a/tests/lacp.at b/tests/lacp.at
> index 7b460d7..d7d98f7 100644
> --- a/tests/lacp.at
> +++ b/tests/lacp.at
> @@ -121,6 +121,7 @@ AT_CHECK([ovs-appctl bond/show], [0], [dnl
>  bond_mode: active-backup
>  bond may use recirculation: no, Recirc-ID : -1
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 0
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -286,6 +287,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 1
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -301,6 +303,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 2
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -423,6 +426,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 1
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -440,6 +444,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 2
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -555,6 +560,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 1
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -572,6 +578,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 2
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -692,6 +699,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 1
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -709,6 +717,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 2
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index fe73c38..ffc3921 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -4600,6 +4600,19 @@ port_configure_bond(struct port *port, struct 
> bond_settings *s)
>          /* OVSDB did not store the last active interface */
>          s->active_slave_mac = eth_addr_zero;
>      }
> +
> +    /* use_bond_buckets is disabled by default. */
> +    s->use_bond_buckets = (s->balance == BM_TCP)

We should probably warn user that lb_output and non-balance-tcp bond
are not compatible.

> +                          && smap_get_bool(&port->cfg->other_config,
> +                                         "lb-output-action", false);
> +
> +    /* Verify if datapath supports lb-output-action. */
> +    if (s->use_bond_buckets
> +        && !ofproto_is_lb_output_action_supported(port->bridge->ofproto)) {
> +        VLOG_WARN("port %s: Unsupported lb-output-action in datapath, "
> +                  "defaulting to false.", port->name);
> +        s->use_bond_buckets = false;
> +    }
>  }
>  
>  /* Returns true if 'port' is synthetic, that is, if we constructed it locally
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 4a74ed3..3da8d0c 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -1994,6 +1994,16 @@
>          <code>active-backup</code>.
>        </column>
>  
> +      <column name="other_config" key="lb-output-action"
> +              type='{"type": "boolean"}'>
> +        Enable/disable usage of optimized lb-output-action for
> +        balancing flows among output slaves in load balanced bonds in
> +        <code>balance-tcp</code>. When enabled, it uses optimized path for
> +        balance-tcp mode by using rss hash and avoids recirculation.
> +        It affects only new flows, i.e, existing flows remain unchanged.
> +        This knob does not affect other balancing modes.
> +      </column>
> +
>        <group title="Link Failure Detection">
>          <p>
>            An important part of link bonding is detecting that links are down 
> so
> @@ -5788,6 +5798,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> type=patch options:peer=p1 \
>          higher performance for MPLS and active-active load balancing
>          bonding modes.
>        </column>
> +      <column name="capabilities" key="lb_output_action"
> +              type='{"type": "boolean"}'>
> +        If this is true, then the datapath supports optimized balance-tcp
> +        bond mode. This capability replaces existing <code>hash</code> and
> +        <code>recirc</code> actions with new action <code>lb_output</code>
> +        and avoids recirculation of packet in datapath. It is supported
> +        only for balance-tcp bond mode in netdev datapath. The new action
> +        gives higer performance by using bond buckets instead of post
> +        recirculation flows for selection of slave port from bond. By default
> +        this new action is disabled, however it can be enabled by setting
> +        <ref column="other-config" key="lb-output-action"/> in
> +        <ref table="Port"/> table.
> +      </column>
>        <group title="Connection-Tracking Capabilities">
>          <p>
>            These capabilities are granular because Open vSwitch and its
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to