Re: [ovs-dev] [PATCH ovn] lflow: Use learn() action to generate LB hairpin reply flows.

2021-02-08 Thread Numan Siddique
On Mon, Feb 8, 2021 at 4:02 PM Dumitru Ceara  wrote:
>
> On 2/8/21 11:09 AM, Numan Siddique wrote:
> > On Sat, Feb 6, 2021 at 4:00 AM Dumitru Ceara  wrote:
> >>
> >> The main trait of load balancer hairpin traffic is that it never leaves
> >> the local hypervisor.  Essentially this means that only hairpin
> >> openflows installed for logical switches that have at least one logical
> >> switch port bound locally can ever be hit.
> >>
> >> Until now, if a load balancer was applied on multiple logical switches
> >> that are connected through a distributed router, ovn-controller would
> >> install flows to detect hairpin replies for all logical switches. In
> >> practice this leads to a very high number of openflows out of which
> >> most will never be used.
> >>
> >> Instead we now use an additional action, learn(), on flows that match on
> >> packets that create the hairpin session.  The learn() action will then
> >> generate the necessary flows to handle hairpin replies, but only for
> >> the local datapaths which actually generate hairpin traffic.
> >>
> >> For example, simulating how ovn-k8s uses load balancer for services,
> >> in a "switch per node" scenario, the script below would generate
> >> 10K (n_nodes * n_vips * n_backends) openflows on every node in table=69
> >> (hairpin reply).  With this patch the maximum number of openflows that
> >> can be created for hairpin replies is 200 (n_vips * n_backends).
> >>
> >> In general, for deployments that leverage switch-per-node topologies,
> >> the number of openflows is reduced by a factor of N, where N is the
> >> number of nodes.
> >>
> >>$ cat lbs.sh
> >>NODES=50
> >>VIPS=20
> >>BACKENDS=10
> >>ovn-nbctl lr-add rtr
> >>for ((i = 1; i <= $NODES; i++)); do
> >>ovn-nbctl \
> >>-- ls-add ls$i \
> >>-- lsp-add ls$i vm$i \
> >>-- lsp-add ls$i ls$i-rtr \
> >>-- lsp-set-type ls$i-rtr router \
> >>-- lsp-set-options ls$i-rtr router-port=rtr-ls$i \
> >>-- lrp-add rtr rtr-ls$i 00:00:00:00:01:00 42.42.42.$i/24
> >>done
> >>
> >>for ((i = 1; i <= $VIPS; i++)); do
> >>lb=lb$i
> >>vip=10.10.10.$i:1
> >>bip=20.20.20.1:2
> >>for ((j = 2; j <= $BACKENDS; j++)); do
> >>bip="$bip,20.20.20.$j:2"
> >>done
> >>ovn-nbctl lb-add $lb $vip $backends
> >>done
> >>
> >>for ((i = 1; i <= $NODES; i++)); do
> >>for ((j = 1; j <= $VIPS; j++)); do
> >>ovn-nbctl ls-lb-add ls$i lb$j
> >>done
> >>done
> >>
> >>ovs-vsctl add-port br-int vm1 \
> >>-- set interface vm1 type=internal \
> >>-- set interface vm1 external-ids:iface-id=vm1
> >>
> >> Suggested-by: Ilya Maximets 
> >> Signed-off-by: Dumitru Ceara 
> >
> > Hi Dumitru,
>
> Hi Numan,
>
> >
> > Thanks for this patch. The patch LGTM. I tested it out with  huge NB
> > DB with lots
> > of load balancer and backends.  I see a significant reduction in OF flows.
>
> Thanks for the review!
>
> >
> > Please see the few comments below.
> >
> > Thanks
> > Numan
> >
> >> ---
> >>   controller/lflow.c  | 204 ---
> >>   tests/ofproto-macros.at |   5 +-
> >>   tests/ovn.at| 516 
> >> +---
> >>   3 files changed, 455 insertions(+), 270 deletions(-)
> >>
> >> diff --git a/controller/lflow.c b/controller/lflow.c
> >> index 946c1e0..2b7d356 100644
> >> --- a/controller/lflow.c
> >> +++ b/controller/lflow.c
> >> @@ -1171,6 +1171,178 @@ add_neighbor_flows(struct ovsdb_idl_index 
> >> *sbrec_port_binding_by_name,
> >>   }
> >>   }
> >>
> >> +/* Builds the "learn()" action to be triggered by packets initiating a
> >> + * hairpin session.
> >> + *
> >> + * This will generate flows in table OFTABLE_CHK_LB_HAIRPIN_REPLY of the 
> >> form:
> >> + * - match:
> >> + * metadata=,ip/ipv6,ip.src=,ip.dst=
> >> + * nw_proto='lb_proto',tp_src_port=
> >> + * - action:
> >> + * set MLF_LOOKUP_LB_HAIRPIN_BIT=1
> >> + */
> >> +static void
> >> +add_lb_vip_hairpin_reply_action(struct in6_addr *vip6, ovs_be32 vip,
> >> +uint8_t lb_proto, bool has_l4_port,
> >> +uint64_t cookie, struct ofpbuf *ofpacts)
> >> +{
> >> +struct match match = MATCH_CATCHALL_INITIALIZER;
> >> +struct ofpact_learn *ol = ofpact_put_LEARN(ofpacts);
> >> +struct ofpact_learn_spec *ol_spec;
> >> +unsigned int imm_bytes;
> >> +uint8_t *src_imm;
> >> +
> >> +/* Once learned, hairpin reply flows are permanent until the 
> >> VIP/backend
> >> + * is removed.
> >> + */
> >> +ol->flags = NX_LEARN_F_DELETE_LEARNED;
> >> +ol->idle_timeout = OFP_FLOW_PERMANENT;
> >> +ol->hard_timeout = OFP_FLOW_PERMANENT;
> >> +ol->priority = OFP_DEFAULT_PRIORITY;
> >> +ol->table_id = OFTABLE_CHK_LB_HAIRPIN_REPLY;
> >> +ol->cookie = htonll(cookie);
> >> +
> >> +/* Match on metadata of the 

Re: [ovs-dev] [PATCH ovn v2 10/10] lflow-cache: Make max cache memory usage configurable.

2021-02-08 Thread Numan Siddique
On Thu, Feb 4, 2021 at 6:57 PM Dumitru Ceara  wrote:
>
> Add a new OVS external-id, "ovn-memlimit-lflow-cache-kb", through which
> users can specify the maximum amount of memory (in KB) ovn-controller
> can use for caching logical flows.
>
> To maintain backwards compatibility the default behavior is to not
> enforce any memory limit on the size of the cache.
>
> Add lflow cache reports to "ovn-appctl -t ovn-controller memory/show".
>
> The memory usage for cache entries of type LCACHE_T_EXPR is computed by
> doing another pass through the expression tree.  While this adds a few
> extra cycles, entries of type LCACHE_T_EXPR shouldn't be very common
> because they are created only for flows with matches that include
> "is_chassis_resident()" but do not reference port groups and address sets.
>
> Signed-off-by: Dumitru Ceara 

Hi Dumitru,

Thanks for working on this series.

I have one comment. I think it will be useful to report the used
memory by the cache
in the reply to the command - lflow-cache/show-stats.

Thanks
Numan

> ---
>  NEWS|8 +++--
>  controller/chassis.c|   21 +
>  controller/lflow-cache.c|   63 
> ++-
>  controller/lflow-cache.h|   13 ++--
>  controller/lflow.c  |8 +++--
>  controller/ovn-controller.8.xml |7 
>  controller/ovn-controller.c |8 -
>  controller/test-lflow-cache.c   |   31 +--
>  include/ovn/expr.h  |3 +-
>  lib/expr.c  |   39 
>  tests/ovn-lflow-cache.at|   54 +++--
>  11 files changed, 213 insertions(+), 42 deletions(-)
>
> diff --git a/NEWS b/NEWS
> index 6e10557..ddb5977 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -16,9 +16,11 @@ Post-v20.12.0
>- Add a new option to Load_Balancer.options, "hairpin_snat_ip", to allow
>  users to explicitly select which source IP should be used for load
>  balancer hairpin traffic.
> -  - ovn-controller: Add a configuration knob, through OVS external-id
> -"ovn-limit-lflow-cache", to allow enforcing a limit for the size of the
> -logical flow cache.
> +  - ovn-controller: Add configuration knobs, through OVS external-id
> +"ovn-limit-lflow-cache" and "ovn-memlimit-lflow-cache-kb", to allow
> +enforcing a limit for the size of the logical flow cache based on
> +maximum number of entries and/or memory usage.
> +  - ovn-controller: Add lflow cache related memory reports.
>
>  OVN v20.12.0 - 18 Dec 2020
>  --
> diff --git a/controller/chassis.c b/controller/chassis.c
> index c66837a..a4a264c 100644
> --- a/controller/chassis.c
> +++ b/controller/chassis.c
> @@ -50,6 +50,7 @@ struct ovs_chassis_cfg {
>  const char *chassis_macs;
>  const char *enable_lflow_cache;
>  const char *limit_lflow_cache;
> +const char *memlimit_lflow_cache;
>
>  /* Set of encap types parsed from the 'ovn-encap-type' external-id. */
>  struct sset encap_type_set;
> @@ -142,6 +143,12 @@ get_limit_lflow_cache(const struct smap *ext_ids)
>  }
>
>  static const char *
> +get_memlimit_lflow_cache(const struct smap *ext_ids)
> +{
> +return smap_get_def(ext_ids, "ovn-memlimit-lflow-cache-kb", "");
> +}
> +
> +static const char *
>  get_encap_csum(const struct smap *ext_ids)
>  {
>  return smap_get_def(ext_ids, "ovn-encap-csum", "true");
> @@ -264,6 +271,8 @@ chassis_parse_ovs_config(const struct 
> ovsrec_open_vswitch_table *ovs_table,
>  ovs_cfg->chassis_macs = get_chassis_mac_mappings(>external_ids);
>  ovs_cfg->enable_lflow_cache = get_enable_lflow_cache(>external_ids);
>  ovs_cfg->limit_lflow_cache = get_limit_lflow_cache(>external_ids);
> +ovs_cfg->memlimit_lflow_cache =
> +get_memlimit_lflow_cache(>external_ids);
>
>  if (!chassis_parse_ovs_encap_type(encap_type, _cfg->encap_type_set)) 
> {
>  return false;
> @@ -293,6 +302,7 @@ chassis_build_other_config(struct smap *config, const 
> char *bridge_mappings,
> const char *iface_types,
> const char *enable_lflow_cache,
> const char *limit_lflow_cache,
> +   const char *memlimit_lflow_cache,
> bool is_interconn)
>  {
>  smap_replace(config, "ovn-bridge-mappings", bridge_mappings);
> @@ -301,6 +311,7 @@ chassis_build_other_config(struct smap *config, const 
> char *bridge_mappings,
>  smap_replace(config, "ovn-monitor-all", monitor_all);
>  smap_replace(config, "ovn-enable-lflow-cache", enable_lflow_cache);
>  smap_replace(config, "ovn-limit-lflow-cache", limit_lflow_cache);
> +smap_replace(config, "ovn-memlimit-lflow-cache-kb", 
> memlimit_lflow_cache);
>  smap_replace(config, "iface-types", iface_types);
>  smap_replace(config, "ovn-chassis-mac-mappings", chassis_macs);
>  

Re: [ovs-dev] [PATCH ovn v2 09/10] lflow-cache: Make maximum number of cache entries configurable.

2021-02-08 Thread Numan Siddique
On Thu, Feb 4, 2021 at 6:57 PM Dumitru Ceara  wrote:
>
> Add a new OVS external-id, "ovn-limit-lflow-cache", through which users
> can specify the maximum size of the ovn-controller logical flow cache.
>
> To maintain backwards compatibility the default behavior is to not
> enforce any limit on the size of the cache.
>
> When the cache becomes full, the rule is to prefer more "important"
> cache entries over less "important" ones.  That is, evict entries of
> type LCACHE_T_CONJ_ID if there's no room to add an entry of type
> LCACHE_T_EXPR.  Similarly, evict entries of type LCACHE_T_CONJ_ID or
> LCACHE_T_EXPR if there's no room to add an entry of type
> LCACHE_T_MATCHES.
>
> Signed-off-by: Dumitru Ceara 

Acked-by: Numan Siddique 

I think the above rules description can also be added as a comment in the
function lflow_cache_make_room__().

Thanks
Numan


> ---
>  NEWS|3 +
>  controller/chassis.c|   23 -
>  controller/lflow-cache.c|   48 +-
>  controller/lflow-cache.h|2 -
>  controller/ovn-controller.8.xml |   16 ++
>  controller/ovn-controller.c |8 +++
>  controller/test-lflow-cache.c   |   12 +++--
>  tests/ovn-lflow-cache.at|  104 
> +++
>  8 files changed, 206 insertions(+), 10 deletions(-)
>
> diff --git a/NEWS b/NEWS
> index 2044671..6e10557 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -16,6 +16,9 @@ Post-v20.12.0
>- Add a new option to Load_Balancer.options, "hairpin_snat_ip", to allow
>  users to explicitly select which source IP should be used for load
>  balancer hairpin traffic.
> +  - ovn-controller: Add a configuration knob, through OVS external-id
> +"ovn-limit-lflow-cache", to allow enforcing a limit for the size of the
> +logical flow cache.
>
>  OVN v20.12.0 - 18 Dec 2020
>  --
> diff --git a/controller/chassis.c b/controller/chassis.c
> index b4d4b0e..c66837a 100644
> --- a/controller/chassis.c
> +++ b/controller/chassis.c
> @@ -49,6 +49,7 @@ struct ovs_chassis_cfg {
>  const char *monitor_all;
>  const char *chassis_macs;
>  const char *enable_lflow_cache;
> +const char *limit_lflow_cache;
>
>  /* Set of encap types parsed from the 'ovn-encap-type' external-id. */
>  struct sset encap_type_set;
> @@ -135,6 +136,12 @@ get_enable_lflow_cache(const struct smap *ext_ids)
>  }
>
>  static const char *
> +get_limit_lflow_cache(const struct smap *ext_ids)
> +{
> +return smap_get_def(ext_ids, "ovn-limit-lflow-cache", "");
> +}
> +
> +static const char *
>  get_encap_csum(const struct smap *ext_ids)
>  {
>  return smap_get_def(ext_ids, "ovn-encap-csum", "true");
> @@ -256,6 +263,7 @@ chassis_parse_ovs_config(const struct 
> ovsrec_open_vswitch_table *ovs_table,
>  ovs_cfg->monitor_all = get_monitor_all(>external_ids);
>  ovs_cfg->chassis_macs = get_chassis_mac_mappings(>external_ids);
>  ovs_cfg->enable_lflow_cache = get_enable_lflow_cache(>external_ids);
> +ovs_cfg->limit_lflow_cache = get_limit_lflow_cache(>external_ids);
>
>  if (!chassis_parse_ovs_encap_type(encap_type, _cfg->encap_type_set)) 
> {
>  return false;
> @@ -283,13 +291,16 @@ chassis_build_other_config(struct smap *config, const 
> char *bridge_mappings,
> const char *datapath_type, const char 
> *cms_options,
> const char *monitor_all, const char *chassis_macs,
> const char *iface_types,
> -   const char *enable_lflow_cache, bool is_interconn)
> +   const char *enable_lflow_cache,
> +   const char *limit_lflow_cache,
> +   bool is_interconn)
>  {
>  smap_replace(config, "ovn-bridge-mappings", bridge_mappings);
>  smap_replace(config, "datapath-type", datapath_type);
>  smap_replace(config, "ovn-cms-options", cms_options);
>  smap_replace(config, "ovn-monitor-all", monitor_all);
>  smap_replace(config, "ovn-enable-lflow-cache", enable_lflow_cache);
> +smap_replace(config, "ovn-limit-lflow-cache", limit_lflow_cache);
>  smap_replace(config, "iface-types", iface_types);
>  smap_replace(config, "ovn-chassis-mac-mappings", chassis_macs);
>  smap_replace(config, "is-interconn", is_interconn ? "true" : "false");
> @@ -305,6 +316,7 @@ chassis_other_config_changed(const char *bridge_mappings,
>   const char *monitor_all,
>   const char *chassis_macs,
>   const char *enable_lflow_cache,
> + const char *limit_lflow_cache,
>   const struct ds *iface_types,
>   bool is_interconn,
>   const struct sbrec_chassis *chassis_rec)
> @@ -344,6 +356,13 @@ chassis_other_config_changed(const char 

Re: [ovs-dev] [PATCH ovn v2 08/10] lflow-cache: Reclaim heap memory after cache flush.

2021-02-08 Thread Numan Siddique
On Thu, Feb 4, 2021 at 6:57 PM Dumitru Ceara  wrote:
>
> If possible, automatically reclaim heap memory when the lflow cache is
> flushed.  This can be an expensive operation but cache flushing is not
> a very common operation.
>
> This change is inspired by Ilya Maximets' OVS commit:
>   f38f98a2c0dd ("ovsdb-server: Reclaim heap memory after compaction.")
>
> Additionally, when flushing the cache, also shrink the backing hmap.
>
> Signed-off-by: Dumitru Ceara 

Acked-by: Numan Siddique 

Numan
> ---
>  configure.ac |1 +
>  controller/lflow-cache.c |9 +
>  2 files changed, 10 insertions(+)
>
> diff --git a/configure.ac b/configure.ac
> index b2d0843..ebb09a5 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -96,6 +96,7 @@ OVN_CHECK_DOT
>  OVS_CHECK_IF_DL
>  OVS_CHECK_STRTOK_R
>  AC_CHECK_DECLS([sys_siglist], [], [], [[#include ]])
> +AC_CHECK_DECLS([malloc_trim], [], [], [[#include ]])
>  AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec],
>[], [], [[#include ]])
>  AC_CHECK_MEMBERS([struct ifreq.ifr_flagshigh], [], [], [[#include 
> ]])
> diff --git a/controller/lflow-cache.c b/controller/lflow-cache.c
> index 1549034..2453b10 100644
> --- a/controller/lflow-cache.c
> +++ b/controller/lflow-cache.c
> @@ -17,6 +17,10 @@
>
>  #include 
>
> +#if HAVE_DECL_MALLOC_TRIM
> +#include 
> +#endif
> +
>  #include "coverage.h"
>  #include "lib/ovn-sb-idl.h"
>  #include "lflow-cache.h"
> @@ -87,7 +91,12 @@ lflow_cache_flush(struct lflow_cache *lc)
>  HMAP_FOR_EACH_SAFE (lce, lce_next, node, >entries[i]) {
>  lflow_cache_delete__(lc, lce);
>  }
> +hmap_shrink(>entries[i]);
>  }
> +
> +#if HAVE_DECL_MALLOC_TRIM
> +malloc_trim(0);
> +#endif
>  }
>
>  void
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 05/10] lflow-cache: Add unit tests.

2021-02-08 Thread Numan Siddique
On Thu, Feb 4, 2021 at 6:56 PM Dumitru Ceara  wrote:
>
> Signed-off-by: Dumitru Ceara 

Acked-by: Numan Siddique 

Numan

> ---
>  controller/test-lflow-cache.c  |  224 +++
>  controller/test-ofctrl-seqno.c |   18 ---
>  tests/automake.mk  |8 +
>  tests/ovn-lflow-cache.at   |  257 
> 
>  tests/ovn.at   |   68 +++
>  tests/test-utils.c |   49 
>  tests/test-utils.h |   26 
>  tests/testsuite.at |1
>  8 files changed, 633 insertions(+), 18 deletions(-)
>  create mode 100644 controller/test-lflow-cache.c
>  create mode 100644 tests/ovn-lflow-cache.at
>  create mode 100644 tests/test-utils.c
>  create mode 100644 tests/test-utils.h
>
> diff --git a/controller/test-lflow-cache.c b/controller/test-lflow-cache.c
> new file mode 100644
> index 000..d6e962e
> --- /dev/null
> +++ b/controller/test-lflow-cache.c
> @@ -0,0 +1,224 @@
> +/* Copyright (c) 2021, Red Hat, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include 
> +
> +#include "ovn/expr.h"
> +#include "ovn-sb-idl.h"
> +#include "tests/ovstest.h"
> +#include "tests/test-utils.h"
> +#include "util.h"
> +
> +#include "lflow-cache.h"
> +
> +static void
> +test_lflow_cache_add__(struct lflow_cache *lc, const char *op_type,
> +   const struct sbrec_logical_flow *lflow,
> +   unsigned int conj_id_ofs,
> +   struct expr *e)
> +{
> +printf("ADD %s:\n", op_type);
> +printf("  conj-id-ofs: %u\n", conj_id_ofs);
> +
> +if (!strcmp(op_type, "conj-id")) {
> +lflow_cache_add_conj_id(lc, lflow, conj_id_ofs);
> +} else if (!strcmp(op_type, "expr")) {
> +lflow_cache_add_expr(lc, lflow, conj_id_ofs, expr_clone(e));
> +} else if (!strcmp(op_type, "matches")) {
> +struct hmap *matches = xmalloc(sizeof *matches);
> +ovs_assert(expr_to_matches(e, NULL, NULL, matches) == 0);
> +ovs_assert(hmap_count(matches) == 1);
> +lflow_cache_add_matches(lc, lflow, matches);
> +} else {
> +OVS_NOT_REACHED();
> +}
> +}
> +
> +static void
> +test_lflow_cache_lookup__(struct lflow_cache *lc,
> +  const struct sbrec_logical_flow *lflow)
> +{
> +struct lflow_cache_value *lcv = lflow_cache_get(lc, lflow);
> +
> +printf("LOOKUP:\n");
> +if (!lcv) {
> +printf("  not found\n");
> +return;
> +}
> +
> +printf("  conj_id_ofs: %"PRIu32"\n", lcv->conj_id_ofs);
> +switch (lcv->type) {
> +case LCACHE_T_CONJ_ID:
> +printf("  type: conj-id\n");
> +break;
> +case LCACHE_T_EXPR:
> +printf("  type: expr\n");
> +break;
> +case LCACHE_T_MATCHES:
> +printf("  type: matches\n");
> +break;
> +case LCACHE_T_NONE:
> +OVS_NOT_REACHED();
> +break;
> +}
> +}
> +
> +static void
> +test_lflow_cache_delete__(struct lflow_cache *lc,
> +  const struct sbrec_logical_flow *lflow)
> +{
> +printf("DELETE\n");
> +lflow_cache_delete(lc, lflow);
> +}
> +
> +static void
> +test_lflow_cache_stats__(struct lflow_cache *lc)
> +{
> +struct lflow_cache_stats *lcs = lflow_cache_get_stats(lc);
> +
> +if (!lcs) {
> +return;
> +}
> +printf("Enabled: %s\n", lflow_cache_is_enabled(lc) ? "true" : "false");
> +for (size_t i = 0; i < LCACHE_T_MAX; i++) {
> +printf("  %s: %"PRIuSIZE"\n", lflow_cache_type_names[i],
> +   lcs->n_entries[i]);
> +}
> +free(lcs);
> +}
> +
> +static void
> +test_lflow_cache_operations(struct ovs_cmdl_context *ctx)
> +{
> +struct lflow_cache *lc = lflow_cache_create();
> +struct expr *e = expr_create_boolean(true);
> +bool enabled = !strcmp(ctx->argv[1], "true");
> +unsigned int shift = 2;
> +unsigned int n_ops;
> +
> +lflow_cache_enable(lc, enabled);
> +test_lflow_cache_stats__(lc);
> +
> +if (!test_read_uint_value(ctx, shift++, "n_ops", _ops)) {
> +goto done;
> +}
> +
> +for (unsigned int i = 0; i < n_ops; i++) {
> +const char *op = test_read_value(ctx, shift++, "op");
> +
> +if (!op) {
> +goto done;
> +}
> +
> +struct sbrec_logical_flow lflow;
> +uuid_generate(_.uuid);
> +
> +if (!strcmp(op, "add")) {
> +

Re: [ovs-dev] [PATCH v2] netlink: ignore IFLA_WIRELESS events

2021-02-08 Thread Gregory Rose




On 1/14/2021 1:09 AM, Michal Kazior wrote:

From: Michal Kazior 

Some older wireless drivers - ones relying on the
old and long deprecated wireless extension ioctl
system - can generate quite a bit of IFLA_WIRELESS
events depending on their configuration and
runtime conditions. These are delivered as
RTNLGRP_LINK via RTM_NEWLINK messages.

These tend to be relatively easily identifiable
because they report the change mask being 0. This
isn't guaranteed but in practice it shouldn't be a
problem. None of the wireless events that I ever
observed actually carry any unique information
about netdev states that ovs-vswitchd is
interested in. Hence ignoring these shouldn't
cause any problems.

These events can be responsible for a significant
CPU churn as ovs-vswitchd attempts to do plenty of
work for each and every netlink message regardless
of what that message carries. On low-end devices
such as consumer-grade routers these can lead to a
lot of CPU cycles being wasted, adding up to heat
output and reducing performance.

It could be argued that wireless drivers in
question should be fixed, but that isn't exactly a
trivial thing to do. Patching ovs seems far more
viable while still making sense.

This change requires a slight tweak to the netlink
abstraction code so that the events can be
discarded as soon as possible before getting a
chance to cause the churn.

Signed-off-by: Michal Kazior 


Hi Michal,

The patch looks fine to me.  Applies cleanly to master
and did not cause any regressions in 'make check'.

I'm curious since I don't ever use OVS with wireless ports
if there is somewhere this issue has been reported.  If
so then we should add it to the commit message.

Otherwise,

Acked-by: Greg Rose 


---

Notes:
 v2:
  - fix bracing style [0day robot / checkpatch]

  lib/netdev-linux.c |  4 ++--
  lib/netlink-notifier.c |  4 +++-
  lib/rtnetlink.c| 24 +---
  lib/rtnetlink.h|  2 +-
  4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 6be23dbee..301dd9060 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -742,7 +742,7 @@ netdev_linux_run(const struct netdev_class *netdev_class 
OVS_UNUSED)
  if (!error) {
  struct rtnetlink_change change;
  
-if (rtnetlink_parse(, )) {

+if (rtnetlink_parse(, ) > 0) {
  struct netdev *netdev_ = NULL;
  char dev_name[IFNAMSIZ];
  
@@ -6343,7 +6343,7 @@ netdev_linux_update_via_netlink(struct netdev_linux *netdev)

  return error;
  }
  
-if (rtnetlink_parse(reply, change)

+if (rtnetlink_parse(reply, change) > 0
  && change->nlmsg_type == RTM_NEWLINK) {
  bool changed = false;
  error = 0;
diff --git a/lib/netlink-notifier.c b/lib/netlink-notifier.c
index dfecb9778..ab5a84abb 100644
--- a/lib/netlink-notifier.c
+++ b/lib/netlink-notifier.c
@@ -189,8 +189,10 @@ nln_run(struct nln *nln)
  if (!error) {
  int group = nln->parse(, nln->change);
  
-if (group != 0) {

+if (group > 0) {
  nln_report(nln, nln->change, group);
+} else if (group == 0) {
+/* ignore some events */
  } else {
  VLOG_WARN_RL(, "unexpected netlink message contents");
  nln_report(nln, NULL, 0);
diff --git a/lib/rtnetlink.c b/lib/rtnetlink.c
index 125802925..316524c0f 100644
--- a/lib/rtnetlink.c
+++ b/lib/rtnetlink.c
@@ -82,7 +82,7 @@ rtnetlink_parse_link_info(const struct nlattr *nla,
  /* Parses a rtnetlink message 'buf' into 'change'.  If 'buf' is unparseable,
   * leaves 'change' untouched and returns false.  Otherwise, populates 'change'
   * and returns true. */
-bool
+int
  rtnetlink_parse(struct ofpbuf *buf, struct rtnetlink_change *change)
  {
  const struct nlmsghdr *nlmsg = buf->data;
@@ -99,6 +99,7 @@ rtnetlink_parse(struct ofpbuf *buf, struct rtnetlink_change 
*change)
  [IFLA_MTU]= { .type = NL_A_U32,.optional = true },
  [IFLA_ADDRESS] = { .type = NL_A_UNSPEC, .optional = true },
  [IFLA_LINKINFO] = { .type = NL_A_NESTED, .optional = true },
+[IFLA_WIRELESS] = { .type = NL_A_UNSPEC, .optional = true },
  };
  
  struct nlattr *attrs[ARRAY_SIZE(policy)];

@@ -111,6 +112,23 @@ rtnetlink_parse(struct ofpbuf *buf, struct 
rtnetlink_change *change)
  
  ifinfo = ofpbuf_at(buf, NLMSG_HDRLEN, sizeof *ifinfo);
  
+/* Wireless events can be spammy and cause a

+ * lot of unnecessary churn and CPU load in
+ * ovs-vswitchd. The best way to filter them out
+ * is to rely on the IFLA_WIRELESS and
+ * ifi_change. As per rtnetlink_ifinfo_prep() in
+ * the kernel, the ifi_change = 0. That combined
+ * with the fact wireless events never really
+ * change 

Re: [ovs-dev] [PATCH] netdev-afxdp: Add start qid support.

2021-02-08 Thread William Tu
On Mon, Feb 8, 2021 at 4:58 AM Ilya Maximets  wrote:
>
> On 2/7/21 5:05 PM, Toshiaki Makita wrote:
> > On 2021/02/07 2:00, William Tu wrote:
> >> On Fri, Feb 5, 2021 at 1:08 PM Gregory Rose  wrote:
> >>> On 2/4/2021 7:08 PM, William Tu wrote:
>  On Thu, Feb 4, 2021 at 3:17 PM Gregory Rose  wrote:
> > On 2/3/2021 1:21 PM, William Tu wrote:
> >> Mellanox card has different XSK design. It requires users to create
> >> dedicated queues for XSK. Unlike Intel's NIC which loads XDP program
> >> to all queues, Mellanox only loads XDP program to a subset of its 
> >> queue.
> >>
> >> When OVS uses AF_XDP with mlx5, it doesn't replace the existing RX and 
> >> TX
> >> queues in the channel with XSK RX and XSK TX queues, but it creates an
> >> additional pair of queues for XSK in that channel. To distinguish
> >> regular and XSK queues, mlx5 uses a different range of qids.
> >> That means, if the card has 24 queues, queues 0..11 correspond to
> >> regular queues, and queues 12..23 are XSK queues.
> >> In this case, we should attach the netdev-afxdp with 'start-qid=12'.
> >>
> >> I tested using Mellanox Connect-X 6Dx, by setting 'start-qid=1', and:
> >>  $ ethtool -L enp2s0f0np0 combined 1
> >>  # queue 0 is for non-XDP traffic, queue 1 is for XSK
> >>  $ ethtool -N enp2s0f0np0 flow-type udp4 action 1
> >> note: we need additionally add flow-redirect rule to queue 1
> >
> > Seems awfully hardware dependent.  Is this just for Mellanox or does
> > it have general usefulness?
> >
>  It is just Mellanox's design which requires pre-configure the 
>  flow-director.
>  I only have cards from Intel and Mellanox so I don't know about other 
>  vendors.
> 
>  Thanks,
>  William
> 
> >>>
> >>> I think we need to abstract the HW layer a little bit.  This start-qid
> >>> option is specific to a single piece of HW, at least at this point.
> >>> We should expect that further HW  specific requirements for
> >>> different NIC vendors will come up in the future.  I suggest
> >>> adding a hw_options:mellanox:start-qid type hierarchy  so that
> >>> as new HW requirements come up we can easily scale.  It will
> >>> also make adding new vendors easier in the future.
> >>>
> >>> Even with NIC vendors you can't always count on each new generation
> >>> design to always keep old requirements and methods for feature
> >>> enablement.
> >>>
> >>> What do you think?
> >>>
> >> Thanks for the feedback.
> >> So far I don't know whether other vendors will need this option or not.
> >
> > FWIU, this api "The lower half of the available amount of RX queues are 
> > regular queues, and the upper half are XSK RX queues." is the result of 
> > long discussion to support dedicated/isolated XSK rings, which is not meant 
> > for a mellanox-specific feature.
> >
> > https://patchwork.ozlabs.org/project/netdev/cover/20190524093431.20887-1-maxi...@mellanox.com/
> > https://patchwork.ozlabs.org/project/netdev/cover/20190612155605.22450-1-maxi...@mellanox.com/
> >
> > Toshiaki Makita
>
> Thanks for the links.  Very helpful.
>
> From what I understand lower half of queues should still work, i.e.
> it should still be possible to attach AF_XDP socket to them.  But
> they will not work in zero-copy mode ("generic" only?).
> William, could you check that?  Does it work and with which mode
> "best-effort" ends up with?  And what kind of errors libbpf returns
> if we're trying to enable zero-copy?

Thanks for your feedback.
Yes, only zero-copy mode needs to be aware of this, meaning zero-copy
mode has to use the upper half of the queues (the start-qid option here).
Native mode and SKB mode works OK on upper and lower queues.
When attaching zc XSK to lower half queue, libbpf returns EINVAL at
xsk_socket__create().

William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 ovn 00/10] ovn-controller: Make lflow cache size configurable.

2021-02-08 Thread Mark Michelson

Hi Dumitru,

Thanks for the v2. I had one small nit on patch 6. Aside from that (and 
Numan's comments), looks good.


Acked-by: Mark Michelson 

On 2/4/21 8:24 AM, Dumitru Ceara wrote:

Scale tests have identified the lflow cache to be one of the main memory
consumers in ovn-controller.  This series refactors the lflow cache code
and adds configuration knobs to limit the size (in lines and/or memory)
of the cache.

Patches 1 and 6 fix issues with the already existing lflow cache code.
Even though patch 6 is a bug fix, it's easier to add it later in the
series because it uses the new lflow cache statistics (from patch 4)
to add a unit test that exercises the buggy scenario.

Changes in v2:
- Added two bug fixes for already existing problems (patches 1 and 6).
- Added unit tests as requested by Mark.
- Added support for evicting "less important" entries when the cache
   limit is reached.
- Improved cache entries memory accounting.

Dumitru Ceara (10):
   lflow: Fix cache update when I-P engine aborts.
   lflow: Refactor convert_match_to_expr() to explicitly consume prereqs.
   lflow-cache: Move the lflow cache to its own module.
   lflow-cache: Add lflow-cache/show-stats command.
   lflow-cache: Add unit tests.
   lflow: Do not cache non-conjunctive flows that use address 
sets/portgroups.
   lflow-cache: Add coverage counters.
   lflow-cache: Reclaim heap memory after cache flush.
   lflow-cache: Make maximum number of cache entries configurable.
   lflow-cache: Make max cache memory usage configurable.


  NEWS|5
  configure.ac|1
  controller/automake.mk  |2
  controller/chassis.c|   44 
  controller/lflow-cache.c|  363 +++
  controller/lflow-cache.h|   89 +
  controller/lflow.c  |  376 +---
  controller/lflow.h  |   10 -
  controller/ovn-controller.8.xml |   23 ++
  controller/ovn-controller.c |  114 ---
  controller/test-lflow-cache.c   |  239 +++
  controller/test-ofctrl-seqno.c  |   18 --
  include/ovn/expr.h  |3
  lib/expr.c  |   43 
  tests/automake.mk   |8 +
  tests/ovn-lflow-cache.at|  405 +++
  tests/ovn.at|   82 
  tests/test-utils.c  |   49 +
  tests/test-utils.h  |   26 +++
  tests/testsuite.at  |1
  20 files changed, 1593 insertions(+), 308 deletions(-)
  create mode 100644 controller/lflow-cache.c
  create mode 100644 controller/lflow-cache.h
  create mode 100644 controller/test-lflow-cache.c
  create mode 100644 tests/ovn-lflow-cache.at
  create mode 100644 tests/test-utils.c
  create mode 100644 tests/test-utils.h

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 06/10] lflow: Do not cache non-conjunctive flows that use address sets/portgroups.

2021-02-08 Thread Mark Michelson

Hi Dumitru, just a small nit below

On 2/4/21 8:25 AM, Dumitru Ceara wrote:

Caching the conjunction id offset for flows that refer to address sets
and/or port groups but do not currently generate conjunctive matches,
e.g., because the port group has only one locally bound port, is wrong.

If ports are later added to the port group and/or addresses are added to
the address set, this flow will be reconsidered but at this point will
generate conjunctive matches.  We cannot use the cached conjunction id
offset because it might create collisions with other conjunction ids
created for unrelated flows.

Fixes: 1213bc827040 ("ovn-controller: Cache logical flow expr matches.")
Signed-off-by: Dumitru Ceara 
---
  controller/lflow.c |2 +-
  tests/ovn.at   |   12 
  2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/controller/lflow.c b/controller/lflow.c
index 3e3605a..c493652 100644
--- a/controller/lflow.c
+++ b/controller/lflow.c
@@ -884,7 +884,7 @@ consider_logical_flow__(const struct sbrec_logical_flow 
*lflow,
  lflow_cache_add_expr(l_ctx_out->lflow_cache, lflow,
   conj_id_ofs, cached_expr);
  cached_expr = NULL;
-} else {
+} else if (n_conjs) {
  lflow_cache_add_conj_id(l_ctx_out->lflow_cache, lflow,
  conj_id_ofs);
  }
diff --git a/tests/ovn.at b/tests/ovn.at
index 0e114cf..4bb92d6 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -22300,6 +22300,18 @@ AT_CHECK([test "$(($conj_id_cnt + 2))" = "$(get_cache_count 
cache-conj-id)"], [0
  AT_CHECK([test "$expr_cnt" = "$(get_cache_count cache-expr)"], [0], [])
  AT_CHECK([test "$matches_cnt" = "$(get_cache_count cache-matches)"], [0], [])
  
+AS_BOX([Check no caching caching for non-conjunctive port group/address set matches])


s/caching caching/caching/


+conj_id_cnt=$(get_cache_count cache-conj-id)
+expr_cnt=$(get_cache_count cache-expr)
+matches_cnt=$(get_cache_count cache-matches)
+
+check ovn-nbctl acl-add ls1 from-lport 1 'inport == @pg2 && outport == @pg2 && 
is_chassis_resident("lsp1")' drop
+check ovn-nbctl --wait=hv sync
+
+AT_CHECK([test "$conj_id_cnt" = "$(get_cache_count cache-conj-id)"], [0], [])
+AT_CHECK([test "$expr_cnt" = "$(get_cache_count cache-expr)"], [0], [])
+AT_CHECK([test "$matches_cnt" = "$(get_cache_count cache-matches)"], [0], [])
+
  OVN_CLEANUP([hv1])
  AT_CLEANUP
  


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev



___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] northd: improve OVN BFD documentation

2021-02-08 Thread Ben Pfaff
On Mon, Feb 08, 2021 at 11:49:03AM +0100, Lorenzo Bianconi wrote:
> Signed-off-by: Lorenzo Bianconi 

Thanks!

Acked-by: Ben Pfaff 

I suggest folding in the following as well to make it clearer the roles
of the different groups of columns.

diff --git a/ovn-nb.xml b/ovn-nb.xml
index 58083f101445..cd297a12abcb 100644
--- a/ovn-nb.xml
+++ b/ovn-nb.xml
@@ -3763,6 +3763,10 @@
 
 
 
+  
+ovn-northd reads configuration from these columns.
+  
+
   
 OVN logical port when BFD engine is running.
   
@@ -3802,6 +3806,10 @@
 
 
 
+  
+ovn-northd writes BFD status into these columns.
+  
+
   
 
   BFD port logical states. Possible values are:


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] Vhost Server Ports future plans and container networking

2021-02-08 Thread Marc Magnuson via dev
Hi!

I would like to know what the plans are for going forward with vhost server 
ports. Judging from the documentation, it seems they will be deprecated and 
removed at a future point in time.

This presents a problem for container networking today. The container 
networking solution suggested by DPDK leverages VIRTIO PMD Virtual Devices 
linking up on a socket being hosted by the vhost server. This is the vhost-user 
port we have today in OVS. The vhost-user-client relies on QEMU or an external 
process to create the socket and then a VIRTIO PMD to link up with it for 
communications. This is not ideal for container networking where it is better 
to have OVS start this server as we've seen so far. This also unifies the 
socket creation mechanism to OVS rather than having it be distribution specific.

The vhost-user-client does not directly interact with DPDK's vhost-server 
option (beyond registration of its existence). It registers with the socket and 
the two acknowledge each other, but no initialization is done to bring the link 
up. This requires a virtio PMD vdev to initialize the vhost socket correctly 
(only function to actually send messages during initialization-vhost-client 
does not do this) as we've seen so far.
For more info on this, check drivers/net/virtio/virtio_user_ethdev.c for 
virtio_user_dev_init()
Or virtio_user_dev_init definition at 
drivers/net/virtio/virtio_user/virtio_user_dev.c
The above is the actual interaction between a VIRTIO PMD and Vhost server, this 
is the only location I see this invoked where everything is set and the 
'add_device' callback is actually setup on the vhost side.

IPC use cases are not addressed with the current plans going forward for vhost 
(from what we have seen), how does OVS team plan to maintain support for IPC 
and container networking?

We use vhost-server ovs ports with virtio vdevs on the containers. If the 
current plan is still to deprecate this method, then our next plan will be to 
use Virtio PMD Vdev ports on OVS with an external listener to reset/add/delete 
the vdev port when the socket goes down and when it is created again. This 
method is in fact better for keeping up time on the containers if OVS goes down 
for the same reason vhost-user-client is better than vhost-user for QEMU 
environments.



In short, vhost-user-client does not work with vhost-user server on DPDK for 
IPC applications as things stand now.  Vhost-user server is still the 
predominant choice for container networking with Virtio Vdevs. Any feedback or 
suggestions are welcome!

If there are other (better) methods to support IPC/container networking with 
OVS, please let us know. (Looks like DPDK Rings were removed some time back as 
well)

Thanks for your time and help!
Regards,
Marc Magnuson
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] netdev-afxdp: Add start qid support.

2021-02-08 Thread Gregory Rose



On 2/6/2021 9:00 AM, William Tu wrote:

On Fri, Feb 5, 2021 at 1:08 PM Gregory Rose  wrote:




On 2/4/2021 7:08 PM, William Tu wrote:

On Thu, Feb 4, 2021 at 3:17 PM Gregory Rose  wrote:




On 2/3/2021 1:21 PM, William Tu wrote:

Mellanox card has different XSK design. It requires users to create
dedicated queues for XSK. Unlike Intel's NIC which loads XDP program
to all queues, Mellanox only loads XDP program to a subset of its queue.

When OVS uses AF_XDP with mlx5, it doesn't replace the existing RX and TX
queues in the channel with XSK RX and XSK TX queues, but it creates an
additional pair of queues for XSK in that channel. To distinguish
regular and XSK queues, mlx5 uses a different range of qids.
That means, if the card has 24 queues, queues 0..11 correspond to
regular queues, and queues 12..23 are XSK queues.
In this case, we should attach the netdev-afxdp with 'start-qid=12'.

I tested using Mellanox Connect-X 6Dx, by setting 'start-qid=1', and:
 $ ethtool -L enp2s0f0np0 combined 1
 # queue 0 is for non-XDP traffic, queue 1 is for XSK
 $ ethtool -N enp2s0f0np0 flow-type udp4 action 1
note: we need additionally add flow-redirect rule to queue 1


Seems awfully hardware dependent.  Is this just for Mellanox or does
it have general usefulness?


It is just Mellanox's design which requires pre-configure the flow-director.
I only have cards from Intel and Mellanox so I don't know about other vendors.

Thanks,
William



I think we need to abstract the HW layer a little bit.  This start-qid
option is specific to a single piece of HW, at least at this point.
We should expect that further HW  specific requirements for
different NIC vendors will come up in the future.  I suggest
adding a hw_options:mellanox:start-qid type hierarchy  so that
as new HW requirements come up we can easily scale.  It will
also make adding new vendors easier in the future.

Even with NIC vendors you can't always count on each new generation
design to always keep old requirements and methods for feature
enablement.

What do you think?


Thanks for the feedback.
So far I don't know whether other vendors will need this option or not.
I think adding another "hw_options" is a little confusing because this
is already an option on the device.
Looking at AF_XDP driver at DPDK, it also has similar option:
see start_queue
https://doc.dpdk.org/guides/nics/af_xdp.html



OK, makes sense in this context then.

Thanks,

- Greg
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] 答复: 答复: [PATCH V3 2/4] Add GSO support for DPDK data path

2021-02-08 Thread William Tu
On Mon, Feb 8, 2021 at 6:36 AM Ilya Maximets  wrote:
>
> On 2/8/21 1:56 AM, Yi Yang (杨燚)-云服务集团 wrote:
> > Yes, GSO  is ok, but GRO may be have that issue, I didn't see that issue in 
> > my openstack environment, so maybe it will be great if we can have a test 
> > case to trigger that issue.
> >
> > -邮件原件-
> > 发件人: William Tu [mailto:u9012...@gmail.com]
> > 发送时间: 2021年2月7日 23:46
> > 收件人: Yi Yang (杨燚)-云服务集团 
> > 抄送: i.maxim...@ovn.org; yang_y...@163.com; ovs-dev@openvswitch.org; 
> > f...@sysclose.org
> > 主题: Re: [ovs-dev] 答复: [PATCH V3 2/4] Add GSO support for DPDK data path
> >
> > On Tue, Oct 27, 2020 at 6:02 PM Yi Yang (杨燚)-云服务集团  
> > wrote:
> >>
> >> -邮件原件-
> >> 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 Ilya Maximets
> >> 发送时间: 2020年10月27日 21:03
> >> 收件人: yang_y...@163.com; ovs-dev@openvswitch.org
> >> 抄送: f...@sysclose.org; i.maxim...@ovn.org
> >> 主题: Re: [ovs-dev] [PATCH V3 2/4] Add GSO support for DPDK data path
> >>
> >> On 8/7/20 12:56 PM, yang_y...@163.com wrote:
> >>> From: Yi Yang 
> >>>
> >>> GSO(Generic Segment Offload) can segment large UDP  and TCP packet
> >>> to small packets per MTU of destination , especially for the case
> >>> that physical NIC can't do hardware offload VXLAN TSO and VXLAN UFO,
> >>> GSO can make sure userspace TSO can still work but not drop.
> >>>
> >>> In addition, GSO can help improve UDP performane when UFO is enabled
> >>> in VM.
> >>>
> >>> GSO can support TCP, UDP, VXLAN TCP, VXLAN UDP, it is done in Tx
> >>> function of physical NIC.
> >>>
> >>> Signed-off-by: Yi Yang 
> >>> ---
> >>>  lib/dp-packet.h|  21 +++-
> >>>  lib/netdev-dpdk.c  | 358
> >>> +
> >>>  lib/netdev-linux.c |  17 ++-
> >>>  lib/netdev.c   |  67 +++---
> >>>  4 files changed, 417 insertions(+), 46 deletions(-)
> >
> > snip
> >
> >>>
> >>> @@ -2339,24 +2428,19 @@ netdev_dpdk_prep_hwol_batch(struct netdev_dpdk 
> >>> *dev, struct rte_mbuf **pkts,
> >>>  return cnt;
> >>>  }
> >>>
> >>> -/* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes
> >>> ownership of
> >>> - * 'pkts', even in case of failure.
> >>> - *
> >>> - * Returns the number of packets that weren't transmitted. */
> >>> static inline int -netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int 
> >>> qid,
> >>> - struct rte_mbuf **pkts, int cnt)
> >>> +__netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
> >>> +   struct rte_mbuf **pkts, int cnt)
> >>>  {
> >>>  uint32_t nb_tx = 0;
> >>> -uint16_t nb_tx_prep = cnt;
> >>> +uint32_t nb_tx_prep;
> >>>
> >>> -if (userspace_tso_enabled()) {
> >>> -nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> >>> -if (nb_tx_prep != cnt) {
> >>> -VLOG_WARN_RL(, "%s: Output batch contains invalid 
> >>> packets. "
> >>> - "Only %u/%u are valid: %s", dev->up.name, 
> >>> nb_tx_prep,
> >>> - cnt, rte_strerror(rte_errno));
> >>> -}
> >>> +nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> >>> +if (nb_tx_prep != cnt) {
> >>> +VLOG_WARN_RL(, "%s: Output batch contains invalid packets. "
> >>> +  "Only %u/%u are valid: %s",
> >>> + dev->up.name, nb_tx_prep,
> >>> + cnt, rte_strerror(rte_errno));
> >>>  }
> >>>
> >>>  while (nb_tx != nb_tx_prep) {
> >>> @@ -2384,6 +2468,200 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, 
> >>> int qid,
> >>>  return cnt - nb_tx;
> >>>  }
> >>>
> >>> +static inline void
> >>> +set_multiseg_udptcp_cksum(struct rte_mbuf *mbuf)
> >>
> >> I didn't review the patch, only had a quick glance, but this part bothers 
> >> me.  OVS doesn't support multi-segment mbufs, so it should not be possible 
> >> for such mbufs being transmitted by OVS.  So, I do not understand why this 
> >> function needs to work with such mbufs.
> >>
> >> [Yi Yang] Only DPDK driver/Tx function will use it, not OVS, 
> >> set_multiseg_udptcp_cksum is called in GSO part, it is last step before Tx 
> >> function, it is a big external mbuf before rte_gso_segment, that isn't a 
> >> multi-segmented mbuf.
> >>
> >
> > Hi Ilya,
> >
> > Now I understand Yi Yang's point better and I agree with him.
> > Looks like the patch does the GSO at the DPDK TX function.
> > It creates multi-seg mbuf after rte_gso_segment(), but will immediately 
> > send out the multi-seg mbuf to DPDK port, without traversing inside other 
> > part of OVS code. I guess this case it should work OK?
>
> Hi.  GSO itself should be possible to implement, but we should
> not enable support for multi-segment mbufs in the NIC, since this
> might end up in multi-segment packets being received by OVS
> causing lots of trouble (for example dp_packet_clone() uses simple
> memcpy() that will result in invalid memory reads.)

I see your point, thanks!
I guess you're saying 

Re: [ovs-dev] 答复: 答复: [PATCH] netdev-dpdk: fix incorrect shinfo initialization

2021-02-08 Thread William Tu
On Mon, Feb 8, 2021 at 8:57 AM Ilya Maximets  wrote:
>
> On 2/6/21 5:15 PM, William Tu wrote:
> > On Mon, Feb 1, 2021 at 5:48 PM Yi Yang (杨燚)-云服务集团  
> > wrote:
> >>
> >> Thanks Ilya, net_tap PMD is handling tap device on host side, so it can 
> >> leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how 
> >> to do this, from source code, tap fd isn't enabled vnet header and TSO.
> >>
> > thanks, learned a lot from these discussions.
> >
> > I looked at the DPDK net_tap and indeed it doesn't support virtio net hdr.
> > Do you guys think it makes sense to add TSO at dpdk net_tap?
> > Or simply using the current OVS's userspace-enable-tso on tap/veth is
> > good enough?
> > (using type=system, not using dpdk port type on tap/veth.)
> >
> > Regards,
> > William
> >
>
> I didn't benchmark all types of interfaces, but I'd say that, if you
> need more or less high performance solution for userspace<->kernel
> communication, you should, probably, take a look at virtio-user
> ports with vhost kernel backend:
>   https://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html
> This should be the fastest and also feature-rich solution.
Thanks! I will give it a try.
>
> Tap devices are not designed for high performance in general,
> so I'd not suggest any of them for highly loaded ports.
> If it's only for some small management traffic, it should be fine
> to just use netdev-linux implementation.

That's what I thought until Flavio enables vnet header.
>
> netdev-afxdp with pmd or non-pmd modes on a veth devices is another
> (potentially high performance) solution.

When testing intra-host container to container performance,
Tap device becomes much faster than netdev-afxdp, especially with iperf TCP.
Mostly due to vnet header's TSO and csum offload feature.
It's a big limitation for XDP frame which couldn't carry large buffer or carry
the partial csum information.

I reach a conclusion that for intra-host container to container
TCP performance, from the fastest configuration to slowest (ns: namespace)
0) dpdk vhostuser in ns0 -> vhostuer - OVS userspace
(But requires TCP in userspace and application modification)
1) veth0 in ns0 -> veth with TSO - OVS kernel module - veth with TSO
-> veth1 in ns1
2) tap0 in ns0 -> virtio_user - OVS userspace - virtio_user -> tap1 in ns1
3) tap0 in ns0 -> recv_tap - OVS with userspace-tso - tap_batch_send
-> tap1 in ns1
4) veth0 in ns0 -> af_packet sock - OVS with userspace-tso -
af_packet_sock -> veth1 in ns1
5) veth0 in ns0 -> netdev-afxdp - OVS - netdev-afxdp -> veth1 in ns1

I also tested Toshiaki's XDP offload patch,
https://www.mail-archive.com/ovs-dev@openvswitch.org/msg45930.html
I would guess it's between 2 to 4.

Regards,
William
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] 答复: 答复: [PATCH] netdev-dpdk: fix incorrect shinfo initialization

2021-02-08 Thread Ilya Maximets
On 2/6/21 5:15 PM, William Tu wrote:
> On Mon, Feb 1, 2021 at 5:48 PM Yi Yang (杨燚)-云服务集团  wrote:
>>
>> Thanks Ilya, net_tap PMD is handling tap device on host side, so it can 
>> leverage vnet header to do TSO/GSO, maybe net_pmd authors don't know how to 
>> do this, from source code, tap fd isn't enabled vnet header and TSO.
>>
> thanks, learned a lot from these discussions.
> 
> I looked at the DPDK net_tap and indeed it doesn't support virtio net hdr.
> Do you guys think it makes sense to add TSO at dpdk net_tap?
> Or simply using the current OVS's userspace-enable-tso on tap/veth is
> good enough?
> (using type=system, not using dpdk port type on tap/veth.)
> 
> Regards,
> William
> 

I didn't benchmark all types of interfaces, but I'd say that, if you
need more or less high performance solution for userspace<->kernel
communication, you should, probably, take a look at virtio-user
ports with vhost kernel backend:
  https://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html
This should be the fastest and also feature-rich solution.

Tap devices are not designed for high performance in general,
so I'd not suggest any of them for highly loaded ports.
If it's only for some small management traffic, it should be fine
to just use netdev-linux implementation.

netdev-afxdp with pmd or non-pmd modes on a veth devices is another
(potentially high performance) solution.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 00/15] Netdev vxlan-decap offload

2021-02-08 Thread Sriharsha Basavapatna via dev
On Mon, Feb 8, 2021 at 7:33 PM Eli Britstein  wrote:
>
>
> On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:
> > On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein  wrote:
> >>
> >> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
> >>> On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
> >>>  wrote:
>  On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein  wrote:
> > VXLAN decap in OVS-DPDK configuration consists of two flows:
> > F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
> > F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
> >
> > F1 is a classification flow. It has outer headers matches and it
> > classifies the packet as a VXLAN packet, and using tnl_pop action the
> > packet continues processing in F2.
> > F2 is a flow that has matches on tunnel metadata as well as on the inner
> > packet headers (as any other flow).
> >
> > In order to fully offload VXLAN decap path, both F1 and F2 should be
> > offloaded. As there are more than one flow in HW, it is possible that
> > F1 is done by HW but F2 is not. Packet is received by SW, and should be
> > processed starting from F2 as F1 was already done by HW.
> > Rte_flows are applicable only on physical port IDs. Vport flows (e.g. 
> > F2)
> > are applied on uplink ports attached to OVS.
> >
> > This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
> > for tunnel offloads.
> >
> > Travis:
> > v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
> >
> > GitHub Actions:
> > v1: https://github.com/elibritstein/OVS/actions/runs/515334647
> >
> > [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
> >
> > Eli Britstein (13):
> > netdev-offload: Add HW miss packet state recover API
> > netdev-dpdk: Introduce DPDK tunnel APIs
> > netdev-offload-dpdk: Implement flow dump create/destroy APIs
> > netdev-dpdk: Add flow_api support for netdev vxlan vports
> > netdev-offload-dpdk: Implement HW miss packet recover for vport
> > dpif-netdev: Add HW miss packet state recover logic
> > netdev-offload-dpdk: Change log rate limits
> > netdev-offload-dpdk: Support tunnel pop action
> > netdev-offload-dpdk: Refactor offload rule creation
> > netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
> >   port
> > netdev-offload-dpdk: Map netdev and ufid to offload objects
> > netdev-offload-dpdk: Support vports flows offload
> > netdev-dpdk-offload: Add vxlan pattern matching function
> >
> > Ilya Maximets (2):
> > netdev-offload: Allow offloading to netdev without ifindex.
> > netdev-offload: Disallow offloading to unrelated tunneling vports.
> >
> >Documentation/howto/dpdk.rst  |   1 +
> >NEWS  |   2 +
> >lib/dpif-netdev.c |  49 +-
> >lib/netdev-dpdk.c | 135 ++
> >lib/netdev-dpdk.h | 104 -
> >lib/netdev-offload-dpdk.c | 851 
> > +-
> >lib/netdev-offload-provider.h |   5 +
> >lib/netdev-offload-tc.c   |   8 +
> >lib/netdev-offload.c  |  29 +-
> >lib/netdev-offload.h  |   1 +
> >10 files changed, 1033 insertions(+), 152 deletions(-)
> >
> > --
> > 2.28.0.546.g385c171
> >
>  Hi Eli,
> 
>  Thanks for posting this new patchset to support tunnel decap action 
>  offload.
> 
>  I haven't looked at the entire patchset yet. But I focused on the
>  patches that introduce 1-to-many mapping between an OVS flow (f2) and
>  HW offloaded flows.
> 
>  Here is a representation of the design proposed in this patchset. A
>  flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
>  the underlying uplink/physical port is P0, gets offloaded to not only
>  P0, but also to other physical ports P1, P2... and so on.
> 
>    P0 <> VxLAN-vPort <> VFRep-1
> 
>    P1
>    P2
>    ...
>    Pn
> 
>  IMO, the problems with this design are:
> 
>  - Offloading a flow to an unrelated physical device that has nothing
>  to do with that flow (invalid device for the flow).
>  - Offloading to not just one, but several such invalid physical devices.
>  - Consuming HW resources for a flow that is never seen or intended to
>  be processed by those physical devices.
>  - Impacts flow scale on other physical devices, since it would consume
>  their HW resources with a large number of such invalid flows.
>  - The indirect list used to track these multiple mappings complicates
>  the offload layer implementation.
>  - The addition of flow_dump_create() to offload APIs, just to 

Re: [ovs-dev] 答复: 答复: [PATCH V3 2/4] Add GSO support for DPDK data path

2021-02-08 Thread Ilya Maximets
On 2/8/21 1:56 AM, Yi Yang (杨燚)-云服务集团 wrote:
> Yes, GSO  is ok, but GRO may be have that issue, I didn't see that issue in 
> my openstack environment, so maybe it will be great if we can have a test 
> case to trigger that issue.
> 
> -邮件原件-
> 发件人: William Tu [mailto:u9012...@gmail.com] 
> 发送时间: 2021年2月7日 23:46
> 收件人: Yi Yang (杨燚)-云服务集团 
> 抄送: i.maxim...@ovn.org; yang_y...@163.com; ovs-dev@openvswitch.org; 
> f...@sysclose.org
> 主题: Re: [ovs-dev] 答复: [PATCH V3 2/4] Add GSO support for DPDK data path
> 
> On Tue, Oct 27, 2020 at 6:02 PM Yi Yang (杨燚)-云服务集团  
> wrote:
>>
>> -邮件原件-
>> 发件人: dev [mailto:ovs-dev-boun...@openvswitch.org] 代表 Ilya Maximets
>> 发送时间: 2020年10月27日 21:03
>> 收件人: yang_y...@163.com; ovs-dev@openvswitch.org
>> 抄送: f...@sysclose.org; i.maxim...@ovn.org
>> 主题: Re: [ovs-dev] [PATCH V3 2/4] Add GSO support for DPDK data path
>>
>> On 8/7/20 12:56 PM, yang_y...@163.com wrote:
>>> From: Yi Yang 
>>>
>>> GSO(Generic Segment Offload) can segment large UDP  and TCP packet 
>>> to small packets per MTU of destination , especially for the case 
>>> that physical NIC can't do hardware offload VXLAN TSO and VXLAN UFO, 
>>> GSO can make sure userspace TSO can still work but not drop.
>>>
>>> In addition, GSO can help improve UDP performane when UFO is enabled 
>>> in VM.
>>>
>>> GSO can support TCP, UDP, VXLAN TCP, VXLAN UDP, it is done in Tx 
>>> function of physical NIC.
>>>
>>> Signed-off-by: Yi Yang 
>>> ---
>>>  lib/dp-packet.h|  21 +++-
>>>  lib/netdev-dpdk.c  | 358
>>> +
>>>  lib/netdev-linux.c |  17 ++-
>>>  lib/netdev.c   |  67 +++---
>>>  4 files changed, 417 insertions(+), 46 deletions(-)
> 
> snip
> 
>>>
>>> @@ -2339,24 +2428,19 @@ netdev_dpdk_prep_hwol_batch(struct netdev_dpdk 
>>> *dev, struct rte_mbuf **pkts,
>>>  return cnt;
>>>  }
>>>
>>> -/* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes 
>>> ownership of
>>> - * 'pkts', even in case of failure.
>>> - *
>>> - * Returns the number of packets that weren't transmitted. */  
>>> static inline int -netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int 
>>> qid,
>>> - struct rte_mbuf **pkts, int cnt)
>>> +__netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
>>> +   struct rte_mbuf **pkts, int cnt)
>>>  {
>>>  uint32_t nb_tx = 0;
>>> -uint16_t nb_tx_prep = cnt;
>>> +uint32_t nb_tx_prep;
>>>
>>> -if (userspace_tso_enabled()) {
>>> -nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
>>> -if (nb_tx_prep != cnt) {
>>> -VLOG_WARN_RL(, "%s: Output batch contains invalid packets. "
>>> - "Only %u/%u are valid: %s", dev->up.name, 
>>> nb_tx_prep,
>>> - cnt, rte_strerror(rte_errno));
>>> -}
>>> +nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
>>> +if (nb_tx_prep != cnt) {
>>> +VLOG_WARN_RL(, "%s: Output batch contains invalid packets. "
>>> +  "Only %u/%u are valid: %s",
>>> + dev->up.name, nb_tx_prep,
>>> + cnt, rte_strerror(rte_errno));
>>>  }
>>>
>>>  while (nb_tx != nb_tx_prep) {
>>> @@ -2384,6 +2468,200 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, 
>>> int qid,
>>>  return cnt - nb_tx;
>>>  }
>>>
>>> +static inline void
>>> +set_multiseg_udptcp_cksum(struct rte_mbuf *mbuf)
>>
>> I didn't review the patch, only had a quick glance, but this part bothers 
>> me.  OVS doesn't support multi-segment mbufs, so it should not be possible 
>> for such mbufs being transmitted by OVS.  So, I do not understand why this 
>> function needs to work with such mbufs.
>>
>> [Yi Yang] Only DPDK driver/Tx function will use it, not OVS, 
>> set_multiseg_udptcp_cksum is called in GSO part, it is last step before Tx 
>> function, it is a big external mbuf before rte_gso_segment, that isn't a 
>> multi-segmented mbuf.
>>
> 
> Hi Ilya,
> 
> Now I understand Yi Yang's point better and I agree with him.
> Looks like the patch does the GSO at the DPDK TX function.
> It creates multi-seg mbuf after rte_gso_segment(), but will immediately send 
> out the multi-seg mbuf to DPDK port, without traversing inside other part of 
> OVS code. I guess this case it should work OK?

Hi.  GSO itself should be possible to implement, but we should
not enable support for multi-segment mbufs in the NIC, since this
might end up in multi-segment packets being received by OVS
causing lots of trouble (for example dp_packet_clone() uses simple
memcpy() that will result in invalid memory reads.)

GSO should be implemented in a same way TSO implemented.  Right now
TSO implementation has no software fallback to segment large packets
if netdev doesn't support - it it simply drops them.  However,
software fallback should be implemented someday and it should be
implemented in a generic way for all the netdev types instead of

Re: [ovs-dev] [PATCH] dpif-netdev: display EMC used entries for PMDs

2021-02-08 Thread Paolo Valerio
Hi Kevin,

thank you for your feedback

Kevin Traynor  writes:

> Hi Paolo,
>
> On 14/01/2021 13:19, Paolo Valerio wrote:
>> adds "emc entries" to "ovs-appctl dpif-netdev/pmd-stats-show" in order
>> to show the number of alive entries.
>> 
>
> Thanks for working on this. Not a full review, but a few high level
> comments. I would like to ask about the motivation - is it so a user can
> judge the effectiveness of using the EMC to decide to disable it? or to
> tune the EMC insertion rate?
>

the purpose is more like giving an extra bit of information that of course
can be used for tuning.

> If I run this and increase the flows, I see:
>   emc hits: 1961314
>   emc entries: 6032 (73.63%)
>   smc hits: 0
>   megaflow hits: 688946
>
> If I look at pmd-perf-stats (below), it shows the % split between emc
> and megaflow and I think I would use that to judge the effectiveness of
> the emc. I'm not too fussed if the emc occupancy is high or low, I just
> want to know if it is being effective at getting hits for my pkts, so
> I'm not sure that 'emc entries: 6032 (73.63%)' helps me decide to
> enable/disable it.
>
>   - EMC hits: 1972766  ( 74.0 %)
>   - SMC hits:   0  (  0.0 %)
>   - Megaflow hits: 693022  ( 26.0 %, 1.00 subtbl lookups/hit)
>
> Of course it doesn't account for the differing cost between them, but if
> emc hits were a low %, I'd want to experiment with disabling it.
>

When you are approaching a reasonably high number of alive entries,
and you expect for any reason that the number of flows for that PMD will
grow significantly, it gives you an insight that the hits rate may
decrease.

> Depending on your motivation, you might also want to take a look at the
> really nice fdb stats that Eelco did, it might give some ideas.
>

Thanks for the heads up.

Paolo

> Kevin.
>
>> Signed-off-by: Paolo Valerio 
>> ---
>>  NEWS   |  2 ++
>>  lib/dpif-netdev-perf.h |  2 ++
>>  lib/dpif-netdev.c  | 76 +++---
>>  tests/pmd.at   |  6 ++--
>>  4 files changed, 65 insertions(+), 21 deletions(-)
>> 
>> diff --git a/NEWS b/NEWS
>> index 617fe8e6a..e5d53a83b 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -20,6 +20,8 @@ Post-v2.14.0
>>   * Add generic IP protocol support to conntrack. With this change, all
>> none UDP, TCP, and ICMP traffic will be treated as general L3
>> traffic, i.e. using 3 tupples.
>> + * EMC alive entries counter has been added to command
>> +   "ovs-appctl dpif-netdev/pmd-stats-show"
>> - The environment variable OVS_UNBOUND_CONF, if set, is now used
>>   as the DNS resolver's (unbound) configuration file.
>> - Linux datapath:
>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>> index 72645b6b3..80f50eae9 100644
>> --- a/lib/dpif-netdev-perf.h
>> +++ b/lib/dpif-netdev-perf.h
>> @@ -157,6 +157,8 @@ struct pmd_perf_stats {
>>  uint64_t last_tsc;
>>  /* Used to space certain checks in time. */
>>  uint64_t next_check_tsc;
>> +/* Exact Match Cache used entries counter. */
>> +atomic_uint32_t emc_n_entries;
>>  /* If non-NULL, outermost cycle timer currently running in PMD. */
>>  struct cycle_timer *cur_timer;
>>  /* Set of PMD counters with their zero offsets. */
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 300861ca5..3eb70ccd5 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -913,7 +913,7 @@ dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread 
>> *pmd,
>> odp_port_t in_port);
>>  
>>  static inline bool emc_entry_alive(struct emc_entry *ce);
>> -static void emc_clear_entry(struct emc_entry *ce);
>> +static void emc_clear_entry(struct emc_entry *ce, struct pmd_perf_stats *s);
>>  static void smc_clear_entry(struct smc_bucket *b, int idx);
>>  
>>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
>> @@ -955,13 +955,16 @@ dfc_cache_init(struct dfc_cache *flow_cache)
>>  }
>>  
>>  static void
>> -emc_cache_uninit(struct emc_cache *flow_cache)
>> +emc_cache_uninit(struct emc_cache *flow_cache, struct pmd_perf_stats *s)
>>  {
>>  int i;
>>  
>> +
>>  for (i = 0; i < ARRAY_SIZE(flow_cache->entries); i++) {
>> -emc_clear_entry(_cache->entries[i]);
>> +emc_clear_entry(_cache->entries[i], s);
>>  }
>> +
>> +atomic_store_relaxed(>emc_n_entries, 0);
>>  }
>>  
>>  static void
>> @@ -977,21 +980,21 @@ smc_cache_uninit(struct smc_cache *smc)
>>  }
>>  
>>  static void
>> -dfc_cache_uninit(struct dfc_cache *flow_cache)
>> +dfc_cache_uninit(struct dfc_cache *flow_cache, struct pmd_perf_stats *s)
>>  {
>>  smc_cache_uninit(_cache->smc_cache);
>> -emc_cache_uninit(_cache->emc_cache);
>> +emc_cache_uninit(_cache->emc_cache, s);
>>  }
>>  
>>  /* Check and clear dead flow references slowly (one entry at each
>>   * invocation).  */
>>  static void
>> -emc_cache_slow_sweep(struct emc_cache *flow_cache)
>> 

Re: [ovs-dev] [PATCH ovn v2 04/10] lflow-cache: Add lflow-cache/show-stats command.

2021-02-08 Thread Numan Siddique
On Thu, Feb 4, 2021 at 6:55 PM Dumitru Ceara  wrote:
>
> Signed-off-by: Dumitru Ceara 

Hi Dumitru,

Please see below for  a few comments.

Thanks
Numan

> ---
>  controller/lflow-cache.c|   21 +
>  controller/lflow-cache.h|7 +++
>  controller/ovn-controller.c |   29 +
>  3 files changed, 57 insertions(+)
>
> diff --git a/controller/lflow-cache.c b/controller/lflow-cache.c
> index ce129ac..8e1c71e 100644
> --- a/controller/lflow-cache.c
> +++ b/controller/lflow-cache.c
> @@ -21,6 +21,12 @@
>  #include "lflow-cache.h"
>  #include "ovn/expr.h"
>
> +const char *lflow_cache_type_names[LCACHE_T_MAX] = {
> +[LCACHE_T_CONJ_ID] = "cache-conj-id",
> +[LCACHE_T_EXPR]= "cache-expr",
> +[LCACHE_T_MATCHES] = "cache-matches",
> +};
> +
>  struct lflow_cache {
>  struct hmap entries[LCACHE_T_MAX];
>  bool enabled;
> @@ -103,6 +109,21 @@ lflow_cache_is_enabled(struct lflow_cache *lc)
>  return lc && lc->enabled;
>  }
>
> +struct lflow_cache_stats *
> +lflow_cache_get_stats(const struct lflow_cache *lc)
> +{
> +if (!lc) {
> +return NULL;
> +}
> +
> +struct lflow_cache_stats *stats = xmalloc(sizeof *stats);
> +
> +for (size_t i = 0; i < LCACHE_T_MAX; i++) {
> +stats->n_entries[i] = hmap_count(>entries[i]);
> +}
> +return stats;
> +}
> +

Instead of exposing 'struct lflow_cache_stats', I think it's better if
this function
takes a 'struct ds *' as out parameter and fills in all the stat
related details.

This way, there is no need to extern the 'lflow_cache_type_names'.


>  void
>  lflow_cache_add_conj_id(struct lflow_cache *lc,
>  const struct sbrec_logical_flow *lflow,
> diff --git a/controller/lflow-cache.h b/controller/lflow-cache.h
> index 74a5b81..54873ee 100644
> --- a/controller/lflow-cache.h
> +++ b/controller/lflow-cache.h
> @@ -42,6 +42,8 @@ enum lflow_cache_type {
>  LCACHE_T_NONE = LCACHE_T_MAX, /* Not found in cache. */
>  };
>
> +extern const char *lflow_cache_type_names[LCACHE_T_MAX];
> +
>  struct lflow_cache_value {
>  enum lflow_cache_type type;
>  uint32_t conj_id_ofs;
> @@ -52,11 +54,16 @@ struct lflow_cache_value {
>  };
>  };
>
> +struct lflow_cache_stats {
> +size_t n_entries[LCACHE_T_MAX];
> +};
> +
>  struct lflow_cache *lflow_cache_create(void);
>  void lflow_cache_flush(struct lflow_cache *);
>  void lflow_cache_destroy(struct lflow_cache *);
>  void lflow_cache_enable(struct lflow_cache *, bool enabled);
>  bool lflow_cache_is_enabled(struct lflow_cache *);
> +struct lflow_cache_stats *lflow_cache_get_stats(const struct lflow_cache *);
>
>  void lflow_cache_add_conj_id(struct lflow_cache *,
>   const struct sbrec_logical_flow *,
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index 7d247fd..6ee6119 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -82,6 +82,7 @@ static unixctl_cb_func debug_pause_execution;
>  static unixctl_cb_func debug_resume_execution;
>  static unixctl_cb_func debug_status_execution;
>  static unixctl_cb_func lflow_cache_flush_cmd;
> +static unixctl_cb_func lflow_cache_show_stats_cmd;
>  static unixctl_cb_func debug_delay_nb_cfg_report;
>
>  #define DEFAULT_BRIDGE_NAME "br-int"
> @@ -2662,6 +2663,9 @@ main(int argc, char *argv[])
>  unixctl_command_register("lflow-cache/flush", "", 0, 0,
>   lflow_cache_flush_cmd,
>   _output_data->pd);
> +unixctl_command_register("lflow-cache/show-stats", "", 0, 0,
> + lflow_cache_show_stats_cmd,
> + _output_data->pd);
>
>  bool reset_ovnsb_idl_min_index = false;
>  unixctl_command_register("sb-cluster-state-reset", "", 0, 0,
> @@ -3270,6 +3274,31 @@ lflow_cache_flush_cmd(struct unixctl_conn *conn 
> OVS_UNUSED,
>  }
>
>  static void
> +lflow_cache_show_stats_cmd(struct unixctl_conn *conn, int argc OVS_UNUSED,
> +   const char *argv[] OVS_UNUSED, void *arg_)
> +{
> +struct flow_output_persistent_data *fo_pd = arg_;
> +struct lflow_cache *lc = fo_pd->lflow_cache;
> +struct ds ds = DS_EMPTY_INITIALIZER;
> +
> +ds_put_format(, "Enabled: %s\n",
> +  lflow_cache_is_enabled(lc) ? "true" : "false");
> +
> +struct lflow_cache_stats *stats = lflow_cache_get_stats(lc);
> +if (stats) {
> +for (size_t i = 0; i < LCACHE_T_MAX; i++) {
> +ds_put_format(, "%-16s: %"PRIuSIZE"\n",
> +  lflow_cache_type_names[i],
> +  stats->n_entries[i]);
> +}
> +}
> +unixctl_command_reply(conn, ds_cstr());
> +
> +ds_destroy();
> +free(stats);
> +}
> +

Please document this command in ovn-controller.8.xml.

Thanks
Numan

> +static void
>  cluster_state_reset_cmd(struct unixctl_conn *conn, int argc OVS_UNUSED,
> const 

Re: [ovs-dev] [PATCH ovn v2 03/10] lflow-cache: Move the lflow cache to its own module.

2021-02-08 Thread Numan Siddique
On Thu, Feb 4, 2021 at 6:56 PM Dumitru Ceara  wrote:
>
> This abstracts the implementation details of the ovn-controller logical
> flow cache and also has refactors how the cache is being used allowing
> further commits to expand its functionality.
>
> Signed-off-by: Dumitru Ceara 

Hi Dumitru,

Acked-by: Numan Siddique 

Please see below for  a comment.

Thanks
Numan

> ---
>  controller/automake.mk  |2
>  controller/lflow-cache.c|  230 +
>  controller/lflow-cache.h|   77 ++
>  controller/lflow.c  |  336 
> ++-
>  controller/lflow.h  |8 -
>  controller/ovn-controller.c |   57 +++
>  lib/expr.c  |4 +
>  7 files changed, 447 insertions(+), 267 deletions(-)
>  create mode 100644 controller/lflow-cache.c
>  create mode 100644 controller/lflow-cache.h
>
> diff --git a/controller/automake.mk b/controller/automake.mk
> index 480578e..75119a6 100644
> --- a/controller/automake.mk
> +++ b/controller/automake.mk
> @@ -14,6 +14,8 @@ controller_ovn_controller_SOURCES = \
> controller/ip-mcast.h \
> controller/lflow.c \
> controller/lflow.h \
> +   controller/lflow-cache.c \
> +   controller/lflow-cache.h \
> controller/lport.c \
> controller/lport.h \
> controller/ofctrl.c \
> diff --git a/controller/lflow-cache.c b/controller/lflow-cache.c
> new file mode 100644
> index 000..ce129ac
> --- /dev/null
> +++ b/controller/lflow-cache.c
> @@ -0,0 +1,230 @@
> +/*
> + * Copyright (c) 2015, 2016 Nicira, Inc.
> + * Copyright (c) 2021, Red Hat, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include 
> +
> +#include "lib/ovn-sb-idl.h"
> +#include "lflow-cache.h"
> +#include "ovn/expr.h"
> +
> +struct lflow_cache {
> +struct hmap entries[LCACHE_T_MAX];
> +bool enabled;
> +};
> +
> +struct lflow_cache_entry {
> +struct hmap_node node;
> +struct uuid lflow_uuid; /* key */
> +
> +struct lflow_cache_value value;
> +};
> +
> +static struct lflow_cache_value *lflow_cache_add__(
> +struct lflow_cache *lc,
> +const struct sbrec_logical_flow *lflow,
> +enum lflow_cache_type type);
> +static void lflow_cache_delete__(struct lflow_cache *lc,
> + struct lflow_cache_entry *lce);
> +
> +struct lflow_cache *
> +lflow_cache_create(void)
> +{
> +struct lflow_cache *lc = xmalloc(sizeof *lc);
> +
> +for (size_t i = 0; i < LCACHE_T_MAX; i++) {
> +hmap_init(>entries[i]);
> +}
> +
> +lc->enabled = true;
> +return lc;
> +}
> +
> +void
> +lflow_cache_flush(struct lflow_cache *lc)
> +{
> +if (!lc) {
> +return;
> +}
> +
> +for (size_t i = 0; i < LCACHE_T_MAX; i++) {
> +struct lflow_cache_entry *lce;
> +struct lflow_cache_entry *lce_next;
> +
> +HMAP_FOR_EACH_SAFE (lce, lce_next, node, >entries[i]) {
> +lflow_cache_delete__(lc, lce);
> +}
> +}
> +}
> +
> +void
> +lflow_cache_destroy(struct lflow_cache *lc)
> +{
> +if (!lc) {
> +return;
> +}
> +
> +lflow_cache_flush(lc);
> +for (size_t i = 0; i < LCACHE_T_MAX; i++) {
> +hmap_destroy(>entries[i]);
> +}
> +free(lc);
> +}
> +
> +void
> +lflow_cache_enable(struct lflow_cache *lc, bool enabled)
> +{
> +if (!lc) {
> +return;
> +}
> +
> +if (lc->enabled && !enabled) {
> +lflow_cache_flush(lc);
> +}
> +lc->enabled = enabled;
> +}
> +
> +bool
> +lflow_cache_is_enabled(struct lflow_cache *lc)
> +{
> +return lc && lc->enabled;
> +}
> +
> +void
> +lflow_cache_add_conj_id(struct lflow_cache *lc,
> +const struct sbrec_logical_flow *lflow,
> +uint32_t conj_id_ofs)

'struct sbrec_logical_flow' is used in the entire file for the lflow uuid.

I'd suggest for all these functions which take 'sbrec_logical_flow' as
an argument
to instead take - 'struct uuid lflow_uuid'.

Since we are planning to refactor the code base to separate the IDL access,
it would help in that regard.

Numan


> +{
> +struct lflow_cache_value *lcv =
> +lflow_cache_add__(lc, lflow, LCACHE_T_CONJ_ID);
> +
> +if (!lcv) {
> +return;
> +}
> +lcv->conj_id_ofs = conj_id_ofs;
> +}
> +
> +void
> +lflow_cache_add_expr(struct lflow_cache *lc,
> + const struct sbrec_logical_flow 

Re: [ovs-dev] [PATCH 00/15] Netdev vxlan-decap offload

2021-02-08 Thread Eli Britstein



On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:

On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein  wrote:


On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:

On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
 wrote:

On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein  wrote:

VXLAN decap in OVS-DPDK configuration consists of two flows:
F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0

F1 is a classification flow. It has outer headers matches and it
classifies the packet as a VXLAN packet, and using tnl_pop action the
packet continues processing in F2.
F2 is a flow that has matches on tunnel metadata as well as on the inner
packet headers (as any other flow).

In order to fully offload VXLAN decap path, both F1 and F2 should be
offloaded. As there are more than one flow in HW, it is possible that
F1 is done by HW but F2 is not. Packet is received by SW, and should be
processed starting from F2 as F1 was already done by HW.
Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
are applied on uplink ports attached to OVS.

This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
for tunnel offloads.

Travis:
v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552

GitHub Actions:
v1: https://github.com/elibritstein/OVS/actions/runs/515334647

[1] https://mails.dpdk.org/archives/dev/2020-October/187314.html

Eli Britstein (13):
netdev-offload: Add HW miss packet state recover API
netdev-dpdk: Introduce DPDK tunnel APIs
netdev-offload-dpdk: Implement flow dump create/destroy APIs
netdev-dpdk: Add flow_api support for netdev vxlan vports
netdev-offload-dpdk: Implement HW miss packet recover for vport
dpif-netdev: Add HW miss packet state recover logic
netdev-offload-dpdk: Change log rate limits
netdev-offload-dpdk: Support tunnel pop action
netdev-offload-dpdk: Refactor offload rule creation
netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
  port
netdev-offload-dpdk: Map netdev and ufid to offload objects
netdev-offload-dpdk: Support vports flows offload
netdev-dpdk-offload: Add vxlan pattern matching function

Ilya Maximets (2):
netdev-offload: Allow offloading to netdev without ifindex.
netdev-offload: Disallow offloading to unrelated tunneling vports.

   Documentation/howto/dpdk.rst  |   1 +
   NEWS  |   2 +
   lib/dpif-netdev.c |  49 +-
   lib/netdev-dpdk.c | 135 ++
   lib/netdev-dpdk.h | 104 -
   lib/netdev-offload-dpdk.c | 851 +-
   lib/netdev-offload-provider.h |   5 +
   lib/netdev-offload-tc.c   |   8 +
   lib/netdev-offload.c  |  29 +-
   lib/netdev-offload.h  |   1 +
   10 files changed, 1033 insertions(+), 152 deletions(-)

--
2.28.0.546.g385c171


Hi Eli,

Thanks for posting this new patchset to support tunnel decap action offload.

I haven't looked at the entire patchset yet. But I focused on the
patches that introduce 1-to-many mapping between an OVS flow (f2) and
HW offloaded flows.

Here is a representation of the design proposed in this patchset. A
flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
the underlying uplink/physical port is P0, gets offloaded to not only
P0, but also to other physical ports P1, P2... and so on.

  P0 <> VxLAN-vPort <> VFRep-1

  P1
  P2
  ...
  Pn

IMO, the problems with this design are:

- Offloading a flow to an unrelated physical device that has nothing
to do with that flow (invalid device for the flow).
- Offloading to not just one, but several such invalid physical devices.
- Consuming HW resources for a flow that is never seen or intended to
be processed by those physical devices.
- Impacts flow scale on other physical devices, since it would consume
their HW resources with a large number of such invalid flows.
- The indirect list used to track these multiple mappings complicates
the offload layer implementation.
- The addition of flow_dump_create() to offload APIs, just to parse
and get a list of user datapath netdevs is confusing and not needed.

I have been exploring an alternate design to address this problem of
figuring out the right physical device for a given tunnel inner-flow.
I will send a patch, please take a look so we can continue the discussion.

I just posted this patch, please see the link below; this is currently
based on the decap offload patchset (but it can be rebased if needed,
without the decap patchset too). The patch provides changes to pass
physical port (orig_in_port) information to the offload layer in the
context of flow F2. Note that additional changes would be needed in
the decap patchset to utilize this, if we agree on this approach.

https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapa...@broadcom.com/

Re: [ovs-dev] [PATCH 00/15] Netdev vxlan-decap offload

2021-02-08 Thread Sriharsha Basavapatna via dev
On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein  wrote:
>
>
> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
> > On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
> >  wrote:
> >> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein  wrote:
> >>> VXLAN decap in OVS-DPDK configuration consists of two flows:
> >>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
> >>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
> >>>
> >>> F1 is a classification flow. It has outer headers matches and it
> >>> classifies the packet as a VXLAN packet, and using tnl_pop action the
> >>> packet continues processing in F2.
> >>> F2 is a flow that has matches on tunnel metadata as well as on the inner
> >>> packet headers (as any other flow).
> >>>
> >>> In order to fully offload VXLAN decap path, both F1 and F2 should be
> >>> offloaded. As there are more than one flow in HW, it is possible that
> >>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
> >>> processed starting from F2 as F1 was already done by HW.
> >>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
> >>> are applied on uplink ports attached to OVS.
> >>>
> >>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
> >>> for tunnel offloads.
> >>>
> >>> Travis:
> >>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
> >>>
> >>> GitHub Actions:
> >>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
> >>>
> >>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
> >>>
> >>> Eli Britstein (13):
> >>>netdev-offload: Add HW miss packet state recover API
> >>>netdev-dpdk: Introduce DPDK tunnel APIs
> >>>netdev-offload-dpdk: Implement flow dump create/destroy APIs
> >>>netdev-dpdk: Add flow_api support for netdev vxlan vports
> >>>netdev-offload-dpdk: Implement HW miss packet recover for vport
> >>>dpif-netdev: Add HW miss packet state recover logic
> >>>netdev-offload-dpdk: Change log rate limits
> >>>netdev-offload-dpdk: Support tunnel pop action
> >>>netdev-offload-dpdk: Refactor offload rule creation
> >>>netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
> >>>  port
> >>>netdev-offload-dpdk: Map netdev and ufid to offload objects
> >>>netdev-offload-dpdk: Support vports flows offload
> >>>netdev-dpdk-offload: Add vxlan pattern matching function
> >>>
> >>> Ilya Maximets (2):
> >>>netdev-offload: Allow offloading to netdev without ifindex.
> >>>netdev-offload: Disallow offloading to unrelated tunneling vports.
> >>>
> >>>   Documentation/howto/dpdk.rst  |   1 +
> >>>   NEWS  |   2 +
> >>>   lib/dpif-netdev.c |  49 +-
> >>>   lib/netdev-dpdk.c | 135 ++
> >>>   lib/netdev-dpdk.h | 104 -
> >>>   lib/netdev-offload-dpdk.c | 851 +-
> >>>   lib/netdev-offload-provider.h |   5 +
> >>>   lib/netdev-offload-tc.c   |   8 +
> >>>   lib/netdev-offload.c  |  29 +-
> >>>   lib/netdev-offload.h  |   1 +
> >>>   10 files changed, 1033 insertions(+), 152 deletions(-)
> >>>
> >>> --
> >>> 2.28.0.546.g385c171
> >>>
> >> Hi Eli,
> >>
> >> Thanks for posting this new patchset to support tunnel decap action 
> >> offload.
> >>
> >> I haven't looked at the entire patchset yet. But I focused on the
> >> patches that introduce 1-to-many mapping between an OVS flow (f2) and
> >> HW offloaded flows.
> >>
> >> Here is a representation of the design proposed in this patchset. A
> >> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
> >> the underlying uplink/physical port is P0, gets offloaded to not only
> >> P0, but also to other physical ports P1, P2... and so on.
> >>
> >>  P0 <> VxLAN-vPort <> VFRep-1
> >>
> >>  P1
> >>  P2
> >>  ...
> >>  Pn
> >>
> >> IMO, the problems with this design are:
> >>
> >> - Offloading a flow to an unrelated physical device that has nothing
> >> to do with that flow (invalid device for the flow).
> >> - Offloading to not just one, but several such invalid physical devices.
> >> - Consuming HW resources for a flow that is never seen or intended to
> >> be processed by those physical devices.
> >> - Impacts flow scale on other physical devices, since it would consume
> >> their HW resources with a large number of such invalid flows.
> >> - The indirect list used to track these multiple mappings complicates
> >> the offload layer implementation.
> >> - The addition of flow_dump_create() to offload APIs, just to parse
> >> and get a list of user datapath netdevs is confusing and not needed.
> >>
> >> I have been exploring an alternate design to address this problem of
> >> figuring out the right physical device for a given tunnel inner-flow.
> >> I will send a patch, please take a look so we can continue the discussion.
> > I just posted this patch, please see the link 

Re: [ovs-dev] [PATCH] netdev-afxdp: Add start qid support.

2021-02-08 Thread Ilya Maximets
On 2/7/21 5:05 PM, Toshiaki Makita wrote:
> On 2021/02/07 2:00, William Tu wrote:
>> On Fri, Feb 5, 2021 at 1:08 PM Gregory Rose  wrote:
>>> On 2/4/2021 7:08 PM, William Tu wrote:
 On Thu, Feb 4, 2021 at 3:17 PM Gregory Rose  wrote:
> On 2/3/2021 1:21 PM, William Tu wrote:
>> Mellanox card has different XSK design. It requires users to create
>> dedicated queues for XSK. Unlike Intel's NIC which loads XDP program
>> to all queues, Mellanox only loads XDP program to a subset of its queue.
>>
>> When OVS uses AF_XDP with mlx5, it doesn't replace the existing RX and TX
>> queues in the channel with XSK RX and XSK TX queues, but it creates an
>> additional pair of queues for XSK in that channel. To distinguish
>> regular and XSK queues, mlx5 uses a different range of qids.
>> That means, if the card has 24 queues, queues 0..11 correspond to
>> regular queues, and queues 12..23 are XSK queues.
>> In this case, we should attach the netdev-afxdp with 'start-qid=12'.
>>
>> I tested using Mellanox Connect-X 6Dx, by setting 'start-qid=1', and:
>>  $ ethtool -L enp2s0f0np0 combined 1
>>  # queue 0 is for non-XDP traffic, queue 1 is for XSK
>>  $ ethtool -N enp2s0f0np0 flow-type udp4 action 1
>> note: we need additionally add flow-redirect rule to queue 1
>
> Seems awfully hardware dependent.  Is this just for Mellanox or does
> it have general usefulness?
>
 It is just Mellanox's design which requires pre-configure the 
 flow-director.
 I only have cards from Intel and Mellanox so I don't know about other 
 vendors.

 Thanks,
 William

>>>
>>> I think we need to abstract the HW layer a little bit.  This start-qid
>>> option is specific to a single piece of HW, at least at this point.
>>> We should expect that further HW  specific requirements for
>>> different NIC vendors will come up in the future.  I suggest
>>> adding a hw_options:mellanox:start-qid type hierarchy  so that
>>> as new HW requirements come up we can easily scale.  It will
>>> also make adding new vendors easier in the future.
>>>
>>> Even with NIC vendors you can't always count on each new generation
>>> design to always keep old requirements and methods for feature
>>> enablement.
>>>
>>> What do you think?
>>>
>> Thanks for the feedback.
>> So far I don't know whether other vendors will need this option or not.
> 
> FWIU, this api "The lower half of the available amount of RX queues are 
> regular queues, and the upper half are XSK RX queues." is the result of long 
> discussion to support dedicated/isolated XSK rings, which is not meant for a 
> mellanox-specific feature.
> 
> https://patchwork.ozlabs.org/project/netdev/cover/20190524093431.20887-1-maxi...@mellanox.com/
> https://patchwork.ozlabs.org/project/netdev/cover/20190612155605.22450-1-maxi...@mellanox.com/
> 
> Toshiaki Makita

Thanks for the links.  Very helpful.

From what I understand lower half of queues should still work, i.e.
it should still be possible to attach AF_XDP socket to them.  But
they will not work in zero-copy mode ("generic" only?).
William, could you check that?  Does it work and with which mode
"best-effort" ends up with?  And what kind of errors libbpf returns
if we're trying to enable zero-copy?

There are still few unanswered questions in those discussions and
a clear lack of documentation.  It seems that it's a work-in-progress,
intermediate step towards some better user API.  However it's unclear
how this API will look like and when it will be implemented.


For BPF list and maintainers:

Is it possible to have some of this behavior documented?
How can application determine which netdevs are using this upper/bottom
half schema for their regular and xsk channels/queues?  How users
should do that without digging into the kernel code or spending
few hours googling for presentations from some conferences?

And I actually failed to find any written reference to the fact that
I have to manually configure redirection of the traffic in order to
receive it on XSK queues/channels and not on regular ones.  This is
very confusing and hard to understand, especially for a regular OVS
user who is not familiar with XDP and kernel internals and just wants
to utilize faster userspace networking.

> 
>> I think adding another "hw_options" is a little confusing because this
>> is already an option on the device.
>> Looking at AF_XDP driver at DPDK, it also has similar option:
>> see start_queue
>> https://doc.dpdk.org/guides/nics/af_xdp.html

This option for DPDK mainly positioned as a way to utilize multi-queue
and a way to create different DPDK ports from different ranges of
queues of the same device, so it's not exactly the same thing.  But
I see that it likely had a double purpose that wasn't mentioned in a
commit message or documentation.

I think, for now we can have this option in OVS in a same way, i.e.
as a way to partition the device or 

Re: [ovs-dev] [PATCH ovn v2 01/10] lflow: Fix cache update when I-P engine aborts.

2021-02-08 Thread Dumitru Ceara

On 2/8/21 12:01 PM, Numan Siddique wrote:

On Thu, Feb 4, 2021 at 6:55 PM Dumitru Ceara  wrote:


If the I-P engine aborts in the middle of a run, the 'flow_output' node
change handlers or run callback might not be called.

As a side effect this may cause that Logical_Flow IDL tracked changes
are not processed during the iteration.  As a consequence, if a
Logical_Flow was removed from the Southound DB, then its associated
cache entry (if any) will not be flushed.

This has two side effects:
1. Stale entries are kept in the cache until a full recompute or cache
flush happens.
2. If a new Logical_Flow is added to the Southbound and it happens to
have a UUID that matches one of a stale cache entry then
ovn-controller will install incorrect openflows.

IDL tracked changes are cleared at every iteration of ovn-controller.
Skipping the ovsdb_idl_track_clear() call if the I-P engine aborted is
not a valid option for now because it might lead to some of the IDL
changes to be processed twice.

Instead, lflow_handle_cached_flows() is called now at every iteration
of ovn-controller making sure deleted flows are removed from the cache.

Also, rename the 'flush-lflow-cache' unixctl command to
'lflow-cache/flush' to better match the style of other command names.


Hi Dumitru,


Hi Numan,



The patch LGTM. 2 comments.

1. I think it's better to keep the command - flush-lflow-cache just so that
 we don't break any existing users. Lets say If CMS has scripted to
run this command
 once a while, then it will break after this patch.

 So I would suggest that to have both the commands which do the
same. We could add
 "deprecated" to the description of flush-lflow-cache when the user
runs - ovn-appctl -t ovn-controller list-commands.


Makes sense, I'll do that in v3.



2.  Can you please document the new command - lflow-cache/flush in
ovn-controller.8.xml.


Sure.


 My bad that I missed adding documentation to flush-lflow-cache.


No problem.



Thanks
Numan


Thanks,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 02/10] lflow: Refactor convert_match_to_expr() to explicitly consume prereqs.

2021-02-08 Thread Numan Siddique
On Thu, Feb 4, 2021 at 6:55 PM Dumitru Ceara  wrote:
>
> It was (and still is) the responsibility of the caller of
> convert_match_to_expr() to explicitly free any 'prereqs' expression that
> was passed as argument if the expression parsing of the 'lflow' match failed.
>
> However, convert_match_to_expr() now updates the value of '*prereqs' setting
> it to NULL in the successful case.  This makes it easier for callers of the
> function because 'expr_destroy(prereqs)' can now be called unconditionally.
>
> Signed-off-by: Dumitru Ceara 

Acked-by: Numan Siddique 

Numan

> ---
> Note: This patch doesn't change the callers of convert_match_to_expr() to
> take advantage of the new semantic but following patches do.
> ---
>  controller/lflow.c |   14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/controller/lflow.c b/controller/lflow.c
> index 04ba0d1..495653f 100644
> --- a/controller/lflow.c
> +++ b/controller/lflow.c
> @@ -758,11 +758,12 @@ add_matches_to_flow_table(const struct 
> sbrec_logical_flow *lflow,
>  /* Converts the match and returns the simplified expr tree.
>   *
>   * The caller should evaluate the conditions and normalize the expr tree.
> + * If parsing is successful, '*prereqs' is also consumed.
>   */
>  static struct expr *
>  convert_match_to_expr(const struct sbrec_logical_flow *lflow,
>const struct sbrec_datapath_binding *dp,
> -  struct expr *prereqs,
> +  struct expr **prereqs,
>const struct shash *addr_sets,
>const struct shash *port_groups,
>struct lflow_resource_ref *lfrr,
> @@ -795,8 +796,9 @@ convert_match_to_expr(const struct sbrec_logical_flow 
> *lflow,
>  sset_destroy(_groups_ref);
>
>  if (!error) {
> -if (prereqs) {
> -e = expr_combine(EXPR_T_AND, e, prereqs);
> +if (*prereqs) {
> +e = expr_combine(EXPR_T_AND, e, *prereqs);
> +*prereqs = NULL;
>  }
>  e = expr_annotate(e, , );
>  }
> @@ -854,7 +856,7 @@ consider_logical_flow__(const struct sbrec_logical_flow 
> *lflow,
>  .n_tables = LOG_PIPELINE_LEN,
>  .cur_ltable = lflow->table_id,
>  };
> -struct expr *prereqs;
> +struct expr *prereqs = NULL;
>  char *error;
>
>  error = ovnacts_parse_string(lflow->actions, , , );
> @@ -885,7 +887,7 @@ consider_logical_flow__(const struct sbrec_logical_flow 
> *lflow,
>  struct expr *expr = NULL;
>  if (!l_ctx_out->lflow_cache_map) {
>  /* Caching is disabled. */
> -expr = convert_match_to_expr(lflow, dp, prereqs, l_ctx_in->addr_sets,
> +expr = convert_match_to_expr(lflow, dp, , 
> l_ctx_in->addr_sets,
>   l_ctx_in->port_groups, l_ctx_out->lfrr,
>   NULL);
>  if (!expr) {
> @@ -949,7 +951,7 @@ consider_logical_flow__(const struct sbrec_logical_flow 
> *lflow,
>
>  bool pg_addr_set_ref = false;
>  if (!expr) {
> -expr = convert_match_to_expr(lflow, dp, prereqs, l_ctx_in->addr_sets,
> +expr = convert_match_to_expr(lflow, dp, , 
> l_ctx_in->addr_sets,
>   l_ctx_in->port_groups, l_ctx_out->lfrr,
>   _addr_set_ref);
>  if (!expr) {
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 01/10] lflow: Fix cache update when I-P engine aborts.

2021-02-08 Thread Numan Siddique
On Thu, Feb 4, 2021 at 6:55 PM Dumitru Ceara  wrote:
>
> If the I-P engine aborts in the middle of a run, the 'flow_output' node
> change handlers or run callback might not be called.
>
> As a side effect this may cause that Logical_Flow IDL tracked changes
> are not processed during the iteration.  As a consequence, if a
> Logical_Flow was removed from the Southound DB, then its associated
> cache entry (if any) will not be flushed.
>
> This has two side effects:
> 1. Stale entries are kept in the cache until a full recompute or cache
>flush happens.
> 2. If a new Logical_Flow is added to the Southbound and it happens to
>have a UUID that matches one of a stale cache entry then
>ovn-controller will install incorrect openflows.
>
> IDL tracked changes are cleared at every iteration of ovn-controller.
> Skipping the ovsdb_idl_track_clear() call if the I-P engine aborted is
> not a valid option for now because it might lead to some of the IDL
> changes to be processed twice.
>
> Instead, lflow_handle_cached_flows() is called now at every iteration
> of ovn-controller making sure deleted flows are removed from the cache.
>
> Also, rename the 'flush-lflow-cache' unixctl command to
> 'lflow-cache/flush' to better match the style of other command names.

Hi Dumitru,

The patch LGTM. 2 comments.

1. I think it's better to keep the command - flush-lflow-cache just so that
we don't break any existing users. Lets say If CMS has scripted to
run this command
once a while, then it will break after this patch.

So I would suggest that to have both the commands which do the
same. We could add
"deprecated" to the description of flush-lflow-cache when the user
runs - ovn-appctl -t ovn-controller list-commands.

2.  Can you please document the new command - lflow-cache/flush in
ovn-controller.8.xml.
My bad that I missed adding documentation to flush-lflow-cache.

Thanks
Numan


>
> Fixes: 2662498bfd13 ("ovn-controller: Persist the conjunction ids allocated 
> for conjuctive matches.")
> Signed-off-by: Dumitru Ceara 
> ---
>  controller/lflow.c  |   30 +-
>  controller/lflow.h  |4 +++-
>  controller/ovn-controller.c |   20 
>  tests/ovn.at|2 +-
>  4 files changed, 37 insertions(+), 19 deletions(-)
>
> diff --git a/controller/lflow.c b/controller/lflow.c
> index 946c1e0..04ba0d1 100644
> --- a/controller/lflow.c
> +++ b/controller/lflow.c
> @@ -1425,19 +1425,6 @@ lflow_run(struct lflow_ctx_in *l_ctx_in, struct 
> lflow_ctx_out *l_ctx_out)
>  {
>  COVERAGE_INC(lflow_run);
>
> -/* when lflow_run is called, it's possible that some of the logical flows
> - * are deleted. We need to delete the lflow cache for these
> - * lflows (if present), otherwise, they will not be deleted at all. */
> -if (l_ctx_out->lflow_cache_map) {
> -const struct sbrec_logical_flow *lflow;
> -SBREC_LOGICAL_FLOW_TABLE_FOR_EACH_TRACKED (lflow,
> -
> l_ctx_in->logical_flow_table) {
> -if (sbrec_logical_flow_is_deleted(lflow)) {
> -lflow_cache_delete(l_ctx_out->lflow_cache_map, lflow);
> -}
> -}
> -}
> -
>  add_logical_flows(l_ctx_in, l_ctx_out);
>  add_neighbor_flows(l_ctx_in->sbrec_port_binding_by_name,
> l_ctx_in->mac_binding_table, 
> l_ctx_in->local_datapaths,
> @@ -1446,6 +1433,23 @@ lflow_run(struct lflow_ctx_in *l_ctx_in, struct 
> lflow_ctx_out *l_ctx_out)
>   l_ctx_out->flow_table);
>  }
>
> +/* Should be called at every ovn-controller iteration before IDL tracked
> + * changes are cleared to avoid maintaining cache entries for flows that
> + * don't exist anymore.
> + */
> +void
> +lflow_handle_cached_flows(struct hmap *lflow_cache_map,
> +  const struct sbrec_logical_flow_table *flow_table)
> +{
> +const struct sbrec_logical_flow *lflow;
> +
> +SBREC_LOGICAL_FLOW_TABLE_FOR_EACH_TRACKED (lflow, flow_table) {
> +if (sbrec_logical_flow_is_deleted(lflow)) {
> +lflow_cache_delete(lflow_cache_map, lflow);
> +}
> +}
> +}
> +
>  void
>  lflow_destroy(void)
>  {
> diff --git a/controller/lflow.h b/controller/lflow.h
> index ba79cc3..cf4f0e8 100644
> --- a/controller/lflow.h
> +++ b/controller/lflow.h
> @@ -156,7 +156,9 @@ struct lflow_ctx_out {
>  };
>
>  void lflow_init(void);
> -void  lflow_run(struct lflow_ctx_in *, struct lflow_ctx_out *);
> +void lflow_run(struct lflow_ctx_in *, struct lflow_ctx_out *);
> +void lflow_handle_cached_flows(struct hmap *lflow_cache,
> +   const struct sbrec_logical_flow_table *);
>  bool lflow_handle_changed_flows(struct lflow_ctx_in *, struct lflow_ctx_out 
> *);
>  bool lflow_handle_changed_ref(enum ref_type, const char *ref_name,
>struct lflow_ctx_in *, struct lflow_ctx_out *,
> diff --git 

[ovs-dev] [PATCH ovn] northd: improve OVN BFD documentation

2021-02-08 Thread Lorenzo Bianconi
Signed-off-by: Lorenzo Bianconi 
---
 ovn-nb.xml | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/ovn-nb.xml b/ovn-nb.xml
index 58083f101..76c3a3d86 100644
--- a/ovn-nb.xml
+++ b/ovn-nb.xml
@@ -129,10 +129,11 @@
 options are described individually below.
   
 
-  
+  
 
   These options apply when ovn-controller configures
-  BFD on tunnels interfaces.
+  OVS BFD on tunnels interfaces. Please note these parameters refer
+  to legacy OVS BFD implementation and not to OVN BFD one.
 
 
 
@@ -2281,11 +2282,10 @@
 
   
 When more than one gateway chassis is specified, OVN only uses
-one at a time.  OVN uses BFD to monitor gateway connectivity,
-preferring the highest-priority gateway that is online.
-Priorities are specified in the priority column
-of  or .
+one at a time. OVN can rely on OVS BFD implementation to monitor
+gateway connectivity, preferring the highest-priority gateway
+that is online.  Priorities are specified in the priority
+column of  or .
   
 
   
@@ -3759,7 +3759,14 @@
 
   
 
-  Contains BFD parameter for ovn-controller bfd configuration.
+  Contains BFD parameter for ovn-controller BFD configuration.
+  OVN BFD implementation is used to provide detection of failures in the
+  path between adjacent forwarding engines, including the OVN interfaces.
+  OVN BFD provides link status info to OVN northd in order to update
+  logical flows according to the status of BFD endpoints. In the current
+  implementation OVN BFD is used to check next-hop status for ECMP routes.
+  Please note BFD table refers to OVN BFD implementation and not to OVS
+  legacy one.
 
 
 
-- 
2.29.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] tests: Fix Port_Binding up test.

2021-02-08 Thread 陈供明
Thanks,
Gongming

On Feb 8, 2021, at 6:01 PM, Numan Siddique  wrote:

On Mon, Feb 8, 2021 at 2:40 PM Dumitru Ceara  wrote:
>
> On 2/7/21 3:52 AM, gmingchen(陈供明) wrote:
> > From: Gongming Chen 
> >
> > After setting the iface-id, immediately check the up status of the port
> > binding, it will occasionally fail, especially when the port binding
> > status is reported later.
> >
> > When it fails, the following will be output:
> > Checking values in sb Port_Binding with logical_port=lsp1 against 
false... found false
> > ovs-vsctl add-port br-int lsp1 -- set Interface lsp1 
external-ids:iface-id=lsp1
> > ./ovn-macros.at:307: "$@"
> > Checking values in sb Port_Binding with logical_port=lsp1 against 
true... found false
> > _uuid   : 15ebabb6-3dbb-4806-aa85-d1c03e3b39f6
> > logical_port: lsp1
> > up  : true
> > ./ovn-macros.at:393: hard failure
> >
> > Fixes: 4d3cb42b076b ("binding: Set Logical_Switch_Port.up when all OVS 
flows are installed.")
> > Signed-off-by: Gongming Chen 
> > ---
>
> Thanks for the fix!
>
> Acked-by: Dumitru Ceara 

Thanks for the fix. I applied this patch to master.

I also added your name to the AUTHORS list.

Numan

>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] lflow: Use learn() action to generate LB hairpin reply flows.

2021-02-08 Thread Dumitru Ceara

On 2/8/21 11:09 AM, Numan Siddique wrote:

On Sat, Feb 6, 2021 at 4:00 AM Dumitru Ceara  wrote:


The main trait of load balancer hairpin traffic is that it never leaves
the local hypervisor.  Essentially this means that only hairpin
openflows installed for logical switches that have at least one logical
switch port bound locally can ever be hit.

Until now, if a load balancer was applied on multiple logical switches
that are connected through a distributed router, ovn-controller would
install flows to detect hairpin replies for all logical switches. In
practice this leads to a very high number of openflows out of which
most will never be used.

Instead we now use an additional action, learn(), on flows that match on
packets that create the hairpin session.  The learn() action will then
generate the necessary flows to handle hairpin replies, but only for
the local datapaths which actually generate hairpin traffic.

For example, simulating how ovn-k8s uses load balancer for services,
in a "switch per node" scenario, the script below would generate
10K (n_nodes * n_vips * n_backends) openflows on every node in table=69
(hairpin reply).  With this patch the maximum number of openflows that
can be created for hairpin replies is 200 (n_vips * n_backends).

In general, for deployments that leverage switch-per-node topologies,
the number of openflows is reduced by a factor of N, where N is the
number of nodes.

   $ cat lbs.sh
   NODES=50
   VIPS=20
   BACKENDS=10
   ovn-nbctl lr-add rtr
   for ((i = 1; i <= $NODES; i++)); do
   ovn-nbctl \
   -- ls-add ls$i \
   -- lsp-add ls$i vm$i \
   -- lsp-add ls$i ls$i-rtr \
   -- lsp-set-type ls$i-rtr router \
   -- lsp-set-options ls$i-rtr router-port=rtr-ls$i \
   -- lrp-add rtr rtr-ls$i 00:00:00:00:01:00 42.42.42.$i/24
   done

   for ((i = 1; i <= $VIPS; i++)); do
   lb=lb$i
   vip=10.10.10.$i:1
   bip=20.20.20.1:2
   for ((j = 2; j <= $BACKENDS; j++)); do
   bip="$bip,20.20.20.$j:2"
   done
   ovn-nbctl lb-add $lb $vip $backends
   done

   for ((i = 1; i <= $NODES; i++)); do
   for ((j = 1; j <= $VIPS; j++)); do
   ovn-nbctl ls-lb-add ls$i lb$j
   done
   done

   ovs-vsctl add-port br-int vm1 \
   -- set interface vm1 type=internal \
   -- set interface vm1 external-ids:iface-id=vm1

Suggested-by: Ilya Maximets 
Signed-off-by: Dumitru Ceara 


Hi Dumitru,


Hi Numan,



Thanks for this patch. The patch LGTM. I tested it out with  huge NB
DB with lots
of load balancer and backends.  I see a significant reduction in OF flows.


Thanks for the review!



Please see the few comments below.

Thanks
Numan


---
  controller/lflow.c  | 204 ---
  tests/ofproto-macros.at |   5 +-
  tests/ovn.at| 516 +---
  3 files changed, 455 insertions(+), 270 deletions(-)

diff --git a/controller/lflow.c b/controller/lflow.c
index 946c1e0..2b7d356 100644
--- a/controller/lflow.c
+++ b/controller/lflow.c
@@ -1171,6 +1171,178 @@ add_neighbor_flows(struct ovsdb_idl_index 
*sbrec_port_binding_by_name,
  }
  }

+/* Builds the "learn()" action to be triggered by packets initiating a
+ * hairpin session.
+ *
+ * This will generate flows in table OFTABLE_CHK_LB_HAIRPIN_REPLY of the form:
+ * - match:
+ * metadata=,ip/ipv6,ip.src=,ip.dst=
+ * nw_proto='lb_proto',tp_src_port=
+ * - action:
+ * set MLF_LOOKUP_LB_HAIRPIN_BIT=1
+ */
+static void
+add_lb_vip_hairpin_reply_action(struct in6_addr *vip6, ovs_be32 vip,
+uint8_t lb_proto, bool has_l4_port,
+uint64_t cookie, struct ofpbuf *ofpacts)
+{
+struct match match = MATCH_CATCHALL_INITIALIZER;
+struct ofpact_learn *ol = ofpact_put_LEARN(ofpacts);
+struct ofpact_learn_spec *ol_spec;
+unsigned int imm_bytes;
+uint8_t *src_imm;
+
+/* Once learned, hairpin reply flows are permanent until the VIP/backend
+ * is removed.
+ */
+ol->flags = NX_LEARN_F_DELETE_LEARNED;
+ol->idle_timeout = OFP_FLOW_PERMANENT;
+ol->hard_timeout = OFP_FLOW_PERMANENT;
+ol->priority = OFP_DEFAULT_PRIORITY;
+ol->table_id = OFTABLE_CHK_LB_HAIRPIN_REPLY;
+ol->cookie = htonll(cookie);
+
+/* Match on metadata of the packet that created the hairpin session. */
+ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
+
+ol_spec->dst.field = mf_from_id(MFF_METADATA);
+ol_spec->dst.ofs = 0;
+ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
+ol_spec->n_bits = ol_spec->dst.n_bits;
+ol_spec->dst_type = NX_LEARN_DST_MATCH;
+ol_spec->src_type = NX_LEARN_SRC_FIELD;
+ol_spec->src.field = mf_from_id(MFF_METADATA);
+
+/* Match on the same ETH type as the packet that created the hairpin
+ * session.
+ */
+ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
+ol_spec->dst.field = mf_from_id(MFF_ETH_TYPE);
+ol_spec->dst.ofs = 

Re: [ovs-dev] [PATCH ovn] expr: Combine multiple ipv4 with wildcard mask.

2021-02-08 Thread Dumitru Ceara

On 2/7/21 1:25 PM, gmingchen(陈供明) wrote:

From: Gongming Chen 


Hi Gongming,

First of all, thanks for the contribution!

This is not a full review, just some comments for now.

It seems that there's a memory leak with your patch applied. 
AddressSanitizer reports:


Direct leak of 12 byte(s) in 1 object(s) allocated from:
#0 0x49644d in malloc 
(/home/runner/work/ovn/ovn/ovn-20.12.90/_build/sub/tests/ovstest+0x49644d)
#1 0x538604 in xmalloc 
/home/runner/work/ovn/ovn/ovs_src/lib/util.c:138:15
#2 0x62636f in expr_const_sets_add 
/home/runner/work/ovn/ovn/ovn-20.12.90/_build/sub/../../lib/expr.c:1237:18
#3 0x4cf8f6 in create_addr_sets 
/home/runner/work/ovn/ovn/ovn-20.12.90/_build/sub/../../tests/test-ovn.c:230:5
#4 0x4cf8f6 in test_parse_expr__ 
/home/runner/work/ovn/ovn/ovn-20.12.90/_build/sub/../../tests/test-ovn.c:296:5
#5 0x4dfd04 in ovs_cmdl_run_command__ 
/home/runner/work/ovn/ovn/ovs_src/lib/command-line.c:247:17
#6 0x4c6810 in test_ovn_main 
/home/runner/work/ovn/ovn/ovn-20.12.90/_build/sub/../../tests/test-ovn.c:1635:5
#7 0x4c6810 in ovstest_wrapper_test_ovn_main__ 
/home/runner/work/ovn/ovn/ovn-20.12.90/_build/sub/../../tests/test-ovn.c:1638:1
#8 0x4dfd04 in ovs_cmdl_run_command__ 
/home/runner/work/ovn/ovn/ovs_src/lib/command-line.c:247:17
#9 0x4c5fe3 in main 
/home/runner/work/ovn/ovn/ovn-20.12.90/_build/sub/../../tests/ovstest.c:150:9
#10 0x7f71aa6ddbf6 in __libc_start_main 
(/lib/x86_64-linux-gnu/libc.so.6+0x21bf6)


Full reports are available in the ovsrobot OVN github CI run artifacts:
https://github.com/ovsrobot/ovn/actions/runs/545416492

Just a note, if you push the branch to your own fork it will trigger the 
github action to run CI and the "linux clang test asan" job will also 
enable AddressSanitizer.


There are also a few style related issues (e.g., sizeof args), please 
see the guidelines here:


https://github.com/ovn-org/ovn/blob/master/Documentation/internals/contributing/coding-style.rst



In the ovn security group, each host ip corresponds to at least 4 flow
tables (different connection states). As the scale of hosts using the
security group increases, the ovs security group flow table will
increase sharply, especially when it is applied  the remote group
feature in OpenStack.

This patch merges ipv4 addresses with wildcard masks, and replaces this
ipv4 addresses with the merged ip/mask. This will greatly reduce the
entries in the ovs security group flow table, especially when the host
size is large. After being used in a production environment, the number
of ovs flow tables will be reduced by at least 50% in most scenarios,
when the remote group in OpenStack is applied.


I think it would be great to describe the algorithm here, in the commit 
log, but also in the comments in the code.




Analysis in the simplest scenario, a network 1.1.1.0/24 network, enable
the OpenStack security group remote group feature, create 253 virtual
machine ports(1.1.1.2-1.1.1.254).

Only focus on the number of ip addresses, in the table=44 table:
./configure --disable-combine-ipv4:
1.1.1.2-1.1.1.254(253 flow meters) * 4(connection status) *
1(local net of localport) = 1012

./configure --enable-combine-ipv4(default):
1.1.1.2/31
1.1.1.4/30
1.1.1.8/29
1.1.1.16/28
1.1.1.32/27
1.1.1.64/26
1.1.1.128/26
1.1.1.192/27
1.1.1.224/28
1.1.1.240/29
1.1.1.248/30
1.1.1.252/31
1.1.1.254
13 flow tables * 4(connection status) * 1(local net of localport) = 52

Reduced from 1012 flow meters to 52, a 19.4 times reduction.

Some scenes are similar to the following:
1.1.1.2, 1.1.1.6
After the combine:
1.1.1.2/255.255.255.251
This will slightly increase the difficulty of finding the flow table
corresponding to a single address.
such as:
ovs-ofctl dump-flows br-int | grep 1.1.1.6
The result is empty.
1.1.1.6 will match 1.1.1.2/255.255.255.251


Would it make sense and potentially make the life of the users easier by 
also adding a utility to automatically do the IP/MASK match?


I'm thinking of something like:
$ ovs-ofctl dump-flows br-int | ovn-match-ip 1.1.1.6
[..]

I didn't look at the implementation in detail but I'm surprised by the 
following:


- change test-ovn.c to define $set1 = {1.1.1.2, .. 1.1.1.254}, basically 
the example you gave above.

- then run:

$ echo 'ip4.src == $set1' | ./tests/ovstest test-ovn expr-to-flows
ip,nw_src=1.1.1.2/31
ip,nw_src=1.1.1.4
$

Shouldn't this output actually match the 13 entries in your sample output?



Signed-off-by: Gongming Chen 
---
  configure.ac |   1 +
  lib/expr.c   | 217 +++
  m4/ovn.m4|  21 
  tests/atlocal.in |   1 +
  tests/ovn.at | 286 +--
  5 files changed, 443 insertions(+), 83 deletions(-)

diff --git a/configure.ac b/configure.ac
index b2d084318..6dc51a5cc 100644
--- a/configure.ac
+++ b/configure.ac
@@ -131,6 +131,7 @@ OVS_LIBTOOL_VERSIONS
  OVS_CHECK_CXX
  AX_FUNC_POSIX_MEMALIGN
  OVN_CHECK_UNBOUND

Re: [ovs-dev] [PATCH ovn] lflow: Use learn() action to generate LB hairpin reply flows.

2021-02-08 Thread Numan Siddique
On Sat, Feb 6, 2021 at 4:00 AM Dumitru Ceara  wrote:
>
> The main trait of load balancer hairpin traffic is that it never leaves
> the local hypervisor.  Essentially this means that only hairpin
> openflows installed for logical switches that have at least one logical
> switch port bound locally can ever be hit.
>
> Until now, if a load balancer was applied on multiple logical switches
> that are connected through a distributed router, ovn-controller would
> install flows to detect hairpin replies for all logical switches. In
> practice this leads to a very high number of openflows out of which
> most will never be used.
>
> Instead we now use an additional action, learn(), on flows that match on
> packets that create the hairpin session.  The learn() action will then
> generate the necessary flows to handle hairpin replies, but only for
> the local datapaths which actually generate hairpin traffic.
>
> For example, simulating how ovn-k8s uses load balancer for services,
> in a "switch per node" scenario, the script below would generate
> 10K (n_nodes * n_vips * n_backends) openflows on every node in table=69
> (hairpin reply).  With this patch the maximum number of openflows that
> can be created for hairpin replies is 200 (n_vips * n_backends).
>
> In general, for deployments that leverage switch-per-node topologies,
> the number of openflows is reduced by a factor of N, where N is the
> number of nodes.
>
>   $ cat lbs.sh
>   NODES=50
>   VIPS=20
>   BACKENDS=10
>   ovn-nbctl lr-add rtr
>   for ((i = 1; i <= $NODES; i++)); do
>   ovn-nbctl \
>   -- ls-add ls$i \
>   -- lsp-add ls$i vm$i \
>   -- lsp-add ls$i ls$i-rtr \
>   -- lsp-set-type ls$i-rtr router \
>   -- lsp-set-options ls$i-rtr router-port=rtr-ls$i \
>   -- lrp-add rtr rtr-ls$i 00:00:00:00:01:00 42.42.42.$i/24
>   done
>
>   for ((i = 1; i <= $VIPS; i++)); do
>   lb=lb$i
>   vip=10.10.10.$i:1
>   bip=20.20.20.1:2
>   for ((j = 2; j <= $BACKENDS; j++)); do
>   bip="$bip,20.20.20.$j:2"
>   done
>   ovn-nbctl lb-add $lb $vip $backends
>   done
>
>   for ((i = 1; i <= $NODES; i++)); do
>   for ((j = 1; j <= $VIPS; j++)); do
>   ovn-nbctl ls-lb-add ls$i lb$j
>   done
>   done
>
>   ovs-vsctl add-port br-int vm1 \
>   -- set interface vm1 type=internal \
>   -- set interface vm1 external-ids:iface-id=vm1
>
> Suggested-by: Ilya Maximets 
> Signed-off-by: Dumitru Ceara 

Hi Dumitru,

Thanks for this patch. The patch LGTM. I tested it out with  huge NB
DB with lots
of load balancer and backends.  I see a significant reduction in OF flows.

Please see the few comments below.

Thanks
Numan

> ---
>  controller/lflow.c  | 204 ---
>  tests/ofproto-macros.at |   5 +-
>  tests/ovn.at| 516 
> +---
>  3 files changed, 455 insertions(+), 270 deletions(-)
>
> diff --git a/controller/lflow.c b/controller/lflow.c
> index 946c1e0..2b7d356 100644
> --- a/controller/lflow.c
> +++ b/controller/lflow.c
> @@ -1171,6 +1171,178 @@ add_neighbor_flows(struct ovsdb_idl_index 
> *sbrec_port_binding_by_name,
>  }
>  }
>
> +/* Builds the "learn()" action to be triggered by packets initiating a
> + * hairpin session.
> + *
> + * This will generate flows in table OFTABLE_CHK_LB_HAIRPIN_REPLY of the 
> form:
> + * - match:
> + * metadata=,ip/ipv6,ip.src=,ip.dst=
> + * nw_proto='lb_proto',tp_src_port=
> + * - action:
> + * set MLF_LOOKUP_LB_HAIRPIN_BIT=1
> + */
> +static void
> +add_lb_vip_hairpin_reply_action(struct in6_addr *vip6, ovs_be32 vip,
> +uint8_t lb_proto, bool has_l4_port,
> +uint64_t cookie, struct ofpbuf *ofpacts)
> +{
> +struct match match = MATCH_CATCHALL_INITIALIZER;
> +struct ofpact_learn *ol = ofpact_put_LEARN(ofpacts);
> +struct ofpact_learn_spec *ol_spec;
> +unsigned int imm_bytes;
> +uint8_t *src_imm;
> +
> +/* Once learned, hairpin reply flows are permanent until the VIP/backend
> + * is removed.
> + */
> +ol->flags = NX_LEARN_F_DELETE_LEARNED;
> +ol->idle_timeout = OFP_FLOW_PERMANENT;
> +ol->hard_timeout = OFP_FLOW_PERMANENT;
> +ol->priority = OFP_DEFAULT_PRIORITY;
> +ol->table_id = OFTABLE_CHK_LB_HAIRPIN_REPLY;
> +ol->cookie = htonll(cookie);
> +
> +/* Match on metadata of the packet that created the hairpin session. */
> +ol_spec = ofpbuf_put_zeros(ofpacts, sizeof *ol_spec);
> +
> +ol_spec->dst.field = mf_from_id(MFF_METADATA);
> +ol_spec->dst.ofs = 0;
> +ol_spec->dst.n_bits = ol_spec->dst.field->n_bits;
> +ol_spec->n_bits = ol_spec->dst.n_bits;
> +ol_spec->dst_type = NX_LEARN_DST_MATCH;
> +ol_spec->src_type = NX_LEARN_SRC_FIELD;
> +ol_spec->src.field = mf_from_id(MFF_METADATA);
> +
> +/* Match on the same ETH type as the packet that created the hairpin
> + * session.
> + */
> +ol_spec = 

Re: [ovs-dev] [PATCH ovn] tests: Fix Port_Binding up test.

2021-02-08 Thread Numan Siddique
On Mon, Feb 8, 2021 at 2:40 PM Dumitru Ceara  wrote:
>
> On 2/7/21 3:52 AM, gmingchen(陈供明) wrote:
> > From: Gongming Chen 
> >
> > After setting the iface-id, immediately check the up status of the port
> > binding, it will occasionally fail, especially when the port binding
> > status is reported later.
> >
> > When it fails, the following will be output:
> > Checking values in sb Port_Binding with logical_port=lsp1 against false... 
> > found false
> > ovs-vsctl add-port br-int lsp1 -- set Interface lsp1 
> > external-ids:iface-id=lsp1
> > ./ovn-macros.at:307: "$@"
> > Checking values in sb Port_Binding with logical_port=lsp1 against true... 
> > found false
> > _uuid   : 15ebabb6-3dbb-4806-aa85-d1c03e3b39f6
> > logical_port: lsp1
> > up  : true
> > ./ovn-macros.at:393: hard failure
> >
> > Fixes: 4d3cb42b076b ("binding: Set Logical_Switch_Port.up when all OVS 
> > flows are installed.")
> > Signed-off-by: Gongming Chen 
> > ---
>
> Thanks for the fix!
>
> Acked-by: Dumitru Ceara 

Thanks for the fix. I applied this patch to master.

I also added your name to the AUTHORS list.

Numan

>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] tests: Fix Port_Binding up test.

2021-02-08 Thread Dumitru Ceara

On 2/7/21 3:52 AM, gmingchen(陈供明) wrote:

From: Gongming Chen 

After setting the iface-id, immediately check the up status of the port
binding, it will occasionally fail, especially when the port binding
status is reported later.

When it fails, the following will be output:
Checking values in sb Port_Binding with logical_port=lsp1 against false... 
found false
ovs-vsctl add-port br-int lsp1 -- set Interface lsp1 external-ids:iface-id=lsp1
./ovn-macros.at:307: "$@"
Checking values in sb Port_Binding with logical_port=lsp1 against true... found 
false
_uuid   : 15ebabb6-3dbb-4806-aa85-d1c03e3b39f6
logical_port: lsp1
up  : true
./ovn-macros.at:393: hard failure

Fixes: 4d3cb42b076b ("binding: Set Logical_Switch_Port.up when all OVS flows are 
installed.")
Signed-off-by: Gongming Chen 
---


Thanks for the fix!

Acked-by: Dumitru Ceara 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev