date:20210707

Re: [ovs-dev] [PATCH ovn v7 4/4] AUTHORS: update email for Mark Gray

2021-07-07 Thread Han Zhou

On Wed, Jul 7, 2021 at 1:28 AM Mark Gray  wrote:
>
> Update email address for Mark Gray
>
> Signed-off-by: Mark Gray 
> ---
>  .mailmap| 1 +
>  AUTHORS.rst | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/.mailmap b/.mailmap
> index f01664e5c1d1..bc32255b5cc4 100644
> --- a/.mailmap
> +++ b/.mailmap
> @@ -52,6 +52,7 @@ Joe Stringer  
>  Justin Pettit  
>  Kmindg 
>  Kyle Mestery  
> +Mark Gray  
>  Mauricio Vasquez  <
mauricio.vasquezber...@studenti.polito.it>
>  Miguel Angel Ajo  
>  Neil McKee 
> diff --git a/AUTHORS.rst b/AUTHORS.rst
> index 4c81a500d47e..5df6110e0230 100644
> --- a/AUTHORS.rst
> +++ b/AUTHORS.rst
> @@ -250,7 +250,7 @@ Manohar K Cman...@gmail.com
>  Manoj Sharma   manoj.sha...@nutanix.com
>  Marcin Mirecki mmire...@redhat.com
>  Mario Cabrera  mario.cabr...@hpe.com
> -Mark D. Gray   mark.d.g...@intel.com
> +Mark D. Gray   mark.d.g...@redhat.com
>  Mark Hamilton
>  Mark Kavanagh  mark.b.kavanag...@gmail.com
>  Mark Maglana   mmagl...@gmail.com
> --
> 2.27.0
>

Thanks Mark. I applied it to the master branch since there is no dependency
to the previous patches in the series.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v7 3/4] ovn.at: Fix whitespace

2021-07-07 Thread Han Zhou

On Wed, Jul 7, 2021 at 1:28 AM Mark Gray  wrote:
>
> Signed-off-by: Mark Gray 
> ---
>  tests/ovn.at | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/tests/ovn.at b/tests/ovn.at
> index eb9bccdc7053..e5d8869a8417 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -5888,7 +5888,7 @@ test_dhcp() {
>  local expect_resume=:
>  local trace=false
>  while :; do
> -case $1 in
> +case $1 in
>  (--no-resume) expect_resume=false; shift ;;
>  # --trace isn't used but it can be useful for debugging:
>  (--trace) trace=:; shift ;;
> @@ -8567,7 +8567,7 @@ check test "$c6_tag" != "$c0_tag"
>  check test "$c6_tag" != "$c2_tag"
>  check test "$c6_tag" != "$c3_tag"
>
> -AS_BOX([restart northd and make sure tag allocation is stable])
> +AS_BOX([restart northd and make sure tag allocation is stable])
>  as northd
>  OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
>  start_daemon NORTHD_TYPE \
> @@ -11554,7 +11554,7 @@ ovn-nbctl --wait=sb ha-chassis-group-add-chassis
hagrp1 hv4 40
>  AS_BOX([Wait till cr-alice is claimed by hv4])
>  hv4_chassis=$(fetch_column Chassis _uuid name=hv4)
>  AS_BOX([check that the chassis redirect port has been claimed by the gw1
chassis])
> -wait_row_count Port_Binding 1 logical_port=cr-alice chassis=$hv4_chassis
> +wait_row_count Port_Binding 1 logical_port=cr-alice chassis=$hv4_chassis
>
>  AS_BOX([Reset the pcap file for hv2/br-ex_n2])
>  # From now on ovn-controller in hv2 should not send GARPs for the router
ports.
> @@ -12246,7 +12246,7 @@ check_row_count HA_Chassis_Group 1 name=outside
>  check_row_count HA_Chassis 2 'chassis!=[[]]'
>
>  ha_ch=$(fetch_column HA_Chassis_Group ha_chassis)
> -check_column "$ha_ch" HA_Chassis _uuid
> +check_column "$ha_ch" HA_Chassis _uuid
>
>  for chassis in gw1 gw2 hv1 hv2; do
>  as $chassis
> @@ -16474,7 +16474,7 @@ test_ip6_packet_larger() {
>  inner_icmp6=800062f1
>  inner_icmp6_and_payload=$(icmp6_csum_inplace
${inner_icmp6}${payload} ${inner_ip6})
>  inner_packet=${inner_ip6}${inner_icmp6_and_payload}
> -
> +
>  # Then the outer.
>  outer_ip6=60883afe${ipv6_rt}${ipv6_src}
>  outer_icmp6_and_payload=$(icmp6_csum_inplace
0200$(printf "%04x" $mtu)${inner_packet} $outer_ip6)
> --
> 2.27.0
>

Thanks Mark. I applied it to the master branch since there is no dependency
to the previous patches in the series.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v7 2/4] northd: Refactor Logical Flows for routers with DNAT/Load Balancers

2021-07-07 Thread Han Zhou

On Wed, Jul 7, 2021 at 1:28 AM Mark Gray  wrote:
>
> This patch addresses a number of interconnected issues with Gateway
Routers
> that have Load Balancing enabled:
>
> 1) In the router pipeline, we have the following stages to handle
> dnat and unsnat.
>
>  - Stage 4 : lr_in_defrag (dnat zone)
>  - Stage 5 : lr_in_unsnat (snat zone)
>  - Stage 6 : lr_in_dnat   (dnat zone)
>
> In the reply direction, the order of traversal of the tables
> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
> datapath flows that check ct_state in the wrong conntrack zone.
> This is illustrated below where reply trafic enters the physical host
> port (6) and traverses DNAT zone (14), SNAT zone (default), back to the
> DNAT zone and then on to Logical Switch Port zone (22). The third
> flow is incorrectly checking the state from the SNAT zone instead
> of the DNAT zone.
>
> recirc_id(0),in_port(6),ct_state(-new-est-rel-rpl-trk)
actions:ct_clear,ct(zone=14),recirc(0xf)
> recirc_id(0xf),in_port(6) actions:ct(nat),recirc(0x10)
> recirc_id(0x10),in_port(6),ct_state(-new+est+trk)
actions:ct(zone=14,nat),recirc(0x11)
> recirc_id(0x11),in_port(6),ct_state(+new-est-rel-rpl+trk) actions:
ct(zone=22,nat),recirc(0x12)
> recirc_id(0x12),in_port(6),ct_state(-new+est-rel+rpl+trk) actions:5
>
> Update the order of these tables to resolve this.
>
> 2) Efficiencies can be gained by using the ct_dnat action in the
> table "lr_in_defrag" instead of ct_next. This removes the need for the
> ct_dnat action for established Load Balancer flows avoiding a
> recirculation.
>
> 3) On a Gateway router with DNAT flows configured, the router will
translate
> the destination IP address from (A) to (B). Reply packets from (B) are
> correctly UNDNATed in the reverse direction.
>
> However, if a new connection is established from (B), this flow is never
> committed to conntrack and, as such, is never established. This will
> cause OVS datapath flows to be added that match on the ct.new flag.
>
> For software-only datapaths this is not a problem. However, for
> datapaths that offload these flows to hardware, this may be problematic
> as some devices are unable to offload flows that match on ct.new.
>
> This patch resolves this by committing these flows to the DNAT zone in
> the new "lr_out_post_undnat" stage. Although this could be done in the
> DNAT zone, by doing this in the new zone we can avoid a recirculation.
>
> This patch also generalizes these changes to distributed routers with
> gateway ports.
>
> Co-authored-by: Numan Siddique 
> Signed-off-by: Mark Gray 
> Signed-off-by: Numan Siddique 
> Reported-at: https://bugzilla.redhat.com/1956740
> Reported-at: https://bugzilla.redhat.com/1953278
> ---
>
> Notes:
> v2:  Addressed Han's comments
>  * fixed ovn-northd.8.xml
>  * added 'is_gw_router' to all cases where relevant
>  * refactor add_router_lb_flow()
>  * added ct_commit/ct_dnat to gateway ports case
>  * updated flows like "ct.new && ip &&  to specify
ip4/ip6 instead of ip
>  * increment ovn_internal_version
> v4:  Fix line length errors from 0-day
> v5:  Add "Reported-at" tag
>
>  lib/ovn-util.c  |   2 +-
>  northd/ovn-northd.8.xml | 285 ++---
>  northd/ovn-northd.c | 176 ++-
>  northd/ovn_northd.dl| 136 +---
>  tests/ovn-northd.at | 685 +++-
>  tests/ovn.at|   8 +-
>  tests/system-ovn.at |  58 +++-
>  7 files changed, 1019 insertions(+), 331 deletions(-)
>
> diff --git a/lib/ovn-util.c b/lib/ovn-util.c
> index acf4b1cd6059..494d6d42d869 100644
> --- a/lib/ovn-util.c
> +++ b/lib/ovn-util.c
> @@ -760,7 +760,7 @@ ip_address_and_port_from_lb_key(const char *key, char
**ip_address,
>
>  /* Increment this for any logical flow changes, if an existing OVN
action is
>   * modified or a stage is added to a logical pipeline. */
> -#define OVN_INTERNAL_MINOR_VER 0
> +#define OVN_INTERNAL_MINOR_VER 1
>
>  /* Returns the OVN version. The caller must free the returned value. */
>  char *
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index b5c961e891f9..c76339ce38e4 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -2637,39 +2637,9 @@ icmp6 {
>
>  
>
> -Ingress Table 4: DEFRAG
>
> -
> -  This is to send packets to connection tracker for tracking and
> -  defragmentation.  It contains a priority-0 flow that simply moves
traffic
> -  to the next table.
> -
> -
> -
> -  If load balancing rules with virtual IP addresses (and ports) are
> -  configured in OVN_Northbound database for a Gateway
router,
> -  a priority-100 flow is added for each configured virtual IP address
> -  VIP. For IPv4 VIPs the flow matches ip
> -  && ip4.dst == VIP.  For IPv6
VIPs,
> -  the flow matches ip && ip6.dst ==
VIP.
> -  The flow uses the action ct_next; to send IP packets
to the
> -  connection tracker

Re: [ovs-dev] [PATCH ovn v7 1/4] northd: update stage-name if changed

2021-07-07 Thread Han Zhou

On Wed, Jul 7, 2021 at 1:28 AM Mark Gray  wrote:
>
> If a new table is added to a logical flow pipeline, the mapping between
> 'external_ids:stage-name' from the 'Logical_Flow' table in the
> 'OVN_Southbound' database and the 'stage' number may change for some
tables.
>
> If 'ovn-northd' is started against a populated Southbound database,
> 'external_ids' will not be updated to reflect the new, correct
> name. This will cause 'external_ids' to be incorrectly displayed by some
> tools and commands such as `ovn-sbctl dump-flows`.
>
> This commit, reconciles these changes as part of build_lflows() when
> 'ovn_internal_version' is updated.
>
> Suggested-by: Ilya Maximets 
> Signed-off-by: Mark Gray 
> ---
>
> Notes:
> v2:  Update all 'external_ids' rather than just 'stage-name'
> v4:  Fix line length errors from 0-day
>
>  lib/ovn-util.c  |  4 ++--
>  northd/ovn-northd.c | 43 ---
>  2 files changed, 42 insertions(+), 5 deletions(-)
>
> diff --git a/lib/ovn-util.c b/lib/ovn-util.c
> index c5af8d1ab340..acf4b1cd6059 100644
> --- a/lib/ovn-util.c
> +++ b/lib/ovn-util.c
> @@ -758,8 +758,8 @@ ip_address_and_port_from_lb_key(const char *key, char
**ip_address,
>  return true;
>  }
>
> -/* Increment this for any logical flow changes or if existing OVN action
is
> - * modified. */
> +/* Increment this for any logical flow changes, if an existing OVN
action is
> + * modified or a stage is added to a logical pipeline. */
>  #define OVN_INTERNAL_MINOR_VER 0
>
>  /* Returns the OVN version. The caller must free the returned value. */
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 570c6a3efd77..eb25e31b1f7d 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -12447,7 +12447,8 @@ build_lflows(struct northd_context *ctx, struct
hmap *datapaths,
>   struct hmap *ports, struct hmap *port_groups,
>   struct hmap *mcgroups, struct hmap *igmp_groups,
>   struct shash *meter_groups,
> - struct hmap *lbs, struct hmap *bfd_connections)
> + struct hmap *lbs, struct hmap *bfd_connections,
> + bool ovn_internal_version_changed)
>  {
>  struct hmap lflows;
>
> @@ -12559,6 +12560,32 @@ build_lflows(struct northd_context *ctx, struct
hmap *datapaths,
>  ovn_stage_build(dp_type, pipeline, sbflow->table_id),
>  sbflow->priority, sbflow->match, sbflow->actions,
sbflow->hash);
>  if (lflow) {
> +if (ovn_internal_version_changed) {
> +const char *stage_name =
smap_get_def(&sbflow->external_ids,
> +  "stage-name", "");
> +const char *stage_hint =
smap_get_def(&sbflow->external_ids,
> +  "stage-hint", "");
> +const char *source = smap_get_def(&sbflow->external_ids,
> +  "source", "");
> +
> +if (strcmp(stage_name, ovn_stage_to_str(lflow->stage))) {
> +sbrec_logical_flow_update_external_ids_setkey(sbflow,
> + "stage-name", ovn_stage_to_str(lflow->stage));
> +}
> +if (lflow->stage_hint) {
> +if (strcmp(stage_hint, lflow->stage_hint)) {
> +
 sbrec_logical_flow_update_external_ids_setkey(sbflow,
> +"stage-hint", lflow->stage_hint);
> +}
> +}
> +if (lflow->where) {
> +if (strcmp(source, lflow->where)) {
> +
 sbrec_logical_flow_update_external_ids_setkey(sbflow,
> +"source", lflow->where);
> +}
> +}
> +}
> +
>  /* This is a valid lflow.  Checking if the datapath group
needs
>   * updates. */
>  bool update_dp_group = false;
> @@ -13390,6 +13417,7 @@ ovnnb_db_run(struct northd_context *ctx,
>  struct shash meter_groups = SHASH_INITIALIZER(&meter_groups);
>  struct hmap lbs;
>  struct hmap bfd_connections = HMAP_INITIALIZER(&bfd_connections);
> +bool ovn_internal_version_changed = true;
>
>  /* Sync ipsec configuration.
>   * Copy nb_cfg from northbound to southbound database.
> @@ -13441,7 +13469,13 @@ ovnnb_db_run(struct northd_context *ctx,
>  smap_replace(&options, "max_tunid", max_tunid);
>  free(max_tunid);
>
> -smap_replace(&options, "northd_internal_version",
ovn_internal_version);
> +if (!strcmp(ovn_internal_version,
> +smap_get_def(&options, "northd_internal_version", ""))) {
> +ovn_internal_version_changed = false;
> +} else {
> +smap_replace(&options, "northd_internal_version",
> + ovn_internal_version);
> +}
>
>  nbrec_nb_global_verify_options(nb);
>  nbrec_nb_global_set_options(nb, &options);
> @@ -13481,7 +13515,8 @@ ovnnb_db_run(st

Re: [ovs-dev] [PATCH ovn v2] inc-proc-eng: Improve debug logging.

2021-07-07 Thread Han Zhou

On Tue, Jul 6, 2021 at 6:47 AM Dumitru Ceara  wrote:
>
> Time how long change/run handlers take and log this at debug level.
> I-P engine debug logs are not so verbose so enabling them is quite
> common when debugging scale/control plane latency related issues.
>
> One of the major missing pieces though was a log about how long each I-P
> node change/run handler took to run.  This commit adds it and also logs
> the reason why engine_recompute() has been called for a given node:
> missing change handler, failed handler, or forced recompute).
>
> Signed-off-by: Dumitru Ceara 
> ---
> v2:
> - Addressed Han's comments:
>   - added reason to engine_recompute()
>   - removed noisy/not so relevant logs.
> Note: Mark Gray had acked v1 but since there are quite a bit of changes
> in v2 I'll not add his ack.
> ---
>  lib/inc-proc-eng.c | 52 +-
>  1 file changed, 37 insertions(+), 15 deletions(-)
>
> diff --git a/lib/inc-proc-eng.c b/lib/inc-proc-eng.c
> index c349efb22..49a1fe2f2 100644
> --- a/lib/inc-proc-eng.c
> +++ b/lib/inc-proc-eng.c
> @@ -27,6 +27,7 @@
>  #include "openvswitch/hmap.h"
>  #include "openvswitch/vlog.h"
>  #include "inc-proc-eng.h"
> +#include "timeval.h"
>  #include "unixctl.h"
>
>  VLOG_DEFINE_THIS_MODULE(inc_proc_eng);
> @@ -45,6 +46,10 @@ static const char
*engine_node_state_name[EN_STATE_MAX] = {
>  [EN_ABORTED]   = "Aborted",
>  };
>
> +static void
> +engine_recompute(struct engine_node *node, bool allowed,
> + const char *reason_fmt, ...) OVS_PRINTF_FORMAT(3, 4);
> +
>  void
>  engine_set_force_recompute(bool val)
>  {
> @@ -315,15 +320,23 @@ engine_init_run(void)
>   * mark the node as "aborted".
>   */
>  static void
> -engine_recompute(struct engine_node *node, bool forced, bool allowed)
> +engine_recompute(struct engine_node *node, bool allowed,
> + const char *reason_fmt, ...)
>  {
> -VLOG_DBG("node: %s, recompute (%s)", node->name,
> - forced ? "forced" : "triggered");
> +char *reason = NULL;
> +
> +if (VLOG_IS_DBG_ENABLED()) {
> +va_list reason_args;
> +
> +va_start(reason_args, reason_fmt);
> +reason = xvasprintf(reason_fmt, reason_args);
> +va_end(reason_args);
> +}
>
>  if (!allowed) {
> -VLOG_DBG("node: %s, recompute aborted", node->name);
> +VLOG_DBG("node: %s, recompute (%s) aborted", node->name, reason);
>  engine_set_node_state(node, EN_ABORTED);
> -return;
> +goto done;
>  }
>
>  /* Clear tracked data before calling run() so that partially tracked
data
> @@ -333,8 +346,13 @@ engine_recompute(struct engine_node *node, bool
forced, bool allowed)
>  }
>
>  /* Run the node handler which might change state. */
> +long long int now = time_msec();
>  node->run(node, node->data);
>  node->stats.recompute++;
> +VLOG_DBG("node: %s, recompute (%s) took %lldms", node->name, reason,
> + time_msec() - now);
> +done:
> +free(reason);
>  }
>
>  /* Return true if the node could be computed, false otherwise. */
> @@ -344,17 +362,19 @@ engine_compute(struct engine_node *node, bool
recompute_allowed)
>  for (size_t i = 0; i < node->n_inputs; i++) {
>  /* If the input node data changed call its change handler. */
>  if (node->inputs[i].node->state == EN_UPDATED) {
> -VLOG_DBG("node: %s, handle change for input %s",
> - node->name, node->inputs[i].node->name);
> -
>  /* If the input change can't be handled incrementally, run
>   * the node handler.
>   */
> -if (!node->inputs[i].change_handler(node, node->data)) {
> -VLOG_DBG("node: %s, can't handle change for input %s, "
> - "fall back to recompute",
> - node->name, node->inputs[i].node->name);
> -engine_recompute(node, false, recompute_allowed);
> +long long int now = time_msec();
> +bool handled = node->inputs[i].change_handler(node,
node->data);
> +
> +VLOG_DBG("node: %s, handler for input %s took %lldms",
> + node->name, node->inputs[i].node->name,
> + time_msec() - now);
> +if (!handled) {
> +engine_recompute(node, recompute_allowed,
> + "failed handler for input %s",
> + node->inputs[i].node->name);
>  return (node->state != EN_ABORTED);
>  }
>  }
> @@ -375,7 +395,7 @@ engine_run_node(struct engine_node *node, bool
recompute_allowed)
>  }
>
>  if (engine_force_recompute) {
> -engine_recompute(node, true, recompute_allowed);
> +engine_recompute(node, recompute_allowed, "forced");
>  return;
>  }
>
> @@ -389,7 +409,9 @@ engine_run_node(struct engine_node *node, bool
recompute_allowed)
>
>

Re: [ovs-dev] is the OVN load balancer also intended to be a firewall?

2021-07-07 Thread Han Zhou

On Wed, Jul 7, 2021 at 3:39 PM Ben Pfaff  wrote:
>
> Hi, I've been talking to Shay Vargaftik (CC'd), also a researcher at
> VMware, about some work he's done on optimizing load balancers.  What
> he's come up with is a technique that in many cases avoids putting
> connections into the connection-tracking table, because it can achieve
> per-connection consistency without needing to do that.  This improves
> performance by reducing the size of the connection-tracking table, which
> is therefore more likely to fit inside a CPU cache and cheaper to
> search.
>
> I'm trying to determine whether this technique would apply to OVN's load
> balancer.  There would be challenges in any case, but one fundamental
> question I have is: is the OVN load balancer also supposed to be a
> firewall?  If it's not, then it's worth continuing to look to see if the
> technique is applicable.  On the other hand, if it is, then every
> connection needs to be tracked in any case, so the technique can't be
> useful.
>
> Anyone's thoughts would be welcome.
>
For my understanding OVN LB doesn't directly relate to FW (OVN ACLs),
although they both use conntrack. For LB, we use conntrack for NAT (convert
the client IP to an LB owned IP) purposes. Does this technique support NAT
without using conntrack?

Moreover, maybe for the future, we also need to consider the cases when a
LB is applied on a OVN gateway, for HA purposes the NAT tracking entries
need to be able to be replicated across nodes, so that when failover
happens the existing connections can continue working through another
gateway node.

There are also OVN LB use cases that don't require NAT. If this technique
doesn't support NAT, it is probably still useful for those scenarios.

Thanks,
Han
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] is the OVN load balancer also intended to be a firewall?

2021-07-07 Thread Ben Pfaff

Hi, I've been talking to Shay Vargaftik (CC'd), also a researcher at
VMware, about some work he's done on optimizing load balancers.  What
he's come up with is a technique that in many cases avoids putting
connections into the connection-tracking table, because it can achieve
per-connection consistency without needing to do that.  This improves
performance by reducing the size of the connection-tracking table, which
is therefore more likely to fit inside a CPU cache and cheaper to
search.

I'm trying to determine whether this technique would apply to OVN's load
balancer.  There would be challenges in any case, but one fundamental
question I have is: is the OVN load balancer also supposed to be a
firewall?  If it's not, then it's worth continuing to look to see if the
technique is applicable.  On the other hand, if it is, then every
connection needs to be tracked in any case, so the technique can't be
useful.

Anyone's thoughts would be welcome.

Thanks,

Ben.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] bridge: Use correct (legacy) role names in database.

2021-07-07 Thread Flavio Leitner

On Wed, Jul 07, 2021 at 12:58:28PM -0700, Ben Pfaff wrote:
> On Wed, Jul 07, 2021 at 04:37:11PM -0300, Flavio Leitner wrote:
> > 
> > Hi,
> > 
> > On Tue, Jul 06, 2021 at 03:37:09PM -0700, Ben Pfaff wrote:
> > > The vswitchd database schema requires role names to be "master" or
> > > "slave", but this code tried to use "primary" and "secondary".
> > 
> > We have defined the constraints in the schema and we can't change
> > the schema because it would require external applications to do
> > the same change and would affect upgrades, correct?
> 
> Yes.

Thanks!
-- 
fbl
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH V7 00/13] Netdev vxlan-decap offload

2021-07-07 Thread Flavio Leitner



Hi,

On Wed, Jul 07, 2021 at 04:53:14PM +0200, Ilya Maximets wrote:
> On 7/6/21 3:34 PM, Van Haaren, Harry wrote:
> >> -Original Message-
> >> From: Ilya Maximets 
> >> Sent: Thursday, July 1, 2021 11:32 AM
> >> To: Van Haaren, Harry ; Ilya Maximets
> >> 
> >> Cc: Eli Britstein ; ovs dev ; Ivan 
> >> Malov
> >> ; Majd Dibbiny ; Stokes, Ian
> >> ; Ferriter, Cian ; Ben Pfaff
> >> ; Balazs Nemeth ; Sriharsha Basavapatna
> >> 
> >> Subject: Re: [ovs-dev] [PATCH V7 00/13] Netdev vxlan-decap offload
> >>
> >> On 6/29/21 1:53 PM, Van Haaren, Harry wrote:
>  -Original Message-
>  From: Ilya Maximets 
>  Sent: Monday, June 28, 2021 3:33 PM
>  To: Van Haaren, Harry ; Ilya Maximets
>  ; Sriharsha Basavapatna
>  
>  Cc: Eli Britstein ; ovs dev ; 
>  Ivan
> >> Malov
>  ; Majd Dibbiny ; Stokes, Ian
>  ; Ferriter, Cian ; Ben 
>  Pfaff
>  ; Balazs Nemeth 
>  Subject: Re: [ovs-dev] [PATCH V7 00/13] Netdev vxlan-decap offload
> 
>  On 6/25/21 7:28 PM, Van Haaren, Harry wrote:
> >> -Original Message-
> >> From: dev  On Behalf Of Ilya Maximets
> >> Sent: Friday, June 25, 2021 4:26 PM
> >> To: Sriharsha Basavapatna ; Ilya
>  Maximets
> >> 
> >> Cc: Eli Britstein ; ovs dev ; 
> >> Ivan
>  Malov
> >> ; Majd Dibbiny 
> >> Subject: Re: [ovs-dev] [PATCH V7 00/13] Netdev vxlan-decap offload
> >
> > 
> >
>  That looks good to me.  So, I guess, Harsha, we're waiting for
>  your review/tests here.
> >>>
> >>> Thanks Ilya and Eli, looks good to me; I've also tested it and it 
> >>> works fine.
> >>> -Harsha
> >>
> >> Thanks, everyone.  Applied to master.
> >
> > Hi Ilya and OVS Community,
> >
> > There are open questions around this patchset, why has it been merged?
> >
> > Earlier today, new concerns were raised by Cian around the negative
> >> performance
>  impact of these code changes:
> > - https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384445.html
> >
> > Both you (Ilya) and Eli responded, and I was following the 
> > conversation. Various
>  code changes were suggested,
> > and some may seem like they might work, Eli mentioned some solutions 
> > might
> >> not
>  work due to the hardware:
> > I was processing both your comments and input, and planning a technical 
> > reply
>  later today.
> > - suggestions: https://mail.openvswitch.org/pipermail/ovs-dev/2021-
>  June/384446.html
> > - concerns around hw: 
> > https://mail.openvswitch.org/pipermail/ovs-dev/2021-
>  June/384464.html
> 
>  Concerns not really about the hardware, but the API itself
>  that should be clarified a little bit to avoid confusion and
>  avoid incorrect changes like the one I suggested.
>  But this is a small enhancement that could be done on top.
> 
> >
> > Keep in mind that there are open performance issues to be worked out, 
> > that
> >> have
>  not been resolved at this point in the conversation.
> 
>  Performance issue that can be worked out, will be worked out
>  in a separate patch , v1 for which we already have on a mailing
>  list for some time, so it didn't make sense to to re-validate
>  the whole series again due to this one pretty obvious change.
> 
> > There is no agreement on solutions, nor an agreement to ignore the
> >> performance
>  degradation, or to try resolve this degradation later.
> 
>  Particular part of the packet restoration call seems hard
>  to avoid in a long term (I don't see a good solution for that),
>  but the short term solution might be implemented on top.
>  The part with multiple reads of recirc_id and checking if
>  offloading is enabled has a fix already (that needs a v2, but
>  anyway).
> 
> >
> > That these patches have been merged is inappropriate:
> > 1) Not enough time given for responses (11 am concerns raised, 5pm 
> > merged
>  without resolution? (Irish timezone))
> 
>  I responded with suggestions and arguments against solutions
>  suggested in the report, Eli responded with rejection of one
>  one of my suggestions.  And it seems clear (for me) that
>  there is no good solution for this part at the moment.
>  Part of the performance could be won back, but the rest
>  seems to be inevitable.  As a short-term solution we can
>  guard the netdev_hw_miss_packet_recover() with experimental
>  API ifdef, but it will strike back anyway in the future.
> 
> > 2) Open question not addressed/resolved, resulting in a 6% known 
> > negative
>  performance impact being merged.
> 
>  I don't think it wasn't addressed.
> >>>
> >>> Was code merged that resulted in a known regression of 6%?  Yes. Facts 
> >>> are facts.
> >>> I don't care for arguing over exactl

Re: [ovs-dev] [PATCH] odp-util: Stop key parsing if already oversized.

2021-07-07 Thread Ben Pfaff

On Thu, Jun 24, 2021 at 01:44:48PM +0200, Ilya Maximets wrote:
> We don't need to continue parsing if already oversized.  This is not
> very important, but fuzzer times out while parsing very long flow.
> 
> The check could be written as a single 'if' statement, but I found
> my variant much more readable.
> 
> Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=35519
> Signed-off-by: Ilya Maximets 

This seems like a reasonable thing to do and I didn't spot any bugs at a
glance.  I didn't compile it or run the tests.

Acked-by: Ben Pfaff 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v9 4/4] northd: Flood ARPs to routers for "unreachable" addresses.

2021-07-07 Thread Numan Siddique

On Wed, Jul 7, 2021 at 5:03 PM Mark Michelson  wrote:
>
> On 7/7/21 2:50 PM, Mark Michelson wrote:
> > On 7/7/21 1:41 PM, Numan Siddique wrote:
> >> On Wed, Jun 30, 2021 at 7:57 PM Mark Michelson 
> >> wrote:
> >>>
> >>> Previously, ARP TPAs were filtered down only to "reachable" addresses.
> >>> Reachable addresses are all router interface addresses, as well as NAT
> >>> external addresses and load balancer VIPs that are within the subnet
> >>> handled by a router's port.
> >>>
> >>> However, it is possible that in some configurations, CMSes purposely
> >>> configure NAT or load balancer addresses on a router that are outside
> >>> the router's subnets, and they expect the router to respond to ARPs for
> >>> those addresses.
> >>>
> >>> This commit adds a higher priority flow to logical switches that makes
> >>> it so ARPs targeted at "unreachable" addresses are flooded to all ports.
> >>> This way, the ARPs can reach the router appropriately and receive a
> >>> response.
> >>>
> >>> Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1929901
> >>>
> >>> Signed-off-by: Mark Michelson 
> >>
> >> Acked-by: Numan Siddique 
> >>
> >> I've one comment which probably can be addressed.
> >>
> >> If a configured NAT entry on a logical router is unreachable within
> >> that router, this patch floods
> >> the packet for the ARP destined to that NAT IP in the logical switch
> >> pipeline.
> >>
> >> Before adding the flow to flood, can't we check if the NAT IP is
> >> reachable from other router ports
> >> connected to this logical switch.  If so, we can do "outport ==
> >> " instead
> >> of flooding.
> >>
> >> I think this is possible given that the logical flow is added in the
> >> switch pipeline.  We just need to loop through
> >> all the router ports of the logical switch.  The question is - is this
> >> efficient  and takes up some time on a scaled environment ?
> >>
> >> What do you think ?  If this seems fine,  it can be a follow up patch
> >> too.
> >
> > I don't think your suggestion would add any appreciable time compared to
> > what we're already doing in ovn-northd. I will attempt this approach and
> > let you know how it goes.
> >
>
> On second thought, I think this should be a follow-up patch as you
> suggested. This particular work has been going on for too long to delay
> for this purpose, IMO.

Sounds good to me.

Numan

>
> >>
> >> Numan
> >>
> >>> ---
> >>>   northd/ovn-northd.8.xml |   8 ++
> >>>   northd/ovn-northd.c | 162 +++-
> >>>   northd/ovn_northd.dl|  91 ++
> >>>   tests/ovn-northd.at |  99 
> >>>   tests/system-ovn.at | 102 +
> >>>   5 files changed, 395 insertions(+), 67 deletions(-)
> >>>
> >>> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> >>> index beaf5a183..5aedd6619 100644
> >>> --- a/northd/ovn-northd.8.xml
> >>> +++ b/northd/ovn-northd.8.xml
> >>> @@ -1587,6 +1587,14 @@ output;
> >>>   logical ports.
> >>> 
> >>>
> >>> +  
> >>> +Priority-90 flows for each IP address/VIP/NAT address
> >>> configured
> >>> +outside its owning router port's subnet. These flows match ARP
> >>> +requests and ND packets for the specific IP addresses.
> >>> Matched packets
> >>> +are forwarded to the MC_FLOOD multicast group
> >>> which
> >>> +contains all connected logical ports.
> >>> +  
> >>> +
> >>> 
> >>>   Priority-75 flows for each port connected to a logical router
> >>>   matching self originated ARP request/ND packets.  These
> >>> packets
> >>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> >>> index f6fad281b..d0b325748 100644
> >>> --- a/northd/ovn-northd.c
> >>> +++ b/northd/ovn-northd.c
> >>> @@ -6555,38 +6555,41 @@
> >>> build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,
> >>>   ds_destroy(&match);
> >>>   }
> >>>
> >>> -/*
> >>> - * Ingress table 19: Flows that forward ARP/ND requests only to the
> >>> routers
> >>> - * that own the addresses. Other ARP/ND packets are still flooded in
> >>> the
> >>> - * switching domain as regular broadcast.
> >>> - */
> >>>   static void
> >>> -build_lswitch_rport_arp_req_flow_for_ip(struct ds *ip_match,
> >>> -int addr_family,
> >>> -struct ovn_port *patch_op,
> >>> -struct ovn_datapath *od,
> >>> -uint32_t priority,
> >>> -struct hmap *lflows,
> >>> -const struct ovsdb_idl_row
> >>> *stage_hint)
> >>> +arp_nd_ns_match(struct ds *ips, int addr_family, struct ds *match)
> >>>   {
> >>> -struct ds match   = DS_EMPTY_INITIALIZER;
> >>> -struct ds actions = DS_EMPTY_INITIALIZER;
> >>> -
> >>>   /* Packets received from VXLAN tunnels have already been

Re: [ovs-dev] [PATCH ovn v9 4/4] northd: Flood ARPs to routers for "unreachable" addresses.

2021-07-07 Thread Mark Michelson


On 7/7/21 2:50 PM, Mark Michelson wrote:

On 7/7/21 1:41 PM, Numan Siddique wrote:
On Wed, Jun 30, 2021 at 7:57 PM Mark Michelson  
wrote:


Previously, ARP TPAs were filtered down only to "reachable" addresses.
Reachable addresses are all router interface addresses, as well as NAT
external addresses and load balancer VIPs that are within the subnet
handled by a router's port.

However, it is possible that in some configurations, CMSes purposely
configure NAT or load balancer addresses on a router that are outside
the router's subnets, and they expect the router to respond to ARPs for
those addresses.

This commit adds a higher priority flow to logical switches that makes
it so ARPs targeted at "unreachable" addresses are flooded to all ports.
This way, the ARPs can reach the router appropriately and receive a
response.

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1929901

Signed-off-by: Mark Michelson 


Acked-by: Numan Siddique 

I've one comment which probably can be addressed.

If a configured NAT entry on a logical router is unreachable within
that router, this patch floods
the packet for the ARP destined to that NAT IP in the logical switch 
pipeline.


Before adding the flow to flood, can't we check if the NAT IP is
reachable from other router ports
connected to this logical switch.  If so, we can do "outport ==
" instead
of flooding.

I think this is possible given that the logical flow is added in the
switch pipeline.  We just need to loop through
all the router ports of the logical switch.  The question is - is this
efficient  and takes up some time on a scaled environment ?

What do you think ?  If this seems fine,  it can be a follow up patch 
too.


I don't think your suggestion would add any appreciable time compared to 
what we're already doing in ovn-northd. I will attempt this approach and 
let you know how it goes.




On second thought, I think this should be a follow-up patch as you 
suggested. This particular work has been going on for too long to delay 
for this purpose, IMO.




Numan


---
  northd/ovn-northd.8.xml |   8 ++
  northd/ovn-northd.c | 162 +++-
  northd/ovn_northd.dl    |  91 ++
  tests/ovn-northd.at |  99 
  tests/system-ovn.at | 102 +
  5 files changed, 395 insertions(+), 67 deletions(-)

diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index beaf5a183..5aedd6619 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -1587,6 +1587,14 @@ output;
  logical ports.
    

+  
+    Priority-90 flows for each IP address/VIP/NAT address 
configured

+    outside its owning router port's subnet. These flows match ARP
+    requests and ND packets for the specific IP addresses.  
Matched packets
+    are forwarded to the MC_FLOOD multicast group 
which

+    contains all connected logical ports.
+  
+
    
  Priority-75 flows for each port connected to a logical router
  matching self originated ARP request/ND packets.  These 
packets

diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index f6fad281b..d0b325748 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -6555,38 +6555,41 @@ 
build_lswitch_rport_arp_req_self_orig_flow(struct ovn_port *op,

  ds_destroy(&match);
  }

-/*
- * Ingress table 19: Flows that forward ARP/ND requests only to the 
routers
- * that own the addresses. Other ARP/ND packets are still flooded in 
the

- * switching domain as regular broadcast.
- */
  static void
-build_lswitch_rport_arp_req_flow_for_ip(struct ds *ip_match,
-    int addr_family,
-    struct ovn_port *patch_op,
-    struct ovn_datapath *od,
-    uint32_t priority,
-    struct hmap *lflows,
-    const struct ovsdb_idl_row 
*stage_hint)

+arp_nd_ns_match(struct ds *ips, int addr_family, struct ds *match)
  {
-    struct ds match   = DS_EMPTY_INITIALIZER;
-    struct ds actions = DS_EMPTY_INITIALIZER;
-
  /* Packets received from VXLAN tunnels have already been 
through the
   * router pipeline so we should skip them. Normally this is 
done by the
   * multicast_group implementation (VXLAN packets skip table 32 
which

   * delivers to patch ports) but we're bypassing multicast_groups.
   */
-    ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
+    ds_put_cstr(match, FLAGBIT_NOT_VXLAN " && ");

  if (addr_family == AF_INET) {
-    ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
+    ds_put_cstr(match, "arp.op == 1 && arp.tpa == {");
  } else {
-    ds_put_cstr(&match, "nd_ns && nd.target == { ");
+    ds_put_cstr(match, "nd_ns && nd.target == {");
  }

-    ds_put_cstr(&match, ds_cstr_ro(ip_mat

Re: [ovs-dev] [PATCH] odp-util: Stop key parsing if already oversized.

2021-07-07 Thread Ilya Maximets

On 6/24/21 1:44 PM, Ilya Maximets wrote:
> We don't need to continue parsing if already oversized.  This is not
> very important, but fuzzer times out while parsing very long flow.
> 
> The check could be written as a single 'if' statement, but I found
> my variant much more readable.
> 
> Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=35519
> Signed-off-by: Ilya Maximets 
> ---

Gentle reminder.
Would be nice to have some review on this patch.

Best regards, Ilya Maximets.

>  lib/odp-util.c |  9 +
>  tests/odp.at   | 14 ++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/lib/odp-util.c b/lib/odp-util.c
> index 04a183c7c..7729a9060 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -6077,6 +6077,15 @@ odp_flow_from_string(const char *s, const struct simap 
> *port_names,
>  }
>  
>  retval = parse_odp_key_mask_attr(&context, s, key, mask);
> +
> +if (retval >= 0) {
> +if (nl_attr_oversized(key->size - NLA_HDRLEN)) {
> +retval = -E2BIG;
> +} else if (mask && nl_attr_oversized(mask->size - NLA_HDRLEN)) {
> +retval = -E2BIG;
> +}
> +}
> +
>  if (retval < 0) {
>  if (errorp) {
>  *errorp = xasprintf("syntax error at %s", s);
> diff --git a/tests/odp.at b/tests/odp.at
> index dccafd9d3..07a5cfe39 100644
> --- a/tests/odp.at
> +++ b/tests/odp.at
> @@ -449,6 +449,20 @@ odp_actions_from_string: error
>  ])
>  AT_CLEANUP
>  
> +AT_SETUP([OVS datapath keys parsing and formatting - keys too long])
> +dnl Flow keys should fit into a single netlink message.
> +dnl Empty encap() takes 4 bytes.  So, 16384 is too many, but 16383 still 
> fits.
> +dnl We're getting 'duplicate attribute' error since it's not a logically 
> valid
> +dnl sequence of keys.  'syntax error' indicates oversized list of keys.
> +keys=$(printf 'encap(),%.0s' $(seq 16382))
> +echo "${keys}encap()" > keys.txt
> +echo "${keys}encap(),encap()" >> keys.txt
> +AT_CHECK([ovstest test-odp parse-keys < keys.txt | sed 's/encap(),//g'], 
> [0], [dnl
> +odp_flow_key_to_flow: error (duplicate encap attribute in flow key; the flow 
> key in error is: encap())
> +odp_flow_from_string: error (syntax error at encap())
> +])
> +AT_CLEANUP
> +
>  AT_SETUP([OVS datapath keys parsing and formatting - 33 nested encap ])
>  AT_DATA([odp-in.txt], [dnl
>  
> encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap(encap()
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] Python: Fix Idl.run change_seqno update

2021-07-07 Thread Ilya Maximets

On 7/2/21 9:09 AM, aserd...@ovn.org wrote:
> From: Alin Gabriel Serdean 
> 
>> Fix an issue where Idl.run() returned False even if there was a change.
>> If Idl.run() reads multiple messages from the database server, some
>> may constitute changes and some may not. Changed the way change_seqno
>> is reset: if a message is not a change, reset change_seqno only to the
>> value before reading this message, not to the value before reading the
>> first message.
>> This will fix the return value in a scenario where some message was a
>> change and the last one wasn't. The new change_seqno will now be the
>> value after handling the message with the last change.
>>
>> Signed-off-by: Bodo Petermann 
>> ---
>>
> Acked-by: Alin Gabriel Serdean 

Thanks!  Applied and backported down to 2.13.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] ofp-actions: Report an error if there are too many actions to parse.

2021-07-07 Thread Ilya Maximets

On 7/2/21 1:40 PM, aserd...@ovn.org wrote:
> From: Alin-Gabriel Serdean 
> 
>> Not a very important fix, but fuzzer times out trying to test parsing
>> of a huge number of actions.  Fixing that by reporting an error as
>> soon as ofpacts oversized.
>>
>> It would be great to use ofpacts_oversized() function instead of manual
>> size checking, but ofpacts->header here always points to the last
>> pushed action, so the value that ofpacts_oversized() would check is
>> always small.
>>
>> Adding a unit test for this, plus the extra test for too deep nesting.
>>
>> Reported-at: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=20254
>> Signed-off-by: Ilya Maximets 
>>
> Acked-by: Alin-Gabriel Serdean 

Thanks!  Applied.

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v1] ovsdb-server.7.rst Fix response result of monitor_cond_change

2021-07-07 Thread Ben Pfaff

On Tue, Jun 15, 2021 at 03:01:47PM +0300, Alexey Roytman wrote:
> From: Alexey Roytman 
> 
> The original version said that "monitor_cond_change" request response 
> should contain '"result": null'. However, if response message has form 
> like {"id":13, "result":null} or {"id":13} the ovsdb client 
> (ovn-controller) returns the folling misleading message and closes 
> connection.
> 
> 2021-06-07T14:32:30.116Z|00026|jsonrpc|WARN|tcp:172.18.0.4:6642: received
>   bad JSON-RPC message: request must have "method"
> 2021-06-07T14:32:30.116Z|00027|reconnect|WARN|tcp:172.18.0.4:6642: 
>   connection dropped (Protocol error)
> 
> Signed-off-by: Alexey Roytman 

Thanks for the correction.  I applied this patch to OVS.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] stream-ssl: Remove unsafe 1024 bit dh params

2021-07-07 Thread Ben Pfaff

On Wed, Jun 16, 2021 at 08:32:28PM +, Jaime Caamaño Ruiz wrote:
> Using 1024 bit params for DH is considered unsafe [1]. Additionally,
> from [2]:

Thanks, I applied this to master.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] bridge: Use correct (legacy) role names in database.

2021-07-07 Thread Ben Pfaff

On Wed, Jul 07, 2021 at 04:37:11PM -0300, Flavio Leitner wrote:
> 
> Hi,
> 
> On Tue, Jul 06, 2021 at 03:37:09PM -0700, Ben Pfaff wrote:
> > The vswitchd database schema requires role names to be "master" or
> > "slave", but this code tried to use "primary" and "secondary".
> 
> We have defined the constraints in the schema and we can't change
> the schema because it would require external applications to do
> the same change and would affect upgrades, correct?

Yes.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] bridge: Use correct (legacy) role names in database.

2021-07-07 Thread Flavio Leitner



Hi,

On Tue, Jul 06, 2021 at 03:37:09PM -0700, Ben Pfaff wrote:
> The vswitchd database schema requires role names to be "master" or
> "slave", but this code tried to use "primary" and "secondary".

We have defined the constraints in the schema and we can't change
the schema because it would require external applications to do
the same change and would affect upgrades, correct?

Thanks
fbl

> Signed-off-by: Ben Pfaff 
> Reported-at: https://github.com/openvswitch/ovs-issues/issues/218
> Fixes: 807152a4ddfb ("Use primary/secondary, not master/slave, as names for 
> OpenFlow roles.")
> ---
>  vswitchd/bridge.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index 0432d2abf0af..cb7c5cb769da 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -3019,9 +3019,9 @@ ofp12_controller_role_to_str(enum ofp12_controller_role 
> role)
>  case OFPCR12_ROLE_EQUAL:
>  return "other";
>  case OFPCR12_ROLE_PRIMARY:
> -return "primary";
> +return "master";
>  case OFPCR12_ROLE_SECONDARY:
> -return "secondary";
> +return "slave";
>  case OFPCR12_ROLE_NOCHANGE:
>  default:
>  return NULL;
> -- 
> 2.31.1
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev

-- 
fbl
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn 1/2] ofctrl: Remove unused hashmap.

2021-07-07 Thread Ben Pfaff

On Tue, Jun 15, 2021 at 09:28:25AM -0400, Mark Michelson wrote:
> On 6/15/21 9:24 AM, Mark Michelson wrote:
> > Signed-off-by: Mark Michelson 
> > ---
> >   controller/ofctrl.c | 1 -
> >   1 file changed, 1 deletion(-)
> > 
> > diff --git a/controller/ofctrl.c b/controller/ofctrl.c
> > index 053631590..48d001506 100644
> > --- a/controller/ofctrl.c
> > +++ b/controller/ofctrl.c
> > @@ -1259,7 +1259,6 @@ ofctrl_flood_remove_flows(struct 
> > ovn_desired_flow_table *flow_table,
> >* Copying the sb_uuids into an array. */
> >   struct uuid *sb_uuids;
> >   sb_uuids = xmalloc(hmap_count(flood_remove_nodes) * sizeof *sb_uuids);
> > -struct hmap flood_remove_uuids = HMAP_INITIALIZER(&flood_remove_uuids);
> >   HMAP_FOR_EACH (ofrn, hmap_node, flood_remove_nodes) {
> >   sb_uuids[n++] = ofrn->sb_uuid;
> >   }
> > 
> 
> This series is probably the most trivial I've ever put up for review. I
> almost merged this series directly without first doing a review. However, I
> talked myself into going through the review process. I'm curious how people
> would feel if devs committed changes like this directly instead of engaging
> in the review process.

I normally post them for review anyway.  The only commits I routinely
push without review are ones that just add people to AUTHORS.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH] ofproto-dpif-xlate: Fix continuations with OF instructions in OF1.1+.

2021-07-07 Thread Ben Pfaff

Open vSwitch supports OpenFlow "instructions", which were introduced in
OpenFlow 1.1 and act like restricted kinds of actions that can only
appear in a particular order and particular circumstances.  OVS did
not support two of these instructions, "write_metadata" and
"goto_table", properly in the case where they appeared in a flow that
needed to be frozen for continuations.

Both of these instructions had the problem that they couldn't be
properly serialized into the stream of actions, because they're not
actions.  This commit fixes that problem in freeze_unroll_actions()
by converting them into equivalent actions for serialization.

goto_table had the additional problem that it was being serialized to
the frozen stream even after it had been executed.  This was already
properly handled in do_xlate_actions() for resubmit, which is almost
equivalent to goto_table, so this commit applies the same fix to
goto_table.  (The commit removes an assertion from the goto_table
implementation, but there wasn't any real value in that assertion and
I thought the code looked cleaner without it.)

This commit adds tests that would have found these bugs.  This includes
adding a variant of each continuation test that uses OF1.3 for
monitor/resume (which is necessary to trigger these bugs) plus specific
tests for continuations with goto_table and write_metadata.  It also
improves the continuation test infrastructure to add more detail on
the problem if a test fails.

Signed-off-by: Ben Pfaff 
Reported-by: Grayson Wu 
Reported-at: https://github.com/openvswitch/ovs-issues/issues/213
---
 ofproto/ofproto-dpif-xlate.c | 56 +++-
 tests/nsh.at |  4 +--
 tests/ofproto-dpif.at| 40 ++
 3 files changed, 72 insertions(+), 28 deletions(-)

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index a6f4ea334017..405996975b28 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -5974,14 +5974,36 @@ freeze_unroll_actions(const struct ofpact *a, const 
struct ofpact *end,
 }
 break;
 
+/* From an OpenFlow point of view, goto_table and write_metadata are
+ * instructions, not actions.  This means that to use them, we'd have
+ * to reformulate the actions as instructions, which is possible, and
+ * we'd have slot them into the frozen actions in a specific order,
+ * which doesn't seem practical.  Instead, we translate these
+ * instructions into equivalent actions. */
+case OFPACT_GOTO_TABLE: {
+struct ofpact_resubmit *resubmit
+= ofpact_put_RESUBMIT(&ctx->frozen_actions);
+resubmit->in_port = OFPP_IN_PORT;
+resubmit->table_id = ofpact_get_GOTO_TABLE(a)->table_id;
+resubmit->with_ct_orig = false;
+}
+continue;
+case OFPACT_WRITE_METADATA: {
+const struct ofpact_metadata *md = ofpact_get_WRITE_METADATA(a);
+const struct mf_field *mf = mf_from_id(MFF_METADATA);
+ovs_assert(mf->n_bytes == sizeof md->metadata);
+ovs_assert(mf->n_bytes == sizeof md->mask);
+ofpact_put_set_field(&ctx->frozen_actions, mf,
+ &md->metadata, &md->mask);
+}
+continue;
+
 case OFPACT_SET_TUNNEL:
 case OFPACT_REG_MOVE:
 case OFPACT_SET_FIELD:
 case OFPACT_STACK_PUSH:
 case OFPACT_STACK_POP:
 case OFPACT_LEARN:
-case OFPACT_WRITE_METADATA:
-case OFPACT_GOTO_TABLE:
 case OFPACT_ENQUEUE:
 case OFPACT_SET_VLAN_VID:
 case OFPACT_SET_VLAN_PCP:
@@ -6911,16 +6933,21 @@ do_xlate_actions(const struct ofpact *ofpacts, size_t 
ofpacts_len,
 }
 break;
 
+/* Freezing complicates resubmit and goto_table.  Some action in the
+ * flow entry found by resubmit might trigger freezing.  If that
+ * happens, then we do not want to execute the resubmit or goto_table
+ * again after during thawing, so we want to skip back to the head of
+ * the loop to avoid that, only adding any actions that follow the
+ * resubmit to the frozen actions.
+ */
 case OFPACT_RESUBMIT:
-/* Freezing complicates resubmit.  Some action in the flow
- * entry found by resubmit might trigger freezing.  If that
- * happens, then we do not want to execute the resubmit again after
- * during thawing, so we want to skip back to the head of the loop
- * to avoid that, only adding any actions that follow the resubmit
- * to the frozen actions.
- */
 xlate_ofpact_resubmit(ctx, ofpact_get_RESUBMIT(a), last);
 continue;
+case OFPACT_GOTO_TABLE:
+xlate_table_action(ctx, ctx->xin->flow.in_port.ofp_port,
+

Re: [ovs-dev] [PATCH ovn v9 4/4] northd: Flood ARPs to routers for "unreachable" addresses.

2021-07-07 Thread Mark Michelson


On 7/7/21 1:41 PM, Numan Siddique wrote:

On Wed, Jun 30, 2021 at 7:57 PM Mark Michelson  wrote:


Previously, ARP TPAs were filtered down only to "reachable" addresses.
Reachable addresses are all router interface addresses, as well as NAT
external addresses and load balancer VIPs that are within the subnet
handled by a router's port.

However, it is possible that in some configurations, CMSes purposely
configure NAT or load balancer addresses on a router that are outside
the router's subnets, and they expect the router to respond to ARPs for
those addresses.

This commit adds a higher priority flow to logical switches that makes
it so ARPs targeted at "unreachable" addresses are flooded to all ports.
This way, the ARPs can reach the router appropriately and receive a
response.

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1929901

Signed-off-by: Mark Michelson 


Acked-by: Numan Siddique 

I've one comment which probably can be addressed.

If a configured NAT entry on a logical router is unreachable within
that router, this patch floods
the packet for the ARP destined to that NAT IP in the logical switch pipeline.

Before adding the flow to flood, can't we check if the NAT IP is
reachable from other router ports
connected to this logical switch.  If so, we can do "outport ==
" instead
of flooding.

I think this is possible given that the logical flow is added in the
switch pipeline.  We just need to loop through
all the router ports of the logical switch.  The question is - is this
efficient  and takes up some time on a scaled environment ?

What do you think ?  If this seems fine,  it can be a follow up patch too.


I don't think your suggestion would add any appreciable time compared to 
what we're already doing in ovn-northd. I will attempt this approach and 
let you know how it goes.




Numan


---
  northd/ovn-northd.8.xml |   8 ++
  northd/ovn-northd.c | 162 +++-
  northd/ovn_northd.dl|  91 ++
  tests/ovn-northd.at |  99 
  tests/system-ovn.at | 102 +
  5 files changed, 395 insertions(+), 67 deletions(-)

diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index beaf5a183..5aedd6619 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -1587,6 +1587,14 @@ output;
  logical ports.


+  
+Priority-90 flows for each IP address/VIP/NAT address configured
+outside its owning router port's subnet. These flows match ARP
+requests and ND packets for the specific IP addresses.  Matched packets
+are forwarded to the MC_FLOOD multicast group which
+contains all connected logical ports.
+  
+

  Priority-75 flows for each port connected to a logical router
  matching self originated ARP request/ND packets.  These packets
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index f6fad281b..d0b325748 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -6555,38 +6555,41 @@ build_lswitch_rport_arp_req_self_orig_flow(struct 
ovn_port *op,
  ds_destroy(&match);
  }

-/*
- * Ingress table 19: Flows that forward ARP/ND requests only to the routers
- * that own the addresses. Other ARP/ND packets are still flooded in the
- * switching domain as regular broadcast.
- */
  static void
-build_lswitch_rport_arp_req_flow_for_ip(struct ds *ip_match,
-int addr_family,
-struct ovn_port *patch_op,
-struct ovn_datapath *od,
-uint32_t priority,
-struct hmap *lflows,
-const struct ovsdb_idl_row *stage_hint)
+arp_nd_ns_match(struct ds *ips, int addr_family, struct ds *match)
  {
-struct ds match   = DS_EMPTY_INITIALIZER;
-struct ds actions = DS_EMPTY_INITIALIZER;
-
  /* Packets received from VXLAN tunnels have already been through the
   * router pipeline so we should skip them. Normally this is done by the
   * multicast_group implementation (VXLAN packets skip table 32 which
   * delivers to patch ports) but we're bypassing multicast_groups.
   */
-ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
+ds_put_cstr(match, FLAGBIT_NOT_VXLAN " && ");

  if (addr_family == AF_INET) {
-ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
+ds_put_cstr(match, "arp.op == 1 && arp.tpa == {");
  } else {
-ds_put_cstr(&match, "nd_ns && nd.target == { ");
+ds_put_cstr(match, "nd_ns && nd.target == {");
  }

-ds_put_cstr(&match, ds_cstr_ro(ip_match));
-ds_put_cstr(&match, "}");
+ds_put_cstr(match, ds_cstr_ro(ips));
+ds_put_cstr(match, "}");
+}
+
+/*
+ * Ingress table 19: Flows that forward ARP/ND requests only to the routers
+ * that own the addresses. Other A

Re: [ovs-dev] [PATCH 1/1] match: do not print "igmp" match keyword

2021-07-07 Thread Ilya Maximets

On 7/6/21 3:36 PM, Flavio Leitner wrote:
> On Tue, Jul 06, 2021 at 03:27:41PM +0200, Adrian Moreno wrote:
>>
>>
>> On 7/6/21 2:50 PM, Flavio Leitner wrote:
>>> On Tue, Jul 06, 2021 at 08:25:59AM +0200, Adrian Moreno wrote:


 On 7/5/21 4:15 PM, Flavio Leitner wrote:
>
> Hi,
>
> On Wed, Jun 30, 2021 at 05:43:54PM +0200, Adrian Moreno wrote:
>> The match keyword "igmp" is not supported in ofp-parse, which means
>> that flow dumps cannot be restored. This patch prints the igmp match
>> in the accepted format (ip,nw_proto=2) and adds a test.
>
> I raised concerns about changing the output and break scripts in
> the past.  However, it seems not removing the keyword also cause
> issues, so I am not opposing to remove the igmp keyword anymore.
>
> Acked-by: Flavio Leitner 
>

 Thanks Flavio. Do you think this is an acceptable solution also for stable 
 branches?
>>>
>>> My concern is that changing the output can potentially break
>>> somebody else's script and that is really bad in a stable
>>> release update.
>>>
>>> BTW, this is an user visible change, so I'd say that the patch
>>> needs to highlight that in the NEWS file too.
>>>
>> OK. I'll send another update, thanks.
>>
>>>
 If not, how about replacing the flows in ovs-save so that upgrades of 
 stable
 branches work fine?
>>>
>>> You mean fixing ovs-save in master or in stable branches?
>>>
>> My proposal was:
>> - changing the output + advertise in NEWS in master branch (and future 
>> releases)
>> - add a workaround in ovs-save in stable branches to ensure they can be 
>> upgraded
>> without big datapath impact
>>
>> WDYT?
> 
> Sounds like a good plan to me.

Sounds good to me too.  This way we will change the behavior in current
release and will fix the existing issue in ovs-save on stable branches.

Adrian, could you send a v2 as a patch set where the first patch implements
a workaround in ovs-save (this one we will apply to master and backport)
and the second patch changes the actual output (and removes the workaround
from ovs-save?) ?

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 07/11] test/sytem-dpdk: Add unit test for mfex autovalidator

2021-07-07 Thread Amber, Kumar

Hi Eelco,

Replies are inline.

+++ b/tests/mfex_fuzzy.py
@@ -0,0 +1,32 @@
+#!/usr/bin/python3
+try:
+ from scapy.all import *
+except ModuleNotFoundError as err:
+ print(err + ": Scapy")
+import sys
+import os
+
+path = os.environ['OVS_DIR'] + "/tests/pcap/fuzzy"

This is failing in my setup, as OVS_DIR is not defined:

Is it something you set manually? As from make check-dpdk it fails:

  1.  system-dpdk.at:260: testing OVS-DPDK - MFEX Autovalidator Fuzzy ...

DEPRECATION: The default format will switch to columns in the future. You can 
use --format=(legacy|columns) (or define a format=(legacy|columns) in your 
pip.conf
under the [list] section) to disable this warning.
scapy (2.4.4)
./system-dpdk.at:263: $PYTHON3 $srcdir/mfex_fuzzy.py
--- /dev/null 2021-07-02 08:15:52.158758028 -0400
+++ 
/root/Documents/Scratch/ovs_review_mfex/OVS_master_DPDK_v20.11.1/ovs_github/tests/system-dpdk-testsuite.dir/at-groups/7/stderr
 2021-07-07 10:34:09.364877
754 -0400
@@ -0,0 +1,6 @@
+Traceback (most recent call last):

  *   File "../.././mfex_fuzzy.py", line 10, in 
  *   path = os.environ['OVS_DIR'] + "/tests/pcap/fuzzy"
  *   File "/usr/lib64/python3.6/os.py", line 669, in getitem
  *   raise KeyError(key) from None

+KeyError: 'OVS_DIR'
stdout:
./system-dpdk.at:263: exit code was 1, expected 0
7. system-dpdk.at:260: 7. OVS-DPDK - MFEX Autovalidator Fuzzy 
(system-dpdk.at:260): FAILED (system-dpdk.at:263)

If I set the environment variable it works fine, but should not be needed.



Its fixed now Script used $srcdir.

+pktdump = PcapWriter(path, append=False, sync=True)
+
+for i in range(0, 2000):
+
+ # Generate random protocol bases, use a fuzz() over the combined packet for 
full fuzzing.
+ eth = Ether(src=RandMAC(), dst=RandMAC())
+ vlan = Dot1Q()
+ ipv4 = IP(src=RandIP(), dst=RandIP())
+ ipv6 = IPv6(src=RandIP6(), dst=RandIP6())
+ udp = UDP()
+ tcp = TCP()
+

I think we should also randomize the UDP/TCP ports as they get extracted.

Done .

+ # IPv4 packets with fuzzing
+ pktdump.write(fuzz(eth/ipv4/udp))
+ pktdump.write(fuzz(eth/ipv4/tcp))
+ pktdump.write(fuzz(eth/vlan/ipv4/udp))
+ pktdump.write(fuzz(eth/vlan/ipv4/tcp))
+
+ # IPv6 packets with fuzzing
+ pktdump.write(fuzz(eth/ipv6/udp))
+ pktdump.write(fuzz(eth/ipv6/tcp))
+ pktdump.write(fuzz(eth/vlan/ipv6/udp))
+ pktdump.write(fuzz(eth/vlan/ipv6/tcp))

The generated pcap file does not have an extension, might be nice to at it?

Also add the generated pcap file to the .git_ignore file.

Done.

\ No newline at end of file
diff --git a/tests/pcap/mfex_test b/tests/pcap/mfex_test

I would give the pcap file the .pcap extension just to be clear!



Actually cannot all the pcaps present in test prior to this commit also don't 
hold extension as the automake tries to compile them keeping them without 
extension prevents this compilation error.

new file mode 100644
index 
..1aac67b8d643ecb016c758cba4cc32212a80f52a
GIT binary patch
literal 416
zcmca|c+)~A1{MYw`2U}Qff2}QK`M68ITRa|G@yFii5$Gfk6YL%z>@uY&}o|
z2s4N<1VH2&7y^V87$)XGOtD~MV$cFgfG~zBGGJ2#YtF$KST_NTIwYriok6N4Vm)gX-Q@c^{cp<7_5LgK^UuU{2>VS0RZ!RQ+EIW

literal 0
HcmV?d1

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 802895488..fcab92729 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -232,3 +232,49 @@ OVS_VSWITCHD_STOP(["\@does not exist. The Open vSwitch 
kernel module is probably
\@EAL: No free hugepages reported in hugepages-1048576kB@d"])
AT_CLEANUP
dnl --
+
+dnl --
+dnl Add standard DPDK PHY port

I think we should also skip these tests if we do not have a machine that has 
AVX512. Just to make sure we do not generate an OK where we are not even 
testing the AVX512 functions.

Actually we should not what if someone wants to write a new mfex version 
without AVX but just SIMD or some other way than we are probably blocking the 
testing

+AT_SETUP([OVS-DPDK - MFEX Autovalidator])
+AT_KEYWORDS([dpdk])
+
+OVS_DPDK_START()
+
+dnl Add userspace bridge and attach it to OVS
+AT_CHECK([ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev])
+AT_CHECK([ovs-vsctl add-port br0 p1 -- set Interface p1 type=dpdk 
options:dpdk-devargs=net_pcap1,rx_pcap=$srcdir/pcap/mfex_test,infinite_rx=1], 
[], [stdout], [stderr])
+AT_CHECK([ovs-vsctl show], [], [stdout])
+
+
+AT_CHECK([ovs-appctl dpif-netdev/miniflow-parser-set autovalidator], [0], [dnl
+Miniflow implementation set to autovalidator
+])


Here we set the autovalidator, and we stop it right after how do we know enough 
packets were running through it? I thought we discussed looking at frame count 
to make sure at least the pcap file was run through it once?

OVS_WAIT_UNTIL([test `ovs-vsctl get interface p1 statistics | grep -oP 
'rx_packets=\s*\K\d+'` -ge 10])

Fixed.

+dnl Clean up
+AT_CHECK([ovs-vsctl del-port br0 p1], [], [stdout], [stderr])
+AT_CL

[ovs-dev] [PATCH ovn 2/2] controller: incrementally create ras port_binding list

2021-07-07 Thread Lorenzo Bianconi

Incrementally manage local_active_ports_ras map for interfaces
where periodic router advertisement has been enabled. This patch
allows to avoid looping over all local interfaces to check if
periodic RA is running on the current port binding.

Signed-off-by: Lorenzo Bianconi 
---
 controller/binding.c|  7 +++
 controller/binding.h|  1 +
 controller/ovn-controller.c | 10 +++-
 controller/pinctrl.c| 96 -
 controller/pinctrl.h|  3 +-
 5 files changed, 72 insertions(+), 45 deletions(-)

diff --git a/controller/binding.c b/controller/binding.c
index 2cec7ea84..8a66a867c 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -1675,6 +1675,9 @@ binding_run(struct binding_ctx_in *b_ctx_in, struct 
binding_ctx_out *b_ctx_out)
 update_active_pb_ras_pd(pb, b_ctx_out->local_datapaths,
 b_ctx_out->local_active_ports_ipv6_pd,
 "ipv6_prefix_delegation");
+update_active_pb_ras_pd(pb, b_ctx_out->local_datapaths,
+b_ctx_out->local_active_ports_ras,
+"ipv6_ra_send_periodic");
 
 enum en_lport_type lport_type = get_lport_type(pb);
 
@@ -2517,6 +2520,10 @@ delete_done:
 b_ctx_out->local_active_ports_ipv6_pd,
 "ipv6_prefix_delegation");
 
+update_active_pb_ras_pd(pb, b_ctx_out->local_datapaths,
+b_ctx_out->local_active_ports_ras,
+"ipv6_ra_send_periodic");
+
 enum en_lport_type lport_type = get_lport_type(pb);
 
 struct binding_lport *b_lport =
diff --git a/controller/binding.h b/controller/binding.h
index 5a6f46a14..a0ac47dd4 100644
--- a/controller/binding.h
+++ b/controller/binding.h
@@ -73,6 +73,7 @@ void related_lports_destroy(struct related_lports *);
 struct binding_ctx_out {
 struct hmap *local_datapaths;
 struct hmap *local_active_ports_ipv6_pd;
+struct hmap *local_active_ports_ras;
 struct local_binding_data *lbinding_data;
 
 /* sset of (potential) local lports. */
diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index a42b96ddb..b779f0c58 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -1044,6 +1044,7 @@ struct ed_type_runtime_data {
 struct hmap tracked_dp_bindings;
 
 struct hmap local_active_ports_ipv6_pd;
+struct hmap local_active_ports_ras;
 };
 
 /* struct ed_type_runtime_data has the below members for tracking the
@@ -1132,6 +1133,7 @@ en_runtime_data_init(struct engine_node *node OVS_UNUSED,
 smap_init(&data->local_iface_ids);
 local_binding_data_init(&data->lbinding_data);
 hmap_init(&data->local_active_ports_ipv6_pd);
+hmap_init(&data->local_active_ports_ras);
 
 /* Init the tracked data. */
 hmap_init(&data->tracked_dp_bindings);
@@ -1158,6 +1160,7 @@ en_runtime_data_cleanup(void *data)
 }
 hmap_destroy(&rt_data->local_datapaths);
 hmap_destroy(&rt_data->local_active_ports_ipv6_pd);
+hmap_destroy(&rt_data->local_active_ports_ras);
 local_binding_data_destroy(&rt_data->lbinding_data);
 }
 
@@ -1238,6 +1241,8 @@ init_binding_ctx(struct engine_node *node,
 b_ctx_out->local_datapaths = &rt_data->local_datapaths;
 b_ctx_out->local_active_ports_ipv6_pd =
 &rt_data->local_active_ports_ipv6_pd;
+b_ctx_out->local_active_ports_ras =
+&rt_data->local_active_ports_ras;
 b_ctx_out->local_lports = &rt_data->local_lports;
 b_ctx_out->local_lports_changed = false;
 b_ctx_out->related_lports = &rt_data->related_lports;
@@ -1256,6 +1261,7 @@ en_runtime_data_run(struct engine_node *node, void *data)
 struct ed_type_runtime_data *rt_data = data;
 struct hmap *local_datapaths = &rt_data->local_datapaths;
 struct hmap *local_active_ipv6_pd = &rt_data->local_active_ports_ipv6_pd;
+struct hmap *local_active_ras = &rt_data->local_active_ports_ras;
 struct sset *local_lports = &rt_data->local_lports;
 struct sset *active_tunnels = &rt_data->active_tunnels;
 
@@ -1272,6 +1278,7 @@ en_runtime_data_run(struct engine_node *node, void *data)
 }
 hmap_clear(local_datapaths);
 hmap_clear(local_active_ipv6_pd);
+hmap_clear(local_active_ras);
 local_binding_data_destroy(&rt_data->lbinding_data);
 sset_destroy(local_lports);
 related_lports_destroy(&rt_data->related_lports);
@@ -3286,7 +3293,8 @@ main(int argc, char *argv[])
 br_int, chassis,
 &runtime_data->local_datapaths,
 &runtime_data->active_tunnels,
-&runtime_data->local_active_ports_ipv6_pd);
+&runtime_data->local_active_ports_ipv6_pd,
+&runtime_data->loc

[ovs-dev] [PATCH ovn 1/2] controller: incrementally create ipv6 prefix delegation port_binding list

2021-07-07 Thread Lorenzo Bianconi

Incrementally manage local_active_ports_ipv6_pd map for interfaces
where IPv6 prefix-delegation has been enabled. This patch allows to
avoid looping over all local interfaces to check if prefix-delegation
is running on the current port binding.

Signed-off-by: Lorenzo Bianconi 
---
 controller/binding.c|  35 
 controller/binding.h|   1 +
 controller/ovn-controller.c |  25 -
 controller/ovn-controller.h |   8 +++
 controller/pinctrl.c| 109 ++--
 controller/pinctrl.h|   3 +-
 6 files changed, 124 insertions(+), 57 deletions(-)

diff --git a/controller/binding.c b/controller/binding.c
index 594babc98..2cec7ea84 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -574,6 +574,33 @@ remove_related_lport(const struct sbrec_port_binding *pb,
 }
 }
 
+static void
+update_active_pb_ras_pd(const struct sbrec_port_binding *pb,
+struct hmap *local_datapaths,
+struct hmap *map, const char *conf)
+{
+const char *ras_pd_conf = smap_get(&pb->options, conf);
+if (!ras_pd_conf) {
+return;
+}
+
+struct pb_active_ra_pd *ras_pd =
+get_pb_active_ras_pd(map, pb->logical_port);
+if (ras_pd && !strcmp(ras_pd_conf, "false")) {
+hmap_remove(map, &ras_pd->hmap_node);
+return;
+}
+if (!ras_pd && !strcmp(ras_pd_conf, "true")) {
+ras_pd = xzalloc(sizeof *ras_pd);
+ras_pd->pb = pb;
+hmap_insert(map, &ras_pd->hmap_node, hash_string(pb->logical_port, 0));
+}
+if (ras_pd) {
+ras_pd->ld = get_local_datapath(local_datapaths,
+pb->datapath->tunnel_key);
+}
+}
+
 /* Corresponds to each Port_Binding.type. */
 enum en_lport_type {
 LP_UNKNOWN,
@@ -1645,6 +1672,10 @@ binding_run(struct binding_ctx_in *b_ctx_in, struct 
binding_ctx_out *b_ctx_out)
 const struct sbrec_port_binding *pb;
 SBREC_PORT_BINDING_TABLE_FOR_EACH (pb,
b_ctx_in->port_binding_table) {
+update_active_pb_ras_pd(pb, b_ctx_out->local_datapaths,
+b_ctx_out->local_active_ports_ipv6_pd,
+"ipv6_prefix_delegation");
+
 enum en_lport_type lport_type = get_lport_type(pb);
 
 switch (lport_type) {
@@ -2482,6 +2513,10 @@ delete_done:
 continue;
 }
 
+update_active_pb_ras_pd(pb, b_ctx_out->local_datapaths,
+b_ctx_out->local_active_ports_ipv6_pd,
+"ipv6_prefix_delegation");
+
 enum en_lport_type lport_type = get_lport_type(pb);
 
 struct binding_lport *b_lport =
diff --git a/controller/binding.h b/controller/binding.h
index a08011ae2..5a6f46a14 100644
--- a/controller/binding.h
+++ b/controller/binding.h
@@ -72,6 +72,7 @@ void related_lports_destroy(struct related_lports *);
 
 struct binding_ctx_out {
 struct hmap *local_datapaths;
+struct hmap *local_active_ports_ipv6_pd;
 struct local_binding_data *lbinding_data;
 
 /* sset of (potential) local lports. */
diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 9050380f3..a42b96ddb 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -133,6 +133,20 @@ get_local_datapath(const struct hmap *local_datapaths, 
uint32_t tunnel_key)
 : NULL);
 }
 
+struct pb_active_ra_pd *
+get_pb_active_ras_pd(const struct hmap *map, const char *name)
+{
+uint32_t key = hash_string(name, 0);
+struct hmap_node *node;
+
+node = hmap_first_with_hash(map, key);
+if (node) {
+return CONTAINER_OF(node, struct pb_active_ra_pd, hmap_node);
+}
+
+return NULL;
+}
+
 uint32_t
 get_tunnel_type(const char *name)
 {
@@ -1028,6 +1042,8 @@ struct ed_type_runtime_data {
 bool tracked;
 bool local_lports_changed;
 struct hmap tracked_dp_bindings;
+
+struct hmap local_active_ports_ipv6_pd;
 };
 
 /* struct ed_type_runtime_data has the below members for tracking the
@@ -1115,6 +1131,7 @@ en_runtime_data_init(struct engine_node *node OVS_UNUSED,
 sset_init(&data->egress_ifaces);
 smap_init(&data->local_iface_ids);
 local_binding_data_init(&data->lbinding_data);
+hmap_init(&data->local_active_ports_ipv6_pd);
 
 /* Init the tracked data. */
 hmap_init(&data->tracked_dp_bindings);
@@ -1140,6 +1157,7 @@ en_runtime_data_cleanup(void *data)
 free(cur_node);
 }
 hmap_destroy(&rt_data->local_datapaths);
+hmap_destroy(&rt_data->local_active_ports_ipv6_pd);
 local_binding_data_destroy(&rt_data->lbinding_data);
 }
 
@@ -1218,6 +1236,8 @@ init_binding_ctx(struct engine_node *node,
 b_ctx_in->ovs_table = ovs_table;
 
 b_ctx_out->local_datapaths = &rt_data->local_datapaths;
+b_ctx_out->local_active_ports_ipv6_pd =
+&rt_data->local_active_ports_ipv6_pd;
 b_ctx_out->local_lports

[ovs-dev] [PATCH ovn 0/2] incrementally process ras-ipv6 pd router ports

2021-07-07 Thread Lorenzo Bianconi

https://bugzilla.redhat.com/show_bug.cgi?id=1944220

Lorenzo Bianconi (2):
  controller: incrementally create ipv6 prefix delegation port_binding
list
  controller: incrementally create ras port_binding list

 controller/binding.c|  42 
 controller/binding.h|   2 +
 controller/ovn-controller.c |  33 +-
 controller/ovn-controller.h |   8 ++
 controller/pinctrl.c| 203 +++-
 controller/pinctrl.h|   4 +-
 6 files changed, 193 insertions(+), 99 deletions(-)

-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v9 4/4] northd: Flood ARPs to routers for "unreachable" addresses.

2021-07-07 Thread Numan Siddique

On Wed, Jun 30, 2021 at 7:57 PM Mark Michelson  wrote:
>
> Previously, ARP TPAs were filtered down only to "reachable" addresses.
> Reachable addresses are all router interface addresses, as well as NAT
> external addresses and load balancer VIPs that are within the subnet
> handled by a router's port.
>
> However, it is possible that in some configurations, CMSes purposely
> configure NAT or load balancer addresses on a router that are outside
> the router's subnets, and they expect the router to respond to ARPs for
> those addresses.
>
> This commit adds a higher priority flow to logical switches that makes
> it so ARPs targeted at "unreachable" addresses are flooded to all ports.
> This way, the ARPs can reach the router appropriately and receive a
> response.
>
> Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1929901
>
> Signed-off-by: Mark Michelson 

Acked-by: Numan Siddique 

I've one comment which probably can be addressed.

If a configured NAT entry on a logical router is unreachable within
that router, this patch floods
the packet for the ARP destined to that NAT IP in the logical switch pipeline.

Before adding the flow to flood, can't we check if the NAT IP is
reachable from other router ports
connected to this logical switch.  If so, we can do "outport ==
" instead
of flooding.

I think this is possible given that the logical flow is added in the
switch pipeline.  We just need to loop through
all the router ports of the logical switch.  The question is - is this
efficient  and takes up some time on a scaled environment ?

What do you think ?  If this seems fine,  it can be a follow up patch too.

Numan

> ---
>  northd/ovn-northd.8.xml |   8 ++
>  northd/ovn-northd.c | 162 +++-
>  northd/ovn_northd.dl|  91 ++
>  tests/ovn-northd.at |  99 
>  tests/system-ovn.at | 102 +
>  5 files changed, 395 insertions(+), 67 deletions(-)
>
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index beaf5a183..5aedd6619 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -1587,6 +1587,14 @@ output;
>  logical ports.
>
>
> +  
> +Priority-90 flows for each IP address/VIP/NAT address configured
> +outside its owning router port's subnet. These flows match ARP
> +requests and ND packets for the specific IP addresses.  Matched 
> packets
> +are forwarded to the MC_FLOOD multicast group which
> +contains all connected logical ports.
> +  
> +
>
>  Priority-75 flows for each port connected to a logical router
>  matching self originated ARP request/ND packets.  These packets
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index f6fad281b..d0b325748 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -6555,38 +6555,41 @@ build_lswitch_rport_arp_req_self_orig_flow(struct 
> ovn_port *op,
>  ds_destroy(&match);
>  }
>
> -/*
> - * Ingress table 19: Flows that forward ARP/ND requests only to the routers
> - * that own the addresses. Other ARP/ND packets are still flooded in the
> - * switching domain as regular broadcast.
> - */
>  static void
> -build_lswitch_rport_arp_req_flow_for_ip(struct ds *ip_match,
> -int addr_family,
> -struct ovn_port *patch_op,
> -struct ovn_datapath *od,
> -uint32_t priority,
> -struct hmap *lflows,
> -const struct ovsdb_idl_row 
> *stage_hint)
> +arp_nd_ns_match(struct ds *ips, int addr_family, struct ds *match)
>  {
> -struct ds match   = DS_EMPTY_INITIALIZER;
> -struct ds actions = DS_EMPTY_INITIALIZER;
> -
>  /* Packets received from VXLAN tunnels have already been through the
>   * router pipeline so we should skip them. Normally this is done by the
>   * multicast_group implementation (VXLAN packets skip table 32 which
>   * delivers to patch ports) but we're bypassing multicast_groups.
>   */
> -ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
> +ds_put_cstr(match, FLAGBIT_NOT_VXLAN " && ");
>
>  if (addr_family == AF_INET) {
> -ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
> +ds_put_cstr(match, "arp.op == 1 && arp.tpa == {");
>  } else {
> -ds_put_cstr(&match, "nd_ns && nd.target == { ");
> +ds_put_cstr(match, "nd_ns && nd.target == {");
>  }
>
> -ds_put_cstr(&match, ds_cstr_ro(ip_match));
> -ds_put_cstr(&match, "}");
> +ds_put_cstr(match, ds_cstr_ro(ips));
> +ds_put_cstr(match, "}");
> +}
> +
> +/*
> + * Ingress table 19: Flows that forward ARP/ND requests only to the routers
> + * that own the addresses. Other ARP/ND packets are still flooded in the
> + * switching d

Re: [ovs-dev] [PATCH ovn v9 3/4] northd: Add options to automatically add routes for NATs and LBs.

2021-07-07 Thread Numan Siddique

On Wed, Jul 7, 2021 at 1:35 PM Numan Siddique  wrote:
>
> On Wed, Jun 30, 2021 at 7:57 PM Mark Michelson  wrote:
> >
> > Load_Balancer and NAT entries have a new option, "add_route" that can be
> > set to automatically add routes to those addresses to neighbor routers,
> > therefore eliminating the need to create static routes.
> >
> > Signed-off-by: Mark Michelson 
>
> Acked-by: Numan Siddique 

There's also a couple of checkpatch errors which need fixing before applying.

Numan

>
> There is one small comment which you can choose to ignore.  Please see below.
>
> Numan
>
> > ---
> >  northd/ovn-northd.8.xml |  7 -
> >  northd/ovn-northd.c | 57 +
> >  northd/ovn_northd.dl| 23 -
> >  ovn-nb.xml  | 33 +++-
> >  tests/ovn-nbctl.at  |  3 +++
> >  tests/ovn-northd.at | 40 ++---
> >  utilities/ovn-nbctl.c   | 25 +-
> >  7 files changed, 158 insertions(+), 30 deletions(-)
> >
> > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > index b5c961e89..beaf5a183 100644
> > --- a/northd/ovn-northd.8.xml
> > +++ b/northd/ovn-northd.8.xml
> > @@ -3539,7 +3539,12 @@ outport = P
> > 
> > column
> >of  table for of type
> >dnat_and_snat, otherwise the Ethernet address of the
> > -  distributed logical router port.
> > +  distributed logical router port. Note that if the
> > +   is 
> > not
> > +  within a subnet on the owning logical router, then OVN will only
> > +  create ARP resolution flows if the  > column="options:add_route"/>
> > +  is set to true. Otherwise, no ARP resolution flows
> > +  will be added.
> >  
> >
> >  
> > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > index 58132bc5c..f6fad281b 100644
> > --- a/northd/ovn-northd.c
> > +++ b/northd/ovn-northd.c
> > @@ -662,8 +662,14 @@ struct ovn_datapath {
> >  struct lport_addresses dnat_force_snat_addrs;
> >  struct lport_addresses lb_force_snat_addrs;
> >  bool lb_force_snat_router_ip;
> > +/* The "routable" ssets are subsets of the load balancer
> > + * IPs for which IP routes and ARP resolution flows are automatically
> > + * added
> > + */
> >  struct sset lb_ips_v4;
> > +struct sset lb_ips_v4_routable;
> >  struct sset lb_ips_v6;
> > +struct sset lb_ips_v6_routable;
> >
> >  struct ovn_port **localnet_ports;
> >  size_t n_localnet_ports;
> > @@ -834,7 +840,9 @@ static void
> >  init_lb_ips(struct ovn_datapath *od)
> >  {
> >  sset_init(&od->lb_ips_v4);
> > +sset_init(&od->lb_ips_v4_routable);
> >  sset_init(&od->lb_ips_v6);
> > +sset_init(&od->lb_ips_v6_routable);
> >  }
> >
> >  static void
> > @@ -845,7 +853,9 @@ destroy_lb_ips(struct ovn_datapath *od)
> >  }
> >
> >  sset_destroy(&od->lb_ips_v4);
> > +sset_destroy(&od->lb_ips_v4_routable);
> >  sset_destroy(&od->lb_ips_v6);
> > +sset_destroy(&od->lb_ips_v6_routable);
> >  }
> >
> >  /* A group of logical router datapaths which are connected - either
> > @@ -1475,13 +1485,14 @@ destroy_routable_addresses(struct 
> > ovn_port_routable_addresses *ra)
> >  free(ra->laddrs);
> >  }
> >
> > -static char **get_nat_addresses(const struct ovn_port *op, size_t *n);
> > +static char **get_nat_addresses(const struct ovn_port *op, size_t *n,
> > +bool routable_only);
> >
> >  static void
> >  assign_routable_addresses(struct ovn_port *op)
> >  {
> >  size_t n;
> > -char **nats = get_nat_addresses(op, &n);
> > +char **nats = get_nat_addresses(op, &n, true);
> >
> >  if (!nats) {
> >  return;
> > @@ -2541,7 +2552,7 @@ join_logical_ports(struct northd_context *ctx,
> >   * The caller must free each of the n returned strings with free(),
> >   * and must free the returned array when it is no longer needed. */
> >  static char **
> > -get_nat_addresses(const struct ovn_port *op, size_t *n)
> > +get_nat_addresses(const struct ovn_port *op, size_t *n, bool routable_only)
> >  {
> >  size_t n_nats = 0;
> >  struct eth_addr mac;
> > @@ -2564,6 +2575,12 @@ get_nat_addresses(const struct ovn_port *op, size_t 
> > *n)
> >  const struct nbrec_nat *nat = op->od->nbr->nat[i];
> >  ovs_be32 ip, mask;
> >
> > +if (routable_only &&
> > +(!strcmp(nat->type, "snat") ||
> > + !smap_get_bool(&nat->options, "add_route", false))) {
> > +continue;
> > +}
> > +
> >  char *error = ip_parse_masked(nat->external_ip, &ip, &mask);
> >  if (error || mask != OVS_BE32_MAX) {
> >  free(error);
> > @@ -2615,13 +2632,24 @@ get_nat_addresses(const struct ovn_port *op, size_t 
> > *n)
> >  }
> >
> >  const char *ip_address;
> > -SSET_FOR_EACH (ip_address, &op->od->lb_ips_v4) {
> > -ds_put_format(&c_addresses, "

Re: [ovs-dev] [PATCH ovn v9 3/4] northd: Add options to automatically add routes for NATs and LBs.

2021-07-07 Thread Numan Siddique

On Wed, Jun 30, 2021 at 7:57 PM Mark Michelson  wrote:
>
> Load_Balancer and NAT entries have a new option, "add_route" that can be
> set to automatically add routes to those addresses to neighbor routers,
> therefore eliminating the need to create static routes.
>
> Signed-off-by: Mark Michelson 

Acked-by: Numan Siddique 

There is one small comment which you can choose to ignore.  Please see below.

Numan

> ---
>  northd/ovn-northd.8.xml |  7 -
>  northd/ovn-northd.c | 57 +
>  northd/ovn_northd.dl| 23 -
>  ovn-nb.xml  | 33 +++-
>  tests/ovn-nbctl.at  |  3 +++
>  tests/ovn-northd.at | 40 ++---
>  utilities/ovn-nbctl.c   | 25 +-
>  7 files changed, 158 insertions(+), 30 deletions(-)
>
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index b5c961e89..beaf5a183 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -3539,7 +3539,12 @@ outport = P
> column
>of  table for of type
>dnat_and_snat, otherwise the Ethernet address of the
> -  distributed logical router port.
> +  distributed logical router port. Note that if the
> +   is not
> +  within a subnet on the owning logical router, then OVN will only
> +  create ARP resolution flows if the  column="options:add_route"/>
> +  is set to true. Otherwise, no ARP resolution flows
> +  will be added.
>  
>
>  
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 58132bc5c..f6fad281b 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -662,8 +662,14 @@ struct ovn_datapath {
>  struct lport_addresses dnat_force_snat_addrs;
>  struct lport_addresses lb_force_snat_addrs;
>  bool lb_force_snat_router_ip;
> +/* The "routable" ssets are subsets of the load balancer
> + * IPs for which IP routes and ARP resolution flows are automatically
> + * added
> + */
>  struct sset lb_ips_v4;
> +struct sset lb_ips_v4_routable;
>  struct sset lb_ips_v6;
> +struct sset lb_ips_v6_routable;
>
>  struct ovn_port **localnet_ports;
>  size_t n_localnet_ports;
> @@ -834,7 +840,9 @@ static void
>  init_lb_ips(struct ovn_datapath *od)
>  {
>  sset_init(&od->lb_ips_v4);
> +sset_init(&od->lb_ips_v4_routable);
>  sset_init(&od->lb_ips_v6);
> +sset_init(&od->lb_ips_v6_routable);
>  }
>
>  static void
> @@ -845,7 +853,9 @@ destroy_lb_ips(struct ovn_datapath *od)
>  }
>
>  sset_destroy(&od->lb_ips_v4);
> +sset_destroy(&od->lb_ips_v4_routable);
>  sset_destroy(&od->lb_ips_v6);
> +sset_destroy(&od->lb_ips_v6_routable);
>  }
>
>  /* A group of logical router datapaths which are connected - either
> @@ -1475,13 +1485,14 @@ destroy_routable_addresses(struct 
> ovn_port_routable_addresses *ra)
>  free(ra->laddrs);
>  }
>
> -static char **get_nat_addresses(const struct ovn_port *op, size_t *n);
> +static char **get_nat_addresses(const struct ovn_port *op, size_t *n,
> +bool routable_only);
>
>  static void
>  assign_routable_addresses(struct ovn_port *op)
>  {
>  size_t n;
> -char **nats = get_nat_addresses(op, &n);
> +char **nats = get_nat_addresses(op, &n, true);
>
>  if (!nats) {
>  return;
> @@ -2541,7 +2552,7 @@ join_logical_ports(struct northd_context *ctx,
>   * The caller must free each of the n returned strings with free(),
>   * and must free the returned array when it is no longer needed. */
>  static char **
> -get_nat_addresses(const struct ovn_port *op, size_t *n)
> +get_nat_addresses(const struct ovn_port *op, size_t *n, bool routable_only)
>  {
>  size_t n_nats = 0;
>  struct eth_addr mac;
> @@ -2564,6 +2575,12 @@ get_nat_addresses(const struct ovn_port *op, size_t *n)
>  const struct nbrec_nat *nat = op->od->nbr->nat[i];
>  ovs_be32 ip, mask;
>
> +if (routable_only &&
> +(!strcmp(nat->type, "snat") ||
> + !smap_get_bool(&nat->options, "add_route", false))) {
> +continue;
> +}
> +
>  char *error = ip_parse_masked(nat->external_ip, &ip, &mask);
>  if (error || mask != OVS_BE32_MAX) {
>  free(error);
> @@ -2615,13 +2632,24 @@ get_nat_addresses(const struct ovn_port *op, size_t 
> *n)
>  }
>
>  const char *ip_address;
> -SSET_FOR_EACH (ip_address, &op->od->lb_ips_v4) {
> -ds_put_format(&c_addresses, " %s", ip_address);
> -central_ip_address = true;
> -}
> -SSET_FOR_EACH (ip_address, &op->od->lb_ips_v6) {
> -ds_put_format(&c_addresses, " %s", ip_address);
> -central_ip_address = true;
> +if (routable_only) {
> +SSET_FOR_EACH (ip_address, &op->od->lb_ips_v4_routable) {
> +ds_put_format(&c_addresses, " %s", ip_address);
> +central_ip_address

Re: [ovs-dev] [PATCH ovn v9 2/4] northd: Add IP routing and ARP resolution flows for NAT/LB addresses.

2021-07-07 Thread Numan Siddique

On Wed, Jun 30, 2021 at 7:57 PM Mark Michelson  wrote:
>
> Dealing with NAT and load balancer IPs has been a bit of a pain point.
> It requires creating static routes if east-west traffic to those
> addresses is desired. Further, it requires ARPs to be sent between the
> logical routers in order to create MAC Bindings.
>
> This commit seeks to make things easier. NAT and load balancer addresess
> automatically have IP routing logical flows and ARP resolution logical
> flows created for reachable routers. This eliminates the need to create
> static routes, and it also eliminates the need for ARPs to be sent
> between logical routers.
>
> In this commit, the behavior is not optional. The next commit will
> introduce configuration to make the behavior optional.
>
> Signed-off-by: Mark Michelson 

Hi Mark,

There is one small problem.  Please see below.  With that addressed:
Acked-by: Numan Siddique 

Numan

> ---
>  northd/ovn-northd.c  | 129 +-
>  northd/ovn_northd.dl |  57 
>  tests/ovn-northd.at  | 214 +++
>  3 files changed, 395 insertions(+), 5 deletions(-)
>
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 694c3b2c4..58132bc5c 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -1378,6 +1378,21 @@ build_datapaths(struct northd_context *ctx, struct 
> hmap *datapaths,
>  }
>  }
>
> +/* Structure representing logical router port
> + * routable addresses. This includes DNAT and Load Balancer
> + * addresses. This structure will only be filled in if the
> + * router port is a gateway router port. Otherwise, all pointers
> + * will be NULL and n_addrs will be 0.
> + */
> +struct ovn_port_routable_addresses {
> +/* Array of address strings suitable for writing to a database table */
> +char **addresses;
> +/* The addresses field parsed into component parts */
> +struct lport_addresses *laddrs;
> +/* Number of items in each of the above arrays */
> +size_t n_addrs;
> +};
> +
>  /* A logical switch port or logical router port.
>   *
>   * In steady state, an ovn_port points to a northbound Logical_Switch_Port
> @@ -1421,6 +1436,8 @@ struct ovn_port {
>
>  struct lport_addresses lrp_networks;
>
> +struct ovn_port_routable_addresses routables;
> +
>  /* Logical port multicast data. */
>  struct mcast_port_info mcast_info;
>
> @@ -1447,6 +1464,44 @@ struct ovn_port {
>  struct ovs_list list;   /* In list of similar records. */
>  };
>
> +static void
> +destroy_routable_addresses(struct ovn_port_routable_addresses *ra)
> +{
> +for (size_t i = 0; i < ra->n_addrs; i++) {
> +free(ra->addresses[i]);
> +destroy_lport_addresses(&ra->laddrs[i]);
> +}
> +free(ra->addresses);
> +free(ra->laddrs);
> +}
> +
> +static char **get_nat_addresses(const struct ovn_port *op, size_t *n);
> +
> +static void
> +assign_routable_addresses(struct ovn_port *op)
> +{
> +size_t n;
> +char **nats = get_nat_addresses(op, &n);
> +
> +if (!nats) {
> +return;
> +}
> +
> +struct lport_addresses *laddrs = xcalloc(n, sizeof(*laddrs));
> +for (size_t i = 0; i < n; i++) {
> +int ofs;
> +if (!extract_addresses(nats[i], &laddrs[i], &ofs)){
> +continue;
> +}
> +}
> +
> +/* Everything seems to have worked out */
> +op->routables.addresses = nats;
> +op->routables.laddrs = laddrs;
> +op->routables.n_addrs = n;

The n_addrs would have the wrong value if 'extract_addresses()' fails.
You probably want to maintain another variable for n_addrs.

Thanks
Numan

> +}
> +
> +
>  static void
>  ovn_port_set_nb(struct ovn_port *op,
>  const struct nbrec_logical_switch_port *nbsp,
> @@ -1496,6 +1551,8 @@ ovn_port_destroy(struct hmap *ports, struct ovn_port 
> *port)
>  }
>  free(port->ps_addrs);
>
> +destroy_routable_addresses(&port->routables);
> +
>  destroy_lport_addresses(&port->lrp_networks);
>  free(port->json_key);
>  free(port->key);
> @@ -2403,6 +2460,8 @@ join_logical_ports(struct northd_context *ctx,
>   * use during flow creation. */
>  od->l3dgw_port = op;
>  od->l3redirect_port = crp;
> +
> +assign_routable_addresses(op);
>  }
>  }
>  }
> @@ -2486,7 +2545,7 @@ get_nat_addresses(const struct ovn_port *op, size_t *n)
>  {
>  size_t n_nats = 0;
>  struct eth_addr mac;
> -if (!op->nbrp || !op->od || !op->od->nbr
> +if (!op || !op->nbrp || !op->od || !op->od->nbr
>  || (!op->od->nbr->n_nat && !op->od->nbr->n_load_balancer)
>  || !eth_addr_from_string(op->nbrp->mac, &mac)) {
>  *n = n_nats;
> @@ -3067,7 +3126,6 @@ ovn_port_update_sbrec(struct northd_context *ctx,
>  } else {
>  sbrec_port_binding_set_options(op->sb, NULL);
>  }
>

Re: [ovs-dev] [v14 06/11] dpif-netdev: Add command to get dpif implementations.

2021-07-07 Thread Flavio Leitner



Hello,

Please find my comments below.

On Thu, Jul 01, 2021 at 04:06:14PM +0100, Cian Ferriter wrote:
> From: Harry van Haaren 
> 
> This commit adds a new command to retrieve the list of available
> DPIF implementations. This can be used by to check what implementations
> of the DPIF are available in any given OVS binary. It also returns which
> implementations are in use by the OVS PMD threads.
> 
> Usage:
>  $ ovs-appctl dpif-netdev/dpif-impl-get
> 
> Signed-off-by: Harry van Haaren 
> Co-authored-by: Cian Ferriter 
> Signed-off-by: Cian Ferriter 
> 
> ---
> 
> v14:
> - Rename command to dpif-impl-get.
> - Hide more of the dpif impl details from lib/dpif-netdev.c. Pass a
>   dynamic_string to return the dpif-impl-get CMD output.
> - Add information about which DPIF impl is currently in use by each PMD
>   thread.
> 
> v13:
> - Add NEWS item about DPIF get and set commands here rather than in a
>   later commit.
> - Add documentation items about DPIF set commands here rather than in a
>   later commit.
> ---
>  Documentation/topics/dpdk/bridge.rst |  8 +++
>  NEWS |  1 +
>  lib/dpif-netdev-private-dpif.c   | 33 
>  lib/dpif-netdev-private-dpif.h   |  8 +++
>  lib/dpif-netdev-unixctl.man  |  3 +++
>  lib/dpif-netdev.c| 30 +
>  6 files changed, 83 insertions(+)
> 
> diff --git a/Documentation/topics/dpdk/bridge.rst 
> b/Documentation/topics/dpdk/bridge.rst
> index 06d1f943c..2d0850836 100644
> --- a/Documentation/topics/dpdk/bridge.rst
> +++ b/Documentation/topics/dpdk/bridge.rst
> @@ -226,6 +226,14 @@ stats associated with the datapath.
>  Just like with the SIMD DPCLS feature above, SIMD can be applied to the DPIF 
> to
>  improve performance.
>  
> +OVS provides multiple implementations of the DPIF. The available
> +implementations can be listed with the following command ::
> +
> +$ ovs-appctl dpif-netdev/dpif-impl-get
> +Available DPIF implementations:
> +  dpif_scalar (pmds: none)
> +  dpif_avx512 (pmds: 1,2,6,7)
> +
>  By default, dpif_scalar is used. The DPIF implementation can be selected by
>  name ::
>  
> diff --git a/NEWS b/NEWS
> index e23506225..cf0987a24 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -13,6 +13,7 @@ Post-v2.15.0
>   * Refactor lib/dpif-netdev.c to multiple header files.
>   * Add avx512 implementation of dpif which can process non recirculated
> packets. It supports partial HWOL, EMC, SMC and DPCLS lookups.
> + * Add commands to get and set the dpif implementations.
> - ovs-ctl:
>   * New option '--no-record-hostname' to disable hostname configuration
> in ovsdb on startup.
> diff --git a/lib/dpif-netdev-private-dpif.c b/lib/dpif-netdev-private-dpif.c
> index da3511f51..4eaefb291 100644
> --- a/lib/dpif-netdev-private-dpif.c
> +++ b/lib/dpif-netdev-private-dpif.c
> @@ -92,6 +92,39 @@ dp_netdev_impl_set_default_by_name(const char *name)
>  
>  }
>  
> +uint32_t
> +dp_netdev_impl_get(struct ds *reply, struct dp_netdev_pmd_thread **pmd_list,
> +   size_t n)
> +{
> +/* Add all dpif functions to reply string. */
> +ds_put_cstr(reply, "Available DPIF implementations:\n");
> +
> +for (uint32_t i = 0; i < ARRAY_SIZE(dpif_impls); i++) {
> +ds_put_format(reply, "  %s (pmds: ", dpif_impls[i].name);
> +
> +for (size_t j = 0; j < n; j++) {
> +struct dp_netdev_pmd_thread *pmd = pmd_list[j];
> +if (pmd->core_id == NON_PMD_CORE_ID) {
> +continue;
> +}
> +
> +if (pmd->netdev_input_func == dpif_impls[i].input_func) {
> +ds_put_format(reply, "%u,", pmd->core_id);
> +}
> +}
> +
> +ds_chomp(reply, ',');
> +
> +if (ds_last(reply) == ' ') {
> +ds_put_cstr(reply, "none");
> +}
> +
> +ds_put_cstr(reply, ")\n");
> +}
> +
> +return ARRAY_SIZE(dpif_impls);
> +}
> +
>  /* This function checks all available DPIF implementations, and selects the
>   * returns the function pointer to the one requested by "name".
>   */
> diff --git a/lib/dpif-netdev-private-dpif.h b/lib/dpif-netdev-private-dpif.h
> index 0e58153f4..d2c2cbaf4 100644
> --- a/lib/dpif-netdev-private-dpif.h
> +++ b/lib/dpif-netdev-private-dpif.h
> @@ -22,6 +22,7 @@
>  /* Forward declarations to avoid including files. */
>  struct dp_netdev_pmd_thread;
>  struct dp_packet_batch;
> +struct ds;
>  
>  /* Typedef for DPIF functions.
>   * Returns whether all packets were processed successfully.
> @@ -48,6 +49,13 @@ struct dpif_netdev_impl_info_t {
>  const char *name;
>  };
>  
> +/* This function returns all available implementations to the caller. The
> + * quantity of implementations is returned by the int return value.
> + */
> +uint32_t
> +dp_netdev_impl_get(struct ds *reply, struct dp_netdev_pmd_thread **pmd_list,
> +   size_t n);
> +
>  /* This function chec

Re: [ovs-dev] [PATCH ovn v9 1/4] northd: Factor peer retrieval into its own function.

2021-07-07 Thread Numan Siddique

On Wed, Jun 30, 2021 at 7:57 PM Mark Michelson  wrote:
>
> The same pattern is repeated several times throughout ovn-northd.c, so
> this puts it in its own function. This will be used even more in an
> upcoming commit.
>
> Signed-off-by: Mark Michelson 

Acked-by: Numan Siddique 

Thanks
Numan

> ---
>  northd/ovn-northd.c | 70 -
>  1 file changed, 24 insertions(+), 46 deletions(-)
>
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index 83746f4ab..694c3b2c4 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -1571,6 +1571,21 @@ lrport_is_enabled(const struct 
> nbrec_logical_router_port *lrport)
>  return !lrport->enabled || *lrport->enabled;
>  }
>
> +static struct ovn_port *
> +ovn_port_get_peer(struct hmap *ports, struct ovn_port *op)
> +{
> +if (!op->nbsp || !lsp_is_router(op->nbsp) || op->derived) {
> +return NULL;
> +}
> +
> +const char *peer_name = smap_get(&op->nbsp->options, "router-port");
> +if (!peer_name) {
> +return NULL;
> +}
> +
> +return ovn_port_find(ports, peer_name);
> +}
> +
>  static void
>  ipam_insert_ip_for_datapath(struct ovn_datapath *od, uint32_t ip)
>  {
> @@ -2398,12 +2413,7 @@ join_logical_ports(struct northd_context *ctx,
>  struct ovn_port *op;
>  HMAP_FOR_EACH (op, key_node, ports) {
>  if (op->nbsp && lsp_is_router(op->nbsp) && !op->derived) {
> -const char *peer_name = smap_get(&op->nbsp->options, 
> "router-port");
> -if (!peer_name) {
> -continue;
> -}
> -
> -struct ovn_port *peer = ovn_port_find(ports, peer_name);
> +struct ovn_port *peer = ovn_port_get_peer(ports, op);
>  if (!peer || !peer->nbrp) {
>  continue;
>  }
> @@ -10206,14 +10216,8 @@ build_arp_resolve_flows_for_lrouter_port(
>  /* Get the Logical_Router_Port that the
>   * Logical_Switch_Port is connected to, as
>   * 'peer'. */
> -const char *peer_name = smap_get(
> -&op->od->router_ports[k]->nbsp->options,
> -"router-port");
> -if (!peer_name) {
> -continue;
> -}
> -
> -struct ovn_port *peer = ovn_port_find(ports, peer_name);
> +struct ovn_port *peer = ovn_port_get_peer(
> +ports, op->od->router_ports[k]);
>  if (!peer || !peer->nbrp) {
>  continue;
>  }
> @@ -10243,14 +10247,8 @@ build_arp_resolve_flows_for_lrouter_port(
>  /* Get the Logical_Router_Port that the
>   * Logical_Switch_Port is connected to, as
>   * 'peer'. */
> -const char *peer_name = smap_get(
> -&op->od->router_ports[k]->nbsp->options,
> -"router-port");
> -if (!peer_name) {
> -continue;
> -}
> -
> -struct ovn_port *peer = ovn_port_find(ports, peer_name);
> +struct ovn_port *peer = ovn_port_get_peer(
> +ports, op->od->router_ports[k]);
>  if (!peer || !peer->nbrp) {
>  continue;
>  }
> @@ -10298,14 +10296,8 @@ build_arp_resolve_flows_for_lrouter_port(
>  !op->sb->chassis) {
>  /* The virtual port is not claimed yet. */
>  for (size_t i = 0; i < op->od->n_router_ports; i++) {
> -const char *peer_name = smap_get(
> -&op->od->router_ports[i]->nbsp->options,
> -"router-port");
> -if (!peer_name) {
> -continue;
> -}
> -
> -struct ovn_port *peer = ovn_port_find(ports, peer_name);
> +struct ovn_port *peer = ovn_port_get_peer(
> +ports, op->od->router_ports[i]);
>  if (!peer || !peer->nbrp) {
>  continue;
>  }
> @@ -10340,15 +10332,8 @@ build_arp_resolve_flows_for_lrouter_port(
>  /* Get the Logical_Router_Port that the
>  * Logical_Switch_Port is connected to, as
>  * 'peer'. */
> -const char *peer_name = smap_get(
> -&vp->od->router_ports[j]->nbsp->options,
> -"router-port");
> -if (!peer_name) {
> -continue;
> -}
> -
>  struct ovn_port *peer =
> -ovn_port_find(ports, peer_name);
> +ovn_port_get_peer(ports, vp->od->router_ports[j]);
>  if (!p

Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache

2021-07-07 Thread Eli Britstein



On 7/7/2021 7:35 PM, Gaëtan Rivet wrote:

External email: Use caution opening links or attachments


On Wed, Jul 7, 2021, at 17:05, Eli Britstein wrote:

Port numbers are usually small. Maintain an array of netdev handles indexed
by port numbers. It accelerates looking up for them for
netdev_hw_miss_packet_recover().

Reported-by: Cian Ferriter 
Signed-off-by: Eli Britstein 
Reviewed-by: Gaetan Rivet 
---
  lib/dpif-netdev.c | 41 +
  1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 2e654426e..accb23a1a 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -650,6 +650,9 @@ struct dp_netdev_pmd_thread_ctx {
  uint32_t emc_insert_min;
  };

+/* Size of netdev's cache. */
+#define DP_PMD_NETDEV_CACHE_SIZE 1024
+
  /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
   * the performance overhead of interrupt processing.  Therefore netdev can
   * not implement rx-wait for these devices.  dpif-netdev needs to poll
@@ -786,6 +789,7 @@ struct dp_netdev_pmd_thread {
   * other instance will only be accessed by its own pmd thread. */
  struct hmap tnl_port_cache;
  struct hmap send_port_cache;
+struct netdev *send_netdev_cache[DP_PMD_NETDEV_CACHE_SIZE];

  /* Keep track of detailed PMD performance statistics. */
  struct pmd_perf_stats perf_stats;
@@ -5910,6 +5914,10 @@ pmd_free_cached_ports(struct
dp_netdev_pmd_thread *pmd)
  free(tx_port_cached);
  }
  HMAP_FOR_EACH_POP (tx_port_cached, node, &pmd->send_port_cache) {
+if (tx_port_cached->port->port_no <

It has some issues in github actions. I'll fix and post v2.

+ARRAY_SIZE(pmd->send_netdev_cache)) {
+pmd->send_netdev_cache[tx_port_cached->port->port_no] =
NULL;
+}
  free(tx_port_cached);
  }
  }
@@ -5939,6 +5947,11 @@ pmd_load_cached_ports(struct
dp_netdev_pmd_thread *pmd)
  tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
  hmap_insert(&pmd->send_port_cache, &tx_port_cached->node,
  hash_port_no(tx_port_cached->port->port_no));
+if (tx_port_cached->port->port_no <
+ARRAY_SIZE(pmd->send_netdev_cache)) {
+pmd->send_netdev_cache[tx_port_cached->port->port_no] =
+tx_port_cached->port->netdev;
+}
  }
  }
  }
@@ -6585,6 +6598,7 @@ dp_netdev_configure_pmd(struct
dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
  hmap_init(&pmd->tx_ports);
  hmap_init(&pmd->tnl_port_cache);
  hmap_init(&pmd->send_port_cache);
+memset(pmd->send_netdev_cache, 0, sizeof pmd->send_netdev_cache);
  cmap_init(&pmd->tx_bonds);
  /* init the 'flow_cache' since there is no
   * actual thread created for NON_PMD_CORE_ID. */
@@ -6603,6 +6617,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread
*pmd)
  struct dpcls *cls;

  dp_netdev_pmd_flow_flush(pmd);
+memset(pmd->send_netdev_cache, 0, sizeof pmd->send_netdev_cache);
  hmap_destroy(&pmd->send_port_cache);
  hmap_destroy(&pmd->tnl_port_cache);
  hmap_destroy(&pmd->tx_ports);
@@ -7090,20 +7105,38 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
  static struct tx_port * pmd_send_port_cache_lookup(
  const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);

+OVS_UNUSED
+static inline struct netdev *
+pmd_netdev_cache_lookup(const struct dp_netdev_pmd_thread *pmd,
+odp_port_t port_no)
+{
+struct tx_port *p;
+
+if (port_no < ARRAY_SIZE(pmd->send_netdev_cache)) {
+return pmd->send_netdev_cache[port_no];
+}
+
+p = pmd_send_port_cache_lookup(pmd, port_no);
+if (p) {
+return p->port->netdev;
+}
+return NULL;
+}
+
  static inline int
  dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
odp_port_t port_no OVS_UNUSED,
struct dp_packet *packet,
struct dp_netdev_flow **flow)
  {
-struct tx_port *p OVS_UNUSED;
+struct netdev *netdev OVS_UNUSED;
  uint32_t mark;

  #ifdef ALLOW_EXPERIMENTAL_API /* Packet restoration API required. */
  /* Restore the packet if HW processing was terminated before completion. 
*/
-p = pmd_send_port_cache_lookup(pmd, port_no);
-if (OVS_LIKELY(p)) {
-int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
+netdev = pmd_netdev_cache_lookup(pmd, port_no);
+if (OVS_LIKELY(netdev)) {
+int err = netdev_hw_miss_packet_recover(netdev, packet);

  if (err && err != EOPNOTSUPP) {
  COVERAGE_INC(datapath_drop_hw_miss_recover);FI-86194-0059
--
2.28.0.2311.g225365fb51

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Hello,

I tested the performance impact of this patch with a partial offload setup.
As reported by

Re: [ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache

2021-07-07 Thread Gaëtan Rivet

On Wed, Jul 7, 2021, at 17:05, Eli Britstein wrote:
> Port numbers are usually small. Maintain an array of netdev handles indexed
> by port numbers. It accelerates looking up for them for
> netdev_hw_miss_packet_recover().
> 
> Reported-by: Cian Ferriter 
> Signed-off-by: Eli Britstein 
> Reviewed-by: Gaetan Rivet 
> ---
>  lib/dpif-netdev.c | 41 +
>  1 file changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 2e654426e..accb23a1a 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -650,6 +650,9 @@ struct dp_netdev_pmd_thread_ctx {
>  uint32_t emc_insert_min;
>  };
>  
> +/* Size of netdev's cache. */
> +#define DP_PMD_NETDEV_CACHE_SIZE 1024
> +
>  /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
>   * the performance overhead of interrupt processing.  Therefore netdev can
>   * not implement rx-wait for these devices.  dpif-netdev needs to poll
> @@ -786,6 +789,7 @@ struct dp_netdev_pmd_thread {
>   * other instance will only be accessed by its own pmd thread. */
>  struct hmap tnl_port_cache;
>  struct hmap send_port_cache;
> +struct netdev *send_netdev_cache[DP_PMD_NETDEV_CACHE_SIZE];
>  
>  /* Keep track of detailed PMD performance statistics. */
>  struct pmd_perf_stats perf_stats;
> @@ -5910,6 +5914,10 @@ pmd_free_cached_ports(struct 
> dp_netdev_pmd_thread *pmd)
>  free(tx_port_cached);
>  }
>  HMAP_FOR_EACH_POP (tx_port_cached, node, &pmd->send_port_cache) {
> +if (tx_port_cached->port->port_no <
> +ARRAY_SIZE(pmd->send_netdev_cache)) {
> +pmd->send_netdev_cache[tx_port_cached->port->port_no] = 
> NULL;
> +}
>  free(tx_port_cached);
>  }
>  }
> @@ -5939,6 +5947,11 @@ pmd_load_cached_ports(struct 
> dp_netdev_pmd_thread *pmd)
>  tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
>  hmap_insert(&pmd->send_port_cache, &tx_port_cached->node,
>  hash_port_no(tx_port_cached->port->port_no));
> +if (tx_port_cached->port->port_no <
> +ARRAY_SIZE(pmd->send_netdev_cache)) {
> +pmd->send_netdev_cache[tx_port_cached->port->port_no] =
> +tx_port_cached->port->netdev;
> +}
>  }
>  }
>  }
> @@ -6585,6 +6598,7 @@ dp_netdev_configure_pmd(struct 
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>  hmap_init(&pmd->tx_ports);
>  hmap_init(&pmd->tnl_port_cache);
>  hmap_init(&pmd->send_port_cache);
> +memset(pmd->send_netdev_cache, 0, sizeof pmd->send_netdev_cache);
>  cmap_init(&pmd->tx_bonds);
>  /* init the 'flow_cache' since there is no
>   * actual thread created for NON_PMD_CORE_ID. */
> @@ -6603,6 +6617,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread 
> *pmd)
>  struct dpcls *cls;
>  
>  dp_netdev_pmd_flow_flush(pmd);
> +memset(pmd->send_netdev_cache, 0, sizeof pmd->send_netdev_cache);
>  hmap_destroy(&pmd->send_port_cache);
>  hmap_destroy(&pmd->tnl_port_cache);
>  hmap_destroy(&pmd->tx_ports);
> @@ -7090,20 +7105,38 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>  static struct tx_port * pmd_send_port_cache_lookup(
>  const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>  
> +OVS_UNUSED
> +static inline struct netdev *
> +pmd_netdev_cache_lookup(const struct dp_netdev_pmd_thread *pmd,
> +odp_port_t port_no)
> +{
> +struct tx_port *p;
> +
> +if (port_no < ARRAY_SIZE(pmd->send_netdev_cache)) {
> +return pmd->send_netdev_cache[port_no];
> +}
> +
> +p = pmd_send_port_cache_lookup(pmd, port_no);
> +if (p) {
> +return p->port->netdev;
> +}
> +return NULL;
> +}
> +
>  static inline int
>  dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>odp_port_t port_no OVS_UNUSED,
>struct dp_packet *packet,
>struct dp_netdev_flow **flow)
>  {
> -struct tx_port *p OVS_UNUSED;
> +struct netdev *netdev OVS_UNUSED;
>  uint32_t mark;
>  
>  #ifdef ALLOW_EXPERIMENTAL_API /* Packet restoration API required. */
>  /* Restore the packet if HW processing was terminated before completion. 
> */
> -p = pmd_send_port_cache_lookup(pmd, port_no);
> -if (OVS_LIKELY(p)) {
> -int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
> +netdev = pmd_netdev_cache_lookup(pmd, port_no);
> +if (OVS_LIKELY(netdev)) {
> +int err = netdev_hw_miss_packet_recover(netdev, packet);
>  
>  if (err && err != EOPNOTSUPP) {
>  COVERAGE_INC(datapath_drop_hw_miss_recover);FI-86194-0059
> -- 
> 2.28.0.2311.g225365fb51
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 

Hello,

I tested the performance impact of

Re: [ovs-dev] [PATCH v2] dpif-netdev: Report overhead busy cycles per pmd.

2021-07-07 Thread Kevin Traynor

On 01/07/2021 17:41, David Marchand wrote:
> Users complained that per rxq pmd usage was confusing: summing those
> values per pmd would never reach 100% even if increasing traffic load
> beyond pmd capacity.
> 
> This is because the dpif-netdev/pmd-rxq-show command only reports "pure"
> rxq cycles while some cycles are used in the pmd mainloop and adds up to
> the total pmd load.
> 
> dpif-netdev/pmd-stats-show does report per pmd load usage.
> This load is measured since the last dpif-netdev/pmd-stats-clear call.
> On the other hand, the per rxq pmd usage reflects the pmd load on a 10s
> sliding window which makes it non trivial to correlate.
> 
> Gather per pmd busy cycles with the same periodicity and report the
> difference as overhead in dpif-netdev/pmd-rxq-show so that we have all
> info in a single command.
> 
> Example:
> $ ovs-appctl dpif-netdev/pmd-rxq-show
> pmd thread numa_id 1 core_id 3:
>   isolated : true
>   port: dpdk0 queue-id:  0 (enabled)   pmd usage: 90 %
>   overhead:  4 %
> pmd thread numa_id 1 core_id 5:
>   isolated : false
>   port: vhost0queue-id:  0 (enabled)   pmd usage:  0 %
>   port: vhost1queue-id:  0 (enabled)   pmd usage: 93 %
>   port: vhost2queue-id:  0 (enabled)   pmd usage:  0 %
>   port: vhost6queue-id:  0 (enabled)   pmd usage:  0 %
>   overhead:  6 %
> pmd thread numa_id 1 core_id 31:
>   isolated : true
>   port: dpdk1 queue-id:  0 (enabled)   pmd usage: 86 %
>   overhead:  4 %
> pmd thread numa_id 1 core_id 33:
>   isolated : false
>   port: vhost3queue-id:  0 (enabled)   pmd usage:  0 %
>   port: vhost4queue-id:  0 (enabled)   pmd usage:  0 %
>   port: vhost5queue-id:  0 (enabled)   pmd usage: 92 %
>   port: vhost7queue-id:  0 (enabled)   pmd usage:  0 %
>   overhead:  7 %
> 
> Signed-off-by: David Marchand 

LGTM. I did some tests with various configurations of pmds, rxqs,
traffic/no traffic etc. Working as expected, things I reported in v1 are
resolved. Unit test ok, checkpatch ok, GHA ok. Thanks!

Acked-by: Kevin Traynor 

> ---
> Changes since v1:
> - fixed unit test and documentation update,
> - moved documentation update under pmd-rxq-show command description,
> - updated commitlog,
> - renamed variables for better readability,
> - avoided reporting a N/A overhead for idle PMD,
> - reset overhead stats on PMD reconfigure,
> 
> ---
>  Documentation/topics/dpdk/pmd.rst |   5 ++
>  lib/dpif-netdev.c | 113 +-
>  tests/pmd.at  |   8 ++-
>  3 files changed, 89 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst 
> b/Documentation/topics/dpdk/pmd.rst
> index e481e79414..6dfc65724c 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -164,6 +164,11 @@ queue::
> due to traffic pattern or reconfig changes, will take one minute to be 
> fully
> reflected in the stats.
>  
> +.. versionchanged:: 2.16.0
> +
> +   A ``overhead`` statistics is shown per PMD: it represents the number of
> +   cycles inherently consumed by the OVS PMD processing loop.
> +
>  Rx queue to PMD assignment takes place whenever there are configuration 
> changes
>  or can be triggered by using::
>  
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 026b52d27d..14f8e0246f 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -235,11 +235,11 @@ struct dfc_cache {
>  
>  /* Time in microseconds of the interval in which rxq processing cycles used
>   * in rxq to pmd assignments is measured and stored. */
> -#define PMD_RXQ_INTERVAL_LEN 1000LL
> +#define PMD_INTERVAL_LEN 1000LL
>  
>  /* Number of intervals for which cycles are stored
>   * and used during rxq to pmd assignment. */
> -#define PMD_RXQ_INTERVAL_MAX 6
> +#define PMD_INTERVAL_MAX 6
>  
>  /* Time in microseconds to try RCU quiescing. */
>  #define PMD_RCU_QUIESCE_INTERVAL 1LL
> @@ -456,9 +456,9 @@ struct dp_netdev_rxq {
>  
>  /* Counters of cycles spent successfully polling and processing pkts. */
>  atomic_ullong cycles[RXQ_N_CYCLES];
> -/* We store PMD_RXQ_INTERVAL_MAX intervals of data for an rxq and then
> +/* We store PMD_INTERVAL_MAX intervals of data for an rxq and then
> sum them to yield the cycles used for an rxq. */
> -atomic_ullong cycles_intrvl[PMD_RXQ_INTERVAL_MAX];
> +atomic_ullong cycles_intrvl[PMD_INTERVAL_MAX];
>  };
>  
>  /* A port in a netdev-based datapath. */
> @@ -694,13 +694,18 @@ struct dp_netdev_pmd_thread {
>  long long int next_optimization;
>  /* End of the next time interval for which processing cycles
> are stored for each polled rxq. */
> -long long int rxq_next_cycle_store;
> +long long int next_cycle_store;
>  
>  /* Last interval timestamp. */
>  uint64_t intrvl_tsc_prev;
>  /* Last interval cycles. */
>  atomic_ullong intrvl_cycles;
>

Re: [ovs-dev] [v14 05/11] dpif-netdev: Add command to switch dpif implementation.

2021-07-07 Thread Ferriter, Cian

Hi Flavio,

Thanks for your comments. My responses are inline.

Thanks,
Cian

> -Original Message-
> From: Flavio Leitner 
> Sent: Wednesday 7 July 2021 15:40
> To: Ferriter, Cian 
> Cc: ovs-dev@openvswitch.org; i.maxim...@ovn.org
> Subject: Re: [ovs-dev] [v14 05/11] dpif-netdev: Add command to switch dpif 
> implementation.
> 
> Hi,
> 
> Please see my comments below.
> 



> > diff --git a/lib/dpif-netdev-private-dpif.c b/lib/dpif-netdev-private-dpif.c
> > new file mode 100644
> > index 0..da3511f51
> > --- /dev/null
> > +++ b/lib/dpif-netdev-private-dpif.c
> > @@ -0,0 +1,122 @@
> > +/*
> > + * Copyright (c) 2021 Intel Corporation.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + * http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#include 
> > +
> > +#include "dpif-netdev-private-dpif.h"
> > +#include "dpif-netdev-private-thread.h"
> > +
> > +#include 
> > +#include 
> > +
> > +#include "openvswitch/dynamic-string.h"
> > +#include "openvswitch/vlog.h"
> > +#include "util.h"
> > +
> > +VLOG_DEFINE_THIS_MODULE(dpif_netdev_impl);
> > +
> > +enum dpif_netdev_impl_info_idx {
> > +DPIF_NETDEV_IMPL_SCALAR,
> > +DPIF_NETDEV_IMPL_AVX512
> > +};
> > +
> > +/* Actual list of implementations goes here. */
> > +static struct dpif_netdev_impl_info_t dpif_impls[] = {
> > +/* The default scalar C code implementation. */
> > +[DPIF_NETDEV_IMPL_SCALAR] = { .input_func = dp_netdev_input,
> > +  .probe = NULL,
> > +  .name = "dpif_scalar", },
> > +
> > +#if (__x86_64__ && HAVE_AVX512F && HAVE_LD_AVX512_GOOD && __SSE4_2__)
> > +/* Only available on x86_64 bit builds with SSE 4.2 used for OVS core. 
> > */
> > +[DPIF_NETDEV_IMPL_AVX512] = { .input_func = 
> > dp_netdev_input_outer_avx512,
> > +  .probe = dp_netdev_input_outer_avx512_probe,
> > +  .name = "dpif_avx512", },
> > +#endif
> > +};
> > +
> > +static dp_netdev_input_func default_dpif_func;
> > +
> > +dp_netdev_input_func
> > +dp_netdev_impl_get_default(void)
> > +{
> > +/* For the first call, this will be NULL. Compute the compile time 
> > default.
> > + */
> > +if (!default_dpif_func) {
> > +int dpif_idx = 0;
> 
> That should be DPIF_NETDEV_IMPL_SCALAR.
> 

Agreed, I'll fix this.

> > +
> > +/* Configure-time overriding to run test suite on all implementations. */
> > +#if (__x86_64__ && HAVE_AVX512F && HAVE_LD_AVX512_GOOD && __SSE4_2__)
> > +#ifdef DPIF_AVX512_DEFAULT
> > +ovs_assert(dpif_impls[DPIF_NETDEV_IMPL_AVX512].input_func
> > +   == dp_netdev_input_outer_avx512);
> 
> This assert() makes little sense now. It's not possible to change
> the dpif_impls at runtime, and if we change the code we will notice
> the problem only at runtime. Wouldn't it make more sense to make it
> generic like below?
> 
> #ifdef DPIF_AVX512_DEFAULT
> dp_netdev_input_func_probe probe;
> 
> /* Check if the compiled default is compatible. */
> probe = dpif_impls[DPIF_NETDEV_IMPL_AVX512].probe;
> if (!probe || !probe()) {
> dpif_idx = DPIF_NETDEV_IMPL_AVX512;
> }
> #endif
> 
> 

Makes sense. Your implementation looks clean. I'll use it in the next version. 
Thanks!

> > +if (!dp_netdev_input_outer_avx512_probe()) {
> > +dpif_idx = DPIF_NETDEV_IMPL_AVX512;
> > +};
> > +#endif
> > +#endif
> > +
> > +VLOG_INFO("Default DPIF implementation is %s.\n",
> > +  dpif_impls[dpif_idx].name);
> > +default_dpif_func = dpif_impls[dpif_idx].input_func;
> > +}
> > +
> > +return default_dpif_func;
> > +}
> > +
> > +int32_t
> > +dp_netdev_impl_set_default_by_name(const char *name)
> > +{
> > +dp_netdev_input_func new_default;
> > +
> > +int32_t err = dp_netdev_impl_get_by_name(name, &new_default);
> > +
> > +if (!err) {
> > +default_dpif_func = new_default;
> > +}
> > +
> > +return err;
> > +
> > +}
> > +
> > +/* This function checks all available DPIF implementations, and selects the
> > + * returns the function pointer to the one requested by "name".
> > + */
> > +int32_t
> > +dp_netdev_impl_get_by_name(const char *name, dp_netdev_input_func 
> > *out_func)
> 
> That one should be static and removed from the
> lib/dpif-netdev-private-dpif.h.
> 
> 

Good catch, I'll fix this.

> > +{
> > +ovs_assert(name);
> > +ovs_assert(out_func);
> > +
> > +uint32_t i;
> > +
> > +for (i = 0; i < ARRAY_SIZE(dpif_impls); i++) {
> >

Re: [ovs-dev] [PATCH v2 ovn] Don't suppress localport traffic directed to external port

2021-07-07 Thread Dumitru Ceara

On 7/7/21 5:03 PM, Ihar Hrachyshka wrote:
> Recently, we stopped leaking localport traffic through localnet ports
> into fabric to avoid unnecessary flipping between chassis hosting the
> same localport.
> 
> Despite the type name, in some scenarios localports are supposed to
> talk outside the hosting chassis. Specifically, in OpenStack [1]
> metadata service for SR-IOV ports is implemented as a localport hosted
> on another chassis that is exposed to the chassis owning the SR-IOV
> port through an "external" port. In this case, "leaking" localport
> traffic into fabric is desirable.
> 
> This patch inserts a higher priority flow per external port on the
> same datapath that avoids dropping localport traffic.
> 
> Fixes: 96959e56d634 ("physical: do not forward traffic from localport
> to a localnet one")
> 
> [1] https://docs.openstack.org/neutron/latest/admin/ovn/sriov.html
> 
> Signed-off-by: Ihar Hrachyshka 
> 
> --

Hi Ihar,

Looks like our emails crossed, but I left some comments on the v1 of
this patch and I think they still apply to v2:

https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385130.html

Regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 06/11] dpif-netdev: Add packet count and core id paramters for study

2021-07-07 Thread Amber, Kumar

Hi Eelco,

Don’t know the formatting keeps breaking . replies are inline.



+optimal implementation. If no packet count is provided then the default value
+128 is chosen.

Not a native speaker, but I think the sentence need some commas?

“If no packet count is provided, then the default value,
128, is chosen.”

Fixed.

Also, as there is no synchronization point between threads, one
+PMD thread might still be running a previous round, and can now decide on
+earlier data.
+
+Study can be selected with packet count by the following command ::
+
+ $ ovs-appctl dpif-netdev/miniflow-parser-set study 1024
+
+Study can be selected with packet count and explicit PMD selection
+by the following command ::
+
+ $ ovs-appctl dpif-netdev/miniflow-parser-set study 1024 3
+
+In the above command the last parameter is the CORE ID of the PMD
+thread and this can also be used to explicitly set the miniflow


+ if ((strcmp(miniflow_funcs[MFEX_IMPL_STUDY].name, name) == 0) &&
+ (pkt_cmp_count != 0)) {
+
+ mfex_study_pkts_count = pkt_cmp_count;

Do we need to set/read this atomically?

Using set in v7.

+ return 0;
+ }
+
+ mfex_study_pkts_count = MFEX_MAX_COUNT;
+ return -EINVAL;
+}
+
uint32_t
mfex_study_traffic(struct dp_packet_batch *packets,
struct netdev_flow_key *keys,
@@ -86,7 +107,7 @@ mfex_study_traffic(struct dp_packet_batch *packets,
/* Choose the best implementation after a minimum packets have been
* processed.
*/
- if (stats->pkt_count >= MFEX_MAX_COUNT) {
+ if (stats->pkt_count >= mfex_study_pkts_count) {
uint32_t best_func_index = MFEX_IMPL_MAX;
uint32_t max_hits = 0;
for (int i = MFEX_IMPL_MAX; i < impl_count; i++) {
diff --git a/lib/dpif-netdev-private-extract.c 
b/lib/dpif-netdev-private-extract.c
index 6ae91a24d..c1239b319 100644
--- a/lib/dpif-netdev-private-extract.c
+++ b/lib/dpif-netdev-private-extract.c
@@ -118,7 +118,7 @@ dp_mfex_impl_get(struct ds *reply, struct 
dp_netdev_pmd_thread **pmd_list,

for (uint32_t i = 0; i < ARRAY_SIZE(mfex_impls); i++) {

- ds_put_format(reply, " %s (available: %s)(pmds: ",
+ ds_put_format(reply, " %s (available: %s, pmds: ",

Rather than changing the format here, we set it correctly in patch introducing 
this?

Yes .

mfex_impls[i].name, mfex_impls[i].available ?
"True" : "False");

diff --git a/lib/dpif-netdev-private-extract.h 
b/lib/dpif-netdev-private-extract.h
index cd46c94dd..a1f48d870 100644
--- a/lib/dpif-netdev-private-extract.h
+++ b/lib/dpif-netdev-private-extract.h
@@ -135,4 +135,13 @@ mfex_study_traffic(struct dp_packet_batch *packets,
uint32_t keys_size, odp_port_t in_port,
struct dp_netdev_pmd_thread *pmd_handle);

+/* Sets the packet count from user to the stats for use in
+ * study function to match against the classified packets to choose
+ * the optimal implementation.
+ * On error, returns EINVAL.
+ * On success, returns 0.
+ */
+uint32_t mfex_set_study_pkt_cnt(uint32_t pkt_cmp_count,
+ const char *name);
+
#endif /* MFEX_AVX512_EXTRACT */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 175d8699f..6bcb24a73 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -1103,9 +1103,13 @@ dpif_miniflow_extract_impl_set(struct unixctl_conn 
*conn, int argc,
const char *argv[], void *aux OVS_UNUSED)
{
/* This function requires just one parameter, the miniflow name.
+ * A second optional parameter can set the packet count to match in study.
+ * A third optional paramter PMD thread ID can be also provided which
+ * allows users to set miniflow implementation on a particular pmd.
*/
const char *mfex_name = argv[1];
struct shash_node *node;
+ struct ds reply = DS_EMPTY_INITIALIZER;

static const char *error_description[2] = {
"Unknown miniflow implementation",

It’s not shown here, but I think the Mutex on line 1105 does not need to be 
taken until 1158, which makes the error path more clean.

Fixed at the right place in v7.

@@ -1116,7 +1120,6 @@ dpif_miniflow_extract_impl_set(struct unixctl_conn *conn, 
int argc,
int32_t err = dp_mfex_impl_set_default_by_name(mfex_name);

if (err) {
- struct ds reply = DS_EMPTY_INITIALIZER;
ds_put_format(&reply,
"Miniflow implementation not available: %s %s.\n",
error_description[ (err == EINVAL) ], mfex_name);
@@ -1128,6 +1131,44 @@ dpif_miniflow_extract_impl_set(struct unixctl_conn 
*conn, int argc,
return;
}

+ /* argv[2] is optional packet count, which user can provide along with
+ * study function to set the minimum packet that must be matched in order
+ * to choose the optimal function. */
+ uint32_t pkt_cmp_count = 0;
+ uint32_t study_ret = 0;
+
+ if ((argc == 3) || (argc == 4)) {
+ if (str_to_uint(argv[2], 10, &pkt_cmp_count)) {
+ study_ret = mfex_set_study_pkt_cnt(pkt_cmp_count, mfex_name);
+ } else {
+ study_ret = -EINVAL;

An invalid input was given so we should error out.

The error is handled later and since we already have a fallback to default 
value we just fall-back.

+ }
+ } else {
+ /* Default packet compare count when packets count not provided. */
+ study_ret = mfex_set_study_pkt_cnt(0, mfex_name

[ovs-dev] [PATCH 1/2] dpif-netdev: Do not execute packet recovery without experimental support

2021-07-07 Thread Eli Britstein

rte_flow_get_restore_info() API is under experimental attribute. Using it
has a performance impact that can be avoided for non-experimental compilation.

Do not call it without experimental support.

Reported-by: Cian Ferriter 
Signed-off-by: Eli Britstein 
Reviewed-by: Gaetan Rivet 
---
 lib/dpif-netdev.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 26218ad72..2e654426e 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -7092,13 +7092,14 @@ static struct tx_port * pmd_send_port_cache_lookup(
 
 static inline int
 dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
-  odp_port_t port_no,
+  odp_port_t port_no OVS_UNUSED,
   struct dp_packet *packet,
   struct dp_netdev_flow **flow)
 {
-struct tx_port *p;
+struct tx_port *p OVS_UNUSED;
 uint32_t mark;
 
+#ifdef ALLOW_EXPERIMENTAL_API /* Packet restoration API required. */
 /* Restore the packet if HW processing was terminated before completion. */
 p = pmd_send_port_cache_lookup(pmd, port_no);
 if (OVS_LIKELY(p)) {
@@ -7109,6 +7110,7 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
 return -1;
 }
 }
+#endif
 
 /* If no mark, no flow to find. */
 if (!dp_packet_has_flow_mark(packet, &mark)) {
-- 
2.28.0.2311.g225365fb51

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH 2/2] dpif-netdev: Introduce netdev array cache

2021-07-07 Thread Eli Britstein

Port numbers are usually small. Maintain an array of netdev handles indexed
by port numbers. It accelerates looking up for them for
netdev_hw_miss_packet_recover().

Reported-by: Cian Ferriter 
Signed-off-by: Eli Britstein 
Reviewed-by: Gaetan Rivet 
---
 lib/dpif-netdev.c | 41 +
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 2e654426e..accb23a1a 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -650,6 +650,9 @@ struct dp_netdev_pmd_thread_ctx {
 uint32_t emc_insert_min;
 };
 
+/* Size of netdev's cache. */
+#define DP_PMD_NETDEV_CACHE_SIZE 1024
+
 /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
  * the performance overhead of interrupt processing.  Therefore netdev can
  * not implement rx-wait for these devices.  dpif-netdev needs to poll
@@ -786,6 +789,7 @@ struct dp_netdev_pmd_thread {
  * other instance will only be accessed by its own pmd thread. */
 struct hmap tnl_port_cache;
 struct hmap send_port_cache;
+struct netdev *send_netdev_cache[DP_PMD_NETDEV_CACHE_SIZE];
 
 /* Keep track of detailed PMD performance statistics. */
 struct pmd_perf_stats perf_stats;
@@ -5910,6 +5914,10 @@ pmd_free_cached_ports(struct dp_netdev_pmd_thread *pmd)
 free(tx_port_cached);
 }
 HMAP_FOR_EACH_POP (tx_port_cached, node, &pmd->send_port_cache) {
+if (tx_port_cached->port->port_no <
+ARRAY_SIZE(pmd->send_netdev_cache)) {
+pmd->send_netdev_cache[tx_port_cached->port->port_no] = NULL;
+}
 free(tx_port_cached);
 }
 }
@@ -5939,6 +5947,11 @@ pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
 tx_port_cached = xmemdup(tx_port, sizeof *tx_port_cached);
 hmap_insert(&pmd->send_port_cache, &tx_port_cached->node,
 hash_port_no(tx_port_cached->port->port_no));
+if (tx_port_cached->port->port_no <
+ARRAY_SIZE(pmd->send_netdev_cache)) {
+pmd->send_netdev_cache[tx_port_cached->port->port_no] =
+tx_port_cached->port->netdev;
+}
 }
 }
 }
@@ -6585,6 +6598,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, 
struct dp_netdev *dp,
 hmap_init(&pmd->tx_ports);
 hmap_init(&pmd->tnl_port_cache);
 hmap_init(&pmd->send_port_cache);
+memset(pmd->send_netdev_cache, 0, sizeof pmd->send_netdev_cache);
 cmap_init(&pmd->tx_bonds);
 /* init the 'flow_cache' since there is no
  * actual thread created for NON_PMD_CORE_ID. */
@@ -6603,6 +6617,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
 struct dpcls *cls;
 
 dp_netdev_pmd_flow_flush(pmd);
+memset(pmd->send_netdev_cache, 0, sizeof pmd->send_netdev_cache);
 hmap_destroy(&pmd->send_port_cache);
 hmap_destroy(&pmd->tnl_port_cache);
 hmap_destroy(&pmd->tx_ports);
@@ -7090,20 +7105,38 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
 static struct tx_port * pmd_send_port_cache_lookup(
 const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
 
+OVS_UNUSED
+static inline struct netdev *
+pmd_netdev_cache_lookup(const struct dp_netdev_pmd_thread *pmd,
+odp_port_t port_no)
+{
+struct tx_port *p;
+
+if (port_no < ARRAY_SIZE(pmd->send_netdev_cache)) {
+return pmd->send_netdev_cache[port_no];
+}
+
+p = pmd_send_port_cache_lookup(pmd, port_no);
+if (p) {
+return p->port->netdev;
+}
+return NULL;
+}
+
 static inline int
 dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
   odp_port_t port_no OVS_UNUSED,
   struct dp_packet *packet,
   struct dp_netdev_flow **flow)
 {
-struct tx_port *p OVS_UNUSED;
+struct netdev *netdev OVS_UNUSED;
 uint32_t mark;
 
 #ifdef ALLOW_EXPERIMENTAL_API /* Packet restoration API required. */
 /* Restore the packet if HW processing was terminated before completion. */
-p = pmd_send_port_cache_lookup(pmd, port_no);
-if (OVS_LIKELY(p)) {
-int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
+netdev = pmd_netdev_cache_lookup(pmd, port_no);
+if (OVS_LIKELY(netdev)) {
+int err = netdev_hw_miss_packet_recover(netdev, packet);
 
 if (err && err != EOPNOTSUPP) {
 COVERAGE_INC(datapath_drop_hw_miss_recover);
-- 
2.28.0.2311.g225365fb51

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH V7 13/13] netdev-dpdk-offload: Add vxlan pattern matching function.

2021-07-07 Thread Ilya Maximets

On 7/6/21 3:03 PM, Van Haaren, Harry wrote:
>> -Original Message-
>> From: Ilya Maximets 
>> Sent: Friday, July 2, 2021 2:34 PM
>> To: Van Haaren, Harry ; Eli Britstein
>> ; d...@openvswitch.org; Ilya Maximets 
>> Cc: Ivan Malov ; Majd Dibbiny 
>> Subject: Re: [ovs-dev] [PATCH V7 13/13] netdev-dpdk-offload: Add vxlan 
>> pattern
>> matching function.
>>
>> On 7/1/21 6:36 PM, Van Haaren, Harry wrote:
> 
> 
> 
>>> Is there a build failure in OVS master branch since this has been merged?
>>
>> I didn't notice any build failures.  The main factor here is that you just
>> can't build with DPDK without -Wno-cast-align anyway with neither modern gcc
>> nor clang. 
> 
> Actually, yes it is possible to build OVS with DPDK, I've just tested this 
> with a fresh clone:
> git clone https://github.com/openvswitch/ovs (commit 780b2bde8 on master)
> cd ovs
> ./boot.sh
> ./configure --enable-Werror --with-dpdk=static
> make
> 
> The above steps *fail to compile* due to the code changes introduce in this 
> patch.
> This above steps *pass* if the code here (and in another vxlan offload 
> commit) are fixed.
> (See void* casting suggestion below, which was used to fix the compile issues 
> here).
> 
> If DPDK has issues in cast-align, then those should be addressed in that 
> community.
> Saying that it's OK to introduce issues around cast-align in OVS because DPDK 
> has them
> is not a good argument. If I could point out a DPDK segfault, would it be OK 
> to introduce
> segfaulting code in OVS? No. Bugs originating from cast-align warnings are no 
> different.

I'm not saying that we should introduce issues existed in other projects,
I'm just saying that it's very easy to miss this issue because of flags
that needs to be passed in order to build OVS.  I have no compilers that
can build this code without -fno-cast-align, same for our CI.

> 
> 
>> So, this should not be an issue.
> 
> This is an issue, and the fact that an OVS maintainer is downplaying breaking 
> the build 
> when the issue is pointed out is frustrating.

I don't have a system where this change leads to a build failure, same
for our CI.  I understand that certain compilers might handle things
differently and I'm not suggesting to keep things as is.  If you didn't
notice, I suggested to fix that in may email below.

> 
> 
>> gcc 10.3.1 on fedora generates myriads of cast-align warnings in DPDK 
>> headers:
>>
>> ./configure CC=gcc --enable-Werror --with-dpdk=static
> 
> Perhaps on the system being tested there, but this works fine here. If there 
> are issues in
> DPDK, then file a bug there.

DPDK project is fully aware of this and has no intention to fix.  And they
have -fno-cast-align in their CI systems.

> Do not use potential issues in other projects to try cover up
> having introduced code that breaks the OVS build.

I didn't suggest that.

> 
> 
> 
> 
>>> It seems the above casts (ovs_be32 *) are causing issues with increasing
>> alignment?
>>>
>>> From DPDK's struct rte_flow_item_vxlan, the vni is a "uint8_t vni[3];"
>>>
>>> The VNI is not a 32-bit value, so is not required to be aligned on 32-bits. 
>>> The cast
>> here
>>> technically requires the alignment of the casted pointer to be 32-bit/4 
>>> byte, which
>> is
>>> not guaranteed to be true. I think the compiler rightly flags a cast 
>>> alignment issue?
>>
>> I briefly inspected the resulted binary and I didn't notice any actual
>> changes in alignments, so this, likely, doesn't cause any real problems,
>> at least on x86.  But, I agree that we, probably, should fix that to
>> have a more correct code.
> 
> I agree that the code on x86-64 is likely to be the same with a (void *) cast.
> 
> I disagree that we "probably, should" fix this. Its an issue, it must be 
> fixed.
> It breaks the build when compiled with -Werror enabled (as shown above).

When I'm saying "probably, should", I usually mean "we need to".
That is just a more mild way to put that.  Sorry for my English,
if that wasn't clear.

> 
> 
>>> Casting through a (void *) might solve, or else using some of the BE/LE 
>>> conversion
>>> and alignment helpers in byte-order.h and unaligned.h could perhaps work?
>>
>> I think, get_unaligned_be32() is a way to go here. 
> 
> A (void *) cast will likely have no impact on resulting ASM, but just mutes 
> the warning.
> get_unaligned_be32() may result in 4x byte loads and shifts and ORs, instead 
> of a 4-byte load.

So why you're suggesting to just silence the warning instead of
fixing the problem here?  In the same way you may just ignore
the warning with -fno-cast-align.  Please, be consistent.

> 
> 
>> Feel free to submit a patch.
> 
> I have no intention of fixing bugs introduced in a patch that was merged 
> while there
> were open outstanding issues to be resolved.  Consider this email a 
> "Reported-by".

I have no issues with Reported-by and nobody should be obligated
to fix the issue that they found, that is perfectly fine.

Best regards, Ilya Maximets.

> 
> For anybo

[ovs-dev] [PATCH v2 ovn] Don't suppress localport traffic directed to external port

2021-07-07 Thread Ihar Hrachyshka

Recently, we stopped leaking localport traffic through localnet ports
into fabric to avoid unnecessary flipping between chassis hosting the
same localport.

Despite the type name, in some scenarios localports are supposed to
talk outside the hosting chassis. Specifically, in OpenStack [1]
metadata service for SR-IOV ports is implemented as a localport hosted
on another chassis that is exposed to the chassis owning the SR-IOV
port through an "external" port. In this case, "leaking" localport
traffic into fabric is desirable.

This patch inserts a higher priority flow per external port on the
same datapath that avoids dropping localport traffic.

Fixes: 96959e56d634 ("physical: do not forward traffic from localport
to a localnet one")

[1] https://docs.openstack.org/neutron/latest/admin/ovn/sriov.html

Signed-off-by: Ihar Hrachyshka 

--

v1: initial version
v2: fixed code for unbound external ports
v2: rebased
---
 controller/physical.c | 50 +++
 tests/ovn.at  | 54 +++
 2 files changed, 104 insertions(+)

diff --git a/controller/physical.c b/controller/physical.c
index 17ca5afbb..9235c48f4 100644
--- a/controller/physical.c
+++ b/controller/physical.c
@@ -920,6 +920,7 @@ get_binding_peer(struct ovsdb_idl_index 
*sbrec_port_binding_by_name,
 
 static void
 consider_port_binding(struct ovsdb_idl_index *sbrec_port_binding_by_name,
+  const struct sbrec_port_binding_table *pb_table,
   enum mf_field_id mff_ovn_geneve,
   const struct simap *ct_zones,
   const struct sset *active_tunnels,
@@ -1281,6 +1282,52 @@ consider_port_binding(struct ovsdb_idl_index 
*sbrec_port_binding_by_name,
 ofctrl_add_flow(flow_table, OFTABLE_CHECK_LOOPBACK, 160,
 binding->header_.uuid.parts[0], &match,
 ofpacts_p, &binding->header_.uuid);
+
+/* Localport traffic directed to external is *not* local. */
+const struct sbrec_port_binding *peer;
+SBREC_PORT_BINDING_TABLE_FOR_EACH (peer, pb_table) {
+if (strcmp(peer->type, "external")) {
+continue;
+}
+if (!peer->chassis) {
+continue;
+}
+if (peer->datapath->tunnel_key != dp_key) {
+continue;
+}
+if (strcmp(peer->chassis->name, chassis->name)) {
+continue;
+}
+
+ofpbuf_clear(ofpacts_p);
+for (int i = 0; i < MFF_N_LOG_REGS; i++) {
+put_load(0, MFF_REG0 + i, 0, 32, ofpacts_p);
+}
+put_resubmit(OFTABLE_LOG_EGRESS_PIPELINE, ofpacts_p);
+
+for (int i = 0; i < peer->n_mac; i++) {
+char *err_str;
+struct eth_addr peer_mac;
+if ((err_str = str_to_mac(peer->mac[i], &peer_mac))) {
+VLOG_WARN("Parsing MAC failed for external port: %s, "
+"with error: %s", peer->logical_port, err_str);
+free(err_str);
+continue;
+}
+
+match_init_catchall(&match);
+match_set_metadata(&match, htonll(dp_key));
+match_set_reg(&match, MFF_LOG_OUTPORT - MFF_REG0,
+  port_key);
+match_set_reg_masked(&match, MFF_LOG_FLAGS - MFF_REG0,
+ MLF_LOCALPORT, MLF_LOCALPORT);
+match_set_dl_dst(&match, peer_mac);
+
+ofctrl_add_flow(flow_table, OFTABLE_CHECK_LOOPBACK, 170,
+binding->header_.uuid.parts[0], &match,
+ofpacts_p, &binding->header_.uuid);
+}
+}
 }
 
 } else if (!tun && !is_ha_remote) {
@@ -1504,6 +1551,7 @@ physical_handle_port_binding_changes(struct physical_ctx 
*p_ctx,
 ofctrl_remove_flows(flow_table, &binding->header_.uuid);
 }
 consider_port_binding(p_ctx->sbrec_port_binding_by_name,
+  p_ctx->port_binding_table,
   p_ctx->mff_ovn_geneve, p_ctx->ct_zones,
   p_ctx->active_tunnels,
   p_ctx->local_datapaths,
@@ -1684,6 +1732,7 @@ physical_run(struct physical_ctx *p_ctx,
 const struct sbrec_port_binding *binding;
 SBREC_PORT_BINDING_TABLE_FOR_EACH (binding, p_ctx->port_binding_table) {
 consider_port_binding(p_ctx->sbrec_port_binding_by_name,
+  p_ctx->port_binding_table,
   p_ctx->mff_ovn_geneve, p_ctx->ct_zones,

[ovs-dev] [PATCH ovn] controller: Avoid unnecessary load balancer flow processing.

2021-07-07 Thread Dumitru Ceara

Whenever a Load_Balancer is updated, e.g., a VIP is added, the following
sequence of events happens:

1. The Southbound Load_Balancer record is updated.
2. The Southbound Datapath_Binding records on which the Load_Balancer is
   applied are updated.
3. Southbound ovsdb-server sends updates about the Load_Balancer and
   Datapath_Binding records to ovn-controller.
4. The IDL layer in ovn-controller processes the updates at #3, but
   because of the SB schema references between tables [0] all logical
   flows referencing the updated Datapath_Binding are marked as
   "updated".  The same is true for Logical_DP_Group records
   referencing the Datapath_Binding, and also for all logical flows
   pointing to the new "updated" datapath groups.
5. ovn-controller ends up recomputing (removing/readding) all flows for
   all these tracked updates.

>From the SB Schema:
"Datapath_Binding": {
"columns": {
[...]
"load_balancers": {"type": {"key": {"type": "uuid",
   "refTable": "Load_Balancer",
   "refType": "weak"},
"min": 0,
"max": "unlimited"}},
[...]
"Load_Balancer": {
"columns": {
"datapaths": {
[...]
"type": {"key": {"type": "uuid",
 "refTable": "Datapath_Binding"},
 "min": 0, "max": "unlimited"}},
[...]
"Logical_DP_Group": {
"columns": {
"datapaths":
{"type": {"key": {"type": "uuid",
  "refTable": "Datapath_Binding",
  "refType": "weak"},
  "min": 0, "max": "unlimited"}}},
[...]
"Logical_Flow": {
"columns": {
"logical_datapath":
{"type": {"key": {"type": "uuid",
  "refTable": "Datapath_Binding"},
  "min": 0, "max": 1}},
"logical_dp_group":
{"type": {"key": {"type": "uuid",
  "refTable": "Logical_DP_Group"},

In order to avoid this unnecessary Logical_Flow notification storm we
now remove the explicit reference from Datapath_Binding to
Load_Balancer and instead store raw UUIDs.

This means that on the ovn-controller side we need to perform a
Load_Balancer table lookup by UUID whenever a new datapath is added,
but that doesn't happen too often and the cost of the lookup is
negligible compared to the huge cost of processing the unnecessary
logical flow updates.

This change is backwards compatible because the contents stored in the
database are not changed, just that the schema constraints are relaxed a
bit.

Some performance measurements, on a scale test deployment simulating an
ovn-kubernetes deployment with 120 nodes and a large load balancer
with 16K VIPs associated to each node's logical switch, the event
processing loop time in ovn-controller, when adding a new VIP, is
reduced from ~39 seconds to ~8 seconds.

There's no need to change the northd DDlog implementation.

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1978605
Signed-off-by: Dumitru Ceara 
---
 controller/lflow.c  |  6 --
 controller/lflow.h  |  2 ++
 controller/ovn-controller.c |  2 ++
 lib/inc-proc-eng.h  |  1 +
 northd/ovn-northd.c | 14 ++
 ovn-sb.ovsschema|  6 ++
 6 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/controller/lflow.c b/controller/lflow.c
index 60aa011ff..877bd49b0 100644
--- a/controller/lflow.c
+++ b/controller/lflow.c
@@ -1744,8 +1744,10 @@ lflow_processing_end:
 /* Add load balancer hairpin flows if the datapath has any load balancers
  * associated. */
 for (size_t i = 0; i < dp->n_load_balancers; i++) {
-consider_lb_hairpin_flows(dp->load_balancers[i],
-  l_ctx_in->local_datapaths,
+const struct sbrec_load_balancer *lb =
+sbrec_load_balancer_get_for_uuid(l_ctx_in->sb_idl,
+ &dp->load_balancers[i]);
+consider_lb_hairpin_flows(lb, l_ctx_in->local_datapaths,
   l_ctx_out->flow_table);
 }
 
diff --git a/controller/lflow.h b/controller/lflow.h
index c17ff6dd4..5ba7af894 100644
--- a/controller/lflow.h
+++ b/controller/lflow.h
@@ -40,6 +40,7 @@
 #include "openvswitch/list.h"
 
 struct ovn_extend_table;
+struct ovsdb_idl;
 struct ovsdb_idl_index;
 struct ovn_desired_flow_table;
 struct hmap;
@@ -126,6 +127,7 @@ void lflow_resource_destroy(struct lflow_resource_ref *);
 void lflow_resource_clear(struct lflow_resource_ref *);
 
 struct lflow_ctx_in {
+stru

Re: [ovs-dev] [PATCH V7 00/13] Netdev vxlan-decap offload

2021-07-07 Thread Ilya Maximets

On 7/6/21 3:34 PM, Van Haaren, Harry wrote:
>> -Original Message-
>> From: Ilya Maximets 
>> Sent: Thursday, July 1, 2021 11:32 AM
>> To: Van Haaren, Harry ; Ilya Maximets
>> 
>> Cc: Eli Britstein ; ovs dev ; Ivan 
>> Malov
>> ; Majd Dibbiny ; Stokes, Ian
>> ; Ferriter, Cian ; Ben Pfaff
>> ; Balazs Nemeth ; Sriharsha Basavapatna
>> 
>> Subject: Re: [ovs-dev] [PATCH V7 00/13] Netdev vxlan-decap offload
>>
>> On 6/29/21 1:53 PM, Van Haaren, Harry wrote:
 -Original Message-
 From: Ilya Maximets 
 Sent: Monday, June 28, 2021 3:33 PM
 To: Van Haaren, Harry ; Ilya Maximets
 ; Sriharsha Basavapatna
 
 Cc: Eli Britstein ; ovs dev ; Ivan
>> Malov
 ; Majd Dibbiny ; Stokes, Ian
 ; Ferriter, Cian ; Ben Pfaff
 ; Balazs Nemeth 
 Subject: Re: [ovs-dev] [PATCH V7 00/13] Netdev vxlan-decap offload

 On 6/25/21 7:28 PM, Van Haaren, Harry wrote:
>> -Original Message-
>> From: dev  On Behalf Of Ilya Maximets
>> Sent: Friday, June 25, 2021 4:26 PM
>> To: Sriharsha Basavapatna ; Ilya
 Maximets
>> 
>> Cc: Eli Britstein ; ovs dev ; 
>> Ivan
 Malov
>> ; Majd Dibbiny 
>> Subject: Re: [ovs-dev] [PATCH V7 00/13] Netdev vxlan-decap offload
>
> 
>
 That looks good to me.  So, I guess, Harsha, we're waiting for
 your review/tests here.
>>>
>>> Thanks Ilya and Eli, looks good to me; I've also tested it and it works 
>>> fine.
>>> -Harsha
>>
>> Thanks, everyone.  Applied to master.
>
> Hi Ilya and OVS Community,
>
> There are open questions around this patchset, why has it been merged?
>
> Earlier today, new concerns were raised by Cian around the negative
>> performance
 impact of these code changes:
> - https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384445.html
>
> Both you (Ilya) and Eli responded, and I was following the conversation. 
> Various
 code changes were suggested,
> and some may seem like they might work, Eli mentioned some solutions might
>> not
 work due to the hardware:
> I was processing both your comments and input, and planning a technical 
> reply
 later today.
> - suggestions: https://mail.openvswitch.org/pipermail/ovs-dev/2021-
 June/384446.html
> - concerns around hw: https://mail.openvswitch.org/pipermail/ovs-dev/2021-
 June/384464.html

 Concerns not really about the hardware, but the API itself
 that should be clarified a little bit to avoid confusion and
 avoid incorrect changes like the one I suggested.
 But this is a small enhancement that could be done on top.

>
> Keep in mind that there are open performance issues to be worked out, that
>> have
 not been resolved at this point in the conversation.

 Performance issue that can be worked out, will be worked out
 in a separate patch , v1 for which we already have on a mailing
 list for some time, so it didn't make sense to to re-validate
 the whole series again due to this one pretty obvious change.

> There is no agreement on solutions, nor an agreement to ignore the
>> performance
 degradation, or to try resolve this degradation later.

 Particular part of the packet restoration call seems hard
 to avoid in a long term (I don't see a good solution for that),
 but the short term solution might be implemented on top.
 The part with multiple reads of recirc_id and checking if
 offloading is enabled has a fix already (that needs a v2, but
 anyway).

>
> That these patches have been merged is inappropriate:
> 1) Not enough time given for responses (11 am concerns raised, 5pm merged
 without resolution? (Irish timezone))

 I responded with suggestions and arguments against solutions
 suggested in the report, Eli responded with rejection of one
 one of my suggestions.  And it seems clear (for me) that
 there is no good solution for this part at the moment.
 Part of the performance could be won back, but the rest
 seems to be inevitable.  As a short-term solution we can
 guard the netdev_hw_miss_packet_recover() with experimental
 API ifdef, but it will strike back anyway in the future.

> 2) Open question not addressed/resolved, resulting in a 6% known negative
 performance impact being merged.

 I don't think it wasn't addressed.
>>>
>>> Was code merged that resulted in a known regression of 6%?  Yes. Facts are 
>>> facts.
>>> I don't care for arguing over exactly what "addressed" means in this 
>>> context.
>>>
>>>
> 3) Suggestions provided were not reviewed technically in detail (no 
> technical
 collaboration or code-changes/patches reviewed)

 Patches was heavily reviewed/tested by at least 4 different
 parties including 2 test rounds from Intel engineers that,
 I believe, included testing of partial

Re: [ovs-dev] [v6 00/11] MFEX Infrastructure + Optimizations

2021-07-07 Thread Van Haaren, Harry

> -Original Message-
> From: Stokes, Ian 
> Sent: Wednesday, July 7, 2021 3:38 PM
> To: Eelco Chaudron ; Van Haaren, Harry
> ; Amber, Kumar 
> Cc: Amber, Kumar ; Ferriter, Cian
> ; ovs-dev@openvswitch.org; f...@sysclose.org;
> i.maxim...@ovn.org
> Subject: RE: [v6 00/11] MFEX Infrastructure + Optimizations
> 
> > On 7 Jul 2021, at 12:13, Van Haaren, Harry wrote:
> >
> > > Hi All,
> > >
> > > This thread has dissolved into unnecessary time-wasting on nitpick 
> > > changes.
> > There is no
> > > technical issue with uint32_t, so this patch remains as is, and this 
> > > should be
> > accepted for merge.
> > >
> > > If you feel differently, reply to this with a detailed description of a 
> > > genuine
> > technical bug.
> >
> > Reviews are not only about technical correctness but also about coding style
> > and consistency.
> > In this case the dp_packet_batch_size() API returns a size_t, so we should 
> > try
> to
> > use it.
> >
> +1

OK, lets work on a viable solution that OVS build-system & the compilers 
support.

> > But I leave it to the maintainers to decide if they accept this as is, or 
> > not :)
> 
> If the standard is to use size_t then I believe we should stick with that. I 
> think it's
> a separate conversation with regards to if size_t should be used and is best 
> for
> the dp_packet_batch_size API and probably beyond the scope of this patch
> series.

Agree that it's beyond scope of patchset. 

> For the moment I'd prefer to see a solution using size_t.

OK, based on coding-style.rst, the PRIuSIZE format spec. will be used.

> Regards
> Ian
> 
> > //Eelco

Thanks for input all.


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 07/11] test/sytem-dpdk: Add unit test for mfex autovalidator

2021-07-07 Thread Eelco Chaudron





On 6 Jul 2021, at 15:11, Cian Ferriter wrote:


From: Kumar Amber 

Tests:
  6: OVS-DPDK - MFEX Autovalidator
  7: OVS-DPDK - MFEX Autovalidator Fuzzy

Added a new directory to store the PCAP file used
in the tests and a script to generate the fuzzy traffic
type pcap to be used in fuzzy unit test.

Signed-off-by: Kumar Amber 

---

v5:
- fix review comments(Ian, Flavio, Eelco)
- remove sleep from first test and added minor 5 sec sleep to fuzzy
---
---
 Documentation/topics/dpdk/bridge.rst |  55 
+++

 tests/automake.mk|   5 +++
 tests/mfex_fuzzy.py  |  32 
 tests/pcap/mfex_test | Bin 0 -> 416 bytes
 tests/system-dpdk.at |  46 ++
 5 files changed, 138 insertions(+)
 create mode 100755 tests/mfex_fuzzy.py
 create mode 100644 tests/pcap/mfex_test

diff --git a/Documentation/topics/dpdk/bridge.rst 
b/Documentation/topics/dpdk/bridge.rst

index 8495687e8..8a8ef3782 100644
--- a/Documentation/topics/dpdk/bridge.rst
+++ b/Documentation/topics/dpdk/bridge.rst
@@ -341,3 +341,58 @@ A compile time option is available in order to 
test it with the OVS unit

 test suite. Use the following configure option ::

 $ ./configure --enable-mfex-default-autovalidator
+
+Unit Test Miniflow Extract
+++
+
+Unit test can also be used to test the workflow mentioned above by 
running

+the following test-case in tests/system-dpdk.at ::
+
+make check-dpdk TESTSUITEFLAGS='-k MFEX'
+OVS-DPDK - MFEX Autovalidator
+
+The unit test uses mulitple traffic type to test the correctness of 
the

+implementaions.
+
+Running Fuzzy test with Autovalidator
++
+
+Fuzzy tests can also be done on miniflow extract with the help of
+auto-validator and Scapy. The steps below describes the steps to
+reproduce the setup with IP being fuzzed to generate packets.
+
+Scapy is used to create fuzzy IP packets and save them into a PCAP ::
+
+pkt = fuzz(Ether()/IP()/TCP())
+
+Set the miniflow extract to autovalidator using ::
+
+$ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator
+
+OVS is configured to receive the generated packets ::
+
+$ ovs-vsctl add-port br0 pcap0 -- \
+set Interface pcap0 type=dpdk options:dpdk-devargs=net_pcap0
+"rx_pcap=fuzzy.pcap"
+
+With this workflow, the autovalidator will ensure that all MFEX
+implementations are classifying each packet in exactly the same way.
+If an optimized MFEX implementation causes a different miniflow to be
+generated, the autovalidator has ovs_assert and logging statements 
that

+will inform about the issue.
+
+Unit Fuzzy test with Autovalidator
++
+
+The prerquiste before running the unit test is to run the script 
provided ::

+
+tests/mfex_fuzzy.py
+
+This script generates a pcap with mulitple type of fuzzed packets to 
be used

+in the below unit test-case.
+
+Unit test can also be used to test the workflow mentioned above by 
running

+the following test-case in tests/system-dpdk.at ::
+
+make check-dpdk TESTSUITEFLAGS='-k MFEX'
+OVS-DPDK - MFEX Autovalidator Fuzzy
diff --git a/tests/automake.mk b/tests/automake.mk
index f45f8d76c..e94ccd27c 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -143,6 +143,11 @@ $(srcdir)/tests/fuzz-regression-list.at: 
tests/automake.mk

echo "TEST_FUZZ_REGRESSION([$$basename])"; \
done > $@.tmp && mv $@.tmp $@

+EXTRA_DIST += $(MFEX_AUTOVALIDATOR_TESTS)
+MFEX_AUTOVALIDATOR_TESTS = \
+   tests/pcap/mfex_test \
+   tests/mfex_fuzzy.py
+
 OVSDB_CLUSTER_TESTSUITE_AT = \
tests/ovsdb-cluster-testsuite.at \
tests/ovsdb-execution.at \
diff --git a/tests/mfex_fuzzy.py b/tests/mfex_fuzzy.py
new file mode 100755
index 0..a8051ba2b
--- /dev/null
+++ b/tests/mfex_fuzzy.py
@@ -0,0 +1,32 @@
+#!/usr/bin/python3
+try:
+   from scapy.all import *
+except ModuleNotFoundError as err:
+   print(err + ": Scapy")
+import sys
+import os
+
+path = os.environ['OVS_DIR'] + "/tests/pcap/fuzzy"


This is failing in my setup, as OVS_DIR is not defined:


Is it something you set manually? As from make check-dpdk it fails:

7. system-dpdk.at:260: testing OVS-DPDK - MFEX Autovalidator Fuzzy ...
DEPRECATION: The default format will switch to columns in the future. 
You can use --format=(legacy|columns) (or define a 
format=(legacy|columns) in your pip.conf

under the [list] section) to disable this warning.
scapy (2.4.4)
./system-dpdk.at:263: $PYTHON3 $srcdir/mfex_fuzzy.py
--- /dev/null   2021-07-02 08:15:52.158758028 -0400
+++ 
/root/Documents/Scratch/ovs_review_mfex/OVS_master_DPDK_v20.11.1/ovs_github/tests/system-dpdk-testsuite.dir/at-groups/7/stderr	2021-07-07 
10:34:09.364877

754 -0400
@@ -0,0 +1,6 @@
+Traceback (most recent call last):
+  File "../.././mfex_fuzzy.py", line 10, in 
+path = os.environ['OVS_DIR'] + "/tests/pcap/fuzzy"
+  File "/usr/lib64/p

Re: [ovs-dev] [PATCH v5 3/3] dpif-netlink: Introduce per-cpu upcall dispatch

2021-07-07 Thread Flavio Leitner

On Wed, Jul 07, 2021 at 04:43:21AM -0400, Mark Gray wrote:
> The Open vSwitch kernel module uses the upcall mechanism to send
> packets from kernel space to user space when it misses in the kernel
> space flow table. The upcall sends packets via a Netlink socket.
> Currently, a Netlink socket is created for every vport. In this way,
> there is a 1:1 mapping between a vport and a Netlink socket.
> When a packet is received by a vport, if it needs to be sent to
> user space, it is sent via the corresponding Netlink socket.
> 
> This mechanism, with various iterations of the corresponding user
> space code, has seen some limitations and issues:
> 
> * On systems with a large number of vports, there is correspondingly
> a large number of Netlink sockets which can limit scaling.
> (https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
> * Packet reordering on upcalls.
> (https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
> * A thundering herd issue.
> (https://bugzilla.redhat.com/show_bug.cgi?id=183)
> 
> This patch introduces an alternative, feature-negotiated, upcall
> mode using a per-cpu dispatch rather than a per-vport dispatch.
> 
> In this mode, the Netlink socket to be used for the upcall is
> selected based on the CPU of the thread that is executing the upcall.
> In this way, it resolves the issues above as:
> 
> a) The number of Netlink sockets scales with the number of CPUs
> rather than the number of vports.
> b) Ordering per-flow is maintained as packets are distributed to
> CPUs based on mechanisms such as RSS and flows are distributed
> to a single user space thread.
> c) Packets from a flow can only wake up one user space thread.
> 
> Reported-at: https://bugzilla.redhat.com/1844576
> Signed-off-by: Mark Gray 
> ---

Acked-by: Flavio Leitner 

Thanks Mark!
fbl

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v14 05/11] dpif-netdev: Add command to switch dpif implementation.

2021-07-07 Thread Flavio Leitner

Hi,

Please see my comments below.

On Thu, Jul 01, 2021 at 04:06:13PM +0100, Cian Ferriter wrote:
> From: Harry van Haaren 
> 
> This commit adds a new command to allow the user to switch
> the active DPIF implementation at runtime. A probe function
> is executed before switching the DPIF implementation, to ensure
> the CPU is capable of running the ISA required. For example, the
> below code will switch to the AVX512 enabled DPIF assuming
> that the runtime CPU is capable of running AVX512 instructions:
> 
>  $ ovs-appctl dpif-netdev/dpif-impl-set dpif_avx512
> 
> A new configuration flag is added to allow selection of the
> default DPIF. This is useful for running the unit-tests against
> the available DPIF implementations, without modifying each unit test.
> 
> The design of the testing & validation for ISA optimized DPIF
> implementations is based around the work already upstream for DPCLS.
> Note however that a DPCLS lookup has no state or side-effects, allowing
> the auto-validator implementation to perform multiple lookups and
> provide consistent statistic counters.
> 
> The DPIF component does have state, so running two implementations in
> parallel and comparing output is not a valid testing method, as there
> are changes in DPIF statistic counters (side effects). As a result, the
> DPIF is tested directly against the unit-tests.
> 
> Signed-off-by: Harry van Haaren 
> Co-authored-by: Cian Ferriter 
> Signed-off-by: Cian Ferriter 
> 
> ---
> 
> v14:
> - Change command name to dpif-impl-set
> - Fix the order of includes to what is layed out in the coding-style.rst
> - Use bool not int to capture return value of dpdk_get_cpu_has_isa()
> - Use an enum to index DPIF impls array.
> - Hide more of the dpif impl details from lib/dpif-netdev.c.
> - Fix comment on *dp_netdev_input_func() typedef.
> - Rename dp_netdev_input_func func to input_func.
> - Remove the datapath or dp argument from the dpif-impl-set CMD.
> - Set the DPIF function pointer atomically.
> 
> v13:
> - Add Docs items about the switch DPIF command here rather than in
>   later commit.
> - Document operation in manpages as well as rST.
> - Minor code refactoring to address review comments.
> ---
>  Documentation/topics/dpdk/bridge.rst |  34 
>  acinclude.m4 |  15 
>  configure.ac |   1 +
>  lib/automake.mk  |   1 +
>  lib/dpif-netdev-avx512.c |  14 +++
>  lib/dpif-netdev-private-dpif.c   | 122 +++
>  lib/dpif-netdev-private-dpif.h   |  47 +++
>  lib/dpif-netdev-private-thread.h |  10 ---
>  lib/dpif-netdev-unixctl.man  |   3 +
>  lib/dpif-netdev.c|  74 ++--
>  10 files changed, 306 insertions(+), 15 deletions(-)
>  create mode 100644 lib/dpif-netdev-private-dpif.c
> 
> diff --git a/Documentation/topics/dpdk/bridge.rst 
> b/Documentation/topics/dpdk/bridge.rst
> index 526d5c959..06d1f943c 100644
> --- a/Documentation/topics/dpdk/bridge.rst
> +++ b/Documentation/topics/dpdk/bridge.rst
> @@ -214,3 +214,37 @@ implementation ::
>  
>  Compile OVS in debug mode to have `ovs_assert` statements error out if
>  there is a mis-match in the DPCLS lookup implementation.
> +
> +Datapath Interface Performance
> +--
> +
> +The datapath interface (DPIF) or dp_netdev_input() is responsible for taking
> +packets through the major components of the userspace datapath; such as
> +miniflow_extract, EMC, SMC and DPCLS lookups, and a lot of the performance
> +stats associated with the datapath.
> +
> +Just like with the SIMD DPCLS feature above, SIMD can be applied to the DPIF 
> to
> +improve performance.
> +
> +By default, dpif_scalar is used. The DPIF implementation can be selected by
> +name ::
> +
> +$ ovs-appctl dpif-netdev/dpif-impl-set dpif_avx512
> +DPIF implementation set to dpif_avx512.
> +
> +$ ovs-appctl dpif-netdev/dpif-impl-set dpif_scalar
> +DPIF implementation set to dpif_scalar.
> +
> +Running Unit Tests with AVX512 DPIF
> +~~~
> +
> +Since the AVX512 DPIF is disabled by default, a compile time option is
> +available in order to test it with the OVS unit test suite. When building 
> with
> +a CPU that supports AVX512, use the following configure option ::
> +
> +$ ./configure --enable-dpif-default-avx512
> +
> +The following line should be seen in the configure output when the above 
> option
> +is used ::
> +
> +checking whether DPIF AVX512 is default implementation... yes
> diff --git a/acinclude.m4 b/acinclude.m4
> index 15a54d636..5fbcd9872 100644
> --- a/acinclude.m4
> +++ b/acinclude.m4
> @@ -30,6 +30,21 @@ AC_DEFUN([OVS_CHECK_DPCLS_AUTOVALIDATOR], [
>fi
>  ])
>  
> +dnl Set OVS DPIF default implementation at configure time for running the 
> unit
> +dnl tests on the whole codebase without modifying tests per DPIF impl
> +AC_DEFUN([OVS_CHECK_DPIF_AVX512_DEFAULT], [
> +  AC_ARG

Re: [ovs-dev] [v6 00/11] MFEX Infrastructure + Optimizations

2021-07-07 Thread Stokes, Ian

> On 7 Jul 2021, at 12:13, Van Haaren, Harry wrote:
> 
> > Hi All,
> >
> > This thread has dissolved into unnecessary time-wasting on nitpick changes.
> There is no
> > technical issue with uint32_t, so this patch remains as is, and this should 
> > be
> accepted for merge.
> >
> > If you feel differently, reply to this with a detailed description of a 
> > genuine
> technical bug.
> 
> Reviews are not only about technical correctness but also about coding style
> and consistency.
> In this case the dp_packet_batch_size() API returns a size_t, so we should 
> try to
> use it.
> 
+1

> But I leave it to the maintainers to decide if they accept this as is, or not 
> :)
> 

If the standard is to use size_t then I believe we should stick with that. I 
think it's a separate conversation with regards to if size_t should be used and 
is best for the dp_packet_batch_size API and probably beyond the scope of this 
patch series.

For the moment I'd prefer to see a solution using size_t.

Regards
Ian

> //Eelco
> 
> > Regards, -Harry
> >
> >
> > From: Amber, Kumar 
> > Sent: Wednesday, July 7, 2021 11:04 AM
> > To: Eelco Chaudron ; Van Haaren, Harry
> 
> > Cc: Ferriter, Cian ; ovs-dev@openvswitch.org;
> f...@sysclose.org; i.maxim...@ovn.org; Stokes, Ian 
> > Subject: RE: [v6 00/11] MFEX Infrastructure + Optimizations
> >
> > Hi Eelco,
> >
> >
> > I tried with the suggestion “zd” is deprecated and in place of it 
> > %"PRIdSIZE`` is
> mentioned which still causes build failure on non-ssl 32 bit builds.
> >
> > Regards
> > Amber
> >
> > From: Eelco Chaudron
> mailto:echau...@redhat.com>>
> > Sent: Wednesday, July 7, 2021 3:02 PM
> > To: Van Haaren, Harry
> mailto:harry.van.haa...@intel.com>>
> > Cc: Amber, Kumar
> mailto:kumar.am...@intel.com>> ; Ferriter, Cian
> mailto:cian.ferri...@intel.com>> ; ovs-
> d...@openvswitch.org ;
> f...@sysclose.org ;
> i.maxim...@ovn.org ; Stokes, Ian
> mailto:ian.sto...@intel.com>>
> > Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations
> >
> >
> > On 7 Jul 2021, at 11:09, Van Haaren, Harry wrote:
> >
> > -Original Message-
> > From: Eelco Chaudron
> mailto:echau...@redhat.com>>
> > Sent: Wednesday, July 7, 2021 9:35 AM
> > To: Amber, Kumar
> mailto:kumar.am...@intel.com>>
> > Cc: Ferriter, Cian 
> > mailto:cian.ferri...@intel.com>> ;
> ovs-dev@openvswitch.org ;
> > f...@sysclose.org ;
> i.maxim...@ovn.org ; Van Haaren, Harry
> > mailto:harry.van.haa...@intel.com>> ; Stokes,
> Ian mailto:ian.sto...@intel.com>>
> > Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations
> >
> > On 6 Jul 2021, at 17:06, Amber, Kumar wrote:
> >
> > Hi Eelco ,
> >
> > Here is the diff vor v6 vs v5 :
> >
> > Patch 1 :
> >
> > diff --git a/lib/dpif-netdev-private-extract.c 
> > b/lib/dpif-netdev-private-extract.c
> > index 1aebf3656d..4987d628a4 100644
> > --- a/lib/dpif-netdev-private-extract.c
> > +++ b/lib/dpif-netdev-private-extract.c
> > @@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct
> >
> > dp_packet_batch *packets,
> >
> > uint32_t keys_size, odp_port_t in_port,
> > struct dp_netdev_pmd_thread *pmd_handle)
> > {
> > - const size_t cnt = dp_packet_batch_size(packets);
> > + const uint32_t cnt = dp_packet_batch_size(packets);
> > uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
> > uint16_t good_l3_ofs[NETDEV_MAX_BURST];
> > uint16_t good_l4_ofs[NETDEV_MAX_BURST];
> > @@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct
> >
> > dp_packet_batch *packets,
> >
> > atomic_uintptr_t *pmd_func = (void *)&pmd->miniflow_extract_opt;
> > atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
> > VLOG_ERR("Invalid key size supplied, Key_size: %d less than"
> > - "batch_size: %ld", keys_size, cnt);
> > + "batch_size: %d", keys_size, cnt);
> >
> > What was the reason for changing this size_t to uint32_t? Is see other
> instances
> > where %ld is used for logging?
> > And other functions like dp_netdev_run_meter() have it as a size_t?
> >
> > The reason to change this is because 32-bit builds were breaking due to
> incorrect
> > format-specifier in the printf. Root cause is because size_t requires 
> > different
> printf
> > format specifier based on 32 or 64 bit arch.
> >
> > (As you likely know, size_t is to describe objects in memory, or the return 
> > of
> sizeof operator.
> > Because 32-bit and 64-bit can have different amounts of memory, size_t can
> be "unsigned int"
> > or "unsigned long long").
> >
> > It does not make sense to me to use a type of variable that changes width
> based on
> > architecture to count batch size (a value from 0 to 32).
> >
> > Simplicity and obvious-ness is nice, and a uint32_t is always exactly what 
> > you
> read it to be,
> > and %d will always be correct for uint32_t regardless of 32 or 64 bit.
> >
> > We should not change this back to the more complex and error-prone

Re: [ovs-dev] [v6 06/11] dpif-netdev: Add packet count and core id paramters for study

2021-07-07 Thread Eelco Chaudron

Did not do a full review, just some small comments as this will change 
with the suggested -pmd option.


On 6 Jul 2021, at 15:11, Cian Ferriter wrote:


From: Kumar Amber 

This commit introduces additional command line paramter
for mfex study function. If user provides additional packet out
it is used in study to compare minimum packets which must be processed
else a default value is choosen.
Also introduces a third paramter for choosing a particular pmd core.

$ ovs-appctl dpif-netdev/miniflow-parser-set study 500 3

Signed-off-by: Kumar Amber 

---

v5:
- fix review comments(Ian, Flavio, Eelco)
- introucde pmd core id parameter
---
---
 Documentation/topics/dpdk/bridge.rst | 35 +++-
 lib/dpif-netdev-extract-study.c  | 23 +++-
 lib/dpif-netdev-private-extract.c|  2 +-
 lib/dpif-netdev-private-extract.h|  9 
 lib/dpif-netdev.c| 79 
++--

 5 files changed, 139 insertions(+), 9 deletions(-)

diff --git a/Documentation/topics/dpdk/bridge.rst 
b/Documentation/topics/dpdk/bridge.rst

index c79e108b7..8495687e8 100644
--- a/Documentation/topics/dpdk/bridge.rst
+++ b/Documentation/topics/dpdk/bridge.rst
@@ -282,12 +282,43 @@ command also shows whether the CPU supports each 
implementation ::


 An implementation can be selected manually by the following command 
::


-$ ovs-appctl dpif-netdev/miniflow-parser-set study
+$ ovs-appctl dpif-netdev/miniflow-parser-set [name] [study_cnt] 
[core_id]

+
+The above command has two optional parameters study_cnt and core_id 
which can
+be set optionally. The second parameter study_cnt is specific to 
study
+where how many packets needed to choose best implementation can be 
provided.
+Third parameter core_id can also be provided to set a particular 
miniflow
+extract function to a specific pmd thread on the core. In case of any 
other
+implementation other than study second parameter [study_cnt] can pe 
provided

+with any arbitatry number and is ignored.

 Also user can select the study implementation which studies the 
traffic for
 a specific number of packets by applying all available implementaions 
of
 miniflow extract and than chooses the one with most optimal result 
for that

-traffic pattern.
+traffic pattern. A user can also provide an additional packet count 
parameter
+which is the minimum number of packets that OVS must study before 
choosing an
+optimal implementation. If no packet count is provided then the 
default value

+128 is chosen.


Not a native speaker, but I think the sentence need some commas?

“If no packet count is provided, then the default value,
128, is chosen.”


Also, as there is no synchronization point between threads, one
+PMD thread might still be running a previous round, and can now 
decide on

+earlier data.
+
+Study can be selected with packet count by the following command ::
+
+$ ovs-appctl dpif-netdev/miniflow-parser-set study 1024
+
+Study can be selected with packet count and explicit PMD selection
+by the following command ::
+
+$ ovs-appctl dpif-netdev/miniflow-parser-set study 1024 3
+
+In the above command the last parameter is the CORE ID of the PMD
+thread and this can also be used to explicitly set the miniflow
+extraction function pointer on different PMD threads.
+
+Scalar can be selected on core 3 by the following command where
+study count can be put as any arbitary number::
+
+$ ovs-appctl dpif-netdev/miniflow-parser-set scalar 0 3

 Miniflow Extract Validation
 ~~~
diff --git a/lib/dpif-netdev-extract-study.c 
b/lib/dpif-netdev-extract-study.c

index 32b76bd03..9b36d1974 100644
--- a/lib/dpif-netdev-extract-study.c
+++ b/lib/dpif-netdev-extract-study.c
@@ -51,6 +51,27 @@ mfex_study_get_study_stats_ptr(void)
 return stats;
 }

+uint32_t mfex_set_study_pkt_cnt(uint32_t pkt_cmp_count,
+const char *name)
+{
+struct dpif_miniflow_extract_impl *miniflow_funcs;
+dpif_mfex_impl_info_get(&miniflow_funcs);
+
+/* If the packet count is set and implementation called is study 
then
+ * set packet counter to requested number else set the packet 
counter

+ * to default number.
+ */
+if ((strcmp(miniflow_funcs[MFEX_IMPL_STUDY].name, name) == 0) &&
+(pkt_cmp_count != 0)) {
+
+mfex_study_pkts_count = pkt_cmp_count;


Do we need to set/read this atomically?


+return 0;
+}
+
+mfex_study_pkts_count = MFEX_MAX_COUNT;
+return -EINVAL;
+}
+
 uint32_t
 mfex_study_traffic(struct dp_packet_batch *packets,
struct netdev_flow_key *keys,
@@ -86,7 +107,7 @@ mfex_study_traffic(struct dp_packet_batch *packets,
 /* Choose the best implementation after a minimum packets have 
been

  * processed.
  */
-if (stats->pkt_count >= MFEX_MAX_COUNT) {
+if (stats->pkt_count >= mfex_study_pkts_count) {
 uint32_t best_func_index = MFEX_IMPL_MAX;
 uint32_t max_hits = 0;
 for (int i =

[ovs-dev] [PATCH ovn] northd-ddlog: Fix IP family match for DNAT flows.

2021-07-07 Thread Dumitru Ceara

This was causing some IPv6 system tests to fail when run with
ovn-northd-ddlog.

Also fix cleanup of the northd process in system-ovn.at.  A few tests
were trying to stop ovn-northd (C version) even when run with
ovn-northd-ddlog.

Signed-off-by: Dumitru Ceara 
---
Note: There are some system-ovn.at tests that still fail with
ovn-northd-ddlog  and need more investigation to see if it's a
test issue or a real bug.
---
 northd/ovn_northd.dl |  2 +-
 tests/system-ovn.at  | 24 
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/northd/ovn_northd.dl b/northd/ovn_northd.dl
index e27c944a0..dea13a91f 100644
--- a/northd/ovn_northd.dl
+++ b/northd/ovn_northd.dl
@@ -5687,7 +5687,7 @@ for (r in &Router(._uuid = lr_uuid,
} in
 if (nat.nat.__type == "dnat" or nat.nat.__type == "dnat_and_snat") 
{
 None = l3dgw_port in
-var __match = "ip && ip4.dst == ${nat.nat.external_ip}" in
+var __match = "ip && ${ipX}.dst == ${nat.nat.external_ip}" in
 (var ext_ip_match, var ext_flow) = 
lrouter_nat_add_ext_ip_match(
 r, nat, __match, ipX, true, mask) in
 {
diff --git a/tests/system-ovn.at b/tests/system-ovn.at
index f42cfc0db..c01fde131 100644
--- a/tests/system-ovn.at
+++ b/tests/system-ovn.at
@@ -1348,7 +1348,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
@@ -3121,7 +3121,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
@@ -4577,7 +4577,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
@@ -4663,7 +4663,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
@@ -4903,7 +4903,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
@@ -5287,7 +5287,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
@@ -5717,7 +5717,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
@@ -5879,7 +5879,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
@@ -5928,7 +5928,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/failed to query port patch-.*/d
@@ -6021,7 +6021,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/.*error receiving.*/d
@@ -6083,7 +6083,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/.*error receiving.*/d
@@ -6234,7 +6234,7 @@ as ovn-nb
 OVS_APP_EXIT_AND_WAIT([ovsdb-server])
 
 as northd
-OVS_APP_EXIT_AND_WAIT([ovn-northd])
+OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 
 as
 OVS_TRAFFIC_VSWITCHD_STOP(["/.*error receiving.*/d
-- 
2.27.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 05/11] dpif-netdev: Add configure to enable autovalidator at build time.

2021-07-07 Thread Amber, Kumar

Hi Eelco,

Ah great catch :



+#ifdef MFEX_AUTOVALIDATOR_DEFAULT
+ VLOG_INFO("Default miniflow Extract implementation %s",
+ mfex_impls[MFEX_IMPL_AUTOVALIDATOR].name);
+ default_mfex_func = mfex_impls[MFEX_IMPL_AUTOVALIDATOR].extract_func;
+#else

The change above forces you to always use the autovalidator, even if you 
configure another option. I think this would make it impossible to run 
potential tests cases with the autovalidator as default.
I think this should probably be protected by ovsthread_once_start(), so it will 
only be run once at startup time. It might be even better to do this at init 
time? For example (did not test):


I think this should be protected by the :

if (!default_mfex_func) { }

that way we will only set it at default time

will fix it in v7.




Regards

Amber


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 05/11] dpif-netdev: Add configure to enable autovalidator at build time.

2021-07-07 Thread Eelco Chaudron





On 6 Jul 2021, at 15:11, Cian Ferriter wrote:


From: Kumar Amber 

This commit adds a new command to allow the user to enable
autovalidatior by default at build time thus allowing for
runnig unit test by default.

 $ ./configure --enable-mfex-default-autovalidator

Signed-off-by: Kumar Amber 
Co-authored-by: Harry van Haaren 
Signed-off-by: Harry van Haaren 

---

v5:
- fix review comments(Ian, Flavio, Eelco)
---
---
 Documentation/topics/dpdk/bridge.rst |  5 +
 NEWS |  5 +++--
 acinclude.m4 | 16 
 configure.ac |  1 +
 lib/dpif-netdev-private-extract.c|  8 +++-
 5 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/Documentation/topics/dpdk/bridge.rst 
b/Documentation/topics/dpdk/bridge.rst

index 2901e8096..c79e108b7 100644
--- a/Documentation/topics/dpdk/bridge.rst
+++ b/Documentation/topics/dpdk/bridge.rst
@@ -305,3 +305,8 @@ implementations provide the same results.
 To set the Miniflow autovalidator, use this command ::

 $ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator
+
+A compile time option is available in order to test it with the OVS 
unit

+test suite. Use the following configure option ::
+
+$ ./configure --enable-mfex-default-autovalidator
diff --git a/NEWS b/NEWS
index 275aa1868..608c5a32f 100644
--- a/NEWS
+++ b/NEWS
@@ -25,9 +25,11 @@ Post-v2.15.0
  * Add command line option to switch between mfex function 
pointers.
  * Add miniflow extract auto-validator function to compare 
different
miniflow extract implementations against default 
implementation.
-*  Add study function to miniflow function table which studies 
packet
+ * Add study function to miniflow function table which studies 
packet
and automatically chooses the best miniflow implementation for 
that

traffic.
+ * Add build time configure command to enable auto-validatior as 
default

+   miniflow implementation at build time.
- ovs-ctl:
  * New option '--no-record-hostname' to disable hostname 
configuration

in ovsdb on startup.
@@ -44,7 +46,6 @@ Post-v2.15.0
  * New option '--election-timer' to the 'create-cluster' command 
to set the

leader election timer during cluster creation.

-
 v2.15.0 - 15 Feb 2021
 -
- OVSDB:
diff --git a/acinclude.m4 b/acinclude.m4
index 5fbcd9872..e2704cfda 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -14,6 +14,22 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+dnl Set OVS MFEX Autovalidator as default miniflow extract at compile 
time?

+dnl This enables automatically running all unit tests with all MFEX
+dnl implementations.
+AC_DEFUN([OVS_CHECK_MFEX_AUTOVALIDATOR], [
+  AC_ARG_ENABLE([mfex-default-autovalidator],
+
[AC_HELP_STRING([--enable-mfex-default-autovalidator], [Enable MFEX 
autovalidator as default miniflow_extract implementation.])],

+[autovalidator=yes],[autovalidator=no])
+  AC_MSG_CHECKING([whether MFEX Autovalidator is default 
implementation])

+  if test "$autovalidator" != yes; then
+AC_MSG_RESULT([no])
+  else
+OVS_CFLAGS="$OVS_CFLAGS -DMFEX_AUTOVALIDATOR_DEFAULT"
+AC_MSG_RESULT([yes])
+  fi
+])
+
 dnl Set OVS DPCLS Autovalidator as default subtable search at compile 
time?

 dnl This enables automatically running all unit tests with all DPCLS
 dnl implementations.
diff --git a/configure.ac b/configure.ac
index e45685a6c..46c402892 100644
--- a/configure.ac
+++ b/configure.ac
@@ -186,6 +186,7 @@ OVS_ENABLE_SPARSE
 OVS_CTAGS_IDENTIFIERS
 OVS_CHECK_DPCLS_AUTOVALIDATOR
 OVS_CHECK_DPIF_AVX512_DEFAULT
+OVS_CHECK_MFEX_AUTOVALIDATOR
 OVS_CHECK_BINUTILS_AVX512

 AC_ARG_VAR(KARCH, [Kernel Architecture String])
diff --git a/lib/dpif-netdev-private-extract.c 
b/lib/dpif-netdev-private-extract.c

index eaddeceaf..6ae91a24d 100644
--- a/lib/dpif-netdev-private-extract.c
+++ b/lib/dpif-netdev-private-extract.c
@@ -76,6 +76,12 @@ dpif_miniflow_extract_init(void)
 miniflow_extract_func
 dp_mfex_impl_get_default(void)
 {
+
+#ifdef MFEX_AUTOVALIDATOR_DEFAULT
+VLOG_INFO("Default miniflow Extract implementation %s",
+  mfex_impls[MFEX_IMPL_AUTOVALIDATOR].name);
+default_mfex_func = 
mfex_impls[MFEX_IMPL_AUTOVALIDATOR].extract_func;

+#else


The change above forces you to always use the autovalidator, even if you 
configure another option. I think this would make it impossible to run 
potential tests cases with the autovalidator as default.
I think this should probably be protected by ovsthread_once_start(), so 
it will only be run once at startup time. It might be even better to do 
this at init time? For example (did not test):



102 void
103 dpif_miniflow_extract_init(void)
104 {
105 /* Call probe on each impl, and cache the result. */
106 uint32_t i;
107 for (i = 0; i < ARRAY_SIZE(mfex_impls); i++) {
108 bool

Re: [ovs-dev] [v6 03/11] dpif-netdev: Add study function to select the best mfex function

2021-07-07 Thread Amber, Kumar

Hi Eelco,

Pls find replies inline.



> > +uint32_t
> > +mfex_study_traffic(struct dp_packet_batch *packets,
> > +   struct netdev_flow_key *keys,
> > +   uint32_t keys_size, odp_port_t in_port,
> > +   struct dp_netdev_pmd_thread *pmd_handle) {
> > +uint32_t hitmask = 0;
> > +uint32_t mask = 0;
> > +struct dp_netdev_pmd_thread *pmd = pmd_handle;
> > +struct dpif_miniflow_extract_impl *miniflow_funcs;
> > +uint32_t impl_count = dpif_mfex_impl_info_get(&miniflow_funcs);
> 
> This function returns an -errno on failure, so we should test the return.
> 

Added check in v7.
> > +struct study_stats *stats = mfex_study_get_study_stats_ptr();
> > +
> > +/* Run traffic optimized miniflow_extract to collect the hitmask
> > + * to be compared after certain packets have been hit to choose
> > + * the best miniflow_extract version for that traffic.
> > + */
> > +for (int i = MFEX_IMPL_MAX; i < impl_count; i++) {
> 
> See comment on patch 2 on using an explicit minimum value.
> 
> > +if (miniflow_funcs[i].available) {
> 
> For consistency, and to safe one indent level, why not do the below, as you 
> did
> in your other patch:
> 
> if (!miniflow_funcs[i].available) {
>   Continue;
> }

Applied to v7.
> 
> > +hitmask = miniflow_funcs[i].extract_func(packets, keys, 
> > keys_size,
> > + in_port, pmd_handle);
> > +stats->impl_hitcount[i] += count_1bits(hitmask);
> > +
> > +/* If traffic is not classified then we dont overwrite the keys
> > + * array in minfiflow implementations so its safe to create a
> > + * mask for all those packets whose miniflow have been created.
> > + */
> > +mask |= hitmask;
> > +}
> > +}
> > +stats->pkt_count += dp_packet_batch_size(packets);
> > +
> > +/* Choose the best implementation after a minimum packets have been
> > + * processed.
> > + */
> > +if (stats->pkt_count >= MFEX_MAX_COUNT) {
> > +uint32_t best_func_index = MFEX_IMPL_MAX;
> 
> See comment on MFEX_IMPL_START_IDX for above and for loop below.
> 

Fixed in v7.

> > +uint32_t max_hits = 0;
> > +for (int i = MFEX_IMPL_MAX; i < impl_count; i++) {
> > +if (stats->impl_hitcount[i] > max_hits) {
> > +max_hits = stats->impl_hitcount[i];
> > +best_func_index = i;
> > +}
> > +}
> > +
> > +/* If 50% of the packets hit, enable the function. */
> > +if (max_hits >= (mfex_study_pkts_count / 2)) {
> > +miniflow_extract_func mf_func =
> > +miniflow_funcs[best_func_index].extract_func;
> > +atomic_uintptr_t *pmd_func = (void 
> > *)&pmd->miniflow_extract_opt;
> > +atomic_store_relaxed(pmd_func, (uintptr_t) mf_func);
> > +VLOG_INFO("MFEX study chose impl %s: (hits %d/%d pkts)",
> > +  miniflow_funcs[best_func_index].name, max_hits,
> > +  stats->pkt_count);
> > +} else {
> > +/* Set the implementation to null for default miniflow. */
> > +miniflow_extract_func mf_func =
> > +miniflow_funcs[MFEX_IMPL_SCALAR].extract_func;
> > +atomic_uintptr_t *pmd_func = (void 
> > *)&pmd->miniflow_extract_opt;
> > +atomic_store_relaxed(pmd_func, (uintptr_t) mf_func);
> > +VLOG_INFO("Not enough packets matched (%d/%d), disabling"
> > +  " optimized MFEX.", max_hits, stats->pkt_count);
> > +}
> 
> I still would like to see the hits for all hits when debugging is enabled.
> Maybe something like
>
> if (VLOG_IS_DBG_ENABLED()) {
>   for each imp in i:
>VLOG_DBG(“MFEX study results for implementation %s: (hits %d/%d pkts)",
>   miniflow_funcs[i].name, stats->impl_hitcount[i],
>   stats->pkt_count);
> 
> }
> 

Applied in v7.
> > +/* Reset stats so that study function can be called again
> > + * for next traffic type and optimal function ptr can be
> > + * chosen.
> > + */
> > +memset(stats, 0, sizeof(struct study_stats));
> > +}
> > +return mask;
> > +}
> > diff --git a/lib/dpif-netdev-private-extract.c
> > b/lib/dpif-netdev-private-extract.c
> > index 62170ff6c..eaddeceaf 100644
> > --- a/lib/dpif-netdev-private-extract.c
> > +++ b/lib/dpif-netdev-private-extract.c
> > @@ -47,6 +47,11 @@ static struct dpif_miniflow_extract_impl mfex_impls[] =
> {
> >  .probe = NULL,
> >  .extract_func = NULL,
> >  .name = "scalar", },
> > +
> > +[MFEX_IMPL_STUDY] = {
> > +.probe = NULL,
> > +.extract_func = mfex_study_traffic,
> > +.name = "study", },
> >  };
> >
> >  BUILD_ASSERT_DECL(MFEX_IMPL_MAX >= ARRAY_SIZE(mfex_impls)); @@ -
> 88,7
> > +9

Re: [ovs-dev] [v14 04/11] dpif-avx512: Add ISA implementation of dpif.

2021-07-07 Thread Flavio Leitner

On Thu, Jul 01, 2021 at 04:06:12PM +0100, Cian Ferriter wrote:
> From: Harry van Haaren 
> 
> This commit adds the AVX512 implementation of DPIF functionality,
> specifically the dp_netdev_input_outer_avx512 function. This function
> only handles outer (no re-circulations), and is optimized to use the
> AVX512 ISA for packet batching and other DPIF work.
> 
> Sparse is not able to handle the AVX512 intrinsics, causing compile
> time failures, so it is disabled for this file.
> 
> Signed-off-by: Harry van Haaren 
> Co-authored-by: Cian Ferriter 
> Signed-off-by: Cian Ferriter 
> Co-authored-by: Kumar Amber 
> Signed-off-by: Kumar Amber 
> 
> ---

Thanks for addressing all the previous comments.

It seems that if we had used a emc_hit along with existing
smc_hit the code would be easier read, instead adding/removing
bitmasks from a emc_smc variable.

Acked-by: Flavio Leitner 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 04/11] docs/dpdk/bridge: add miniflow extract section.

2021-07-07 Thread Eelco Chaudron




On 6 Jul 2021, at 15:11, Cian Ferriter wrote:

> From: Kumar Amber 
>
> This commit adds a section to the dpdk/bridge.rst netdev documentation,
> detailing the added miniflow functionality. The newly added commands are
> documented, and sample output is provided.
>
> The use of auto-validator and special study function is also described
> in detail as well as running fuzzy tests.
>
> Signed-off-by: Kumar Amber 
> Co-authored-by: Cian Ferriter 
> Signed-off-by: Cian Ferriter 
> Co-authored-by: Harry van Haaren 
> Signed-off-by: Harry van Haaren 
>
> ---
>
> v5:
> - fix review comments(Ian, Flavio, Eelco)
> ---
> ---
>  Documentation/topics/dpdk/bridge.rst | 49 
>  1 file changed, 49 insertions(+)
>
> diff --git a/Documentation/topics/dpdk/bridge.rst 
> b/Documentation/topics/dpdk/bridge.rst
> index 2d0850836..2901e8096 100644
> --- a/Documentation/topics/dpdk/bridge.rst
> +++ b/Documentation/topics/dpdk/bridge.rst
> @@ -256,3 +256,52 @@ The following line should be seen in the configure 
> output when the above option
>  is used ::
>
>  checking whether DPIF AVX512 is default implementation... yes
> +
> +Miniflow Extract
> +
> +
> +Miniflow extract (MFEX) performs parsing of the raw packets and extracts the
> +important header information into a compressed miniflow. This miniflow is
> +composed of bits and blocks where the bits signify which blocks are set or
> +have values where as the blocks hold the metadata, ip, udp, vlan, etc. These
> +values are used by the datapath for switching decisions later.

I think this text should also explain why we have different miniflow extract 
functions, i.e. that a miniflow extract is traffic specific to speed up the 
lookup, whereas the scalar works for ALL traffic patterns.

> +
> +Most modern CPUs have SIMD capabilities. These SIMD instructions are able
> +to process a vector rather than act on one single data. OVS provides multiple
> +implementations of miniflow extract. This allows the user to take advantage
> +of SIMD instructions like AVX512 to gain additional performance.
> +
> +A list of implementations can be obtained by the following command. The
> +command also shows whether the CPU supports each implementation ::
> +
> +$ ovs-appctl dpif-netdev/miniflow-parser-get
> +Available Optimized Miniflow Extracts:

Not sure how it will look in v7, but the actual output on v6 shows:

$ ovs-appctl dpif-netdev/miniflow-parser-get
Available MFEX implementations:
  autovalidator (available: True, pmds: none)
  scalar (available: True, pmds: 1,15)
  study (available: True, pmds: none)


> +autovalidator (available: True)(pmds: none)
> +scalar (available: True)(pmds: 3)
> +study (available: True)(pmds: none)
> +
> +An implementation can be selected manually by the following command ::
> +
> +$ ovs-appctl dpif-netdev/miniflow-parser-set study
> +
> +Also user can select the study implementation which studies the traffic for
> +a specific number of packets by applying all available implementaions of
> +miniflow extract and than chooses the one with most optimal result for that
> +traffic pattern.
> +
> +Miniflow Extract Validation
> +~~~
> +
> +As multiple versions of miniflow extract can co-exist, each with different
> +CPU ISA optimizations, it is important to validate that they all give the
> +exact same results. To easily test all miniflow implementations, an
> +``autovalidator`` implementation of the miniflow exists. This implementation
> +runs all other available miniflow extract implementations, and verifies that
> +the results are identical.
> +
> +Running the OVS unit tests with the autovalidator enabled ensures all
> +implementations provide the same results.
> +
> +To set the Miniflow autovalidator, use this command ::
> +
> +$ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator
> -- 
> 2.32.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 03/11] dpif-netdev: Add study function to select the best mfex function

2021-07-07 Thread Eelco Chaudron



On 6 Jul 2021, at 15:11, Cian Ferriter wrote:

> From: Kumar Amber 
>
> The study function runs all the available implementations
> of miniflow_extract and makes a choice whose hitmask has
> maximum hits and sets the mfex to that function.
>
> Study can be run at runtime using the following command:
>
> $ ovs-appctl dpif-netdev/miniflow-parser-set study
>
> Signed-off-by: Kumar Amber 
> Co-authored-by: Harry van Haaren 
> Signed-off-by: Harry van Haaren 
>
> ---
>
> v5:
> - fix review comments(Ian, Flavio, Eelco)
> - add Atomic set in study
> ---
> ---
>  NEWS  |   3 +
>  lib/automake.mk   |   1 +
>  lib/dpif-netdev-extract-study.c   | 124 ++
>  lib/dpif-netdev-private-extract.c |  19 -
>  lib/dpif-netdev-private-extract.h |  20 +
>  5 files changed, 165 insertions(+), 2 deletions(-)
>  create mode 100644 lib/dpif-netdev-extract-study.c
>
> diff --git a/NEWS b/NEWS
> index ccf9a0f1e..275aa1868 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -25,6 +25,9 @@ Post-v2.15.0
>   * Add command line option to switch between mfex function pointers.
>   * Add miniflow extract auto-validator function to compare different
> miniflow extract implementations against default implementation.
> +*  Add study function to miniflow function table which studies packet
> +   and automatically chooses the best miniflow implementation for that
> +   traffic.
> - ovs-ctl:
>   * New option '--no-record-hostname' to disable hostname configuration
> in ovsdb on startup.
> diff --git a/lib/automake.mk b/lib/automake.mk
> index 6657b9ae5..5223d321b 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -107,6 +107,7 @@ lib_libopenvswitch_la_SOURCES = \
>   lib/dp-packet.h \
>   lib/dp-packet.c \
>   lib/dpdk.h \
> + lib/dpif-netdev-extract-study.c \
>   lib/dpif-netdev-lookup.h \
>   lib/dpif-netdev-lookup.c \
>   lib/dpif-netdev-lookup-autovalidator.c \
> diff --git a/lib/dpif-netdev-extract-study.c b/lib/dpif-netdev-extract-study.c
> new file mode 100644
> index 0..32b76bd03
> --- /dev/null
> +++ b/lib/dpif-netdev-extract-study.c
> @@ -0,0 +1,124 @@
> +/*
> + * Copyright (c) 2021 Intel.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "dpif-netdev-private-thread.h"
> +#include "openvswitch/vlog.h"
> +#include "ovs-thread.h"
> +
> +VLOG_DEFINE_THIS_MODULE(dpif_mfex_extract_study);
> +
> +/* Max count of packets to be compared. */
> +#define MFEX_MAX_COUNT (128)
> +
> +static uint32_t mfex_study_pkts_count = 0;
> +
> +/* Struct to hold miniflow study stats. */
> +struct study_stats {
> +uint32_t pkt_count;
> +uint32_t impl_hitcount[MFEX_IMPL_MAX];
> +};
> +
> +/* Define per thread data to hold the study stats. */
> +DEFINE_PER_THREAD_MALLOCED_DATA(struct study_stats *, study_stats);
> +
> +/* Allocate per thread PMD pointer space for study_stats. */
> +static inline struct study_stats *
> +mfex_study_get_study_stats_ptr(void)
> +{
> +struct study_stats *stats = study_stats_get();
> +if (OVS_UNLIKELY(!stats)) {
> +   stats = xzalloc(sizeof *stats);
> +   study_stats_set_unsafe(stats);
> +}
> +return stats;
> +}
> +
> +uint32_t
> +mfex_study_traffic(struct dp_packet_batch *packets,
> +   struct netdev_flow_key *keys,
> +   uint32_t keys_size, odp_port_t in_port,
> +   struct dp_netdev_pmd_thread *pmd_handle)
> +{
> +uint32_t hitmask = 0;
> +uint32_t mask = 0;
> +struct dp_netdev_pmd_thread *pmd = pmd_handle;
> +struct dpif_miniflow_extract_impl *miniflow_funcs;
> +uint32_t impl_count = dpif_mfex_impl_info_get(&miniflow_funcs);

This function returns an -errno on failure, so we should test the return.

> +struct study_stats *stats = mfex_study_get_study_stats_ptr();
> +
> +/* Run traffic optimized miniflow_extract to collect the hitmask
> + * to be compared after certain packets have been hit to choose
> + * the best miniflow_extract version for that traffic.
> + */
> +for (int i = MFEX_IMPL_MAX; i < impl_count; i++) {

See comment on patch 2 on using an explicit minimum value.

> +if (miniflow_funcs[i].available) {

For consistency, and to safe one indent level, why not do the below, as you did 
in your other patch:

if (!miniflow_fun

Re: [ovs-dev] [v14 03/11] dpif-netdev: Add function pointer for netdev input.

2021-07-07 Thread Flavio Leitner

On Thu, Jul 01, 2021 at 04:06:11PM +0100, Cian Ferriter wrote:
> From: Harry van Haaren 
> 
> This commit adds a function pointer to the pmd thread data structure,
> giving the pmd thread flexibility in its dpif-input function choice.
> This allows choosing of the implementation based on ISA capabilities
> of the runtime CPU, leading to optimizations and higher performance.
> 
> Signed-off-by: Harry van Haaren 
> Co-authored-by: Cian Ferriter 
> Signed-off-by: Cian Ferriter 
> 
> ---

Acked-by: Flavio Leitner 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 02/11] dpif-netdev: Add auto validation function for miniflow extract

2021-07-07 Thread Amber, Kumar

Hi Eelco,

Again, looks like email format is broken . I will snip the mail
Pls find the replies inline.



+ uint32_t batch_failed = 0;
+ /* Iterate through each version of miniflow implementations. */
+ for (int j = MFEX_IMPL_MAX; j < MFEX_IMPL_MAX; j++) {

This is really confusing ;) Maybe we could define an enum for the first 
implementation, as you had in your previous patch, MFEX_IMPL_START_IDX?
So that even if someone changes the enum order it’s clear? Maybe with a 
separate #define or doing something like:

enum dpif_miniflow_extract_impl_idx {
MFEX_IMPL_AUTOVALIDATOR,
MFEX_IMPL_SCALAR,
MFEX_IMPL_START_IDX,
MFEX_IMPL_MAX = MFEX_IMPL_START_IDX
};

I have decided to create a new define MFEX_IMPL_START_IDX  that is cleaner and 
nicer to implement. Applied to v7.

+ if ((j < MFEX_IMPL_MAX) || (!mfex_impls[j].available)) {

Why the (j < MFEX_IMPL_MAX) so the below code is always skipped? So the 
validator is not executed ever?

This was introduced to avoid seg faults if someone tried to use this without 
the AVX flag enabled.

+ continue;
+ }
+
+ /* Reset keys and offsets before each implementation. */
+
+ uint32_t failed = 0;
+
+ struct ds log_msg = DS_EMPTY_INITIALIZER;
+ ds_put_format(&log_msg, "mfex autovalidator pkt %d\n", i);

Capital MFEX?

Applied to v7.

+ /* Check miniflow bits are equal. */
+ if ((keys[i].mf.map.bits[0] != test_keys[i].mf.map.bits[0]) ||
+ (keys[i].mf.map.bits[1] != test_keys[i].mf.map.bits[1])) {

In this code we assume FLOWMAP_UNITS == 2, can we add a static assert to make 
sure this is not changing.
There is also flowmap_equal() which might be better, but the assert might still 
be needed for the below logging.

Introduced  BUILD_ASSERT_DECL(FLOWMAP_UNITS == 2);

+ ds_put_format(&log_msg, "Good 0x%llx 0x%llx\tTest 0x%llx"

Think here we need some more details like for the ones below:

  dd_put_format(&log_msg, "Autovalidation map 
failed\n”

   “Good 0x%llx 
0x%llx\tTest 0x%llx”



Applied to v7

+ " 0x%llx\n", keys[i].mf.map.bits[0],
+ keys[i].mf.map.bits[1],
+ test_keys[i].mf.map.bits[0],
+ test_keys[i].mf.map.bits[1]);
+ failed = 1;
+ }
+
+ if (!miniflow_equal(&keys[i].mf, &test_keys[i].mf)) {
+ uint32_t block_cnt = miniflow_n_values(&keys[i].mf);
+ ds_put_format(&log_msg, "Autovalidation blocks failed for %s"
+ "pkt %d\nGood hex:\n", mfex_impls[j].name, i);

To save on log file content, we can remove the packet and name, as it’s already 
logged on error in the if(failed) condition below.
So it will become:

ds_put_format(&log_msg, "Autovalidation blocks failed\n”
"nGood hex:\n", mfex_impls[j].name, i);

Applied to v7.

+ ds_put_hex_dump(&log_msg, &keys[i].buf, block_cnt * 8, 0,
+ false);
+ ds_put_format(&log_msg, "Test hex:\n");
+ ds_put_hex_dump(&log_msg, &test_keys[i].buf, block_cnt * 8, 0,
+ false);
+ failed = 1;
+ }
+
+ packet = packets->packets[i];
+ if ((packet->l2_pad_size != good_l2_pad_size[i]) ||
+ (packet->l2_5_ofs != good_l2_5_ofs[i]) ||
+ (packet->l3_ofs != good_l3_ofs[i]) ||
+ (packet->l4_ofs != good_l4_ofs[i])) {
+ ds_put_format(&log_msg, "Autovalidation packet offsets failed"
+ " for %s pkt %d\n", mfex_impls[j].name, i);

To save on log file content, we can remove the packet and name, as it’s already 
logged on error in the if(failed) condition below.

Applied to v7.

+ ds_put_format(&log_msg, "Good offsets: l2_pad_size %u,"
+ " l2_5_ofs : %u l3_ofs %u, l4_ofs %u\n",




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function for miniflow extract

2021-07-07 Thread Stokes, Ian

> > -Original Message-
> > From: Eelco Chaudron 
> > Sent: Wednesday, July 7, 2021 10:41 AM
> > To: Van Haaren, Harry 
> > Cc: Amber, Kumar ; d...@openvswitch.org;
> > i.maxim...@ovn.org; Flavio Leitner ; Stokes, Ian
> > 
> > Subject: Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function 
> > for
> > miniflow extract
> >
> >
> >
> > On 7 Jul 2021, at 11:33, Van Haaren, Harry wrote:
> 
> 
> 
> > > By removing scalar DPIF enabling of MFEX opt pointer (details below) we
> > remove any
> > > urgency on benchmark results?
> >
> > I’ll wrap up the review first, and hopefully, when you are working on 
> > potential
> > changes, I can run the tests and get some results.
> 
> As we're nearing the merge dates, I'd prefer to focus on getting merged.
> To help review & merge, v7 will contain the following patch split change:
> 
> Scalar DPIF usage of the MFEX Optimized function is now in its own patch at
> the end of the series. This allows all other MFEX patches to be merged, 
> without
> any hazard to scalar DPIF datapath performance.
> 
> 
> > I understand now what you meant with disabling it in the scalar part, so if 
> > I still
> > see 1%+ deltas I’ll try it out.
> 
> Eelco's testing results can inform the inclusion of Scalar DPIF usage of the 
> MFEX
> function pointer. As this enabling is now in a separate patch at the end of 
> the
> series, it means that the patch can be easily merged, or not merged. No 
> rebasing
> or rework required.
> 
> If the main MFEX code is ready for merge before the testing results are in, 
> this
> allows the merge of MFEX. Scalar enabling can be merged later in the 2.16
> merge window if desired, or re-visited in a future release.

+1 to this approach. If there is more discussion needed then lets keep it 
separate for the moment as described above and not block the main series.

Regards
Ian
> 
> 
> 
> Regards, -Harry

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function for miniflow extract

2021-07-07 Thread Van Haaren, Harry

> -Original Message-
> From: Eelco Chaudron 
> Sent: Wednesday, July 7, 2021 10:41 AM
> To: Van Haaren, Harry 
> Cc: Amber, Kumar ; d...@openvswitch.org;
> i.maxim...@ovn.org; Flavio Leitner ; Stokes, Ian
> 
> Subject: Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function 
> for
> miniflow extract
> 
> 
> 
> On 7 Jul 2021, at 11:33, Van Haaren, Harry wrote:

> > By removing scalar DPIF enabling of MFEX opt pointer (details below) we
> remove any
> > urgency on benchmark results?
> 
> I’ll wrap up the review first, and hopefully, when you are working on 
> potential
> changes, I can run the tests and get some results.

As we're nearing the merge dates, I'd prefer to focus on getting merged.
To help review & merge, v7 will contain the following patch split change:

Scalar DPIF usage of the MFEX Optimized function is now in its own patch at
the end of the series. This allows all other MFEX patches to be merged, without
any hazard to scalar DPIF datapath performance.

> I understand now what you meant with disabling it in the scalar part, so if I 
> still
> see 1%+ deltas I’ll try it out.

Eelco's testing results can inform the inclusion of Scalar DPIF usage of the 
MFEX
function pointer. As this enabling is now in a separate patch at the end of the
series, it means that the patch can be easily merged, or not merged. No rebasing
or rework required.

If the main MFEX code is ready for merge before the testing results are in, this
allows the merge of MFEX. Scalar enabling can be merged later in the 2.16
merge window if desired, or re-visited in a future release.

Regards, -Harry
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 00/11] MFEX Infrastructure + Optimizations

2021-07-07 Thread Eelco Chaudron



On 7 Jul 2021, at 12:13, Van Haaren, Harry wrote:

> Hi All,
>
> This thread has dissolved into unnecessary time-wasting on nitpick changes. 
> There is no
> technical issue with uint32_t, so this patch remains as is, and this should 
> be accepted for merge.
>
> If you feel differently, reply to this with a detailed description of a 
> genuine technical bug.

Reviews are not only about technical correctness but also about coding style 
and consistency.
In this case the dp_packet_batch_size() API returns a size_t, so we should try 
to use it.

But I leave it to the maintainers to decide if they accept this as is, or not :)

//Eelco

> Regards, -Harry
>
>
> From: Amber, Kumar 
> Sent: Wednesday, July 7, 2021 11:04 AM
> To: Eelco Chaudron ; Van Haaren, Harry 
> 
> Cc: Ferriter, Cian ; ovs-dev@openvswitch.org; 
> f...@sysclose.org; i.maxim...@ovn.org; Stokes, Ian 
> Subject: RE: [v6 00/11] MFEX Infrastructure + Optimizations
>
> Hi Eelco,
>
>
> I tried with the suggestion “zd” is deprecated and in place of it 
> %"PRIdSIZE`` is mentioned which still causes build failure on non-ssl 32 bit 
> builds.
>
> Regards
> Amber
>
> From: Eelco Chaudron mailto:echau...@redhat.com>>
> Sent: Wednesday, July 7, 2021 3:02 PM
> To: Van Haaren, Harry 
> mailto:harry.van.haa...@intel.com>>
> Cc: Amber, Kumar mailto:kumar.am...@intel.com>> ; 
> Ferriter, Cian mailto:cian.ferri...@intel.com>> ; 
> ovs-dev@openvswitch.org ; 
> f...@sysclose.org ; 
> i.maxim...@ovn.org ; Stokes, Ian 
> mailto:ian.sto...@intel.com>>
> Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations
>
>
> On 7 Jul 2021, at 11:09, Van Haaren, Harry wrote:
>
> -Original Message-
> From: Eelco Chaudron mailto:echau...@redhat.com>>
> Sent: Wednesday, July 7, 2021 9:35 AM
> To: Amber, Kumar mailto:kumar.am...@intel.com>>
> Cc: Ferriter, Cian mailto:cian.ferri...@intel.com>> 
> ; ovs-dev@openvswitch.org ;
> f...@sysclose.org ; 
> i.maxim...@ovn.org ; Van Haaren, Harry
> mailto:harry.van.haa...@intel.com>> ; Stokes, Ian 
> mailto:ian.sto...@intel.com>>
> Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations
>
> On 6 Jul 2021, at 17:06, Amber, Kumar wrote:
>
> Hi Eelco ,
>
> Here is the diff vor v6 vs v5 :
>
> Patch 1 :
>
> diff --git a/lib/dpif-netdev-private-extract.c 
> b/lib/dpif-netdev-private-extract.c
> index 1aebf3656d..4987d628a4 100644
> --- a/lib/dpif-netdev-private-extract.c
> +++ b/lib/dpif-netdev-private-extract.c
> @@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct
>
> dp_packet_batch *packets,
>
> uint32_t keys_size, odp_port_t in_port,
> struct dp_netdev_pmd_thread *pmd_handle)
> {
> - const size_t cnt = dp_packet_batch_size(packets);
> + const uint32_t cnt = dp_packet_batch_size(packets);
> uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
> uint16_t good_l3_ofs[NETDEV_MAX_BURST];
> uint16_t good_l4_ofs[NETDEV_MAX_BURST];
> @@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct
>
> dp_packet_batch *packets,
>
> atomic_uintptr_t *pmd_func = (void *)&pmd->miniflow_extract_opt;
> atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
> VLOG_ERR("Invalid key size supplied, Key_size: %d less than"
> - "batch_size: %ld", keys_size, cnt);
> + "batch_size: %d", keys_size, cnt);
>
> What was the reason for changing this size_t to uint32_t? Is see other 
> instances
> where %ld is used for logging?
> And other functions like dp_netdev_run_meter() have it as a size_t?
>
> The reason to change this is because 32-bit builds were breaking due to 
> incorrect
> format-specifier in the printf. Root cause is because size_t requires 
> different printf
> format specifier based on 32 or 64 bit arch.
>
> (As you likely know, size_t is to describe objects in memory, or the return 
> of sizeof operator.
> Because 32-bit and 64-bit can have different amounts of memory, size_t can be 
> "unsigned int"
> or "unsigned long long").
>
> It does not make sense to me to use a type of variable that changes width 
> based on
> architecture to count batch size (a value from 0 to 32).
>
> Simplicity and obvious-ness is nice, and a uint32_t is always exactly what 
> you read it to be,
> and %d will always be correct for uint32_t regardless of 32 or 64 bit.
>
> We should not change this back to the more complex and error-prone "size_t", 
> uint32_t is better.
>
> I don't think it’s more error-prone if the right type qualifier is used, i.e. 
> %zd. See also the coding style document, so I would suggest changing it to:
>
> @@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct 
> dp_packet_batch *packets,
> uint32_t keys_size, odp_port_t in_port,
> struct dp_netdev_pmd_thread *pmd_handle)
> {
>
>   *   const uint32_t cnt = dp_packet_batch_size(packets);
>
>   *   const size_t cnt = dp_packet_batch_size(packets);
> uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
> uint16_t good_l3_of

Re: [ovs-dev] [v6 02/11] dpif-netdev: Add auto validation function for miniflow extract

2021-07-07 Thread Eelco Chaudron




On 6 Jul 2021, at 15:11, Cian Ferriter wrote:


From: Kumar Amber 

This patch introduced the auto-validation function which
allows users to compare the batch of packets obtained from
different miniflow implementations against the linear
miniflow extract and return a hitmask.

The autovaidator function can be triggered at runtime using the
following command:

$ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator

Signed-off-by: Kumar Amber 
Co-authored-by: Harry van Haaren 
Signed-off-by: Harry van Haaren 

---

v5:
- fix review comments(Ian, Flavio, Eelco)
- remove ovs assert and switch to default after a batch of packets
  is processed
- Atomic set and get introduced
- fix raw_ctz for windows build
---
---
 NEWS  |   2 +
 lib/dpif-netdev-private-extract.c | 149 
++

 lib/dpif-netdev-private-extract.h |  13 +++
 lib/dpif-netdev.c |   2 +-
 4 files changed, 165 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index 60db823c4..ccf9a0f1e 100644
--- a/NEWS
+++ b/NEWS
@@ -23,6 +23,8 @@ Post-v2.15.0
CPU supports it. This enhances performance by using the native 
vpopcount

instructions, instead of the emulated version of vpopcount.
  * Add command line option to switch between mfex function 
pointers.
+ * Add miniflow extract auto-validator function to compare 
different
+   miniflow extract implementations against default 
implementation.

- ovs-ctl:
  * New option '--no-record-hostname' to disable hostname 
configuration

in ovsdb on startup.
diff --git a/lib/dpif-netdev-private-extract.c 
b/lib/dpif-netdev-private-extract.c

index f7ad2d5b5..62170ff6c 100644
--- a/lib/dpif-netdev-private-extract.c
+++ b/lib/dpif-netdev-private-extract.c
@@ -38,6 +38,11 @@ static miniflow_extract_func default_mfex_func = 
NULL;

  */
 static struct dpif_miniflow_extract_impl mfex_impls[] = {

+[MFEX_IMPL_AUTOVALIDATOR] = {
+.probe = NULL,
+.extract_func = dpif_miniflow_extract_autovalidator,
+.name = "autovalidator", },
+
 [MFEX_IMPL_SCALAR] = {
 .probe = NULL,
 .extract_func = NULL,
@@ -157,3 +162,147 @@ dp_mfex_impl_get_by_name(const char *name, 
miniflow_extract_func *out_func)


 return -EINVAL;
 }
+
+uint32_t
+dpif_miniflow_extract_autovalidator(struct dp_packet_batch *packets,
+struct netdev_flow_key *keys,
+uint32_t keys_size, odp_port_t 
in_port,
+struct dp_netdev_pmd_thread 
*pmd_handle)

+{
+const uint32_t cnt = dp_packet_batch_size(packets);
+uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
+uint16_t good_l3_ofs[NETDEV_MAX_BURST];
+uint16_t good_l4_ofs[NETDEV_MAX_BURST];
+uint16_t good_l2_pad_size[NETDEV_MAX_BURST];
+struct dp_packet *packet;
+struct dp_netdev_pmd_thread *pmd = pmd_handle;
+struct netdev_flow_key test_keys[NETDEV_MAX_BURST];
+
+if (keys_size < cnt) {
+miniflow_extract_func default_func = NULL;
+atomic_uintptr_t *pmd_func = (void 
*)&pmd->miniflow_extract_opt;

+atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
+VLOG_ERR("Invalid key size supplied, Key_size: %d less than"
+ "batch_size: %d", keys_size, cnt);
+return 0;
+}
+
+/* Run scalar miniflow_extract to get default result. */
+DP_PACKET_BATCH_FOR_EACH (i, packet, packets) {
+pkt_metadata_init(&packet->md, in_port);
+miniflow_extract(packet, &keys[i].mf);
+
+/* Store known good metadata to compare with optimized 
metadata. */

+good_l2_5_ofs[i] = packet->l2_5_ofs;
+good_l3_ofs[i] = packet->l3_ofs;
+good_l4_ofs[i] = packet->l4_ofs;
+good_l2_pad_size[i] = packet->l2_pad_size;
+}
+
+uint32_t batch_failed = 0;
+/* Iterate through each version of miniflow implementations. */
+for (int j = MFEX_IMPL_MAX; j < MFEX_IMPL_MAX; j++) {



This is really confusing ;) Maybe we could define an enum for the first 
implementation, as you had in your previous patch, MFEX_IMPL_START_IDX?
So that even if someone changes the enum order it’s clear? Maybe with 
a separate #define or doing something like:


 enum dpif_miniflow_extract_impl_idx {
 MFEX_IMPL_AUTOVALIDATOR,
 MFEX_IMPL_SCALAR,
 MFEX_IMPL_START_IDX,
 MFEX_IMPL_MAX = MFEX_IMPL_START_IDX
 };


+if ((j < MFEX_IMPL_MAX) || (!mfex_impls[j].available)) {


Why the (j < MFEX_IMPL_MAX) so the below code is always skipped? So the 
validator is not executed ever?



+continue;
+}
+
+/* Reset keys and offsets before each implementation. */
+memset(test_keys, 0, keys_size * sizeof(struct 
netdev_flow_key));

+DP_PACKET_BATCH_FOR_EACH (i, packet, packets) {
+dp_packet_reset_offsets(packet);
+}
+/* Call optimized miniflow for each batch of packet. */
+uint32_t

Re: [ovs-dev] [v6 00/11] MFEX Infrastructure + Optimizations

2021-07-07 Thread Van Haaren, Harry

Hi All,

This thread has dissolved into unnecessary time-wasting on nitpick changes. 
There is no
technical issue with uint32_t, so this patch remains as is, and this should be 
accepted for merge.

If you feel differently, reply to this with a detailed description of a genuine 
technical bug.

Regards, -Harry


From: Amber, Kumar 
Sent: Wednesday, July 7, 2021 11:04 AM
To: Eelco Chaudron ; Van Haaren, Harry 

Cc: Ferriter, Cian ; ovs-dev@openvswitch.org; 
f...@sysclose.org; i.maxim...@ovn.org; Stokes, Ian 
Subject: RE: [v6 00/11] MFEX Infrastructure + Optimizations

Hi Eelco,


I tried with the suggestion “zd” is deprecated and in place of it %"PRIdSIZE`` 
is mentioned which still causes build failure on non-ssl 32 bit builds.

Regards
Amber

From: Eelco Chaudron mailto:echau...@redhat.com>>
Sent: Wednesday, July 7, 2021 3:02 PM
To: Van Haaren, Harry 
mailto:harry.van.haa...@intel.com>>
Cc: Amber, Kumar mailto:kumar.am...@intel.com>>; 
Ferriter, Cian mailto:cian.ferri...@intel.com>>; 
ovs-dev@openvswitch.org; 
f...@sysclose.org; 
i.maxim...@ovn.org; Stokes, Ian 
mailto:ian.sto...@intel.com>>
Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations


On 7 Jul 2021, at 11:09, Van Haaren, Harry wrote:

-Original Message-
From: Eelco Chaudron mailto:echau...@redhat.com>>
Sent: Wednesday, July 7, 2021 9:35 AM
To: Amber, Kumar mailto:kumar.am...@intel.com>>
Cc: Ferriter, Cian mailto:cian.ferri...@intel.com>>; 
ovs-dev@openvswitch.org;
f...@sysclose.org; 
i.maxim...@ovn.org; Van Haaren, Harry
mailto:harry.van.haa...@intel.com>>; Stokes, Ian 
mailto:ian.sto...@intel.com>>
Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations

On 6 Jul 2021, at 17:06, Amber, Kumar wrote:

Hi Eelco ,

Here is the diff vor v6 vs v5 :

Patch 1 :

diff --git a/lib/dpif-netdev-private-extract.c 
b/lib/dpif-netdev-private-extract.c
index 1aebf3656d..4987d628a4 100644
--- a/lib/dpif-netdev-private-extract.c
+++ b/lib/dpif-netdev-private-extract.c
@@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct

dp_packet_batch *packets,

uint32_t keys_size, odp_port_t in_port,
struct dp_netdev_pmd_thread *pmd_handle)
{
- const size_t cnt = dp_packet_batch_size(packets);
+ const uint32_t cnt = dp_packet_batch_size(packets);
uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
uint16_t good_l3_ofs[NETDEV_MAX_BURST];
uint16_t good_l4_ofs[NETDEV_MAX_BURST];
@@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct

dp_packet_batch *packets,

atomic_uintptr_t *pmd_func = (void *)&pmd->miniflow_extract_opt;
atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
VLOG_ERR("Invalid key size supplied, Key_size: %d less than"
- "batch_size: %ld", keys_size, cnt);
+ "batch_size: %d", keys_size, cnt);

What was the reason for changing this size_t to uint32_t? Is see other instances
where %ld is used for logging?
And other functions like dp_netdev_run_meter() have it as a size_t?

The reason to change this is because 32-bit builds were breaking due to 
incorrect
format-specifier in the printf. Root cause is because size_t requires different 
printf
format specifier based on 32 or 64 bit arch.

(As you likely know, size_t is to describe objects in memory, or the return of 
sizeof operator.
Because 32-bit and 64-bit can have different amounts of memory, size_t can be 
"unsigned int"
or "unsigned long long").

It does not make sense to me to use a type of variable that changes width based 
on
architecture to count batch size (a value from 0 to 32).

Simplicity and obvious-ness is nice, and a uint32_t is always exactly what you 
read it to be,
and %d will always be correct for uint32_t regardless of 32 or 64 bit.

We should not change this back to the more complex and error-prone "size_t", 
uint32_t is better.

I don't think it’s more error-prone if the right type qualifier is used, i.e. 
%zd. See also the coding style document, so I would suggest changing it to:

@@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct dp_packet_batch 
*packets,
uint32_t keys_size, odp_port_t in_port,
struct dp_netdev_pmd_thread *pmd_handle)
{

  *   const uint32_t cnt = dp_packet_batch_size(packets);

  *   const size_t cnt = dp_packet_batch_size(packets);
uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
uint16_t good_l3_ofs[NETDEV_MAX_BURST];
uint16_t good_l4_ofs[NETDEV_MAX_BURST];

@@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct dp_packet_batch 
*packets,
atomic_uintptr_t *pmd_func = (void *)&pmd->miniflow_extract_opt;
atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
VLOG_ERR("Invalid key size supplied, Key_size: %d less than"

·"batch_size: %d", keys_size, cnt);

  *

·"batch_size: %"PRIdSIZE, keys_size, cnt);

·   return 0;

  *   }


___
dev mailing list
d...@openvswitch.or

Re: [ovs-dev] [v14 02/11] dpif-netdev: Split HWOL out to own header file.

2021-07-07 Thread Ferriter, Cian

Hi Flavio,

Thanks for your comment. My response is inline.

Cian

> -Original Message-
> From: Flavio Leitner 
> Sent: Wednesday 7 July 2021 00:06
> To: Ferriter, Cian 
> Cc: ovs-dev@openvswitch.org; i.maxim...@ovn.org
> Subject: Re: [ovs-dev] [v14 02/11] dpif-netdev: Split HWOL out to own header 
> file.
> 
> 
> Hi,
> 
> After the refactoring and rebasing, this patch doesn't seem
> necessary anymore. I don't see value in keeping it.
> Can we drop it? What do you think?
> 
> fbl
> 
> 

Good catch and agreed! This is no longer necessary. I'll drop it for the next 
version.

> On Thu, Jul 01, 2021 at 04:06:10PM +0100, Cian Ferriter wrote:
> > From: Harry van Haaren 
> >
> > This commit moves the datapath lookup functions required for
> > hardware offload to a separate file. This allows other DPIF
> > implementations to access the lookup functions, encouraging
> > code reuse.
> >
> > Signed-off-by: Harry van Haaren 
> >
> > ---
> >
> > Cc: Gaetan Rivet 
> > Cc: Sriharsha Basavapatna 
> >
> > v14:
> > - Fix spelling mistake in commit message.
> >
> > v13:
> > - Minor code refactor to address review comments.
> > ---
> >  lib/automake.mk|  1 +
> >  lib/dpif-netdev-private-hwol.h | 63 ++
> >  lib/dpif-netdev-private.h  |  1 +
> >  lib/dpif-netdev.c  | 38 ++--
> >  4 files changed, 67 insertions(+), 36 deletions(-)
> >  create mode 100644 lib/dpif-netdev-private-hwol.h
> >
> > diff --git a/lib/automake.mk b/lib/automake.mk
> > index fdba3c6c0..3a33cdd5c 100644
> > --- a/lib/automake.mk
> > +++ b/lib/automake.mk
> > @@ -115,6 +115,7 @@ lib_libopenvswitch_la_SOURCES = \
> > lib/dpif-netdev-private-dfc.h \
> > lib/dpif-netdev-private-dpcls.h \
> > lib/dpif-netdev-private-flow.h \
> > +   lib/dpif-netdev-private-hwol.h \
> > lib/dpif-netdev-private-thread.h \
> > lib/dpif-netdev-private.h \
> > lib/dpif-netdev-perf.c \
> > diff --git a/lib/dpif-netdev-private-hwol.h b/lib/dpif-netdev-private-hwol.h
> > new file mode 100644
> > index 0..b93297a74
> > --- /dev/null
> > +++ b/lib/dpif-netdev-private-hwol.h
> > @@ -0,0 +1,63 @@
> > +/*
> > + * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2015 Nicira, Inc.
> > + * Copyright (c) 2021 Intel Corporation.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + * http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef DPIF_NETDEV_PRIVATE_HWOL_H
> > +#define DPIF_NETDEV_PRIVATE_HWOL_H 1
> > +
> > +#include "dpif-netdev-private-flow.h"
> > +
> > +#define MAX_FLOW_MARK   (UINT32_MAX - 1)
> > +#define INVALID_FLOW_MARK   0
> > +/* Zero flow mark is used to indicate the HW to remove the mark. A packet
> > + * marked with zero mark is received in SW without a mark at all, so it
> > + * cannot be used as a valid mark.
> > + */
> > +
> > +struct megaflow_to_mark_data {
> > +const struct cmap_node node;
> > +ovs_u128 mega_ufid;
> > +uint32_t mark;
> > +};
> > +
> > +struct flow_mark {
> > +struct cmap megaflow_to_mark;
> > +struct cmap mark_to_flow;
> > +struct id_pool *pool;
> > +};
> > +
> > +/* allocated in dpif-netdev.c */
> > +extern struct flow_mark flow_mark;
> > +
> > +static inline struct dp_netdev_flow *
> > +mark_to_flow_find(const struct dp_netdev_pmd_thread *pmd,
> > +  const uint32_t mark)
> > +{
> > +struct dp_netdev_flow *flow;
> > +
> > +CMAP_FOR_EACH_WITH_HASH (flow, mark_node, hash_int(mark, 0),
> > + &flow_mark.mark_to_flow) {
> > +if (flow->mark == mark && flow->pmd_id == pmd->core_id &&
> > +flow->dead == false) {
> > +return flow;
> > +}
> > +}
> > +
> > +return NULL;
> > +}
> > +
> > +
> > +#endif /* dpif-netdev-private-hwol.h */
> > diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
> > index d7b6fd7ec..62e3616c1 100644
> > --- a/lib/dpif-netdev-private.h
> > +++ b/lib/dpif-netdev-private.h
> > @@ -30,5 +30,6 @@
> >  #include "dpif-netdev-private-dpcls.h"
> >  #include "dpif-netdev-private-dfc.h"
> >  #include "dpif-netdev-private-thread.h"
> > +#include "dpif-netdev-private-hwol.h"
> >
> >  #endif /* netdev-private.h */
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 2e29980c5..b9b10c6bb 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -18,6 +18,7 @@
> >  #include "dpif-netdev.h"
> >  #include "dpif-netdev-private.h"
> >  #include "dpi

Re: [ovs-dev] [v6 00/11] MFEX Infrastructure + Optimizations

2021-07-07 Thread Amber, Kumar

Hi Eelco,


I tried with the suggestion “zd” is deprecated and in place of it %"PRIdSIZE`` 
is mentioned which still causes build failure on non-ssl 32 bit builds.

Regards
Amber

From: Eelco Chaudron 
Sent: Wednesday, July 7, 2021 3:02 PM
To: Van Haaren, Harry 
Cc: Amber, Kumar ; Ferriter, Cian 
; ovs-dev@openvswitch.org; f...@sysclose.org; 
i.maxim...@ovn.org; Stokes, Ian 
Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations


On 7 Jul 2021, at 11:09, Van Haaren, Harry wrote:

-Original Message-
From: Eelco Chaudron mailto:echau...@redhat.com>>
Sent: Wednesday, July 7, 2021 9:35 AM
To: Amber, Kumar mailto:kumar.am...@intel.com>>
Cc: Ferriter, Cian mailto:cian.ferri...@intel.com>>; 
ovs-dev@openvswitch.org;
f...@sysclose.org; 
i.maxim...@ovn.org; Van Haaren, Harry
mailto:harry.van.haa...@intel.com>>; Stokes, Ian 
mailto:ian.sto...@intel.com>>
Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations

On 6 Jul 2021, at 17:06, Amber, Kumar wrote:

Hi Eelco ,

Here is the diff vor v6 vs v5 :

Patch 1 :

diff --git a/lib/dpif-netdev-private-extract.c 
b/lib/dpif-netdev-private-extract.c
index 1aebf3656d..4987d628a4 100644
--- a/lib/dpif-netdev-private-extract.c
+++ b/lib/dpif-netdev-private-extract.c
@@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct

dp_packet_batch *packets,

uint32_t keys_size, odp_port_t in_port,
struct dp_netdev_pmd_thread *pmd_handle)
{
- const size_t cnt = dp_packet_batch_size(packets);
+ const uint32_t cnt = dp_packet_batch_size(packets);
uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
uint16_t good_l3_ofs[NETDEV_MAX_BURST];
uint16_t good_l4_ofs[NETDEV_MAX_BURST];
@@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct

dp_packet_batch *packets,

atomic_uintptr_t *pmd_func = (void *)&pmd->miniflow_extract_opt;
atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
VLOG_ERR("Invalid key size supplied, Key_size: %d less than"
- "batch_size: %ld", keys_size, cnt);
+ "batch_size: %d", keys_size, cnt);

What was the reason for changing this size_t to uint32_t? Is see other instances
where %ld is used for logging?
And other functions like dp_netdev_run_meter() have it as a size_t?

The reason to change this is because 32-bit builds were breaking due to 
incorrect
format-specifier in the printf. Root cause is because size_t requires different 
printf
format specifier based on 32 or 64 bit arch.

(As you likely know, size_t is to describe objects in memory, or the return of 
sizeof operator.
Because 32-bit and 64-bit can have different amounts of memory, size_t can be 
"unsigned int"
or "unsigned long long").

It does not make sense to me to use a type of variable that changes width based 
on
architecture to count batch size (a value from 0 to 32).

Simplicity and obvious-ness is nice, and a uint32_t is always exactly what you 
read it to be,
and %d will always be correct for uint32_t regardless of 32 or 64 bit.

We should not change this back to the more complex and error-prone "size_t", 
uint32_t is better.

I don't think it’s more error-prone if the right type qualifier is used, i.e. 
%zd. See also the coding style document, so I would suggest changing it to:

@@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct dp_packet_batch 
*packets,
uint32_t keys_size, odp_port_t in_port,
struct dp_netdev_pmd_thread *pmd_handle)
{

  *   const uint32_t cnt = dp_packet_batch_size(packets);

  *   const size_t cnt = dp_packet_batch_size(packets);
uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
uint16_t good_l3_ofs[NETDEV_MAX_BURST];
uint16_t good_l4_ofs[NETDEV_MAX_BURST];

@@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct dp_packet_batch 
*packets,
atomic_uintptr_t *pmd_func = (void *)&pmd->miniflow_extract_opt;
atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
VLOG_ERR("Invalid key size supplied, Key_size: %d less than"

·"batch_size: %d", keys_size, cnt);

  *

·"batch_size: %"PRIdSIZE, keys_size, cnt);

·   return 0;

  *   }


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function for miniflow extract

2021-07-07 Thread Eelco Chaudron



On 7 Jul 2021, at 11:33, Van Haaren, Harry wrote:

>> -Original Message-
>> From: Eelco Chaudron 
>> Sent: Wednesday, July 7, 2021 9:59 AM
>> To: Van Haaren, Harry 
>> Cc: Amber, Kumar ; d...@openvswitch.org;
>> i.maxim...@ovn.org; Flavio Leitner ; Stokes, Ian
>> 
>> Subject: Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function 
>> for
>> miniflow extract
>>
>>
>>
>> On 6 Jul 2021, at 15:58, Van Haaren, Harry wrote:
>>
 -Original Message-
 From: Eelco Chaudron 
 Sent: Friday, July 2, 2021 8:10 AM
 To: Van Haaren, Harry 
 Cc: Amber, Kumar ; d...@openvswitch.org;
 i.maxim...@ovn.org; Flavio Leitner ; Stokes, Ian
 
 Subject: Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation 
 function for
 miniflow extract



 On 1 Jul 2021, at 19:24, Van Haaren, Harry wrote:

>> -Original Message-
>> From: Eelco Chaudron 
>>>
>>> 
>>>
>> I’ll share the google sheet with you directly as it also has the config, 
>> and PVP
 results.
>
> I can't actually access that doc, sorry. Results above are enough to go 
> by for
>> now :)

 It’s attached.
>>>
>>> Thanks.
>>>
> We can investigate if there's any optimizations to be done to improve the
>> scalar DPIF
> enabling of the miniflow extract func ptr, but I'm not sure there is.
>>>
>>> Note the v6 of MFEX has some minor changes/optimizations in place, as per
>> scalar DPIF enabling in this patch:
>>>
>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210706131150.45513
>> -2-cian.ferri...@intel.com/
>>>
>>>
> If we cannot improve the perf data from above, there is an option to not
>> enable
 the scalar DPIF with the AVX512 MFEX optimizations. (Logic being if AVX512 
 is
>> present,
 running both the DPIF + MFEX makes sense). What do you think?
>>>
>>> If you feel it is required before merge, would you re-run the benchmark on 
>>> v6?
>>> If so, we're targeting Thursday for merge, so data ASAP, or by EOD tomorrow
>> would be required.
>>
>> I’m reviewing your v6 now, so I have no cycles to also do the testing before 
>> the
>> end of the week. But the tests are simple, so maybe you guys can try it and
>> report the difference with and without the two patchsets applied on a non
>> AVX512 machine?
>
> Yes, we have done scalar-only code benchmarking of master vs with DPIF 
> patchset.
> By not enabling AVX512 at runtime we get the "non AVX512 machine" behaviour.
> (All the scalar code is common, no need to a specific CPU in that instance).
>
> Testing OVS master branch vs with patchset did not show up any performance 
> delta
> on the test machines here, so there's nothing I can do.
>
> By removing scalar DPIF enabling of MFEX opt pointer (details below) we 
> remove any
> urgency on benchmark results?

I’ll wrap up the review first, and hopefully, when you are working on potential 
changes, I can run the tests and get some results.

I understand now what you meant with disabling it in the scalar part, so if I 
still see 1%+ deltas I’ll try it out.

>>> As mentioned above, there is an option to remove the AVX512-Optimized
>> MFEX enabling
>>> from the scalar datapath, if there is measurable/significant performance
>> reduction in this v6 code.
>>
>> It not clear to me what you mean by this? Can you elaborate? I’m running 
>> this on
>> a non AVX512 machine, all with default configs.
>
> I'm suggesting that if you're not OK with merging the ~1.x% negative 
> performance on scalar
> DPIF performance to enable MFEX, we can remove the MFEX enabling from the 
> scalar DPIF.
>
> Logically, if AVX512 is in use for MFEX, it is logical to use the AVX512 DPIF 
> too, hence
> this is a workable solution/workaround for scalar DPIF performance loss.
>
> Taking this approach would ensure that scalar DPIF performance is not reduced 
> in
> this release, and we can re-visit scalar DPIF enabling of MFEX in future if 
> desired?
>
> Overall, this seems the pragmatic way of reducing risk around performance and 
> getting merged.
>
>
 This is on a system without AVX512 support, so all is disabled. The 
 “without
>> patch”
 has both the new AVX patches removed (mfex and dpif framework).

>
>> //Eelco
>>>
>>> Thanks again for testing & follow up! Regards, -Harry

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function for miniflow extract

2021-07-07 Thread Van Haaren, Harry

> -Original Message-
> From: Eelco Chaudron 
> Sent: Wednesday, July 7, 2021 9:59 AM
> To: Van Haaren, Harry 
> Cc: Amber, Kumar ; d...@openvswitch.org;
> i.maxim...@ovn.org; Flavio Leitner ; Stokes, Ian
> 
> Subject: Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function 
> for
> miniflow extract
> 
> 
> 
> On 6 Jul 2021, at 15:58, Van Haaren, Harry wrote:
> 
> >> -Original Message-
> >> From: Eelco Chaudron 
> >> Sent: Friday, July 2, 2021 8:10 AM
> >> To: Van Haaren, Harry 
> >> Cc: Amber, Kumar ; d...@openvswitch.org;
> >> i.maxim...@ovn.org; Flavio Leitner ; Stokes, Ian
> >> 
> >> Subject: Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation 
> >> function for
> >> miniflow extract
> >>
> >>
> >>
> >> On 1 Jul 2021, at 19:24, Van Haaren, Harry wrote:
> >>
>  -Original Message-
>  From: Eelco Chaudron 
> >
> > 
> >
>  I’ll share the google sheet with you directly as it also has the config, 
>  and PVP
> >> results.
> >>>
> >>> I can't actually access that doc, sorry. Results above are enough to go 
> >>> by for
> now :)
> >>
> >> It’s attached.
> >
> > Thanks.
> >
> >>> We can investigate if there's any optimizations to be done to improve the
> scalar DPIF
> >>> enabling of the miniflow extract func ptr, but I'm not sure there is.
> >
> > Note the v6 of MFEX has some minor changes/optimizations in place, as per
> scalar DPIF enabling in this patch:
> >
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210706131150.45513
> -2-cian.ferri...@intel.com/
> >
> >
> >>> If we cannot improve the perf data from above, there is an option to not
> enable
> >> the scalar DPIF with the AVX512 MFEX optimizations. (Logic being if AVX512 
> >> is
> present,
> >> running both the DPIF + MFEX makes sense). What do you think?
> >
> > If you feel it is required before merge, would you re-run the benchmark on 
> > v6?
> > If so, we're targeting Thursday for merge, so data ASAP, or by EOD tomorrow
> would be required.
> 
> I’m reviewing your v6 now, so I have no cycles to also do the testing before 
> the
> end of the week. But the tests are simple, so maybe you guys can try it and
> report the difference with and without the two patchsets applied on a non
> AVX512 machine?

Yes, we have done scalar-only code benchmarking of master vs with DPIF patchset.
By not enabling AVX512 at runtime we get the "non AVX512 machine" behaviour.
(All the scalar code is common, no need to a specific CPU in that instance).

Testing OVS master branch vs with patchset did not show up any performance delta
on the test machines here, so there's nothing I can do.

By removing scalar DPIF enabling of MFEX opt pointer (details below) we remove 
any
urgency on benchmark results?


> > As mentioned above, there is an option to remove the AVX512-Optimized
> MFEX enabling
> > from the scalar datapath, if there is measurable/significant performance
> reduction in this v6 code.
> 
> It not clear to me what you mean by this? Can you elaborate? I’m running this 
> on
> a non AVX512 machine, all with default configs.

I'm suggesting that if you're not OK with merging the ~1.x% negative 
performance on scalar
DPIF performance to enable MFEX, we can remove the MFEX enabling from the 
scalar DPIF.

Logically, if AVX512 is in use for MFEX, it is logical to use the AVX512 DPIF 
too, hence
this is a workable solution/workaround for scalar DPIF performance loss.

Taking this approach would ensure that scalar DPIF performance is not reduced in
this release, and we can re-visit scalar DPIF enabling of MFEX in future if 
desired?

Overall, this seems the pragmatic way of reducing risk around performance and 
getting merged.


> >> This is on a system without AVX512 support, so all is disabled. The 
> >> “without
> patch”
> >> has both the new AVX patches removed (mfex and dpif framework).
> >>
> >>>
>  //Eelco
> >
> > Thanks again for testing & follow up! Regards, -Harry

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 00/11] MFEX Infrastructure + Optimizations

2021-07-07 Thread Eelco Chaudron

On 7 Jul 2021, at 11:09, Van Haaren, Harry wrote:

-Original Message-
From: Eelco Chaudron 
Sent: Wednesday, July 7, 2021 9:35 AM
To: Amber, Kumar 
Cc: Ferriter, Cian ; 
ovs-dev@openvswitch.org;

f...@sysclose.org; i.maxim...@ovn.org; Van Haaren, Harry
; Stokes, Ian 
Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations

On 6 Jul 2021, at 17:06, Amber, Kumar wrote:

Hi Eelco ,

Here is the diff  vor v6 vs v5 :

Patch 1 :

diff --git a/lib/dpif-netdev-private-extract.c 
b/lib/dpif-netdev-private-extract.c

index 1aebf3656d..4987d628a4 100644
--- a/lib/dpif-netdev-private-extract.c
+++ b/lib/dpif-netdev-private-extract.c
@@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct

dp_packet_batch *packets,
 uint32_t keys_size, odp_port_t 
in_port,
 struct dp_netdev_pmd_thread 
*pmd_handle)

 {
-const size_t cnt = dp_packet_batch_size(packets);
+const uint32_t cnt = dp_packet_batch_size(packets);
 uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
 uint16_t good_l3_ofs[NETDEV_MAX_BURST];
 uint16_t good_l4_ofs[NETDEV_MAX_BURST];
@@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct

dp_packet_batch *packets,
 atomic_uintptr_t *pmd_func = (void 
*)&pmd->miniflow_extract_opt;

 atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
 VLOG_ERR("Invalid key size supplied, Key_size: %d less 
than"

- "batch_size: %ld", keys_size, cnt);
+ "batch_size: %d", keys_size, cnt);

What was the reason for changing this size_t to uint32_t? Is see 
other instances

where %ld is used for logging?
And other functions like dp_netdev_run_meter() have it as a size_t?

The reason to change this is because 32-bit builds were breaking due 
to incorrect
format-specifier in the printf. Root cause is because size_t requires 
different printf

format specifier based on 32 or 64 bit arch.

(As you likely know, size_t is to describe objects in memory, or the 
return of sizeof operator.
Because 32-bit and 64-bit can have different amounts of memory, size_t 
can be "unsigned int"

or "unsigned long long").

It does not make sense to me to use a type of variable that changes 
width based on

architecture to count batch size (a value from 0 to 32).

Simplicity and obvious-ness is nice, and a uint32_t is always exactly 
what you read it to be,

and %d will always be correct for uint32_t regardless of 32 or 64 bit.

We should not change this back to the more complex and error-prone 
"size_t", uint32_t is better.

I don't think it’s more error-prone if the right type qualifier is 
used, i.e. %zd. See also the coding style document, so I would suggest 
changing it to:

@@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct 
dp_packet_batch *packets,
 uint32_t keys_size, odp_port_t 
in_port,
 struct dp_netdev_pmd_thread 
*pmd_handle)

 {
-const uint32_t cnt = dp_packet_batch_size(packets);
+const size_t cnt = dp_packet_batch_size(packets);
 uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
 uint16_t good_l3_ofs[NETDEV_MAX_BURST];
 uint16_t good_l4_ofs[NETDEV_MAX_BURST];
@@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct 
dp_packet_batch *packets,
 atomic_uintptr_t *pmd_func = (void 
*)&pmd->miniflow_extract_opt;

 atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
 VLOG_ERR("Invalid key size supplied, Key_size: %d less than"
- "batch_size: %d", keys_size, cnt);
+ "batch_size: %"PRIdSIZE, keys_size, cnt);
 return 0;
 }

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v14 01/11] dpif-netdev: Refactor to multiple header files.

2021-07-07 Thread Ferriter, Cian

Hi Flavio,

Thanks for the info. My response is inline.

Cian

> -Original Message-
> From: Flavio Leitner 
> Sent: Tuesday 6 July 2021 20:36
> To: Ferriter, Cian 
> Cc: ovs-dev@openvswitch.org; i.maxim...@ovn.org
> Subject: Re: [ovs-dev] [v14 01/11] dpif-netdev: Refactor to multiple header 
> files.
> 
> On Tue, Jul 06, 2021 at 04:20:59PM -0300, Flavio Leitner wrote:
> >
> > Hi,
> >
> > I was reviewing the patch while testing and I can consistently
> > loss 1Mpps (or more) on a P2P scenario with this flow table:
> > ovs-ofctl add-flow br0 in_port=dpdk0,actions=output:dpdk1
> >
> > TX: 14Mpps
> > RX without patch: +12.6Mpps
> > RX with patch: 11.67Mpps
> 
> FYI: the performance is consistently recovered with patch 03.
> fbl
> 

Good that the performance is recovered. This is probably a compiler-inlining 
behaviour change due to files moving around (and compiler seeing "re-use" of 
functions, resulting in not being inlined).

As the movement of inline functions takes its final shape, the compiler 
identifies better inlining again, resulting in the same performance as before 
patch 01. (Note that using -O3 instead of e.g. -O2 causes more aggressive 
inlining, and may continually have best performance).

Since the performance is recovered in patch 03, this should be acceptable.

> > CPU: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
> >
> > Perf diff:
> > # Event 'cycles'
> > #
> > # Baseline  Delta Abs  Shared Object   Symbol
> > #   .  ..  
> > ..
> > #
> >  8.32% -2.64%  libc-2.28.so[.] __memcmp_avx2_movbe
> >+2.04%  ovs-vswitchd[.] 
> > dp_netdev_pmd_flush_output_packets.part.41
> > 14.60% -1.78%  ovs-vswitchd[.] mlx5_rx_burst_vec
> >  2.78% +1.78%  ovs-vswitchd[.] non_atomic_ullong_add
> > 23.95% +1.60%  ovs-vswitchd[.] miniflow_extract
> >  2.02% +1.54%  ovs-vswitchd[.] netdev_dpdk_rxq_recv
> > 14.77% -0.82%  ovs-vswitchd[.] dp_netdev_input__
> >  5.46% +0.79%  ovs-vswitchd[.] mlx5_tx_burst_none_empw
> >  2.79% -0.77%  ovs-vswitchd[.] 
> > dp_netdev_pmd_flush_output_on_port
> >  3.70% -0.58%  ovs-vswitchd[.] dp_execute_output_action
> >  3.34% -0.58%  ovs-vswitchd[.] netdev_send
> >  2.92% +0.41%  ovs-vswitchd[.] dp_netdev_process_rxq_port
> >  0.36% +0.39%  ovs-vswitchd[.] netdev_dpdk_vhost_rxq_recv
> >  4.25% +0.38%  ovs-vswitchd[.] 
> > mlx5_tx_handle_completion.isra.49
> >  0.82% +0.30%  ovs-vswitchd[.] pmd_perf_end_iteration
> >  1.53% -0.12%  ovs-vswitchd[.] netdev_dpdk_filter_packet_len
> >  0.53% -0.11%  ovs-vswitchd[.] netdev_is_flow_api_enabled
> >  0.54% -0.10%  [vdso]  [.] 0x09c0
> >  1.72% +0.10%  ovs-vswitchd[.] netdev_rxq_recv
> >  0.61% +0.09%  ovs-vswitchd[.] pmd_thread_main
> >  0.08% +0.07%  ovs-vswitchd[.] userspace_tso_enabled
> >  0.45% -0.07%  ovs-vswitchd[.] memcmp@plt
> >  0.22% -0.05%  ovs-vswitchd[.] dp_execute_cb
> >
> >
> >
> > On Thu, Jul 01, 2021 at 04:06:09PM +0100, Cian Ferriter wrote:
> > > From: Harry van Haaren 
> > >
> > > Split the very large file dpif-netdev.c and the datastructures
> > > it contains into multiple header files. Each header file is
> > > responsible for the datastructures of that component.
> > >
> > > This logical split allows better reuse and modularity of the code,
> > > and reduces the very large file dpif-netdev.c to be more managable.
> > >
> > > Due to dependencies between components, it is not possible to
> > > move component in smaller granularities than this patch.
> > >
> > > To explain the dependencies better, eg:
> > >
> > > DPCLS has no deps (from dpif-netdev.c file)
> > > FLOW depends on DPCLS (struct dpcls_rule)
> > > DFC depends on DPCLS (netdev_flow_key) and FLOW (netdev_flow_key)
> > > THREAD depends on DFC (struct dfc_cache)
> > >
> > > DFC_PROC depends on THREAD (struct pmd_thread)
> > >
> > > DPCLS lookup.h/c require only DPCLS
> > > DPCLS implementations require only dpif-netdev-lookup.h.
> > > - This change was made in 2.12 release with function pointers
> > > - This commit only refactors the name to "private-dpcls.h"
> > >
> > > netdev_flow_key_equal_mf() is renamed to emc_flow_key_equal_mf().
> > >
> > > Rename functions specific to dpcls from netdev_* namespace to the
> > > dpcls_* namespace, as they are only used by dpcls code.
> > >
> > > 'inline' is added to the dp_netdev_flow_hash() when it is moved
> > > definition to fix a compiler error.
> > >
> > > One valid checkpatch issue with the use of the
> > > EMC_FOR_EACH_POS_WITH_HASH() macro was fixed.
> > >
> > > Signed-off-by: Harry van Haaren 
> > > Co-authored-by: Cian Ferriter 
> > > Signed-off-by:

Re: [ovs-dev] [v6 00/11] MFEX Infrastructure + Optimizations

2021-07-07 Thread Van Haaren, Harry

> -Original Message-
> From: Eelco Chaudron 
> Sent: Wednesday, July 7, 2021 9:35 AM
> To: Amber, Kumar 
> Cc: Ferriter, Cian ; ovs-dev@openvswitch.org;
> f...@sysclose.org; i.maxim...@ovn.org; Van Haaren, Harry
> ; Stokes, Ian 
> Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations
> 
> 
> 
> On 6 Jul 2021, at 17:06, Amber, Kumar wrote:
> 
> > Hi Eelco ,
> >
> >
> > Here is the diff  vor v6 vs v5 :
> >
> > Patch 1 :
> >
> > diff --git a/lib/dpif-netdev-private-extract.c 
> > b/lib/dpif-netdev-private-extract.c
> > index 1aebf3656d..4987d628a4 100644
> > --- a/lib/dpif-netdev-private-extract.c
> > +++ b/lib/dpif-netdev-private-extract.c
> > @@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct
> dp_packet_batch *packets,
> >  uint32_t keys_size, odp_port_t in_port,
> >  struct dp_netdev_pmd_thread 
> > *pmd_handle)
> >  {
> > -const size_t cnt = dp_packet_batch_size(packets);
> > +const uint32_t cnt = dp_packet_batch_size(packets);
> >  uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
> >  uint16_t good_l3_ofs[NETDEV_MAX_BURST];
> >  uint16_t good_l4_ofs[NETDEV_MAX_BURST];
> > @@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct
> dp_packet_batch *packets,
> >  atomic_uintptr_t *pmd_func = (void *)&pmd->miniflow_extract_opt;
> >  atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
> >  VLOG_ERR("Invalid key size supplied, Key_size: %d less than"
> > - "batch_size: %ld", keys_size, cnt);
> > + "batch_size: %d", keys_size, cnt);
> 
> What was the reason for changing this size_t to uint32_t? Is see other 
> instances
> where %ld is used for logging?
> And other functions like dp_netdev_run_meter() have it as a size_t?

The reason to change this is because 32-bit builds were breaking due to 
incorrect
format-specifier in the printf. Root cause is because size_t requires different 
printf
format specifier based on 32 or 64 bit arch.

(As you likely know, size_t is to describe objects in memory, or the return of 
sizeof operator.
Because 32-bit and 64-bit can have different amounts of memory, size_t can be 
"unsigned int"
or "unsigned long long").

It does not make sense to me to use a type of variable that changes width based 
on
architecture to count batch size (a value from 0 to 32).

Simplicity and obvious-ness is nice, and a uint32_t is always exactly what you 
read it to be,
and %d will always be correct for uint32_t regardless of 32 or 64 bit.

We should not change this back to the more complex and error-prone "size_t", 
uint32_t is better.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v5 01/11] dpif-netdev: Add command line and function pointer for miniflow extract

2021-07-07 Thread Amber, Kumar

Hi Eelco,

Thanks for reviews good find I missed it will be fixed in v7.

< Snip>

> 
> >>> + * for that packet.
> >>> + */
> >>> +uint32_t mfex_hit = (mf_mask & (1 << i));
> >>
> >> This was supposed to become a bool?
> >>
> >
> > This cannot be a bool as this is used like a bit-mask and set bits are used 
> > to
> iterate the packets.
> >
> 

Fixed.

> Guess the problem is that it should be a bool in this instance, but later on 
> in the
> code you redefine the same variable and use it as a count!
> I would suggest changing this to a bool, and renaming the other instance to 
> hits
> or mfex_hit_cnt.
> 
> 
> 313 /* At this point we don't return error anymore, so commit stats here. 
> */
> 314 uint32_t mfex_hit = __builtin_popcountll(mf_mask);
> 
> Change this to “uint32_t mfex_hit_cnt / or mfex_hits”
> 
> 315 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RECV,
> batch_size);
> 316 pms_perf_update_counter(&pmd->perf_stats, PMD_STAT_PHWOL_HIT,
> phwol_hits);
> 317 pmd_perf_update_counter(&pmd->perf_stats,
> PMD_STAT_MFEX_OPT_HIT, mfex_hit);
> 
> And this to:
> 
> pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MFEX_OPT_HIT,
> mfex_hits);
> 
> 318 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
> emc_hits);
> 319 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT,
> smc_hits);
> 320 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
> 321 dpcls_key_idx);
> 322 pmd_perf_update_counter(&pmd->perf_stats,
> PMD_STAT_MASKED_LOOKUP,
> 323 dpcls_key_idx);
> 
> 

Fixed.
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function for miniflow extract

2021-07-07 Thread Eelco Chaudron



On 6 Jul 2021, at 15:58, Van Haaren, Harry wrote:

>> -Original Message-
>> From: Eelco Chaudron 
>> Sent: Friday, July 2, 2021 8:10 AM
>> To: Van Haaren, Harry 
>> Cc: Amber, Kumar ; d...@openvswitch.org;
>> i.maxim...@ovn.org; Flavio Leitner ; Stokes, Ian
>> 
>> Subject: Re: [ovs-dev] [v4 02/12] dpif-netdev: Add auto validation function 
>> for
>> miniflow extract
>>
>>
>>
>> On 1 Jul 2021, at 19:24, Van Haaren, Harry wrote:
>>
 -Original Message-
 From: Eelco Chaudron 
>
> 
>
 I’ll share the google sheet with you directly as it also has the config, 
 and PVP
>> results.
>>>
>>> I can't actually access that doc, sorry. Results above are enough to go by 
>>> for now :)
>>
>> It’s attached.
>
> Thanks.
>
>>> We can investigate if there's any optimizations to be done to improve the 
>>> scalar DPIF
>>> enabling of the miniflow extract func ptr, but I'm not sure there is.
>
> Note the v6 of MFEX has some minor changes/optimizations in place, as per 
> scalar DPIF enabling in this patch:
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210706131150.45513-2-cian.ferri...@intel.com/
>
>
>>> If we cannot improve the perf data from above, there is an option to not 
>>> enable
>> the scalar DPIF with the AVX512 MFEX optimizations. (Logic being if AVX512 
>> is present,
>> running both the DPIF + MFEX makes sense). What do you think?
>
> If you feel it is required before merge, would you re-run the benchmark on v6?
> If so, we're targeting Thursday for merge, so data ASAP, or by EOD tomorrow 
> would be required.

I’m reviewing your v6 now, so I have no cycles to also do the testing before 
the end of the week. But the tests are simple, so maybe you guys can try it and 
report the difference with and without the two patchsets applied on a non 
AVX512 machine?

> As mentioned above, there is an option to remove the AVX512-Optimized MFEX 
> enabling
> from the scalar datapath, if there is measurable/significant performance 
> reduction in this v6 code.

It not clear to me what you mean by this? Can you elaborate? I’m running this 
on a non AVX512 machine, all with default configs.

>> This is on a system without AVX512 support, so all is disabled. The “without 
>> patch”
>> has both the new AVX patches removed (mfex and dpif framework).
>>
>>>
 //Eelco
>
> Thanks again for testing & follow up! Regards, -Harry

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v5 01/11] dpif-netdev: Add command line and function pointer for miniflow extract

2021-07-07 Thread Eelco Chaudron



On 7 Jul 2021, at 6:34, Amber, Kumar wrote:

> Hi Eelco,
>
> MFEX v7 will be available shorty EOD today.
> Some comments are inline.

Thanks, looks like one item is still not clear, and I think why (see below).

> 

>>> + * for that packet.
>>> + */
>>> +uint32_t mfex_hit = (mf_mask & (1 << i));
>>
>> This was supposed to become a bool?
>>
>
> This cannot be a bool as this is used like a bit-mask and set bits are used 
> to iterate the packets.
>

Guess the problem is that it should be a bool in this instance, but later on in 
the code you redefine the same variable and use it as a count!
I would suggest changing this to a bool, and renaming the other instance to 
hits or mfex_hit_cnt.


313 /* At this point we don't return error anymore, so commit stats here. */
314 uint32_t mfex_hit = __builtin_popcountll(mf_mask);

Change this to “uint32_t mfex_hit_cnt / or mfex_hits”

315 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RECV, batch_size);
316 pms_perf_update_counter(&pmd->perf_stats, PMD_STAT_PHWOL_HIT, 
phwol_hits);
317 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MFEX_OPT_HIT, 
mfex_hit);

And this to:

pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MFEX_OPT_HIT, mfex_hits);

318 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, emc_hits);
319 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, smc_hits);
320 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
321 dpcls_key_idx);
322 pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_LOOKUP,
323 dpcls_key_idx);




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v4 3/3] dpif-netlink: Introduce per-cpu upcall dispatch

2021-07-07 Thread Mark Gray

On 06/07/2021 13:35, Flavio Leitner wrote:
> 
> The two functions below are Linux specific, so they need to be conditional
> to the #ifdef above like you did in dpif_netlink_recv_wait():
> 
> #ifdef _WIN32
> dpif_netlink_recv_wait_windows()
> #else
> dpif_netlink_recv_wait_vport_dispatch()
> dpif_netlink_recv_wait_cpu_dispatch()
> #endif
> 
> Otherwise I don't see anything else. Tests are good on my end.
> Thanks
> fbl
> 

Done. Thanks for the review and testing.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3 3/3] dpif-netlink: Introduce per-cpu upcall dispatch

2021-07-07 Thread Mark Gray

On 06/07/2021 10:05, David Marc hand wrote:
> Small nits on the NEWS update.
> 
> On Mon, Jul 5, 2021 at 3:39 PM Mark Gray  wrote:
>> diff --git a/NEWS b/NEWS
>> index a2a2dcf95d7d..80b13e358685 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -29,7 +29,12 @@ Post-v2.15.0
>> - ovsdb-tool:
>>   * New option '--election-timer' to the 'create-cluster' command to set 
>> the
>> leader election timer during cluster creation.
>> -
> 
> I think versions are separated with 2 empty lines in NEWS, so you
> should leave this one.
> 
>> +   - Per-cpu upcall dispatching:
> 
> And since this change affects the kernel datapath on Linux:
> 
> - Linux datapath:
> 
>> + * ovs-vswitchd will configure the kernel module using per-cpu dispatch
>> +   mode (if available). This changes the way upcalls are delivered to 
>> user
>> +   space in order to resolve a number of issues with per-vport dispatch.
>> +   The new debug appctl command `dpif-netlink/dispatch-mode`
>> +   will return the current dispatch mode for each datapath.
> 
> 
>>
>>  v2.15.0 - 15 Feb 2021
>>  -
> 
> 

Done! Thanks David

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v5 0/3] dpif-netlink: Introduce per-cpu upcall dispatching

2021-07-07 Thread Mark Gray

On 07/07/2021 09:43, Mark Gray wrote:
> This series proposes a new method of distributing upcalls
> to user space threads attempting to resolve a number of
> issues with the current method.
> 
> v2 - Rebase
>  Address Flavio's comments
> v3 - Add man page to automake
> v4 - Rebase and address Flavio's comments
> v5 - Rebase and address Flavio and David's comments
> 
> Mark Gray (3):
>   ofproto: change type of n_handlers and n_revalidators
>   dpif-netlink: fix report_loss() message
>   dpif-netlink: Introduce per-cpu upcall dispatch
> 
>  NEWS  |   6 +
>  .../linux/compat/include/linux/openvswitch.h  |   7 +
>  lib/automake.mk   |   1 +
>  lib/dpif-netdev.c |   1 +
>  lib/dpif-netlink-unixctl.man  |   6 +
>  lib/dpif-netlink.c| 463 --
>  lib/dpif-provider.h   |  32 +-
>  lib/dpif.c|  17 +
>  lib/dpif.h|   1 +
>  ofproto/ofproto-dpif-upcall.c |  71 ++-
>  ofproto/ofproto-dpif-upcall.h |   5 +-
>  ofproto/ofproto-provider.h|   2 +-
>  ofproto/ofproto.c |  14 +-
>  vswitchd/ovs-vswitchd.8.in|   1 +
>  vswitchd/vswitch.xml  |  23 +-
>  15 files changed, 541 insertions(+), 109 deletions(-)
>  create mode 100644 lib/dpif-netlink-unixctl.man
> 

This series is still associated with v1 of the kernel space series at
https://marc.info/?l=linux-netdev&m=162504684016825&w=2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v5 3/3] dpif-netlink: Introduce per-cpu upcall dispatch

2021-07-07 Thread Mark Gray

The Open vSwitch kernel module uses the upcall mechanism to send
packets from kernel space to user space when it misses in the kernel
space flow table. The upcall sends packets via a Netlink socket.
Currently, a Netlink socket is created for every vport. In this way,
there is a 1:1 mapping between a vport and a Netlink socket.
When a packet is received by a vport, if it needs to be sent to
user space, it is sent via the corresponding Netlink socket.

This mechanism, with various iterations of the corresponding user
space code, has seen some limitations and issues:

* On systems with a large number of vports, there is correspondingly
a large number of Netlink sockets which can limit scaling.
(https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
* Packet reordering on upcalls.
(https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
* A thundering herd issue.
(https://bugzilla.redhat.com/show_bug.cgi?id=183)

This patch introduces an alternative, feature-negotiated, upcall
mode using a per-cpu dispatch rather than a per-vport dispatch.

In this mode, the Netlink socket to be used for the upcall is
selected based on the CPU of the thread that is executing the upcall.
In this way, it resolves the issues above as:

a) The number of Netlink sockets scales with the number of CPUs
rather than the number of vports.
b) Ordering per-flow is maintained as packets are distributed to
CPUs based on mechanisms such as RSS and flows are distributed
to a single user space thread.
c) Packets from a flow can only wake up one user space thread.

Reported-at: https://bugzilla.redhat.com/1844576
Signed-off-by: Mark Gray 
---

Notes:
v1 - Reworked based on Flavio's comments:
 * change DISPATCH_MODE_PER_CPU() to inline function
 * add `ovs-appctl` command to check dispatch mode for datapaths
 * fixed issue with userspace actions (tested using `ovs-ofctl monitor 
br0 65534 -P nxt_packet_in`)
 * update documentation as requested
v2 - Reworked based on Flavio's comments:
 * Used dpif_netlink_upcall_per_cpu() for check in 
dpif_netlink_set_handler_pids()
 * Added macro for (ignored) Netlink PID
 * Fixed indentation issue
 * Added NEWS entry
 * Added section to ovs-vswitchd.8 man page
v4 - Reworked based on Flavio's comments:
 * Cleaned up log message when dispatch mode is set
v5 - Reworked based on Flavio's comments:
 * Added macros to remove functions for Window's build
 Reworked based on David's comments:
 * Updated the NEWS file

 NEWS  |   6 +
 .../linux/compat/include/linux/openvswitch.h  |   7 +
 lib/automake.mk   |   1 +
 lib/dpif-netdev.c |   1 +
 lib/dpif-netlink-unixctl.man  |   6 +
 lib/dpif-netlink.c| 463 --
 lib/dpif-provider.h   |  32 +-
 lib/dpif.c|  17 +
 lib/dpif.h|   1 +
 ofproto/ofproto-dpif-upcall.c |  51 +-
 ofproto/ofproto.c |  12 -
 vswitchd/ovs-vswitchd.8.in|   1 +
 vswitchd/vswitch.xml  |  23 +-
 13 files changed, 526 insertions(+), 95 deletions(-)
 create mode 100644 lib/dpif-netlink-unixctl.man

diff --git a/NEWS b/NEWS
index a2a2dcf95d7d..6b5da85316bc 100644
--- a/NEWS
+++ b/NEWS
@@ -29,6 +29,12 @@ Post-v2.15.0
- ovsdb-tool:
  * New option '--election-timer' to the 'create-cluster' command to set the
leader election timer during cluster creation.
+   - Linux datapath:
+ * ovs-vswitchd will configure the kernel module using per-cpu dispatch
+   mode (if available). This changes the way upcalls are delivered to user
+   space in order to resolve a number of issues with per-vport dispatch.
+ * New vswitchd unixctl command `dpif-netlink/dispatch-mode` will return
+   the current dispatch mode for each datapath.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/datapath/linux/compat/include/linux/openvswitch.h 
b/datapath/linux/compat/include/linux/openvswitch.h
index 875de20250ce..f29265df055e 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -89,6 +89,8 @@ enum ovs_datapath_cmd {
  * set on the datapath port (for OVS_ACTION_ATTR_MISS).  Only valid on
  * %OVS_DP_CMD_NEW requests. A value of zero indicates that upcalls should
  * not be sent.
+ * OVS_DP_ATTR_PER_CPU_PIDS: Per-cpu array of PIDs for upcalls when
+ * OVS_DP_F_DISPATCH_UPCALL_PER_CPU feature is set.
  * @OVS_DP_ATTR_STATS: Statistics about packets that have passed through the
  * datapath.  Always present in notifications.
  * @OVS_DP_ATTR_MEGAFLOW_STATS: Statistics about mega flow masks usage for the
@@ -105,6 +107,8 @@ enum ovs_datapath_attr {
OVS_DP_ATTR_MEGAFLOW_STATS, /* struct ovs_d

[ovs-dev] [PATCH v5 2/3] dpif-netlink: fix report_loss() message

2021-07-07 Thread Mark Gray

Fixes: 1579cf677fcb ("dpif-linux: Implement the API functions to allow multiple 
handler threads read upcall.")
Signed-off-by: Mark Gray 
Acked-by: Flavio Leitner 
---

Notes:
v1 - Reworked based on Flavio's comments:
 * Added "Fixes" tag

 lib/dpif-netlink.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 73d5608a81a2..f92905dd83fd 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -4666,7 +4666,7 @@ report_loss(struct dpif_netlink *dpif, struct 
dpif_channel *ch, uint32_t ch_idx,
   time_msec() - ch->last_poll);
 }
 
-VLOG_WARN("%s: lost packet on port channel %u of handler %u",
-  dpif_name(&dpif->dpif), ch_idx, handler_id);
+VLOG_WARN("%s: lost packet on port channel %u of handler %u%s",
+  dpif_name(&dpif->dpif), ch_idx, handler_id, ds_cstr(&s));
 ds_destroy(&s);
 }
-- 
2.27.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v5 1/3] ofproto: change type of n_handlers and n_revalidators

2021-07-07 Thread Mark Gray

'n_handlers' and 'n_revalidators' are declared as type 'size_t'.
However, dpif_handlers_set() requires parameter 'n_handlers' as
type 'uint32_t'. This patch fixes this type mismatch.

Signed-off-by: Mark Gray 
Acked-by: Flavio Leitner 
---

Notes:
v1 - Reworked based on Flavio's comments:
 * fixed inconsistency with change of size_t -> uint32_t

 ofproto/ofproto-dpif-upcall.c | 20 ++--
 ofproto/ofproto-dpif-upcall.h |  5 +++--
 ofproto/ofproto-provider.h|  2 +-
 ofproto/ofproto.c |  2 +-
 4 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
index ccf97266c0b9..d22f7f07361f 100644
--- a/ofproto/ofproto-dpif-upcall.c
+++ b/ofproto/ofproto-dpif-upcall.c
@@ -129,10 +129,10 @@ struct udpif {
 struct dpif_backer *backer;/* Opaque dpif_backer pointer. */
 
 struct handler *handlers;  /* Upcall handlers. */
-size_t n_handlers;
+uint32_t n_handlers;
 
 struct revalidator *revalidators;  /* Flow revalidators. */
-size_t n_revalidators;
+uint32_t n_revalidators;
 
 struct latch exit_latch;   /* Tells child threads to exit. */
 
@@ -335,8 +335,8 @@ static int process_upcall(struct udpif *, struct upcall *,
   struct ofpbuf *odp_actions, struct flow_wildcards *);
 static void handle_upcalls(struct udpif *, struct upcall *, size_t n_upcalls);
 static void udpif_stop_threads(struct udpif *, bool delete_flows);
-static void udpif_start_threads(struct udpif *, size_t n_handlers,
-size_t n_revalidators);
+static void udpif_start_threads(struct udpif *, uint32_t n_handlers,
+uint32_t n_revalidators);
 static void udpif_pause_revalidators(struct udpif *);
 static void udpif_resume_revalidators(struct udpif *);
 static void *udpif_upcall_handler(void *);
@@ -562,8 +562,8 @@ udpif_stop_threads(struct udpif *udpif, bool delete_flows)
 
 /* Starts the handler and revalidator threads. */
 static void
-udpif_start_threads(struct udpif *udpif, size_t n_handlers_,
-size_t n_revalidators_)
+udpif_start_threads(struct udpif *udpif, uint32_t n_handlers_,
+uint32_t n_revalidators_)
 {
 if (udpif && n_handlers_ && n_revalidators_) {
 /* Creating a thread can take a significant amount of time on some
@@ -632,8 +632,8 @@ udpif_resume_revalidators(struct udpif *udpif)
  * datapath handle must have packet reception enabled before starting
  * threads. */
 void
-udpif_set_threads(struct udpif *udpif, size_t n_handlers_,
-  size_t n_revalidators_)
+udpif_set_threads(struct udpif *udpif, uint32_t n_handlers_,
+  uint32_t n_revalidators_)
 {
 ovs_assert(udpif);
 ovs_assert(n_handlers_ && n_revalidators_);
@@ -691,8 +691,8 @@ udpif_get_memory_usage(struct udpif *udpif, struct simap 
*usage)
 void
 udpif_flush(struct udpif *udpif)
 {
-size_t n_handlers_ = udpif->n_handlers;
-size_t n_revalidators_ = udpif->n_revalidators;
+uint32_t n_handlers_ = udpif->n_handlers;
+uint32_t n_revalidators_ = udpif->n_revalidators;
 
 udpif_stop_threads(udpif, true);
 dpif_flow_flush(udpif->dpif);
diff --git a/ofproto/ofproto-dpif-upcall.h b/ofproto/ofproto-dpif-upcall.h
index 693107ae56c1..b4dfed32046e 100644
--- a/ofproto/ofproto-dpif-upcall.h
+++ b/ofproto/ofproto-dpif-upcall.h
@@ -16,6 +16,7 @@
 #define OFPROTO_DPIF_UPCALL_H
 
 #include 
+#include 
 
 struct dpif;
 struct dpif_backer;
@@ -31,8 +32,8 @@ struct simap;
 void udpif_init(void);
 struct udpif *udpif_create(struct dpif_backer *, struct dpif *);
 void udpif_run(struct udpif *udpif);
-void udpif_set_threads(struct udpif *, size_t n_handlers,
-   size_t n_revalidators);
+void udpif_set_threads(struct udpif *, uint32_t n_handlers,
+   uint32_t n_revalidators);
 void udpif_destroy(struct udpif *);
 void udpif_revalidate(struct udpif *);
 void udpif_get_memory_usage(struct udpif *, struct simap *usage);
diff --git a/ofproto/ofproto-provider.h b/ofproto/ofproto-provider.h
index 9ad2b71d23eb..57c7d17cb28f 100644
--- a/ofproto/ofproto-provider.h
+++ b/ofproto/ofproto-provider.h
@@ -534,7 +534,7 @@ extern unsigned ofproto_min_revalidate_pps;
 
 /* Number of upcall handler and revalidator threads. Only affects the
  * ofproto-dpif implementation. */
-extern size_t n_handlers, n_revalidators;
+extern uint32_t n_handlers, n_revalidators;
 
 static inline struct rule *rule_from_cls_rule(const struct cls_rule *);
 
diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
index 80ec2d9ac9c7..53002f082b52 100644
--- a/ofproto/ofproto.c
+++ b/ofproto/ofproto.c
@@ -309,7 +309,7 @@ unsigned ofproto_max_idle = OFPROTO_MAX_IDLE_DEFAULT;
 unsigned ofproto_max_revalidator = OFPROTO_MAX_REVALIDATOR_DEFAULT;
 unsigned ofproto_min_revalidate_pps = OFPROTO_MIN_REVALIDATE_PPS_DEFAULT;
 
-size_t n_handlers, n_revalidators;
+uin

[ovs-dev] [PATCH v5 0/3] dpif-netlink: Introduce per-cpu upcall dispatching

2021-07-07 Thread Mark Gray

This series proposes a new method of distributing upcalls
to user space threads attempting to resolve a number of
issues with the current method.

v2 - Rebase
 Address Flavio's comments
v3 - Add man page to automake
v4 - Rebase and address Flavio's comments
v5 - Rebase and address Flavio and David's comments

Mark Gray (3):
  ofproto: change type of n_handlers and n_revalidators
  dpif-netlink: fix report_loss() message
  dpif-netlink: Introduce per-cpu upcall dispatch

 NEWS  |   6 +
 .../linux/compat/include/linux/openvswitch.h  |   7 +
 lib/automake.mk   |   1 +
 lib/dpif-netdev.c |   1 +
 lib/dpif-netlink-unixctl.man  |   6 +
 lib/dpif-netlink.c| 463 --
 lib/dpif-provider.h   |  32 +-
 lib/dpif.c|  17 +
 lib/dpif.h|   1 +
 ofproto/ofproto-dpif-upcall.c |  71 ++-
 ofproto/ofproto-dpif-upcall.h |   5 +-
 ofproto/ofproto-provider.h|   2 +-
 ofproto/ofproto.c |  14 +-
 vswitchd/ovs-vswitchd.8.in|   1 +
 vswitchd/vswitch.xml  |  23 +-
 15 files changed, 541 insertions(+), 109 deletions(-)
 create mode 100644 lib/dpif-netlink-unixctl.man

-- 
2.27.0


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v6 00/11] MFEX Infrastructure + Optimizations

2021-07-07 Thread Eelco Chaudron




On 6 Jul 2021, at 17:06, Amber, Kumar wrote:

> Hi Eelco ,
>
>
> Here is the diff  vor v6 vs v5 :
>
> Patch 1 :
>
> diff --git a/lib/dpif-netdev-private-extract.c 
> b/lib/dpif-netdev-private-extract.c
> index 1aebf3656d..4987d628a4 100644
> --- a/lib/dpif-netdev-private-extract.c
> +++ b/lib/dpif-netdev-private-extract.c
> @@ -233,7 +233,7 @@ dpif_miniflow_extract_autovalidator(struct 
> dp_packet_batch *packets,
>  uint32_t keys_size, odp_port_t in_port,
>  struct dp_netdev_pmd_thread *pmd_handle)
>  {
> -const size_t cnt = dp_packet_batch_size(packets);
> +const uint32_t cnt = dp_packet_batch_size(packets);
>  uint16_t good_l2_5_ofs[NETDEV_MAX_BURST];
>  uint16_t good_l3_ofs[NETDEV_MAX_BURST];
>  uint16_t good_l4_ofs[NETDEV_MAX_BURST];
> @@ -247,7 +247,7 @@ dpif_miniflow_extract_autovalidator(struct 
> dp_packet_batch *packets,
>  atomic_uintptr_t *pmd_func = (void *)&pmd->miniflow_extract_opt;
>  atomic_store_relaxed(pmd_func, (uintptr_t) default_func);
>  VLOG_ERR("Invalid key size supplied, Key_size: %d less than"
> - "batch_size: %ld", keys_size, cnt);
> + "batch_size: %d", keys_size, cnt);

What was the reason for changing this size_t to uint32_t? Is see other 
instances where %ld is used for logging?
And other functions like dp_netdev_run_meter() have it as a size_t?

>  return 0;
>  }
>
> Patch 7 :
>
> AT_SKIP_IF([! pip3 list | grep scapy], [], [])
>
>> -Original Message-
>> From: Eelco Chaudron 
>> Sent: Tuesday, July 6, 2021 8:04 PM
>> To: Ferriter, Cian 
>> Cc: ovs-dev@openvswitch.org; f...@sysclose.org; i.maxim...@ovn.org; Van
>> Haaren, Harry ; Amber, Kumar
>> ; Stokes, Ian 
>> Subject: Re: [v6 00/11] MFEX Infrastructure + Optimizations
>>
>> Cian,
>>
>> Which patches change, so I know where to update my review? None of the
>> commit messages show v6 changes.
>>
>> //Eelco
>>
>
> Regards
> Amber

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v7 2/4] northd: Refactor Logical Flows for routers with DNAT/Load Balancers

2021-07-07 Thread Mark Gray

On 07/07/2021 09:28, Mark Gray wrote:
> This patch addresses a number of interconnected issues with Gateway Routers
> that have Load Balancing enabled:
> 
> 1) In the router pipeline, we have the following stages to handle
> dnat and unsnat.
> 
>  - Stage 4 : lr_in_defrag (dnat zone)
>  - Stage 5 : lr_in_unsnat (snat zone)
>  - Stage 6 : lr_in_dnat   (dnat zone)
> 
> In the reply direction, the order of traversal of the tables
> "lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
> datapath flows that check ct_state in the wrong conntrack zone.
> This is illustrated below where reply trafic enters the physical host
> port (6) and traverses DNAT zone (14), SNAT zone (default), back to the
> DNAT zone and then on to Logical Switch Port zone (22). The third
> flow is incorrectly checking the state from the SNAT zone instead
> of the DNAT zone.
> 
> recirc_id(0),in_port(6),ct_state(-new-est-rel-rpl-trk) 
> actions:ct_clear,ct(zone=14),recirc(0xf)
> recirc_id(0xf),in_port(6) actions:ct(nat),recirc(0x10)
> recirc_id(0x10),in_port(6),ct_state(-new+est+trk) 
> actions:ct(zone=14,nat),recirc(0x11)
> recirc_id(0x11),in_port(6),ct_state(+new-est-rel-rpl+trk) actions: 
> ct(zone=22,nat),recirc(0x12)
> recirc_id(0x12),in_port(6),ct_state(-new+est-rel+rpl+trk) actions:5
> 
> Update the order of these tables to resolve this.
> 
> 2) Efficiencies can be gained by using the ct_dnat action in the
> table "lr_in_defrag" instead of ct_next. This removes the need for the
> ct_dnat action for established Load Balancer flows avoiding a
> recirculation.
> 
> 3) On a Gateway router with DNAT flows configured, the router will translate
> the destination IP address from (A) to (B). Reply packets from (B) are
> correctly UNDNATed in the reverse direction.
> 
> However, if a new connection is established from (B), this flow is never
> committed to conntrack and, as such, is never established. This will
> cause OVS datapath flows to be added that match on the ct.new flag.
> 
> For software-only datapaths this is not a problem. However, for
> datapaths that offload these flows to hardware, this may be problematic
> as some devices are unable to offload flows that match on ct.new.
> 
> This patch resolves this by committing these flows to the DNAT zone in
> the new "lr_out_post_undnat" stage. Although this could be done in the
> DNAT zone, by doing this in the new zone we can avoid a recirculation.
> 
> This patch also generalizes these changes to distributed routers with
> gateway ports.
> 
> Co-authored-by: Numan Siddique 
> Signed-off-by: Mark Gray 
> Signed-off-by: Numan Siddique 
> Reported-at: https://bugzilla.redhat.com/1956740
> Reported-at: https://bugzilla.redhat.com/1953278
> ---


This was a tricky rebase due to Lorenzo's refactor LB series. It would
be worth having another read over the northd.c code which should be the
only code changed.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH ovn v7 2/4] northd: Refactor Logical Flows for routers with DNAT/Load Balancers

2021-07-07 Thread Mark Gray

This patch addresses a number of interconnected issues with Gateway Routers
that have Load Balancing enabled:

1) In the router pipeline, we have the following stages to handle
dnat and unsnat.

 - Stage 4 : lr_in_defrag (dnat zone)
 - Stage 5 : lr_in_unsnat (snat zone)
 - Stage 6 : lr_in_dnat   (dnat zone)

In the reply direction, the order of traversal of the tables
"lr_in_defrag", "lr_in_unsnat" and "lr_in_dnat" adds incorrect
datapath flows that check ct_state in the wrong conntrack zone.
This is illustrated below where reply trafic enters the physical host
port (6) and traverses DNAT zone (14), SNAT zone (default), back to the
DNAT zone and then on to Logical Switch Port zone (22). The third
flow is incorrectly checking the state from the SNAT zone instead
of the DNAT zone.

recirc_id(0),in_port(6),ct_state(-new-est-rel-rpl-trk) 
actions:ct_clear,ct(zone=14),recirc(0xf)
recirc_id(0xf),in_port(6) actions:ct(nat),recirc(0x10)
recirc_id(0x10),in_port(6),ct_state(-new+est+trk) 
actions:ct(zone=14,nat),recirc(0x11)
recirc_id(0x11),in_port(6),ct_state(+new-est-rel-rpl+trk) actions: 
ct(zone=22,nat),recirc(0x12)
recirc_id(0x12),in_port(6),ct_state(-new+est-rel+rpl+trk) actions:5

Update the order of these tables to resolve this.

2) Efficiencies can be gained by using the ct_dnat action in the
table "lr_in_defrag" instead of ct_next. This removes the need for the
ct_dnat action for established Load Balancer flows avoiding a
recirculation.

3) On a Gateway router with DNAT flows configured, the router will translate
the destination IP address from (A) to (B). Reply packets from (B) are
correctly UNDNATed in the reverse direction.

However, if a new connection is established from (B), this flow is never
committed to conntrack and, as such, is never established. This will
cause OVS datapath flows to be added that match on the ct.new flag.

For software-only datapaths this is not a problem. However, for
datapaths that offload these flows to hardware, this may be problematic
as some devices are unable to offload flows that match on ct.new.

This patch resolves this by committing these flows to the DNAT zone in
the new "lr_out_post_undnat" stage. Although this could be done in the
DNAT zone, by doing this in the new zone we can avoid a recirculation.

This patch also generalizes these changes to distributed routers with
gateway ports.

Co-authored-by: Numan Siddique 
Signed-off-by: Mark Gray 
Signed-off-by: Numan Siddique 
Reported-at: https://bugzilla.redhat.com/1956740
Reported-at: https://bugzilla.redhat.com/1953278
---

Notes:
v2:  Addressed Han's comments
 * fixed ovn-northd.8.xml
 * added 'is_gw_router' to all cases where relevant
 * refactor add_router_lb_flow()
 * added ct_commit/ct_dnat to gateway ports case
 * updated flows like "ct.new && ip &&  to specify ip4/ip6 
instead of ip
 * increment ovn_internal_version
v4:  Fix line length errors from 0-day
v5:  Add "Reported-at" tag

 lib/ovn-util.c  |   2 +-
 northd/ovn-northd.8.xml | 285 ++---
 northd/ovn-northd.c | 176 ++-
 northd/ovn_northd.dl| 136 +---
 tests/ovn-northd.at | 685 +++-
 tests/ovn.at|   8 +-
 tests/system-ovn.at |  58 +++-
 7 files changed, 1019 insertions(+), 331 deletions(-)

diff --git a/lib/ovn-util.c b/lib/ovn-util.c
index acf4b1cd6059..494d6d42d869 100644
--- a/lib/ovn-util.c
+++ b/lib/ovn-util.c
@@ -760,7 +760,7 @@ ip_address_and_port_from_lb_key(const char *key, char 
**ip_address,
 
 /* Increment this for any logical flow changes, if an existing OVN action is
  * modified or a stage is added to a logical pipeline. */
-#define OVN_INTERNAL_MINOR_VER 0
+#define OVN_INTERNAL_MINOR_VER 1
 
 /* Returns the OVN version. The caller must free the returned value. */
 char *
diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index b5c961e891f9..c76339ce38e4 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -2637,39 +2637,9 @@ icmp6 {
   
 
 
-Ingress Table 4: DEFRAG
 
-
-  This is to send packets to connection tracker for tracking and
-  defragmentation.  It contains a priority-0 flow that simply moves traffic
-  to the next table.
-
-
-
-  If load balancing rules with virtual IP addresses (and ports) are
-  configured in OVN_Northbound database for a Gateway router,
-  a priority-100 flow is added for each configured virtual IP address
-  VIP. For IPv4 VIPs the flow matches ip
-  && ip4.dst == VIP.  For IPv6 VIPs,
-  the flow matches ip && ip6.dst == VIP.
-  The flow uses the action ct_next; to send IP packets to the
-  connection tracker for packet de-fragmentation and tracking before
-  sending it to the next table.
-
-
-
-  If ECMP routes with symmetric reply are configured in the
-  OVN_Northbound database for a gateway router, a priority-300
-  flow is added fo

[ovs-dev] [PATCH ovn v7 4/4] AUTHORS: update email for Mark Gray

2021-07-07 Thread Mark Gray

Update email address for Mark Gray

Signed-off-by: Mark Gray 
---
 .mailmap| 1 +
 AUTHORS.rst | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index f01664e5c1d1..bc32255b5cc4 100644
--- a/.mailmap
+++ b/.mailmap
@@ -52,6 +52,7 @@ Joe Stringer  
 Justin Pettit  
 Kmindg 
 Kyle Mestery  
+Mark Gray  
 Mauricio Vasquez  

 Miguel Angel Ajo  
 Neil McKee 
diff --git a/AUTHORS.rst b/AUTHORS.rst
index 4c81a500d47e..5df6110e0230 100644
--- a/AUTHORS.rst
+++ b/AUTHORS.rst
@@ -250,7 +250,7 @@ Manohar K Cman...@gmail.com
 Manoj Sharma   manoj.sha...@nutanix.com
 Marcin Mirecki mmire...@redhat.com
 Mario Cabrera  mario.cabr...@hpe.com
-Mark D. Gray   mark.d.g...@intel.com
+Mark D. Gray   mark.d.g...@redhat.com
 Mark Hamilton
 Mark Kavanagh  mark.b.kavanag...@gmail.com
 Mark Maglana   mmagl...@gmail.com
-- 
2.27.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH ovn v7 3/4] ovn.at: Fix whitespace

2021-07-07 Thread Mark Gray

Signed-off-by: Mark Gray 
---
 tests/ovn.at | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tests/ovn.at b/tests/ovn.at
index eb9bccdc7053..e5d8869a8417 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -5888,7 +5888,7 @@ test_dhcp() {
 local expect_resume=:
 local trace=false
 while :; do
-case $1 in 
+case $1 in
 (--no-resume) expect_resume=false; shift ;;
 # --trace isn't used but it can be useful for debugging:
 (--trace) trace=:; shift ;;
@@ -8567,7 +8567,7 @@ check test "$c6_tag" != "$c0_tag"
 check test "$c6_tag" != "$c2_tag"
 check test "$c6_tag" != "$c3_tag"
 
-AS_BOX([restart northd and make sure tag allocation is stable]) 
+AS_BOX([restart northd and make sure tag allocation is stable])
 as northd
 OVS_APP_EXIT_AND_WAIT([NORTHD_TYPE])
 start_daemon NORTHD_TYPE \
@@ -11554,7 +11554,7 @@ ovn-nbctl --wait=sb ha-chassis-group-add-chassis hagrp1 
hv4 40
 AS_BOX([Wait till cr-alice is claimed by hv4])
 hv4_chassis=$(fetch_column Chassis _uuid name=hv4)
 AS_BOX([check that the chassis redirect port has been claimed by the gw1 
chassis])
-wait_row_count Port_Binding 1 logical_port=cr-alice chassis=$hv4_chassis 
+wait_row_count Port_Binding 1 logical_port=cr-alice chassis=$hv4_chassis
 
 AS_BOX([Reset the pcap file for hv2/br-ex_n2])
 # From now on ovn-controller in hv2 should not send GARPs for the router ports.
@@ -12246,7 +12246,7 @@ check_row_count HA_Chassis_Group 1 name=outside
 check_row_count HA_Chassis 2 'chassis!=[[]]'
 
 ha_ch=$(fetch_column HA_Chassis_Group ha_chassis)
-check_column "$ha_ch" HA_Chassis _uuid 
+check_column "$ha_ch" HA_Chassis _uuid
 
 for chassis in gw1 gw2 hv1 hv2; do
 as $chassis
@@ -16474,7 +16474,7 @@ test_ip6_packet_larger() {
 inner_icmp6=800062f1
 inner_icmp6_and_payload=$(icmp6_csum_inplace ${inner_icmp6}${payload} 
${inner_ip6})
 inner_packet=${inner_ip6}${inner_icmp6_and_payload}
-
+
 # Then the outer.
 outer_ip6=60883afe${ipv6_rt}${ipv6_src}
 outer_icmp6_and_payload=$(icmp6_csum_inplace 0200$(printf 
"%04x" $mtu)${inner_packet} $outer_ip6)
-- 
2.27.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH ovn v7 0/4] northd: Refactor Logical Flows for routers with DNAT/Load Balancers

2021-07-07 Thread Mark Gray

There are a number of issues with logical flows in the DNAT/SNAT/DEFRAG tables. 
This
series addresses these issues.

This is a continuation of the series 
https://patchwork.ozlabs.org/project/ovn/list/?series=245191.
As additional changes and patches have been added and the scope of the patch is
broader, the series starts from v1.

v3:  resending as my mail client gave an error message for patch 4/4
v4:  fix line length errors from 0-day
v6:  rebase
v7:  rebase

Mark Gray (4):
  northd: update stage-name if changed
  northd: Refactor Logical Flows for routers with DNAT/Load Balancers
  ovn.at: Fix whitespace
  AUTHORS: update email for Mark Gray

 .mailmap|   1 +
 AUTHORS.rst |   2 +-
 lib/ovn-util.c  |   6 +-
 northd/ovn-northd.8.xml | 285 ++---
 northd/ovn-northd.c | 219 -
 northd/ovn_northd.dl| 136 +---
 tests/ovn-northd.at | 685 +++-
 tests/ovn.at|  18 +-
 tests/system-ovn.at |  58 +++-
 9 files changed, 1068 insertions(+), 342 deletions(-)

-- 
2.27.0


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH ovn v7 1/4] northd: update stage-name if changed

2021-07-07 Thread Mark Gray

If a new table is added to a logical flow pipeline, the mapping between
'external_ids:stage-name' from the 'Logical_Flow' table in the
'OVN_Southbound' database and the 'stage' number may change for some tables.

If 'ovn-northd' is started against a populated Southbound database,
'external_ids' will not be updated to reflect the new, correct
name. This will cause 'external_ids' to be incorrectly displayed by some
tools and commands such as `ovn-sbctl dump-flows`.

This commit, reconciles these changes as part of build_lflows() when
'ovn_internal_version' is updated.

Suggested-by: Ilya Maximets 
Signed-off-by: Mark Gray 
---

Notes:
v2:  Update all 'external_ids' rather than just 'stage-name'
v4:  Fix line length errors from 0-day

 lib/ovn-util.c  |  4 ++--
 northd/ovn-northd.c | 43 ---
 2 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/lib/ovn-util.c b/lib/ovn-util.c
index c5af8d1ab340..acf4b1cd6059 100644
--- a/lib/ovn-util.c
+++ b/lib/ovn-util.c
@@ -758,8 +758,8 @@ ip_address_and_port_from_lb_key(const char *key, char 
**ip_address,
 return true;
 }
 
-/* Increment this for any logical flow changes or if existing OVN action is
- * modified. */
+/* Increment this for any logical flow changes, if an existing OVN action is
+ * modified or a stage is added to a logical pipeline. */
 #define OVN_INTERNAL_MINOR_VER 0
 
 /* Returns the OVN version. The caller must free the returned value. */
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 570c6a3efd77..eb25e31b1f7d 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -12447,7 +12447,8 @@ build_lflows(struct northd_context *ctx, struct hmap 
*datapaths,
  struct hmap *ports, struct hmap *port_groups,
  struct hmap *mcgroups, struct hmap *igmp_groups,
  struct shash *meter_groups,
- struct hmap *lbs, struct hmap *bfd_connections)
+ struct hmap *lbs, struct hmap *bfd_connections,
+ bool ovn_internal_version_changed)
 {
 struct hmap lflows;
 
@@ -12559,6 +12560,32 @@ build_lflows(struct northd_context *ctx, struct hmap 
*datapaths,
 ovn_stage_build(dp_type, pipeline, sbflow->table_id),
 sbflow->priority, sbflow->match, sbflow->actions, sbflow->hash);
 if (lflow) {
+if (ovn_internal_version_changed) {
+const char *stage_name = smap_get_def(&sbflow->external_ids,
+  "stage-name", "");
+const char *stage_hint = smap_get_def(&sbflow->external_ids,
+  "stage-hint", "");
+const char *source = smap_get_def(&sbflow->external_ids,
+  "source", "");
+
+if (strcmp(stage_name, ovn_stage_to_str(lflow->stage))) {
+sbrec_logical_flow_update_external_ids_setkey(sbflow,
+ "stage-name", ovn_stage_to_str(lflow->stage));
+}
+if (lflow->stage_hint) {
+if (strcmp(stage_hint, lflow->stage_hint)) {
+sbrec_logical_flow_update_external_ids_setkey(sbflow,
+"stage-hint", lflow->stage_hint);
+}
+}
+if (lflow->where) {
+if (strcmp(source, lflow->where)) {
+sbrec_logical_flow_update_external_ids_setkey(sbflow,
+"source", lflow->where);
+}
+}
+}
+
 /* This is a valid lflow.  Checking if the datapath group needs
  * updates. */
 bool update_dp_group = false;
@@ -13390,6 +13417,7 @@ ovnnb_db_run(struct northd_context *ctx,
 struct shash meter_groups = SHASH_INITIALIZER(&meter_groups);
 struct hmap lbs;
 struct hmap bfd_connections = HMAP_INITIALIZER(&bfd_connections);
+bool ovn_internal_version_changed = true;
 
 /* Sync ipsec configuration.
  * Copy nb_cfg from northbound to southbound database.
@@ -13441,7 +13469,13 @@ ovnnb_db_run(struct northd_context *ctx,
 smap_replace(&options, "max_tunid", max_tunid);
 free(max_tunid);
 
-smap_replace(&options, "northd_internal_version", ovn_internal_version);
+if (!strcmp(ovn_internal_version,
+smap_get_def(&options, "northd_internal_version", ""))) {
+ovn_internal_version_changed = false;
+} else {
+smap_replace(&options, "northd_internal_version",
+ ovn_internal_version);
+}
 
 nbrec_nb_global_verify_options(nb);
 nbrec_nb_global_set_options(nb, &options);
@@ -13481,7 +13515,8 @@ ovnnb_db_run(struct northd_context *ctx,
 build_meter_groups(ctx, &meter_groups);
 build_bfd_table(ctx, &bfd_connections, ports);
 build_lflows(ctx, datapaths, ports, &port_groups, &mcast_groups,
-

Re: [ovs-dev] [v5 01/11] dpif-netdev: Add command line and function pointer for miniflow extract

2021-07-07 Thread Eelco Chaudron



On 7 Jul 2021, at 9:21, Amber, Kumar wrote:

> Hi Eelco
>
> Pls find my comments inline.
>
 See comments inline...
>>>
>>> 
>>>
> +return;
> +}


 Argument handling is not as it should be, see my previous comment. I
 think the packets count should only be available for the study option
 (this might not be the correct patch, but just want to make sure it’s
>> addressed, and I do not forget).

 So as an example this looks odd trying to set it for a specific PMD:

   $ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator 15 1
   Miniflow implementation set to autovalidator, on pmd thread 1

 Why do I have to put in the dummy value 15. Here is a quote from my
 previous
 comment:

 “
   We also might need to re-think the command to make sure
 packet_count_to_study is only needed for the study command.
   So the help text might become something like:

   dpif-netdev/miniflow-parser-set {miniflow_implementation_name |
 study [pkt_cnt]} [dp] [pmd_core]
>>>
>>>
>>> I don't particularly like the "insert extra variable" with PMD_core moving 
>>> up an
>> index if study is used.
>>> Special casing specific implementations like that (if study then change 
>>> indexes)
>> is nasty.
>>>
>>> (Note that based on Flavio's feedback the [dp] argument was removed.)
>>>
>>> Thoughts on the this suggestion:
>>> $ dpif-netdev/miniflow-parser-set miniflow_implementation_name
>>> [pmd_core] [pkt_cnt]
>>>
>>> Notes:
>>> 1) All arguments are positional, optional arguments at the end
>>> 2) Based on "power-user-ness", the command required will get longer
>>>- simple usage is simple, with just $  miniflow-parser-set
>>> 
>>> 3) The worst-part is that to specify [pkt_cnt] to study, it also requires 
>>> setting
>> [pmd_core]. Given [pkt_cnt] is a power-user option, I think this is the right
>> compromise.
>>>
>>> Agree to implement & merge the above?
>>
>> You are right, and your suggestion eases the general use case but makes the
>> pkt_cnt option hard to use, as you have to execute the command for each PMD.
>>
>> I was looking at other commands with the same problem, and they use a -pmd
>> keyword approach. Some examples:
>>
>>   dpif-netdev/pmd-rxq-show [-pmd core] [dp]
>>   dpif-netdev/pmd-stats-clear [-pmd core] [dp]
>>   dpif-netdev/pmd-stats-show [-pmd core] [dp]
>>
>> Guess we can do the same here:
>>
>>   dpif-netdev/miniflow-parser-set [-pmd core] miniflow_implementation_name
>> [pkt_cnt]
>>
>
> We can certainly do that but there is a problem making [-pmd-core] first 
> parameter makes it required before every set command  which should not be the 
> case as this is
> Effectively still optional and it doesn’t look pleasant and will introduce 
> if-else kind of dirty code in a already complex set command ?

Looking at the other existing commands, options come before the bridge name, or 
other switches, so I think we should keep it consistent. The 
dpif-netdev/pmd-stats-show has a clean implementation, and I think you can use 
something similar.

>
> I still think the

Guess you sent the email too fast?

>>> .
>>>
>>> Regards, -Harry

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn] Don't suppress localport traffic directed to external port

2021-07-07 Thread Dumitru Ceara

On 7/7/21 5:20 AM, Ihar Hrachyshka wrote:
> Recently, we stopped leaking localport traffic through localnet ports
> into fabric to avoid unnecessary flipping between chassis hosting the
> same localport.
> 
> Despite the type name, in some scenarios localports are supposed to talk
> outside the hosting chassis. Specifically, in OpenStack [1] metadata
> service for SR-IOV ports is implemented as a localport hosted on another
> chassis that is exposed to the chassis owning the SR-IOV port through an
> "external" port. In this case, "leaking" localport traffic into fabric
> is desirable.
> 
> This patch inserts a higher priority flow per external port on the
> same datapath that avoids dropping localport traffic.
> 
> Fixes: 96959e56d634 ("physical: do not forward traffic from localport to
> a localnet one")
> 
> [1] https://docs.openstack.org/neutron/latest/admin/ovn/sriov.html
> 
> Signed-off-by: Ihar Hrachyshka 
> ---

Hi Ihar,

Thanks for working on this!

I just had a glance at the change so this is not a full review.

>  controller/physical.c | 48 ++
>  tests/ovn.at  | 54 +++
>  2 files changed, 102 insertions(+)
> 
> diff --git a/controller/physical.c b/controller/physical.c
> index 17ca5afbb..c2de30941 100644
> --- a/controller/physical.c
> +++ b/controller/physical.c
> @@ -920,6 +920,7 @@ get_binding_peer(struct ovsdb_idl_index 
> *sbrec_port_binding_by_name,
>  
>  static void
>  consider_port_binding(struct ovsdb_idl_index *sbrec_port_binding_by_name,
> +  const struct sbrec_port_binding_table *pb_table,
>enum mf_field_id mff_ovn_geneve,
>const struct simap *ct_zones,
>const struct sset *active_tunnels,
> @@ -1281,6 +1282,49 @@ consider_port_binding(struct ovsdb_idl_index 
> *sbrec_port_binding_by_name,
>  ofctrl_add_flow(flow_table, OFTABLE_CHECK_LOOPBACK, 160,
>  binding->header_.uuid.parts[0], &match,
>  ofpacts_p, &binding->header_.uuid);
> +
> +/* Localport traffic directed to external is *not* local. */
> +const struct sbrec_port_binding *peer;
> +SBREC_PORT_BINDING_TABLE_FOR_EACH (peer, pb_table) {
> +if (strcmp(peer->type, "external")) {
> +continue;
> +}
> +if (peer->datapath->tunnel_key != dp_key) {
> +continue;
> +}
> +if (strcmp(peer->chassis->name, chassis->name)) {
> +continue;
> +}

Won't this create a scalability issue?  If I'm reading this correctly,
every time consider_port_binding() is called for a localnet port we'll
be walking all port_bindings in the SB DB table (there can be a lot of
them in scaled scenarios) and skip most of them because they're not of
type external or they're not owned by the local chassis or they're on a
different datapath.

One option would be to use an IDL index instead (although that's still
log(n) complexity for every localnet port, I think).  Another option
would be to precompute the set of external ports for each datapath so we
don't have to walk all ports every time.

> +
> +ofpbuf_clear(ofpacts_p);
> +for (int i = 0; i < MFF_N_LOG_REGS; i++) {
> +put_load(0, MFF_REG0 + i, 0, 32, ofpacts_p);
> +}
> +put_resubmit(OFTABLE_LOG_EGRESS_PIPELINE, ofpacts_p);
> +
> +for (int i = 0; i < peer->n_mac; i++) {
> +char *err_str;
> +struct eth_addr peer_mac;
> +if ((err_str = str_to_mac(peer->mac[i], &peer_mac))) {
> +VLOG_WARN("Parsing MAC failed for external port: %s, 
> "
> +"with error: %s", peer->logical_port, 
> err_str);

This probably needs rate limiting.

Regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v5 01/11] dpif-netdev: Add command line and function pointer for miniflow extract

2021-07-07 Thread Van Haaren, Harry

> -Original Message-
> From: Eelco Chaudron 
> Sent: Wednesday, July 7, 2021 7:45 AM
> To: Van Haaren, Harry 
> Cc: Ferriter, Cian ; ovs-dev@openvswitch.org;
> f...@sysclose.org; i.maxim...@ovn.org; Amber, Kumar ;
> Stokes, Ian 
> Subject: Re: [v5 01/11] dpif-netdev: Add command line and function pointer for
> miniflow extract
> 
> 
> 
> On 6 Jul 2021, at 17:32, Van Haaren, Harry wrote:



> > Agree to implement & merge the above?
> 
> You are right, and your suggestion eases the general use case but makes the 
> pkt_cnt
> option hard to use, as you have to execute the command for each PMD.
> 
> I was looking at other commands with the same problem, and they use a -pmd
> keyword approach. Some examples:
> 
>   dpif-netdev/pmd-rxq-show [-pmd core] [dp]
>   dpif-netdev/pmd-stats-clear [-pmd core] [dp]
>   dpif-netdev/pmd-stats-show [-pmd core] [dp]
> 
> Guess we can do the same here:
> 
>   dpif-netdev/miniflow-parser-set [-pmd core] miniflow_implementation_name
> [pkt_cnt]

Ah, yes, that's better than the other suggestions;
1) Consistency for power users with how -pmd thread can be set
2) Shortest possible command for simplest/most-common usage (just set study)

Thanks for input, v7 targeting this command method!

> > .
> >
> > Regards, -Harry

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [v5 01/11] dpif-netdev: Add command line and function pointer for miniflow extract

2021-07-07 Thread Amber, Kumar

Hi Eelco 

Pls find my comments inline.

> >> See comments inline...
> >
> > 
> >
> >>> +return;
> >>> +}
> >>
> >>
> >> Argument handling is not as it should be, see my previous comment. I
> >> think the packets count should only be available for the study option
> >> (this might not be the correct patch, but just want to make sure it’s
> addressed, and I do not forget).
> >>
> >> So as an example this looks odd trying to set it for a specific PMD:
> >>
> >>   $ ovs-appctl dpif-netdev/miniflow-parser-set autovalidator 15 1
> >>   Miniflow implementation set to autovalidator, on pmd thread 1
> >>
> >> Why do I have to put in the dummy value 15. Here is a quote from my
> >> previous
> >> comment:
> >>
> >> “
> >>   We also might need to re-think the command to make sure
> >> packet_count_to_study is only needed for the study command.
> >>   So the help text might become something like:
> >>
> >>   dpif-netdev/miniflow-parser-set {miniflow_implementation_name |
> >> study [pkt_cnt]} [dp] [pmd_core]
> >
> >
> > I don't particularly like the "insert extra variable" with PMD_core moving 
> > up an
> index if study is used.
> > Special casing specific implementations like that (if study then change 
> > indexes)
> is nasty.
> >
> > (Note that based on Flavio's feedback the [dp] argument was removed.)
> >
> > Thoughts on the this suggestion:
> > $ dpif-netdev/miniflow-parser-set miniflow_implementation_name
> > [pmd_core] [pkt_cnt]
> >
> > Notes:
> > 1) All arguments are positional, optional arguments at the end
> > 2) Based on "power-user-ness", the command required will get longer
> >- simple usage is simple, with just $  miniflow-parser-set
> > 
> > 3) The worst-part is that to specify [pkt_cnt] to study, it also requires 
> > setting
> [pmd_core]. Given [pkt_cnt] is a power-user option, I think this is the right
> compromise.
> >
> > Agree to implement & merge the above?
> 
> You are right, and your suggestion eases the general use case but makes the
> pkt_cnt option hard to use, as you have to execute the command for each PMD.
> 
> I was looking at other commands with the same problem, and they use a -pmd
> keyword approach. Some examples:
> 
>   dpif-netdev/pmd-rxq-show [-pmd core] [dp]
>   dpif-netdev/pmd-stats-clear [-pmd core] [dp]
>   dpif-netdev/pmd-stats-show [-pmd core] [dp]
> 
> Guess we can do the same here:
> 
>   dpif-netdev/miniflow-parser-set [-pmd core] miniflow_implementation_name
> [pkt_cnt] 
> 

We can certainly do that but there is a problem making [-pmd-core] first 
parameter makes it required before every set command  which should not be the 
case as this is
Effectively still optional and it doesn’t look pleasant and will introduce 
if-else kind of dirty code in a already complex set command ?

I still think the
> > .
> >
> > Regards, -Harry

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn] northd: Add config option to specify # of threads

2021-07-07 Thread 0-day Robot

Bleep bloop.  Greetings Fabrizio D'Angelo, I am a robot and I have tried out 
your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 81 characters long (recommended limit is 79)
#98 FILE: lib/ovn-parallel-hmap.h:269:
#define add_worker_pool(start, thread_num) ovn_add_worker_pool(start, 
thread_num)

Lines checked: 157, Warnings: 1, Errors: 0


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH ovn] northd: Add config option to specify # of threads

2021-07-07 Thread Fabrizio D'Angelo

Uses northd database to specify number of threads that should be used
when lflow parallel computation is enabled.

Example:
ovn-nbctl set NB_Global . options:num_parallel_threads=16

Reported at:
https://bugzilla.redhat.com/show_bug.cgi?id=1975345

Signed-off-by: Fabrizio D'Angelo 
---
 lib/ovn-parallel-hmap.c | 12 ++--
 lib/ovn-parallel-hmap.h |  5 +++--
 northd/ovn-northd.c |  7 ++-
 ovn-nb.xml  | 10 ++
 4 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/lib/ovn-parallel-hmap.c b/lib/ovn-parallel-hmap.c
index b8c7ac786..cae0b3110 100644
--- a/lib/ovn-parallel-hmap.c
+++ b/lib/ovn-parallel-hmap.c
@@ -62,7 +62,7 @@ static int pool_size;
 static int sembase;
 
 static void worker_pool_hook(void *aux OVS_UNUSED);
-static void setup_worker_pools(bool force);
+static void setup_worker_pools(bool force, unsigned int thread_num);
 static void merge_list_results(struct worker_pool *pool OVS_UNUSED,
void *fin_result, void *result_frags,
int index);
@@ -86,14 +86,14 @@ ovn_can_parallelize_hashes(bool force_parallel)
 &test,
 true)) {
 ovs_mutex_lock(&init_mutex);
-setup_worker_pools(force_parallel);
+setup_worker_pools(force_parallel, 0);
 ovs_mutex_unlock(&init_mutex);
 }
 return can_parallelize;
 }
 
 struct worker_pool *
-ovn_add_worker_pool(void *(*start)(void *))
+ovn_add_worker_pool(void *(*start)(void *), unsigned int thread_num)
 {
 struct worker_pool *new_pool = NULL;
 struct worker_control *new_control;
@@ -109,7 +109,7 @@ ovn_add_worker_pool(void *(*start)(void *))
 &test,
 true)) {
 ovs_mutex_lock(&init_mutex);
-setup_worker_pools(false);
+setup_worker_pools(false, thread_num);
 ovs_mutex_unlock(&init_mutex);
 }
 
@@ -401,14 +401,14 @@ worker_pool_hook(void *aux OVS_UNUSED) {
 }
 
 static void
-setup_worker_pools(bool force) {
+setup_worker_pools(bool force, unsigned int thread_num) {
 int cores, nodes;
 
 nodes = ovs_numa_get_n_numas();
 if (nodes == OVS_NUMA_UNSPEC || nodes <= 0) {
 nodes = 1;
 }
-cores = ovs_numa_get_n_cores();
+cores = thread_num ? thread_num : ovs_numa_get_n_cores();
 
 /* If there is no NUMA config, use 4 cores.
  * If there is NUMA config use half the cores on
diff --git a/lib/ovn-parallel-hmap.h b/lib/ovn-parallel-hmap.h
index 0af8914c4..9637a273d 100644
--- a/lib/ovn-parallel-hmap.h
+++ b/lib/ovn-parallel-hmap.h
@@ -95,7 +95,8 @@ struct worker_pool {
 /* Add a worker pool for thread function start() which expects a pointer to
  * a worker_control structure as an argument. */
 
-struct worker_pool *ovn_add_worker_pool(void *(*start)(void *));
+struct worker_pool *ovn_add_worker_pool(void *(*start)(void *),
+unsigned int thread_num);
 
 /* Setting this to true will make all processing threads exit */
 
@@ -265,7 +266,7 @@ bool ovn_can_parallelize_hashes(bool force_parallel);
 
 #define stop_parallel_processing() ovn_stop_parallel_processing()
 
-#define add_worker_pool(start) ovn_add_worker_pool(start)
+#define add_worker_pool(start, thread_num) ovn_add_worker_pool(start, 
thread_num)
 
 #define fast_hmap_size_for(hmap, size) ovn_fast_hmap_size_for(hmap, size)
 
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index 570c6a3ef..ffefac361 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -4153,6 +4153,7 @@ ovn_lflow_init(struct ovn_lflow *lflow, struct 
ovn_datapath *od,
  * logical datapath only by creating a datapath group. */
 static bool use_logical_dp_groups = false;
 static bool use_parallel_build = true;
+static unsigned int num_parallel_threads;
 
 static struct hashrow_locks lflow_locks;
 
@@ -12219,7 +12220,8 @@ init_lflows_thread_pool(void)
 int index;
 
 if (!pool_init_done) {
-struct worker_pool *pool = add_worker_pool(build_lflows_thread);
+struct worker_pool *pool = add_worker_pool(build_lflows_thread,
+   num_parallel_threads);
 pool_init_done = true;
 if (pool) {
 build_lflows_pool = xmalloc(sizeof(*build_lflows_pool));
@@ -13456,6 +13458,9 @@ ovnnb_db_run(struct northd_context *ctx,
 (smap_get_bool(&nb->options, "use_parallel_build", false) &&
  ovn_can_parallelize_hashes(false));
 
+num_parallel_threads =
+smap_get_uint(&nb->options, "num_parallel_threads", 0);
+
 use_logical_dp_groups = smap_get_bool(&nb->options,
   "use_logical_dp_groups", false);
 use_ct_inv_match = smap_get_bool(&nb->options,
diff --git a/ovn-nb.xml b/ovn-nb.xml
index 36a77097c..d5bfb7ece 100644
--- a/ovn-nb.xml
+++ b/ovn-nb.xml
@@ -226,6 +226,16 @@
   The default value is false.
 
   
+  
+
+  Manually specify the number of threads to

95 matches

Mail list logo