Re: [ovs-dev] [PATCH RFC v2 0/6] conntrack: Introduce buckets and reduce contention.

2022-03-10 Thread Peng He
Hi,

just some thought on the results.

the test-conntrack.c now only prepares one batch size of packets for each
thread,
and repeat running ct on the same batch until the packet number reaches to
the parameter of n_pkt, which is, in your case, 16777216

So I guess the optimization results largely comes from the fact that you
remove
the global lock on the timeout lists, which is taken/released by each
packet.

the split bucket patch reduces the contention and thus increase the rate of
connection setup,
I think maybe we need a new benchmark on the setup rate.



Paolo Valerio  于2022年3月7日周一 20:39写道:

> This series aims to share the work done so far, and to start a
> discussion of a slightly different approach in terms of the way we
> handle the connections.
> It's not considered ready as further work and extensive tests are
> needed.
>
> These are some performance numbers obtained in the following way:
>
> ./tests/ovstest test-conntrack benchmark $i 16777216 32 1
>
> $i   baseline  ct-scale  ct-bucket
>  2   16480 ms   3527 ms3180 ms
>  4   37375 ms   4997 ms4477 ms
>  8   63239 ms   9692 ms8954 ms
> 16  131583 ms  19303 ms   18062 ms
>
> Both ct-scale and ct-bucket introduce a noticeable improvement in
> terms of performace. It's worth to mention that the ct-scale, keeping
> the exp_lists, is more efficient during the sweeping.
>
> There are two main logical changes that the series tries to introduce.
> The first one is the reintroduction of buckets, the second one is the
> removal of the expiration lists.
>
> The first two patches are a prereq and were picked (untouched) from
> this series (by Gaetan):
>
>
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=249027&state=*
>
> The third patch is a follow up of the following:
>
>
> https://patchwork.ozlabs.org/project/openvswitch/patch/20201129033255.64647-2-hepeng.0...@bytedance.com/
>
> Some logic has changed, but the overall idea remained the same.
> Additional details in the patch description.
>
> The fourth patch introduces the buckets and removes the expiration
> lists.
> The buckets got reintroduced as cmaps instead of hmaps, trying to
> narrow down the critical sections as well.
>
> An alternative approach may involve the replacement of cmaps with
> linked lists (without increasing the number of locks), increasing the
> number of buckets (it could involve the ability to change the buckets
> number when the user changes the connection limit). This could improve
> the stale conns eviction during the processing of the packet, but this
> is currently out of scope for this series (can be discussed, though).
>
> The insertion, namely the creation of the connection, can probably be
> improved (or at least an improvement can be evaluated), as, in case of
> nat, it acquires the bucket locks calling nat_get_unique_tuple_lock().
> This is needed because we may need to know the reverse tuple before
> locking in order to acquire the locks in the right way to avoid ABBA
> deadlocks.
>
> Further details about locking may be found in patch #4 description.
>
> There's a known limitation, and it is related mostly to zone
> limit. Keeping the stale entries, leads to a mismatch between the
> current number of per zone connections and the actual number of valid
> connections. This means that if we have a small zone limit number, we
> may wait a whole "next_wakeup" time before the entries get cleaned up
> (assuming we have candidates for removal). To avoid this, a zone clean
> up from the packet path has been added, but alternative approaches as
> triggering a per zone cleanup when a threshold of connections has
> been reached (storing the connections per bucket and per zone during
> the creation) may be still viable.
>
> Expiration lists have been removed for the sake of evaluation, but
> they could be included and duplicated among buckets. In this way we
> could distribute the contention (as a matter of fact reducing the
> number of connections per list) that affects the write access to a
> single data structure during every connection update.
>
> Patch #6 reintegrates the command below with multiple buckets:
>
> ovs-appctl dpctl/ct-bkts
>
> A couple of minor patches have been kept out of the series, for the
> time being, as they strongly depend on #4.
>
> Gaetan Rivet (3):
>   conntrack: Use a cmap to store zone limits
>   conntrack-tp: Use a cmap to store timeout policies
>   conntrack: Use an atomic conn expiration value
>
> Paolo Valerio (2):
>   conntrack: Split single cmap to multiple buckets.
>   conntrack: Make ovs-appctl dpctl/ct-bkts work with multiple buckets
>
> Peng He (1):
>   conntrack: Replaces nat_conn introducing key directionality.
>
>
>  lib/conntrack-private.h |   60 ++-
>  lib/conntrack-tp.c  |  102 ++--
>  lib/conntrack.c | 1062 +++
>  lib/conntrack.h |6 +-
>  lib/dpif-netdev.c   |5 +-
>  tests/system-traffic.at |5 +-
>  6 files 

Re: [ovs-dev] [PATCH net-next v2] net: openvswitch: fix uAPI incompatibility with existing user space

2022-03-10 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net-next.git (master)
by Jakub Kicinski :

On Wed,  9 Mar 2022 23:20:33 +0100 you wrote:
> Few years ago OVS user space made a strange choice in the commit [1]
> to define types only valid for the user space inside the copy of a
> kernel uAPI header.  '#ifndef __KERNEL__' and another attribute was
> added later.
> 
> This leads to the inevitable clash between user space and kernel types
> when the kernel uAPI is extended.  The issue was unveiled with the
> addition of a new type for IPv6 extension header in kernel uAPI.
> 
> [...]

Here is the summary with links:
  - [net-next,v2] net: openvswitch: fix uAPI incompatibility with existing user 
space
https://git.kernel.org/netdev/net-next/c/1926407a4ab0

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v20 3/8] dpif-offload-provider: Introduce dpif-offload-provider layer

2022-03-10 Thread Chris Mi via dev

On 2022-03-04 6:59 PM, Eelco Chaudron wrote:

We are waiting on Ilya’s feedback on this, which he told us he will look at 
after the v2.17 release.

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.ozlabs.org%2Fproject%2Fopenvswitch%2Fpatch%2F2025025312.786953-4-cmi%40nvidia.com%2F&data=04%7C01%7Ccmi%40nvidia.com%7C01938f4df4e54cb7fd1008d9fdce2259%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637819884060258877%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=6iEQpwH%2FxEJyFuhdtqey%2F53I5cT0rSsz4ueKbHAiWJU%3D&reserved=0

Looking at the thread again, I still feel like we need to remove the 1:1 
mapping, if not we should remove the entire offload class.
OK. Let's wait on Ilya's feedback. If the design is finalized, I'll 
change it accordingly.


Thanks,
Chris


//Eelco

On 15 Feb 2022, at 10:17, Chris Mi wrote:


Some offload actions require functionality that is not netdev
based, but dpif. For example, sFlow action requires to create
a psample netlink socket to receive the sampled packets from
TC or kernel driver.

Create dpif-offload-provider layer to support such actions.

Issue:  2181036
Change-Id: I880b8b2aebbf6886fd8064409108ee59974ff195
Signed-off-by: Chris Mi 
Reviewed-by: Eli Britstein 
---
  lib/automake.mk |  2 ++
  lib/dpif-offload-provider.h | 52 +
  lib/dpif-offload.c  | 43 ++
  lib/dpif-provider.h |  4 ++-
  4 files changed, 100 insertions(+), 1 deletion(-)
  create mode 100644 lib/dpif-offload-provider.h
  create mode 100644 lib/dpif-offload.c

diff --git a/lib/automake.mk b/lib/automake.mk
index a23cdc4ad..781fba47a 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -129,6 +129,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpif-netdev-private.h \
lib/dpif-netdev-perf.c \
lib/dpif-netdev-perf.h \
+   lib/dpif-offload.c \
+   lib/dpif-offload-provider.h \
lib/dpif-provider.h \
lib/dpif.c \
lib/dpif.h \
diff --git a/lib/dpif-offload-provider.h b/lib/dpif-offload-provider.h
new file mode 100644
index 0..3b94740df
--- /dev/null
+++ b/lib/dpif-offload-provider.h
@@ -0,0 +1,52 @@
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * 
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.apache.org%2Flicenses%2FLICENSE-2.0&data=04%7C01%7Ccmi%40nvidia.com%7C01938f4df4e54cb7fd1008d9fdce2259%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637819884060258877%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=z5yeYkZ654e15ykYLHQjYeAMGL%2B9N94LTkoFyNg%2Fdgs%3D&reserved=0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef DPIF_OFFLOAD_PROVIDER_H
+#define DPIF_OFFLOAD_PROVIDER_H
+
+struct dpif;
+struct dpif_offload_sflow;
+
+/* Datapath interface offload structure, to be defined by each implementation
+ * of a datapath interface.
+ */
+struct dpif_offload_class {
+/* Type of dpif offload in this class, e.g. "system", "netdev", etc.
+ *
+ * One of the providers should supply a "system" type, since this is
+ * the type assumed if no type is specified when opening a dpif. */
+const char *type;
+
+/* Called when the dpif offload provider is registered. */
+int (*init)(void);
+
+/* Free all dpif offload resources. */
+void (*destroy)(void);
+
+/* Arranges for the poll loop for an upcall handler to wake up when psample
+ * has a message queued to be received. */
+void (*sflow_recv_wait)(void);
+
+/* Polls for an upcall from psample for an upcall handler.
+ * Return 0 for success. */
+int (*sflow_recv)(struct dpif_offload_sflow *sflow);
+};
+
+void dpif_offload_sflow_recv_wait(const struct dpif *dpif);
+int dpif_offload_sflow_recv(const struct dpif *dpif,
+struct dpif_offload_sflow *sflow);
+
+#endif /* DPIF_OFFLOAD_PROVIDER_H */
diff --git a/lib/dpif-offload.c b/lib/dpif-offload.c
new file mode 100644
index 0..f2bf3e634
--- /dev/null
+++ b/lib/dpif-offload.c
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * 
https://nam11.safelinks.protection.outlook.com/?url=http%

Re: [ovs-dev] [PATCH v20 5/8] dpif-offload-netlink: Implement dpif-offload-provider API

2022-03-10 Thread Chris Mi via dev

On 2022-03-04 6:50 PM, Eelco Chaudron wrote:

I’m re-adding some of my v18 comments, which I know are depending on Ilya’s 
feedback, although I still think we should have provider classes, i.e., not 
based on the datapath class.

Some other questions could still be answered regardless of Ilya’s feedback 
(will mark them with **). Please answer them inline, before sending another 
revision so we can have some discussion around them if needed.

OK.


//Eelco

On 15 Feb 2022, at 10:17, Chris Mi wrote:


Implement dpif-offload API for netlink datapath. And implement a
dummy dpif-offload API for netdev datapath to make tests pass.


** Why do we need a dummy? If there is no hardware offload class we should just 
pass, shouldn’t we?
According to one of your below comment, I moved dp_offload_class_lookup 
from dpif to dpif-netlink.

After that, dpif-offload for dummy is not needed.



Issue: 2181036
Change-Id: I950c3cc3c7092c3d87602da8928ad4aa2df4f38a
Signed-off-by: Chris Mi 
Reviewed-by: Eli Britstein 
---
  lib/automake.mk |   2 +
  lib/dpif-netdev.c   |   8 +-
  lib/dpif-netlink.c  |   4 +-
  lib/dpif-offload-netdev.c   |  43 +++
  lib/dpif-offload-netlink.c  | 221 
  lib/dpif-offload-provider.h |  25 +++-
  lib/dpif-offload.c  | 157 +
  lib/dpif-provider.h |   6 +-
  lib/dpif.c  |  17 ++-
  9 files changed, 476 insertions(+), 7 deletions(-)
  create mode 100644 lib/dpif-offload-netdev.c
  create mode 100644 lib/dpif-offload-netlink.c

diff --git a/lib/automake.mk b/lib/automake.mk
index 781fba47a..9ed61029a 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -130,6 +130,7 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpif-netdev-perf.c \
lib/dpif-netdev-perf.h \
lib/dpif-offload.c \
+   lib/dpif-offload-netdev.c \
lib/dpif-offload-provider.h \
lib/dpif-provider.h \
lib/dpif.c \
@@ -453,6 +454,7 @@ lib_libopenvswitch_la_SOURCES += \
lib/dpif-netlink.h \
lib/dpif-netlink-rtnl.c \
lib/dpif-netlink-rtnl.h \
+   lib/dpif-offload-netlink.c \
lib/if-notifier.c \
lib/netdev-linux.c \
lib/netdev-linux.h \
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index e28e0b554..5ebd127b9 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -1683,7 +1683,8 @@ create_dpif_netdev(struct dp_netdev *dp)
  ovs_refcount_ref(&dp->ref_cnt);

  dpif = xmalloc(sizeof *dpif);
-dpif_init(&dpif->dpif, dp->class, dp->name, netflow_id >> 8, netflow_id);
+dpif_init(&dpif->dpif, dp->class, NULL, dp->name, netflow_id >> 8,
+  netflow_id);
  dpif->dp = dp;
  dpif->last_port_seq = seq_read(dp->port_seq);

@@ -9639,6 +9640,10 @@ dpif_dummy_override(const char *type)
  if (error == 0 || error == EAFNOSUPPORT) {
  dpif_dummy_register__(type);
  }
+error = dp_offload_unregister_provider(type);
+if (error == 0 || error == EAFNOSUPPORT) {
+dpif_offload_dummy_register(type);
+}
  }

  void
@@ -9659,6 +9664,7 @@ dpif_dummy_register(enum dummy_level level)
  }

  dpif_dummy_register__("dummy");
+dpif_offload_dummy_register("dummy");

  unixctl_command_register("dpif-dummy/change-port-number",
   "dp port new-number",
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 71e35ccdd..75f85c254 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -461,8 +461,8 @@ open_dpif(const struct dpif_netlink_dp *dp, struct dpif 
**dpifp)
  dpif->port_notifier = NULL;
  fat_rwlock_init(&dpif->upcall_lock);

-dpif_init(&dpif->dpif, &dpif_netlink_class, dp->name,
-  dp->dp_ifindex, dp->dp_ifindex);
+dpif_init(&dpif->dpif, &dpif_netlink_class, &dpif_offload_netlink_class,
+  dp->name, dp->dp_ifindex, dp->dp_ifindex);

  dpif->dp_ifindex = dp->dp_ifindex;
  dpif->user_features = dp->user_features;
diff --git a/lib/dpif-offload-netdev.c b/lib/dpif-offload-netdev.c
new file mode 100644
index 0..35dac2cc3
--- /dev/null
+++ b/lib/dpif-offload-netdev.c
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * 
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.apache.org%2Flicenses%2FLICENSE-2.0&data=04%7C01%7Ccmi%40nvidia.com%7Cadd31a18c66c46ea056408d9fdccd97d%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637819878551136144%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=A5fNu%2BjqL48flvxwkfz1BHOAiu6LLuBTdp%2F4HkYVEdw%3D&reserved=0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "A

[ovs-dev] [PATCH] ovsdb: raft: Fix inability to read the database with DNS host names.

2022-03-10 Thread Ilya Maximets
Clustered OVSDB allows to use DNS names as addresses of raft members.
However, if DNS resolution fails during the initial database read,
this causes a fatal failure and exit of the ovsdb-server process.

Also, if DNS name of a joining server is not resolvable for one of the
followers, this follower will reject append requests for a new server
to join until the name is successfully resolved.  This makes a follower
effectively non-functional while DNS is unavailable.

To fix the problem relaxing the address verification.  Allowing
validation to pass if only name resolution failed and the address
is valid otherwise.  This will allow addresses to be added to the
database, so connections could be established later when the DNS
is available.

Additionally fixing missed initialization of the dns-resolve module.
Wihtout it, DNS requests are blocking.  This causes unexpected delays
in runtime.

Fixes: 771680d96fb6 ("DNS: Add basic support for asynchronous DNS resolving")
Reported-at: https://bugzilla.redhat.com/2055097
Signed-off-by: Ilya Maximets 
---
 lib/dns-resolve.c|  2 +-
 lib/socket-util.c| 39 
 lib/socket-util.h|  3 ++-
 lib/stream.c |  2 +-
 ofproto/ofproto-dpif-sflow.c |  3 ++-
 ovsdb/ovsdb-server.c |  3 +++
 ovsdb/raft-private.c |  5 -
 7 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/lib/dns-resolve.c b/lib/dns-resolve.c
index d34451434..8bcecb90c 100644
--- a/lib/dns-resolve.c
+++ b/lib/dns-resolve.c
@@ -265,7 +265,7 @@ resolve_callback__(void *req_, int err, struct ub_result 
*result)
 if (err != 0 || (result->qtype == ns_t_ && !result->havedata)) {
 ub_resolve_free(result);
 req->state = RESOLVE_ERROR;
-VLOG_ERR_RL(&rl, "%s: failed to resolve", req->name);
+VLOG_WARN_RL(&rl, "%s: failed to resolve", req->name);
 return;
 }
 
diff --git a/lib/socket-util.c b/lib/socket-util.c
index 4f1ffecf5..38705cc51 100644
--- a/lib/socket-util.c
+++ b/lib/socket-util.c
@@ -62,7 +62,8 @@ static bool parse_sockaddr_components(struct sockaddr_storage 
*ss,
   const char *port_s,
   uint16_t default_port,
   const char *s,
-  bool resolve_host);
+  bool resolve_host,
+  bool *dns_failure);
 
 /* Sets 'fd' to non-blocking mode.  Returns 0 if successful, otherwise a
  * positive errno value. */
@@ -438,7 +439,7 @@ parse_sockaddr_components_dns(struct sockaddr_storage *ss 
OVS_UNUSED,
 dns_resolve(host_s, &tmp_host_s);
 if (tmp_host_s != NULL) {
 parse_sockaddr_components(ss, tmp_host_s, port_s,
-  default_port, s, false);
+  default_port, s, false, NULL);
 free(tmp_host_s);
 return true;
 }
@@ -450,11 +451,15 @@ parse_sockaddr_components(struct sockaddr_storage *ss,
   char *host_s,
   const char *port_s, uint16_t default_port,
   const char *s,
-  bool resolve_host)
+  bool resolve_host, bool *dns_failure)
 {
 struct sockaddr_in *sin = sin_cast(sa_cast(ss));
 int port;
 
+if (dns_failure) {
+*dns_failure = false;
+}
+
 if (port_s && port_s[0]) {
 if (!str_to_int(port_s, 10, &port) || port < 0 || port > 65535) {
 VLOG_ERR("%s: bad port number \"%s\"", s, port_s);
@@ -501,10 +506,15 @@ parse_sockaddr_components(struct sockaddr_storage *ss,
 return true;
 
 resolve:
-if (resolve_host && parse_sockaddr_components_dns(ss, host_s, port_s,
-  default_port, s)) {
-return true;
-} else if (!resolve_host) {
+if (resolve_host) {
+if (parse_sockaddr_components_dns(ss, host_s, port_s,
+  default_port, s)) {
+return true;
+}
+if (dns_failure) {
+*dns_failure = true;
+}
+} else {
 VLOG_ERR("%s: bad IP address \"%s\"", s, host_s);
 }
 exit:
@@ -521,10 +531,12 @@ exit:
  * It resolves the host if 'resolve_host' is true.
  *
  * On success, returns true and stores the parsed remote address into '*ss'.
- * On failure, logs an error, stores zeros into '*ss', and returns false. */
+ * On failure, logs an error, stores zeros into '*ss', and returns false,
+ * '*dns_failure' indicates if the host resolution failed. */
 bool
 inet_parse_active(const char *target_, int default_port,
-  struct sockaddr_storage *ss, bool resolve_host)
+  struct sockaddr_storage *ss,
+  bool resolve_host, bool *dns_failure)
 {
 char *target = xstrdup(target_);
 char *port, *ho

[ovs-dev] [PATCH ovn branch-21.12 2/2] Prepare for 21.12.2.

2022-03-10 Thread Mark Michelson
Signed-off-by: Mark Michelson 
---
 NEWS | 3 +++
 configure.ac | 2 +-
 debian/changelog | 6 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index b00644b59..4cb40cd7f 100644
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,6 @@
+OVN v21.12.2 - xx xxx 
+--
+
 OVN v21.12.1 - 10 Mar 2022
 --
- Bug fixes
diff --git a/configure.ac b/configure.ac
index e44c86c74..c4bf08db7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -13,7 +13,7 @@
 # limitations under the License.
 
 AC_PREREQ(2.63)
-AC_INIT(ovn, 21.12.1, b...@openvswitch.org)
+AC_INIT(ovn, 21.12.2, b...@openvswitch.org)
 AC_CONFIG_MACRO_DIR([m4])
 AC_CONFIG_AUX_DIR([build-aux])
 AC_CONFIG_HEADERS([config.h])
diff --git a/debian/changelog b/debian/changelog
index eda96af03..9f3d2ebfb 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,9 @@
+OVN (21.12.2-1) unstable; urgency=low
+   [ OVN team ]
+   * New upstream version
+
+ -- OVN team   Thu, 10 Mar 2022 15:47:33 -0500
+
 OVN (21.12.1-1) unstable; urgency=low
[ OVN team ]
* New upstream version
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn branch-21.12 0/2] Release patches for v21.12.1.

2022-03-10 Thread Mark Michelson


Mark Michelson (2):
  Set release date for 21.12.1.
  Prepare for 21.12.2.

 NEWS | 6 +-
 configure.ac | 2 +-
 debian/changelog | 8 +++-
 3 files changed, 13 insertions(+), 3 deletions(-)

-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn branch-22.03 2/2] Prepare for 22.03.1.

2022-03-10 Thread Mark Michelson
Signed-off-by: Mark Michelson 
---
 NEWS | 3 +++
 configure.ac | 2 +-
 debian/changelog | 6 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/NEWS b/NEWS
index 0f8e068b1..15fa545d2 100644
--- a/NEWS
+++ b/NEWS
@@ -1,3 +1,6 @@
+OVN v22.03.1 - xx xxx 
+--
+
 OVN v22.03.0 - 11 Mar 2022
 --
   - Refactor CoPP commands introducing a unique name index in CoPP NB table.
diff --git a/configure.ac b/configure.ac
index 283381b4e..70f86e1f5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -13,7 +13,7 @@
 # limitations under the License.
 
 AC_PREREQ(2.63)
-AC_INIT(ovn, 22.03.0, b...@openvswitch.org)
+AC_INIT(ovn, 22.03.1, b...@openvswitch.org)
 AC_CONFIG_MACRO_DIR([m4])
 AC_CONFIG_AUX_DIR([build-aux])
 AC_CONFIG_HEADERS([config.h])
diff --git a/debian/changelog b/debian/changelog
index a26420b7a..a22ba6e91 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,9 @@
+OVN (22.03.1-1) unstable; urgency=low
+   [ OVN team ]
+   * New upstream version
+
+ -- OVN team   Thu, 10 Mar 2022 15:47:41 -0500
+
 ovn (22.03.0-1) unstable; urgency=low
 
* New upstream version
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn branch-22.03 0/2] Release patches for v22.03.0.

2022-03-10 Thread Mark Michelson


Mark Michelson (2):
  Set release date for 22.03.0.
  Prepare for 22.03.1.

 NEWS | 5 -
 configure.ac | 2 +-
 debian/changelog | 8 +++-
 3 files changed, 12 insertions(+), 3 deletions(-)

-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn branch-22.03 1/2] Set release date for 22.03.0.

2022-03-10 Thread Mark Michelson
Signed-off-by: Mark Michelson 
---
 NEWS | 2 +-
 debian/changelog | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index 9f3ce3cf3..0f8e068b1 100644
--- a/NEWS
+++ b/NEWS
@@ -1,4 +1,4 @@
-OVN v22.03.0 - XX XXX 
+OVN v22.03.0 - 11 Mar 2022
 --
   - Refactor CoPP commands introducing a unique name index in CoPP NB table.
 Add following new CoPP commands to manage CoPP table:
diff --git a/debian/changelog b/debian/changelog
index fe0e7f488..a26420b7a 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -2,7 +2,7 @@ ovn (22.03.0-1) unstable; urgency=low
 
* New upstream version
 
- -- OVN team   Thu, 24 Feb 2022 16:55:45 -0500
+ -- OVN team   Thu, 10 Mar 2022 15:47:41 -0500
 
 ovn (21.12.0-1) unstable; urgency=low
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn branch-21.12 1/2] Set release date for 21.12.1.

2022-03-10 Thread Mark Michelson
Signed-off-by: Mark Michelson 
---
 NEWS | 3 ++-
 debian/changelog | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index 4043ecf20..b00644b59 100644
--- a/NEWS
+++ b/NEWS
@@ -1,5 +1,6 @@
-OVN v21.12.1 - xx xxx 
+OVN v21.12.1 - 10 Mar 2022
 --
+   - Bug fixes
 
 OVN v21.12.0 - 22 Dec 2021
 --
diff --git a/debian/changelog b/debian/changelog
index 0d8593083..eda96af03 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -2,7 +2,7 @@ OVN (21.12.1-1) unstable; urgency=low
[ OVN team ]
* New upstream version
 
- -- OVN team   Wed, 22 Dec 2021 09:45:52 -0500
+ -- OVN team   Thu, 10 Mar 2022 15:47:33 -0500
 
 ovn (21.12.0-1) unstable; urgency=low
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn v2] ovn-northd: Don't commit to conntrack if the conntrack was skipped.

2022-03-10 Thread numans
From: Numan Siddique 

Presently, if the inport or outport is a peer port (of router port),
then we skip sending the packet to conntrack by not setting the
reg0[0]/reg0[1] bits.  But the packet still goes through the
stages - ls_in_acl_hint, ls_in_acl and ls_in_stateful in the ingress
pipeline and similarly ls_out_acl_hint, ls_out_acl and ls_out_stateful
in the egress pipeline of the logical switch.

In these mentioned stages, the logical flows match on the ct states
(ct.new, ct.est etc) and this can be problematic if the packet was
sent to conntrack and committed in the ingress pipeline.  When
that packet enters the egress pipeline with outport set to the peer
port,  we skip sending the packet to conntrack (which is expected)
but ct state values carry over from the ingress state.  If the ct.new
flag was set in the ingress pipeline, then the below flows are hit and
the packet gets committed in the conntrack zone of the inport logical port.

table=4 (ls_out_acl ), priority=1,
 match=(ip && (!ct.est || (ct.est && ct_label.blocked == 1))),
 action=(reg0[1] = 1; next;)
table=7 (ls_out_stateful), priority=100  ,
 match=(reg0[1] == 1 && reg0[13] == 0),
 action=(ct_commit { ct_label.blocked = 0; }; next;)

With this extra commit to the same conntrack zone, sometimes the packet gets
dropped or doesn't reach the destination.  It is not very clear how
the packet gets misplaced, but avoiding this resolves the issue.
And OVN ideally should not do this extra commit.

This patch addresses this issue by adding the below logical flow
both in ls_in_acl and ls_out_acl stage

priority=2, match=(reg0[0] == 0 && reg0[1] == 0), action=(next;)

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2060688
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2062431
Reported-by: Michael Washer 
Signed-off-by: Numan Siddique 
---

v1 -> v2
---
  * Changed the approach to solve the issue.

 northd/northd.c | 17 +++--
 northd/ovn-northd.8.xml |  6 ++
 tests/ovn-northd.at |  5 +
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/northd/northd.c b/northd/northd.c
index b264fb850b..23ab5ac260 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -6529,7 +6529,7 @@ build_acls(struct ovn_datapath *od, struct hmap *lflows,
 ovn_lflow_add(lflows, od, S_SWITCH_IN_ACL_AFTER_LB, 0, "1", "next;");
 
 if (has_stateful) {
-/* Ingress and Egress ACL Table (Priority 1).
+/* Ingress and Egress ACL Table (Priority 1 and 2).
  *
  * By default, traffic is allowed.  This is partially handled by
  * the Priority 0 ACL flows added earlier, but we also need to
@@ -6549,7 +6549,20 @@ build_acls(struct ovn_datapath *od, struct hmap *lflows,
  * We need to set ct_label.blocked=0 to let the connection continue,
  * which will be done by ct_commit() in the "stateful" stage.
  * Subsequent packets will hit the flow at priority 0 that just
- * uses "next;". */
+ * uses "next;".
+ *
+ * However, we don't want to commit if the packet was not sent to
+ * conntrack in the first place.
+ * Eg. if inport or outport is a router peer port. */
+ovn_lflow_add(lflows, od, S_SWITCH_IN_ACL, 2,
+  REGBIT_CONNTRACK_DEFRAG" == 0 && "
+  REGBIT_CONNTRACK_NAT" == 0",
+  "next;");
+ovn_lflow_add(lflows, od, S_SWITCH_OUT_ACL, 2,
+  REGBIT_CONNTRACK_DEFRAG" == 0 && "
+  REGBIT_CONNTRACK_NAT" == 0",
+  "next;");
+
 ovn_lflow_add(lflows, od, S_SWITCH_IN_ACL, 1,
   "ip && (!ct.est || (ct.est && ct_label.blocked == 1))",
REGBIT_CONNTRACK_COMMIT" = 1; next;");
diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
index 4784bff041..d788efe70b 100644
--- a/northd/ovn-northd.8.xml
+++ b/northd/ovn-northd.8.xml
@@ -784,6 +784,12 @@
 
 
 
+  
+A priority-2 flow that advances the packet to the next stage if the
+packet was not sent to conntrack earlier, by matching on
+reg0[0] == 0 && reg0[1] == 1.
+  
+
   
 A priority-1 flow that sets the hint to commit IP traffic to the
 connection tracker (with action reg0[1] = 1; next;).  This
diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
index 60d91a7717..bfd0f2382b 100644
--- a/tests/ovn-northd.at
+++ b/tests/ovn-northd.at
@@ -2265,6 +2265,7 @@ AT_CHECK([ovn-sbctl lflow-list ls | grep -e 
ls_in_acl_hint -e ls_out_acl_hint -e
   table=4 (ls_out_acl ), priority=1, match=(ip && (!ct.est || 
(ct.est && ct_label.blocked == 1))), action=(reg0[[1]] = 1; next;)
   table=4 (ls_out_acl ), priority=1001 , match=(reg0[[7]] == 1 && 
(ip)), action=(reg0[[1]] = 1; next;)
   table=4 (ls_out_acl ), priority=1001 , match=(reg0[[8]] == 1 && 
(ip)), action=(n

Re: [ovs-dev] [PATCH net-next v2] net: openvswitch: fix uAPI incompatibility with existing user space

2022-03-10 Thread Aaron Conole
Ilya Maximets  writes:

> Few years ago OVS user space made a strange choice in the commit [1]
> to define types only valid for the user space inside the copy of a
> kernel uAPI header.  '#ifndef __KERNEL__' and another attribute was
> added later.
>
> This leads to the inevitable clash between user space and kernel types
> when the kernel uAPI is extended.  The issue was unveiled with the
> addition of a new type for IPv6 extension header in kernel uAPI.
>
> When kernel provides the OVS_KEY_ATTR_IPV6_EXTHDRS attribute to the
> older user space application, application tries to parse it as
> OVS_KEY_ATTR_PACKET_TYPE and discards the whole netlink message as
> malformed.  Since OVS_KEY_ATTR_IPV6_EXTHDRS is supplied along with
> every IPv6 packet that goes to the user space, IPv6 support is fully
> broken.
>
> Fixing that by bringing these user space attributes to the kernel
> uAPI to avoid the clash.  Strictly speaking this is not the problem
> of the kernel uAPI, but changing it is the only way to avoid breakage
> of the older user space applications at this point.
>
> These 2 types are explicitly rejected now since they should not be
> passed to the kernel.  Additionally, OVS_KEY_ATTR_TUNNEL_INFO moved
> out from the '#ifdef __KERNEL__' as there is no good reason to hide
> it from the userspace.  And it's also explicitly rejected now, because
> it's for in-kernel use only.
>
> Comments with warnings were added to avoid the problem coming back.
>
> (1 << type) converted to (1ULL << type) to avoid integer overflow on
> OVS_KEY_ATTR_IPV6_EXTHDRS, since it equals 32 now.
>
>  [1] beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
>
> Fixes: 28a3f0601727 ("net: openvswitch: IPv6: Add IPv6 extension header 
> support")
> Link: 
> https://lore.kernel.org/netdev/3adf00c7-fe65-3ef4-b6d7-6d8a0cad8...@nvidia.com
> Link: 
> https://github.com/openvswitch/ovs/commit/beb75a40fdc295bfd6521b0068b4cd12f6de507c
> Reported-by: Roi Dayan 
> Signed-off-by: Ilya Maximets 
> ---

Acked-by: Aaron Conole 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-10 Thread Kevin Traynor

On 04/03/2022 17:57, Anurag Agarwal wrote:

Hello Kevin,
I have prepared a patch for "per port cross-numa-polling" and 
attached herewith.

The results are captured in 'cross-numa-results.txt'. We see PMD to RxQ 
assignment evenly balanced across all PMDs with this patch.

Please take a look and let us know your inputs.


Hi Anurag,

I think what this is showing is more related to txqs used for sending to 
the VM. As you are allowing the rxqs from the phy port to be handled by 
more pmds, and all those rxqs have traffic then in turn more txqs are 
used for sending to the VM. The result of using more txqs when sending 
to the VM in this case is that the traffic is returned on the more rxqs.


Allowing cross-numa does not guarantee that the different pmd cores will 
poll rxqs from an interface. At least with group algorithm, the pmds 
will be selected purely on load. The right way to ensure that all VM 
txqs(/rxqs) are used is to enable the Tx-steering feature [0].


So you might be seeing some benefit in this case, but to me it's not the 
core use case of cross-numa polling. That is more about allowing the 
pmds on every numa to be used when the traffic load is primarily coming 
from one numa.


Kevin.

[0] https://docs.openvswitch.org/en/latest/topics/userspace-tx-steering/


Please find some of my inputs inline, in response to your comments.

Regards,
Anurag

-Original Message-
From: Kevin Traynor 
Sent: Thursday, February 24, 2022 7:54 PM
To: Jan Scheurich ; Wan Junjie 

Cc: d...@openvswitch.org; Anurag Agarwal 
Subject: Re: [External] Re: [PATCH] dpif-netdev: add an option to assign pmd 
rxq to all numas

Hi Jan,

On 17/02/2022 14:21, Jan Scheurich wrote:

Hi Kevin,


We have done extensive benchmarking and found that we get better
overall

PMD load balance and resulting OVS performance when we do not
statically pin any rx queues and instead let the auto-load-balancing
find the optimal distribution of phy rx queues over both NUMA nodes
to balance an asymmetric load of vhu rx queues (polled only on the local NUMA 
node).


Cross-NUMA polling of vhu rx queues comes with a very high latency
cost due

to cross-NUMA access to volatile virtio ring pointers in every
iteration (not only when actually copying packets). Cross-NUMA
polling of phy rx queues doesn't have a similar issue.




I agree that for vhost rxq polling, it always causes a performance
penalty when there is cross-numa polling.

For polling phy rxq, when phy and vhost are in different numas, I
don't see any additional penalty for cross-numa polling the phy rxq.

For the case where phy and vhost are both in the same numa, if I
change to poll the phy rxq cross-numa, then I see about a >20% tput
drop for traffic from phy -

vhost. Are you seeing that too?


Yes, but the performance drop is mostly due to the extra cost of copying the 
packets across the UPI bus to the virtio buffers on the other NUMA, not because 
of polling the phy rxq on the other NUMA.



Just to be clear, phy and vhost are on the same numa in my test. I see the drop 
when polling the phy rxq with a pmd from a different numa.



Also, the fact that a different numa can poll the phy rxq after every
rebalance means that the ability of the auto-load-balancer to
estimate and trigger a rebalance is impacted.


Agree, there is some inaccuracy in the estimation of the load a phy rx queue 
creates when it is moved to another NUMA node. So far we have not seen that as 
a practical problem.



It seems like simple pinning some phy rxqs cross-numa would avoid all
the issues above and give most of the benefit of cross-numa polling for phy 
rxqs.


That is what we have done in the past (far a lack of alternatives). But any 
static pinning reduces the ability of the auto-load balancer to do its job. 
Consider the following scenarios:

1. The phy ingress traffic is not evenly distributed by RSS due to lack of 
entropy (Examples for this are IP-IP encapsulated traffic, e.g. Calico, or 
MPLSoGRE encapsulated traffic).

2. VM traffic is very asymmetric, e.g. due to a large dual-NUMA VM whose vhu 
ports are all on NUMA 0.

In all such scenarios, static pinning of phy rxqs may lead to unnecessarily 
uneven PMD load and loss of overall capacity.



I agree that static pinning may cause a bottleneck if you have more than one rx 
pinned on a core. On the flip side, pinning removes uncertainty about the 
ability of OVS to make good assignments and ALB.
[Anurag] Echoing what Jan said, static pinning wouldn't allow rebalancing in 
case the traffic across DPDK and VHU queues is asymmetric. With the 
introduction of per port cross-numa-polling, the user has one more option in 
his tool box, to allow full auto load balancing without worrying at all about 
the rxq to PMD assignments. This also makes the deployment of OVS much simpler. 
The user only now needs to provide the list of CPUs, enable AUTO LB and 
cross-numa-polling (in case necessary). All of the rest is handled in software.



Re: [ovs-dev] [PATCH ovn] ovn-northd: Skip conntrack related stages for router ports in switch pipeline.

2022-03-10 Thread Numan Siddique
On Thu, Mar 10, 2022 at 4:22 AM Dumitru Ceara  wrote:
>
> On 3/10/22 04:04, num...@ovn.org wrote:
> > From: Numan Siddique 
> >
>
> Hi Numan,
>
> > Presently, if the inport or outport is a peer port (of router port),
> > then we skip sending the packet to conntrack by not setting the
> > reg0[0]/reg0[1] bits.  But the packet still goes through the
> > stages - ls_in_acl_hint, ls_in_acl and ls_in_stateful in the ingress
> > pipeline and similarly ls_out_acl_hint, ls_out_acl and ls_out_stateful
> > in the egress pipeline of the logical switch.
> >
> > In these mentioned stages, the logical flows match on the ct states
> > (ct.new, ct.est etc) and this can be problematic if the inport or
> > outport is peer port.  It is more problematic if the packet was
> > sent to conntrack and committed in the ingress pipeline.  When
> > that packet enters the egress pipeline with outport set to the peer
> > port,  we skip sending the packet to conntrack (which is expected)
> > but ct state values carry over from the ingress state.  If the ct.new
> > flag was set in the ingress pipeline, then the below flows are hit and
> > the packet gets committed in the conntrack zone of the inport logical port.
> >
> > table=4 (ls_out_acl ), priority=1,
> >  match=(ip && (!ct.est || (ct.est && ct_label.blocked == 1))),
> >  action=(reg0[1] = 1; next;)
> > table=7 (ls_out_stateful), priority=100  ,
> >  match=(reg0[1] == 1 && reg0[13] == 0),
> >  action=(ct_commit { ct_label.blocked = 0; }; next;)
> >
> > With this extra commit to the same conntrack zone, sometimes the packet gets
> > dropped or doesn't reach the destination.  It is not very clear how
> > the packet gets misplaced, but avoiding this resolves the issue.
> > And OVN ideally should not do this extra commit.
> >
> > To address this issue, this patch adds the logical flows to skip all
> > the conntrack related stages if the inport or outport is peer logical
> > port.
> >
> > table=5 (ls_in_pre_acl  ), priority=110  ,
> >  match=(ip && inport == "sw0-lr0"),
> >  action=(next(pipeline=ingress,table=18);)
> >
> > table=0 (ls_out_pre_lb  ), priority=110  ,
> >  match=(ip && outport == "sw0-lr0"),
> >  action=(next(pipeline=egress,table=8);)
> >
>
> This is a bit worrying.  We're skipping a lot of stages, some of which
> are not stateful at all, e.g., "ls_in_qos_mark", "ls_in_qos_meter".
>
> Also, I think we would be breaking the scenario in which LB happens on
> traffic received from a logical router.  Simplified setup with a client
> connecting to VIP1:
>
> client (IP1) --> LS1 --> LR (VIP1 in the subnet of LS2) --> LS2 (LB:
> VIP1->IP1)
>
> Without your patch the hairpin stage on LS2 would have performed the
> DNAT and SNAT (for hairpin) properly.  I'm not sure that still works
> with your patch.
>
> Wouldn't it be safer and simpler to change the 110-priority flows in
> stages ls_in/out_pre_acl and ls_in/out_pre_lb to apply action="ct_clear;
> next;" instead?

Hi Dumitru,

Thanks for the comments.

I agree that skipping many pipelines would not be prudent as there are
qos related stages too.

I'll submit another patch which skips commiting to the conntrack if
the packet was not sent to
conntrack in the pre_stateful stage.

we could also do ct_clear in table 38/39 when the packet transitions
from the ingress stage to egress stage.
I'll look into that too.  But I'll do that as a follow up.

Thanks
Numan

>
> > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2060688
> > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2062431
> > Reported-by: Michael Washer 
> > Signed-off-by: Numan Siddique 
> > ---
>
> Regards,
> Dumitru
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [External] Re: [PATCH] dpif-netdev: add an option to assign pmd rxq to all numas

2022-03-10 Thread Kevin Traynor

Hi Jan,

On 09/03/2022 15:48, Jan Scheurich wrote:

Thanks for sharing your experience with it. My fear with the proposal is that
someone turns this on and then tells us performance is worse and/or OVS
assignments/ALB are broken, because it has an impact on their case.

In terms of limiting possible negative effects,
- it can be opt-in and recommended only for phy ports
- could print a warning when it is enabled
- ALB is currently disabled with cross-numa polling (except a limited
case) but it's clear you want to remove that restriction too
- for ALB, a user could increase the improvement threshold to account for any
reassignments triggered by inaccuracies


[Jan] Yes, we want to enable cross-NUMA polling of selected (typically phy) ports in ALB 
"group" mode as an opt-in config option (default off). Based on our 
observations we are not too much concerned with the loss of ALB prediction accuracy but 
increasing the threshold may be a way of taking that into account, if wanted.



There is also some improvements that can be made to the proposed method
when used with group assignment,
- we can prefer local numa where there is no difference between pmd cores.
(e.g. two unused cores available, pick the local numa one)
- we can flatten the list of pmds, so best pmd can be selected. This will remove
issues with RR numa when there are different num of pmd cores or loads per
numa.
- I wrote an RFC that does these two items, I can post when(/if!) consensus is
reached on the broader topic


[Jan] In our alternative version of the current upstream "group" ALB [1] we 
already maintained a flat list of PMDs. So we would support that feature. Using 
NUMA-locality as a tie-breaker makes sense.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384546.html



In summary, it's a trade-off,

With no cross-numa polling (current):
- won't have any impact to OVS assignment or ALB accuracy
- there could be a bottleneck on one numa pmds while other numa pmd cores
are idle and unused

With cross-numa rx pinning (current):
- will have access to pmd cores on all numas
- may require more cycles for some traffic paths
- won't have any impact to OVS assignment or ALB accuracy
- >1 pinned rxqs per core may cause a bottleneck depending on traffic

With cross-numa interface setting (proposed):
- will have access to all pmd cores on all numas (i.e. no unused pmd cores
during highest load)
- will require more cycles for some traffic paths
- will impact on OVS assignment and ALB accuracy

Anything missing above, or is it a reasonable summary?


I think that is a reasonable summary, albeit I would have characterized the 
third option a bit more positively:
- Gives ALB maximum freedom to balance load of PMDs on all NUMA nodes (in the 
likely scenario of uneven VM load on the NUMAs)
- Accepts an increase of cycles on cross-NUMA paths for a better utilization of 
a free PMD cycles
- Mostly suitable for phy ports due to limited cycle increase for cross-NUMA 
polling of phy rx queues
- Could negatively impact the ALB prediction accuracy in certain scenarios



It's the estimation accuracy during the assignments. That might be be 
seen as part of the ALB dry-run, but it could also be during an actual 
reconfigure/reassignment itself. e.g. we think pmdx is the lowest loaded 
pmd and assign another rxq, only to find out a previously assigned rxq 
now requires 30% more cycles after changing numa and it was a bad 
selection to add more rxqs. Now there's overload on that pmdx etc.


OTOH, in some cases you are now also getting to utilize more pmds from 
other numas, so there is more net compute power. That means less risk of 
overload on any pmd, but I'm sure people will still try and push it to 
the limits.


I think we're both on the same page regarding functionality, pros and 
cons etc. It's just the inaccuracy and likelihood of problems occurring 
where we are viewing differently. You are saying you think it's low risk 
as that is your experience, while I am a bit more cautious about it.



We will post a new version of our patch [2] for cross-numa polling on selected 
ports adapted to the current OVS master shortly.

[2] https://mail.openvswitch.org/pipermail/ovs-dev/2021-June/384547.html



I think we should give a bit more time for more eyes and any discussion 
at the high level before progressing too much on the code, but as you 
are talking about it...I mentioned an RFC so I put it up here [0].


I didn't add the user enabling part, that's straightforward, I was just 
working on how adapt the current rxq scheduling for not always selecting 
a numa first (as it was based like this), flattening numa and a local 
numa tiebreaker.


[0] https://github.com/kevintraynor/ovs/commits/crossnuma

thanks,
Kevin.


Thanks, Jan




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 21.12] controller: reconfigure ovs meters for ovn meters

2022-03-10 Thread Mark Michelson
Thanks Lorenzo for this backport. I did a quick sanity check on the 
21.12 branch and it looks good to me. So I Acked and pushed the patch to 
branch-21.12.


On 3/10/22 11:01, Lorenzo Bianconi wrote:

At the moment ovs meters are reconfigured by ovn just when a
meter is allocated or removed while updates for an already
allocated meter are ignored. This issue can be easily verified
with the following reproducer:

$ovn-nbctl meter-add meter0 drop 10 pktps
$ovn-nbctl --log --meter=meter0 acl-add sw0 to-lport 1000 'tcp.dst == 80' drop
$ovn-nbctl --may-exist meter-add meter0 drop 20 pktps
$ovs-ofctl -O OpenFlow15 dump-meters br-int

In order to fix the issue introduce SB_meter node in the incremetal
processing engine and add it as input for lflow_output one.
Sync OVS meters in ofctl_put() to push changes in SB meter/SB meter
bands tables.

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1939524
Acked-by: Mark Michelson 
Signed-off-by: Lorenzo Bianconi 
Signed-off-by: Han Zhou 
---
  controller/ofctrl.c | 212 +++-
  controller/ovn-controller.c |  25 -
  lib/extend-table.c  |  14 +++
  lib/extend-table.h  |   4 +
  tests/ovn-performance.at|  17 +++
  tests/ovn.at|  73 +
  tests/system-ovn.at |  17 +++
  7 files changed, 332 insertions(+), 30 deletions(-)

diff --git a/controller/ofctrl.c b/controller/ofctrl.c
index 08fcfed8b..22f80620a 100644
--- a/controller/ofctrl.c
+++ b/controller/ofctrl.c
@@ -335,6 +335,22 @@ static struct ovn_extend_table *groups;
  /* A reference to the meter_table. */
  static struct ovn_extend_table *meters;
  
+/* Installed meter bands. */

+struct meter_band_data {
+int64_t burst_size;
+int64_t rate;
+};
+
+struct meter_band_entry {
+struct meter_band_data *bands;
+size_t n_bands;
+};
+
+static struct shash meter_bands;
+
+static void ofctrl_meter_bands_destroy(void);
+static void ofctrl_meter_bands_clear(void);
+
  /* MFF_* field ID for our Geneve option.  In S_TLV_TABLE_MOD_SENT, this is
   * the option we requested (we don't know whether we obtained it yet).  In
   * S_CLEAR_FLOWS or S_UPDATE_FLOWS, this is really the option we have. */
@@ -372,6 +388,7 @@ ofctrl_init(struct ovn_extend_table *group_table,
  ovn_init_symtab(&symtab);
  groups = group_table;
  meters = meter_table;
+shash_init(&meter_bands);
  }
  
  /* S_NEW, for a new connection.
@@ -615,6 +632,7 @@ run_S_CLEAR_FLOWS(void)
  /* Clear existing meters, to match the state of the switch. */
  if (meters) {
  ovn_extend_table_clear(meters, true);
+ofctrl_meter_bands_clear();
  }
  
  /* All flow updates are irrelevant now. */

@@ -789,6 +807,7 @@ ofctrl_destroy(void)
  rconn_packet_counter_destroy(tx_counter);
  expr_symtab_destroy(&symtab);
  shash_destroy(&symtab);
+ofctrl_meter_bands_destroy();
  }
  
  uint64_t

@@ -1802,26 +1821,14 @@ add_meter_string(struct ovn_extend_table_info 
*m_desired,
  }
  
  static void

-add_meter(struct ovn_extend_table_info *m_desired,
-  const struct sbrec_meter_table *meter_table,
-  struct ovs_list *msgs)
+update_ovs_meter(struct ovn_extend_table_info *entry,
+ const struct sbrec_meter *sb_meter, int cmd,
+ struct ovs_list *msgs)
  {
-const struct sbrec_meter *sb_meter;
-SBREC_METER_TABLE_FOR_EACH (sb_meter, meter_table) {
-if (!strcmp(m_desired->name, sb_meter->name)) {
-break;
-}
-}
-
-if (!sb_meter) {
-static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
-VLOG_ERR_RL(&rl, "could not find meter named \"%s\"", m_desired->name);
-return;
-}
  
  struct ofputil_meter_mod mm;

-mm.command = OFPMC13_ADD;
-mm.meter.meter_id = m_desired->table_id;
+mm.command = cmd;
+mm.meter.meter_id = entry->table_id;
  mm.meter.flags = OFPMF13_STATS;
  
  if (!strcmp(sb_meter->unit, "pktps")) {

@@ -1854,6 +1861,152 @@ add_meter(struct ovn_extend_table_info *m_desired,
  free(mm.meter.bands);
  }
  
+static void

+ofctrl_meter_bands_clear(void)
+{
+struct shash_node *node, *next;
+SHASH_FOR_EACH_SAFE (node, next, &meter_bands) {
+struct meter_band_entry *mb = node->data;
+shash_delete(&meter_bands, node);
+free(mb->bands);
+free(mb);
+}
+}
+
+static void
+ofctrl_meter_bands_destroy(void)
+{
+ofctrl_meter_bands_clear();
+shash_destroy(&meter_bands);
+}
+
+static bool
+ofctrl_meter_bands_is_equal(const struct sbrec_meter *sb_meter,
+struct meter_band_entry *mb)
+{
+if (mb->n_bands != sb_meter->n_bands) {
+return false;
+}
+
+for (int i = 0; i < sb_meter->n_bands; i++) {
+int j;
+for (j = 0; j < mb->n_bands; j++) {
+if (sb_meter->bands[i]->rate == mb->bands[j].rate &&
+sb_meter->bands[i]->burst_size == mb->bands[j].bu

Re: [ovs-dev] [PATCH ovn 21.12] controller: reconfigure ovs meters for ovn meters

2022-03-10 Thread 0-day Robot
Bleep bloop.  Greetings Lorenzo Bianconi, I am a robot and I have tried out 
your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


git-am:
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0001 controller: reconfigure ovs meters for ovn meters
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn 21.12] controller: reconfigure ovs meters for ovn meters

2022-03-10 Thread Lorenzo Bianconi
At the moment ovs meters are reconfigured by ovn just when a
meter is allocated or removed while updates for an already
allocated meter are ignored. This issue can be easily verified
with the following reproducer:

$ovn-nbctl meter-add meter0 drop 10 pktps
$ovn-nbctl --log --meter=meter0 acl-add sw0 to-lport 1000 'tcp.dst == 80' drop
$ovn-nbctl --may-exist meter-add meter0 drop 20 pktps
$ovs-ofctl -O OpenFlow15 dump-meters br-int

In order to fix the issue introduce SB_meter node in the incremetal
processing engine and add it as input for lflow_output one.
Sync OVS meters in ofctl_put() to push changes in SB meter/SB meter
bands tables.

Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1939524
Acked-by: Mark Michelson 
Signed-off-by: Lorenzo Bianconi 
Signed-off-by: Han Zhou 
---
 controller/ofctrl.c | 212 +++-
 controller/ovn-controller.c |  25 -
 lib/extend-table.c  |  14 +++
 lib/extend-table.h  |   4 +
 tests/ovn-performance.at|  17 +++
 tests/ovn.at|  73 +
 tests/system-ovn.at |  17 +++
 7 files changed, 332 insertions(+), 30 deletions(-)

diff --git a/controller/ofctrl.c b/controller/ofctrl.c
index 08fcfed8b..22f80620a 100644
--- a/controller/ofctrl.c
+++ b/controller/ofctrl.c
@@ -335,6 +335,22 @@ static struct ovn_extend_table *groups;
 /* A reference to the meter_table. */
 static struct ovn_extend_table *meters;
 
+/* Installed meter bands. */
+struct meter_band_data {
+int64_t burst_size;
+int64_t rate;
+};
+
+struct meter_band_entry {
+struct meter_band_data *bands;
+size_t n_bands;
+};
+
+static struct shash meter_bands;
+
+static void ofctrl_meter_bands_destroy(void);
+static void ofctrl_meter_bands_clear(void);
+
 /* MFF_* field ID for our Geneve option.  In S_TLV_TABLE_MOD_SENT, this is
  * the option we requested (we don't know whether we obtained it yet).  In
  * S_CLEAR_FLOWS or S_UPDATE_FLOWS, this is really the option we have. */
@@ -372,6 +388,7 @@ ofctrl_init(struct ovn_extend_table *group_table,
 ovn_init_symtab(&symtab);
 groups = group_table;
 meters = meter_table;
+shash_init(&meter_bands);
 }
 
 /* S_NEW, for a new connection.
@@ -615,6 +632,7 @@ run_S_CLEAR_FLOWS(void)
 /* Clear existing meters, to match the state of the switch. */
 if (meters) {
 ovn_extend_table_clear(meters, true);
+ofctrl_meter_bands_clear();
 }
 
 /* All flow updates are irrelevant now. */
@@ -789,6 +807,7 @@ ofctrl_destroy(void)
 rconn_packet_counter_destroy(tx_counter);
 expr_symtab_destroy(&symtab);
 shash_destroy(&symtab);
+ofctrl_meter_bands_destroy();
 }
 
 uint64_t
@@ -1802,26 +1821,14 @@ add_meter_string(struct ovn_extend_table_info 
*m_desired,
 }
 
 static void
-add_meter(struct ovn_extend_table_info *m_desired,
-  const struct sbrec_meter_table *meter_table,
-  struct ovs_list *msgs)
+update_ovs_meter(struct ovn_extend_table_info *entry,
+ const struct sbrec_meter *sb_meter, int cmd,
+ struct ovs_list *msgs)
 {
-const struct sbrec_meter *sb_meter;
-SBREC_METER_TABLE_FOR_EACH (sb_meter, meter_table) {
-if (!strcmp(m_desired->name, sb_meter->name)) {
-break;
-}
-}
-
-if (!sb_meter) {
-static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
-VLOG_ERR_RL(&rl, "could not find meter named \"%s\"", m_desired->name);
-return;
-}
 
 struct ofputil_meter_mod mm;
-mm.command = OFPMC13_ADD;
-mm.meter.meter_id = m_desired->table_id;
+mm.command = cmd;
+mm.meter.meter_id = entry->table_id;
 mm.meter.flags = OFPMF13_STATS;
 
 if (!strcmp(sb_meter->unit, "pktps")) {
@@ -1854,6 +1861,152 @@ add_meter(struct ovn_extend_table_info *m_desired,
 free(mm.meter.bands);
 }
 
+static void
+ofctrl_meter_bands_clear(void)
+{
+struct shash_node *node, *next;
+SHASH_FOR_EACH_SAFE (node, next, &meter_bands) {
+struct meter_band_entry *mb = node->data;
+shash_delete(&meter_bands, node);
+free(mb->bands);
+free(mb);
+}
+}
+
+static void
+ofctrl_meter_bands_destroy(void)
+{
+ofctrl_meter_bands_clear();
+shash_destroy(&meter_bands);
+}
+
+static bool
+ofctrl_meter_bands_is_equal(const struct sbrec_meter *sb_meter,
+struct meter_band_entry *mb)
+{
+if (mb->n_bands != sb_meter->n_bands) {
+return false;
+}
+
+for (int i = 0; i < sb_meter->n_bands; i++) {
+int j;
+for (j = 0; j < mb->n_bands; j++) {
+if (sb_meter->bands[i]->rate == mb->bands[j].rate &&
+sb_meter->bands[i]->burst_size == mb->bands[j].burst_size) {
+break;
+}
+}
+if (j == mb->n_bands) {
+return false;
+}
+}
+return true;
+}
+
+static void
+ofctrl_meter_bands_alloc(const struct sbrec_meter *sb_meter,
+  

Re: [ovs-dev] [PATCH] Fix Apache license notice.

2022-03-10 Thread David Marchand
On Thu, Mar 10, 2022 at 1:51 PM Ilya Maximets  wrote:
>
> On 3/10/22 13:12, David Marchand wrote:
> > The hacking flake8 plugin generates a lot of noise in GHA logs about an
> > issue on the Apache license contained in OVS python files:
> > https://github.com/openvswitch/ovs/runs/5453426047?#step:11:1519
> >
> > Those prints seem like a bug in the hacking plugin [1] as the check
> > about the license is not enabled in OVS checks.
> >
> > Though, when comparing with https://www.apache.org/licenses/LICENSE-2.0,
> > OVS files have an additional ':'.
> >
> > Align all files in OVS with the official notice and enable the check to
> > avoid reintroducing a wrong notice.
> >
> > 1: 
> > https://github.com/openstack/hacking/blob/4.1.0/hacking/checks/comments.py#L168
> >
> > Signed-off-by: David Marchand 
> > ---
>
> Hi, David.
>
> OVS doesn't enable H103 check, so I think that the problem is in
> the hacking plugin itself.  It prints the license comparison
> unconditionally even if the check wasn't requested.
>
> I actually prepared a fix for hacking a couple of weeks ago:
>   https://review.opendev.org/c/openstack/hacking/+/830537

Thanks, that saves me the time to fix it :-).

> Had no responses yet, though... :(

I reviewed your change, let's see if it helps...


-- 
David Marchand

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] Fix Apache license notice.

2022-03-10 Thread Ilya Maximets
On 3/10/22 13:12, David Marchand wrote:
> The hacking flake8 plugin generates a lot of noise in GHA logs about an
> issue on the Apache license contained in OVS python files:
> https://github.com/openvswitch/ovs/runs/5453426047?#step:11:1519
> 
> Those prints seem like a bug in the hacking plugin [1] as the check
> about the license is not enabled in OVS checks.
> 
> Though, when comparing with https://www.apache.org/licenses/LICENSE-2.0,
> OVS files have an additional ':'.
> 
> Align all files in OVS with the official notice and enable the check to
> avoid reintroducing a wrong notice.
> 
> 1: 
> https://github.com/openstack/hacking/blob/4.1.0/hacking/checks/comments.py#L168
> 
> Signed-off-by: David Marchand 
> ---

Hi, David.

OVS doesn't enable H103 check, so I think that the problem is in
the hacking plugin itself.  It prints the license comparison
unconditionally even if the check wasn't requested.

I actually prepared a fix for hacking a couple of weeks ago:
  https://review.opendev.org/c/openstack/hacking/+/830537

Had no responses yet, though... :(

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [[PATCH RFC] 13/17] Enable IP checksum offloading by default.

2022-03-10 Thread Flavio Leitner
On Mon, Jan 24, 2022 at 02:21:35PM -0500, Mike Pattrick wrote:
> On Tue, Dec 7, 2021 at 11:54 AM Flavio Leitner  wrote:
> >
> > The netdev receiving packets is supposed to provide the flags
> > indicating if the IP csum was verified and it is OK or BAD,
> > otherwise the stack will check when appropriate by software.
> >
> > If the packet comes with good checksum, then postpone the
> > checksum calculation to the egress device if needed.
> >
> > When encapsulate a packet with that flag, set the checksum
> > of the inner IP header since that is not yet supported.
> >
> > Calculate the IP csum when the packet is going to be sent over
> > a device that doesn't support the feature.
> >
> > Linux devices don't support IP csum offload alone, so the
> > support is not enabled.
> >
> > Signed-off-by: Flavio Leitner 
> > ---
> >  lib/conntrack.c | 12 ++---
> >  lib/dp-packet.c | 12 +
> >  lib/dp-packet.h | 63 ---
> >  lib/dpif.h  |  2 +-
> >  lib/flow.c  | 16 --
> >  lib/ipf.c   |  9 ++--
> >  lib/netdev-dpdk.c   | 78 ++--
> >  lib/netdev-dummy.c  | 21 
> >  lib/netdev-native-tnl.c | 19 +--
> >  lib/netdev.c| 22 
> >  lib/odp-execute.c   | 21 ++--
> >  lib/packets.c   | 34 ++---
> >  ofproto/ofproto-dpif-upcall.c   | 14 +++--
> >  tests/automake.mk   |  1 +
> >  tests/system-userspace-offload.at   | 79 +
> >  tests/system-userspace-testsuite.at |  1 +
> >  16 files changed, 322 insertions(+), 82 deletions(-)
> >  create mode 100644 tests/system-userspace-offload.at
> >
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index 2392a2ea4..5b4ca4dfc 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -2089,16 +2089,12 @@ conn_key_extract(struct conntrack *ct, struct 
> > dp_packet *pkt, ovs_be16 dl_type,
> >  ctx->key.dl_type = dl_type;
> >
> >  if (ctx->key.dl_type == htons(ETH_TYPE_IP)) {
> > -bool hwol_bad_l3_csum = dp_packet_ol_ip_csum_bad(pkt);
> > -if (hwol_bad_l3_csum) {
> > +if (dp_packet_ol_ip_csum_bad(pkt)) {
> >  ok = false;
> >  COVERAGE_INC(conntrack_l3csum_err);
> >  } else {
> > -bool hwol_good_l3_csum = dp_packet_ol_ip_csum_good(pkt)
> > - || dp_packet_ol_tx_ipv4(pkt);
> > -/* Validate the checksum only when hwol is not supported. */
> >  ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), 
> > NULL,
> > - !hwol_good_l3_csum);
> > + !dp_packet_ol_ip_csum_good(pkt));
> >  }
> >  } else if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> >  ok = extract_l3_ipv6(&ctx->key, l3, dp_packet_l3_size(pkt), NULL);
> > @@ -3402,7 +3398,9 @@ handle_ftp_ctl(struct conntrack *ct, const struct 
> > conn_lookup_ctx *ctx,
> >  }
> >  if (seq_skew) {
> >  ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
> > -if (!dp_packet_ol_tx_ipv4(pkt)) {
> > +if (dp_packet_ol_tx_ip_csum(pkt)) {
> > +dp_packet_ol_reset_ip_csum_good(pkt);
> > +} else {
> >  l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> >  l3_hdr->ip_tot_len,
> >  htons(ip_len));
> 
> This is more of a general comment for the whole patch series, but I
> see that a lot of the diffs use the motif:
> 
> if (dp_packet_ol_tx_ip_csum(pkt)) {
> dp_packet_ol_reset_ip_csum_good(pkt);
> } else {
> recalc_csumXX()
> }
> 
> Would it make sense instead to simply flag for non-offload tainted
> checksum, and then only one call to csum() on packet egress?

That's a good point. I see that in most cases it recalculates
only what has changed, so it is supposed to be faster because
1) cache is hot and 2) fewer operations. If we leave to the
egress port, then we need to calculate full checksum and the
headers may not be in the cache.

Therefore, to avoid regressions in the non offloaded case,
I left the original behavior. Maybe my assumption doesn't
make sense.

fbl


> 
> -M
> 
> > diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> > index a4ca5a052..369f3561e 100644
> > --- a/lib/dp-packet.c
> > +++ b/lib/dp-packet.c
> > @@ -21,6 +21,7 @@
> >  #include "dp-packet.h"
> >  #include "netdev-afxdp.h"
> >  #include "netdev-dpdk.h"
> > +#include "netdev-provider.h"
> >  #include "openvswitch/dynamic-string.h"
> >  #include "util.h"
> >
> > @@ -506,3 +507,14 @@ dp_packet_resize_l2(struct dp_packet *p, int increment)
> >

Re: [ovs-dev] [[PATCH RFC] 06/17] dp-packet: Use p for packet and b for batch.

2022-03-10 Thread Flavio Leitner
On Mon, Jan 24, 2022 at 01:39:57PM -0500, Mike Pattrick wrote:
> On Tue, Dec 7, 2021 at 11:53 AM Flavio Leitner  wrote:
> >
> > Currently 'p' and 'b' and used for packets, so use
> > a convention that struct dp_packet is 'p' and
> > struct dp_packet_batch is 'b'.
> >
> > Some comments needed new formatting to not pass the
> > 80 column.
> >
> > Some variables were using 'p' or 'b' were renamed
> > as well.
> >
> > There should be no functional change with this patch.
> >
> > Signed-off-by: Flavio Leitner 
> > ---
> 
> I think this patch goes a long way to improve the readability of Would
> it also make sense to change b to p in lib/packets.c and
> lib/netdev-linux.c as well?

I think 'b' meant buffer before, but IMHO 'b' for batches
and 'p' for packets make more sense and then we should 
keep the consistency in all files, unless someone else
has a better idea.

-- 
fbl
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] ovn-northd: Skip conntrack related stages for router ports in switch pipeline.

2022-03-10 Thread Dumitru Ceara
On 3/10/22 04:04, num...@ovn.org wrote:
> From: Numan Siddique 
> 

Hi Numan,

> Presently, if the inport or outport is a peer port (of router port),
> then we skip sending the packet to conntrack by not setting the
> reg0[0]/reg0[1] bits.  But the packet still goes through the
> stages - ls_in_acl_hint, ls_in_acl and ls_in_stateful in the ingress
> pipeline and similarly ls_out_acl_hint, ls_out_acl and ls_out_stateful
> in the egress pipeline of the logical switch.
> 
> In these mentioned stages, the logical flows match on the ct states
> (ct.new, ct.est etc) and this can be problematic if the inport or
> outport is peer port.  It is more problematic if the packet was
> sent to conntrack and committed in the ingress pipeline.  When
> that packet enters the egress pipeline with outport set to the peer
> port,  we skip sending the packet to conntrack (which is expected)
> but ct state values carry over from the ingress state.  If the ct.new
> flag was set in the ingress pipeline, then the below flows are hit and
> the packet gets committed in the conntrack zone of the inport logical port.
> 
> table=4 (ls_out_acl ), priority=1,
>  match=(ip && (!ct.est || (ct.est && ct_label.blocked == 1))),
>  action=(reg0[1] = 1; next;)
> table=7 (ls_out_stateful), priority=100  ,
>  match=(reg0[1] == 1 && reg0[13] == 0),
>  action=(ct_commit { ct_label.blocked = 0; }; next;)
> 
> With this extra commit to the same conntrack zone, sometimes the packet gets
> dropped or doesn't reach the destination.  It is not very clear how
> the packet gets misplaced, but avoiding this resolves the issue.
> And OVN ideally should not do this extra commit.
> 
> To address this issue, this patch adds the logical flows to skip all
> the conntrack related stages if the inport or outport is peer logical
> port.
> 
> table=5 (ls_in_pre_acl  ), priority=110  ,
>  match=(ip && inport == "sw0-lr0"),
>  action=(next(pipeline=ingress,table=18);)
> 
> table=0 (ls_out_pre_lb  ), priority=110  ,
>  match=(ip && outport == "sw0-lr0"),
>  action=(next(pipeline=egress,table=8);)
> 

This is a bit worrying.  We're skipping a lot of stages, some of which
are not stateful at all, e.g., "ls_in_qos_mark", "ls_in_qos_meter".

Also, I think we would be breaking the scenario in which LB happens on
traffic received from a logical router.  Simplified setup with a client
connecting to VIP1:

client (IP1) --> LS1 --> LR (VIP1 in the subnet of LS2) --> LS2 (LB:
VIP1->IP1)

Without your patch the hairpin stage on LS2 would have performed the
DNAT and SNAT (for hairpin) properly.  I'm not sure that still works
with your patch.

Wouldn't it be safer and simpler to change the 110-priority flows in
stages ls_in/out_pre_acl and ls_in/out_pre_lb to apply action="ct_clear;
next;" instead?

> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2060688
> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=2062431
> Reported-by: Michael Washer 
> Signed-off-by: Numan Siddique 
> ---

Regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next v2] net: openvswitch: fix uAPI incompatibility with existing user space

2022-03-10 Thread Nicolas Dichtel

Le 09/03/2022 à 23:20, Ilya Maximets a écrit :
> Few years ago OVS user space made a strange choice in the commit [1]
> to define types only valid for the user space inside the copy of a
> kernel uAPI header.  '#ifndef __KERNEL__' and another attribute was
> added later.
> 
> This leads to the inevitable clash between user space and kernel types
> when the kernel uAPI is extended.  The issue was unveiled with the
> addition of a new type for IPv6 extension header in kernel uAPI.
> 
> When kernel provides the OVS_KEY_ATTR_IPV6_EXTHDRS attribute to the
> older user space application, application tries to parse it as
> OVS_KEY_ATTR_PACKET_TYPE and discards the whole netlink message as
> malformed.  Since OVS_KEY_ATTR_IPV6_EXTHDRS is supplied along with
> every IPv6 packet that goes to the user space, IPv6 support is fully
> broken.
> 
> Fixing that by bringing these user space attributes to the kernel
> uAPI to avoid the clash.  Strictly speaking this is not the problem
> of the kernel uAPI, but changing it is the only way to avoid breakage
> of the older user space applications at this point.
> 
> These 2 types are explicitly rejected now since they should not be
> passed to the kernel.  Additionally, OVS_KEY_ATTR_TUNNEL_INFO moved
> out from the '#ifdef __KERNEL__' as there is no good reason to hide
> it from the userspace.  And it's also explicitly rejected now, because
> it's for in-kernel use only.
> 
> Comments with warnings were added to avoid the problem coming back.
> 
> (1 << type) converted to (1ULL << type) to avoid integer overflow on
> OVS_KEY_ATTR_IPV6_EXTHDRS, since it equals 32 now.
> 
>  [1] beb75a40fdc2 ("userspace: Switching of L3 packets in L2 pipeline")
> 
> Fixes: 28a3f0601727 ("net: openvswitch: IPv6: Add IPv6 extension header 
> support")
> Link: 
> https://lore.kernel.org/netdev/3adf00c7-fe65-3ef4-b6d7-6d8a0cad8...@nvidia.com
> Link: 
> https://github.com/openvswitch/ovs/commit/beb75a40fdc295bfd6521b0068b4cd12f6de507c
> Reported-by: Roi Dayan 
> Signed-off-by: Ilya Maximets 
Thanks for finally doing this. I also suggest it months ago:
https://lore.kernel.org/lkml/a4894aef-b82a-8224-611d-07be229f5...@6wind.com/

Acked-by: Nicolas Dichtel 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev