Re: [ovs-dev] [PATCH v4] dpdk: Allow retaining CAP_SYS_RAWIO privileges

2023-03-16 Thread Gaetan Rivet via dev
> From: Aaron Conole 
> Date: Thursday, 16 March 2023 at 13:00
> To: d...@openvswitch.org 
> Cc: Robin Jarry , Gaetan Rivet , Ilya 
> Maximets , Eli Britstein , Maxime 
> Coquelin , Jason Gunthorpe , 
> Majd Dibbiny , David Marchand  march...@redhat.com>, Simon Horman , Flavio 
> Leitner 
> Subject: [PATCH v4] dpdk: Allow retaining CAP_SYS_RAWIO privileges
> External email: Use caution opening links or attachments
>
>
> Open vSwitch generally tries to let the underlying operating system
> managed the low level details of hardware, for example DMA mapping,
> bus arbitration, etc.  However, when using DPDK, the underlying
> operating system yields control of many of these details to userspace
> for management.
>
> In the case of some DPDK port drivers, configuring rte_flow or even
> allocating resources may require access to iopl/ioperm calls, which
> are guarded by the CAP_SYS_RAWIO privilege on linux systems.  These
> calls are dangerous, and can allow a process to completely compromise
> a system.  However, they are needed in the case of some userspace
> driver code which manages the hardware (for example, the mlx
> implementation of backend support for rte_flow).
>
> Here, we create an opt-in flag passed to the command line to allow
> this access.  We need to do this before ever accessing the database,
> because we want to drop all privileges asap, and cannot wait for
> a connection to the database to be established and functional before
> dropping.  There may be distribution specific ways to do capability
> management as well (using for example, systemd), but they are not
> as universal to the vswitchd as a flag.
>
> Reviewed-by: Simon Horman 
> Signed-off-by: Aaron Conole 
> ---

Thank you for handling this Aaron,
Acked-by: Gaetan Rivet 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC] dpdk: Allow retaining cap_sys_rawio privileges

2023-02-23 Thread Gaetan Rivet via dev
> -Original Message-
> From: Aaron Conole mailto:acon...@redhat.com>>
> Date: Wednesday 22 February 2023 at 18:11
> To: "d...@openvswitch.org <mailto:d...@openvswitch.org>" 
> mailto:d...@openvswitch.org>>
> Cc: Eli Britstein mailto:el...@nvidia.com>>, Gaetan Rivet 
> mailto:gaet...@nvidia.com>>, Ilya Maximets 
> mailto:i.maxim...@ovn.org>>, Maxime Coquelin 
> mailto:maxime.coque...@redhat.com>>, Jason  
> Gunthorpe mailto:j...@nvidia.com>>, Majd Dibbiny 
> mailto:m...@nvidia.com>>, David Marchand 
> mailto:david.march...@redhat.com>>
> Subject: Re: [ovs-dev] [RFC] dpdk: Allow retaining cap_sys_rawio privileges
>
> Apologies - I mis-typed Gaetan's email when I entered it into my mail
> file. CC'd correctly on this email (but I can resend the patch, if you
> think it is better).
>
> Aaron Conole mailto:acon...@redhat.com>> writes:
>
> > Open vSwitch generally tries to let the underlying operating system
> > managed the low level details of hardware, for example DMA mapping,
> > bus arbitration, etc. However, when using DPDK, the underlying
> > operating system yields control of many of these details to userspace
> > for management.
> >
> > In the case of some DPDK port drivers, configuring rte_flow or even
> > allocating resources may require access to iopl/ioperm calls, which
> > are guarded by the CAP_SYS_RAWIO privilege on linux systems. These
> > calls are dangerous, and can allow a process to completely compromise
> > a system. However, they are needed in the case of some userspace
> > driver code which manages the hardware (for example, the mlx
> > implementation of backend support for rte_flow).
> >
> > Here, we create an opt-in flag passed to the command line to allow
> > this access. We need to do this before ever accessing the database,
> > because we want to drop all privileges asap, and cannot wait for
> > a connection to the database to be established and functional before
> > dropping. There may be distribution specific ways to do capability
> > management as well (using for example, systemd), but they are not
> > as universal to the vswitchd as a flag.
> >
> > Signed-off-by: Aaron Conole mailto:acon...@redhat.com>>
> > ---

Hello Aaron,

Thank you for proposing this change.

If users want to use mlx5 ports with OVS without being root, this capability 
will be required.
Adding a vswitchd option to enable it seems the simplest way to offer some 
control.

If vendor-specific logic was allowed, I could add a function to detect Mellanox 
ports
and enable this option in that case. Otherwise we can document as much as 
possible,
but hopefully the errors will be made clear from the DPDK side because it will
be hard to explain those errors without vendor-specific code.

Regarding the implementation, I had a few comments:
> @@ -877,11 +890,11 @@ daemon_become_new_user__(bool access_datapath)
>   * However, there in case the user switch needs to be done
>   * before daemonize_start(), the following API can be used.  */
>  void
> -daemon_become_new_user(bool access_datapath)
> +daemon_become_new_user(bool access_datapath, bool access_hardware_ports)
>  {
>  assert_single_threaded();
>  if (switch_user) {
> -daemon_become_new_user__(access_datapath);
> +daemon_become_new_user__(access_datapath, access_hardware_ports);
>  /* daemonize_start() should not switch user again. */
>  switch_user = false;
>  }

Grepping for daemon_become_new_user, I see the following that might need a 
change:
lib/daemon-windows.c:529:daemon_become_new_user(bool access_datapath OVS_UNUSED)

> diff --git a/vswitchd/ovs-vswitchd.c b/vswitchd/ovs-vswitchd.c
> index 407bfc60e..f62d1ad75 100644
> --- a/vswitchd/ovs-vswitchd.c
> +++ b/vswitchd/ovs-vswitchd.c
> @@ -60,6 +60,9 @@ VLOG_DEFINE_THIS_MODULE(vswitchd);
>   * the kernel from paging any of its memory to disk. */
>  static bool want_mlockall;
>
> +/* --hw-access: If set, retains CAP_SYS_RAWIO privileges.  */
> +static bool hw_access;
> +
>  static unixctl_cb_func ovs_vswitchd_exit;
>
>  static char *parse_options(int argc, char *argv[], char **unixctl_path);
> @@ -89,7 +92,7 @@ main(int argc, char *argv[])
>  remote = parse_options(argc, argv, &unixctl_path);
>  fatal_ignore_sigpipe();
>
> -daemonize_start(true);
> +daemonize_start(true, true);
 
Here I think it should be daemonize_start(true, hw_access);

Best regards,

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC] dpdk: Allow retaining cap_sys_rawio privileges

2023-02-23 Thread Gaetan Rivet via dev
> -Original Message-
> From: Robin Jarry mailto:rja...@redhat.com>>
> Date: Thursday 23 February 2023 at 22:43
> To: Gaetan Rivet mailto:gaet...@nvidia.com>>, Aaron 
> Conole mailto:acon...@redhat.com>>
> Cc: "d...@openvswitch.org <mailto:d...@openvswitch.org>" 
> mailto:d...@openvswitch.org>>, Eli Britstein 
> mailto:el...@nvidia.com>>, Ilya Maximets 
> mailto:i.maxim...@ovn.org>>, Maxime Coquelin 
> mailto:maxime.coque...@redhat.com>>, Jason 
> Gunthorpe mailto:j...@nvidia.com>>, Majd Dibbiny 
> mailto:m...@nvidia.com>>, David Marchand 
> mailto:david.march...@redhat.com>>, Gaetan Rivet 
> mailto:grive@u256.  net>>, Eelco Chaudron 
> mailto:echau...@redhat.com>>
> Subject: Re: [ovs-dev] [RFC] dpdk: Allow retaining cap_sys_rawio privileges
>
>
> Salut Gaëtan,
>
>
> Gaetan Rivet, Feb 23, 2023 at 22:33:
> > I've looked at your patch Robin and the offloads you insert in
> > dpdk_cp_prot_add_flow use the following:
> >
> > const struct rte_flow_attr attr = { .ingress = 1 };
> >
> > implicitly setting transfer and group to 0. If either of those had
> > been non-zero instead, cap_sys_rawio would be required.
>
>
> Oh I was not aware that this would change anything. Is there some
> document/code snippet/anything that explains why is that so? Is that
> specific to the mlx5 driver?
>
>
> Thanks!

You can find some scarce info there: 
https://doc.dpdk.org/guides/platform/mlx5.html#linux-environment
Check out section 5.5.1.5. "Run as Non-Root".

This doc is incomplete, which is one of the RC of these threads.

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC] dpdk: Allow retaining cap_sys_rawio privileges

2023-02-23 Thread Gaetan Rivet via dev
> -Original Message-
> From: Robin Jarry mailto:rja...@redhat.com>>
> Date: Thursday 23 February 2023 at 22:14
> To: Aaron Conole mailto:acon...@redhat.com>>
> Cc: "d...@openvswitch.org <mailto:d...@openvswitch.org>" 
> mailto:d...@openvswitch.org>>, Eli Britstein 
> mailto:el...@nvidia.com>>, Gaetan Rivet 
> mailto:gaet...@nvidia.com>>, Ilya Maximets 
> >, Maxime Coquelin 
> mailto:maxime.coque...@redhat.com>>, Jason 
> Gunthorpe mailto:j...@nvidia.com>>, Majd Dibbiny 
> mailto:m...@nvidia.com>>, David Marchand 
> mailto:david.march...@redhat.com>>, Gaetan Rivet 
> mailto:gr...@u256.net>>, Eelco Chaudron  <mailto:echau...@redhat.com>>
> Subject: Re: [ovs-dev] [RFC] dpdk: Allow retaining cap_sys_rawio privileges
>
>
> External email: Use caution opening links or attachments
>
>
>
>
> Aaron Conole, Feb 23, 2023 at 22:09:
> > Thanks for taking a look. You're saying that you tested without this
> > patch applied, yes? That could be. I only know of one hardware which
> > requires CAP_SYS_RAWIO for rte_flow to function.
>
>
> Yes that is correct, I tested *without* this patch applied and with
> a non-root user (ovs-vswitchd linked with libcap-ng).
>
>
> ovs-ctl --ovs-user="openvswitch:hugetlbfs" start
>
>
> The basic RTE flow rules (matching of the ether type field and redirect
> to a specific queue) were created without errors returned with both NICs
> I had available (Intel X710 and Mellanox ConnectX-5 Ex)
>
>
> cp-protection: redirected lacp traffic to rx queue 1
> cp-protection: redirected other traffic to rx queue 0

Hello,

I've looked at your patch Robin and the offloads you insert in 
dpdk_cp_prot_add_flow
use the following:

const struct rte_flow_attr attr = { .ingress = 1 };

implicitly setting transfer and group to 0. If either of those had been 
non-zero instead,
cap_sys_rawio would be required.

Otherwise thank you very much Aaron for you patch, I was reading it and will
comment directly to it.

Best regards,
Gaetan 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH 1/1] daemon-unix: Support OVS-DPDK HW offloads for non-root user

2023-02-09 Thread Gaetan Rivet via dev
>-Original Message-
>From: dev <mailto:ovs-dev-boun...@openvswitch.org>> on behalf of Eelco Chaudron 
>mailto:echau...@redhat.com>>
>Date: Monday 23 January 2023 at 10:44
>To: Gaetan Rivet mailto:gaet...@nvidia.com>>
>Cc: ovs dev mailto:d...@openvswitch.org>>, Eli 
>Britstein mailto:el...@nvidia.com>>, Ilya Maximets 
>mailto:i.maxim...@ovn.org>>, Maxime Coquelin 
>mailto:maxime.coque...@redhat.com>>, Jason 
>Gunthorpe mailto:j...@nvidia.com>>, Majd Dibbiny 
>mailto:m...@nvidia.com>>, David Marchand 
>mailto:david.march...@redhat.com>>
>Subject: Re: [ovs-dev] [PATCH 1/1] daemon-unix: Support OVS-DPDK HW offloads 
>for non-root user
>
>On 22 Mar 2021, at 15:21, David Marchand wrote:
>
>
>> Hello Gaëtan,
>>
>> On Fri, Mar 19, 2021 at 5:59 PM Gaetan Rivet > <mailto:gaet...@nvidia.com>> wrote:
>>> Our rte_flow implementation uses ICM mappings to program our hardware,
>>> which requires super privileged access. We are looking into ways to avoid 
>>> it.
>>
>> Ok, thanks for looking into it.
>
>
>Was any progress made on this? Or was the conclusion that this is the only way?
>
>

Hello Eelco,

I'll start by clarifying something: this issue is two-fold, even though there 
is a single RC.
The previous email thread is about using rte_flow API with the mlx5 PMD,
while the bug you described is about lack of TX on ports with no offloads 
inserted by the user.

On the matter of the above rte_flow requirements: it was discussed, the 
conclusion
was that nothing could be done with the current offload architecture.
A patch was written to improve the logs and the documentation, but I see that 
it didn't
make it to upstream DPDK. I have brought it to the attention of the DPDK team 
and
it will be submitted.

>>>
>>> In the meantime, we failed to properly communicate this need in the 
>>> rte_flow API.
>>> We will improve the documentation and the error path in DPDK.
>>
>> Without this capa, mlx5 rte_flow full hw offloading errors with logs like:
>> 2021-03-22T14:12:40.274Z|1|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>> rte_flow creation failed: 1 ((null)).
>> 2021-03-22T14:12:40.274Z|2|netdev_offload_dpdk(dp_netdev_flow_9)|WARN|dpdk0-rep0:
>> Failed flow: flow create 3 ingress priority 0 group 0 transfer
>> pattern eth src is 6a:20:8f:82:52:49 dst is 0c:42:a1:00:a8:7c type is
>> 0x0800 / ipv4 / end actions count / port_id original 0 id 2 / end
>>
>> First log is useless.
>> This is more bugfixing than enhancement.
>> Though logs do not need to tell the full story, they can point at the
>> mlx5 pmd documentation where the full explanation is.
>
>
>
>
>I was running into this issue also and spent a decent amount of time trying to 
>figure out what was going on.
>I did not have HW offload enabled yet, but just the basic VF/port representer 
>configuration and no error messages or packets were arriving.
>It Would be good to get some logging indicating the configuration/system was 
>not valid. All I got was silent packet drops, but counters were incremented :(

I was able to reproduce this issue. The DPDK team was adamant that it should 
have resulted in an error log,
but none can be seen. The absence of log is a bug, however the behavior itself 
is intended.

To give some more details, port probe should work without SYS_RAWIO, but start 
should fail.
On start, the PMD installs hardware rules, using the same underlying API as 
rte_flow. It makes SYS_RAWIO
a requirement for even basic port functions. This will remain the case in the 
future.

Similarly, this issue has been brought to the attention of the DPDK team and 
they
will make sure the user has an error log in that case, as well as clarifying 
the mlx5 doc.

Thanks for reporting this!

Gaetan

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 5/5] conntrack: Use an atomic conn expiration value

2022-03-25 Thread Gaetan Rivet
A lock is taken during conn_lookup() to check whether a connection is
expired before returning it. This lock can have some contention.

Even though this lock ensures a consistent sequence of writes, it does
not imply a specific order. A ct_clean thread taking the lock first
could read a value that would be updated immediately after by a PMD
waiting on the same lock, just as well as the inverse order.

As such, the expiration time can be stale anytime it is read. In this
context, using an atomic will ensure the same guarantees for either
writes or reads, i.e. writes are consistent and reads are not undefined
behaviour. Reading an atomic is however less costly than taking and
releasing a lock.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Acked-by: William Tu 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack-tp.c  |  2 +-
 lib/conntrack.c | 27 +++
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 183785d8a..0aec2d611 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -144,7 +144,7 @@ struct conn {
 /* Mutable data. */
 struct ovs_mutex lock; /* Guards all mutable fields. */
 ovs_u128 label;
-long long expiration;
+atomic_llong expiration;
 uint32_t mark;
 int seq_skew;
 
diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
index 22363e7fe..5bf2816ca 100644
--- a/lib/conntrack-tp.c
+++ b/lib/conntrack-tp.c
@@ -240,7 +240,7 @@ static void
 conn_schedule_expiration(struct conn *conn, enum ct_timeout tm, long long now,
  uint32_t tp_value)
 {
-conn->expiration = now + tp_value * 1000;
+atomic_store_relaxed(&conn->expiration, now + tp_value * 1000);
 conn->exp.tm = tm;
 ignore(atomic_flag_test_and_set(&conn->exp.reschedule));
 }
diff --git a/lib/conntrack.c b/lib/conntrack.c
index e4262fdf3..9132ebc32 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -101,6 +101,7 @@ static enum ct_update_res conn_update(struct conntrack *ct, 
struct conn *conn,
   struct dp_packet *pkt,
   struct conn_lookup_ctx *ctx,
   long long now);
+static long long int conn_expiration(const struct conn *);
 static bool conn_expired(struct conn *, long long now);
 static void set_mark(struct dp_packet *, struct conn *,
  uint32_t val, uint32_t mask);
@@ -1017,13 +1018,10 @@ un_nat_packet(struct dp_packet *pkt, const struct conn 
*conn,
 static void
 conn_seq_skew_set(struct conntrack *ct, const struct conn *conn_in,
   long long now, int seq_skew, bool seq_skew_dir)
-OVS_NO_THREAD_SAFETY_ANALYSIS
 {
 struct conn *conn;
-ovs_mutex_unlock(&conn_in->lock);
-conn_lookup(ct, &conn_in->key, now, &conn, NULL);
-ovs_mutex_lock(&conn_in->lock);
 
+conn_lookup(ct, &conn_in->key, now, &conn, NULL);
 if (conn && seq_skew) {
 conn->seq_skew = seq_skew;
 conn->seq_skew_dir = seq_skew_dir;
@@ -1624,9 +1622,7 @@ ct_sweep(struct conntrack *ct, long long now, size_t 
limit)
 continue;
 }
 
-ovs_mutex_lock(&conn->lock);
-expiration = conn->expiration;
-ovs_mutex_unlock(&conn->lock);
+expiration = conn_expiration(conn);
 
 if (conn == end_of_queue) {
 /* If we already re-enqueued this conn during this sweep,
@@ -2653,14 +2649,21 @@ conn_update(struct conntrack *ct, struct conn *conn, 
struct dp_packet *pkt,
 return update_res;
 }
 
+static long long int
+conn_expiration(const struct conn *conn)
+{
+long long int expiration;
+
+atomic_read_relaxed(&CONST_CAST(struct conn *, conn)->expiration,
+&expiration);
+return expiration;
+}
+
 static bool
 conn_expired(struct conn *conn, long long now)
 {
 if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
-ovs_mutex_lock(&conn->lock);
-bool expired = now >= conn->expiration ? true : false;
-ovs_mutex_unlock(&conn->lock);
-return expired;
+return now >= conn_expiration(conn);
 }
 return false;
 }
@@ -2802,7 +2805,7 @@ conn_to_ct_dpif_entry(const struct conn *conn, struct 
ct_dpif_entry *entry,
 entry->mark = conn->mark;
 memcpy(&entry->labels, &conn->label, sizeof entry->labels);
 
-long long expiration = conn->expiration - now;
+long long expiration = conn_expiration(conn) - now;
 
 struct ct_l4_proto *class = l4_protos[conn->key.nw_proto];
 if (class->conn_get_protoinfo) {
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 4/5] conntrack: Inverse conn and ct lock precedence

2022-03-25 Thread Gaetan Rivet
The lock priority order is for the global 'ct_lock' to be taken first
and then 'conn->lock'. This is an issue, as multiple operations on
connections are thus blocked between threads contending on the
global 'ct_lock'.

This was previously necessary due to how the expiration lists, timeout
policies and zone limits were managed. They are now using RCU-friendly
structures that allow concurrent readers. The mutual exclusion now only
needs to happen during writes.

This allows reducing the 'ct_lock' precedence, and to only take it
when writing the relevant structures. This will reduce contention on
'ct_lock', which impairs scalability when the connection tracker is
used by many threads.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  7 --
 lib/conntrack-tp.c  | 30 +
 lib/conntrack.c | 49 ++---
 3 files changed, 37 insertions(+), 49 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 87107a05d..183785d8a 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -134,6 +134,9 @@ struct conn {
 uint16_t nat_action;
 char *alg;
 struct conn *nat_conn; /* The NAT 'conn' context, if there is one. */
+atomic_flag reclaimed; /* False during the lifetime of the connection,
+* True as soon as a thread has started freeing
+* its memory. */
 
 /* Inserted once by a PMD, then managed by the 'ct_clean' thread. */
 struct conn_expire exp;
@@ -229,8 +232,8 @@ struct conntrack {
 };
 
 /* Lock acquisition order:
- *1. 'ct_lock'
- *2. 'conn->lock'
+ *1. 'conn->lock'
+ *2. 'ct_lock'
  *3. 'resources_lock'
  */
 
diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
index 592e10c6f..22363e7fe 100644
--- a/lib/conntrack-tp.c
+++ b/lib/conntrack-tp.c
@@ -245,58 +245,30 @@ conn_schedule_expiration(struct conn *conn, enum 
ct_timeout tm, long long now,
 ignore(atomic_flag_test_and_set(&conn->exp.reschedule));
 }
 
-static void
-conn_update_expiration__(struct conntrack *ct, struct conn *conn,
- enum ct_timeout tm, long long now,
- uint32_t tp_value)
-OVS_REQUIRES(conn->lock)
-{
-ovs_mutex_unlock(&conn->lock);
-
-ovs_mutex_lock(&ct->ct_lock);
-ovs_mutex_lock(&conn->lock);
-conn_schedule_expiration(conn, tm, now, tp_value);
-ovs_mutex_unlock(&conn->lock);
-ovs_mutex_unlock(&ct->ct_lock);
-
-ovs_mutex_lock(&conn->lock);
-}
-
 /* The conn entry lock must be held on entry and exit. */
 void
 conn_update_expiration(struct conntrack *ct, struct conn *conn,
enum ct_timeout tm, long long now)
-OVS_REQUIRES(conn->lock)
 {
 struct timeout_policy *tp;
 uint32_t val;
 
-ovs_mutex_unlock(&conn->lock);
-
-ovs_mutex_lock(&ct->ct_lock);
-ovs_mutex_lock(&conn->lock);
 tp = timeout_policy_lookup(ct, conn->tp_id);
 if (tp) {
 val = tp->policy.attrs[tm_to_ct_dpif_tp(tm)];
 } else {
 val = ct_dpif_netdev_tp_def[tm_to_ct_dpif_tp(tm)];
 }
-ovs_mutex_unlock(&conn->lock);
-ovs_mutex_unlock(&ct->ct_lock);
-
-ovs_mutex_lock(&conn->lock);
 VLOG_DBG_RL(&rl, "Update timeout %s zone=%u with policy id=%d "
 "val=%u sec.",
 ct_timeout_str[tm], conn->key.zone, conn->tp_id, val);
 
-conn_update_expiration__(ct, conn, tm, now, val);
+conn_schedule_expiration(conn, tm, now, val);
 }
 
-/* ct_lock must be held. */
 void
 conn_init_expiration(struct conntrack *ct, struct conn *conn,
  enum ct_timeout tm, long long now)
-OVS_REQUIRES(ct->ct_lock)
 {
 struct timeout_policy *tp;
 uint32_t val;
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 81322e405..e4262fdf3 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -468,7 +468,7 @@ zone_limit_delete(struct conntrack *ct, uint16_t zone)
 
 static void
 conn_clean_cmn(struct conntrack *ct, struct conn *conn)
-OVS_REQUIRES(ct->ct_lock)
+OVS_REQUIRES(conn->lock, ct->ct_lock)
 {
 if (conn->alg) {
 expectation_clean(ct, &conn->key);
@@ -498,18 +498,29 @@ conn_unref(struct conn *conn)
  * removes the associated nat 'conn' from the lookup datastructures. */
 static void
 conn_clean(struct conntrack *ct, struct conn *conn)
-OVS_REQUIRES(ct->ct_lock)
+OVS_EXCLUDED(conn->lock, ct->ct_lock)
 {
 ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
 
+if (atomic_flag_test_and_set(&conn->reclaimed)) {
+return;
+}
+
+ovs_mutex_lock(&conn->lock);
+
+ovs_m

[ovs-dev] [PATCH v4 3/5] conntrack-tp: Use a cmap to store timeout policies

2022-03-25 Thread Gaetan Rivet
Multiple lookups are done to stored timeout policies, each time blocking
the global 'ct_lock'. This is usually not necessary and it should be
acceptable to get policy updates slightly delayed (by one RCU sync
at most). Using a CMAP reduces multiple lock taking and releasing in
the connection insertion path.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Acked-by: William Tu 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack-tp.c  | 54 +++--
 lib/conntrack.c |  9 ---
 lib/conntrack.h |  2 +-
 4 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index d622ba5e0..87107a05d 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -206,7 +206,7 @@ struct conntrack {
 struct cmap conns OVS_GUARDED;
 struct mpsc_queue exp_lists[N_CT_TM];
 struct cmap zone_limits OVS_GUARDED;
-struct hmap timeout_policies OVS_GUARDED;
+struct cmap timeout_policies OVS_GUARDED;
 uint32_t hash_basis; /* Salt for hashing a connection key. */
 pthread_t clean_thread; /* Periodically cleans up connection tracker. */
 struct latch clean_thread_exit; /* To destroy the 'clean_thread'. */
diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
index 6de2354c0..592e10c6f 100644
--- a/lib/conntrack-tp.c
+++ b/lib/conntrack-tp.c
@@ -47,14 +47,15 @@ static unsigned int ct_dpif_netdev_tp_def[] = {
 };
 
 static struct timeout_policy *
-timeout_policy_lookup(struct conntrack *ct, int32_t tp_id)
+timeout_policy_lookup_protected(struct conntrack *ct, int32_t tp_id)
 OVS_REQUIRES(ct->ct_lock)
 {
 struct timeout_policy *tp;
 uint32_t hash;
 
 hash = hash_int(tp_id, ct->hash_basis);
-HMAP_FOR_EACH_IN_BUCKET (tp, node, hash, &ct->timeout_policies) {
+CMAP_FOR_EACH_WITH_HASH_PROTECTED (tp, node, hash,
+   &ct->timeout_policies) {
 if (tp->policy.id == tp_id) {
 return tp;
 }
@@ -62,20 +63,25 @@ timeout_policy_lookup(struct conntrack *ct, int32_t tp_id)
 return NULL;
 }
 
-struct timeout_policy *
-timeout_policy_get(struct conntrack *ct, int32_t tp_id)
+static struct timeout_policy *
+timeout_policy_lookup(struct conntrack *ct, int32_t tp_id)
 {
 struct timeout_policy *tp;
+uint32_t hash;
 
-ovs_mutex_lock(&ct->ct_lock);
-tp = timeout_policy_lookup(ct, tp_id);
-if (!tp) {
-ovs_mutex_unlock(&ct->ct_lock);
-return NULL;
+hash = hash_int(tp_id, ct->hash_basis);
+CMAP_FOR_EACH_WITH_HASH (tp, node, hash, &ct->timeout_policies) {
+if (tp->policy.id == tp_id) {
+return tp;
+}
 }
+return NULL;
+}
 
-ovs_mutex_unlock(&ct->ct_lock);
-return tp;
+struct timeout_policy *
+timeout_policy_get(struct conntrack *ct, int32_t tp_id)
+{
+return timeout_policy_lookup(ct, tp_id);
 }
 
 static void
@@ -125,27 +131,30 @@ timeout_policy_create(struct conntrack *ct,
 init_default_tp(tp, tp_id);
 update_existing_tp(tp, new_tp);
 hash = hash_int(tp_id, ct->hash_basis);
-hmap_insert(&ct->timeout_policies, &tp->node, hash);
+cmap_insert(&ct->timeout_policies, &tp->node, hash);
 }
 
 static void
 timeout_policy_clean(struct conntrack *ct, struct timeout_policy *tp)
 OVS_REQUIRES(ct->ct_lock)
 {
-hmap_remove(&ct->timeout_policies, &tp->node);
-free(tp);
+uint32_t hash = hash_int(tp->policy.id, ct->hash_basis);
+cmap_remove(&ct->timeout_policies, &tp->node, hash);
+ovsrcu_postpone(free, tp);
 }
 
 static int
-timeout_policy_delete__(struct conntrack *ct, uint32_t tp_id)
+timeout_policy_delete__(struct conntrack *ct, uint32_t tp_id,
+bool warn_on_error)
 OVS_REQUIRES(ct->ct_lock)
 {
+struct timeout_policy *tp;
 int err = 0;
-struct timeout_policy *tp = timeout_policy_lookup(ct, tp_id);
 
+tp = timeout_policy_lookup_protected(ct, tp_id);
 if (tp) {
 timeout_policy_clean(ct, tp);
-} else {
+} else if (warn_on_error) {
 VLOG_WARN_RL(&rl, "Failed to delete a non-existent timeout "
  "policy: id=%d", tp_id);
 err = ENOENT;
@@ -159,7 +168,7 @@ timeout_policy_delete(struct conntrack *ct, uint32_t tp_id)
 int err;
 
 ovs_mutex_lock(&ct->ct_lock);
-err = timeout_policy_delete__(ct, tp_id);
+err = timeout_policy_delete__(ct, tp_id, true);
 ovs_mutex_unlock(&ct->ct_lock);
 return err;
 }
@@ -170,7 +179,7 @@ timeout_policy_init(struct conntrack *ct)
 {
 struct timeout_policy tp;
 
-hmap_init(&ct->timeout_policies);
+cmap_init(&ct->timeout_policies);
 
 /* Create default timeout policy. */
 memset(&tp, 0, sizeof tp);
@@ -182,14 +191,11 @@ int
 timeout_

[ovs-dev] [PATCH v4 2/5] conntrack: Use a cmap to store zone limits

2022-03-25 Thread Gaetan Rivet
Change the data structure from hmap to cmap for zone limits.
As they are shared amongst multiple conntrack users, multiple
readers want to check the current zone limit state before progressing in
their processing. Using a CMAP allows doing lookups without taking the
global 'ct_lock', thus reducing contention.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack.c | 70 -
 lib/conntrack.h |  2 +-
 lib/dpif-netdev.c   |  5 +--
 4 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 4a6f7a787..d622ba5e0 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -205,7 +205,7 @@ struct conntrack {
 struct ovs_mutex ct_lock; /* Protects 2 following fields. */
 struct cmap conns OVS_GUARDED;
 struct mpsc_queue exp_lists[N_CT_TM];
-struct hmap zone_limits OVS_GUARDED;
+struct cmap zone_limits OVS_GUARDED;
 struct hmap timeout_policies OVS_GUARDED;
 uint32_t hash_basis; /* Salt for hashing a connection key. */
 pthread_t clean_thread; /* Periodically cleans up connection tracker. */
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 1320e5afb..0b095b706 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -81,7 +81,7 @@ enum ct_alg_ctl_type {
 };
 
 struct zone_limit {
-struct hmap_node node;
+struct cmap_node node;
 struct conntrack_zone_limit czl;
 };
 
@@ -311,7 +311,7 @@ conntrack_init(void)
 for (unsigned i = 0; i < ARRAY_SIZE(ct->exp_lists); i++) {
 mpsc_queue_init(&ct->exp_lists[i]);
 }
-hmap_init(&ct->zone_limits);
+cmap_init(&ct->zone_limits);
 ct->zone_limit_seq = 0;
 timeout_policy_init(ct);
 ovs_mutex_unlock(&ct->ct_lock);
@@ -346,12 +346,25 @@ zone_key_hash(int32_t zone, uint32_t basis)
 }
 
 static struct zone_limit *
-zone_limit_lookup(struct conntrack *ct, int32_t zone)
+zone_limit_lookup_protected(struct conntrack *ct, int32_t zone)
 OVS_REQUIRES(ct->ct_lock)
 {
 uint32_t hash = zone_key_hash(zone, ct->hash_basis);
 struct zone_limit *zl;
-HMAP_FOR_EACH_IN_BUCKET (zl, node, hash, &ct->zone_limits) {
+CMAP_FOR_EACH_WITH_HASH_PROTECTED (zl, node, hash, &ct->zone_limits) {
+if (zl->czl.zone == zone) {
+return zl;
+}
+}
+return NULL;
+}
+
+static struct zone_limit *
+zone_limit_lookup(struct conntrack *ct, int32_t zone)
+{
+uint32_t hash = zone_key_hash(zone, ct->hash_basis);
+struct zone_limit *zl;
+CMAP_FOR_EACH_WITH_HASH (zl, node, hash, &ct->zone_limits) {
 if (zl->czl.zone == zone) {
 return zl;
 }
@@ -361,7 +374,6 @@ zone_limit_lookup(struct conntrack *ct, int32_t zone)
 
 static struct zone_limit *
 zone_limit_lookup_or_default(struct conntrack *ct, int32_t zone)
-OVS_REQUIRES(ct->ct_lock)
 {
 struct zone_limit *zl = zone_limit_lookup(ct, zone);
 return zl ? zl : zone_limit_lookup(ct, DEFAULT_ZONE);
@@ -370,13 +382,16 @@ zone_limit_lookup_or_default(struct conntrack *ct, 
int32_t zone)
 struct conntrack_zone_limit
 zone_limit_get(struct conntrack *ct, int32_t zone)
 {
-ovs_mutex_lock(&ct->ct_lock);
-struct conntrack_zone_limit czl = {DEFAULT_ZONE, 0, 0, 0};
+struct conntrack_zone_limit czl = {
+.zone = DEFAULT_ZONE,
+.limit = 0,
+.count = ATOMIC_COUNT_INIT(0),
+.zone_limit_seq = 0,
+};
 struct zone_limit *zl = zone_limit_lookup_or_default(ct, zone);
 if (zl) {
 czl = zl->czl;
 }
-ovs_mutex_unlock(&ct->ct_lock);
 return czl;
 }
 
@@ -384,13 +399,19 @@ static int
 zone_limit_create(struct conntrack *ct, int32_t zone, uint32_t limit)
 OVS_REQUIRES(ct->ct_lock)
 {
+struct zone_limit *zl = zone_limit_lookup_protected(ct, zone);
+
+if (zl) {
+return 0;
+}
+
 if (zone >= DEFAULT_ZONE && zone <= MAX_ZONE) {
-struct zone_limit *zl = xzalloc(sizeof *zl);
+zl = xzalloc(sizeof *zl);
 zl->czl.limit = limit;
 zl->czl.zone = zone;
 zl->czl.zone_limit_seq = ct->zone_limit_seq++;
 uint32_t hash = zone_key_hash(zone, ct->hash_basis);
-hmap_insert(&ct->zone_limits, &zl->node, hash);
+cmap_insert(&ct->zone_limits, &zl->node, hash);
 return 0;
 } else {
 return EINVAL;
@@ -401,13 +422,14 @@ int
 zone_limit_update(struct conntrack *ct, int32_t zone, uint32_t limit)
 {
 int err = 0;
-ovs_mutex_lock(&ct->ct_lock);
 struct zone_limit *zl = zone_limit_lookup(ct, zone);
 if (zl) {
 zl->czl.limit = limit;
 VLOG_INFO("Changed zone limit of %u for zone %d", limit, zone);
 } else {
+ovs_mutex_lock(&ct->ct_lock);
 err = 

[ovs-dev] [PATCH v4 0/5] conntrack: improve multithread scalability

2022-03-25 Thread Gaetan Rivet
Conntracks are executed within the datapath. Locks along this path are crucial
and their critical section should be minimal. The global 'ct_lock' is necessary
before any action taken on connection states. This lock is needed for many
operations on the conntrack, slowing down the datapath.

The cleanup thread 'ct_clean' will take it to do its job. As it can hold it a
long time, the thread is limited in amount of connection cleaned per round,
and calls are rate-limited.

* Timeout policies locking is contrived to avoid deadlock.
  Anytime a connection state is updated, during its update it is unlocked,
  'ct_lock' is taken, then the connection is locked again. Then the reverse
  is done for unlock.

* Scalability is poor. The global ct_lock needs to be taken before applying
  any change to a conn object. This is backward: local changes to smaller
  objects should be independent, then the global lock should only be taken once
  the rest of the work is done, the goal being to have the smallest possible
  critical section.

It can be improved. Using RCU-friendly structures for connections, zone limits
and timeout policies, read-first workload is improved and the precedence of the
global 'ct_lock' and local 'conn->lock' can be inversed.

Running the conntrack benchmark we see these changes:
  ./tests/ovstest test-conntrack benchmark  300 32

code \ N  1 2 4 8
  Before   2310  2766  6117 19838  (ms)
   After   2072  2084  2653  4541  (ms)

One thread in the benchmark executes the task of a PMD, while the 'ct_clean' 
thread
runs in background as well.

Github actions: https://github.com/grivet/ovs/actions/runs/574446345

v2:

An mpsc-queue is used instead of rculist to manage connection expirations lists.
PMDs and ct_clean all act as producers, while ct_clean is the sole consumer 
thread.
A PMD now needs to take the 'ct_lock' only when creating a new connection, and 
only
while inserting it in the conn CMAP. For any updates, only the conn lock is now 
required,
to properly change its state.

The mpsc-queue implementation is identical to the one from the parallel offload 
series [1].

CI: https://github.com/grivet/ovs/actions/runs/772118640

[1]: https://patchwork.ozlabs.org/project/openvswitch/list/?series=238779

v3:

The last part of the series modifying the rate limit of conntrack_clean is 
dropped.
It is not necessary to improve scalability and can be done later if needed.

CI: https://github.com/grivet/ovs/actions/runs/940610003

v4:

  * Rebase on master.
  * Fix race condition introduced by patch [v3] 6/7 [1]

I prepared this version last september but got sidetracked.
Paolo's alternative series [2] can also improve the same metric.
I am not sure which one would be best between the two, I am
sending this revised version so that it is available for public
comment.

[1]: https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385470.html
[2]: 
https://patchwork.ozlabs.org/project/openvswitch/list/?series=291239&state=*

Gaetan Rivet (5):
  conntrack: Use mpsc-queue to store conn expirations
  conntrack: Use a cmap to store zone limits
  conntrack-tp: Use a cmap to store timeout policies
  conntrack: Inverse conn and ct lock precedence
  conntrack: Use an atomic conn expiration value

 lib/conntrack-private.h |  97 ++-
 lib/conntrack-tp.c  | 100 ++-
 lib/conntrack.c | 265 +---
 lib/conntrack.h |   4 +-
 lib/dpif-netdev.c   |   5 +-
 5 files changed, 307 insertions(+), 164 deletions(-)

--
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 1/5] conntrack: Use mpsc-queue to store conn expirations

2022-03-25 Thread Gaetan Rivet
Change the connection expiration lists from ovs_list to mpsc-queue.
This is a pre-step towards reducing the granularity of 'ct_lock'.

It simplifies the responsibilities toward updating the expiration queue.
The dataplane now appends the new conn for expiration once during
creation.  Any further update will only consist in writing the conn
expiration limit and marking the conn for expiration rescheduling.

The ageing thread 'ct_clean' is the only one consuming the expiration
lists.  If a conn was marked for rescheduling by a dataplane, it will
move the conn to the end of the queue.

Once the locks have been reworked, it means neither the dataplane
threads nor 'ct_clean' have to take a lock to update the expiration
lists (assuming the consumer lock is perpetually held by 'ct_clean');

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  84 +++-
 lib/conntrack-tp.c  |  28 +-
 lib/conntrack.c | 118 ++--
 3 files changed, 173 insertions(+), 57 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index dfdf4e676..4a6f7a787 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -29,6 +29,7 @@
 #include "openvswitch/list.h"
 #include "openvswitch/types.h"
 #include "packets.h"
+#include "mpsc-queue.h"
 #include "unaligned.h"
 #include "dp-packet.h"
 
@@ -86,22 +87,57 @@ struct alg_exp_node {
 bool nat_rpl_dst;
 };
 
+/* Timeouts: all the possible timeout states passed to update_expiration()
+ * are listed here. The name will be prefix by CT_TM_ and the value is in
+ * milliseconds */
+#define CT_TIMEOUTS \
+CT_TIMEOUT(TCP_FIRST_PACKET) \
+CT_TIMEOUT(TCP_OPENING) \
+CT_TIMEOUT(TCP_ESTABLISHED) \
+CT_TIMEOUT(TCP_CLOSING) \
+CT_TIMEOUT(TCP_FIN_WAIT) \
+CT_TIMEOUT(TCP_CLOSED) \
+CT_TIMEOUT(OTHER_FIRST) \
+CT_TIMEOUT(OTHER_MULTIPLE) \
+CT_TIMEOUT(OTHER_BIDIR) \
+CT_TIMEOUT(ICMP_FIRST) \
+CT_TIMEOUT(ICMP_REPLY)
+
+enum ct_timeout {
+#define CT_TIMEOUT(NAME) CT_TM_##NAME,
+CT_TIMEOUTS
+#undef CT_TIMEOUT
+N_CT_TM
+};
+
 enum OVS_PACKED_ENUM ct_conn_type {
 CT_CONN_TYPE_DEFAULT,
 CT_CONN_TYPE_UN_NAT,
 };
 
+struct conn_expire {
+struct mpsc_queue_node node;
+/* Timeout state of the connection.
+ * It follows the connection state updates.
+ */
+enum ct_timeout tm;
+atomic_flag reschedule;
+struct ovs_refcount refcount;
+};
+
 struct conn {
 /* Immutable data. */
 struct conn_key key;
 struct conn_key rev_key;
 struct conn_key parent_key; /* Only used for orig_tuple support. */
-struct ovs_list exp_node;
 struct cmap_node cm_node;
 uint16_t nat_action;
 char *alg;
 struct conn *nat_conn; /* The NAT 'conn' context, if there is one. */
 
+/* Inserted once by a PMD, then managed by the 'ct_clean' thread. */
+struct conn_expire exp;
+
 /* Mutable data. */
 struct ovs_mutex lock; /* Guards all mutable fields. */
 ovs_u128 label;
@@ -132,22 +168,6 @@ enum ct_update_res {
 CT_UPDATE_VALID_NEW,
 };
 
-/* Timeouts: all the possible timeout states passed to update_expiration()
- * are listed here. The name will be prefix by CT_TM_ and the value is in
- * milliseconds */
-#define CT_TIMEOUTS \
-CT_TIMEOUT(TCP_FIRST_PACKET) \
-CT_TIMEOUT(TCP_OPENING) \
-CT_TIMEOUT(TCP_ESTABLISHED) \
-CT_TIMEOUT(TCP_CLOSING) \
-CT_TIMEOUT(TCP_FIN_WAIT) \
-CT_TIMEOUT(TCP_CLOSED) \
-CT_TIMEOUT(OTHER_FIRST) \
-CT_TIMEOUT(OTHER_MULTIPLE) \
-CT_TIMEOUT(OTHER_BIDIR) \
-CT_TIMEOUT(ICMP_FIRST) \
-CT_TIMEOUT(ICMP_REPLY)
-
 #define NAT_ACTION_SNAT_ALL (NAT_ACTION_SRC | NAT_ACTION_SRC_PORT)
 #define NAT_ACTION_DNAT_ALL (NAT_ACTION_DST | NAT_ACTION_DST_PORT)
 
@@ -181,17 +201,10 @@ enum ct_ephemeral_range {
 #define FOR_EACH_PORT_IN_RANGE(curr, min, max) \
 FOR_EACH_PORT_IN_RANGE__(curr, min, max, OVS_JOIN(idx, __COUNTER__))
 
-enum ct_timeout {
-#define CT_TIMEOUT(NAME) CT_TM_##NAME,
-CT_TIMEOUTS
-#undef CT_TIMEOUT
-N_CT_TM
-};
-
 struct conntrack {
 struct ovs_mutex ct_lock; /* Protects 2 following fields. */
 struct cmap conns OVS_GUARDED;
-struct ovs_list exp_lists[N_CT_TM] OVS_GUARDED;
+struct mpsc_queue exp_lists[N_CT_TM];
 struct hmap zone_limits OVS_GUARDED;
 struct hmap timeout_policies OVS_GUARDED;
 uint32_t hash_basis; /* Salt for hashing a connection key. */
@@ -237,4 +250,25 @@ struct ct_l4_proto {
struct ct_dpif_protoinfo *);
 };
 
+static inline void
+conn_expire_push_back(struct conntrack *ct, struct conn *conn)
+{
+if (ovs_refcount_try_ref_rcu(&conn->exp.refcount)) {
+atomic_flag_clear(&conn->exp.reschedule);
+mpsc_queue_insert(&ct->exp_lists[conn->exp.tm], &co

[ovs-dev] [PATCH] ofproto: Use xlate map for uuid lookups

2022-02-23 Thread Gaetan Rivet
The ofproto map 'all_ofproto_dpifs_by_uuid' does not support
concurrent accesses. It is however read by upcall handler threads
and written by the main thread at the same time.

Additionally, handler threads will change the ams_seq while
an ofproto is being destroyed, triggering crashes with the
following backtrace:

(gdb) bt
  hmap_next (hmap.h:398)
  seq_wake_waiters (seq.c:326)
  seq_change_protected (seq.c:134)
  seq_change (seq.c:144)
  ofproto_dpif_send_async_msg (ofproto_dpif.c:263)
  process_upcall (ofproto_dpif_upcall.c:1782)
  recv_upcalls (ofproto_dpif_upcall.c:1026)
  udpif_upcall_handler (ofproto/ofproto_dpif_upcall.c:945)
  ovsthread_wrapper (ovs_thread.c:734)

To solve both issues, remove the 'all_ofproto_dpifs_by_uuid'.
Instead, another map already storing ofprotos in xlate can be used.

During an ofproto destruction, its reference is removed from the current
xlate xcfg. Such change is committed only after all threads have quiesced
at least once during xlate_txn_commit(). This wait ensures that the
removal is seen by all threads, rendering impossible for a thread to
still hold a reference while the destruction proceeds.

Furthermore, the xlate maps are copied during updates instead of
being written in place. It is thus correct to read xcfg->xbridges while
inserting or removing from new_xcfg->xbridges.

Finally, now that ofproto_dpifs lookups are done through xcfg->xbridges,
it is important to use a high level of entropy. As it used the ofproto pointer
hashed, fewer bits were random compared to the uuid key used in
'all_ofproto_dpifs_by_uuid'. To solve this, use the ofproto uuid as the key
in xbridges as well, improving entropy.

Fixes: fcb9579be3c7 ("ofproto: Add 'ofproto_uuid' and 'ofp_in_port' to user 
action cookie.")
Suggested-by: Adrian Moreno 
Signed-off-by: Yunjian Wang 
Signed-off-by: Gaetan Rivet 
---

Following the discussion on the fix
https://patchwork.ozlabs.org/project/openvswitch/patch/1638530715-44436-1-git-send-email-wangyunj...@huawei.com/

I tested it with Peng's ofproto refcount patch in tree:
https://patchwork.ozlabs.org/project/openvswitch/patch/20220219032607.15757-1-hepeng.0...@bytedance.com/

CI result: https://github.com/grivet/ovs/actions/runs/1889083195

 ofproto/ofproto-dpif-xlate.c | 21 +++--
 ofproto/ofproto-dpif-xlate.h |  1 +
 ofproto/ofproto-dpif.c   | 19 +--
 3 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 578cbfe58..4a7388071 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -865,7 +865,7 @@ xlate_xbridge_init(struct xlate_cfg *xcfg, struct xbridge 
*xbridge)
 ovs_list_init(&xbridge->xbundles);
 hmap_init(&xbridge->xports);
 hmap_insert(&xcfg->xbridges, &xbridge->hmap_node,
-hash_pointer(xbridge->ofproto, 0));
+uuid_hash(&xbridge->ofproto->uuid));
 }
 
 static void
@@ -1639,7 +1639,7 @@ xbridge_lookup(struct xlate_cfg *xcfg, const struct 
ofproto_dpif *ofproto)
 
 xbridges = &xcfg->xbridges;
 
-HMAP_FOR_EACH_IN_BUCKET (xbridge, hmap_node, hash_pointer(ofproto, 0),
+HMAP_FOR_EACH_IN_BUCKET (xbridge, hmap_node, uuid_hash(&ofproto->uuid),
  xbridges) {
 if (xbridge->ofproto == ofproto) {
 return xbridge;
@@ -1661,6 +1661,23 @@ xbridge_lookup_by_uuid(struct xlate_cfg *xcfg, const 
struct uuid *uuid)
 return NULL;
 }
 
+struct ofproto_dpif *
+xlate_ofproto_lookup(const struct uuid *uuid)
+{
+struct xlate_cfg *xcfg = ovsrcu_get(struct xlate_cfg *, &xcfgp);
+struct xbridge *xbridge;
+
+if (!xcfg) {
+return NULL;
+}
+
+xbridge = xbridge_lookup_by_uuid(xcfg, uuid);
+if (xbridge != NULL) {
+return xbridge->ofproto;
+}
+return NULL;
+}
+
 static struct xbundle *
 xbundle_lookup(struct xlate_cfg *xcfg, const struct ofbundle *ofbundle)
 {
diff --git a/ofproto/ofproto-dpif-xlate.h b/ofproto/ofproto-dpif-xlate.h
index 851088d79..2ba90e999 100644
--- a/ofproto/ofproto-dpif-xlate.h
+++ b/ofproto/ofproto-dpif-xlate.h
@@ -176,6 +176,7 @@ void xlate_ofproto_set(struct ofproto_dpif *, const char 
*name, struct dpif *,
bool forward_bpdu, bool has_in_band,
const struct dpif_backer_support *support);
 void xlate_remove_ofproto(struct ofproto_dpif *);
+struct ofproto_dpif *xlate_ofproto_lookup(const struct uuid *uuid);
 
 void xlate_bundle_set(struct ofproto_dpif *, struct ofbundle *,
   const char *name, enum port_vlan_mode,
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 8143dd965..7b4a1b3d8 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -215,10 +215,6 @@ struct shash all_dpif_backers = 
SHASH_INITIALIZER(&all_dpif_backer

[ovs-dev] [PATCH v1 1/3] dpif-netdev: Move port flush after datapath reconfiguration

2022-02-04 Thread Gaetan Rivet
Port flush and offload uninit should be moved after the datapath
has been reconfigured. That way, no other thread such as PMDs will
find this port to poll and enqueue further offload requests.

After a flush, almost no further offload request for this port should
be found in the queue.

There will still be some issued by revalidators, but they
will be catched when the offload thread fails to take a netdev ref.

This change fixes the issue of datapath reference being improperly
accessed by offload threads while it is being destroyed.

Fixes: 5b0aa55776cb ("dpif-netdev: Execute flush from offload thread.")
Fixes: 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.")
Signed-off-by: Ilya Maximets 
Signed-off-by: Gaetan Rivet 
---
 lib/dpif-netdev.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index e28e0b554..b5702e6a1 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2313,13 +2313,22 @@ static void
 do_del_port(struct dp_netdev *dp, struct dp_netdev_port *port)
 OVS_REQ_WRLOCK(dp->port_rwlock)
 {
-dp_netdev_offload_flush(dp, port);
-netdev_uninit_flow_api(port->netdev);
 hmap_remove(&dp->ports, &port->node);
 seq_change(dp->port_seq);
 
 reconfigure_datapath(dp);
 
+/* Flush and disable offloads only after 'port' has been made
+ * inaccessible through datapath reconfiguration.
+ * This prevents having PMDs enqueuing offload requests after
+ * the flush.
+ * When only this port is deleted instead of the whole datapath,
+ * revalidator threads are still active and can still enqueue
+ * offload modification or deletion. Managing those stray requests
+ * is done in the offload threads. */
+dp_netdev_offload_flush(dp, port);
+netdev_uninit_flow_api(port->netdev);
+
 port_destroy(port);
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 3/3] dpif-netdev: Use dp_netdev reference in offload threads

2022-02-04 Thread Gaetan Rivet
The PMD reference taken is not actually used, it is only needed to get
the dp_netdev linked. Additionally, the taking of the PMD reference
does not protect against the disappearance of the dp_netdev,
so it is misleading.

The dp reference is protected by the way the ports are being deleted
during datapath deletion. No further offload request should be found
past a flush, so it is safe to keep this reference in the offload item.

Signed-off-by: Gaetan Rivet 
---
 lib/dpif-netdev.c | 50 +++
 1 file changed, 24 insertions(+), 26 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index bb03cf137..4d886092b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -348,7 +348,6 @@ enum {
 };
 
 struct dp_offload_flow_item {
-struct dp_netdev_pmd_thread *pmd;
 struct dp_netdev_flow *flow;
 int op;
 struct match match;
@@ -358,7 +357,6 @@ struct dp_offload_flow_item {
 };
 
 struct dp_offload_flush_item {
-struct dp_netdev *dp;
 struct netdev *netdev;
 struct ovs_barrier *barrier;
 };
@@ -372,6 +370,7 @@ struct dp_offload_thread_item {
 struct mpsc_queue_node node;
 enum dp_offload_type type;
 long long int timestamp;
+struct dp_netdev *dp;
 union dp_offload_thread_data data[0];
 };
 
@@ -2559,10 +2558,10 @@ flow_mark_has_no_ref(uint32_t mark)
 }
 
 static int
-mark_to_flow_disassociate(struct dp_netdev_pmd_thread *pmd,
+mark_to_flow_disassociate(struct dp_netdev *dp,
   struct dp_netdev_flow *flow)
 {
-const char *dpif_type_str = dpif_normalize_type(pmd->dp->class->type);
+const char *dpif_type_str = dpif_normalize_type(dp->class->type);
 struct cmap_node *mark_node = CONST_CAST(struct cmap_node *,
  &flow->mark_node);
 unsigned int tid = netdev_offload_thread_id();
@@ -2591,9 +2590,9 @@ mark_to_flow_disassociate(struct dp_netdev_pmd_thread 
*pmd,
 if (port) {
 /* Taking a global 'port_rwlock' to fulfill thread safety
  * restrictions regarding netdev port mapping. */
-ovs_rwlock_rdlock(&pmd->dp->port_rwlock);
+ovs_rwlock_rdlock(&dp->port_rwlock);
 ret = netdev_flow_del(port, &flow->mega_ufid, NULL);
-ovs_rwlock_unlock(&pmd->dp->port_rwlock);
+ovs_rwlock_unlock(&dp->port_rwlock);
 netdev_close(port);
 }
 
@@ -2635,7 +2634,7 @@ mark_to_flow_find(const struct dp_netdev_pmd_thread *pmd,
 }
 
 static struct dp_offload_thread_item *
-dp_netdev_alloc_flow_offload(struct dp_netdev_pmd_thread *pmd,
+dp_netdev_alloc_flow_offload(struct dp_netdev *dp,
  struct dp_netdev_flow *flow,
  int op)
 {
@@ -2646,13 +2645,12 @@ dp_netdev_alloc_flow_offload(struct 
dp_netdev_pmd_thread *pmd,
 flow_offload = &item->data->flow;
 
 item->type = DP_OFFLOAD_FLOW;
+item->dp = dp;
 
-flow_offload->pmd = pmd;
 flow_offload->flow = flow;
 flow_offload->op = op;
 
 dp_netdev_flow_ref(flow);
-dp_netdev_pmd_try_ref(pmd);
 
 return item;
 }
@@ -2671,7 +2669,6 @@ dp_netdev_free_flow_offload(struct dp_offload_thread_item 
*offload)
 {
 struct dp_offload_flow_item *flow_offload = &offload->data->flow;
 
-dp_netdev_pmd_unref(flow_offload->pmd);
 dp_netdev_flow_unref(flow_offload->flow);
 ovsrcu_postpone(dp_netdev_free_flow_offload__, offload);
 }
@@ -2714,9 +2711,9 @@ dp_netdev_offload_flow_enqueue(struct 
dp_offload_thread_item *item)
 }
 
 static int
-dp_netdev_flow_offload_del(struct dp_offload_flow_item *offload)
+dp_netdev_flow_offload_del(struct dp_offload_thread_item *item)
 {
-return mark_to_flow_disassociate(offload->pmd, offload->flow);
+return mark_to_flow_disassociate(item->dp, item->data->flow.flow);
 }
 
 /*
@@ -2731,12 +2728,13 @@ dp_netdev_flow_offload_del(struct dp_offload_flow_item 
*offload)
  * valid, thus only item 2 needed.
  */
 static int
-dp_netdev_flow_offload_put(struct dp_offload_flow_item *offload)
+dp_netdev_flow_offload_put(struct dp_offload_thread_item *item)
 {
-struct dp_netdev_pmd_thread *pmd = offload->pmd;
+struct dp_offload_flow_item *offload = &item->data->flow;
+struct dp_netdev *dp = item->dp;
 struct dp_netdev_flow *flow = offload->flow;
 odp_port_t in_port = flow->flow.in_port.odp_port;
-const char *dpif_type_str = dpif_normalize_type(pmd->dp->class->type);
+const char *dpif_type_str = dpif_normalize_type(dp->class->type);
 bool modification = offload->op == DP_NETDEV_FLOW_OFFLOAD_OP_MOD
 && flow->mark != INVALID_FLOW_MARK;
 struct offload_info info;
@@ -2782,12 +2780,12 @@ dp_netdev_flow_offload_put(struct dp_offload_flow_item 
*offload)
 
 /* Taking 

[ovs-dev] [PATCH v1 2/3] dpif-netdev: Fix a race condition in deletion of offloaded flows

2022-02-04 Thread Gaetan Rivet
From: Sriharsha Basavapatna 

In dp_netdev_pmd_remove_flow() we schedule the deletion of an
offloaded flow, if a mark has been assigned to the flow. But if
this occurs in the window in which the offload thread completes
offloading the flow and assigns a mark to the flow, then we miss
deleting the flow. This problem has been observed while adding
and deleting flows in a loop. To fix this, always enqueue flow
deletion regardless of the flow->mark being set.

Fixes: 241bad15d99a("dpif-netdev: associate flow with a mark id")
Signed-off-by: Sriharsha Basavapatna 
Signed-off-by: Gaetan Rivet 
---
 lib/dpif-netdev.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index b5702e6a1..bb03cf137 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2927,6 +2927,10 @@ queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
 {
 struct dp_offload_thread_item *offload;
 
+if (!netdev_is_flow_api_enabled()) {
+return;
+}
+
 offload = dp_netdev_alloc_flow_offload(pmd, flow,
DP_NETDEV_FLOW_OFFLOAD_OP_DEL);
 offload->timestamp = pmd->ctx.now;
@@ -3038,9 +3042,7 @@ dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread 
*pmd,
 dp_netdev_simple_match_remove(pmd, flow);
 cmap_remove(&pmd->flow_table, node, dp_netdev_flow_hash(&flow->ufid));
 ccmap_dec(&pmd->n_flows, odp_to_u32(in_port));
-if (flow->mark != INVALID_FLOW_MARK) {
-queue_netdev_flow_del(pmd, flow);
-}
+queue_netdev_flow_del(pmd, flow);
 flow->dead = true;
 
 dp_netdev_flow_unref(flow);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 0/3] Fix offload rule flush race condition

2022-02-04 Thread Gaetan Rivet
A race condition has been identified during datapath destruction,
along with the port offload flushes issued.

This series addresses these race conditions, cleaning up the
port deletion process.

The last patch also cleans up the offload structure.
It is not strictly necessary like the first two fixes,
so I put it last. It can wait until after the code-freeze
to be integrated.

I tested for a few hours without ASAN enabled without seeing
issues. ASAN has been executed as part of the github CI:
https://github.com/grivet/ovs/actions/runs/1795624401
It is however not too relevant, as no offloads are inserted during CI.

The following patch was used to fix an unrelated CI issue:
https://patchwork.ozlabs.org/project/openvswitch/patch/20220204150445.1481457-1-i.maxim...@ovn.org/

I also ran datapath creation + deletion loop with ASAN on an offload
test setup, but the execution was excruciatingly slow and could
not progress much. It reached datapath deletion without panicking
and no crash was seen, even though I had to interrupt the test after
a few hours.

Gaetan Rivet (2):
  dpif-netdev: Move port flush after datapath reconfiguration
  dpif-netdev: Use dp_netdev reference in offload threads

Sriharsha Basavapatna (1):
  dpif-netdev: Fix a race condition in deletion of offloaded flows

 lib/dpif-netdev.c | 71 ++-
 1 file changed, 40 insertions(+), 31 deletions(-)

--
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 27/27] netdev-dpdk: Remove rte-flow API access locks

2021-09-08 Thread Gaetan Rivet
The rte_flow DPDK API was made thread-safe [1] in release 20.11.
Now that the DPDK offload provider in OVS is thread safe, remove the
locks.

[1]: http://mails.dpdk.org/archives/dev/2020-October/184251.html

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-dpdk.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 45a96b9be..65f4ef086 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -5247,9 +5247,7 @@ netdev_dpdk_rte_flow_destroy(struct netdev *netdev,
 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 int ret;
 
-ovs_mutex_lock(&dev->mutex);
 ret = rte_flow_destroy(dev->port_id, rte_flow, error);
-ovs_mutex_unlock(&dev->mutex);
 return ret;
 }
 
@@ -5263,9 +5261,7 @@ netdev_dpdk_rte_flow_create(struct netdev *netdev,
 struct rte_flow *flow;
 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 
-ovs_mutex_lock(&dev->mutex);
 flow = rte_flow_create(dev->port_id, attr, items, actions, error);
-ovs_mutex_unlock(&dev->mutex);
 return flow;
 }
 
@@ -5293,9 +5289,7 @@ netdev_dpdk_rte_flow_query_count(struct netdev *netdev,
 }
 
 dev = netdev_dpdk_cast(netdev);
-ovs_mutex_lock(&dev->mutex);
 ret = rte_flow_query(dev->port_id, rte_flow, actions, query, error);
-ovs_mutex_unlock(&dev->mutex);
 return ret;
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 25/27] dpif-netdev: Replace port mutex by rwlock

2021-09-08 Thread Gaetan Rivet
The port mutex protects the netdev mapping, that can be changed by port
addition or port deletion. HW offloads operations can be considered read
operations on the port mapping itself. Use a rwlock to differentiate
between read and write operations, allowing concurrent queries and
offload insertions.

Because offload queries, deletion, and reconfigure_datapath() calls are
all rdlock, the deadlock fixed by [1] is still avoided, as the rdlock
side is recursive as prescribed by the POSIX standard. Executing
'reconfigure_datapath()' only requires a rdlock taken, but it is sometimes
executed in contexts where wrlock is taken ('do_add_port()' and
'do_del_port()').

This means that the deadlock described in [2] is still valid and should
be mitigated. The rdlock is taken using 'tryrdlock()' during offload query,
keeping the current behavior.

[1]: 81e89d5c2645 ("dpif-netdev: Make datapath port mutex recursive.")

[2]: 12d0edd75eba ("dpif-netdev: Avoid deadlock with offloading during PMD
 thread deletion.").

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 143 +++---
 lib/netdev-offload-dpdk.c |   4 +-
 2 files changed, 74 insertions(+), 73 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 365726ed5..30547c0ec 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -245,7 +245,7 @@ enum sched_assignment_type {
  * Acquisition order is, from outermost to innermost:
  *
  *dp_netdev_mutex (global)
- *port_mutex
+ *port_rwlock
  *bond_mutex
  *non_pmd_mutex
  */
@@ -258,8 +258,8 @@ struct dp_netdev {
 /* Ports.
  *
  * Any lookup into 'ports' or any access to the dp_netdev_ports found
- * through 'ports' requires taking 'port_mutex'. */
-struct ovs_mutex port_mutex;
+ * through 'ports' requires taking 'port_rwlock'. */
+struct ovs_rwlock port_rwlock;
 struct hmap ports;
 struct seq *port_seq;   /* Incremented whenever a port changes. */
 
@@ -323,7 +323,7 @@ struct dp_netdev {
 
 static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,
 odp_port_t)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_RDLOCK(dp->port_rwlock);
 
 enum rxq_cycles_counter_type {
 RXQ_CYCLES_PROC_CURR,   /* Cycles spent successfully polling and
@@ -491,17 +491,17 @@ struct dpif_netdev {
 
 static int get_port_by_number(struct dp_netdev *dp, odp_port_t port_no,
   struct dp_netdev_port **portp)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_RDLOCK(dp->port_rwlock);
 static int get_port_by_name(struct dp_netdev *dp, const char *devname,
 struct dp_netdev_port **portp)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_RDLOCK(dp->port_rwlock);
 static void dp_netdev_free(struct dp_netdev *)
 OVS_REQUIRES(dp_netdev_mutex);
 static int do_add_port(struct dp_netdev *dp, const char *devname,
const char *type, odp_port_t port_no)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_WRLOCK(dp->port_rwlock);
 static void do_del_port(struct dp_netdev *dp, struct dp_netdev_port *)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_WRLOCK(dp->port_rwlock);
 static int dpif_netdev_open(const struct dpif_class *, const char *name,
 bool create, struct dpif **);
 static void dp_netdev_execute_actions(struct dp_netdev_pmd_thread *pmd,
@@ -520,7 +520,7 @@ static void dp_netdev_configure_pmd(struct 
dp_netdev_pmd_thread *pmd,
 int numa_id);
 static void dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd);
 static void dp_netdev_set_nonpmd(struct dp_netdev *dp)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_WRLOCK(dp->port_rwlock);
 
 static void *pmd_thread_main(void *);
 static struct dp_netdev_pmd_thread *dp_netdev_get_pmd(struct dp_netdev *dp,
@@ -557,7 +557,7 @@ static void dp_netdev_offload_flush(struct dp_netdev *dp,
 struct dp_netdev_port *port);
 
 static void reconfigure_datapath(struct dp_netdev *dp)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_RDLOCK(dp->port_rwlock);
 static bool dp_netdev_pmd_try_ref(struct dp_netdev_pmd_thread *pmd);
 static void dp_netdev_pmd_unref(struct dp_netdev_pmd_thread *pmd);
 static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd);
@@ -1003,7 +1003,7 @@ dpif_netdev_subtable_lookup_set(struct unixctl_conn 
*conn, int argc OVS_UNUSED,
 sorted_poll_thread_list(dp, &pmd_list, &n);
 
 /* take port mutex as HMAP iters over them. */
-ovs_mutex_lock(&dp->port_mutex);
+ovs_rwlock_rdlock(&dp->port_rwlock);
 
 for (size_t i = 0; i 

[ovs-dev] [PATCH v5 21/27] netdev-offload-dpdk: Lock rte_flow map access

2021-09-08 Thread Gaetan Rivet
Add a lock to access the ufid to rte_flow map.  This will protect it
from concurrent write accesses when multiple threads attempt it.

At this point, the reason for taking the lock is not to fullfill the
needs of the DPDK offload implementation anymore. Rewrite the comments
to reflect this change. The lock is still needed to protect against
changes to netdev port mapping.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c |  8 ++---
 lib/netdev-offload-dpdk.c | 61 ---
 2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 381c959af..bf5785981 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2521,7 +2521,7 @@ mark_to_flow_disassociate(struct dp_netdev_pmd_thread 
*pmd,
 port = netdev_ports_get(in_port, dpif_type_str);
 if (port) {
 /* Taking a global 'port_mutex' to fulfill thread safety
- * restrictions for the netdev-offload-dpdk module. */
+ * restrictions regarding netdev port mapping. */
 ovs_mutex_lock(&pmd->dp->port_mutex);
 ret = netdev_flow_del(port, &flow->mega_ufid, NULL);
 ovs_mutex_unlock(&pmd->dp->port_mutex);
@@ -2690,8 +2690,8 @@ dp_netdev_flow_offload_put(struct dp_offload_flow_item 
*offload)
 goto err_free;
 }
 
-/* Taking a global 'port_mutex' to fulfill thread safety restrictions for
- * the netdev-offload-dpdk module. */
+/* Taking a global 'port_mutex' to fulfill thread safety
+ * restrictions regarding the netdev port mapping. */
 ovs_mutex_lock(&pmd->dp->port_mutex);
 ret = netdev_flow_put(port, &offload->match,
   CONST_CAST(struct nlattr *, offload->actions),
@@ -3533,7 +3533,7 @@ dpif_netdev_get_flow_offload_status(const struct 
dp_netdev *dp,
 }
 ofpbuf_use_stack(&buf, &act_buf, sizeof act_buf);
 /* Taking a global 'port_mutex' to fulfill thread safety
- * restrictions for the netdev-offload-dpdk module.
+ * restrictions regarding netdev port mapping.
  *
  * XXX: Main thread will try to pause/stop all revalidators during datapath
  *  reconfiguration via datapath purge callback (dp_purge_cb) while
diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index 8ed869eb3..e76c50b72 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -40,9 +40,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(600, 
600);
  *
  * Below API is NOT thread safe in following terms:
  *
- *  - The caller must be sure that none of these functions will be called
- *simultaneously.  Even for different 'netdev's.
- *
  *  - The caller must be sure that 'netdev' will not be destructed/deallocated.
  *
  *  - The caller must be sure that 'netdev' configuration will not be changed.
@@ -69,6 +66,7 @@ struct ufid_to_rte_flow_data {
 struct netdev_offload_dpdk_data {
 struct cmap ufid_to_rte_flow;
 uint64_t *rte_flow_counters;
+struct ovs_mutex map_lock;
 };
 
 static int
@@ -77,6 +75,7 @@ offload_data_init(struct netdev *netdev)
 struct netdev_offload_dpdk_data *data;
 
 data = xzalloc(sizeof *data);
+ovs_mutex_init(&data->map_lock);
 cmap_init(&data->ufid_to_rte_flow);
 data->rte_flow_counters = xcalloc(netdev_offload_thread_nb(),
   sizeof *data->rte_flow_counters);
@@ -89,6 +88,7 @@ offload_data_init(struct netdev *netdev)
 static void
 offload_data_destroy__(struct netdev_offload_dpdk_data *data)
 {
+ovs_mutex_destroy(&data->map_lock);
 free(data->rte_flow_counters);
 free(data);
 }
@@ -120,6 +120,34 @@ offload_data_destroy(struct netdev *netdev)
 ovsrcu_set(&netdev->hw_info.offload_data, NULL);
 }
 
+static void
+offload_data_lock(struct netdev *netdev)
+OVS_NO_THREAD_SAFETY_ANALYSIS
+{
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+if (!data) {
+return;
+}
+ovs_mutex_lock(&data->map_lock);
+}
+
+static void
+offload_data_unlock(struct netdev *netdev)
+OVS_NO_THREAD_SAFETY_ANALYSIS
+{
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+if (!data) {
+return;
+}
+ovs_mutex_unlock(&data->map_lock);
+}
+
 static struct cmap *
 offload_data_map(struct netdev *netdev)
 {
@@ -158,6 +186,24 @@ ufid_to_rte_flow_data_find(struct netdev *netdev,
 return NULL;
 }
 
+/* Find rte_flow with @ufid, lock-protected. */
+static struct ufid_to_rte_flow_data *
+ufid_to_rte_flow_data_find_protected(struct 

[ovs-dev] [PATCH v5 23/27] dpif-netdev: Use lockless queue to manage offloads

2021-09-08 Thread Gaetan Rivet
The dataplane threads (PMDs) send offloading commands to a dedicated
offload management thread. The current implementation uses a lock
and benchmarks show a high contention on the queue in some cases.

With high-contention, the mutex will more often lead to the locking
thread yielding in wait, using a syscall. This should be avoided in
a userland dataplane.

The mpsc-queue can be used instead. It uses less cycles and has
lower latency. Benchmarks show better behavior as multiple
revalidators and one or multiple PMDs writes to a single queue
while another thread polls it.

One trade-off with the new scheme however is to be forced to poll
the queue from the offload thread. Without mutex, a cond_wait
cannot be used for signaling. The offload thread is implementing
an exponential backoff and will sleep in short increments when no
data is available. This makes the thread yield, at the price of
some latency to manage offloads after an inactivity period.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 109 --
 1 file changed, 57 insertions(+), 52 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index bf5785981..4e91926fd 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -55,6 +55,7 @@
 #include "id-pool.h"
 #include "ipf.h"
 #include "mov-avg.h"
+#include "mpsc-queue.h"
 #include "netdev.h"
 #include "netdev-offload.h"
 #include "netdev-provider.h"
@@ -366,25 +367,22 @@ union dp_offload_thread_data {
 };
 
 struct dp_offload_thread_item {
-struct ovs_list node;
+struct mpsc_queue_node node;
 enum dp_offload_type type;
 long long int timestamp;
 union dp_offload_thread_data data[0];
 };
 
 struct dp_offload_thread {
-struct ovs_mutex mutex;
-struct ovs_list list;
-uint64_t enqueued_item;
+struct mpsc_queue queue;
+atomic_uint64_t enqueued_item;
 struct mov_avg_cma cma;
 struct mov_avg_ema ema;
-pthread_cond_t cond;
 };
 
 static struct dp_offload_thread dp_offload_thread = {
-.mutex = OVS_MUTEX_INITIALIZER,
-.list  = OVS_LIST_INITIALIZER(&dp_offload_thread.list),
-.enqueued_item = 0,
+.queue = MPSC_QUEUE_INITIALIZER(&dp_offload_thread.queue),
+.enqueued_item = ATOMIC_VAR_INIT(0),
 .cma = MOV_AVG_CMA_INITIALIZER,
 .ema = MOV_AVG_EMA_INITIALIZER(100),
 };
@@ -2616,11 +2614,8 @@ dp_netdev_free_offload(struct dp_offload_thread_item 
*offload)
 static void
 dp_netdev_append_offload(struct dp_offload_thread_item *offload)
 {
-ovs_mutex_lock(&dp_offload_thread.mutex);
-ovs_list_push_back(&dp_offload_thread.list, &offload->node);
-dp_offload_thread.enqueued_item++;
-xpthread_cond_signal(&dp_offload_thread.cond);
-ovs_mutex_unlock(&dp_offload_thread.mutex);
+mpsc_queue_insert(&dp_offload_thread.queue, &offload->node);
+atomic_count_inc64(&dp_offload_thread.enqueued_item);
 }
 
 static int
@@ -2765,58 +2760,68 @@ dp_offload_flush(struct dp_offload_thread_item *item)
 ovs_barrier_block(flush->barrier);
 }
 
+#define DP_NETDEV_OFFLOAD_BACKOFF_MIN 1
+#define DP_NETDEV_OFFLOAD_BACKOFF_MAX 64
 #define DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US (10 * 1000) /* 10 ms */
 
 static void *
 dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 {
 struct dp_offload_thread_item *offload;
-struct ovs_list *list;
+struct mpsc_queue_node *node;
+struct mpsc_queue *queue;
 long long int latency_us;
 long long int next_rcu;
 long long int now;
+uint64_t backoff;
 
-next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
-for (;;) {
-ovs_mutex_lock(&dp_offload_thread.mutex);
-if (ovs_list_is_empty(&dp_offload_thread.list)) {
-ovsrcu_quiesce_start();
-ovs_mutex_cond_wait(&dp_offload_thread.cond,
-&dp_offload_thread.mutex);
-ovsrcu_quiesce_end();
-next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
-}
-list = ovs_list_pop_front(&dp_offload_thread.list);
-dp_offload_thread.enqueued_item--;
-offload = CONTAINER_OF(list, struct dp_offload_thread_item, node);
-ovs_mutex_unlock(&dp_offload_thread.mutex);
-
-switch (offload->type) {
-case DP_OFFLOAD_FLOW:
-dp_offload_flow(offload);
-break;
-case DP_OFFLOAD_FLUSH:
-dp_offload_flush(offload);
-break;
-default:
-OVS_NOT_REACHED();
+queue = &dp_offload_thread.queue;
+mpsc_queue_acquire(queue);
+
+while (true) {
+backoff = DP_NETDEV_OFFLOAD_BACKOFF_MIN;
+while (mpsc_queue_tail(queue) == NULL) {
+xnanosleep(backoff * 1E6);
+if (backoff < DP_NETDEV_OFFLOAD_BACKOFF_MAX) {
+ 

[ovs-dev] [PATCH v5 26/27] dpif-netdev: Use one or more offload threads

2021-09-08 Thread Gaetan Rivet
Read the user configuration in the netdev-offload module to modify the
number of threads used to manage hardware offload requests.

This allows processing insertion, deletion and modification
concurrently.

The offload thread structure was modified to contain all needed
elements. This structure is multiplied by the number of requested
threads and used separately.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 290 --
 lib/netdev-offload-dpdk.c |   7 +-
 2 files changed, 193 insertions(+), 104 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 30547c0ec..cdeb11811 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -374,25 +374,47 @@ struct dp_offload_thread_item {
 };
 
 struct dp_offload_thread {
-struct mpsc_queue queue;
-atomic_uint64_t enqueued_item;
-struct cmap megaflow_to_mark;
-struct cmap mark_to_flow;
-struct mov_avg_cma cma;
-struct mov_avg_ema ema;
+PADDED_MEMBERS(CACHE_LINE_SIZE,
+struct mpsc_queue queue;
+atomic_uint64_t enqueued_item;
+struct cmap megaflow_to_mark;
+struct cmap mark_to_flow;
+struct mov_avg_cma cma;
+struct mov_avg_ema ema;
+);
 };
+static struct dp_offload_thread *dp_offload_threads;
+static void *dp_netdev_flow_offload_main(void *arg);
 
-static struct dp_offload_thread dp_offload_thread = {
-.queue = MPSC_QUEUE_INITIALIZER(&dp_offload_thread.queue),
-.megaflow_to_mark = CMAP_INITIALIZER,
-.mark_to_flow = CMAP_INITIALIZER,
-.enqueued_item = ATOMIC_VAR_INIT(0),
-.cma = MOV_AVG_CMA_INITIALIZER,
-.ema = MOV_AVG_EMA_INITIALIZER(100),
-};
+static void
+dp_netdev_offload_init(void)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+unsigned int nb_offload_thread = netdev_offload_thread_nb();
+unsigned int tid;
+
+if (!ovsthread_once_start(&once)) {
+return;
+}
+
+dp_offload_threads = xcalloc(nb_offload_thread,
+ sizeof *dp_offload_threads);
 
-static struct ovsthread_once offload_thread_once
-= OVSTHREAD_ONCE_INITIALIZER;
+for (tid = 0; tid < nb_offload_thread; tid++) {
+struct dp_offload_thread *thread;
+
+thread = &dp_offload_threads[tid];
+mpsc_queue_init(&thread->queue);
+cmap_init(&thread->megaflow_to_mark);
+cmap_init(&thread->mark_to_flow);
+atomic_init(&thread->enqueued_item, 0);
+mov_avg_cma_init(&thread->cma);
+mov_avg_ema_init(&thread->ema, 100);
+ovs_thread_create("hw_offload", dp_netdev_flow_offload_main, thread);
+}
+
+ovsthread_once_done(&once);
+}
 
 #define XPS_TIMEOUT 50LL/* In microseconds. */
 
@@ -2409,11 +2431,12 @@ megaflow_to_mark_associate(const ovs_u128 *mega_ufid, 
uint32_t mark)
 {
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data = xzalloc(sizeof(*data));
+unsigned int tid = netdev_offload_thread_id();
 
 data->mega_ufid = *mega_ufid;
 data->mark = mark;
 
-cmap_insert(&dp_offload_thread.megaflow_to_mark,
+cmap_insert(&dp_offload_threads[tid].megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 }
 
@@ -2423,11 +2446,12 @@ megaflow_to_mark_disassociate(const ovs_u128 *mega_ufid)
 {
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 CMAP_FOR_EACH_WITH_HASH (data, node, hash,
- &dp_offload_thread.megaflow_to_mark) {
+ &dp_offload_threads[tid].megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
-cmap_remove(&dp_offload_thread.megaflow_to_mark,
+cmap_remove(&dp_offload_threads[tid].megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 ovsrcu_postpone(free, data);
 return;
@@ -2443,9 +2467,10 @@ megaflow_to_mark_find(const ovs_u128 *mega_ufid)
 {
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 CMAP_FOR_EACH_WITH_HASH (data, node, hash,
- &dp_offload_thread.megaflow_to_mark) {
+ &dp_offload_threads[tid].megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
 return data->mark;
 }
@@ -2460,9 +2485,10 @@ megaflow_to_mark_find(const ovs_u128 *mega_ufid)
 static void
 mark_to_flow_associate(const uint32_t mark, struct dp_netdev_flow *flow)
 {
+unsigned int tid = netdev_offload_thread_id();
 dp_netdev_flow_ref(flow);
 
-cmap_insert(&dp_of

[ovs-dev] [PATCH v5 24/27] dpif-netdev: Make megaflow and mark mappings thread objects

2021-09-08 Thread Gaetan Rivet
In later commits hardware offloads are managed in several threads.
Each offload is managed by a thread determined by its flow's 'mega_ufid'.

As megaflow to mark and mark to flow mappings are 1:1 and 1:N
respectively, then a single mark exists for a single 'mega_ufid', and
multiple flows uses the same 'mega_ufid'. Because the managing thread will
be chosen using the 'mega_ufid', then each mapping does not need to be
shared with other offload threads.

The mappings are kept as cmap as upcalls will sometimes query them before
enqueuing orders to the offload threads.

To prepare this change, move the mappings within the offload thread
structure.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 45 +
 1 file changed, 21 insertions(+), 24 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 4e91926fd..365726ed5 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -376,12 +376,16 @@ struct dp_offload_thread_item {
 struct dp_offload_thread {
 struct mpsc_queue queue;
 atomic_uint64_t enqueued_item;
+struct cmap megaflow_to_mark;
+struct cmap mark_to_flow;
 struct mov_avg_cma cma;
 struct mov_avg_ema ema;
 };
 
 static struct dp_offload_thread dp_offload_thread = {
 .queue = MPSC_QUEUE_INITIALIZER(&dp_offload_thread.queue),
+.megaflow_to_mark = CMAP_INITIALIZER,
+.mark_to_flow = CMAP_INITIALIZER,
 .enqueued_item = ATOMIC_VAR_INIT(0),
 .cma = MOV_AVG_CMA_INITIALIZER,
 .ema = MOV_AVG_EMA_INITIALIZER(100),
@@ -2368,32 +2372,23 @@ struct megaflow_to_mark_data {
 uint32_t mark;
 };
 
-struct flow_mark {
-struct cmap megaflow_to_mark;
-struct cmap mark_to_flow;
-struct id_fpool *pool;
-};
-
-static struct flow_mark flow_mark = {
-.megaflow_to_mark = CMAP_INITIALIZER,
-.mark_to_flow = CMAP_INITIALIZER,
-};
+static struct id_fpool *flow_mark_pool;
 
 static uint32_t
 flow_mark_alloc(void)
 {
-static struct ovsthread_once pool_init = OVSTHREAD_ONCE_INITIALIZER;
+static struct ovsthread_once init_once = OVSTHREAD_ONCE_INITIALIZER;
 unsigned int tid = netdev_offload_thread_id();
 uint32_t mark;
 
-if (ovsthread_once_start(&pool_init)) {
+if (ovsthread_once_start(&init_once)) {
 /* Haven't initiated yet, do it here */
-flow_mark.pool = id_fpool_create(netdev_offload_thread_nb(),
+flow_mark_pool = id_fpool_create(netdev_offload_thread_nb(),
  1, MAX_FLOW_MARK);
-ovsthread_once_done(&pool_init);
+ovsthread_once_done(&init_once);
 }
 
-if (id_fpool_new_id(flow_mark.pool, tid, &mark)) {
+if (id_fpool_new_id(flow_mark_pool, tid, &mark)) {
 return mark;
 }
 
@@ -2405,7 +2400,7 @@ flow_mark_free(uint32_t mark)
 {
 unsigned int tid = netdev_offload_thread_id();
 
-id_fpool_free_id(flow_mark.pool, tid, mark);
+id_fpool_free_id(flow_mark_pool, tid, mark);
 }
 
 /* associate megaflow with a mark, which is a 1:1 mapping */
@@ -2418,7 +2413,7 @@ megaflow_to_mark_associate(const ovs_u128 *mega_ufid, 
uint32_t mark)
 data->mega_ufid = *mega_ufid;
 data->mark = mark;
 
-cmap_insert(&flow_mark.megaflow_to_mark,
+cmap_insert(&dp_offload_thread.megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 }
 
@@ -2429,9 +2424,10 @@ megaflow_to_mark_disassociate(const ovs_u128 *mega_ufid)
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
 
-CMAP_FOR_EACH_WITH_HASH (data, node, hash, &flow_mark.megaflow_to_mark) {
+CMAP_FOR_EACH_WITH_HASH (data, node, hash,
+ &dp_offload_thread.megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
-cmap_remove(&flow_mark.megaflow_to_mark,
+cmap_remove(&dp_offload_thread.megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 ovsrcu_postpone(free, data);
 return;
@@ -2448,7 +2444,8 @@ megaflow_to_mark_find(const ovs_u128 *mega_ufid)
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
 
-CMAP_FOR_EACH_WITH_HASH (data, node, hash, &flow_mark.megaflow_to_mark) {
+CMAP_FOR_EACH_WITH_HASH (data, node, hash,
+ &dp_offload_thread.megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
 return data->mark;
 }
@@ -2465,7 +2462,7 @@ mark_to_flow_associate(const uint32_t mark, struct 
dp_netdev_flow *flow)
 {
 dp_netdev_flow_ref(flow);
 
-cmap_insert(&flow_mark.mark_to_flow,
+cmap_insert(&dp_offload_thread.mark_to_flow,
 CONST_CAST(struct cmap_node *, &a

[ovs-dev] [PATCH v5 20/27] netdev-offload-dpdk: Use per-thread HW offload stats

2021-09-08 Thread Gaetan Rivet
The implementation of hardware offload counters in currently meant to be
managed by a single thread. Use the offload thread pool API to manage
one counter per thread.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-dpdk.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index 931518ab0..8ed869eb3 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -68,7 +68,7 @@ struct ufid_to_rte_flow_data {
 
 struct netdev_offload_dpdk_data {
 struct cmap ufid_to_rte_flow;
-uint64_t rte_flow_counter;
+uint64_t *rte_flow_counters;
 };
 
 static int
@@ -78,6 +78,8 @@ offload_data_init(struct netdev *netdev)
 
 data = xzalloc(sizeof *data);
 cmap_init(&data->ufid_to_rte_flow);
+data->rte_flow_counters = xcalloc(netdev_offload_thread_nb(),
+  sizeof *data->rte_flow_counters);
 
 ovsrcu_set(&netdev->hw_info.offload_data, (void *) data);
 
@@ -87,6 +89,7 @@ offload_data_init(struct netdev *netdev)
 static void
 offload_data_destroy__(struct netdev_offload_dpdk_data *data)
 {
+free(data->rte_flow_counters);
 free(data);
 }
 
@@ -732,10 +735,11 @@ netdev_offload_dpdk_flow_create(struct netdev *netdev,
 flow = netdev_dpdk_rte_flow_create(netdev, attr, items, actions, error);
 if (flow) {
 struct netdev_offload_dpdk_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 data = (struct netdev_offload_dpdk_data *)
 ovsrcu_get(void *, &netdev->hw_info.offload_data);
-data->rte_flow_counter++;
+data->rte_flow_counters[tid]++;
 
 if (!VLOG_DROP_DBG(&rl)) {
 dump_flow(&s, &s_extra, attr, flow_patterns, flow_actions);
@@ -1985,10 +1989,11 @@ netdev_offload_dpdk_flow_destroy(struct 
ufid_to_rte_flow_data *rte_flow_data)
 
 if (ret == 0) {
 struct netdev_offload_dpdk_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 data = (struct netdev_offload_dpdk_data *)
 ovsrcu_get(void *, &netdev->hw_info.offload_data);
-data->rte_flow_counter--;
+data->rte_flow_counters[tid]--;
 
 ufid_to_rte_flow_disassociate(rte_flow_data);
 VLOG_DBG_RL(&rl, "%s/%s: rte_flow 0x%"PRIxPTR
@@ -2343,6 +2348,7 @@ netdev_offload_dpdk_get_n_flows(struct netdev *netdev,
 uint64_t *n_flows)
 {
 struct netdev_offload_dpdk_data *data;
+unsigned int tid;
 
 data = (struct netdev_offload_dpdk_data *)
 ovsrcu_get(void *, &netdev->hw_info.offload_data);
@@ -2350,7 +2356,9 @@ netdev_offload_dpdk_get_n_flows(struct netdev *netdev,
 return -1;
 }
 
-*n_flows = data->rte_flow_counter;
+for (tid = 0; tid < netdev_offload_thread_nb(); tid++) {
+n_flows[tid] = data->rte_flow_counters[tid];
+}
 
 return 0;
 }
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 22/27] netdev-offload-dpdk: Protect concurrent offload destroy/query

2021-09-08 Thread Gaetan Rivet
The rte_flow API in DPDK is now thread safe for insertion and deletion.
It is not however safe for concurrent query while the offload is being
inserted or deleted.

Insertion is not an issue as the rte_flow handle will be published to
other threads only once it has been inserted in the hardware, so the
query will only be able to proceed once it is already available.

For the deletion path however, offload status queries can be made while
an offload is being destroyed. This would create race conditions and
use-after-free if not properly protected.

As a pre-step before removing the OVS-level locks on the rte_flow API,
mutually exclude offload query and deletion from concurrent execution.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-dpdk.c | 39 ---
 1 file changed, 36 insertions(+), 3 deletions(-)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index e76c50b72..28cb2f96b 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -61,6 +61,8 @@ struct ufid_to_rte_flow_data {
 bool actions_offloaded;
 struct dpif_flow_stats stats;
 struct netdev *physdev;
+struct ovs_mutex lock;
+bool dead;
 };
 
 struct netdev_offload_dpdk_data {
@@ -238,6 +240,7 @@ ufid_to_rte_flow_associate(const ovs_u128 *ufid, struct 
netdev *netdev,
 data->physdev = netdev != physdev ? netdev_ref(physdev) : physdev;
 data->rte_flow = rte_flow;
 data->actions_offloaded = actions_offloaded;
+ovs_mutex_init(&data->lock);
 
 cmap_insert(map, CONST_CAST(struct cmap_node *, &data->node), hash);
 
@@ -245,8 +248,16 @@ ufid_to_rte_flow_associate(const ovs_u128 *ufid, struct 
netdev *netdev,
 return data;
 }
 
+static void
+rte_flow_data_unref(struct ufid_to_rte_flow_data *data)
+{
+ovs_mutex_destroy(&data->lock);
+free(data);
+}
+
 static inline void
 ufid_to_rte_flow_disassociate(struct ufid_to_rte_flow_data *data)
+OVS_REQUIRES(data->lock)
 {
 size_t hash = hash_bytes(&data->ufid, sizeof data->ufid, 0);
 struct cmap *map = offload_data_map(data->netdev);
@@ -263,7 +274,7 @@ ufid_to_rte_flow_disassociate(struct ufid_to_rte_flow_data 
*data)
 netdev_close(data->netdev);
 }
 netdev_close(data->physdev);
-ovsrcu_postpone(free, data);
+ovsrcu_postpone(rte_flow_data_unref, data);
 }
 
 /*
@@ -2033,6 +2044,15 @@ netdev_offload_dpdk_flow_destroy(struct 
ufid_to_rte_flow_data *rte_flow_data)
 ovs_u128 *ufid;
 int ret;
 
+ovs_mutex_lock(&rte_flow_data->lock);
+
+if (rte_flow_data->dead) {
+ovs_mutex_unlock(&rte_flow_data->lock);
+return 0;
+}
+
+rte_flow_data->dead = true;
+
 rte_flow = rte_flow_data->rte_flow;
 physdev = rte_flow_data->physdev;
 netdev = rte_flow_data->netdev;
@@ -2062,6 +2082,8 @@ netdev_offload_dpdk_flow_destroy(struct 
ufid_to_rte_flow_data *rte_flow_data)
  UUID_ARGS((struct uuid *) ufid));
 }
 
+ovs_mutex_unlock(&rte_flow_data->lock);
+
 return ret;
 }
 
@@ -2194,8 +2216,19 @@ netdev_offload_dpdk_flow_get(struct netdev *netdev,
 struct rte_flow_error error;
 int ret = 0;
 
+attrs->dp_extra_info = NULL;
+
 rte_flow_data = ufid_to_rte_flow_data_find(netdev, ufid, false);
-if (!rte_flow_data || !rte_flow_data->rte_flow) {
+if (!rte_flow_data || !rte_flow_data->rte_flow ||
+rte_flow_data->dead || ovs_mutex_trylock(&rte_flow_data->lock)) {
+return -1;
+}
+
+/* Check again whether the data is dead, as it could have been
+ * updated while the lock was not yet taken. The first check above
+ * was only to avoid unnecessary locking if possible.
+ */
+if (rte_flow_data->dead) {
 ret = -1;
 goto out;
 }
@@ -2223,7 +2256,7 @@ netdev_offload_dpdk_flow_get(struct netdev *netdev,
 }
 memcpy(stats, &rte_flow_data->stats, sizeof *stats);
 out:
-attrs->dp_extra_info = NULL;
+ovs_mutex_unlock(&rte_flow_data->lock);
 return ret;
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 19/27] dpif-netdev: Execute flush from offload thread

2021-09-08 Thread Gaetan Rivet
When a port is deleted, its offloads must be flushed.  The operation
runs in the thread that initiated it.  Offload data is thus accessed
jointly by the port deletion thread(s) and the offload thread, which
complicates the data access model.

To simplify this model, as a pre-step toward introducing parallel
offloads, execute the flush operation in the offload thread.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 126 --
 1 file changed, 122 insertions(+), 4 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index e0052a65b..381c959af 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -335,6 +335,7 @@ enum rxq_cycles_counter_type {
 
 enum dp_offload_type {
 DP_OFFLOAD_FLOW,
+DP_OFFLOAD_FLUSH,
 };
 
 enum {
@@ -353,8 +354,15 @@ struct dp_offload_flow_item {
 odp_port_t orig_in_port; /* Originating in_port for tnl flows. */
 };
 
+struct dp_offload_flush_item {
+struct dp_netdev *dp;
+struct netdev *netdev;
+struct ovs_barrier *barrier;
+};
+
 union dp_offload_thread_data {
 struct dp_offload_flow_item flow;
+struct dp_offload_flush_item flush;
 };
 
 struct dp_offload_thread_item {
@@ -543,6 +551,9 @@ static void dp_netdev_del_bond_tx_from_pmd(struct 
dp_netdev_pmd_thread *pmd,
uint32_t bond_id)
 OVS_EXCLUDED(pmd->bond_mutex);
 
+static void dp_netdev_offload_flush(struct dp_netdev *dp,
+struct dp_netdev_port *port);
+
 static void reconfigure_datapath(struct dp_netdev *dp)
 OVS_REQUIRES(dp->port_mutex);
 static bool dp_netdev_pmd_try_ref(struct dp_netdev_pmd_thread *pmd);
@@ -2242,7 +2253,7 @@ static void
 do_del_port(struct dp_netdev *dp, struct dp_netdev_port *port)
 OVS_REQUIRES(dp->port_mutex)
 {
-netdev_flow_flush(port->netdev);
+dp_netdev_offload_flush(dp, port);
 netdev_uninit_flow_api(port->netdev);
 hmap_remove(&dp->ports, &port->node);
 seq_change(dp->port_seq);
@@ -2594,13 +2605,16 @@ dp_netdev_free_offload(struct dp_offload_thread_item 
*offload)
 case DP_OFFLOAD_FLOW:
 dp_netdev_free_flow_offload(offload);
 break;
+case DP_OFFLOAD_FLUSH:
+free(offload);
+break;
 default:
 OVS_NOT_REACHED();
 };
 }
 
 static void
-dp_netdev_append_flow_offload(struct dp_offload_thread_item *offload)
+dp_netdev_append_offload(struct dp_offload_thread_item *offload)
 {
 ovs_mutex_lock(&dp_offload_thread.mutex);
 ovs_list_push_back(&dp_offload_thread.list, &offload->node);
@@ -2734,6 +2748,23 @@ dp_offload_flow(struct dp_offload_thread_item *item)
  UUID_ARGS((struct uuid *) &flow_offload->flow->mega_ufid));
 }
 
+static void
+dp_offload_flush(struct dp_offload_thread_item *item)
+{
+struct dp_offload_flush_item *flush = &item->data->flush;
+
+ovs_mutex_lock(&flush->dp->port_mutex);
+netdev_flow_flush(flush->netdev);
+ovs_mutex_unlock(&flush->dp->port_mutex);
+
+ovs_barrier_block(flush->barrier);
+
+/* Allow the other thread to take again the port lock, before
+ * continuing offload operations in this thread.
+ */
+ovs_barrier_block(flush->barrier);
+}
+
 #define DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US (10 * 1000) /* 10 ms */
 
 static void *
@@ -2764,6 +2795,9 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 case DP_OFFLOAD_FLOW:
 dp_offload_flow(offload);
 break;
+case DP_OFFLOAD_FLUSH:
+dp_offload_flush(offload);
+break;
 default:
 OVS_NOT_REACHED();
 }
@@ -2801,7 +2835,7 @@ queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
 offload = dp_netdev_alloc_flow_offload(pmd, flow,
DP_NETDEV_FLOW_OFFLOAD_OP_DEL);
 offload->timestamp = pmd->ctx.now;
-dp_netdev_append_flow_offload(offload);
+dp_netdev_append_offload(offload);
 }
 
 static void
@@ -2902,7 +2936,7 @@ queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd,
 flow_offload->orig_in_port = orig_in_port;
 
 item->timestamp = pmd->ctx.now;
-dp_netdev_append_flow_offload(item);
+dp_netdev_append_offload(item);
 }
 
 static void
@@ -2926,6 +2960,90 @@ dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread 
*pmd,
 dp_netdev_flow_unref(flow);
 }
 
+static void
+dp_netdev_offload_flush_enqueue(struct dp_netdev *dp,
+struct netdev *netdev,
+struct ovs_barrier *barrier)
+{
+struct dp_offload_thread_item *item;
+struct dp_offload_flush_item *flush;
+
+if (ovsthread_once_start(&offload_thread_once)) {
+xpthread_cond_init(&dp_offload_thread.cond, NULL);
+ovs_thread_create("hw_offload", dp_netdev_flow_offload_main, NULL)

[ovs-dev] [PATCH v5 16/27] dpif-netdev: Postpone flow offload item freeing

2021-09-08 Thread Gaetan Rivet
Profiling the HW offload thread, the flow offload freeing takes
approximatively 25% of the time. Most of this time is spent waiting on
the futex used by the libc free(), as it triggers a syscall and
reschedule the thread.

Avoid the syscall and its expensive context switch. Batch the offload
messages freeing using the RCU.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index c4672e6e5..c3d211858 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2544,14 +2544,19 @@ dp_netdev_alloc_flow_offload(struct 
dp_netdev_pmd_thread *pmd,
 return offload;
 }
 
+static void
+dp_netdev_free_flow_offload__(struct dp_offload_thread_item *offload)
+{
+free(offload->actions);
+free(offload);
+}
+
 static void
 dp_netdev_free_flow_offload(struct dp_offload_thread_item *offload)
 {
 dp_netdev_pmd_unref(offload->pmd);
 dp_netdev_flow_unref(offload->flow);
-
-free(offload->actions);
-free(offload);
+ovsrcu_postpone(dp_netdev_free_flow_offload__, offload);
 }
 
 static void
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 18/27] dpif-netdev: Introduce tagged union of offload requests

2021-09-08 Thread Gaetan Rivet
Offload requests are currently only supporting flow offloads.
As a pre-step before supporting an offload flush request,
modify the layout of an offload request item, to become a tagged union.

Future offload types won't be forced to re-use the full flow offload
structure, which consumes a lot of memory.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 137 --
 1 file changed, 96 insertions(+), 41 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 24ecaa0a8..e0052a65b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -333,13 +333,17 @@ enum rxq_cycles_counter_type {
 RXQ_N_CYCLES
 };
 
+enum dp_offload_type {
+DP_OFFLOAD_FLOW,
+};
+
 enum {
 DP_NETDEV_FLOW_OFFLOAD_OP_ADD,
 DP_NETDEV_FLOW_OFFLOAD_OP_MOD,
 DP_NETDEV_FLOW_OFFLOAD_OP_DEL,
 };
 
-struct dp_offload_thread_item {
+struct dp_offload_flow_item {
 struct dp_netdev_pmd_thread *pmd;
 struct dp_netdev_flow *flow;
 int op;
@@ -347,9 +351,17 @@ struct dp_offload_thread_item {
 struct nlattr *actions;
 size_t actions_len;
 odp_port_t orig_in_port; /* Originating in_port for tnl flows. */
-long long int timestamp;
+};
+
+union dp_offload_thread_data {
+struct dp_offload_flow_item flow;
+};
 
+struct dp_offload_thread_item {
 struct ovs_list node;
+enum dp_offload_type type;
+long long int timestamp;
+union dp_offload_thread_data data[0];
 };
 
 struct dp_offload_thread {
@@ -2538,34 +2550,55 @@ dp_netdev_alloc_flow_offload(struct 
dp_netdev_pmd_thread *pmd,
  struct dp_netdev_flow *flow,
  int op)
 {
-struct dp_offload_thread_item *offload;
+struct dp_offload_thread_item *item;
+struct dp_offload_flow_item *flow_offload;
+
+item = xzalloc(sizeof *item + sizeof *flow_offload);
+flow_offload = &item->data->flow;
+
+item->type = DP_OFFLOAD_FLOW;
 
-offload = xzalloc(sizeof(*offload));
-offload->pmd = pmd;
-offload->flow = flow;
-offload->op = op;
+flow_offload->pmd = pmd;
+flow_offload->flow = flow;
+flow_offload->op = op;
 
 dp_netdev_flow_ref(flow);
 dp_netdev_pmd_try_ref(pmd);
 
-return offload;
+return item;
 }
 
 static void
 dp_netdev_free_flow_offload__(struct dp_offload_thread_item *offload)
 {
-free(offload->actions);
+struct dp_offload_flow_item *flow_offload = &offload->data->flow;
+
+free(flow_offload->actions);
 free(offload);
 }
 
 static void
 dp_netdev_free_flow_offload(struct dp_offload_thread_item *offload)
 {
-dp_netdev_pmd_unref(offload->pmd);
-dp_netdev_flow_unref(offload->flow);
+struct dp_offload_flow_item *flow_offload = &offload->data->flow;
+
+dp_netdev_pmd_unref(flow_offload->pmd);
+dp_netdev_flow_unref(flow_offload->flow);
 ovsrcu_postpone(dp_netdev_free_flow_offload__, offload);
 }
 
+static void
+dp_netdev_free_offload(struct dp_offload_thread_item *offload)
+{
+switch (offload->type) {
+case DP_OFFLOAD_FLOW:
+dp_netdev_free_flow_offload(offload);
+break;
+default:
+OVS_NOT_REACHED();
+};
+}
+
 static void
 dp_netdev_append_flow_offload(struct dp_offload_thread_item *offload)
 {
@@ -2577,7 +2610,7 @@ dp_netdev_append_flow_offload(struct 
dp_offload_thread_item *offload)
 }
 
 static int
-dp_netdev_flow_offload_del(struct dp_offload_thread_item *offload)
+dp_netdev_flow_offload_del(struct dp_offload_flow_item *offload)
 {
 return mark_to_flow_disassociate(offload->pmd, offload->flow);
 }
@@ -2594,7 +2627,7 @@ dp_netdev_flow_offload_del(struct dp_offload_thread_item 
*offload)
  * valid, thus only item 2 needed.
  */
 static int
-dp_netdev_flow_offload_put(struct dp_offload_thread_item *offload)
+dp_netdev_flow_offload_put(struct dp_offload_flow_item *offload)
 {
 struct dp_netdev_pmd_thread *pmd = offload->pmd;
 struct dp_netdev_flow *flow = offload->flow;
@@ -2672,6 +2705,35 @@ err_free:
 return -1;
 }
 
+static void
+dp_offload_flow(struct dp_offload_thread_item *item)
+{
+struct dp_offload_flow_item *flow_offload = &item->data->flow;
+const char *op;
+int ret;
+
+switch (flow_offload->op) {
+case DP_NETDEV_FLOW_OFFLOAD_OP_ADD:
+op = "add";
+ret = dp_netdev_flow_offload_put(flow_offload);
+break;
+case DP_NETDEV_FLOW_OFFLOAD_OP_MOD:
+op = "modify";
+ret = dp_netdev_flow_offload_put(flow_offload);
+break;
+case DP_NETDEV_FLOW_OFFLOAD_OP_DEL:
+op = "delete";
+ret = dp_netdev_flow_offload_del(flow_offload);
+break;
+default:
+OVS_NOT_REACHED();
+}
+
+VLOG_DBG("%s to %s netdev flow "UUID_FMT,
+ ret == 0 ? "succeed" : "failed", op,
+ UUID_ARGS

[ovs-dev] [PATCH v5 17/27] dpif-netdev: Use id-fpool for mark allocation

2021-09-08 Thread Gaetan Rivet
Use the netdev-offload multithread API to allow multiple thread
allocating marks concurrently.

Initialize only once the pool in a multithread context by using
the ovsthread_once type.

Use the id-fpool module for faster concurrent ID allocation.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index c3d211858..24ecaa0a8 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -51,6 +51,7 @@
 #include "fat-rwlock.h"
 #include "flow.h"
 #include "hmapx.h"
+#include "id-fpool.h"
 #include "id-pool.h"
 #include "ipf.h"
 #include "mov-avg.h"
@@ -2349,7 +2350,7 @@ struct megaflow_to_mark_data {
 struct flow_mark {
 struct cmap megaflow_to_mark;
 struct cmap mark_to_flow;
-struct id_pool *pool;
+struct id_fpool *pool;
 };
 
 static struct flow_mark flow_mark = {
@@ -2360,14 +2361,18 @@ static struct flow_mark flow_mark = {
 static uint32_t
 flow_mark_alloc(void)
 {
+static struct ovsthread_once pool_init = OVSTHREAD_ONCE_INITIALIZER;
+unsigned int tid = netdev_offload_thread_id();
 uint32_t mark;
 
-if (!flow_mark.pool) {
+if (ovsthread_once_start(&pool_init)) {
 /* Haven't initiated yet, do it here */
-flow_mark.pool = id_pool_create(1, MAX_FLOW_MARK);
+flow_mark.pool = id_fpool_create(netdev_offload_thread_nb(),
+ 1, MAX_FLOW_MARK);
+ovsthread_once_done(&pool_init);
 }
 
-if (id_pool_alloc_id(flow_mark.pool, &mark)) {
+if (id_fpool_new_id(flow_mark.pool, tid, &mark)) {
 return mark;
 }
 
@@ -2377,7 +2382,9 @@ flow_mark_alloc(void)
 static void
 flow_mark_free(uint32_t mark)
 {
-id_pool_free_id(flow_mark.pool, mark);
+unsigned int tid = netdev_offload_thread_id();
+
+id_fpool_free_id(flow_mark.pool, tid, mark);
 }
 
 /* associate megaflow with a mark, which is a 1:1 mapping */
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 15/27] dpif-netdev: Quiesce offload thread periodically

2021-09-08 Thread Gaetan Rivet
Similar to what was done for the PMD threads [1], reduce the performance
impact of quiescing too often in the offload thread.

After each processed offload, the offload thread currently quiesce and
will sync with RCU. This synchronization can be lengthy and make the
thread unnecessary slow.

Instead attempt to quiesce every 10 ms at most. While the queue is
empty, the offload thread remains quiescent.

[1]: 81ac8b3b194c ("dpif-netdev: Do RCU synchronization at fixed interval
 in PMD main loop.")

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index c592f8c1d..c4672e6e5 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2660,15 +2660,20 @@ err_free:
 return -1;
 }
 
+#define DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US (10 * 1000) /* 10 ms */
+
 static void *
 dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 {
 struct dp_offload_thread_item *offload;
 struct ovs_list *list;
 long long int latency_us;
+long long int next_rcu;
+long long int now;
 const char *op;
 int ret;
 
+next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
 for (;;) {
 ovs_mutex_lock(&dp_offload_thread.mutex);
 if (ovs_list_is_empty(&dp_offload_thread.list)) {
@@ -2676,6 +2681,7 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 ovs_mutex_cond_wait(&dp_offload_thread.cond,
 &dp_offload_thread.mutex);
 ovsrcu_quiesce_end();
+next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
 }
 list = ovs_list_pop_front(&dp_offload_thread.list);
 dp_offload_thread.enqueued_item--;
@@ -2699,7 +2705,9 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 OVS_NOT_REACHED();
 }
 
-latency_us = time_usec() - offload->timestamp;
+now = time_usec();
+
+latency_us = now - offload->timestamp;
 mov_avg_cma_update(&dp_offload_thread.cma, latency_us);
 mov_avg_ema_update(&dp_offload_thread.ema, latency_us);
 
@@ -2707,7 +2715,12 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
  ret == 0 ? "succeed" : "failed", op,
  UUID_ARGS((struct uuid *) &offload->flow->mega_ufid));
 dp_netdev_free_flow_offload(offload);
-ovsrcu_quiesce();
+
+/* Do RCU synchronization at fixed interval. */
+if (now > next_rcu) {
+ovsrcu_quiesce();
+next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
+}
 }
 
 return NULL;
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 12/27] mpsc-queue: Module for lock-free message passing

2021-09-08 Thread Gaetan Rivet
Add a lockless multi-producer/single-consumer (MPSC), linked-list based,
intrusive, unbounded queue that does not require deferred memory
management.

The queue is designed to improve the specific MPSC setup.  A benchmark
accompanies the unit tests to measure the difference in this configuration.
A single reader thread polls the queue while N writers enqueue elements
as fast as possible.  The mpsc-queue is compared against the regular ovs-list
as well as the guarded list.  The latter usually offers a slight improvement
by batching the element removal, however the mpsc-queue is faster.

The average is of each producer threads time:

   $ ./tests/ovstest test-mpsc-queue benchmark 300 1
   Benchmarking n=300 on 1 + 1 threads.
type\thread:  Reader  1Avg
 mpsc-queue: 167167167 ms
 list(spin):  89 80 80 ms
list(mutex): 745745745 ms
   guarded list: 788788788 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 2
   Benchmarking n=300 on 1 + 2 threads.
type\thread:  Reader  1  2Avg
 mpsc-queue:  98 97 94 95 ms
 list(spin): 185171173172 ms
list(mutex): 203199203201 ms
   guarded list: 269269188228 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 3
   Benchmarking n=300 on 1 + 3 threads.
type\thread:  Reader  1  2  3Avg
 mpsc-queue:  76 76 65 76 72 ms
 list(spin): 246110240238196 ms
list(mutex): 542541541539540 ms
   guarded list: 535535507511517 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 4
   Benchmarking n=300 on 1 + 4 threads.
type\thread:  Reader  1  2  3  4Avg
 mpsc-queue:  73 68 68 68 68 68 ms
 list(spin): 294275279277282278 ms
list(mutex): 346309287345302310 ms
   guarded list: 378319334378351345 ms

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/automake.mk |   2 +
 lib/mpsc-queue.c| 251 +
 lib/mpsc-queue.h| 190 ++
 tests/automake.mk   |   1 +
 tests/library.at|   5 +
 tests/test-mpsc-queue.c | 772 
 6 files changed, 1221 insertions(+)
 create mode 100644 lib/mpsc-queue.c
 create mode 100644 lib/mpsc-queue.h
 create mode 100644 tests/test-mpsc-queue.c

diff --git a/lib/automake.mk b/lib/automake.mk
index 804c8da6f..098337078 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -180,6 +180,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/memory.h \
lib/meta-flow.c \
lib/mov-avg.h \
+   lib/mpsc-queue.c \
+   lib/mpsc-queue.h \
lib/multipath.c \
lib/multipath.h \
lib/namemap.c \
diff --git a/lib/mpsc-queue.c b/lib/mpsc-queue.c
new file mode 100644
index 0..4e99c94f7
--- /dev/null
+++ b/lib/mpsc-queue.c
@@ -0,0 +1,251 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include "ovs-atomic.h"
+
+#include "mpsc-queue.h"
+
+/* Multi-producer, single-consumer queue
+ * =
+ *
+ * This an implementation of the MPSC queue described by Dmitri Vyukov [1].
+ *
+ * One atomic exchange operation is done per insertion.  Removal in most cases
+ * will not require atomic operation and will use one atomic exchange to close
+ * the queue chain.
+ *
+ * Insertion
+ * =
+ *
+ * The queue is implemented using a linked-list.  Insertion is done at the
+ * back of the queue, by swapping the current end with the new node atomically,
+ * then pointing the previous end toward the new node.  To follow Vyukov
+ * nomenclature, the end-node of the chain is called head.  A producer will
+ * only manipulate the head.
+ *
+ * The head swap is atomic, however the link from the previous head to the new
+ * one is done in a separate operation.  This means that the chain is
+ * momentarily broken, when the previous head still points to NULL and the
+ * current head has been inserted.
+ *
+ * Considering a series of insertions, the queue state will remain consistent
+ * and the insertions order is compatible with their preced

[ovs-dev] [PATCH v5 13/27] id-fpool: Module for fast ID generation

2021-09-08 Thread Gaetan Rivet
The current id-pool module is slow to allocate the
next valid ID, and can be optimized when restricting
some properties of the pool.

Those restrictions are:

  * No ability to add a random ID to the pool.

  * A new ID is no more the smallest possible ID.
It is however guaranteed to be in the range of

   [floor, last_alloc + nb_user * cache_size + 1].

where 'cache_size' is the number of ID in each per-user
cache.  It is defined as 'ID_FPOOL_CACHE_SIZE' to 64.

  * A user should never free an ID that is not allocated.
No checks are done and doing so will duplicate the spurious
ID.  Refcounting or other memory management scheme should
be used to ensure an object and its ID are only freed once.

This allocator is designed to scale reasonably well in multithread
setup.  As it is aimed at being a faster replacement to the current
id-pool, a benchmark has been implemented alongside unit tests.

The benchmark is composed of 4 rounds: 'new', 'del', 'mix', and 'rnd'.
Respectively

  + 'new': only allocate IDs
  + 'del': only free IDs
  + 'mix': allocate, sequential free, then allocate ID.
  + 'rnd': allocate, random free, allocate ID.

Randomized freeing is done by swapping the latest allocated ID with any
from the range of currently allocated ID, which is reminiscent of the
Fisher-Yates shuffle.  This evaluates freeing non-sequential IDs,
which is the more natural use-case.

For this specific round, the id-pool performance is such that a timeout
of 10 seconds is added to the benchmark:

   $ ./tests/ovstest test-id-fpool benchmark 1 1
   Benchmarking n=1 on 1 thread.
type\thread:   1Avg
   id-fpool new:   1  1 ms
   id-fpool del:   1  1 ms
   id-fpool mix:   2  2 ms
   id-fpool rnd:   2  2 ms
id-pool new:   4  4 ms
id-pool del:   2  2 ms
id-pool mix:   6  6 ms
id-pool rnd: 431431 ms

   $ ./tests/ovstest test-id-fpool benchmark 10 1
   Benchmarking n=10 on 1 thread.
type\thread:   1Avg
   id-fpool new:   2  2 ms
   id-fpool del:   2  2 ms
   id-fpool mix:   3  3 ms
   id-fpool rnd:   4  4 ms
id-pool new:  12 12 ms
id-pool del:   5  5 ms
id-pool mix:  16 16 ms
id-pool rnd:  1+ -1 ms

   $ ./tests/ovstest test-id-fpool benchmark 100 1
   Benchmarking n=100 on 1 thread.
type\thread:   1Avg
   id-fpool new:  15 15 ms
   id-fpool del:  12 12 ms
   id-fpool mix:  34 34 ms
   id-fpool rnd:  48 48 ms
id-pool new: 276276 ms
id-pool del: 286286 ms
id-pool mix: 448448 ms
id-pool rnd:  1+ -1 ms

Running only a performance test on the fast pool:

   $ ./tests/ovstest test-id-fpool perf 100 1
   Benchmarking n=100 on 1 thread.
type\thread:   1Avg
   id-fpool new:  15 15 ms
   id-fpool del:  12 12 ms
   id-fpool mix:  34 34 ms
   id-fpool rnd:  47 47 ms

   $ ./tests/ovstest test-id-fpool perf 100 2
   Benchmarking n=100 on 2 threads.
type\thread:   1  2Avg
   id-fpool new:  11 11 11 ms
   id-fpool del:  10 10 10 ms
   id-fpool mix:  24 24 24 ms
   id-fpool rnd:  30 30 30 ms

   $ ./tests/ovstest test-id-fpool perf 100 4
   Benchmarking n=100 on 4 threads.
type\thread:   1  2  3  4Avg
   id-fpool new:   9 11 11 10 10 ms
   id-fpool del:   5  6  6  5  5 ms
   id-fpool mix:  16 16     16     16 16 ms
   id-fpool rnd:  20 20 20 20 20 ms

Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 lib/automake.mk   |   2 +
 lib/id-fpool.c| 279 +++
 lib/id-fpool.h|  66 +
 tests/automake.mk |   1 +
 tests/library.at  |   4 +
 tests/test-id-fpool.c | 615 ++
 6 files changed, 967 insertions(+)
 create mode 100644 lib/id-fpool.c
 create mode 100644 lib/id-fpool.h
 create mode 100644 tests/test-id-fpool.c

diff --git a/lib/automake.mk b/lib/automake.mk
index 098337078..ec1306b49 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -151,6 +151,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/hmap.c \
lib/hmapx.c \
lib/hmapx.h \
+   lib/id-fpool.c \
+   lib/id-fpool.h \
lib/id-pool.c \
lib/id-pool.h \
lib/if-notifier-manual.c \
diff --git a/lib/id-fpool.c b/lib/id-fpool.c
new file mode 100644
index 0..15cef5d00
--- /dev/null
+++ b/lib/id-fpool.c
@@ -0,0 +1,279 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance wit

[ovs-dev] [PATCH v5 11/27] ovs-atomic: Expose atomic exchange operation

2021-09-08 Thread Gaetan Rivet
The atomic exchange operation is a useful primitive that should be
available as well.  Most compilers already expose or offer a way
to use it, but a single symbol needs to be defined.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/ovs-atomic-c++.h  |  3 +++
 lib/ovs-atomic-clang.h|  5 +
 lib/ovs-atomic-gcc4+.h|  5 +
 lib/ovs-atomic-gcc4.7+.h  |  5 +
 lib/ovs-atomic-i586.h |  5 +
 lib/ovs-atomic-locked.h   |  9 +
 lib/ovs-atomic-msvc.h | 22 ++
 lib/ovs-atomic-pthreads.h |  5 +
 lib/ovs-atomic-x86_64.h   |  5 +
 lib/ovs-atomic.h  |  8 +++-
 10 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/lib/ovs-atomic-c++.h b/lib/ovs-atomic-c++.h
index d47b8dd39..8605fa9d3 100644
--- a/lib/ovs-atomic-c++.h
+++ b/lib/ovs-atomic-c++.h
@@ -29,6 +29,9 @@ using std::atomic_compare_exchange_strong_explicit;
 using std::atomic_compare_exchange_weak;
 using std::atomic_compare_exchange_weak_explicit;
 
+using std::atomic_exchange;
+using std::atomic_exchange_explicit;
+
 #define atomic_read(SRC, DST) \
 atomic_read_explicit(SRC, DST, memory_order_seq_cst)
 #define atomic_read_explicit(SRC, DST, ORDER)   \
diff --git a/lib/ovs-atomic-clang.h b/lib/ovs-atomic-clang.h
index 34cc2faa7..cdf02a512 100644
--- a/lib/ovs-atomic-clang.h
+++ b/lib/ovs-atomic-clang.h
@@ -67,6 +67,11 @@ typedef enum {
 #define atomic_compare_exchange_weak_explicit(DST, EXP, SRC, ORD1, ORD2) \
 __c11_atomic_compare_exchange_weak(DST, EXP, SRC, ORD1, ORD2)
 
+#define atomic_exchange(RMW, ARG) \
+atomic_exchange_explicit(RMW, ARG, memory_order_seq_cst)
+#define atomic_exchange_explicit(RMW, ARG, ORDER) \
+__c11_atomic_exchange(RMW, ARG, ORDER)
+
 #define atomic_add(RMW, ARG, ORIG) \
 atomic_add_explicit(RMW, ARG, ORIG, memory_order_seq_cst)
 #define atomic_sub(RMW, ARG, ORIG) \
diff --git a/lib/ovs-atomic-gcc4+.h b/lib/ovs-atomic-gcc4+.h
index 25bcf20a0..f9accde1a 100644
--- a/lib/ovs-atomic-gcc4+.h
+++ b/lib/ovs-atomic-gcc4+.h
@@ -128,6 +128,11 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit   \
 atomic_compare_exchange_strong_explicit
 
+#define atomic_exchange_explicit(DST, SRC, ORDER) \
+__sync_lock_test_and_set(DST, SRC)
+#define atomic_exchange(DST, SRC) \
+atomic_exchange_explicit(DST, SRC, memory_order_seq_cst)
+
 #define atomic_op__(RMW, OP, ARG, ORIG) \
 ({  \
 typeof(RMW) rmw__ = (RMW);  \
diff --git a/lib/ovs-atomic-gcc4.7+.h b/lib/ovs-atomic-gcc4.7+.h
index 4c197ebe0..846e05775 100644
--- a/lib/ovs-atomic-gcc4.7+.h
+++ b/lib/ovs-atomic-gcc4.7+.h
@@ -61,6 +61,11 @@ typedef enum {
 #define atomic_compare_exchange_weak_explicit(DST, EXP, SRC, ORD1, ORD2) \
 __atomic_compare_exchange_n(DST, EXP, SRC, true, ORD1, ORD2)
 
+#define atomic_exchange_explicit(DST, SRC, ORDER) \
+__atomic_exchange_n(DST, SRC, ORDER)
+#define atomic_exchange(DST, SRC) \
+atomic_exchange_explicit(DST, SRC, memory_order_seq_cst)
+
 #define atomic_add(RMW, OPERAND, ORIG) \
 atomic_add_explicit(RMW, OPERAND, ORIG, memory_order_seq_cst)
 #define atomic_sub(RMW, OPERAND, ORIG) \
diff --git a/lib/ovs-atomic-i586.h b/lib/ovs-atomic-i586.h
index 9a385ce84..35a0959ff 100644
--- a/lib/ovs-atomic-i586.h
+++ b/lib/ovs-atomic-i586.h
@@ -400,6 +400,11 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit   \
 atomic_compare_exchange_strong_explicit
 
+#define atomic_exchange_explicit(RMW, ARG, ORDER) \
+atomic_exchange__(RMW, ARG, ORDER)
+#define atomic_exchange(RMW, ARG) \
+atomic_exchange_explicit(RMW, ARG, memory_order_seq_cst)
+
 #define atomic_add__(RMW, ARG, CLOB)\
 asm volatile("lock; xadd %0,%1 ; "  \
  "# atomic_add__ "  \
diff --git a/lib/ovs-atomic-locked.h b/lib/ovs-atomic-locked.h
index f8f0ba2a5..bf38c4a43 100644
--- a/lib/ovs-atomic-locked.h
+++ b/lib/ovs-atomic-locked.h
@@ -31,6 +31,15 @@ void atomic_unlock__(void *);
  atomic_unlock__(DST),  \
  false)))
 
+#define atomic_exchange_locked(DST, SRC) \
+({   \
+atomic_lock__(DST);  \
+typeof(*(DST)) __tmp = *(DST);   \
+*(DST) = SRC;\
+atomic_unlock__(DST);\
+__tmp;   \
+})
+
 #define atomic_op_locked_add +=
 #define atomic_op_locked_sub -=
 #define atomic_op_locked_or  |=
diff --git a/lib/ovs-atomic-msvc.h b/lib/ovs-atomic-msvc.h
index 9def887d3..ef8310269 100644
--- a/lib/ovs-atomic-msvc.h
+++ b/lib/ovs-atomic-msvc.h
@@ -345,6 +345,28 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit \
 atomic_compare_exch

[ovs-dev] [PATCH v5 14/27] netdev-offload: Add multi-thread API

2021-09-08 Thread Gaetan Rivet
Expose functions reporting user configuration of offloading threads, as
well as utility functions for multithreading.

This will only expose the configuration knob to the user, while no
datapath will implement the multiple thread request.

This will allow implementations to use this API for offload thread
management in relevant layers before enabling the actual dataplane
implementation.

The offload thread ID is lazily allocated and can as such be in a
different order than the offload thread start sequence.

The RCU thread will sometime access hardware-offload objects from
a provider for reclamation purposes.  In such case, it will get
a default offload thread ID of 0. Care must be taken that using
this thread ID is safe concurrently with the offload threads.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-provider.h |  1 +
 lib/netdev-offload.c  | 88 ++-
 lib/netdev-offload.h  | 19 
 vswitchd/vswitch.xml  | 16 +++
 4 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/lib/netdev-offload-provider.h b/lib/netdev-offload-provider.h
index bc52a3f61..8ff2de983 100644
--- a/lib/netdev-offload-provider.h
+++ b/lib/netdev-offload-provider.h
@@ -84,6 +84,7 @@ struct netdev_flow_api {
 struct dpif_flow_stats *);
 
 /* Get the number of flows offloaded to netdev.
+ * 'n_flows' is an array of counters, one per offload thread.
  * Return 0 if successful, otherwise returns a positive errno value. */
 int (*flow_get_n_flows)(struct netdev *, uint64_t *n_flows);
 
diff --git a/lib/netdev-offload.c b/lib/netdev-offload.c
index 5ddd4d01d..fc5f815d0 100644
--- a/lib/netdev-offload.c
+++ b/lib/netdev-offload.c
@@ -60,6 +60,12 @@ VLOG_DEFINE_THIS_MODULE(netdev_offload);
 
 static bool netdev_flow_api_enabled = false;
 
+#define DEFAULT_OFFLOAD_THREAD_NB 1
+#define MAX_OFFLOAD_THREAD_NB 10
+
+static unsigned int offload_thread_nb = DEFAULT_OFFLOAD_THREAD_NB;
+DEFINE_EXTERN_PER_THREAD_DATA(netdev_offload_thread_id, OVSTHREAD_ID_UNSET);
+
 /* Protects 'netdev_flow_apis'.  */
 static struct ovs_mutex netdev_flow_api_provider_mutex = OVS_MUTEX_INITIALIZER;
 
@@ -448,6 +454,64 @@ netdev_is_flow_api_enabled(void)
 return netdev_flow_api_enabled;
 }
 
+unsigned int
+netdev_offload_thread_nb(void)
+{
+return offload_thread_nb;
+}
+
+unsigned int
+netdev_offload_ufid_to_thread_id(const ovs_u128 ufid)
+{
+uint32_t ufid_hash;
+
+if (netdev_offload_thread_nb() == 1) {
+return 0;
+}
+
+ufid_hash = hash_words64_inline(
+(const uint64_t [2]){ ufid.u64.lo,
+  ufid.u64.hi }, 2, 1);
+return ufid_hash % netdev_offload_thread_nb();
+}
+
+unsigned int
+netdev_offload_thread_init(void)
+{
+static atomic_count next_id = ATOMIC_COUNT_INIT(0);
+bool thread_is_hw_offload;
+bool thread_is_rcu;
+
+thread_is_hw_offload = !strncmp(get_subprogram_name(),
+"hw_offload", strlen("hw_offload"));
+thread_is_rcu = !strncmp(get_subprogram_name(), "urcu", strlen("urcu"));
+
+/* Panic if any other thread besides offload and RCU tries
+ * to initialize their thread ID. */
+ovs_assert(thread_is_hw_offload || thread_is_rcu);
+
+if (*netdev_offload_thread_id_get() == OVSTHREAD_ID_UNSET) {
+unsigned int id;
+
+if (thread_is_rcu) {
+/* RCU will compete with other threads for shared object access.
+ * Reclamation functions using a thread ID must be thread-safe.
+ * For that end, and because RCU must consider all potential shared
+ * objects anyway, its thread-id can be whichever, so return 0.
+ */
+id = 0;
+} else {
+/* Only the actual offload threads have their own ID. */
+id = atomic_count_inc(&next_id);
+}
+/* Panic if any offload thread is getting a spurious ID. */
+ovs_assert(id < netdev_offload_thread_nb());
+return *netdev_offload_thread_id_get() = id;
+} else {
+return *netdev_offload_thread_id_get();
+}
+}
+
 void
 netdev_ports_flow_flush(const char *dpif_type)
 {
@@ -660,7 +724,16 @@ netdev_ports_get_n_flows(const char *dpif_type, odp_port_t 
port_no,
 ovs_rwlock_rdlock(&netdev_hmap_rwlock);
 data = netdev_ports_lookup(port_no, dpif_type);
 if (data) {
-ret = netdev_flow_get_n_flows(data->netdev, n_flows);
+uint64_t thread_n_flows[MAX_OFFLOAD_THREAD_NB] = {0};
+unsigned int tid;
+
+ret = netdev_flow_get_n_flows(data->netdev, thread_n_flows);
+*n_flows = 0;
+if (!ret) {
+for (tid = 0; tid < netdev_offload_thread_nb(); tid++) {
+*n_flows += thread_n_flows[tid];
+}
+}
 }

[ovs-dev] [PATCH v5 09/27] mov-avg: Add a moving average helper structure

2021-09-08 Thread Gaetan Rivet
Add a new module offering a helper to compute the Cumulative
Moving Average (CMA) and the Exponential Moving Average (EMA)
of a series of values.

Use the new helpers to add latency metrics in dpif-netdev.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/automake.mk |   1 +
 lib/mov-avg.h   | 171 
 2 files changed, 172 insertions(+)
 create mode 100644 lib/mov-avg.h

diff --git a/lib/automake.mk b/lib/automake.mk
index 46f869a33..804c8da6f 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -179,6 +179,7 @@ lib_libopenvswitch_la_SOURCES = \
lib/memory.c \
lib/memory.h \
lib/meta-flow.c \
+   lib/mov-avg.h \
lib/multipath.c \
lib/multipath.h \
lib/namemap.c \
diff --git a/lib/mov-avg.h b/lib/mov-avg.h
new file mode 100644
index 0..36a6ceb76
--- /dev/null
+++ b/lib/mov-avg.h
@@ -0,0 +1,171 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef _MOV_AVG_H
+#define _MOV_AVG_H 1
+
+#include 
+
+/* Moving average helpers. */
+
+/* Cumulative Moving Average.
+ *
+ * Computes the arithmetic mean over a whole series of value.
+ * Online equivalent of sum(V) / len(V).
+ *
+ * As all values have equal weight, this average will
+ * be slow to show recent changes in the series.
+ *
+ */
+
+struct mov_avg_cma {
+unsigned long long int count;
+double mean;
+double sum_dsquared;
+};
+
+#define MOV_AVG_CMA_INITIALIZER \
+{ .count = 0, .mean = .0, .sum_dsquared = .0 }
+
+static inline void
+mov_avg_cma_init(struct mov_avg_cma *cma)
+{
+*cma = (struct mov_avg_cma) MOV_AVG_CMA_INITIALIZER;
+}
+
+static inline void
+mov_avg_cma_update(struct mov_avg_cma *cma, double new_val)
+{
+double new_mean;
+
+cma->count++;
+new_mean = cma->mean + (new_val - cma->mean) / cma->count;
+
+cma->sum_dsquared += (new_val - new_mean) * (new_val - cma->mean);
+cma->mean = new_mean;
+}
+
+static inline double
+mov_avg_cma(struct mov_avg_cma *cma)
+{
+return cma->mean;
+}
+
+static inline double
+mov_avg_cma_std_dev(struct mov_avg_cma *cma)
+{
+double variance = 0.0;
+
+if (cma->count > 1) {
+variance = cma->sum_dsquared / (cma->count - 1);
+}
+
+return sqrt(variance);
+}
+
+/* Exponential Moving Average.
+ *
+ * Each value in the series has an exponentially decreasing weight,
+ * the older they get the less weight they have.
+ *
+ * The smoothing factor 'alpha' must be within 0 < alpha < 1.
+ * The closer this factor to zero, the more equal the weight between
+ * recent and older values. As it approaches one, the more recent values
+ * will have more weight.
+ *
+ * The EMA can be thought of as an estimator for the next value when measures
+ * are dependent. In this case, it can make sense to consider the mean square
+ * error of the prediction. An 'alpha' minimizing this error would be the
+ * better choice to improve the estimation.
+ *
+ * A common way to choose 'alpha' is to use the following formula:
+ *
+ *   a = 2 / (N + 1)
+ *
+ * With this 'alpha', the EMA will have the same 'center of mass' as an
+ * equivalent N-values Simple Moving Average.
+ *
+ * When using this factor, the N last values of the EMA will have a sum weight
+ * converging toward 0.8647, meaning that those values will account for 86% of
+ * the average[1].
+ *
+ * [1] https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
+ */
+
+struct mov_avg_ema {
+double alpha; /* 'Smoothing' factor. */
+double mean;
+double variance;
+bool initialized;
+};
+
+/* Choose alpha explicitly. */
+#define MOV_AVG_EMA_INITIALIZER_ALPHA(a) { \
+.initialized = false, \
+.alpha = (a), .variance = 0.0, .mean = 0.0 \
+}
+
+/* Choose alpha to consider 'N' past periods as 86% of the EMA. */
+#define MOV_AVG_EMA_INITIALIZER(n_elem) \
+MOV_AVG_EMA_INITIALIZER_ALPHA(2.0 / ((double)(n_elem) + 1.0))
+
+static inline void
+mov_avg_ema_init_alpha(struct mov_avg_ema *ema,
+   double alpha)
+{
+*ema = (struct mov_avg_ema) MOV_AVG_EMA_INITIALIZER_ALPHA(alpha);
+}
+
+static inline void
+mov_avg_ema_init(struct mov_avg_ema *ema,
+ unsigned long long int n_elem)
+{
+*ema = (struct mov_avg_ema) 

[ovs-dev] [PATCH v5 10/27] dpif-netdev: Implement hardware offloads stats query

2021-09-08 Thread Gaetan Rivet
In the netdev datapath, keep track of the enqueued offloads between
the PMDs and the offload thread.  Additionally, query each netdev
for their hardware offload counters.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 90 ++-
 1 file changed, 89 insertions(+), 1 deletion(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 36d6a5962..c592f8c1d 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -53,6 +53,7 @@
 #include "hmapx.h"
 #include "id-pool.h"
 #include "ipf.h"
+#include "mov-avg.h"
 #include "netdev.h"
 #include "netdev-offload.h"
 #include "netdev-provider.h"
@@ -345,6 +346,7 @@ struct dp_offload_thread_item {
 struct nlattr *actions;
 size_t actions_len;
 odp_port_t orig_in_port; /* Originating in_port for tnl flows. */
+long long int timestamp;
 
 struct ovs_list node;
 };
@@ -352,12 +354,18 @@ struct dp_offload_thread_item {
 struct dp_offload_thread {
 struct ovs_mutex mutex;
 struct ovs_list list;
+uint64_t enqueued_item;
+struct mov_avg_cma cma;
+struct mov_avg_ema ema;
 pthread_cond_t cond;
 };
 
 static struct dp_offload_thread dp_offload_thread = {
 .mutex = OVS_MUTEX_INITIALIZER,
 .list  = OVS_LIST_INITIALIZER(&dp_offload_thread.list),
+.enqueued_item = 0,
+.cma = MOV_AVG_CMA_INITIALIZER,
+.ema = MOV_AVG_EMA_INITIALIZER(100),
 };
 
 static struct ovsthread_once offload_thread_once
@@ -2551,6 +2559,7 @@ dp_netdev_append_flow_offload(struct 
dp_offload_thread_item *offload)
 {
 ovs_mutex_lock(&dp_offload_thread.mutex);
 ovs_list_push_back(&dp_offload_thread.list, &offload->node);
+dp_offload_thread.enqueued_item++;
 xpthread_cond_signal(&dp_offload_thread.cond);
 ovs_mutex_unlock(&dp_offload_thread.mutex);
 }
@@ -2656,6 +2665,7 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 {
 struct dp_offload_thread_item *offload;
 struct ovs_list *list;
+long long int latency_us;
 const char *op;
 int ret;
 
@@ -2668,6 +2678,7 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 ovsrcu_quiesce_end();
 }
 list = ovs_list_pop_front(&dp_offload_thread.list);
+dp_offload_thread.enqueued_item--;
 offload = CONTAINER_OF(list, struct dp_offload_thread_item, node);
 ovs_mutex_unlock(&dp_offload_thread.mutex);
 
@@ -2688,6 +2699,10 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 OVS_NOT_REACHED();
 }
 
+latency_us = time_usec() - offload->timestamp;
+mov_avg_cma_update(&dp_offload_thread.cma, latency_us);
+mov_avg_ema_update(&dp_offload_thread.ema, latency_us);
+
 VLOG_DBG("%s to %s netdev flow "UUID_FMT,
  ret == 0 ? "succeed" : "failed", op,
  UUID_ARGS((struct uuid *) &offload->flow->mega_ufid));
@@ -2712,6 +2727,7 @@ queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
 
 offload = dp_netdev_alloc_flow_offload(pmd, flow,
DP_NETDEV_FLOW_OFFLOAD_OP_DEL);
+offload->timestamp = pmd->ctx.now;
 dp_netdev_append_flow_offload(offload);
 }
 
@@ -2805,6 +2821,7 @@ queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd,
 offload->actions_len = actions_len;
 offload->orig_in_port = orig_in_port;
 
+offload->timestamp = pmd->ctx.now;
 dp_netdev_append_flow_offload(offload);
 }
 
@@ -4123,6 +4140,77 @@ dpif_netdev_operate(struct dpif *dpif, struct dpif_op 
**ops, size_t n_ops,
 }
 }
 
+static int
+dpif_netdev_offload_stats_get(struct dpif *dpif,
+  struct netdev_custom_stats *stats)
+{
+enum {
+DP_NETDEV_HW_OFFLOADS_STATS_ENQUEUED,
+DP_NETDEV_HW_OFFLOADS_STATS_INSERTED,
+DP_NETDEV_HW_OFFLOADS_STATS_LAT_CMA_MEAN,
+DP_NETDEV_HW_OFFLOADS_STATS_LAT_CMA_STDDEV,
+DP_NETDEV_HW_OFFLOADS_STATS_LAT_EMA_MEAN,
+DP_NETDEV_HW_OFFLOADS_STATS_LAT_EMA_STDDEV,
+};
+const char *names[] = {
+[DP_NETDEV_HW_OFFLOADS_STATS_ENQUEUED] =
+"Enqueued offloads",
+[DP_NETDEV_HW_OFFLOADS_STATS_INSERTED] =
+"Inserted offloads",
+[DP_NETDEV_HW_OFFLOADS_STATS_LAT_CMA_MEAN] =
+"  Cumulative Average latency (us)",
+[DP_NETDEV_HW_OFFLOADS_STATS_LAT_CMA_STDDEV] =
+"   Cumulative Latency stddev (us)",
+[DP_NETDEV_HW_OFFLOADS_STATS_LAT_EMA_MEAN] =
+" Exponential Average latency (us)",
+[DP_NETDEV_HW_OFFLOADS_STATS_LAT_EMA_STDDEV] =
+"  Exponential Latency stddev (us)",
+};
+struct dp_netdev *dp = get_dp_netdev(dpif);
+struct dp

[ovs-dev] [PATCH v5 07/27] dpctl: Add function to read hardware offload statistics

2021-09-08 Thread Gaetan Rivet
Expose a function to query datapath offload statistics.
This function is separate from the current one in netdev-offload
as it exposes more detailed statistics from the datapath, instead of
only from the netdev-offload provider.

Each datapath is meant to use the custom counters as it sees fit for its
handling of hardware offloads.

Call the new API from dpctl.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpctl.c | 36 
 lib/dpif-netdev.c   |  1 +
 lib/dpif-netlink.c  |  1 +
 lib/dpif-provider.h |  7 +++
 lib/dpif.c  |  8 
 lib/dpif.h  |  9 +
 6 files changed, 62 insertions(+)

diff --git a/lib/dpctl.c b/lib/dpctl.c
index ef8ae7402..6ff73e2d9 100644
--- a/lib/dpctl.c
+++ b/lib/dpctl.c
@@ -1541,6 +1541,40 @@ dpctl_del_flows(int argc, const char *argv[], struct 
dpctl_params *dpctl_p)
 return error;
 }
 
+static int
+dpctl_offload_stats_show(int argc, const char *argv[],
+ struct dpctl_params *dpctl_p)
+{
+struct netdev_custom_stats stats;
+struct dpif *dpif;
+int error;
+size_t i;
+
+error = opt_dpif_open(argc, argv, dpctl_p, 2, &dpif);
+if (error) {
+return error;
+}
+
+memset(&stats, 0, sizeof(stats));
+error = dpif_offload_stats_get(dpif, &stats);
+if (error) {
+dpctl_error(dpctl_p, error, "retrieving offload statistics");
+goto close_dpif;
+}
+
+dpctl_print(dpctl_p, "HW Offload stats:\n");
+for (i = 0; i < stats.size; i++) {
+dpctl_print(dpctl_p, "   %s: %6" PRIu64 "\n",
+stats.counters[i].name, stats.counters[i].value);
+}
+
+netdev_free_custom_stats_counters(&stats);
+
+close_dpif:
+dpif_close(dpif);
+return error;
+}
+
 static int
 dpctl_help(int argc OVS_UNUSED, const char *argv[] OVS_UNUSED,
struct dpctl_params *dpctl_p)
@@ -2697,6 +2731,8 @@ static const struct dpctl_command all_commands[] = {
 { "add-flows", "[dp] file", 1, 2, dpctl_process_flows, DP_RW },
 { "mod-flows", "[dp] file", 1, 2, dpctl_process_flows, DP_RW },
 { "del-flows", "[dp] [file]", 0, 2, dpctl_del_flows, DP_RW },
+{ "offload-stats-show", "[dp]",
+  0, 1, dpctl_offload_stats_show, DP_RO },
 { "dump-conntrack", "[-m] [-s] [dp] [zone=N]",
   0, 4, dpctl_dump_conntrack, DP_RO },
 { "flush-conntrack", "[dp] [zone=N] [ct-tuple]", 0, 3,
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 42c078657..a36ff9456 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -8741,6 +8741,7 @@ const struct dpif_class dpif_netdev_class = {
 dpif_netdev_flow_dump_thread_destroy,
 dpif_netdev_flow_dump_next,
 dpif_netdev_operate,
+NULL,   /* offload_stats_get */
 NULL,   /* recv_set */
 NULL,   /* handlers_set */
 NULL,   /* number_handlers_required */
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 34fc04237..b403b062f 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -4293,6 +4293,7 @@ const struct dpif_class dpif_netlink_class = {
 dpif_netlink_flow_dump_thread_destroy,
 dpif_netlink_flow_dump_next,
 dpif_netlink_operate,
+NULL,   /* offload_stats_get */
 dpif_netlink_recv_set,
 dpif_netlink_handlers_set,
 dpif_netlink_number_handlers_required,
diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
index 7e11b9697..8bdda37f0 100644
--- a/lib/dpif-provider.h
+++ b/lib/dpif-provider.h
@@ -331,6 +331,13 @@ struct dpif_class {
 void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
 enum dpif_offload_type offload_type);
 
+/* Get hardware-offloads activity counters from a dataplane.
+ * Those counters are not offload statistics (which are accessible through
+ * netdev statistics), but a status of hardware offload management:
+ * how many offloads are currently waiting, inserted, etc. */
+int (*offload_stats_get)(struct dpif *dpif,
+ struct netdev_custom_stats *stats);
+
 /* Enables or disables receiving packets with dpif_recv() for 'dpif'.
  * Turning packet receive off and then back on is allowed to change Netlink
  * PID assignments (see ->port_get_pid()).  The client is responsible for
diff --git a/lib/dpif.c b/lib/dpif.c
index 8c4aed47b..97cca2841 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1427,6 +1427,14 @@ dpif_operate(struct dpif *dpif, struct dpif_op **ops, 
size_t n_ops,
 }
 }
 
+int dpif_offload_stats_get(struct dpif *dpif,
+   struct netdev_custom_stats *stats)
+{
+return (dpif->dpif_class->o

[ovs-dev] [PATCH v5 08/27] dpif-netdev: Rename offload thread structure

2021-09-08 Thread Gaetan Rivet
The offload management in userspace is done through a separate thread.
The naming of the structure holding the objects used for synchronization
with the dataplane is generic and nondescript.

Clarify the object function by renaming it.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 52 +++
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index a36ff9456..36d6a5962 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -337,7 +337,7 @@ enum {
 DP_NETDEV_FLOW_OFFLOAD_OP_DEL,
 };
 
-struct dp_flow_offload_item {
+struct dp_offload_thread_item {
 struct dp_netdev_pmd_thread *pmd;
 struct dp_netdev_flow *flow;
 int op;
@@ -349,15 +349,15 @@ struct dp_flow_offload_item {
 struct ovs_list node;
 };
 
-struct dp_flow_offload {
+struct dp_offload_thread {
 struct ovs_mutex mutex;
 struct ovs_list list;
 pthread_cond_t cond;
 };
 
-static struct dp_flow_offload dp_flow_offload = {
+static struct dp_offload_thread dp_offload_thread = {
 .mutex = OVS_MUTEX_INITIALIZER,
-.list  = OVS_LIST_INITIALIZER(&dp_flow_offload.list),
+.list  = OVS_LIST_INITIALIZER(&dp_offload_thread.list),
 };
 
 static struct ovsthread_once offload_thread_once
@@ -2518,12 +2518,12 @@ mark_to_flow_find(const struct dp_netdev_pmd_thread 
*pmd,
 return NULL;
 }
 
-static struct dp_flow_offload_item *
+static struct dp_offload_thread_item *
 dp_netdev_alloc_flow_offload(struct dp_netdev_pmd_thread *pmd,
  struct dp_netdev_flow *flow,
  int op)
 {
-struct dp_flow_offload_item *offload;
+struct dp_offload_thread_item *offload;
 
 offload = xzalloc(sizeof(*offload));
 offload->pmd = pmd;
@@ -2537,7 +2537,7 @@ dp_netdev_alloc_flow_offload(struct dp_netdev_pmd_thread 
*pmd,
 }
 
 static void
-dp_netdev_free_flow_offload(struct dp_flow_offload_item *offload)
+dp_netdev_free_flow_offload(struct dp_offload_thread_item *offload)
 {
 dp_netdev_pmd_unref(offload->pmd);
 dp_netdev_flow_unref(offload->flow);
@@ -2547,16 +2547,16 @@ dp_netdev_free_flow_offload(struct dp_flow_offload_item 
*offload)
 }
 
 static void
-dp_netdev_append_flow_offload(struct dp_flow_offload_item *offload)
+dp_netdev_append_flow_offload(struct dp_offload_thread_item *offload)
 {
-ovs_mutex_lock(&dp_flow_offload.mutex);
-ovs_list_push_back(&dp_flow_offload.list, &offload->node);
-xpthread_cond_signal(&dp_flow_offload.cond);
-ovs_mutex_unlock(&dp_flow_offload.mutex);
+ovs_mutex_lock(&dp_offload_thread.mutex);
+ovs_list_push_back(&dp_offload_thread.list, &offload->node);
+xpthread_cond_signal(&dp_offload_thread.cond);
+ovs_mutex_unlock(&dp_offload_thread.mutex);
 }
 
 static int
-dp_netdev_flow_offload_del(struct dp_flow_offload_item *offload)
+dp_netdev_flow_offload_del(struct dp_offload_thread_item *offload)
 {
 return mark_to_flow_disassociate(offload->pmd, offload->flow);
 }
@@ -2573,7 +2573,7 @@ dp_netdev_flow_offload_del(struct dp_flow_offload_item 
*offload)
  * valid, thus only item 2 needed.
  */
 static int
-dp_netdev_flow_offload_put(struct dp_flow_offload_item *offload)
+dp_netdev_flow_offload_put(struct dp_offload_thread_item *offload)
 {
 struct dp_netdev_pmd_thread *pmd = offload->pmd;
 struct dp_netdev_flow *flow = offload->flow;
@@ -2654,22 +2654,22 @@ err_free:
 static void *
 dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 {
-struct dp_flow_offload_item *offload;
+struct dp_offload_thread_item *offload;
 struct ovs_list *list;
 const char *op;
 int ret;
 
 for (;;) {
-ovs_mutex_lock(&dp_flow_offload.mutex);
-if (ovs_list_is_empty(&dp_flow_offload.list)) {
+ovs_mutex_lock(&dp_offload_thread.mutex);
+if (ovs_list_is_empty(&dp_offload_thread.list)) {
 ovsrcu_quiesce_start();
-ovs_mutex_cond_wait(&dp_flow_offload.cond,
-&dp_flow_offload.mutex);
+ovs_mutex_cond_wait(&dp_offload_thread.cond,
+&dp_offload_thread.mutex);
 ovsrcu_quiesce_end();
 }
-list = ovs_list_pop_front(&dp_flow_offload.list);
-offload = CONTAINER_OF(list, struct dp_flow_offload_item, node);
-ovs_mutex_unlock(&dp_flow_offload.mutex);
+list = ovs_list_pop_front(&dp_offload_thread.list);
+offload = CONTAINER_OF(list, struct dp_offload_thread_item, node);
+ovs_mutex_unlock(&dp_offload_thread.mutex);
 
 switch (offload->op) {
 case DP_NETDEV_FLOW_OFFLOAD_OP_ADD:
@@ -2702,10 +2702,10 @@ static void
 queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
   struct dp_netdev_flow *fl

[ovs-dev] [PATCH v5 06/27] netdev-offload-dpdk: Implement hw-offload statistics read

2021-09-08 Thread Gaetan Rivet
In the DPDK offload provider, keep track of inserted rte_flow and report
it when queried.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-dpdk.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index 2d1e31ece..931518ab0 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -68,6 +68,7 @@ struct ufid_to_rte_flow_data {
 
 struct netdev_offload_dpdk_data {
 struct cmap ufid_to_rte_flow;
+uint64_t rte_flow_counter;
 };
 
 static int
@@ -730,6 +731,12 @@ netdev_offload_dpdk_flow_create(struct netdev *netdev,
 
 flow = netdev_dpdk_rte_flow_create(netdev, attr, items, actions, error);
 if (flow) {
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+data->rte_flow_counter++;
+
 if (!VLOG_DROP_DBG(&rl)) {
 dump_flow(&s, &s_extra, attr, flow_patterns, flow_actions);
 extra_str = ds_cstr(&s_extra);
@@ -1977,6 +1984,12 @@ netdev_offload_dpdk_flow_destroy(struct 
ufid_to_rte_flow_data *rte_flow_data)
 ret = netdev_dpdk_rte_flow_destroy(physdev, rte_flow, &error);
 
 if (ret == 0) {
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+data->rte_flow_counter--;
+
 ufid_to_rte_flow_disassociate(rte_flow_data);
 VLOG_DBG_RL(&rl, "%s/%s: rte_flow 0x%"PRIxPTR
 " flow destroy %d ufid " UUID_FMT,
@@ -2325,6 +2338,23 @@ close_vport_netdev:
 return ret;
 }
 
+static int
+netdev_offload_dpdk_get_n_flows(struct netdev *netdev,
+uint64_t *n_flows)
+{
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+if (!data) {
+return -1;
+}
+
+*n_flows = data->rte_flow_counter;
+
+return 0;
+}
+
 const struct netdev_flow_api netdev_offload_dpdk = {
 .type = "dpdk_flow_api",
 .flow_put = netdev_offload_dpdk_flow_put,
@@ -2334,4 +2364,5 @@ const struct netdev_flow_api netdev_offload_dpdk = {
 .flow_get = netdev_offload_dpdk_flow_get,
 .flow_flush = netdev_offload_dpdk_flow_flush,
 .hw_miss_packet_recover = netdev_offload_dpdk_hw_miss_packet_recover,
+.flow_get_n_flows = netdev_offload_dpdk_get_n_flows,
 };
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 05/27] netdev-offload-dpdk: Use per-netdev offload metadata

2021-09-08 Thread Gaetan Rivet
Add a per-netdev offload data field as part of netdev hw_info structure.
Use this field in netdev-offload-dpdk to map offload metadata (ufid to
rte_flow). Use flow API deinit ops to destroy the per-netdev metadata
when deallocating a netdev. Use RCU primitives to ensure coherency
during port deletion.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-dpdk.c | 126 +-
 lib/netdev-offload.h  |   2 +
 2 files changed, 113 insertions(+), 15 deletions(-)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index b87a50b40..2d1e31ece 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -28,6 +28,7 @@
 #include "odp-util.h"
 #include "openvswitch/match.h"
 #include "openvswitch/vlog.h"
+#include "ovs-rcu.h"
 #include "packets.h"
 #include "uuid.h"
 
@@ -54,7 +55,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(600, 
600);
 /*
  * A mapping from ufid to dpdk rte_flow.
  */
-static struct cmap ufid_to_rte_flow = CMAP_INITIALIZER;
 
 struct ufid_to_rte_flow_data {
 struct cmap_node node;
@@ -66,14 +66,81 @@ struct ufid_to_rte_flow_data {
 struct netdev *physdev;
 };
 
+struct netdev_offload_dpdk_data {
+struct cmap ufid_to_rte_flow;
+};
+
+static int
+offload_data_init(struct netdev *netdev)
+{
+struct netdev_offload_dpdk_data *data;
+
+data = xzalloc(sizeof *data);
+cmap_init(&data->ufid_to_rte_flow);
+
+ovsrcu_set(&netdev->hw_info.offload_data, (void *) data);
+
+return 0;
+}
+
+static void
+offload_data_destroy__(struct netdev_offload_dpdk_data *data)
+{
+free(data);
+}
+
+static void
+offload_data_destroy(struct netdev *netdev)
+{
+struct netdev_offload_dpdk_data *data;
+struct ufid_to_rte_flow_data *node;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+if (data == NULL) {
+return;
+}
+
+if (!cmap_is_empty(&data->ufid_to_rte_flow)) {
+VLOG_ERR("Incomplete flush: %s contains rte_flow elements",
+ netdev_get_name(netdev));
+}
+
+CMAP_FOR_EACH (node, node, &data->ufid_to_rte_flow) {
+ovsrcu_postpone(free, node);
+}
+
+cmap_destroy(&data->ufid_to_rte_flow);
+ovsrcu_postpone(offload_data_destroy__, data);
+
+ovsrcu_set(&netdev->hw_info.offload_data, NULL);
+}
+
+static struct cmap *
+offload_data_map(struct netdev *netdev)
+{
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+
+return data ? &data->ufid_to_rte_flow : NULL;
+}
+
 /* Find rte_flow with @ufid. */
 static struct ufid_to_rte_flow_data *
-ufid_to_rte_flow_data_find(const ovs_u128 *ufid, bool warn)
+ufid_to_rte_flow_data_find(struct netdev *netdev,
+   const ovs_u128 *ufid, bool warn)
 {
 size_t hash = hash_bytes(ufid, sizeof *ufid, 0);
 struct ufid_to_rte_flow_data *data;
+struct cmap *map = offload_data_map(netdev);
+
+if (!map) {
+return NULL;
+}
 
-CMAP_FOR_EACH_WITH_HASH (data, node, hash, &ufid_to_rte_flow) {
+CMAP_FOR_EACH_WITH_HASH (data, node, hash, map) {
 if (ovs_u128_equals(*ufid, data->ufid)) {
 return data;
 }
@@ -93,8 +160,15 @@ ufid_to_rte_flow_associate(const ovs_u128 *ufid, struct 
netdev *netdev,
bool actions_offloaded)
 {
 size_t hash = hash_bytes(ufid, sizeof *ufid, 0);
-struct ufid_to_rte_flow_data *data = xzalloc(sizeof *data);
+struct cmap *map = offload_data_map(netdev);
 struct ufid_to_rte_flow_data *data_prev;
+struct ufid_to_rte_flow_data *data;
+
+if (!map) {
+return NULL;
+}
+
+data = xzalloc(sizeof *data);
 
 /*
  * We should not simply overwrite an existing rte flow.
@@ -102,7 +176,7 @@ ufid_to_rte_flow_associate(const ovs_u128 *ufid, struct 
netdev *netdev,
  * Thus, if following assert triggers, something is wrong:
  * the rte_flow is not destroyed.
  */
-data_prev = ufid_to_rte_flow_data_find(ufid, false);
+data_prev = ufid_to_rte_flow_data_find(netdev, ufid, false);
 if (data_prev) {
 ovs_assert(data_prev->rte_flow == NULL);
 }
@@ -113,8 +187,7 @@ ufid_to_rte_flow_associate(const ovs_u128 *ufid, struct 
netdev *netdev,
 data->rte_flow = rte_flow;
 data->actions_offloaded = actions_offloaded;
 
-cmap_insert(&ufid_to_rte_flow,
-CONST_CAST(struct cmap_node *, &data->node), hash);
+cmap_insert(map, CONST_CAST(struct cmap_node *, &data->node), hash);
 return data;
 }
 
@@ -122,9 +195,13 @@ static inline void
 ufid_to_rte_flow_disassociate(struct ufid_to_rte_flow_data *data)
 {
 size_t hash

[ovs-dev] [PATCH v5 04/27] netdev: Add flow API uninit function

2021-09-08 Thread Gaetan Rivet
Add a new operation for flow API providers to
uninitialize when the API is disassociated from a netdev.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-provider.h | 3 +++
 lib/netdev-offload.c  | 4 
 2 files changed, 7 insertions(+)

diff --git a/lib/netdev-offload-provider.h b/lib/netdev-offload-provider.h
index 348ca7081..bc52a3f61 100644
--- a/lib/netdev-offload-provider.h
+++ b/lib/netdev-offload-provider.h
@@ -96,6 +96,9 @@ struct netdev_flow_api {
 /* Initializies the netdev flow api.
  * Return 0 if successful, otherwise returns a positive errno value. */
 int (*init_flow_api)(struct netdev *);
+
+/* Uninitializes the netdev flow api. */
+void (*uninit_flow_api)(struct netdev *);
 };
 
 int netdev_register_flow_api_provider(const struct netdev_flow_api *);
diff --git a/lib/netdev-offload.c b/lib/netdev-offload.c
index 8075cfbd8..5ddd4d01d 100644
--- a/lib/netdev-offload.c
+++ b/lib/netdev-offload.c
@@ -332,6 +332,10 @@ netdev_uninit_flow_api(struct netdev *netdev)
 return;
 }
 
+if (flow_api->uninit_flow_api) {
+flow_api->uninit_flow_api(netdev);
+}
+
 ovsrcu_set(&netdev->flow_api, NULL);
 rfa = netdev_lookup_flow_api(flow_api->type);
 ovs_refcount_unref(&rfa->refcnt);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 03/27] tests: Add ovs-barrier unit test

2021-09-08 Thread Gaetan Rivet
No unit test exist currently for the ovs-barrier type.
It is however crucial as a building block and should be verified to work
as expected.

Create a simple test verifying the basic function of ovs-barrier.
Integrate the test as part of the test suite.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 tests/automake.mk|   1 +
 tests/library.at |   5 +
 tests/test-barrier.c | 264 +++
 3 files changed, 270 insertions(+)
 create mode 100644 tests/test-barrier.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 43731d097..99765 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -454,6 +454,7 @@ tests_ovstest_SOURCES = \
tests/ovstest.h \
tests/test-aes128.c \
tests/test-atomic.c \
+   tests/test-barrier.c \
tests/test-bundle.c \
tests/test-byte-order.c \
tests/test-classifier.c \
diff --git a/tests/library.at b/tests/library.at
index b2914ae6c..42a5ce1aa 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -246,6 +246,11 @@ AT_SETUP([ofpbuf module])
 AT_CHECK([ovstest test-ofpbuf], [0], [])
 AT_CLEANUP
 
+AT_SETUP([barrier module])
+AT_KEYWORDS([barrier])
+AT_CHECK([ovstest test-barrier], [0], [])
+AT_CLEANUP
+
 AT_SETUP([rcu])
 AT_CHECK([ovstest test-rcu-quiesce], [0], [])
 AT_CLEANUP
diff --git a/tests/test-barrier.c b/tests/test-barrier.c
new file mode 100644
index 0..3bc5291cc
--- /dev/null
+++ b/tests/test-barrier.c
@@ -0,0 +1,264 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include 
+
+#include "ovs-thread.h"
+#include "ovs-rcu.h"
+#include "ovstest.h"
+#include "random.h"
+#include "util.h"
+
+#define DEFAULT_N_THREADS 4
+#define NB_STEPS 4
+
+static bool verbose;
+static struct ovs_barrier barrier;
+
+struct blocker_aux {
+unsigned int tid;
+bool leader;
+int step;
+};
+
+static void *
+basic_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+ovs_barrier_block(&barrier);
+aux->step++;
+ovs_barrier_block(&barrier);
+}
+
+return NULL;
+}
+
+static void
+basic_block_check(struct blocker_aux *aux, size_t n, int expected)
+{
+size_t i;
+
+for (i = 0; i < n; i++) {
+if (verbose) {
+printf("aux[%" PRIuSIZE "]=%d == %d", i, aux[i].step, expected);
+if (aux[i].step != expected) {
+printf(" <--- X");
+}
+printf("\n");
+} else {
+ovs_assert(aux[i].step == expected);
+}
+}
+ovs_barrier_block(&barrier);
+ovs_barrier_block(&barrier);
+}
+
+/*
+ * Basic barrier test.
+ *
+ * N writers and 1 reader participate in the test.
+ * Each thread goes through M steps (=NB_STEPS).
+ * The main thread participates as the reader.
+ *
+ * A Step is divided in three parts:
+ *1. before
+ *  (barrier)
+ *2. during
+ *  (barrier)
+ *3. after
+ *
+ * Each writer updates a thread-local variable with the
+ * current step number within part 2 and waits.
+ *
+ * The reader checks all variables during part 3, expecting
+ * all variables to be equal. If any variable differs, it means
+ * its thread was not properly blocked by the barrier.
+ */
+static void
+test_barrier_basic(size_t n_threads)
+{
+struct blocker_aux *aux;
+pthread_t *threads;
+size_t i;
+
+ovs_barrier_init(&barrier, n_threads + 1);
+
+aux = xcalloc(n_threads, sizeof *aux);
+threads = xmalloc(n_threads * sizeof *threads);
+for (i = 0; i < n_threads; i++) {
+threads[i] = ovs_thread_create("ovs-barrier",
+   basic_blocker_main, &aux[i]);
+}
+
+for (i = 0; i < NB_STEPS; i++) {
+basic_block_check(aux, n_threads, i);
+}
+ovs_barrier_destroy(&barrier);
+
+for (i = 0; i < n_threads; i++) {
+xpthread_join(threads[i], NULL);
+}
+
+free(threads);
+free(aux);
+}
+
+static unsigned int *shared_mem;
+
+static void *
+lead_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+if (aux->leader) 

[ovs-dev] [PATCH v5 02/27] dpif-netdev: Rename flow offload thread

2021-09-08 Thread Gaetan Rivet
ovs_strlcpy silently fails to copy the thread name if it is too long.
Rename the flow offload thread to differentiate it from the main thread.

Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread")
Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index b3e57bb95..42c078657 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2706,8 +2706,7 @@ queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
 
 if (ovsthread_once_start(&offload_thread_once)) {
 xpthread_cond_init(&dp_flow_offload.cond, NULL);
-ovs_thread_create("dp_netdev_flow_offload",
-  dp_netdev_flow_offload_main, NULL);
+ovs_thread_create("hw_offload", dp_netdev_flow_offload_main, NULL);
 ovsthread_once_done(&offload_thread_once);
 }
 
@@ -2795,8 +2794,7 @@ queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd,
 
 if (ovsthread_once_start(&offload_thread_once)) {
 xpthread_cond_init(&dp_flow_offload.cond, NULL);
-ovs_thread_create("dp_netdev_flow_offload",
-  dp_netdev_flow_offload_main, NULL);
+ovs_thread_create("hw_offload", dp_netdev_flow_offload_main, NULL);
 ovsthread_once_done(&offload_thread_once);
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v5 01/27] ovs-thread: Fix barrier use-after-free

2021-09-08 Thread Gaetan Rivet
When a thread is blocked on a barrier, there is no guarantee
regarding the moment it will resume, only that it will at some point in
the future.

One thread can resume first then proceed to destroy the barrier while
another thread has not yet awoken. When it finally happens, the second
thread will attempt a seq_read() on the barrier seq, while the first
thread have already destroyed it, triggering a use-after-free.

Introduce an additional indirection layer within the barrier.
A internal barrier implementation holds all the necessary elements
for a thread to safely block and destroy. Whenever a barrier is
destroyed, the internal implementation is left available to still
blocking threads if necessary. A reference counter is used to track
threads still using the implementation.

Note that current uses of ovs-barrier are not affected: RCU and
revalidators will not destroy their barrier immediately after blocking
on it.

Fixes: d8043da7182a ("ovs-thread: Implement OVS specific barrier.")
Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 lib/ovs-thread.c | 61 +++-
 lib/ovs-thread.h |  6 ++---
 2 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
index b686e4548..805cba622 100644
--- a/lib/ovs-thread.c
+++ b/lib/ovs-thread.c
@@ -299,21 +299,53 @@ ovs_spin_init(const struct ovs_spin *spin)
 }
 #endif
 
+struct ovs_barrier_impl {
+uint32_t size;/* Number of threads to wait. */
+atomic_count count;   /* Number of threads already hit the barrier. */
+struct seq *seq;
+struct ovs_refcount refcnt;
+};
+
+static void
+ovs_barrier_impl_ref(struct ovs_barrier_impl *impl)
+{
+ovs_refcount_ref(&impl->refcnt);
+}
+
+static void
+ovs_barrier_impl_unref(struct ovs_barrier_impl *impl)
+{
+if (ovs_refcount_unref(&impl->refcnt) == 1) {
+seq_destroy(impl->seq);
+free(impl);
+}
+}
+
 /* Initializes the 'barrier'.  'size' is the number of threads
  * expected to hit the barrier. */
 void
 ovs_barrier_init(struct ovs_barrier *barrier, uint32_t size)
 {
-barrier->size = size;
-atomic_count_init(&barrier->count, 0);
-barrier->seq = seq_create();
+struct ovs_barrier_impl *impl;
+
+impl = xmalloc(sizeof *impl);
+impl->size = size;
+atomic_count_init(&impl->count, 0);
+impl->seq = seq_create();
+ovs_refcount_init(&impl->refcnt);
+
+ovsrcu_set(&barrier->impl, impl);
 }
 
 /* Destroys the 'barrier'. */
 void
 ovs_barrier_destroy(struct ovs_barrier *barrier)
 {
-seq_destroy(barrier->seq);
+struct ovs_barrier_impl *impl;
+
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovsrcu_set(&barrier->impl, NULL);
+ovs_barrier_impl_unref(impl);
 }
 
 /* Makes the calling thread block on the 'barrier' until all
@@ -325,23 +357,30 @@ ovs_barrier_destroy(struct ovs_barrier *barrier)
 void
 ovs_barrier_block(struct ovs_barrier *barrier)
 {
-uint64_t seq = seq_read(barrier->seq);
+struct ovs_barrier_impl *impl;
 uint32_t orig;
+uint64_t seq;
 
-orig = atomic_count_inc(&barrier->count);
-if (orig + 1 == barrier->size) {
-atomic_count_set(&barrier->count, 0);
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovs_barrier_impl_ref(impl);
+
+seq = seq_read(impl->seq);
+orig = atomic_count_inc(&impl->count);
+if (orig + 1 == impl->size) {
+atomic_count_set(&impl->count, 0);
 /* seq_change() serves as a release barrier against the other threads,
  * so the zeroed count is visible to them as they continue. */
-seq_change(barrier->seq);
+seq_change(impl->seq);
 } else {
 /* To prevent thread from waking up by other event,
  * keeps waiting for the change of 'barrier->seq'. */
-while (seq == seq_read(barrier->seq)) {
-seq_wait(barrier->seq, seq);
+while (seq == seq_read(impl->seq)) {
+seq_wait(impl->seq, seq);
 poll_block();
 }
 }
+
+ovs_barrier_impl_unref(impl);
 }
 
 DEFINE_EXTERN_PER_THREAD_DATA(ovsthread_id, OVSTHREAD_ID_UNSET);
diff --git a/lib/ovs-thread.h b/lib/ovs-thread.h
index 7ee98bd4e..3b444ccdc 100644
--- a/lib/ovs-thread.h
+++ b/lib/ovs-thread.h
@@ -21,16 +21,16 @@
 #include 
 #include 
 #include "ovs-atomic.h"
+#include "ovs-rcu.h"
 #include "openvswitch/thread.h"
 #include "util.h"
 
 struct seq;
 
 /* Poll-block()-able barrier similar to pthread_barrier_t. */
+struct ovs_barrier_impl;
 struct ovs_barrier {
-uint32_t size;/* Number of threads to wait. */
-atomic_count count;   /* Number of threads already hit the barrier. */
-struct seq *s

[ovs-dev] [PATCH v5 00/27] dpif-netdev: Parallel offload processing

2021-09-08 Thread Gaetan Rivet
This patch series aims to improve the performance of the management
of hw-offloads in dpif-netdev. In the current version, some setup
will experience high memory usage and poor latency between a flow
decision and its execution regarding hardware offloading.

This series starts by measuring key metrics regarding both issues
Those patches are introduced first to compare the current status
with each improvements introduced.
Offloads enqueued and inserted, as well as the latency
from queue insertion to hardware insertion is measured. A new
command 'ovs-appctl dpctl/offload-stats-show' is introduced
to show the current measure.

In my current performance test setup I am measuring an
average latency hovering between 1~2 seconds.
After the optimizations, it is reduced to 500~900 ms.
Finally when using multiple threads and with proper driver
support[1], it is measured in the order of 1 ms.

A few modules are introduced:

  * An ID pool with reduced capabilities, simplifying its
operations and allowing better performances in both
single and multi-thread setup.

  * A lockless queue between PMDs / revalidators and
offload thread(s). As the number of PMDs increases,
contention can be high on the shared queue.
This queue is designed to serve as message queue
between threads.

  * A bounded lockless MPMC ring and some helpers for
calculating moving averages.

  * A moving average module for Cumulative and Exponential
moving averages.

The netdev-offload-dpdk module is made thread-safe.
Internal maps are made per-netdev instead, and locks are
taken for shorter critical sections within the module.

CI result: https://github.com/grivet/ovs/actions/runs/554918929

[1]: The rte_flow API was made thread-safe in the 20.11 DPDK
 release. Drivers that do not implement those operations
 concurrently are protected by a lock. Others will
 allow better concurrency, that improve the result
 of this series.

v2:

  * Improved the MPSC queue API to simplify usage.

  * Moved flush operation from initiator thread to offload
thread(s). This ensures offload metadata are shared only
among the offload thread pool.

  * Flush operation needs additional thread synchronization.
The ovs_barrier currently triggers a UAF. Add a unit-test to
validate its operations and a fix for the UAF.

CI result: https://github.com/grivet/ovs/actions/runs/741430135
   The error comes from a failure to download 'automake' on
   osx, unrelated to any change in this series.

v3:

  * Re-ordered commits so fixes are first. No conflict seen currently,
but it might prevent them if some requested changes to the series
were to move code in the same parts.

  * Modified the reduced quiescing the thread to use ovsrcu_quiesce(),
and base next_rcu on the current time value (after quiescing happened,
however long it takes).

  * Added Reviewed-by tags to the relevant commits.

CI result: https://github.com/grivet/ovs/actions/runs/782655601

v4:

  * Modified the seq-pool to use batches of IDs with a spinlock
instead of lockless rings.

  * The llring structure is removed.

  * Due to the length of the changes to the structure, some
acked-by or reviewed-by were not ported to the id-fpool patch.

CI result: https://github.com/grivet/ovs/actions/runs/921095015

v5:

  * Rebase on master.
Conflicts were seen related to the vxlan-decap and pmd rebalance
series.

  * Fix typo in xchg patch spotted by Maxime Coquelin.

  * Added Reviewed-by Maxime Coquelin on 4 patches.

CI result: https://github.com/grivet/ovs/actions/runs/1212804378

Gaetan Rivet (27):
  ovs-thread: Fix barrier use-after-free
  dpif-netdev: Rename flow offload thread
  tests: Add ovs-barrier unit test
  netdev: Add flow API uninit function
  netdev-offload-dpdk: Use per-netdev offload metadata
  netdev-offload-dpdk: Implement hw-offload statistics read
  dpctl: Add function to read hardware offload statistics
  dpif-netdev: Rename offload thread structure
  mov-avg: Add a moving average helper structure
  dpif-netdev: Implement hardware offloads stats query
  ovs-atomic: Expose atomic exchange operation
  mpsc-queue: Module for lock-free message passing
  id-fpool: Module for fast ID generation
  netdev-offload: Add multi-thread API
  dpif-netdev: Quiesce offload thread periodically
  dpif-netdev: Postpone flow offload item freeing
  dpif-netdev: Use id-fpool for mark allocation
  dpif-netdev: Introduce tagged union of offload requests
  dpif-netdev: Execute flush from offload thread
  netdev-offload-dpdk: Use per-thread HW offload stats
  netdev-offload-dpdk: Lock rte_flow map access
  netdev-offload-dpdk: Protect concurrent offload destroy/query
  dpif-netdev: Use lockless queue to manage offloads
  dpif-netdev: Make megaflow and mark mappings thread objects
  dpif-netdev: Replace port mutex by rwlock
  dpif-netdev: Use one or more offload threads
  netdev-dpdk: Remove rte-f

[ovs-dev] [PATCH v3 8/8] conntrack: Use an atomic conn expiration value

2021-06-15 Thread Gaetan Rivet
A lock is taken during conn_lookup() to check whether a connection is
expired before returning it. This lock can have some contention.

Even though this lock ensures a consistent sequence of writes, it does
not imply a specific order. A ct_clean thread taking the lock first
could read a value that would be updated immediately after by a PMD
waiting on the same lock, just as well as the inverse order.

As such, the expiration time can be stale anytime it is read. In this
context, using an atomic will ensure the same guarantees for either
writes or reads, i.e. writes are consistent and reads are not undefined
behaviour. Reading an atomic is however less costly than taking and
releasing a lock.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Acked-by: William Tu 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack-tp.c  |  2 +-
 lib/conntrack.c | 27 +++
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index bb82252e8..d61ab4f36 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -144,7 +144,7 @@ struct conn {
 /* Mutable data. */
 struct ovs_mutex lock; /* Guards all mutable fields. */
 ovs_u128 label;
-long long expiration;
+atomic_llong expiration;
 uint32_t mark;
 int seq_skew;
 
diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
index 22363e7fe..5bf2816ca 100644
--- a/lib/conntrack-tp.c
+++ b/lib/conntrack-tp.c
@@ -240,7 +240,7 @@ static void
 conn_schedule_expiration(struct conn *conn, enum ct_timeout tm, long long now,
  uint32_t tp_value)
 {
-conn->expiration = now + tp_value * 1000;
+atomic_store_relaxed(&conn->expiration, now + tp_value * 1000);
 conn->exp.tm = tm;
 ignore(atomic_flag_test_and_set(&conn->exp.reschedule));
 }
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 045710e8d..03aa21e78 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -99,6 +99,7 @@ static enum ct_update_res conn_update(struct conntrack *ct, 
struct conn *conn,
   struct dp_packet *pkt,
   struct conn_lookup_ctx *ctx,
   long long now);
+static long long int conn_expiration(const struct conn *);
 static bool conn_expired(struct conn *, long long now);
 static void set_mark(struct dp_packet *, struct conn *,
  uint32_t val, uint32_t mask);
@@ -1018,13 +1019,10 @@ un_nat_packet(struct dp_packet *pkt, const struct conn 
*conn,
 static void
 conn_seq_skew_set(struct conntrack *ct, const struct conn *conn_in,
   long long now, int seq_skew, bool seq_skew_dir)
-OVS_NO_THREAD_SAFETY_ANALYSIS
 {
 struct conn *conn;
-ovs_mutex_unlock(&conn_in->lock);
-conn_lookup(ct, &conn_in->key, now, &conn, NULL);
-ovs_mutex_lock(&conn_in->lock);
 
+conn_lookup(ct, &conn_in->key, now, &conn, NULL);
 if (conn && seq_skew) {
 conn->seq_skew = seq_skew;
 conn->seq_skew_dir = seq_skew_dir;
@@ -1596,9 +1594,7 @@ ct_sweep(struct conntrack *ct, long long now, size_t 
limit)
 continue;
 }
 
-ovs_mutex_lock(&conn->lock);
-expiration = conn->expiration;
-ovs_mutex_unlock(&conn->lock);
+expiration = conn_expiration(conn);
 
 if (conn == end_of_queue) {
 /* If we already re-enqueued this conn during this sweep,
@@ -2483,14 +2479,21 @@ conn_update(struct conntrack *ct, struct conn *conn, 
struct dp_packet *pkt,
 return update_res;
 }
 
+static long long int
+conn_expiration(const struct conn *conn)
+{
+long long int expiration;
+
+atomic_read_relaxed(&CONST_CAST(struct conn *, conn)->expiration,
+&expiration);
+return expiration;
+}
+
 static bool
 conn_expired(struct conn *conn, long long now)
 {
 if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
-ovs_mutex_lock(&conn->lock);
-bool expired = now >= conn->expiration ? true : false;
-ovs_mutex_unlock(&conn->lock);
-return expired;
+return now >= conn_expiration(conn);
 }
 return false;
 }
@@ -2633,7 +2636,7 @@ conn_to_ct_dpif_entry(const struct conn *conn, struct 
ct_dpif_entry *entry,
 entry->mark = conn->mark;
 memcpy(&entry->labels, &conn->label, sizeof entry->labels);
 
-long long expiration = conn->expiration - now;
+long long expiration = conn_expiration(conn) - now;
 
 struct ct_l4_proto *class = l4_protos[conn->key.nw_proto];
 if (class->conn_get_protoinfo) {
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 7/8] conntrack: Inverse conn and ct lock precedence

2021-06-15 Thread Gaetan Rivet
The lock priority order is for the global 'ct_lock' to be taken first
and then 'conn->lock'. This is an issue, as multiple operations on
connections are thus blocked between threads contending on the
global 'ct_lock'.

This was previously necessary due to how the expiration lists, timeout
policies and zone limits were managed. They are now using RCU-friendly
structures that allow concurrent readers. The mutual exclusion now only
needs to happen during writes.

This allows reducing the 'ct_lock' precedence, and to only take it
when writing the relevant structures. This will reduce contention on
'ct_lock', which impairs scalability when the connection tracker is
used by many threads.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  7 --
 lib/conntrack-tp.c  | 30 +-
 lib/conntrack.c | 56 +
 3 files changed, 41 insertions(+), 52 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index ea2e7ed4d..bb82252e8 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -134,6 +134,9 @@ struct conn {
 struct nat_action_info_t *nat_info;
 char *alg;
 struct conn *nat_conn; /* The NAT 'conn' context, if there is one. */
+atomic_flag reclaimed; /* False during the lifetime of the connection,
+* True as soon as a thread has started freeing
+* its memory. */
 
 /* Inserted once by a PMD, then managed by the 'ct_clean' thread. */
 struct conn_expire exp;
@@ -196,8 +199,8 @@ struct conntrack {
 };
 
 /* Lock acquisition order:
- *1. 'ct_lock'
- *2. 'conn->lock'
+ *1. 'conn->lock'
+ *2. 'ct_lock'
  *3. 'resources_lock'
  */
 
diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
index 592e10c6f..22363e7fe 100644
--- a/lib/conntrack-tp.c
+++ b/lib/conntrack-tp.c
@@ -245,58 +245,30 @@ conn_schedule_expiration(struct conn *conn, enum 
ct_timeout tm, long long now,
 ignore(atomic_flag_test_and_set(&conn->exp.reschedule));
 }
 
-static void
-conn_update_expiration__(struct conntrack *ct, struct conn *conn,
- enum ct_timeout tm, long long now,
- uint32_t tp_value)
-OVS_REQUIRES(conn->lock)
-{
-ovs_mutex_unlock(&conn->lock);
-
-ovs_mutex_lock(&ct->ct_lock);
-ovs_mutex_lock(&conn->lock);
-conn_schedule_expiration(conn, tm, now, tp_value);
-ovs_mutex_unlock(&conn->lock);
-ovs_mutex_unlock(&ct->ct_lock);
-
-ovs_mutex_lock(&conn->lock);
-}
-
 /* The conn entry lock must be held on entry and exit. */
 void
 conn_update_expiration(struct conntrack *ct, struct conn *conn,
enum ct_timeout tm, long long now)
-OVS_REQUIRES(conn->lock)
 {
 struct timeout_policy *tp;
 uint32_t val;
 
-ovs_mutex_unlock(&conn->lock);
-
-ovs_mutex_lock(&ct->ct_lock);
-ovs_mutex_lock(&conn->lock);
 tp = timeout_policy_lookup(ct, conn->tp_id);
 if (tp) {
 val = tp->policy.attrs[tm_to_ct_dpif_tp(tm)];
 } else {
 val = ct_dpif_netdev_tp_def[tm_to_ct_dpif_tp(tm)];
 }
-ovs_mutex_unlock(&conn->lock);
-ovs_mutex_unlock(&ct->ct_lock);
-
-ovs_mutex_lock(&conn->lock);
 VLOG_DBG_RL(&rl, "Update timeout %s zone=%u with policy id=%d "
 "val=%u sec.",
 ct_timeout_str[tm], conn->key.zone, conn->tp_id, val);
 
-conn_update_expiration__(ct, conn, tm, now, val);
+conn_schedule_expiration(conn, tm, now, val);
 }
 
-/* ct_lock must be held. */
 void
 conn_init_expiration(struct conntrack *ct, struct conn *conn,
  enum ct_timeout tm, long long now)
-OVS_REQUIRES(ct->ct_lock)
 {
 struct timeout_policy *tp;
 uint32_t val;
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 71f51f3d9..045710e8d 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -465,7 +465,7 @@ zone_limit_delete(struct conntrack *ct, uint16_t zone)
 
 static void
 conn_clean_cmn(struct conntrack *ct, struct conn *conn)
-OVS_REQUIRES(ct->ct_lock)
+OVS_REQUIRES(conn->lock, ct->ct_lock)
 {
 if (conn->alg) {
 expectation_clean(ct, &conn->key);
@@ -495,18 +495,29 @@ conn_unref(struct conn *conn)
  * removes the associated nat 'conn' from the lookup datastructures. */
 static void
 conn_clean(struct conntrack *ct, struct conn *conn)
-OVS_REQUIRES(ct->ct_lock)
+OVS_EXCLUDED(conn->lock, ct->ct_lock)
 {
 ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
 
+if (atomic_flag_test_and_set(&conn->reclaimed)) {
+return;
+}
+
+ovs_mutex_lock(&conn->lock);
+
+ovs_m

[ovs-dev] [PATCH v3 6/8] conntrack-tp: Use a cmap to store timeout policies

2021-06-15 Thread Gaetan Rivet
Multiple lookups are done to stored timeout policies, each time blocking
the global 'ct_lock'. This is usually not necessary and it should be
acceptable to get policy updates slightly delayed (by one RCU sync
at most). Using a CMAP reduces multiple lock taking and releasing in
the connection insertion path.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Acked-by: William Tu 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack-tp.c  | 54 +++--
 lib/conntrack.c |  9 ---
 lib/conntrack.h |  2 +-
 4 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 7eb3ca297..ea2e7ed4d 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -173,7 +173,7 @@ struct conntrack {
 struct cmap conns OVS_GUARDED;
 struct mpsc_queue exp_lists[N_CT_TM];
 struct cmap zone_limits OVS_GUARDED;
-struct hmap timeout_policies OVS_GUARDED;
+struct cmap timeout_policies OVS_GUARDED;
 uint32_t hash_basis; /* Salt for hashing a connection key. */
 pthread_t clean_thread; /* Periodically cleans up connection tracker. */
 struct latch clean_thread_exit; /* To destroy the 'clean_thread'. */
diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
index 6de2354c0..592e10c6f 100644
--- a/lib/conntrack-tp.c
+++ b/lib/conntrack-tp.c
@@ -47,14 +47,15 @@ static unsigned int ct_dpif_netdev_tp_def[] = {
 };
 
 static struct timeout_policy *
-timeout_policy_lookup(struct conntrack *ct, int32_t tp_id)
+timeout_policy_lookup_protected(struct conntrack *ct, int32_t tp_id)
 OVS_REQUIRES(ct->ct_lock)
 {
 struct timeout_policy *tp;
 uint32_t hash;
 
 hash = hash_int(tp_id, ct->hash_basis);
-HMAP_FOR_EACH_IN_BUCKET (tp, node, hash, &ct->timeout_policies) {
+CMAP_FOR_EACH_WITH_HASH_PROTECTED (tp, node, hash,
+   &ct->timeout_policies) {
 if (tp->policy.id == tp_id) {
 return tp;
 }
@@ -62,20 +63,25 @@ timeout_policy_lookup(struct conntrack *ct, int32_t tp_id)
 return NULL;
 }
 
-struct timeout_policy *
-timeout_policy_get(struct conntrack *ct, int32_t tp_id)
+static struct timeout_policy *
+timeout_policy_lookup(struct conntrack *ct, int32_t tp_id)
 {
 struct timeout_policy *tp;
+uint32_t hash;
 
-ovs_mutex_lock(&ct->ct_lock);
-tp = timeout_policy_lookup(ct, tp_id);
-if (!tp) {
-ovs_mutex_unlock(&ct->ct_lock);
-return NULL;
+hash = hash_int(tp_id, ct->hash_basis);
+CMAP_FOR_EACH_WITH_HASH (tp, node, hash, &ct->timeout_policies) {
+if (tp->policy.id == tp_id) {
+return tp;
+}
 }
+return NULL;
+}
 
-ovs_mutex_unlock(&ct->ct_lock);
-return tp;
+struct timeout_policy *
+timeout_policy_get(struct conntrack *ct, int32_t tp_id)
+{
+return timeout_policy_lookup(ct, tp_id);
 }
 
 static void
@@ -125,27 +131,30 @@ timeout_policy_create(struct conntrack *ct,
 init_default_tp(tp, tp_id);
 update_existing_tp(tp, new_tp);
 hash = hash_int(tp_id, ct->hash_basis);
-hmap_insert(&ct->timeout_policies, &tp->node, hash);
+cmap_insert(&ct->timeout_policies, &tp->node, hash);
 }
 
 static void
 timeout_policy_clean(struct conntrack *ct, struct timeout_policy *tp)
 OVS_REQUIRES(ct->ct_lock)
 {
-hmap_remove(&ct->timeout_policies, &tp->node);
-free(tp);
+uint32_t hash = hash_int(tp->policy.id, ct->hash_basis);
+cmap_remove(&ct->timeout_policies, &tp->node, hash);
+ovsrcu_postpone(free, tp);
 }
 
 static int
-timeout_policy_delete__(struct conntrack *ct, uint32_t tp_id)
+timeout_policy_delete__(struct conntrack *ct, uint32_t tp_id,
+bool warn_on_error)
 OVS_REQUIRES(ct->ct_lock)
 {
+struct timeout_policy *tp;
 int err = 0;
-struct timeout_policy *tp = timeout_policy_lookup(ct, tp_id);
 
+tp = timeout_policy_lookup_protected(ct, tp_id);
 if (tp) {
 timeout_policy_clean(ct, tp);
-} else {
+} else if (warn_on_error) {
 VLOG_WARN_RL(&rl, "Failed to delete a non-existent timeout "
  "policy: id=%d", tp_id);
 err = ENOENT;
@@ -159,7 +168,7 @@ timeout_policy_delete(struct conntrack *ct, uint32_t tp_id)
 int err;
 
 ovs_mutex_lock(&ct->ct_lock);
-err = timeout_policy_delete__(ct, tp_id);
+err = timeout_policy_delete__(ct, tp_id, true);
 ovs_mutex_unlock(&ct->ct_lock);
 return err;
 }
@@ -170,7 +179,7 @@ timeout_policy_init(struct conntrack *ct)
 {
 struct timeout_policy tp;
 
-hmap_init(&ct->timeout_policies);
+cmap_init(&ct->timeout_policies);
 
 /* Create default timeout policy. */
 memset(&tp, 0, sizeof tp);
@@ -182,14 +191,11 @@ int
 timeout_

[ovs-dev] [PATCH v3 2/8] ovs-atomic: Expose atomic exchange operation

2021-06-15 Thread Gaetan Rivet
The atomic exchange operation is a useful primitive that should be
available as well.  Most compiler already expose or offer a way
to use it, but a single symbol needs to be defined.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/ovs-atomic-c++.h  |  3 +++
 lib/ovs-atomic-clang.h|  5 +
 lib/ovs-atomic-gcc4+.h|  5 +
 lib/ovs-atomic-gcc4.7+.h  |  5 +
 lib/ovs-atomic-i586.h |  5 +
 lib/ovs-atomic-locked.h   |  9 +
 lib/ovs-atomic-msvc.h | 22 ++
 lib/ovs-atomic-pthreads.h |  5 +
 lib/ovs-atomic-x86_64.h   |  5 +
 lib/ovs-atomic.h  |  8 +++-
 10 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/lib/ovs-atomic-c++.h b/lib/ovs-atomic-c++.h
index d47b8dd39..8605fa9d3 100644
--- a/lib/ovs-atomic-c++.h
+++ b/lib/ovs-atomic-c++.h
@@ -29,6 +29,9 @@ using std::atomic_compare_exchange_strong_explicit;
 using std::atomic_compare_exchange_weak;
 using std::atomic_compare_exchange_weak_explicit;
 
+using std::atomic_exchange;
+using std::atomic_exchange_explicit;
+
 #define atomic_read(SRC, DST) \
 atomic_read_explicit(SRC, DST, memory_order_seq_cst)
 #define atomic_read_explicit(SRC, DST, ORDER)   \
diff --git a/lib/ovs-atomic-clang.h b/lib/ovs-atomic-clang.h
index 34cc2faa7..cdf02a512 100644
--- a/lib/ovs-atomic-clang.h
+++ b/lib/ovs-atomic-clang.h
@@ -67,6 +67,11 @@ typedef enum {
 #define atomic_compare_exchange_weak_explicit(DST, EXP, SRC, ORD1, ORD2) \
 __c11_atomic_compare_exchange_weak(DST, EXP, SRC, ORD1, ORD2)
 
+#define atomic_exchange(RMW, ARG) \
+atomic_exchange_explicit(RMW, ARG, memory_order_seq_cst)
+#define atomic_exchange_explicit(RMW, ARG, ORDER) \
+__c11_atomic_exchange(RMW, ARG, ORDER)
+
 #define atomic_add(RMW, ARG, ORIG) \
 atomic_add_explicit(RMW, ARG, ORIG, memory_order_seq_cst)
 #define atomic_sub(RMW, ARG, ORIG) \
diff --git a/lib/ovs-atomic-gcc4+.h b/lib/ovs-atomic-gcc4+.h
index 25bcf20a0..f9accde1a 100644
--- a/lib/ovs-atomic-gcc4+.h
+++ b/lib/ovs-atomic-gcc4+.h
@@ -128,6 +128,11 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit   \
 atomic_compare_exchange_strong_explicit
 
+#define atomic_exchange_explicit(DST, SRC, ORDER) \
+__sync_lock_test_and_set(DST, SRC)
+#define atomic_exchange(DST, SRC) \
+atomic_exchange_explicit(DST, SRC, memory_order_seq_cst)
+
 #define atomic_op__(RMW, OP, ARG, ORIG) \
 ({  \
 typeof(RMW) rmw__ = (RMW);  \
diff --git a/lib/ovs-atomic-gcc4.7+.h b/lib/ovs-atomic-gcc4.7+.h
index 4c197ebe0..846e05775 100644
--- a/lib/ovs-atomic-gcc4.7+.h
+++ b/lib/ovs-atomic-gcc4.7+.h
@@ -61,6 +61,11 @@ typedef enum {
 #define atomic_compare_exchange_weak_explicit(DST, EXP, SRC, ORD1, ORD2) \
 __atomic_compare_exchange_n(DST, EXP, SRC, true, ORD1, ORD2)
 
+#define atomic_exchange_explicit(DST, SRC, ORDER) \
+__atomic_exchange_n(DST, SRC, ORDER)
+#define atomic_exchange(DST, SRC) \
+atomic_exchange_explicit(DST, SRC, memory_order_seq_cst)
+
 #define atomic_add(RMW, OPERAND, ORIG) \
 atomic_add_explicit(RMW, OPERAND, ORIG, memory_order_seq_cst)
 #define atomic_sub(RMW, OPERAND, ORIG) \
diff --git a/lib/ovs-atomic-i586.h b/lib/ovs-atomic-i586.h
index 9a385ce84..35a0959ff 100644
--- a/lib/ovs-atomic-i586.h
+++ b/lib/ovs-atomic-i586.h
@@ -400,6 +400,11 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit   \
 atomic_compare_exchange_strong_explicit
 
+#define atomic_exchange_explicit(RMW, ARG, ORDER) \
+atomic_exchange__(RMW, ARG, ORDER)
+#define atomic_exchange(RMW, ARG) \
+atomic_exchange_explicit(RMW, ARG, memory_order_seq_cst)
+
 #define atomic_add__(RMW, ARG, CLOB)\
 asm volatile("lock; xadd %0,%1 ; "  \
  "# atomic_add__ "  \
diff --git a/lib/ovs-atomic-locked.h b/lib/ovs-atomic-locked.h
index f8f0ba2a5..bf38c4a43 100644
--- a/lib/ovs-atomic-locked.h
+++ b/lib/ovs-atomic-locked.h
@@ -31,6 +31,15 @@ void atomic_unlock__(void *);
  atomic_unlock__(DST),  \
  false)))
 
+#define atomic_exchange_locked(DST, SRC) \
+({   \
+atomic_lock__(DST);  \
+typeof(*(DST)) __tmp = *(DST);   \
+*(DST) = SRC;\
+atomic_unlock__(DST);\
+__tmp;   \
+})
+
 #define atomic_op_locked_add +=
 #define atomic_op_locked_sub -=
 #define atomic_op_locked_or  |=
diff --git a/lib/ovs-atomic-msvc.h b/lib/ovs-atomic-msvc.h
index 9def887d3..ef8310269 100644
--- a/lib/ovs-atomic-msvc.h
+++ b/lib/ovs-atomic-msvc.h
@@ -345,6 +345,28 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit \
 atomic_compare_exchange_strong_explicit
 

[ovs-dev] [PATCH v3 5/8] conntrack: Use a cmap to store zone limits

2021-06-15 Thread Gaetan Rivet
Change the data structure from hmap to cmap for zone limits.
As they are shared amongst multiple conntrack users, multiple
readers want to check the current zone limit state before progressing in
their processing. Using a CMAP allows doing lookups without taking the
global 'ct_lock', thus reducing contention.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack.c | 70 -
 lib/conntrack.h |  2 +-
 lib/dpif-netdev.c   |  5 +--
 4 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 537e56534..7eb3ca297 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -172,7 +172,7 @@ struct conntrack {
 struct ovs_mutex ct_lock; /* Protects 2 following fields. */
 struct cmap conns OVS_GUARDED;
 struct mpsc_queue exp_lists[N_CT_TM];
-struct hmap zone_limits OVS_GUARDED;
+struct cmap zone_limits OVS_GUARDED;
 struct hmap timeout_policies OVS_GUARDED;
 uint32_t hash_basis; /* Salt for hashing a connection key. */
 pthread_t clean_thread; /* Periodically cleans up connection tracker. */
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 45de13ebf..094367733 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -79,7 +79,7 @@ enum ct_alg_ctl_type {
 };
 
 struct zone_limit {
-struct hmap_node node;
+struct cmap_node node;
 struct conntrack_zone_limit czl;
 };
 
@@ -308,7 +308,7 @@ conntrack_init(void)
 for (unsigned i = 0; i < ARRAY_SIZE(ct->exp_lists); i++) {
 mpsc_queue_init(&ct->exp_lists[i]);
 }
-hmap_init(&ct->zone_limits);
+cmap_init(&ct->zone_limits);
 ct->zone_limit_seq = 0;
 timeout_policy_init(ct);
 ovs_mutex_unlock(&ct->ct_lock);
@@ -343,12 +343,25 @@ zone_key_hash(int32_t zone, uint32_t basis)
 }
 
 static struct zone_limit *
-zone_limit_lookup(struct conntrack *ct, int32_t zone)
+zone_limit_lookup_protected(struct conntrack *ct, int32_t zone)
 OVS_REQUIRES(ct->ct_lock)
 {
 uint32_t hash = zone_key_hash(zone, ct->hash_basis);
 struct zone_limit *zl;
-HMAP_FOR_EACH_IN_BUCKET (zl, node, hash, &ct->zone_limits) {
+CMAP_FOR_EACH_WITH_HASH_PROTECTED (zl, node, hash, &ct->zone_limits) {
+if (zl->czl.zone == zone) {
+return zl;
+}
+}
+return NULL;
+}
+
+static struct zone_limit *
+zone_limit_lookup(struct conntrack *ct, int32_t zone)
+{
+uint32_t hash = zone_key_hash(zone, ct->hash_basis);
+struct zone_limit *zl;
+CMAP_FOR_EACH_WITH_HASH (zl, node, hash, &ct->zone_limits) {
 if (zl->czl.zone == zone) {
 return zl;
 }
@@ -358,7 +371,6 @@ zone_limit_lookup(struct conntrack *ct, int32_t zone)
 
 static struct zone_limit *
 zone_limit_lookup_or_default(struct conntrack *ct, int32_t zone)
-OVS_REQUIRES(ct->ct_lock)
 {
 struct zone_limit *zl = zone_limit_lookup(ct, zone);
 return zl ? zl : zone_limit_lookup(ct, DEFAULT_ZONE);
@@ -367,13 +379,16 @@ zone_limit_lookup_or_default(struct conntrack *ct, 
int32_t zone)
 struct conntrack_zone_limit
 zone_limit_get(struct conntrack *ct, int32_t zone)
 {
-ovs_mutex_lock(&ct->ct_lock);
-struct conntrack_zone_limit czl = {DEFAULT_ZONE, 0, 0, 0};
+struct conntrack_zone_limit czl = {
+.zone = DEFAULT_ZONE,
+.limit = 0,
+.count = ATOMIC_COUNT_INIT(0),
+.zone_limit_seq = 0,
+};
 struct zone_limit *zl = zone_limit_lookup_or_default(ct, zone);
 if (zl) {
 czl = zl->czl;
 }
-ovs_mutex_unlock(&ct->ct_lock);
 return czl;
 }
 
@@ -381,13 +396,19 @@ static int
 zone_limit_create(struct conntrack *ct, int32_t zone, uint32_t limit)
 OVS_REQUIRES(ct->ct_lock)
 {
+struct zone_limit *zl = zone_limit_lookup_protected(ct, zone);
+
+if (zl) {
+return 0;
+}
+
 if (zone >= DEFAULT_ZONE && zone <= MAX_ZONE) {
-struct zone_limit *zl = xzalloc(sizeof *zl);
+zl = xzalloc(sizeof *zl);
 zl->czl.limit = limit;
 zl->czl.zone = zone;
 zl->czl.zone_limit_seq = ct->zone_limit_seq++;
 uint32_t hash = zone_key_hash(zone, ct->hash_basis);
-hmap_insert(&ct->zone_limits, &zl->node, hash);
+cmap_insert(&ct->zone_limits, &zl->node, hash);
 return 0;
 } else {
 return EINVAL;
@@ -398,13 +419,14 @@ int
 zone_limit_update(struct conntrack *ct, int32_t zone, uint32_t limit)
 {
 int err = 0;
-ovs_mutex_lock(&ct->ct_lock);
 struct zone_limit *zl = zone_limit_lookup(ct, zone);
 if (zl) {
 zl->czl.limit = limit;
 VLOG_INFO("Changed zone limit of %u for zone %d", limit, zone);
 } else {
+ovs_mutex_lock(&ct->ct_lock);
 err = 

[ovs-dev] [PATCH v3 3/8] mpsc-queue: Module for lock-free message passing

2021-06-15 Thread Gaetan Rivet
Add a lockless multi-producer/single-consumer (MPSC), linked-list based,
intrusive, unbounded queue that does not require deferred memory
management.

The queue is designed to improve the specific MPSC setup.  A benchmark
accompanies the unit tests to measure the difference in this configuration.
A single reader thread polls the queue while N writers enqueue elements
as fast as possible.  The mpsc-queue is compared against the regular ovs-list
as well as the guarded list.  The latter usually offers a slight improvement
by batching the element removal, however the mpsc-queue is faster.

The average is of each producer threads time:

   $ ./tests/ovstest test-mpsc-queue benchmark 300 1
   Benchmarking n=300 on 1 + 1 threads.
type\thread:  Reader  1Avg
 mpsc-queue: 167167167 ms
 list(spin):  89 80 80 ms
list(mutex): 745745745 ms
   guarded list: 788788788 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 2
   Benchmarking n=300 on 1 + 2 threads.
type\thread:  Reader  1  2Avg
 mpsc-queue:  98 97 94 95 ms
 list(spin): 185171173172 ms
list(mutex): 203199203201 ms
   guarded list: 269269188228 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 3
   Benchmarking n=300 on 1 + 3 threads.
type\thread:  Reader  1  2  3Avg
 mpsc-queue:  76 76 65 76 72 ms
 list(spin): 246110240238196 ms
list(mutex): 542541541539540 ms
   guarded list: 535535507511517 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 4
   Benchmarking n=300 on 1 + 4 threads.
type\thread:  Reader  1  2  3  4Avg
 mpsc-queue:  73 68 68 68 68 68 ms
 list(spin): 294275279277282278 ms
list(mutex): 346309287345302310 ms
   guarded list: 378319334378351345 ms

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/automake.mk |   2 +
 lib/mpsc-queue.c| 251 +
 lib/mpsc-queue.h| 190 ++
 tests/automake.mk   |   1 +
 tests/library.at|   5 +
 tests/test-mpsc-queue.c | 772 
 6 files changed, 1221 insertions(+)
 create mode 100644 lib/mpsc-queue.c
 create mode 100644 lib/mpsc-queue.h
 create mode 100644 tests/test-mpsc-queue.c

diff --git a/lib/automake.mk b/lib/automake.mk
index db9017591..4b68c7227 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -166,6 +166,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/memory.c \
lib/memory.h \
lib/meta-flow.c \
+   lib/mpsc-queue.c \
+   lib/mpsc-queue.h \
lib/multipath.c \
lib/multipath.h \
lib/namemap.c \
diff --git a/lib/mpsc-queue.c b/lib/mpsc-queue.c
new file mode 100644
index 0..ee762e1dc
--- /dev/null
+++ b/lib/mpsc-queue.c
@@ -0,0 +1,251 @@
+/*
+ * Copyright (c) 2020 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include "ovs-atomic.h"
+
+#include "mpsc-queue.h"
+
+/* Multi-producer, single-consumer queue
+ * =
+ *
+ * This an implementation of the MPSC queue described by Dmitri Vyukov [1].
+ *
+ * One atomic exchange operation is done per insertion.  Removal in most cases
+ * will not require atomic operation and will use one atomic exchange to close
+ * the queue chain.
+ *
+ * Insertion
+ * =
+ *
+ * The queue is implemented using a linked-list.  Insertion is done at the
+ * back of the queue, by swapping the current end with the new node atomically,
+ * then pointing the previous end toward the new node.  To follow Vyukov
+ * nomenclature, the end-node of the chain is called head.  A producer will
+ * only manipulate the head.
+ *
+ * The head swap is atomic, however the link from the previous head to the new
+ * one is done in a separate operation.  This means that the chain is
+ * momentarily broken, when the previous head still points to NULL and the
+ * current head has been inserted.
+ *
+ * Considering a series of insertions, the queue state will remain consistent
+ * and the insertions order is compatible with their precedence, thus the
+ * queue is seri

[ovs-dev] [PATCH v3 4/8] conntrack: Use mpsc-queue to store conn expirations

2021-06-15 Thread Gaetan Rivet
Change the connection expiration lists from ovs_list to mpsc-queue.
This is a pre-step towards reducing the granularity of 'ct_lock'.

It simplifies the responsibilities toward updating the expiration queue.
The dataplane now appends the new conn for expiration once during
creation.  Any further update will only consist in writing the conn
expiration limit and marking the conn for expiration rescheduling.

The ageing thread 'ct_clean' is the only one consuming the expiration
lists.  If a conn was marked for rescheduling by a dataplane, it will
move the conn to the end of the queue.

Once the locks have been reworked, it means neither the dataplane
threads nor 'ct_clean' have to take a lock to update the expiration
lists (assuming the consumer lock is perpetually held by 'ct_clean');

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  84 +++-
 lib/conntrack-tp.c  |  28 +-
 lib/conntrack.c | 118 ++--
 3 files changed, 173 insertions(+), 57 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index e8332bdba..537e56534 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -29,6 +29,7 @@
 #include "openvswitch/list.h"
 #include "openvswitch/types.h"
 #include "packets.h"
+#include "mpsc-queue.h"
 #include "unaligned.h"
 #include "dp-packet.h"
 
@@ -86,22 +87,57 @@ struct alg_exp_node {
 bool nat_rpl_dst;
 };
 
+/* Timeouts: all the possible timeout states passed to update_expiration()
+ * are listed here. The name will be prefix by CT_TM_ and the value is in
+ * milliseconds */
+#define CT_TIMEOUTS \
+CT_TIMEOUT(TCP_FIRST_PACKET) \
+CT_TIMEOUT(TCP_OPENING) \
+CT_TIMEOUT(TCP_ESTABLISHED) \
+CT_TIMEOUT(TCP_CLOSING) \
+CT_TIMEOUT(TCP_FIN_WAIT) \
+CT_TIMEOUT(TCP_CLOSED) \
+CT_TIMEOUT(OTHER_FIRST) \
+CT_TIMEOUT(OTHER_MULTIPLE) \
+CT_TIMEOUT(OTHER_BIDIR) \
+CT_TIMEOUT(ICMP_FIRST) \
+CT_TIMEOUT(ICMP_REPLY)
+
+enum ct_timeout {
+#define CT_TIMEOUT(NAME) CT_TM_##NAME,
+CT_TIMEOUTS
+#undef CT_TIMEOUT
+N_CT_TM
+};
+
 enum OVS_PACKED_ENUM ct_conn_type {
 CT_CONN_TYPE_DEFAULT,
 CT_CONN_TYPE_UN_NAT,
 };
 
+struct conn_expire {
+struct mpsc_queue_node node;
+/* Timeout state of the connection.
+ * It follows the connection state updates.
+ */
+enum ct_timeout tm;
+atomic_flag reschedule;
+struct ovs_refcount refcount;
+};
+
 struct conn {
 /* Immutable data. */
 struct conn_key key;
 struct conn_key rev_key;
 struct conn_key parent_key; /* Only used for orig_tuple support. */
-struct ovs_list exp_node;
 struct cmap_node cm_node;
 struct nat_action_info_t *nat_info;
 char *alg;
 struct conn *nat_conn; /* The NAT 'conn' context, if there is one. */
 
+/* Inserted once by a PMD, then managed by the 'ct_clean' thread. */
+struct conn_expire exp;
+
 /* Mutable data. */
 struct ovs_mutex lock; /* Guards all mutable fields. */
 ovs_u128 label;
@@ -132,33 +168,10 @@ enum ct_update_res {
 CT_UPDATE_VALID_NEW,
 };
 
-/* Timeouts: all the possible timeout states passed to update_expiration()
- * are listed here. The name will be prefix by CT_TM_ and the value is in
- * milliseconds */
-#define CT_TIMEOUTS \
-CT_TIMEOUT(TCP_FIRST_PACKET) \
-CT_TIMEOUT(TCP_OPENING) \
-CT_TIMEOUT(TCP_ESTABLISHED) \
-CT_TIMEOUT(TCP_CLOSING) \
-CT_TIMEOUT(TCP_FIN_WAIT) \
-CT_TIMEOUT(TCP_CLOSED) \
-CT_TIMEOUT(OTHER_FIRST) \
-CT_TIMEOUT(OTHER_MULTIPLE) \
-CT_TIMEOUT(OTHER_BIDIR) \
-CT_TIMEOUT(ICMP_FIRST) \
-CT_TIMEOUT(ICMP_REPLY)
-
-enum ct_timeout {
-#define CT_TIMEOUT(NAME) CT_TM_##NAME,
-CT_TIMEOUTS
-#undef CT_TIMEOUT
-N_CT_TM
-};
-
 struct conntrack {
 struct ovs_mutex ct_lock; /* Protects 2 following fields. */
 struct cmap conns OVS_GUARDED;
-struct ovs_list exp_lists[N_CT_TM] OVS_GUARDED;
+struct mpsc_queue exp_lists[N_CT_TM];
 struct hmap zone_limits OVS_GUARDED;
 struct hmap timeout_policies OVS_GUARDED;
 uint32_t hash_basis; /* Salt for hashing a connection key. */
@@ -204,4 +217,25 @@ struct ct_l4_proto {
struct ct_dpif_protoinfo *);
 };
 
+static inline void
+conn_expire_push_back(struct conntrack *ct, struct conn *conn)
+{
+if (ovs_refcount_try_ref_rcu(&conn->exp.refcount)) {
+atomic_flag_clear(&conn->exp.reschedule);
+mpsc_queue_insert(&ct->exp_lists[conn->exp.tm], &conn->exp.node);
+}
+}
+
+static inline void
+conn_expire_push_front(struct conntrack *ct, struct conn *conn)
+OVS_REQUIRES(ct->exp_lists[conn->exp.tm].read_lock)
+{
+if (ovs_refcount_try_ref_rcu(&conn->exp.refcount)) {
+/* Do not change 'reschedule' sta

[ovs-dev] [PATCH v3 0/8] conntrack: improve multithread scalability

2021-06-15 Thread Gaetan Rivet
Conntracks are executed within the datapath. Locks along this path are crucial
and their critical section should be minimal. The global 'ct_lock' is necessary
before any action taken on connection states. This lock is needed for many
operations on the conntrack, slowing down the datapath.

The cleanup thread 'ct_clean' will take it to do its job. As it can hold it a
long time, the thread is limited in amount of connection cleaned per round,
and calls are rate-limited.

* Timeout policies locking is contrived to avoid deadlock.
  Anytime a connection state is updated, during its update it is unlocked,
  'ct_lock' is taken, then the connection is locked again. Then the reverse
  is done for unlock.

* Scalability is poor. The global ct_lock needs to be taken before applying
  any change to a conn object. This is backward: local changes to smaller
  objects should be independent, then the global lock should only be taken once
  the rest of the work is done, the goal being to have the smallest possible
  critical section.

It can be improved. Using RCU-friendly structures for connections, zone limits
and timeout policies, read-first workload is improved and the precedence of the
global 'ct_lock' and local 'conn->lock' can be inversed.

Running the conntrack benchmark we see these changes:
  ./tests/ovstest test-conntrack benchmark  300 32

code \ N  1 2 4 8
  Before   2310  2766  6117 19838  (ms)
   After   2072  2084  2653  4541  (ms)

One thread in the benchmark executes the task of a PMD, while the 'ct_clean' 
thread
runs in background as well.

Github actions: https://github.com/grivet/ovs/actions/runs/574446345

v2:

An mpsc-queue is used instead of rculist to manage connection expirations lists.
PMDs and ct_clean all act as producers, while ct_clean is the sole consumer 
thread.
A PMD now needs to take the 'ct_lock' only when creating a new connection, and 
only
while inserting it in the conn CMAP. For any updates, only the conn lock is now 
required,
to properly change its state.

The mpsc-queue implementation is identical to the one from the parallel offload 
series [1].

CI: https://github.com/grivet/ovs/actions/runs/772118640

[1]: https://patchwork.ozlabs.org/project/openvswitch/list/?series=238779

v3:

The last part of the series modifying the rate limit of conntrack_clean is 
dropped.
It is not necessary to improve scalability and can be done later if needed.

CI: https://github.com/grivet/ovs/actions/runs/940610003

On my local development laptop, the benchmark gives different numbers since v1:
  ./tests/ovstest test-conntrack benchmark  300 32

code \ N  1 2 4 8
  Before598  1656 12612 39301  (ms)
   After293   337   427   893  (ms)

I replicate the numbers on a 24-cores machine as well.
The benchmark is not very accurate as no core pinning and no isolation is done.

Gaetan Rivet (8):
  conntrack: Init hash basis first at creation
  ovs-atomic: Expose atomic exchange operation
  mpsc-queue: Module for lock-free message passing
  conntrack: Use mpsc-queue to store conn expirations
  conntrack: Use a cmap to store zone limits
  conntrack-tp: Use a cmap to store timeout policies
  conntrack: Inverse conn and ct lock precedence
  conntrack: Use an atomic conn expiration value

 lib/automake.mk   |   2 +
 lib/conntrack-private.h   |  97 +++--
 lib/conntrack-tp.c| 100 ++---
 lib/conntrack.c   | 278 ++
 lib/conntrack.h   |   4 +-
 lib/dpif-netdev.c |   5 +-
 lib/mpsc-queue.c  | 251 +
 lib/mpsc-queue.h  | 190 ++
 lib/ovs-atomic-c++.h  |   3 +
 lib/ovs-atomic-clang.h|   5 +
 lib/ovs-atomic-gcc4+.h|   5 +
 lib/ovs-atomic-gcc4.7+.h  |   5 +
 lib/ovs-atomic-i586.h |   5 +
 lib/ovs-atomic-locked.h   |   9 +
 lib/ovs-atomic-msvc.h |  22 ++
 lib/ovs-atomic-pthreads.h |   5 +
 lib/ovs-atomic-x86_64.h   |   5 +
 lib/ovs-atomic.h  |   8 +-
 tests/automake.mk |   1 +
 tests/library.at  |   5 +
 tests/test-mpsc-queue.c   | 772 ++
 21 files changed, 1608 insertions(+), 169 deletions(-)
 create mode 100644 lib/mpsc-queue.c
 create mode 100644 lib/mpsc-queue.h
 create mode 100644 tests/test-mpsc-queue.c

--
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 1/8] conntrack: Init hash basis first at creation

2021-06-15 Thread Gaetan Rivet
The 'hash_basis' field is used sometimes during sub-systems init
routine. It will be 0 by default before randomization. Sub-systems would
then init some nodes with incorrect hash values.

The timeout policies module is affected, making the default policy being
referenced using an incorrect hash value.

Fixes: 2078901a4c14 ("userspace: Add conntrack timeout policy support.")
Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Acked-by: William Tu 
---
 lib/conntrack.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/conntrack.c b/lib/conntrack.c
index 99198a601..a5efb37aa 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -291,6 +291,11 @@ conntrack_init(void)
 static struct ovsthread_once setup_l4_once = OVSTHREAD_ONCE_INITIALIZER;
 struct conntrack *ct = xzalloc(sizeof *ct);
 
+/* This value can be used during init (e.g. timeout_policy_init()),
+ * set it first to ensure it is available.
+ */
+ct->hash_basis = random_uint32();
+
 ovs_rwlock_init(&ct->resources_lock);
 ovs_rwlock_wrlock(&ct->resources_lock);
 hmap_init(&ct->alg_expectations);
@@ -308,7 +313,6 @@ conntrack_init(void)
 timeout_policy_init(ct);
 ovs_mutex_unlock(&ct->ct_lock);
 
-ct->hash_basis = random_uint32();
 atomic_count_init(&ct->n_conn, 0);
 atomic_init(&ct->n_conn_limit, DEFAULT_N_CONN_LIMIT);
 atomic_init(&ct->tcp_seq_chk, true);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 24/27] dpif-netdev: Make megaflow and mark mappings thread objects

2021-06-09 Thread Gaetan Rivet
In later commits hardware offloads are managed in several threads.
Each offload is managed by a thread determined by its flow's 'mega_ufid'.

As megaflow to mark and mark to flow mappings are 1:1 and 1:N
respectively, then a single mark exists for a single 'mega_ufid', and
multiple flows uses the same 'mega_ufid'. Because the managing thread will
be chosen using the 'mega_ufid', then each mapping does not need to be
shared with other offload threads.

The mappings are kept as cmap as upcalls will sometimes query them before
enqueuing orders to the offload threads.

To prepare this change, move the mappings within the offload thread
structure.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 47 ++-
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 68dcdf39a..8fe794557 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -462,12 +462,16 @@ struct dp_offload_thread_item {
 struct dp_offload_thread {
 struct mpsc_queue queue;
 atomic_uint64_t enqueued_item;
+struct cmap megaflow_to_mark;
+struct cmap mark_to_flow;
 struct mov_avg_cma cma;
 struct mov_avg_ema ema;
 };
 
 static struct dp_offload_thread dp_offload_thread = {
 .queue = MPSC_QUEUE_INITIALIZER(&dp_offload_thread.queue),
+.megaflow_to_mark = CMAP_INITIALIZER,
+.mark_to_flow = CMAP_INITIALIZER,
 .enqueued_item = ATOMIC_VAR_INIT(0),
 .cma = MOV_AVG_CMA_INITIALIZER,
 .ema = MOV_AVG_EMA_INITIALIZER(100),
@@ -2437,32 +2441,23 @@ struct megaflow_to_mark_data {
 uint32_t mark;
 };
 
-struct flow_mark {
-struct cmap megaflow_to_mark;
-struct cmap mark_to_flow;
-struct id_fpool *pool;
-};
-
-static struct flow_mark flow_mark = {
-.megaflow_to_mark = CMAP_INITIALIZER,
-.mark_to_flow = CMAP_INITIALIZER,
-};
+static struct id_fpool *flow_mark_pool;
 
 static uint32_t
 flow_mark_alloc(void)
 {
-static struct ovsthread_once pool_init = OVSTHREAD_ONCE_INITIALIZER;
+static struct ovsthread_once init_once = OVSTHREAD_ONCE_INITIALIZER;
 unsigned int tid = netdev_offload_thread_id();
 uint32_t mark;
 
-if (ovsthread_once_start(&pool_init)) {
+if (ovsthread_once_start(&init_once)) {
 /* Haven't initiated yet, do it here */
-flow_mark.pool = id_fpool_create(netdev_offload_thread_nb(),
+flow_mark_pool = id_fpool_create(netdev_offload_thread_nb(),
  1, MAX_FLOW_MARK);
-ovsthread_once_done(&pool_init);
+ovsthread_once_done(&init_once);
 }
 
-if (id_fpool_new_id(flow_mark.pool, tid, &mark)) {
+if (id_fpool_new_id(flow_mark_pool, tid, &mark)) {
 return mark;
 }
 
@@ -2474,7 +2469,7 @@ flow_mark_free(uint32_t mark)
 {
 unsigned int tid = netdev_offload_thread_id();
 
-id_fpool_free_id(flow_mark.pool, tid, mark);
+id_fpool_free_id(flow_mark_pool, tid, mark);
 }
 
 /* associate megaflow with a mark, which is a 1:1 mapping */
@@ -2487,7 +2482,7 @@ megaflow_to_mark_associate(const ovs_u128 *mega_ufid, 
uint32_t mark)
 data->mega_ufid = *mega_ufid;
 data->mark = mark;
 
-cmap_insert(&flow_mark.megaflow_to_mark,
+cmap_insert(&dp_offload_thread.megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 }
 
@@ -2498,9 +2493,10 @@ megaflow_to_mark_disassociate(const ovs_u128 *mega_ufid)
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
 
-CMAP_FOR_EACH_WITH_HASH (data, node, hash, &flow_mark.megaflow_to_mark) {
+CMAP_FOR_EACH_WITH_HASH (data, node, hash,
+ &dp_offload_thread.megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
-cmap_remove(&flow_mark.megaflow_to_mark,
+cmap_remove(&dp_offload_thread.megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 ovsrcu_postpone(free, data);
 return;
@@ -2517,7 +2513,8 @@ megaflow_to_mark_find(const ovs_u128 *mega_ufid)
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
 
-CMAP_FOR_EACH_WITH_HASH (data, node, hash, &flow_mark.megaflow_to_mark) {
+CMAP_FOR_EACH_WITH_HASH (data, node, hash,
+ &dp_offload_thread.megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
 return data->mark;
 }
@@ -2534,7 +2531,7 @@ mark_to_flow_associate(const uint32_t mark, struct 
dp_netdev_flow *flow)
 {
 dp_netdev_flow_ref(flow);
 
-cmap_insert(&flow_mark.mark_to_flow,
+cmap_insert(&dp_offload_thread.mark_to_flow,
 CONST_CAST(struct cmap_node *, &a

[ovs-dev] [PATCH v4 23/27] dpif-netdev: Use lockless queue to manage offloads

2021-06-09 Thread Gaetan Rivet
The dataplane threads (PMDs) send offloading commands to a dedicated
offload management thread. The current implementation uses a lock
and benchmarks show a high contention on the queue in some cases.

With high-contention, the mutex will more often lead to the locking
thread yielding in wait, using a syscall. This should be avoided in
a userland dataplane.

The mpsc-queue can be used instead. It uses less cycles and has
lower latency. Benchmarks show better behavior as multiple
revalidators and one or multiple PMDs writes to a single queue
while another thread polls it.

One trade-off with the new scheme however is to be forced to poll
the queue from the offload thread. Without mutex, a cond_wait
cannot be used for signaling. The offload thread is implementing
an exponential backoff and will sleep in short increments when no
data is available. This makes the thread yield, at the price of
some latency to manage offloads after an inactivity period.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 109 --
 1 file changed, 57 insertions(+), 52 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 1daaecb1c..68dcdf39a 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -53,6 +53,7 @@
 #include "id-pool.h"
 #include "ipf.h"
 #include "mov-avg.h"
+#include "mpsc-queue.h"
 #include "netdev.h"
 #include "netdev-offload.h"
 #include "netdev-provider.h"
@@ -452,25 +453,22 @@ union dp_offload_thread_data {
 };
 
 struct dp_offload_thread_item {
-struct ovs_list node;
+struct mpsc_queue_node node;
 enum dp_offload_type type;
 long long int timestamp;
 union dp_offload_thread_data data[0];
 };
 
 struct dp_offload_thread {
-struct ovs_mutex mutex;
-struct ovs_list list;
-uint64_t enqueued_item;
+struct mpsc_queue queue;
+atomic_uint64_t enqueued_item;
 struct mov_avg_cma cma;
 struct mov_avg_ema ema;
-pthread_cond_t cond;
 };
 
 static struct dp_offload_thread dp_offload_thread = {
-.mutex = OVS_MUTEX_INITIALIZER,
-.list  = OVS_LIST_INITIALIZER(&dp_offload_thread.list),
-.enqueued_item = 0,
+.queue = MPSC_QUEUE_INITIALIZER(&dp_offload_thread.queue),
+.enqueued_item = ATOMIC_VAR_INIT(0),
 .cma = MOV_AVG_CMA_INITIALIZER,
 .ema = MOV_AVG_EMA_INITIALIZER(100),
 };
@@ -2697,11 +2695,8 @@ dp_netdev_free_offload(struct dp_offload_thread_item 
*offload)
 static void
 dp_netdev_append_offload(struct dp_offload_thread_item *offload)
 {
-ovs_mutex_lock(&dp_offload_thread.mutex);
-ovs_list_push_back(&dp_offload_thread.list, &offload->node);
-dp_offload_thread.enqueued_item++;
-xpthread_cond_signal(&dp_offload_thread.cond);
-ovs_mutex_unlock(&dp_offload_thread.mutex);
+mpsc_queue_insert(&dp_offload_thread.queue, &offload->node);
+atomic_count_inc64(&dp_offload_thread.enqueued_item);
 }
 
 static int
@@ -2845,58 +2840,68 @@ dp_offload_flush(struct dp_offload_thread_item *item)
 ovs_barrier_block(flush->barrier);
 }
 
+#define DP_NETDEV_OFFLOAD_BACKOFF_MIN 1
+#define DP_NETDEV_OFFLOAD_BACKOFF_MAX 64
 #define DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US (10 * 1000) /* 10 ms */
 
 static void *
 dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 {
 struct dp_offload_thread_item *offload;
-struct ovs_list *list;
+struct mpsc_queue_node *node;
+struct mpsc_queue *queue;
 long long int latency_us;
 long long int next_rcu;
 long long int now;
+uint64_t backoff;
 
-next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
-for (;;) {
-ovs_mutex_lock(&dp_offload_thread.mutex);
-if (ovs_list_is_empty(&dp_offload_thread.list)) {
-ovsrcu_quiesce_start();
-ovs_mutex_cond_wait(&dp_offload_thread.cond,
-&dp_offload_thread.mutex);
-ovsrcu_quiesce_end();
-next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
-}
-list = ovs_list_pop_front(&dp_offload_thread.list);
-dp_offload_thread.enqueued_item--;
-offload = CONTAINER_OF(list, struct dp_offload_thread_item, node);
-ovs_mutex_unlock(&dp_offload_thread.mutex);
-
-switch (offload->type) {
-case DP_OFFLOAD_FLOW:
-dp_offload_flow(offload);
-break;
-case DP_OFFLOAD_FLUSH:
-dp_offload_flush(offload);
-break;
-default:
-OVS_NOT_REACHED();
+queue = &dp_offload_thread.queue;
+mpsc_queue_acquire(queue);
+
+while (true) {
+backoff = DP_NETDEV_OFFLOAD_BACKOFF_MIN;
+while (mpsc_queue_tail(queue) == NULL) {
+xnanosleep(backoff * 1E6);
+if (backoff < DP_NETDEV_OFFLOAD_BACKOFF_MAX) {
+ 

[ovs-dev] [PATCH v4 27/27] netdev-dpdk: Remove rte-flow API access locks

2021-06-09 Thread Gaetan Rivet
The rte_flow DPDK API was made thread-safe [1] in release 20.11.
Now that the DPDK offload provider in OVS is thread safe, remove the
locks.

[1]: http://mails.dpdk.org/archives/dev/2020-October/184251.html

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-dpdk.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 9d8096668..c7ebeb4d5 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -5239,9 +5239,7 @@ netdev_dpdk_rte_flow_destroy(struct netdev *netdev,
 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 int ret;
 
-ovs_mutex_lock(&dev->mutex);
 ret = rte_flow_destroy(dev->port_id, rte_flow, error);
-ovs_mutex_unlock(&dev->mutex);
 return ret;
 }
 
@@ -5255,9 +5253,7 @@ netdev_dpdk_rte_flow_create(struct netdev *netdev,
 struct rte_flow *flow;
 struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
 
-ovs_mutex_lock(&dev->mutex);
 flow = rte_flow_create(dev->port_id, attr, items, actions, error);
-ovs_mutex_unlock(&dev->mutex);
 return flow;
 }
 
@@ -5285,9 +5281,7 @@ netdev_dpdk_rte_flow_query_count(struct netdev *netdev,
 }
 
 dev = netdev_dpdk_cast(netdev);
-ovs_mutex_lock(&dev->mutex);
 ret = rte_flow_query(dev->port_id, rte_flow, actions, query, error);
-ovs_mutex_unlock(&dev->mutex);
 return ret;
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 26/27] dpif-netdev: Use one or more offload threads

2021-06-09 Thread Gaetan Rivet
Read the user configuration in the netdev-offload module to modify the
number of threads used to manage hardware offload requests.

This allows processing insertion, deletion and modification
concurrently.

The offload thread structure was modified to contain all needed
elements. This structure is multiplied by the number of requested
threads and used separately.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 304 +-
 lib/netdev-offload-dpdk.c |   7 +-
 2 files changed, 204 insertions(+), 107 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index cc7a979d7..73dec57c4 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -460,25 +460,47 @@ struct dp_offload_thread_item {
 };
 
 struct dp_offload_thread {
-struct mpsc_queue queue;
-atomic_uint64_t enqueued_item;
-struct cmap megaflow_to_mark;
-struct cmap mark_to_flow;
-struct mov_avg_cma cma;
-struct mov_avg_ema ema;
+PADDED_MEMBERS(CACHE_LINE_SIZE,
+struct mpsc_queue queue;
+atomic_uint64_t enqueued_item;
+struct cmap megaflow_to_mark;
+struct cmap mark_to_flow;
+struct mov_avg_cma cma;
+struct mov_avg_ema ema;
+);
 };
+static struct dp_offload_thread *dp_offload_threads;
+static void *dp_netdev_flow_offload_main(void *arg);
 
-static struct dp_offload_thread dp_offload_thread = {
-.queue = MPSC_QUEUE_INITIALIZER(&dp_offload_thread.queue),
-.megaflow_to_mark = CMAP_INITIALIZER,
-.mark_to_flow = CMAP_INITIALIZER,
-.enqueued_item = ATOMIC_VAR_INIT(0),
-.cma = MOV_AVG_CMA_INITIALIZER,
-.ema = MOV_AVG_EMA_INITIALIZER(100),
-};
+static void
+dp_netdev_offload_init(void)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+unsigned int nb_offload_thread = netdev_offload_thread_nb();
+unsigned int tid;
+
+if (!ovsthread_once_start(&once)) {
+return;
+}
+
+dp_offload_threads = xcalloc(nb_offload_thread,
+ sizeof *dp_offload_threads);
 
-static struct ovsthread_once offload_thread_once
-= OVSTHREAD_ONCE_INITIALIZER;
+for (tid = 0; tid < nb_offload_thread; tid++) {
+struct dp_offload_thread *thread;
+
+thread = &dp_offload_threads[tid];
+mpsc_queue_init(&thread->queue);
+cmap_init(&thread->megaflow_to_mark);
+cmap_init(&thread->mark_to_flow);
+atomic_init(&thread->enqueued_item, 0);
+mov_avg_cma_init(&thread->cma);
+mov_avg_ema_init(&thread->ema, 100);
+ovs_thread_create("hw_offload", dp_netdev_flow_offload_main, thread);
+}
+
+ovsthread_once_done(&once);
+}
 
 #define XPS_TIMEOUT 50LL/* In microseconds. */
 
@@ -2478,11 +2500,12 @@ megaflow_to_mark_associate(const ovs_u128 *mega_ufid, 
uint32_t mark)
 {
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data = xzalloc(sizeof(*data));
+unsigned int tid = netdev_offload_thread_id();
 
 data->mega_ufid = *mega_ufid;
 data->mark = mark;
 
-cmap_insert(&dp_offload_thread.megaflow_to_mark,
+cmap_insert(&dp_offload_threads[tid].megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 }
 
@@ -2492,11 +2515,12 @@ megaflow_to_mark_disassociate(const ovs_u128 *mega_ufid)
 {
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 CMAP_FOR_EACH_WITH_HASH (data, node, hash,
- &dp_offload_thread.megaflow_to_mark) {
+ &dp_offload_threads[tid].megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
-cmap_remove(&dp_offload_thread.megaflow_to_mark,
+cmap_remove(&dp_offload_threads[tid].megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 ovsrcu_postpone(free, data);
 return;
@@ -2512,9 +2536,10 @@ megaflow_to_mark_find(const ovs_u128 *mega_ufid)
 {
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 CMAP_FOR_EACH_WITH_HASH (data, node, hash,
- &dp_offload_thread.megaflow_to_mark) {
+ &dp_offload_threads[tid].megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
 return data->mark;
 }
@@ -2529,9 +2554,10 @@ megaflow_to_mark_find(const ovs_u128 *mega_ufid)
 static void
 mark_to_flow_associate(const uint32_t mark, struct dp_netdev_flow *flow)
 {
+unsigned int tid = netdev_offload_thread_id();
 dp_netdev_flow_ref(flow);
 
-cmap_insert(&dp_of

[ovs-dev] [PATCH v4 25/27] dpif-netdev: Replace port mutex by rwlock

2021-06-09 Thread Gaetan Rivet
The port mutex protects the netdev mapping, that can be changed by port
addition or port deletion. HW offloads operations can be considered read
operations on the port mapping itself. Use a rwlock to differentiate
between read and write operations, allowing concurrent queries and
offload insertions.

Because offload queries, deletion, and reconfigure_datapath() calls are
all rdlock, the deadlock fixed by [1] is still avoided, as the rdlock
side is recursive as prescribed by the POSIX standard. Executing
'reconfigure_datapath()' only requires a rdlock taken, but it is sometimes
executed in contexts where wrlock is taken ('do_add_port()' and
'do_del_port()').

This means that the deadlock described in [2] is still valid and should
be mitigated. The rdlock is taken using 'tryrdlock()' during offload query,
keeping the current behavior.

[1]: 81e89d5c2645 ("dpif-netdev: Make datapath port mutex recursive.")

[2]: 12d0edd75eba ("dpif-netdev: Avoid deadlock with offloading during PMD
 thread deletion.").

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 139 +++---
 lib/netdev-offload-dpdk.c |   4 +-
 2 files changed, 72 insertions(+), 71 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 8fe794557..cc7a979d7 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -333,8 +333,8 @@ struct dp_netdev {
 /* Ports.
  *
  * Any lookup into 'ports' or any access to the dp_netdev_ports found
- * through 'ports' requires taking 'port_mutex'. */
-struct ovs_mutex port_mutex;
+ * through 'ports' requires taking 'port_rwlock'. */
+struct ovs_rwlock port_rwlock;
 struct hmap ports;
 struct seq *port_seq;   /* Incremented whenever a port changes. */
 
@@ -410,7 +410,7 @@ static void meter_unlock(const struct dp_netdev *dp, 
uint32_t meter_id)
 
 static struct dp_netdev_port *dp_netdev_lookup_port(const struct dp_netdev *dp,
 odp_port_t)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_RDLOCK(dp->port_rwlock);
 
 enum rxq_cycles_counter_type {
 RXQ_CYCLES_PROC_CURR,   /* Cycles spent successfully polling and
@@ -851,17 +851,17 @@ struct dpif_netdev {
 
 static int get_port_by_number(struct dp_netdev *dp, odp_port_t port_no,
   struct dp_netdev_port **portp)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_RDLOCK(dp->port_rwlock);
 static int get_port_by_name(struct dp_netdev *dp, const char *devname,
 struct dp_netdev_port **portp)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_RDLOCK(dp->port_rwlock);
 static void dp_netdev_free(struct dp_netdev *)
 OVS_REQUIRES(dp_netdev_mutex);
 static int do_add_port(struct dp_netdev *dp, const char *devname,
const char *type, odp_port_t port_no)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_WRLOCK(dp->port_rwlock);
 static void do_del_port(struct dp_netdev *dp, struct dp_netdev_port *)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_WRLOCK(dp->port_rwlock);
 static int dpif_netdev_open(const struct dpif_class *, const char *name,
 bool create, struct dpif **);
 static void dp_netdev_execute_actions(struct dp_netdev_pmd_thread *pmd,
@@ -882,7 +882,7 @@ static void dp_netdev_configure_pmd(struct 
dp_netdev_pmd_thread *pmd,
 int numa_id);
 static void dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd);
 static void dp_netdev_set_nonpmd(struct dp_netdev *dp)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_WRLOCK(dp->port_rwlock);
 
 static void *pmd_thread_main(void *);
 static struct dp_netdev_pmd_thread *dp_netdev_get_pmd(struct dp_netdev *dp,
@@ -919,7 +919,7 @@ static void dp_netdev_offload_flush(struct dp_netdev *dp,
 struct dp_netdev_port *port);
 
 static void reconfigure_datapath(struct dp_netdev *dp)
-OVS_REQUIRES(dp->port_mutex);
+OVS_REQ_RDLOCK(dp->port_rwlock);
 static bool dp_netdev_pmd_try_ref(struct dp_netdev_pmd_thread *pmd);
 static void dp_netdev_pmd_unref(struct dp_netdev_pmd_thread *pmd);
 static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd);
@@ -1425,8 +1425,8 @@ dpif_netdev_subtable_lookup_set(struct unixctl_conn 
*conn, int argc,
 struct dp_netdev_pmd_thread **pmd_list;
 sorted_poll_thread_list(dp, &pmd_list, &n);
 
-/* take port mutex as HMAP iters over them. */
-ovs_mutex_lock(&dp->port_mutex);
+/* take port rwlock as HMAP iters over them. */
+ovs_rwlock_rdlock(&dp->port_rwlock);
 
 for (size_t i = 0; i < n; i++) {
 struct dp_netdev_pmd_thread *pmd = pmd_list[i];
@@ -1449,8 +1449,8 @@ dpif_netdev_subtable_lookup_set(struct

[ovs-dev] [PATCH v4 22/27] netdev-offload-dpdk: Protect concurrent offload destroy/query

2021-06-09 Thread Gaetan Rivet
The rte_flow API in DPDK is now thread safe for insertion and deletion.
It is not however safe for concurrent query while the offload is being
inserted or deleted.

Insertion is not an issue as the rte_flow handle will be published to
other threads only once it has been inserted in the hardware, so the
query will only be able to proceed once it is already available.

For the deletion path however, offload status queries can be made while
an offload is being destroyed. This would create race conditions and
use-after-free if not properly protected.

As a pre-step before removing the OVS-level locks on the rte_flow API,
mutually exclude offload query and deletion from concurrent execution.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-dpdk.c | 39 ---
 1 file changed, 36 insertions(+), 3 deletions(-)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index 4459a0aa1..13e017ef8 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -58,6 +58,8 @@ struct ufid_to_rte_flow_data {
 struct rte_flow *rte_flow;
 bool actions_offloaded;
 struct dpif_flow_stats stats;
+struct ovs_mutex lock;
+bool dead;
 };
 
 struct netdev_offload_dpdk_data {
@@ -233,6 +235,7 @@ ufid_to_rte_flow_associate(struct netdev *netdev, const 
ovs_u128 *ufid,
 data->netdev = netdev_ref(netdev);
 data->rte_flow = rte_flow;
 data->actions_offloaded = actions_offloaded;
+ovs_mutex_init(&data->lock);
 
 cmap_insert(map, CONST_CAST(struct cmap_node *, &data->node), hash);
 
@@ -240,8 +243,16 @@ ufid_to_rte_flow_associate(struct netdev *netdev, const 
ovs_u128 *ufid,
 return data;
 }
 
+static void
+rte_flow_data_unref(struct ufid_to_rte_flow_data *data)
+{
+ovs_mutex_destroy(&data->lock);
+free(data);
+}
+
 static inline void
 ufid_to_rte_flow_disassociate(struct ufid_to_rte_flow_data *data)
+OVS_REQUIRES(data->lock)
 {
 size_t hash = hash_bytes(&data->ufid, sizeof data->ufid, 0);
 struct cmap *map = offload_data_map(data->netdev);
@@ -255,7 +266,7 @@ ufid_to_rte_flow_disassociate(struct ufid_to_rte_flow_data 
*data)
 offload_data_unlock(data->netdev);
 
 netdev_close(data->netdev);
-ovsrcu_postpone(free, data);
+ovsrcu_postpone(rte_flow_data_unref, data);
 }
 
 /*
@@ -1581,6 +1592,15 @@ netdev_offload_dpdk_flow_destroy(struct 
ufid_to_rte_flow_data *rte_flow_data)
 ovs_u128 *ufid;
 int ret;
 
+ovs_mutex_lock(&rte_flow_data->lock);
+
+if (rte_flow_data->dead) {
+ovs_mutex_unlock(&rte_flow_data->lock);
+return 0;
+}
+
+rte_flow_data->dead = true;
+
 rte_flow = rte_flow_data->rte_flow;
 netdev = rte_flow_data->netdev;
 ufid = &rte_flow_data->ufid;
@@ -1607,6 +1627,8 @@ netdev_offload_dpdk_flow_destroy(struct 
ufid_to_rte_flow_data *rte_flow_data)
  UUID_ARGS((struct uuid *) ufid));
 }
 
+ovs_mutex_unlock(&rte_flow_data->lock);
+
 return ret;
 }
 
@@ -1702,8 +1724,19 @@ netdev_offload_dpdk_flow_get(struct netdev *netdev,
 struct rte_flow_error error;
 int ret = 0;
 
+attrs->dp_extra_info = NULL;
+
 rte_flow_data = ufid_to_rte_flow_data_find(netdev, ufid, false);
-if (!rte_flow_data || !rte_flow_data->rte_flow) {
+if (!rte_flow_data || !rte_flow_data->rte_flow ||
+rte_flow_data->dead || ovs_mutex_trylock(&rte_flow_data->lock)) {
+return -1;
+}
+
+/* Check again whether the data is dead, as it could have been
+ * updated while the lock was not yet taken. The first check above
+ * was only to avoid unnecessary locking if possible.
+ */
+if (rte_flow_data->dead) {
 ret = -1;
 goto out;
 }
@@ -1730,7 +1763,7 @@ netdev_offload_dpdk_flow_get(struct netdev *netdev,
 }
 memcpy(stats, &rte_flow_data->stats, sizeof *stats);
 out:
-attrs->dp_extra_info = NULL;
+ovs_mutex_unlock(&rte_flow_data->lock);
 return ret;
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 21/27] netdev-offload-dpdk: Lock rte_flow map access

2021-06-09 Thread Gaetan Rivet
Add a lock to access the ufid to rte_flow map.  This will protect it
from concurrent write accesses when multiple threads attempt it.

At this point, the reason for taking the lock is not to fullfill the
needs of the DPDK offload implementation anymore. Rewrite the comments
to reflect this change. The lock is still needed to protect against
changes to netdev port mapping.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c |  8 ++---
 lib/netdev-offload-dpdk.c | 61 ---
 2 files changed, 61 insertions(+), 8 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 82e55e60b..1daaecb1c 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2590,7 +2590,7 @@ mark_to_flow_disassociate(struct dp_netdev_pmd_thread 
*pmd,
 port = netdev_ports_get(in_port, dpif_type_str);
 if (port) {
 /* Taking a global 'port_mutex' to fulfill thread safety
- * restrictions for the netdev-offload-dpdk module. */
+ * restrictions regarding netdev port mapping. */
 ovs_mutex_lock(&pmd->dp->port_mutex);
 ret = netdev_flow_del(port, &flow->mega_ufid, NULL);
 ovs_mutex_unlock(&pmd->dp->port_mutex);
@@ -2770,8 +2770,8 @@ dp_netdev_flow_offload_put(struct dp_offload_flow_item 
*offload)
 netdev_close(port);
 goto err_free;
 }
-/* Taking a global 'port_mutex' to fulfill thread safety restrictions for
- * the netdev-offload-dpdk module. */
+/* Taking a global 'port_mutex' to fulfill thread safety
+ * restrictions regarding the netdev port mapping. */
 ovs_mutex_lock(&pmd->dp->port_mutex);
 ret = netdev_flow_put(port, &offload->match,
   CONST_CAST(struct nlattr *, offload->actions),
@@ -3573,7 +3573,7 @@ dpif_netdev_get_flow_offload_status(const struct 
dp_netdev *dp,
 }
 ofpbuf_use_stack(&buf, &act_buf, sizeof act_buf);
 /* Taking a global 'port_mutex' to fulfill thread safety
- * restrictions for the netdev-offload-dpdk module.
+ * restrictions regarding netdev port mapping.
  *
  * XXX: Main thread will try to pause/stop all revalidators during datapath
  *  reconfiguration via datapath purge callback (dp_purge_cb) while
diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index ecdc846e1..4459a0aa1 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -38,9 +38,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(100, 
5);
  *
  * Below API is NOT thread safe in following terms:
  *
- *  - The caller must be sure that none of these functions will be called
- *simultaneously.  Even for different 'netdev's.
- *
  *  - The caller must be sure that 'netdev' will not be destructed/deallocated.
  *
  *  - The caller must be sure that 'netdev' configuration will not be changed.
@@ -66,6 +63,7 @@ struct ufid_to_rte_flow_data {
 struct netdev_offload_dpdk_data {
 struct cmap ufid_to_rte_flow;
 uint64_t *rte_flow_counters;
+struct ovs_mutex map_lock;
 };
 
 static int
@@ -74,6 +72,7 @@ offload_data_init(struct netdev *netdev)
 struct netdev_offload_dpdk_data *data;
 
 data = xzalloc(sizeof *data);
+ovs_mutex_init(&data->map_lock);
 cmap_init(&data->ufid_to_rte_flow);
 data->rte_flow_counters = xcalloc(netdev_offload_thread_nb(),
   sizeof *data->rte_flow_counters);
@@ -86,6 +85,7 @@ offload_data_init(struct netdev *netdev)
 static void
 offload_data_destroy__(struct netdev_offload_dpdk_data *data)
 {
+ovs_mutex_destroy(&data->map_lock);
 free(data->rte_flow_counters);
 free(data);
 }
@@ -117,6 +117,34 @@ offload_data_destroy(struct netdev *netdev)
 ovsrcu_set(&netdev->hw_info.offload_data, NULL);
 }
 
+static void
+offload_data_lock(struct netdev *netdev)
+OVS_NO_THREAD_SAFETY_ANALYSIS
+{
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+if (!data) {
+return;
+}
+ovs_mutex_lock(&data->map_lock);
+}
+
+static void
+offload_data_unlock(struct netdev *netdev)
+OVS_NO_THREAD_SAFETY_ANALYSIS
+{
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+if (!data) {
+return;
+}
+ovs_mutex_unlock(&data->map_lock);
+}
+
 static struct cmap *
 offload_data_map(struct netdev *netdev)
 {
@@ -155,6 +183,24 @@ ufid_to_rte_flow_data_find(struct netdev *netdev,
 return NULL;
 }
 
+/* Find rte_flow with @ufid, lock-protected. */
+static struct ufid_to_rte_flow_data *
+ufid_to_rte_flow_dat

[ovs-dev] [PATCH v4 20/27] netdev-offload-dpdk: Use per-thread HW offload stats

2021-06-09 Thread Gaetan Rivet
The implementation of hardware offload counters in currently meant to be
managed by a single thread. Use the offload thread pool API to manage
one counter per thread.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-dpdk.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index c43e8b968..ecdc846e1 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -65,7 +65,7 @@ struct ufid_to_rte_flow_data {
 
 struct netdev_offload_dpdk_data {
 struct cmap ufid_to_rte_flow;
-uint64_t rte_flow_counter;
+uint64_t *rte_flow_counters;
 };
 
 static int
@@ -75,6 +75,8 @@ offload_data_init(struct netdev *netdev)
 
 data = xzalloc(sizeof *data);
 cmap_init(&data->ufid_to_rte_flow);
+data->rte_flow_counters = xcalloc(netdev_offload_thread_nb(),
+  sizeof *data->rte_flow_counters);
 
 ovsrcu_set(&netdev->hw_info.offload_data, (void *) data);
 
@@ -84,6 +86,7 @@ offload_data_init(struct netdev *netdev)
 static void
 offload_data_destroy__(struct netdev_offload_dpdk_data *data)
 {
+free(data->rte_flow_counters);
 free(data);
 }
 
@@ -646,10 +649,11 @@ netdev_offload_dpdk_flow_create(struct netdev *netdev,
 flow = netdev_dpdk_rte_flow_create(netdev, attr, items, actions, error);
 if (flow) {
 struct netdev_offload_dpdk_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 data = (struct netdev_offload_dpdk_data *)
 ovsrcu_get(void *, &netdev->hw_info.offload_data);
-data->rte_flow_counter++;
+data->rte_flow_counters[tid]++;
 
 if (!VLOG_DROP_DBG(&rl)) {
 dump_flow(&s, &s_extra, attr, items, actions);
@@ -1532,10 +1536,11 @@ netdev_offload_dpdk_flow_destroy(struct 
ufid_to_rte_flow_data *rte_flow_data)
 
 if (ret == 0) {
 struct netdev_offload_dpdk_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 data = (struct netdev_offload_dpdk_data *)
 ovsrcu_get(void *, &netdev->hw_info.offload_data);
-data->rte_flow_counter--;
+data->rte_flow_counters[tid]--;
 
 ufid_to_rte_flow_disassociate(rte_flow_data);
 VLOG_DBG_RL(&rl, "%s: rte_flow 0x%"PRIxPTR
@@ -1698,6 +1703,7 @@ netdev_offload_dpdk_get_n_flows(struct netdev *netdev,
 uint64_t *n_flows)
 {
 struct netdev_offload_dpdk_data *data;
+unsigned int tid;
 
 data = (struct netdev_offload_dpdk_data *)
 ovsrcu_get(void *, &netdev->hw_info.offload_data);
@@ -1705,7 +1711,9 @@ netdev_offload_dpdk_get_n_flows(struct netdev *netdev,
 return -1;
 }
 
-*n_flows = data->rte_flow_counter;
+for (tid = 0; tid < netdev_offload_thread_nb(); tid++) {
+n_flows[tid] = data->rte_flow_counters[tid];
+}
 
 return 0;
 }
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 19/27] dpif-netdev: Execute flush from offload thread

2021-06-09 Thread Gaetan Rivet
When a port is deleted, its offloads must be flushed.  The operation
runs in the thread that initiated it.  Offload data is thus accessed
jointly by the port deletion thread(s) and the offload thread, which
complicates the data access model.

To simplify this model, as a pre-step toward introducing parallel
offloads, execute the flush operation in the offload thread.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 126 --
 1 file changed, 122 insertions(+), 4 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 1d7e55d47..82e55e60b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -422,6 +422,7 @@ enum rxq_cycles_counter_type {
 
 enum dp_offload_type {
 DP_OFFLOAD_FLOW,
+DP_OFFLOAD_FLUSH,
 };
 
 enum {
@@ -439,8 +440,15 @@ struct dp_offload_flow_item {
 size_t actions_len;
 };
 
+struct dp_offload_flush_item {
+struct dp_netdev *dp;
+struct netdev *netdev;
+struct ovs_barrier *barrier;
+};
+
 union dp_offload_thread_data {
 struct dp_offload_flow_item flow;
+struct dp_offload_flush_item flush;
 };
 
 struct dp_offload_thread_item {
@@ -905,6 +913,9 @@ static void dp_netdev_del_bond_tx_from_pmd(struct 
dp_netdev_pmd_thread *pmd,
uint32_t bond_id)
 OVS_EXCLUDED(pmd->bond_mutex);
 
+static void dp_netdev_offload_flush(struct dp_netdev *dp,
+struct dp_netdev_port *port);
+
 static void reconfigure_datapath(struct dp_netdev *dp)
 OVS_REQUIRES(dp->port_mutex);
 static bool dp_netdev_pmd_try_ref(struct dp_netdev_pmd_thread *pmd);
@@ -2305,7 +2316,7 @@ static void
 do_del_port(struct dp_netdev *dp, struct dp_netdev_port *port)
 OVS_REQUIRES(dp->port_mutex)
 {
-netdev_flow_flush(port->netdev);
+dp_netdev_offload_flush(dp, port);
 netdev_uninit_flow_api(port->netdev);
 hmap_remove(&dp->ports, &port->node);
 seq_change(dp->port_seq);
@@ -2675,13 +2686,16 @@ dp_netdev_free_offload(struct dp_offload_thread_item 
*offload)
 case DP_OFFLOAD_FLOW:
 dp_netdev_free_flow_offload(offload);
 break;
+case DP_OFFLOAD_FLUSH:
+free(offload);
+break;
 default:
 OVS_NOT_REACHED();
 };
 }
 
 static void
-dp_netdev_append_flow_offload(struct dp_offload_thread_item *offload)
+dp_netdev_append_offload(struct dp_offload_thread_item *offload)
 {
 ovs_mutex_lock(&dp_offload_thread.mutex);
 ovs_list_push_back(&dp_offload_thread.list, &offload->node);
@@ -2814,6 +2828,23 @@ dp_offload_flow(struct dp_offload_thread_item *item)
  UUID_ARGS((struct uuid *) &flow_offload->flow->mega_ufid));
 }
 
+static void
+dp_offload_flush(struct dp_offload_thread_item *item)
+{
+struct dp_offload_flush_item *flush = &item->data->flush;
+
+ovs_mutex_lock(&flush->dp->port_mutex);
+netdev_flow_flush(flush->netdev);
+ovs_mutex_unlock(&flush->dp->port_mutex);
+
+ovs_barrier_block(flush->barrier);
+
+/* Allow the other thread to take again the port lock, before
+ * continuing offload operations in this thread.
+ */
+ovs_barrier_block(flush->barrier);
+}
+
 #define DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US (10 * 1000) /* 10 ms */
 
 static void *
@@ -2844,6 +2875,9 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 case DP_OFFLOAD_FLOW:
 dp_offload_flow(offload);
 break;
+case DP_OFFLOAD_FLUSH:
+dp_offload_flush(offload);
+break;
 default:
 OVS_NOT_REACHED();
 }
@@ -2881,7 +2915,7 @@ queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
 offload = dp_netdev_alloc_flow_offload(pmd, flow,
DP_NETDEV_FLOW_OFFLOAD_OP_DEL);
 offload->timestamp = pmd->ctx.now;
-dp_netdev_append_flow_offload(offload);
+dp_netdev_append_offload(offload);
 }
 
 static void
@@ -2916,7 +2950,7 @@ queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd,
 flow_offload->actions_len = actions_len;
 
 item->timestamp = pmd->ctx.now;
-dp_netdev_append_flow_offload(item);
+dp_netdev_append_offload(item);
 }
 
 static void
@@ -2940,6 +2974,90 @@ dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread 
*pmd,
 dp_netdev_flow_unref(flow);
 }
 
+static void
+dp_netdev_offload_flush_enqueue(struct dp_netdev *dp,
+struct netdev *netdev,
+struct ovs_barrier *barrier)
+{
+struct dp_offload_thread_item *item;
+struct dp_offload_flush_item *flush;
+
+if (ovsthread_once_start(&offload_thread_once)) {
+xpthread_cond_init(&dp_offload_thread.cond, NULL);
+ovs_thread_create("hw_offload", dp_netdev_flow_offload_main, NULL);
+ovsthread_once_done(&offload_thread_on

[ovs-dev] [PATCH v4 17/27] dpif-netdev: Use id-fpool for mark allocation

2021-06-09 Thread Gaetan Rivet
Use the netdev-offload multithread API to allow multiple thread
allocating marks concurrently.

Initialize only once the pool in a multithread context by using
the ovsthread_once type.

Use the id-fpool module for faster concurrent ID allocation.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 75b289904..b8fd49f5d 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -49,6 +49,7 @@
 #include "fat-rwlock.h"
 #include "flow.h"
 #include "hmapx.h"
+#include "id-fpool.h"
 #include "id-pool.h"
 #include "ipf.h"
 #include "mov-avg.h"
@@ -2418,7 +2419,7 @@ struct megaflow_to_mark_data {
 struct flow_mark {
 struct cmap megaflow_to_mark;
 struct cmap mark_to_flow;
-struct id_pool *pool;
+struct id_fpool *pool;
 };
 
 static struct flow_mark flow_mark = {
@@ -2429,14 +2430,18 @@ static struct flow_mark flow_mark = {
 static uint32_t
 flow_mark_alloc(void)
 {
+static struct ovsthread_once pool_init = OVSTHREAD_ONCE_INITIALIZER;
+unsigned int tid = netdev_offload_thread_id();
 uint32_t mark;
 
-if (!flow_mark.pool) {
+if (ovsthread_once_start(&pool_init)) {
 /* Haven't initiated yet, do it here */
-flow_mark.pool = id_pool_create(1, MAX_FLOW_MARK);
+flow_mark.pool = id_fpool_create(netdev_offload_thread_nb(),
+ 1, MAX_FLOW_MARK);
+ovsthread_once_done(&pool_init);
 }
 
-if (id_pool_alloc_id(flow_mark.pool, &mark)) {
+if (id_fpool_new_id(flow_mark.pool, tid, &mark)) {
 return mark;
 }
 
@@ -2446,7 +2451,9 @@ flow_mark_alloc(void)
 static void
 flow_mark_free(uint32_t mark)
 {
-id_pool_free_id(flow_mark.pool, mark);
+unsigned int tid = netdev_offload_thread_id();
+
+id_fpool_free_id(flow_mark.pool, tid, mark);
 }
 
 /* associate megaflow with a mark, which is a 1:1 mapping */
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 18/27] dpif-netdev: Introduce tagged union of offload requests

2021-06-09 Thread Gaetan Rivet
Offload requests are currently only supporting flow offloads.
As a pre-step before supporting an offload flush request,
modify the layout of an offload request item, to become a tagged union.

Future offload types won't be forced to re-use the full flow offload
structure, which consumes a lot of memory.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 128 --
 1 file changed, 89 insertions(+), 39 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index b8fd49f5d..1d7e55d47 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -420,22 +420,34 @@ enum rxq_cycles_counter_type {
 RXQ_N_CYCLES
 };
 
+enum dp_offload_type {
+DP_OFFLOAD_FLOW,
+};
+
 enum {
 DP_NETDEV_FLOW_OFFLOAD_OP_ADD,
 DP_NETDEV_FLOW_OFFLOAD_OP_MOD,
 DP_NETDEV_FLOW_OFFLOAD_OP_DEL,
 };
 
-struct dp_offload_thread_item {
+struct dp_offload_flow_item {
 struct dp_netdev_pmd_thread *pmd;
 struct dp_netdev_flow *flow;
 int op;
 struct match match;
 struct nlattr *actions;
 size_t actions_len;
-long long int timestamp;
+};
 
+union dp_offload_thread_data {
+struct dp_offload_flow_item flow;
+};
+
+struct dp_offload_thread_item {
 struct ovs_list node;
+enum dp_offload_type type;
+long long int timestamp;
+union dp_offload_thread_data data[0];
 };
 
 struct dp_offload_thread {
@@ -2619,34 +2631,55 @@ dp_netdev_alloc_flow_offload(struct 
dp_netdev_pmd_thread *pmd,
  struct dp_netdev_flow *flow,
  int op)
 {
-struct dp_offload_thread_item *offload;
+struct dp_offload_thread_item *item;
+struct dp_offload_flow_item *flow_offload;
+
+item = xzalloc(sizeof *item + sizeof *flow_offload);
+flow_offload = &item->data->flow;
 
-offload = xzalloc(sizeof(*offload));
-offload->pmd = pmd;
-offload->flow = flow;
-offload->op = op;
+item->type = DP_OFFLOAD_FLOW;
+
+flow_offload->pmd = pmd;
+flow_offload->flow = flow;
+flow_offload->op = op;
 
 dp_netdev_flow_ref(flow);
 dp_netdev_pmd_try_ref(pmd);
 
-return offload;
+return item;
 }
 
 static void
 dp_netdev_free_flow_offload__(struct dp_offload_thread_item *offload)
 {
-free(offload->actions);
+struct dp_offload_flow_item *flow_offload = &offload->data->flow;
+
+free(flow_offload->actions);
 free(offload);
 }
 
 static void
 dp_netdev_free_flow_offload(struct dp_offload_thread_item *offload)
 {
-dp_netdev_pmd_unref(offload->pmd);
-dp_netdev_flow_unref(offload->flow);
+struct dp_offload_flow_item *flow_offload = &offload->data->flow;
+
+dp_netdev_pmd_unref(flow_offload->pmd);
+dp_netdev_flow_unref(flow_offload->flow);
 ovsrcu_postpone(dp_netdev_free_flow_offload__, offload);
 }
 
+static void
+dp_netdev_free_offload(struct dp_offload_thread_item *offload)
+{
+switch (offload->type) {
+case DP_OFFLOAD_FLOW:
+dp_netdev_free_flow_offload(offload);
+break;
+default:
+OVS_NOT_REACHED();
+};
+}
+
 static void
 dp_netdev_append_flow_offload(struct dp_offload_thread_item *offload)
 {
@@ -2658,7 +2691,7 @@ dp_netdev_append_flow_offload(struct 
dp_offload_thread_item *offload)
 }
 
 static int
-dp_netdev_flow_offload_del(struct dp_offload_thread_item *offload)
+dp_netdev_flow_offload_del(struct dp_offload_flow_item *offload)
 {
 return mark_to_flow_disassociate(offload->pmd, offload->flow);
 }
@@ -2675,7 +2708,7 @@ dp_netdev_flow_offload_del(struct dp_offload_thread_item 
*offload)
  * valid, thus only item 2 needed.
  */
 static int
-dp_netdev_flow_offload_put(struct dp_offload_thread_item *offload)
+dp_netdev_flow_offload_put(struct dp_offload_flow_item *offload)
 {
 struct dp_netdev_pmd_thread *pmd = offload->pmd;
 struct dp_netdev_flow *flow = offload->flow;
@@ -2752,6 +2785,35 @@ err_free:
 return -1;
 }
 
+static void
+dp_offload_flow(struct dp_offload_thread_item *item)
+{
+struct dp_offload_flow_item *flow_offload = &item->data->flow;
+const char *op;
+int ret;
+
+switch (flow_offload->op) {
+case DP_NETDEV_FLOW_OFFLOAD_OP_ADD:
+op = "add";
+ret = dp_netdev_flow_offload_put(flow_offload);
+break;
+case DP_NETDEV_FLOW_OFFLOAD_OP_MOD:
+op = "modify";
+ret = dp_netdev_flow_offload_put(flow_offload);
+break;
+case DP_NETDEV_FLOW_OFFLOAD_OP_DEL:
+op = "delete";
+ret = dp_netdev_flow_offload_del(flow_offload);
+break;
+default:
+OVS_NOT_REACHED();
+}
+
+VLOG_DBG("%s to %s netdev flow "UUID_FMT,
+ ret == 0 ? "succeed" : "failed", op,
+ UUID_ARGS((struct uuid *) &flow_offload->flow->mega_ufid));
+}
+
 #define DP_NETDEV_OFFLOAD_QUIESCE

[ovs-dev] [PATCH v4 16/27] dpif-netdev: Postpone flow offload item freeing

2021-06-09 Thread Gaetan Rivet
Profiling the HW offload thread, the flow offload freeing takes
approximatively 25% of the time. Most of this time is spent waiting on
the futex used by the libc free(), as it triggers a syscall and
reschedule the thread.

Avoid the syscall and its expensive context switch. Batch the offload
messages freeing using the RCU.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index e403e461a..75b289904 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2625,14 +2625,19 @@ dp_netdev_alloc_flow_offload(struct 
dp_netdev_pmd_thread *pmd,
 return offload;
 }
 
+static void
+dp_netdev_free_flow_offload__(struct dp_offload_thread_item *offload)
+{
+free(offload->actions);
+free(offload);
+}
+
 static void
 dp_netdev_free_flow_offload(struct dp_offload_thread_item *offload)
 {
 dp_netdev_pmd_unref(offload->pmd);
 dp_netdev_flow_unref(offload->flow);
-
-free(offload->actions);
-free(offload);
+ovsrcu_postpone(dp_netdev_free_flow_offload__, offload);
 }
 
 static void
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 15/27] dpif-netdev: Quiesce offload thread periodically

2021-06-09 Thread Gaetan Rivet
Similar to what was done for the PMD threads [1], reduce the performance
impact of quiescing too often in the offload thread.

After each processed offload, the offload thread currently quiesce and
will sync with RCU. This synchronization can be lengthy and make the
thread unnecessary slow.

Instead attempt to quiesce every 10 ms at most. While the queue is
empty, the offload thread remains quiescent.

[1]: 81ac8b3b194c ("dpif-netdev: Do RCU synchronization at fixed interval
 in PMD main loop.")

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/dpif-netdev.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index a20eeda4d..e403e461a 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2740,15 +2740,20 @@ err_free:
 return -1;
 }
 
+#define DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US (10 * 1000) /* 10 ms */
+
 static void *
 dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 {
 struct dp_offload_thread_item *offload;
 struct ovs_list *list;
 long long int latency_us;
+long long int next_rcu;
+long long int now;
 const char *op;
 int ret;
 
+next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
 for (;;) {
 ovs_mutex_lock(&dp_offload_thread.mutex);
 if (ovs_list_is_empty(&dp_offload_thread.list)) {
@@ -2756,6 +2761,7 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 ovs_mutex_cond_wait(&dp_offload_thread.cond,
 &dp_offload_thread.mutex);
 ovsrcu_quiesce_end();
+next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
 }
 list = ovs_list_pop_front(&dp_offload_thread.list);
 dp_offload_thread.enqueued_item--;
@@ -2779,7 +2785,9 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 OVS_NOT_REACHED();
 }
 
-latency_us = time_usec() - offload->timestamp;
+now = time_usec();
+
+latency_us = now - offload->timestamp;
 mov_avg_cma_update(&dp_offload_thread.cma, latency_us);
 mov_avg_ema_update(&dp_offload_thread.ema, latency_us);
 
@@ -2787,7 +2795,12 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
  ret == 0 ? "succeed" : "failed", op,
  UUID_ARGS((struct uuid *) &offload->flow->mega_ufid));
 dp_netdev_free_flow_offload(offload);
-ovsrcu_quiesce();
+
+/* Do RCU synchronization at fixed interval. */
+if (now > next_rcu) {
+ovsrcu_quiesce();
+next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
+}
 }
 
 return NULL;
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 13/27] id-fpool: Module for fast ID generation

2021-06-09 Thread Gaetan Rivet
The current id-pool module is slow to allocate the
next valid ID, and can be optimized when restricting
some properties of the pool.

Those restrictions are:

  * No ability to add a random ID to the pool.

  * A new ID is no more the smallest possible ID.
It is however guaranteed to be in the range of

   [floor, last_alloc + nb_user * cache_size + 1].

where 'cache_size' is the number of ID in each per-user
cache.  It is defined as 'ID_FPOOL_CACHE_SIZE' to 64.

  * A user should never free an ID that is not allocated.
No checks are done and doing so will duplicate the spurious
ID.  Refcounting or other memory management scheme should
be used to ensure an object and its ID are only freed once.

This allocator is designed to scale reasonably well in multithread
setup.  As it is aimed at being a faster replacement to the current
id-pool, a benchmark has been implemented alongside unit tests.

The benchmark is composed of 4 rounds: 'new', 'del', 'mix', and 'rnd'.
Respectively

  + 'new': only allocate IDs
  + 'del': only free IDs
  + 'mix': allocate, sequential free, then allocate ID.
  + 'rnd': allocate, random free, allocate ID.

Randomized freeing is done by swapping the latest allocated ID with any
from the range of currently allocated ID, which is reminiscent of the
Fisher-Yates shuffle.  This evaluates freeing non-sequential IDs,
which is the more natural use-case.

For this specific round, the id-pool performance is such that a timeout
of 10 seconds is added to the benchmark:

   $ ./tests/ovstest test-id-fpool benchmark 1 1
   Benchmarking n=1 on 1 thread.
type\thread:   1Avg
   id-fpool new:   1  1 ms
   id-fpool del:   1  1 ms
   id-fpool mix:   2  2 ms
   id-fpool rnd:   2  2 ms
id-pool new:   4  4 ms
id-pool del:   2  2 ms
id-pool mix:   6  6 ms
id-pool rnd: 431431 ms

   $ ./tests/ovstest test-id-fpool benchmark 10 1
   Benchmarking n=10 on 1 thread.
type\thread:   1Avg
   id-fpool new:   2  2 ms
   id-fpool del:   2  2 ms
   id-fpool mix:   3  3 ms
   id-fpool rnd:   4  4 ms
id-pool new:  12 12 ms
id-pool del:   5  5 ms
id-pool mix:  16 16 ms
id-pool rnd:  1+ -1 ms

   $ ./tests/ovstest test-id-fpool benchmark 100 1
   Benchmarking n=100 on 1 thread.
type\thread:   1Avg
   id-fpool new:  15 15 ms
   id-fpool del:  12 12 ms
   id-fpool mix:  34 34 ms
   id-fpool rnd:  48 48 ms
id-pool new: 276276 ms
id-pool del: 286286 ms
id-pool mix: 448448 ms
id-pool rnd:  1+ -1 ms

Running only a performance test on the fast pool:

   $ ./tests/ovstest test-id-fpool perf 100 1
   Benchmarking n=100 on 1 thread.
type\thread:   1Avg
   id-fpool new:  15 15 ms
   id-fpool del:  12 12 ms
   id-fpool mix:  34 34 ms
   id-fpool rnd:  47 47 ms

   $ ./tests/ovstest test-id-fpool perf 100 2
   Benchmarking n=100 on 2 threads.
type\thread:   1  2Avg
   id-fpool new:  11 11 11 ms
   id-fpool del:  10 10 10 ms
   id-fpool mix:  24 24 24 ms
   id-fpool rnd:  30 30 30 ms

   $ ./tests/ovstest test-id-fpool perf 100 4
   Benchmarking n=100 on 4 threads.
type\thread:   1  2  3  4Avg
   id-fpool new:   9 11 11 10 10 ms
   id-fpool del:   5  6  6  5  5 ms
   id-fpool mix:  16 16     16     16 16 ms
   id-fpool rnd:  20 20 20 20 20 ms

Signed-off-by: Gaetan Rivet 
---
 lib/automake.mk   |   2 +
 lib/id-fpool.c| 279 +++
 lib/id-fpool.h|  66 +
 tests/automake.mk |   1 +
 tests/library.at  |   4 +
 tests/test-id-fpool.c | 615 ++
 6 files changed, 967 insertions(+)
 create mode 100644 lib/id-fpool.c
 create mode 100644 lib/id-fpool.h
 create mode 100644 tests/test-id-fpool.c

diff --git a/lib/automake.mk b/lib/automake.mk
index b45801852..afff2e09c 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -138,6 +138,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/hmap.c \
lib/hmapx.c \
lib/hmapx.h \
+   lib/id-fpool.c \
+   lib/id-fpool.h \
lib/id-pool.c \
lib/id-pool.h \
lib/if-notifier-manual.c \
diff --git a/lib/id-fpool.c b/lib/id-fpool.c
new file mode 100644
index 0..15cef5d00
--- /dev/null
+++ b/lib/id-fpool.c
@@ -0,0 +1,279 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obta

[ovs-dev] [PATCH v4 14/27] netdev-offload: Add multi-thread API

2021-06-09 Thread Gaetan Rivet
Expose functions reporting user configuration of offloading threads, as
well as utility functions for multithreading.

This will only expose the configuration knob to the user, while no
datapath will implement the multiple thread request.

This will allow implementations to use this API for offload thread
management in relevant layers before enabling the actual dataplane
implementation.

The offload thread ID is lazily allocated and can as such be in a
different order than the offload thread start sequence.

The RCU thread will sometime access hardware-offload objects from
a provider for reclamation purposes.  In such case, it will get
a default offload thread ID of 0. Care must be taken that using
this thread ID is safe concurrently with the offload threads.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-provider.h |  1 +
 lib/netdev-offload.c  | 88 ++-
 lib/netdev-offload.h  | 19 
 vswitchd/vswitch.xml  | 16 +++
 4 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/lib/netdev-offload-provider.h b/lib/netdev-offload-provider.h
index 2127599d3..e02330a43 100644
--- a/lib/netdev-offload-provider.h
+++ b/lib/netdev-offload-provider.h
@@ -84,6 +84,7 @@ struct netdev_flow_api {
 struct dpif_flow_stats *);
 
 /* Get the number of flows offloaded to netdev.
+ * 'n_flows' is an array of counters, one per offload thread.
  * Return 0 if successful, otherwise returns a positive errno value. */
 int (*flow_get_n_flows)(struct netdev *, uint64_t *n_flows);
 
diff --git a/lib/netdev-offload.c b/lib/netdev-offload.c
index deefefd63..087302fd3 100644
--- a/lib/netdev-offload.c
+++ b/lib/netdev-offload.c
@@ -60,6 +60,12 @@ VLOG_DEFINE_THIS_MODULE(netdev_offload);
 
 static bool netdev_flow_api_enabled = false;
 
+#define DEFAULT_OFFLOAD_THREAD_NB 1
+#define MAX_OFFLOAD_THREAD_NB 10
+
+static unsigned int offload_thread_nb = DEFAULT_OFFLOAD_THREAD_NB;
+DEFINE_EXTERN_PER_THREAD_DATA(netdev_offload_thread_id, OVSTHREAD_ID_UNSET);
+
 /* Protects 'netdev_flow_apis'.  */
 static struct ovs_mutex netdev_flow_api_provider_mutex = OVS_MUTEX_INITIALIZER;
 
@@ -436,6 +442,64 @@ netdev_is_flow_api_enabled(void)
 return netdev_flow_api_enabled;
 }
 
+unsigned int
+netdev_offload_thread_nb(void)
+{
+return offload_thread_nb;
+}
+
+unsigned int
+netdev_offload_ufid_to_thread_id(const ovs_u128 ufid)
+{
+uint32_t ufid_hash;
+
+if (netdev_offload_thread_nb() == 1) {
+return 0;
+}
+
+ufid_hash = hash_words64_inline(
+(const uint64_t [2]){ ufid.u64.lo,
+  ufid.u64.hi }, 2, 1);
+return ufid_hash % netdev_offload_thread_nb();
+}
+
+unsigned int
+netdev_offload_thread_init(void)
+{
+static atomic_count next_id = ATOMIC_COUNT_INIT(0);
+bool thread_is_hw_offload;
+bool thread_is_rcu;
+
+thread_is_hw_offload = !strncmp(get_subprogram_name(),
+"hw_offload", strlen("hw_offload"));
+thread_is_rcu = !strncmp(get_subprogram_name(), "urcu", strlen("urcu"));
+
+/* Panic if any other thread besides offload and RCU tries
+ * to initialize their thread ID. */
+ovs_assert(thread_is_hw_offload || thread_is_rcu);
+
+if (*netdev_offload_thread_id_get() == OVSTHREAD_ID_UNSET) {
+unsigned int id;
+
+if (thread_is_rcu) {
+/* RCU will compete with other threads for shared object access.
+ * Reclamation functions using a thread ID must be thread-safe.
+ * For that end, and because RCU must consider all potential shared
+ * objects anyway, its thread-id can be whichever, so return 0.
+ */
+id = 0;
+} else {
+/* Only the actual offload threads have their own ID. */
+id = atomic_count_inc(&next_id);
+}
+/* Panic if any offload thread is getting a spurious ID. */
+ovs_assert(id < netdev_offload_thread_nb());
+return *netdev_offload_thread_id_get() = id;
+} else {
+return *netdev_offload_thread_id_get();
+}
+}
+
 void
 netdev_ports_flow_flush(const char *dpif_type)
 {
@@ -627,7 +691,16 @@ netdev_ports_get_n_flows(const char *dpif_type, odp_port_t 
port_no,
 ovs_rwlock_rdlock(&netdev_hmap_rwlock);
 data = netdev_ports_lookup(port_no, dpif_type);
 if (data) {
-ret = netdev_flow_get_n_flows(data->netdev, n_flows);
+uint64_t thread_n_flows[MAX_OFFLOAD_THREAD_NB] = {0};
+unsigned int tid;
+
+ret = netdev_flow_get_n_flows(data->netdev, thread_n_flows);
+*n_flows = 0;
+if (!ret) {
+for (tid = 0; tid < netdev_offload_thread_nb(); tid++) {
+*n_flows += thread_n_flows[tid];
+}
+}
 }

[ovs-dev] [PATCH v4 12/27] mpsc-queue: Module for lock-free message passing

2021-06-09 Thread Gaetan Rivet
Add a lockless multi-producer/single-consumer (MPSC), linked-list based,
intrusive, unbounded queue that does not require deferred memory
management.

The queue is designed to improve the specific MPSC setup.  A benchmark
accompanies the unit tests to measure the difference in this configuration.
A single reader thread polls the queue while N writers enqueue elements
as fast as possible.  The mpsc-queue is compared against the regular ovs-list
as well as the guarded list.  The latter usually offers a slight improvement
by batching the element removal, however the mpsc-queue is faster.

The average is of each producer threads time:

   $ ./tests/ovstest test-mpsc-queue benchmark 300 1
   Benchmarking n=300 on 1 + 1 threads.
type\thread:  Reader  1Avg
 mpsc-queue: 167167167 ms
 list(spin):  89 80 80 ms
list(mutex): 745745745 ms
   guarded list: 788788788 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 2
   Benchmarking n=300 on 1 + 2 threads.
type\thread:  Reader  1  2Avg
 mpsc-queue:  98 97 94 95 ms
 list(spin): 185171173172 ms
list(mutex): 203199203201 ms
   guarded list: 269269188228 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 3
   Benchmarking n=300 on 1 + 3 threads.
type\thread:  Reader  1  2  3Avg
 mpsc-queue:  76 76 65 76 72 ms
 list(spin): 246110240238196 ms
list(mutex): 542541541539540 ms
   guarded list: 535535507511517 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 4
   Benchmarking n=300 on 1 + 4 threads.
type\thread:  Reader  1  2  3  4Avg
 mpsc-queue:  73 68 68 68 68 68 ms
 list(spin): 294275279277282278 ms
list(mutex): 346309287345302310 ms
   guarded list: 378319334378351345 ms

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/automake.mk |   2 +
 lib/mpsc-queue.c| 251 +
 lib/mpsc-queue.h| 190 ++
 tests/automake.mk   |   1 +
 tests/library.at|   5 +
 tests/test-mpsc-queue.c | 772 
 6 files changed, 1221 insertions(+)
 create mode 100644 lib/mpsc-queue.c
 create mode 100644 lib/mpsc-queue.h
 create mode 100644 tests/test-mpsc-queue.c

diff --git a/lib/automake.mk b/lib/automake.mk
index 79736..b45801852 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -167,6 +167,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/memory.h \
lib/meta-flow.c \
lib/mov-avg.h \
+   lib/mpsc-queue.c \
+   lib/mpsc-queue.h \
lib/multipath.c \
lib/multipath.h \
lib/namemap.c \
diff --git a/lib/mpsc-queue.c b/lib/mpsc-queue.c
new file mode 100644
index 0..ee762e1dc
--- /dev/null
+++ b/lib/mpsc-queue.c
@@ -0,0 +1,251 @@
+/*
+ * Copyright (c) 2020 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include "ovs-atomic.h"
+
+#include "mpsc-queue.h"
+
+/* Multi-producer, single-consumer queue
+ * =
+ *
+ * This an implementation of the MPSC queue described by Dmitri Vyukov [1].
+ *
+ * One atomic exchange operation is done per insertion.  Removal in most cases
+ * will not require atomic operation and will use one atomic exchange to close
+ * the queue chain.
+ *
+ * Insertion
+ * =
+ *
+ * The queue is implemented using a linked-list.  Insertion is done at the
+ * back of the queue, by swapping the current end with the new node atomically,
+ * then pointing the previous end toward the new node.  To follow Vyukov
+ * nomenclature, the end-node of the chain is called head.  A producer will
+ * only manipulate the head.
+ *
+ * The head swap is atomic, however the link from the previous head to the new
+ * one is done in a separate operation.  This means that the chain is
+ * momentarily broken, when the previous head still points to NULL and the
+ * current head has been inserted.
+ *
+ * Considering a series of insertions, the queue state will remain consistent
+ * and the insertions order is compatible with their precedence, thus the
+ * queue is seri

[ovs-dev] [PATCH v4 11/27] ovs-atomic: Expose atomic exchange operation

2021-06-09 Thread Gaetan Rivet
The atomic exchange operation is a useful primitive that should be
available as well.  Most compiler already expose or offer a way
to use it, but a single symbol needs to be defined.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/ovs-atomic-c++.h  |  3 +++
 lib/ovs-atomic-clang.h|  5 +
 lib/ovs-atomic-gcc4+.h|  5 +
 lib/ovs-atomic-gcc4.7+.h  |  5 +
 lib/ovs-atomic-i586.h |  5 +
 lib/ovs-atomic-locked.h   |  9 +
 lib/ovs-atomic-msvc.h | 22 ++
 lib/ovs-atomic-pthreads.h |  5 +
 lib/ovs-atomic-x86_64.h   |  5 +
 lib/ovs-atomic.h  |  8 +++-
 10 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/lib/ovs-atomic-c++.h b/lib/ovs-atomic-c++.h
index d47b8dd39..8605fa9d3 100644
--- a/lib/ovs-atomic-c++.h
+++ b/lib/ovs-atomic-c++.h
@@ -29,6 +29,9 @@ using std::atomic_compare_exchange_strong_explicit;
 using std::atomic_compare_exchange_weak;
 using std::atomic_compare_exchange_weak_explicit;
 
+using std::atomic_exchange;
+using std::atomic_exchange_explicit;
+
 #define atomic_read(SRC, DST) \
 atomic_read_explicit(SRC, DST, memory_order_seq_cst)
 #define atomic_read_explicit(SRC, DST, ORDER)   \
diff --git a/lib/ovs-atomic-clang.h b/lib/ovs-atomic-clang.h
index 34cc2faa7..cdf02a512 100644
--- a/lib/ovs-atomic-clang.h
+++ b/lib/ovs-atomic-clang.h
@@ -67,6 +67,11 @@ typedef enum {
 #define atomic_compare_exchange_weak_explicit(DST, EXP, SRC, ORD1, ORD2) \
 __c11_atomic_compare_exchange_weak(DST, EXP, SRC, ORD1, ORD2)
 
+#define atomic_exchange(RMW, ARG) \
+atomic_exchange_explicit(RMW, ARG, memory_order_seq_cst)
+#define atomic_exchange_explicit(RMW, ARG, ORDER) \
+__c11_atomic_exchange(RMW, ARG, ORDER)
+
 #define atomic_add(RMW, ARG, ORIG) \
 atomic_add_explicit(RMW, ARG, ORIG, memory_order_seq_cst)
 #define atomic_sub(RMW, ARG, ORIG) \
diff --git a/lib/ovs-atomic-gcc4+.h b/lib/ovs-atomic-gcc4+.h
index 25bcf20a0..f9accde1a 100644
--- a/lib/ovs-atomic-gcc4+.h
+++ b/lib/ovs-atomic-gcc4+.h
@@ -128,6 +128,11 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit   \
 atomic_compare_exchange_strong_explicit
 
+#define atomic_exchange_explicit(DST, SRC, ORDER) \
+__sync_lock_test_and_set(DST, SRC)
+#define atomic_exchange(DST, SRC) \
+atomic_exchange_explicit(DST, SRC, memory_order_seq_cst)
+
 #define atomic_op__(RMW, OP, ARG, ORIG) \
 ({  \
 typeof(RMW) rmw__ = (RMW);  \
diff --git a/lib/ovs-atomic-gcc4.7+.h b/lib/ovs-atomic-gcc4.7+.h
index 4c197ebe0..846e05775 100644
--- a/lib/ovs-atomic-gcc4.7+.h
+++ b/lib/ovs-atomic-gcc4.7+.h
@@ -61,6 +61,11 @@ typedef enum {
 #define atomic_compare_exchange_weak_explicit(DST, EXP, SRC, ORD1, ORD2) \
 __atomic_compare_exchange_n(DST, EXP, SRC, true, ORD1, ORD2)
 
+#define atomic_exchange_explicit(DST, SRC, ORDER) \
+__atomic_exchange_n(DST, SRC, ORDER)
+#define atomic_exchange(DST, SRC) \
+atomic_exchange_explicit(DST, SRC, memory_order_seq_cst)
+
 #define atomic_add(RMW, OPERAND, ORIG) \
 atomic_add_explicit(RMW, OPERAND, ORIG, memory_order_seq_cst)
 #define atomic_sub(RMW, OPERAND, ORIG) \
diff --git a/lib/ovs-atomic-i586.h b/lib/ovs-atomic-i586.h
index 9a385ce84..35a0959ff 100644
--- a/lib/ovs-atomic-i586.h
+++ b/lib/ovs-atomic-i586.h
@@ -400,6 +400,11 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit   \
 atomic_compare_exchange_strong_explicit
 
+#define atomic_exchange_explicit(RMW, ARG, ORDER) \
+atomic_exchange__(RMW, ARG, ORDER)
+#define atomic_exchange(RMW, ARG) \
+atomic_exchange_explicit(RMW, ARG, memory_order_seq_cst)
+
 #define atomic_add__(RMW, ARG, CLOB)\
 asm volatile("lock; xadd %0,%1 ; "  \
  "# atomic_add__ "  \
diff --git a/lib/ovs-atomic-locked.h b/lib/ovs-atomic-locked.h
index f8f0ba2a5..bf38c4a43 100644
--- a/lib/ovs-atomic-locked.h
+++ b/lib/ovs-atomic-locked.h
@@ -31,6 +31,15 @@ void atomic_unlock__(void *);
  atomic_unlock__(DST),  \
  false)))
 
+#define atomic_exchange_locked(DST, SRC) \
+({   \
+atomic_lock__(DST);  \
+typeof(*(DST)) __tmp = *(DST);   \
+*(DST) = SRC;\
+atomic_unlock__(DST);\
+__tmp;   \
+})
+
 #define atomic_op_locked_add +=
 #define atomic_op_locked_sub -=
 #define atomic_op_locked_or  |=
diff --git a/lib/ovs-atomic-msvc.h b/lib/ovs-atomic-msvc.h
index 9def887d3..ef8310269 100644
--- a/lib/ovs-atomic-msvc.h
+++ b/lib/ovs-atomic-msvc.h
@@ -345,6 +345,28 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit \
 atomic_compare_exchange_strong_explicit
 

[ovs-dev] [PATCH v4 10/27] dpif-netdev: Implement hardware offloads stats query

2021-06-09 Thread Gaetan Rivet
In the netdev datapath, keep track of the enqueued offloads between
the PMDs and the offload thread.  Additionally, query each netdev
for their hardware offload counters.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 90 ++-
 1 file changed, 89 insertions(+), 1 deletion(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index b666bc405..a20eeda4d 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -51,6 +51,7 @@
 #include "hmapx.h"
 #include "id-pool.h"
 #include "ipf.h"
+#include "mov-avg.h"
 #include "netdev.h"
 #include "netdev-offload.h"
 #include "netdev-provider.h"
@@ -431,6 +432,7 @@ struct dp_offload_thread_item {
 struct match match;
 struct nlattr *actions;
 size_t actions_len;
+long long int timestamp;
 
 struct ovs_list node;
 };
@@ -438,12 +440,18 @@ struct dp_offload_thread_item {
 struct dp_offload_thread {
 struct ovs_mutex mutex;
 struct ovs_list list;
+uint64_t enqueued_item;
+struct mov_avg_cma cma;
+struct mov_avg_ema ema;
 pthread_cond_t cond;
 };
 
 static struct dp_offload_thread dp_offload_thread = {
 .mutex = OVS_MUTEX_INITIALIZER,
 .list  = OVS_LIST_INITIALIZER(&dp_offload_thread.list),
+.enqueued_item = 0,
+.cma = MOV_AVG_CMA_INITIALIZER,
+.ema = MOV_AVG_EMA_INITIALIZER(100),
 };
 
 static struct ovsthread_once offload_thread_once
@@ -2632,6 +2640,7 @@ dp_netdev_append_flow_offload(struct 
dp_offload_thread_item *offload)
 {
 ovs_mutex_lock(&dp_offload_thread.mutex);
 ovs_list_push_back(&dp_offload_thread.list, &offload->node);
+dp_offload_thread.enqueued_item++;
 xpthread_cond_signal(&dp_offload_thread.cond);
 ovs_mutex_unlock(&dp_offload_thread.mutex);
 }
@@ -2736,6 +2745,7 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 {
 struct dp_offload_thread_item *offload;
 struct ovs_list *list;
+long long int latency_us;
 const char *op;
 int ret;
 
@@ -2748,6 +2758,7 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 ovsrcu_quiesce_end();
 }
 list = ovs_list_pop_front(&dp_offload_thread.list);
+dp_offload_thread.enqueued_item--;
 offload = CONTAINER_OF(list, struct dp_offload_thread_item, node);
 ovs_mutex_unlock(&dp_offload_thread.mutex);
 
@@ -2768,6 +2779,10 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 OVS_NOT_REACHED();
 }
 
+latency_us = time_usec() - offload->timestamp;
+mov_avg_cma_update(&dp_offload_thread.cma, latency_us);
+mov_avg_ema_update(&dp_offload_thread.ema, latency_us);
+
 VLOG_DBG("%s to %s netdev flow "UUID_FMT,
  ret == 0 ? "succeed" : "failed", op,
  UUID_ARGS((struct uuid *) &offload->flow->mega_ufid));
@@ -2792,6 +2807,7 @@ queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
 
 offload = dp_netdev_alloc_flow_offload(pmd, flow,
DP_NETDEV_FLOW_OFFLOAD_OP_DEL);
+offload->timestamp = pmd->ctx.now;
 dp_netdev_append_flow_offload(offload);
 }
 
@@ -2824,6 +2840,7 @@ queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd,
 memcpy(offload->actions, actions, actions_len);
 offload->actions_len = actions_len;
 
+offload->timestamp = pmd->ctx.now;
 dp_netdev_append_flow_offload(offload);
 }
 
@@ -4209,6 +4226,77 @@ dpif_netdev_operate(struct dpif *dpif, struct dpif_op 
**ops, size_t n_ops,
 }
 }
 
+static int
+dpif_netdev_offload_stats_get(struct dpif *dpif,
+  struct netdev_custom_stats *stats)
+{
+enum {
+DP_NETDEV_HW_OFFLOADS_STATS_ENQUEUED,
+DP_NETDEV_HW_OFFLOADS_STATS_INSERTED,
+DP_NETDEV_HW_OFFLOADS_STATS_LAT_CMA_MEAN,
+DP_NETDEV_HW_OFFLOADS_STATS_LAT_CMA_STDDEV,
+DP_NETDEV_HW_OFFLOADS_STATS_LAT_EMA_MEAN,
+DP_NETDEV_HW_OFFLOADS_STATS_LAT_EMA_STDDEV,
+};
+const char *names[] = {
+[DP_NETDEV_HW_OFFLOADS_STATS_ENQUEUED] =
+"Enqueued offloads",
+[DP_NETDEV_HW_OFFLOADS_STATS_INSERTED] =
+"Inserted offloads",
+[DP_NETDEV_HW_OFFLOADS_STATS_LAT_CMA_MEAN] =
+"  Cumulative Average latency (us)",
+[DP_NETDEV_HW_OFFLOADS_STATS_LAT_CMA_STDDEV] =
+"   Cumulative Latency stddev (us)",
+[DP_NETDEV_HW_OFFLOADS_STATS_LAT_EMA_MEAN] =
+" Exponential Average latency (us)",
+[DP_NETDEV_HW_OFFLOADS_STATS_LAT_EMA_STDDEV] =
+"  Exponential Latency stddev (us)",
+};
+struct dp_netdev *dp = get_dp_netdev(dpif);
+struct dp_netdev_port *port;
+uint64_t n

[ovs-dev] [PATCH v4 09/27] mov-avg: Add a moving average helper structure

2021-06-09 Thread Gaetan Rivet
Add a new module offering a helper to compute the Cumulative
Moving Average (CMA) and the Exponential Moving Average (EMA)
of a series of values.

Use the new helpers to add latency metrics in dpif-netdev.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/automake.mk |   1 +
 lib/mov-avg.h   | 171 
 2 files changed, 172 insertions(+)
 create mode 100644 lib/mov-avg.h

diff --git a/lib/automake.mk b/lib/automake.mk
index 39901bd6d..79736 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -166,6 +166,7 @@ lib_libopenvswitch_la_SOURCES = \
lib/memory.c \
lib/memory.h \
lib/meta-flow.c \
+   lib/mov-avg.h \
lib/multipath.c \
lib/multipath.h \
lib/namemap.c \
diff --git a/lib/mov-avg.h b/lib/mov-avg.h
new file mode 100644
index 0..4a7e62c18
--- /dev/null
+++ b/lib/mov-avg.h
@@ -0,0 +1,171 @@
+/*
+ * Copyright (c) 2020 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef _MOV_AVG_H
+#define _MOV_AVG_H 1
+
+#include 
+
+/* Moving average helpers. */
+
+/* Cumulative Moving Average.
+ *
+ * Computes the arithmetic mean over a whole series of value.
+ * Online equivalent of sum(V) / len(V).
+ *
+ * As all values have equal weight, this average will
+ * be slow to show recent changes in the series.
+ *
+ */
+
+struct mov_avg_cma {
+unsigned long long int count;
+double mean;
+double sum_dsquared;
+};
+
+#define MOV_AVG_CMA_INITIALIZER \
+{ .count = 0, .mean = .0, .sum_dsquared = .0 }
+
+static inline void
+mov_avg_cma_init(struct mov_avg_cma *cma)
+{
+*cma = (struct mov_avg_cma) MOV_AVG_CMA_INITIALIZER;
+}
+
+static inline void
+mov_avg_cma_update(struct mov_avg_cma *cma, double new_val)
+{
+double new_mean;
+
+cma->count++;
+new_mean = cma->mean + (new_val - cma->mean) / cma->count;
+
+cma->sum_dsquared += (new_val - new_mean) * (new_val - cma->mean);
+cma->mean = new_mean;
+}
+
+static inline double
+mov_avg_cma(struct mov_avg_cma *cma)
+{
+return cma->mean;
+}
+
+static inline double
+mov_avg_cma_std_dev(struct mov_avg_cma *cma)
+{
+double variance = 0.0;
+
+if (cma->count > 1) {
+variance = cma->sum_dsquared / (cma->count - 1);
+}
+
+return sqrt(variance);
+}
+
+/* Exponential Moving Average.
+ *
+ * Each value in the series has an exponentially decreasing weight,
+ * the older they get the less weight they have.
+ *
+ * The smoothing factor 'alpha' must be within 0 < alpha < 1.
+ * The closer this factor to zero, the more equal the weight between
+ * recent and older values. As it approaches one, the more recent values
+ * will have more weight.
+ *
+ * The EMA can be thought of as an estimator for the next value when measures
+ * are dependent. In this case, it can make sense to consider the mean square
+ * error of the prediction. An 'alpha' minimizing this error would be the
+ * better choice to improve the estimation.
+ *
+ * A common way to choose 'alpha' is to use the following formula:
+ *
+ *   a = 2 / (N + 1)
+ *
+ * With this 'alpha', the EMA will have the same 'center of mass' as an
+ * equivalent N-values Simple Moving Average.
+ *
+ * When using this factor, the N last values of the EMA will have a sum weight
+ * converging toward 0.8647, meaning that those values will account for 86% of
+ * the average[1].
+ *
+ * [1] https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average
+ */
+
+struct mov_avg_ema {
+double alpha; /* 'Smoothing' factor. */
+double mean;
+double variance;
+bool initialized;
+};
+
+/* Choose alpha explicitly. */
+#define MOV_AVG_EMA_INITIALIZER_ALPHA(a) { \
+.initialized = false, \
+.alpha = (a), .variance = 0.0, .mean = 0.0 \
+}
+
+/* Choose alpha to consider 'N' past periods as 86% of the EMA. */
+#define MOV_AVG_EMA_INITIALIZER(n_elem) \
+MOV_AVG_EMA_INITIALIZER_ALPHA(2.0 / ((double)(n_elem) + 1.0))
+
+static inline void
+mov_avg_ema_init_alpha(struct mov_avg_ema *ema,
+   double alpha)
+{
+*ema = (struct mov_avg_ema) MOV_AVG_EMA_INITIALIZER_ALPHA(alpha);
+}
+
+static inline void
+mov_avg_ema_init(struct mov_avg_ema *ema,
+ unsigned long long int n_elem)
+{
+*ema = (struct mov_avg_ema) 

[ovs-dev] [PATCH v4 08/27] dpif-netdev: Rename offload thread structure

2021-06-09 Thread Gaetan Rivet
The offload management in userspace is done through a separate thread.
The naming of the structure holding the objects used for synchronization
with the dataplane is generic and nondescript.

Clarify the object function by renaming it.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 52 +++
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 7710417a7..b666bc405 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -424,7 +424,7 @@ enum {
 DP_NETDEV_FLOW_OFFLOAD_OP_DEL,
 };
 
-struct dp_flow_offload_item {
+struct dp_offload_thread_item {
 struct dp_netdev_pmd_thread *pmd;
 struct dp_netdev_flow *flow;
 int op;
@@ -435,15 +435,15 @@ struct dp_flow_offload_item {
 struct ovs_list node;
 };
 
-struct dp_flow_offload {
+struct dp_offload_thread {
 struct ovs_mutex mutex;
 struct ovs_list list;
 pthread_cond_t cond;
 };
 
-static struct dp_flow_offload dp_flow_offload = {
+static struct dp_offload_thread dp_offload_thread = {
 .mutex = OVS_MUTEX_INITIALIZER,
-.list  = OVS_LIST_INITIALIZER(&dp_flow_offload.list),
+.list  = OVS_LIST_INITIALIZER(&dp_offload_thread.list),
 };
 
 static struct ovsthread_once offload_thread_once
@@ -2599,12 +2599,12 @@ mark_to_flow_find(const struct dp_netdev_pmd_thread 
*pmd,
 return NULL;
 }
 
-static struct dp_flow_offload_item *
+static struct dp_offload_thread_item *
 dp_netdev_alloc_flow_offload(struct dp_netdev_pmd_thread *pmd,
  struct dp_netdev_flow *flow,
  int op)
 {
-struct dp_flow_offload_item *offload;
+struct dp_offload_thread_item *offload;
 
 offload = xzalloc(sizeof(*offload));
 offload->pmd = pmd;
@@ -2618,7 +2618,7 @@ dp_netdev_alloc_flow_offload(struct dp_netdev_pmd_thread 
*pmd,
 }
 
 static void
-dp_netdev_free_flow_offload(struct dp_flow_offload_item *offload)
+dp_netdev_free_flow_offload(struct dp_offload_thread_item *offload)
 {
 dp_netdev_pmd_unref(offload->pmd);
 dp_netdev_flow_unref(offload->flow);
@@ -2628,16 +2628,16 @@ dp_netdev_free_flow_offload(struct dp_flow_offload_item 
*offload)
 }
 
 static void
-dp_netdev_append_flow_offload(struct dp_flow_offload_item *offload)
+dp_netdev_append_flow_offload(struct dp_offload_thread_item *offload)
 {
-ovs_mutex_lock(&dp_flow_offload.mutex);
-ovs_list_push_back(&dp_flow_offload.list, &offload->node);
-xpthread_cond_signal(&dp_flow_offload.cond);
-ovs_mutex_unlock(&dp_flow_offload.mutex);
+ovs_mutex_lock(&dp_offload_thread.mutex);
+ovs_list_push_back(&dp_offload_thread.list, &offload->node);
+xpthread_cond_signal(&dp_offload_thread.cond);
+ovs_mutex_unlock(&dp_offload_thread.mutex);
 }
 
 static int
-dp_netdev_flow_offload_del(struct dp_flow_offload_item *offload)
+dp_netdev_flow_offload_del(struct dp_offload_thread_item *offload)
 {
 return mark_to_flow_disassociate(offload->pmd, offload->flow);
 }
@@ -2654,7 +2654,7 @@ dp_netdev_flow_offload_del(struct dp_flow_offload_item 
*offload)
  * valid, thus only item 2 needed.
  */
 static int
-dp_netdev_flow_offload_put(struct dp_flow_offload_item *offload)
+dp_netdev_flow_offload_put(struct dp_offload_thread_item *offload)
 {
 struct dp_netdev_pmd_thread *pmd = offload->pmd;
 struct dp_netdev_flow *flow = offload->flow;
@@ -2734,22 +2734,22 @@ err_free:
 static void *
 dp_netdev_flow_offload_main(void *data OVS_UNUSED)
 {
-struct dp_flow_offload_item *offload;
+struct dp_offload_thread_item *offload;
 struct ovs_list *list;
 const char *op;
 int ret;
 
 for (;;) {
-ovs_mutex_lock(&dp_flow_offload.mutex);
-if (ovs_list_is_empty(&dp_flow_offload.list)) {
+ovs_mutex_lock(&dp_offload_thread.mutex);
+if (ovs_list_is_empty(&dp_offload_thread.list)) {
 ovsrcu_quiesce_start();
-ovs_mutex_cond_wait(&dp_flow_offload.cond,
-&dp_flow_offload.mutex);
+ovs_mutex_cond_wait(&dp_offload_thread.cond,
+&dp_offload_thread.mutex);
 ovsrcu_quiesce_end();
 }
-list = ovs_list_pop_front(&dp_flow_offload.list);
-offload = CONTAINER_OF(list, struct dp_flow_offload_item, node);
-ovs_mutex_unlock(&dp_flow_offload.mutex);
+list = ovs_list_pop_front(&dp_offload_thread.list);
+offload = CONTAINER_OF(list, struct dp_offload_thread_item, node);
+ovs_mutex_unlock(&dp_offload_thread.mutex);
 
 switch (offload->op) {
 case DP_NETDEV_FLOW_OFFLOAD_OP_ADD:
@@ -2782,10 +2782,10 @@ static void
 queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
   struct dp_netdev_flow *fl

[ovs-dev] [PATCH v4 07/27] dpctl: Add function to read hardware offload statistics

2021-06-09 Thread Gaetan Rivet
Expose a function to query datapath offload statistics.
This function is separate from the current one in netdev-offload
as it exposes more detailed statistics from the datapath, instead of
only from the netdev-offload provider.

Each datapath is meant to use the custom counters as it sees fit for its
handling of hardware offloads.

Call the new API from dpctl.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpctl.c | 36 
 lib/dpif-netdev.c   |  1 +
 lib/dpif-netlink.c  |  1 +
 lib/dpif-provider.h |  7 +++
 lib/dpif.c  |  8 
 lib/dpif.h  |  9 +
 6 files changed, 62 insertions(+)

diff --git a/lib/dpctl.c b/lib/dpctl.c
index ef8ae7402..6ff73e2d9 100644
--- a/lib/dpctl.c
+++ b/lib/dpctl.c
@@ -1541,6 +1541,40 @@ dpctl_del_flows(int argc, const char *argv[], struct 
dpctl_params *dpctl_p)
 return error;
 }
 
+static int
+dpctl_offload_stats_show(int argc, const char *argv[],
+ struct dpctl_params *dpctl_p)
+{
+struct netdev_custom_stats stats;
+struct dpif *dpif;
+int error;
+size_t i;
+
+error = opt_dpif_open(argc, argv, dpctl_p, 2, &dpif);
+if (error) {
+return error;
+}
+
+memset(&stats, 0, sizeof(stats));
+error = dpif_offload_stats_get(dpif, &stats);
+if (error) {
+dpctl_error(dpctl_p, error, "retrieving offload statistics");
+goto close_dpif;
+}
+
+dpctl_print(dpctl_p, "HW Offload stats:\n");
+for (i = 0; i < stats.size; i++) {
+dpctl_print(dpctl_p, "   %s: %6" PRIu64 "\n",
+stats.counters[i].name, stats.counters[i].value);
+}
+
+netdev_free_custom_stats_counters(&stats);
+
+close_dpif:
+dpif_close(dpif);
+return error;
+}
+
 static int
 dpctl_help(int argc OVS_UNUSED, const char *argv[] OVS_UNUSED,
struct dpctl_params *dpctl_p)
@@ -2697,6 +2731,8 @@ static const struct dpctl_command all_commands[] = {
 { "add-flows", "[dp] file", 1, 2, dpctl_process_flows, DP_RW },
 { "mod-flows", "[dp] file", 1, 2, dpctl_process_flows, DP_RW },
 { "del-flows", "[dp] [file]", 0, 2, dpctl_del_flows, DP_RW },
+{ "offload-stats-show", "[dp]",
+  0, 1, dpctl_offload_stats_show, DP_RO },
 { "dump-conntrack", "[-m] [-s] [dp] [zone=N]",
   0, 4, dpctl_dump_conntrack, DP_RO },
 { "flush-conntrack", "[dp] [zone=N] [ct-tuple]", 0, 3,
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index d2c480529..7710417a7 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -8483,6 +8483,7 @@ const struct dpif_class dpif_netdev_class = {
 dpif_netdev_flow_dump_thread_destroy,
 dpif_netdev_flow_dump_next,
 dpif_netdev_operate,
+NULL,   /* offload_stats_get */
 NULL,   /* recv_set */
 NULL,   /* handlers_set */
 dpif_netdev_set_config,
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 50520f8c0..9dd580f63 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -3972,6 +3972,7 @@ const struct dpif_class dpif_netlink_class = {
 dpif_netlink_flow_dump_thread_destroy,
 dpif_netlink_flow_dump_next,
 dpif_netlink_operate,
+NULL,   /* offload_stats_get */
 dpif_netlink_recv_set,
 dpif_netlink_handlers_set,
 NULL,   /* set_config */
diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
index b817fceac..36dfa8e71 100644
--- a/lib/dpif-provider.h
+++ b/lib/dpif-provider.h
@@ -330,6 +330,13 @@ struct dpif_class {
 void (*operate)(struct dpif *dpif, struct dpif_op **ops, size_t n_ops,
 enum dpif_offload_type offload_type);
 
+/* Get hardware-offloads activity counters from a dataplane.
+ * Those counters are not offload statistics (which are accessible through
+ * netdev statistics), but a status of hardware offload management:
+ * how many offloads are currently waiting, inserted, etc. */
+int (*offload_stats_get)(struct dpif *dpif,
+ struct netdev_custom_stats *stats);
+
 /* Enables or disables receiving packets with dpif_recv() for 'dpif'.
  * Turning packet receive off and then back on is allowed to change Netlink
  * PID assignments (see ->port_get_pid()).  The client is responsible for
diff --git a/lib/dpif.c b/lib/dpif.c
index 26e8bfb7d..7d3e09d78 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1427,6 +1427,14 @@ dpif_operate(struct dpif *dpif, struct dpif_op **ops, 
size_t n_ops,
 }
 }
 
+int dpif_offload_stats_get(struct dpif *dpif,
+   struct netdev_custom_stats *stats)
+{
+return (dpif->dpif_class->offload_stats_get
+? dpif->dpif_class-

[ovs-dev] [PATCH v4 05/27] netdev-offload-dpdk: Use per-netdev offload metadata

2021-06-09 Thread Gaetan Rivet
Add a per-netdev offload data field as part of netdev hw_info structure.
Use this field in netdev-offload-dpdk to map offload metadata (ufid to
rte_flow). Use flow API deinit ops to destroy the per-netdev metadata
when deallocating a netdev. Use RCU primitives to ensure coherency
during port deletion.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-dpdk.c | 131 --
 lib/netdev-offload.h  |   2 +
 2 files changed, 114 insertions(+), 19 deletions(-)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index f2413f5be..01d09aca7 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -26,6 +26,7 @@
 #include "netdev-provider.h"
 #include "openvswitch/match.h"
 #include "openvswitch/vlog.h"
+#include "ovs-rcu.h"
 #include "packets.h"
 #include "uuid.h"
 
@@ -52,7 +53,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(100, 
5);
 /*
  * A mapping from ufid to dpdk rte_flow.
  */
-static struct cmap ufid_to_rte_flow = CMAP_INITIALIZER;
 
 struct ufid_to_rte_flow_data {
 struct cmap_node node;
@@ -63,14 +63,81 @@ struct ufid_to_rte_flow_data {
 struct dpif_flow_stats stats;
 };
 
+struct netdev_offload_dpdk_data {
+struct cmap ufid_to_rte_flow;
+};
+
+static int
+offload_data_init(struct netdev *netdev)
+{
+struct netdev_offload_dpdk_data *data;
+
+data = xzalloc(sizeof *data);
+cmap_init(&data->ufid_to_rte_flow);
+
+ovsrcu_set(&netdev->hw_info.offload_data, (void *) data);
+
+return 0;
+}
+
+static void
+offload_data_destroy__(struct netdev_offload_dpdk_data *data)
+{
+free(data);
+}
+
+static void
+offload_data_destroy(struct netdev *netdev)
+{
+struct netdev_offload_dpdk_data *data;
+struct ufid_to_rte_flow_data *node;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+if (data == NULL) {
+return;
+}
+
+if (!cmap_is_empty(&data->ufid_to_rte_flow)) {
+VLOG_ERR("Incomplete flush: %s contains rte_flow elements",
+ netdev_get_name(netdev));
+}
+
+CMAP_FOR_EACH (node, node, &data->ufid_to_rte_flow) {
+ovsrcu_postpone(free, node);
+}
+
+cmap_destroy(&data->ufid_to_rte_flow);
+ovsrcu_postpone(offload_data_destroy__, data);
+
+ovsrcu_set(&netdev->hw_info.offload_data, NULL);
+}
+
+static struct cmap *
+offload_data_map(struct netdev *netdev)
+{
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+
+return data ? &data->ufid_to_rte_flow : NULL;
+}
+
 /* Find rte_flow with @ufid. */
 static struct ufid_to_rte_flow_data *
-ufid_to_rte_flow_data_find(const ovs_u128 *ufid, bool warn)
+ufid_to_rte_flow_data_find(struct netdev *netdev,
+   const ovs_u128 *ufid, bool warn)
 {
 size_t hash = hash_bytes(ufid, sizeof *ufid, 0);
 struct ufid_to_rte_flow_data *data;
+struct cmap *map = offload_data_map(netdev);
+
+if (!map) {
+return NULL;
+}
 
-CMAP_FOR_EACH_WITH_HASH (data, node, hash, &ufid_to_rte_flow) {
+CMAP_FOR_EACH_WITH_HASH (data, node, hash, map) {
 if (ovs_u128_equals(*ufid, data->ufid)) {
 return data;
 }
@@ -85,12 +152,19 @@ ufid_to_rte_flow_data_find(const ovs_u128 *ufid, bool warn)
 }
 
 static inline struct ufid_to_rte_flow_data *
-ufid_to_rte_flow_associate(const ovs_u128 *ufid, struct netdev *netdev,
+ufid_to_rte_flow_associate(struct netdev *netdev, const ovs_u128 *ufid,
struct rte_flow *rte_flow, bool actions_offloaded)
 {
 size_t hash = hash_bytes(ufid, sizeof *ufid, 0);
-struct ufid_to_rte_flow_data *data = xzalloc(sizeof *data);
+struct cmap *map = offload_data_map(netdev);
 struct ufid_to_rte_flow_data *data_prev;
+struct ufid_to_rte_flow_data *data;
+
+if (!map) {
+return NULL;
+}
+
+data = xzalloc(sizeof *data);
 
 /*
  * We should not simply overwrite an existing rte flow.
@@ -98,7 +172,7 @@ ufid_to_rte_flow_associate(const ovs_u128 *ufid, struct 
netdev *netdev,
  * Thus, if following assert triggers, something is wrong:
  * the rte_flow is not destroyed.
  */
-data_prev = ufid_to_rte_flow_data_find(ufid, false);
+data_prev = ufid_to_rte_flow_data_find(netdev, ufid, false);
 if (data_prev) {
 ovs_assert(data_prev->rte_flow == NULL);
 }
@@ -108,8 +182,7 @@ ufid_to_rte_flow_associate(const ovs_u128 *ufid, struct 
netdev *netdev,
 data->rte_flow = rte_flow;
 data->actions_offloaded = actions_offloaded;
 
-cmap_insert(&ufid_to_rte_flow,
-CONST_CAST(struct cmap_node *, &data->node), hash);
+cmap_inse

[ovs-dev] [PATCH v4 04/27] netdev: Add flow API uninit function

2021-06-09 Thread Gaetan Rivet
Add a new operation for flow API providers to
uninitialize when the API is disassociated from a netdev.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-provider.h | 3 +++
 lib/netdev-offload.c  | 4 
 2 files changed, 7 insertions(+)

diff --git a/lib/netdev-offload-provider.h b/lib/netdev-offload-provider.h
index cf859d1b4..2127599d3 100644
--- a/lib/netdev-offload-provider.h
+++ b/lib/netdev-offload-provider.h
@@ -90,6 +90,9 @@ struct netdev_flow_api {
 /* Initializies the netdev flow api.
  * Return 0 if successful, otherwise returns a positive errno value. */
 int (*init_flow_api)(struct netdev *);
+
+/* Uninitializes the netdev flow api. */
+void (*uninit_flow_api)(struct netdev *);
 };
 
 int netdev_register_flow_api_provider(const struct netdev_flow_api *);
diff --git a/lib/netdev-offload.c b/lib/netdev-offload.c
index 6237667c3..deefefd63 100644
--- a/lib/netdev-offload.c
+++ b/lib/netdev-offload.c
@@ -320,6 +320,10 @@ netdev_uninit_flow_api(struct netdev *netdev)
 return;
 }
 
+if (flow_api->uninit_flow_api) {
+flow_api->uninit_flow_api(netdev);
+}
+
 ovsrcu_set(&netdev->flow_api, NULL);
 rfa = netdev_lookup_flow_api(flow_api->type);
 ovs_refcount_unref(&rfa->refcnt);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 06/27] netdev-offload-dpdk: Implement hw-offload statistics read

2021-06-09 Thread Gaetan Rivet
In the DPDK offload provider, keep track of inserted rte_flow and report
it when queried.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/netdev-offload-dpdk.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index 01d09aca7..c43e8b968 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -65,6 +65,7 @@ struct ufid_to_rte_flow_data {
 
 struct netdev_offload_dpdk_data {
 struct cmap ufid_to_rte_flow;
+uint64_t rte_flow_counter;
 };
 
 static int
@@ -644,6 +645,12 @@ netdev_offload_dpdk_flow_create(struct netdev *netdev,
 
 flow = netdev_dpdk_rte_flow_create(netdev, attr, items, actions, error);
 if (flow) {
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+data->rte_flow_counter++;
+
 if (!VLOG_DROP_DBG(&rl)) {
 dump_flow(&s, &s_extra, attr, items, actions);
 extra_str = ds_cstr(&s_extra);
@@ -1524,6 +1531,12 @@ netdev_offload_dpdk_flow_destroy(struct 
ufid_to_rte_flow_data *rte_flow_data)
 ret = netdev_dpdk_rte_flow_destroy(netdev, rte_flow, &error);
 
 if (ret == 0) {
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+data->rte_flow_counter--;
+
 ufid_to_rte_flow_disassociate(rte_flow_data);
 VLOG_DBG_RL(&rl, "%s: rte_flow 0x%"PRIxPTR
 " flow destroy %d ufid " UUID_FMT,
@@ -1680,6 +1693,23 @@ netdev_offload_dpdk_flow_flush(struct netdev *netdev)
 return 0;
 }
 
+static int
+netdev_offload_dpdk_get_n_flows(struct netdev *netdev,
+uint64_t *n_flows)
+{
+struct netdev_offload_dpdk_data *data;
+
+data = (struct netdev_offload_dpdk_data *)
+ovsrcu_get(void *, &netdev->hw_info.offload_data);
+if (!data) {
+return -1;
+}
+
+*n_flows = data->rte_flow_counter;
+
+return 0;
+}
+
 const struct netdev_flow_api netdev_offload_dpdk = {
 .type = "dpdk_flow_api",
 .flow_put = netdev_offload_dpdk_flow_put,
@@ -1688,4 +1718,5 @@ const struct netdev_flow_api netdev_offload_dpdk = {
 .uninit_flow_api = netdev_offload_dpdk_uninit_flow_api,
 .flow_get = netdev_offload_dpdk_flow_get,
 .flow_flush = netdev_offload_dpdk_flow_flush,
+.flow_get_n_flows = netdev_offload_dpdk_get_n_flows,
 };
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 02/27] dpif-netdev: Rename flow offload thread

2021-06-09 Thread Gaetan Rivet
ovs_strlcpy silently fails to copy the thread name if it is too long.
Rename the flow offload thread to differentiate it from the main thread.

Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread")
Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 650e67ab3..d2c480529 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2786,8 +2786,7 @@ queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
 
 if (ovsthread_once_start(&offload_thread_once)) {
 xpthread_cond_init(&dp_flow_offload.cond, NULL);
-ovs_thread_create("dp_netdev_flow_offload",
-  dp_netdev_flow_offload_main, NULL);
+ovs_thread_create("hw_offload", dp_netdev_flow_offload_main, NULL);
 ovsthread_once_done(&offload_thread_once);
 }
 
@@ -2810,8 +2809,7 @@ queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd,
 
 if (ovsthread_once_start(&offload_thread_once)) {
 xpthread_cond_init(&dp_flow_offload.cond, NULL);
-ovs_thread_create("dp_netdev_flow_offload",
-  dp_netdev_flow_offload_main, NULL);
+ovs_thread_create("hw_offload", dp_netdev_flow_offload_main, NULL);
 ovsthread_once_done(&offload_thread_once);
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v4 03/27] tests: Add ovs-barrier unit test

2021-06-09 Thread Gaetan Rivet
No unit test exist currently for the ovs-barrier type.
It is however crucial as a building block and should be verified to work
as expected.

Create a simple test verifying the basic function of ovs-barrier.
Integrate the test as part of the test suite.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 tests/automake.mk|   1 +
 tests/library.at |   5 +
 tests/test-barrier.c | 264 +++
 3 files changed, 270 insertions(+)
 create mode 100644 tests/test-barrier.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 1a528aa39..a32abd41c 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -448,6 +448,7 @@ tests_ovstest_SOURCES = \
tests/ovstest.h \
tests/test-aes128.c \
tests/test-atomic.c \
+   tests/test-barrier.c \
tests/test-bundle.c \
tests/test-byte-order.c \
tests/test-classifier.c \
diff --git a/tests/library.at b/tests/library.at
index 1702b7556..e572c22e3 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -246,6 +246,11 @@ AT_SETUP([ofpbuf module])
 AT_CHECK([ovstest test-ofpbuf], [0], [])
 AT_CLEANUP
 
+AT_SETUP([barrier module])
+AT_KEYWORDS([barrier])
+AT_CHECK([ovstest test-barrier], [0], [])
+AT_CLEANUP
+
 AT_SETUP([rcu])
 AT_CHECK([ovstest test-rcu-quiesce], [0], [])
 AT_CLEANUP
diff --git a/tests/test-barrier.c b/tests/test-barrier.c
new file mode 100644
index 0..3bc5291cc
--- /dev/null
+++ b/tests/test-barrier.c
@@ -0,0 +1,264 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include 
+
+#include "ovs-thread.h"
+#include "ovs-rcu.h"
+#include "ovstest.h"
+#include "random.h"
+#include "util.h"
+
+#define DEFAULT_N_THREADS 4
+#define NB_STEPS 4
+
+static bool verbose;
+static struct ovs_barrier barrier;
+
+struct blocker_aux {
+unsigned int tid;
+bool leader;
+int step;
+};
+
+static void *
+basic_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+ovs_barrier_block(&barrier);
+aux->step++;
+ovs_barrier_block(&barrier);
+}
+
+return NULL;
+}
+
+static void
+basic_block_check(struct blocker_aux *aux, size_t n, int expected)
+{
+size_t i;
+
+for (i = 0; i < n; i++) {
+if (verbose) {
+printf("aux[%" PRIuSIZE "]=%d == %d", i, aux[i].step, expected);
+if (aux[i].step != expected) {
+printf(" <--- X");
+}
+printf("\n");
+} else {
+ovs_assert(aux[i].step == expected);
+}
+}
+ovs_barrier_block(&barrier);
+ovs_barrier_block(&barrier);
+}
+
+/*
+ * Basic barrier test.
+ *
+ * N writers and 1 reader participate in the test.
+ * Each thread goes through M steps (=NB_STEPS).
+ * The main thread participates as the reader.
+ *
+ * A Step is divided in three parts:
+ *1. before
+ *  (barrier)
+ *2. during
+ *  (barrier)
+ *3. after
+ *
+ * Each writer updates a thread-local variable with the
+ * current step number within part 2 and waits.
+ *
+ * The reader checks all variables during part 3, expecting
+ * all variables to be equal. If any variable differs, it means
+ * its thread was not properly blocked by the barrier.
+ */
+static void
+test_barrier_basic(size_t n_threads)
+{
+struct blocker_aux *aux;
+pthread_t *threads;
+size_t i;
+
+ovs_barrier_init(&barrier, n_threads + 1);
+
+aux = xcalloc(n_threads, sizeof *aux);
+threads = xmalloc(n_threads * sizeof *threads);
+for (i = 0; i < n_threads; i++) {
+threads[i] = ovs_thread_create("ovs-barrier",
+   basic_blocker_main, &aux[i]);
+}
+
+for (i = 0; i < NB_STEPS; i++) {
+basic_block_check(aux, n_threads, i);
+}
+ovs_barrier_destroy(&barrier);
+
+for (i = 0; i < n_threads; i++) {
+xpthread_join(threads[i], NULL);
+}
+
+free(threads);
+free(aux);
+}
+
+static unsigned int *shared_mem;
+
+static void *
+lead_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+if (aux->leader) 

[ovs-dev] [PATCH v4 01/27] ovs-thread: Fix barrier use-after-free

2021-06-09 Thread Gaetan Rivet
When a thread is blocked on a barrier, there is no guarantee
regarding the moment it will resume, only that it will at some point in
the future.

One thread can resume first then proceed to destroy the barrier while
another thread has not yet awoken. When it finally happens, the second
thread will attempt a seq_read() on the barrier seq, while the first
thread have already destroyed it, triggering a use-after-free.

Introduce an additional indirection layer within the barrier.
A internal barrier implementation holds all the necessary elements
for a thread to safely block and destroy. Whenever a barrier is
destroyed, the internal implementation is left available to still
blocking threads if necessary. A reference counter is used to track
threads still using the implementation.

Note that current uses of ovs-barrier are not affected: RCU and
revalidators will not destroy their barrier immediately after blocking
on it.

Fixes: d8043da7182a ("ovs-thread: Implement OVS specific barrier.")
Signed-off-by: Gaetan Rivet 
Reviewed-by: Maxime Coquelin 
---
 lib/ovs-thread.c | 61 +++-
 lib/ovs-thread.h |  6 ++---
 2 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
index b686e4548..805cba622 100644
--- a/lib/ovs-thread.c
+++ b/lib/ovs-thread.c
@@ -299,21 +299,53 @@ ovs_spin_init(const struct ovs_spin *spin)
 }
 #endif
 
+struct ovs_barrier_impl {
+uint32_t size;/* Number of threads to wait. */
+atomic_count count;   /* Number of threads already hit the barrier. */
+struct seq *seq;
+struct ovs_refcount refcnt;
+};
+
+static void
+ovs_barrier_impl_ref(struct ovs_barrier_impl *impl)
+{
+ovs_refcount_ref(&impl->refcnt);
+}
+
+static void
+ovs_barrier_impl_unref(struct ovs_barrier_impl *impl)
+{
+if (ovs_refcount_unref(&impl->refcnt) == 1) {
+seq_destroy(impl->seq);
+free(impl);
+}
+}
+
 /* Initializes the 'barrier'.  'size' is the number of threads
  * expected to hit the barrier. */
 void
 ovs_barrier_init(struct ovs_barrier *barrier, uint32_t size)
 {
-barrier->size = size;
-atomic_count_init(&barrier->count, 0);
-barrier->seq = seq_create();
+struct ovs_barrier_impl *impl;
+
+impl = xmalloc(sizeof *impl);
+impl->size = size;
+atomic_count_init(&impl->count, 0);
+impl->seq = seq_create();
+ovs_refcount_init(&impl->refcnt);
+
+ovsrcu_set(&barrier->impl, impl);
 }
 
 /* Destroys the 'barrier'. */
 void
 ovs_barrier_destroy(struct ovs_barrier *barrier)
 {
-seq_destroy(barrier->seq);
+struct ovs_barrier_impl *impl;
+
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovsrcu_set(&barrier->impl, NULL);
+ovs_barrier_impl_unref(impl);
 }
 
 /* Makes the calling thread block on the 'barrier' until all
@@ -325,23 +357,30 @@ ovs_barrier_destroy(struct ovs_barrier *barrier)
 void
 ovs_barrier_block(struct ovs_barrier *barrier)
 {
-uint64_t seq = seq_read(barrier->seq);
+struct ovs_barrier_impl *impl;
 uint32_t orig;
+uint64_t seq;
 
-orig = atomic_count_inc(&barrier->count);
-if (orig + 1 == barrier->size) {
-atomic_count_set(&barrier->count, 0);
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovs_barrier_impl_ref(impl);
+
+seq = seq_read(impl->seq);
+orig = atomic_count_inc(&impl->count);
+if (orig + 1 == impl->size) {
+atomic_count_set(&impl->count, 0);
 /* seq_change() serves as a release barrier against the other threads,
  * so the zeroed count is visible to them as they continue. */
-seq_change(barrier->seq);
+seq_change(impl->seq);
 } else {
 /* To prevent thread from waking up by other event,
  * keeps waiting for the change of 'barrier->seq'. */
-while (seq == seq_read(barrier->seq)) {
-seq_wait(barrier->seq, seq);
+while (seq == seq_read(impl->seq)) {
+seq_wait(impl->seq, seq);
 poll_block();
 }
 }
+
+ovs_barrier_impl_unref(impl);
 }
 
 DEFINE_EXTERN_PER_THREAD_DATA(ovsthread_id, OVSTHREAD_ID_UNSET);
diff --git a/lib/ovs-thread.h b/lib/ovs-thread.h
index 7ee98bd4e..3b444ccdc 100644
--- a/lib/ovs-thread.h
+++ b/lib/ovs-thread.h
@@ -21,16 +21,16 @@
 #include 
 #include 
 #include "ovs-atomic.h"
+#include "ovs-rcu.h"
 #include "openvswitch/thread.h"
 #include "util.h"
 
 struct seq;
 
 /* Poll-block()-able barrier similar to pthread_barrier_t. */
+struct ovs_barrier_impl;
 struct ovs_barrier {
-uint32_t size;/* Number of threads to wait. */
-atomic_count count;   /* Number of threads already hit the barrier. */
-struct seq *s

[ovs-dev] [PATCH v4 00/27] dpif-netdev: Parallel offload processing

2021-06-09 Thread Gaetan Rivet
l mix: avg   43.0 | stdev5.7 | max56 | min34
id-fpool rnd: avg   45.9 | stdev5.6 | max62 | min36
seq-pool new: avg   39.6 | stdev6.0 | max49 | min34
seq-pool del: avg   37.0 | stdev5.3 | max47 | min33
seq-pool mix: avg  101.4 | stdev   16.3 | max   130 | min89
seq-pool rnd: avg   81.6 | stdev   12.6 | max   105 | min71
id-queue new: avg   20.9 | stdev4.1 | max32 | min15
id-queue del: avg   17.2 | stdev4.5 | max28 | min10
id-queue mix: avg   56.5 | stdev   10.9 | max86 | min38
id-queue rnd: avg   97.2 | stdev   15.7 | max   130 | min64

$ pool-stats.sh 100 4
20 times './tests/ovstest test-id-fpool perf  100 4':
id-fpool new: avg   10.4 | stdev2.8 | max22 | min 7
id-fpool del: avg8.5 | stdev0.7 | max 9 | min 6
id-fpool mix: avg   19.6 | stdev1.8 | max22 | min15
id-fpool rnd: avg   25.6 | stdev2.4 | max28 | min20
seq-pool new: avg   47.7 | stdev5.2 | max52 | min34
seq-pool del: avg   35.8 | stdev3.3 | max39 | min28
seq-pool mix: avg  118.1 | stdev   14.8 | max   130 | min81
seq-pool rnd: avg   89.3 | stdev9.7 | max   101 | min65
id-queue new: avg   83.2 | stdev   17.5 | max   126 | min65
id-queue del: avg   81.8 | stdev   20.8 | max   128 | min57
id-queue mix: avg  276.2 | stdev   57.1 | max   369 | min   171
id-queue rnd: avg  347.9 | stdev   44.1 | max   410 | min   236

Isolating the 'rnd' test:

1 thread:
-
id-fpool rnd: avg   55.2 | stdev0.7 | max56 | min54
seq-pool rnd: avg   70.3 | stdev1.2 | max74 | min69
id-queue rnd: avg   48.5 | stdev0.5 | max49 | min48

2 threads:
--
id-fpool rnd: avg   45.9 | stdev5.6 | max62 | min36
seq-pool rnd: avg   81.6 | stdev   12.6 | max   105 | min71
id-queue rnd: avg   97.2 | stdev   15.7 | max   130 | min64

4 threads:
--
id-fpool rnd: avg   25.6 | stdev2.4 | max28 | min20
seq-pool rnd: avg   89.3 | stdev9.7 | max   101 | min    65
id-queue rnd: avg  347.9 | stdev   44.1 | max   410 | min   236


Gaetan Rivet (27):
  ovs-thread: Fix barrier use-after-free
  dpif-netdev: Rename flow offload thread
  tests: Add ovs-barrier unit test
  netdev: Add flow API uninit function
  netdev-offload-dpdk: Use per-netdev offload metadata
  netdev-offload-dpdk: Implement hw-offload statistics read
  dpctl: Add function to read hardware offload statistics
  dpif-netdev: Rename offload thread structure
  mov-avg: Add a moving average helper structure
  dpif-netdev: Implement hardware offloads stats query
  ovs-atomic: Expose atomic exchange operation
  mpsc-queue: Module for lock-free message passing
  id-fpool: Module for fast ID generation
  netdev-offload: Add multi-thread API
  dpif-netdev: Quiesce offload thread periodically
  dpif-netdev: Postpone flow offload item freeing
  dpif-netdev: Use id-fpool for mark allocation
  dpif-netdev: Introduce tagged union of offload requests
  dpif-netdev: Execute flush from offload thread
  netdev-offload-dpdk: Use per-thread HW offload stats
  netdev-offload-dpdk: Lock rte_flow map access
  netdev-offload-dpdk: Protect concurrent offload destroy/query
  dpif-netdev: Use lockless queue to manage offloads
  dpif-netdev: Make megaflow and mark mappings thread objects
  dpif-netdev: Replace port mutex by rwlock
  dpif-netdev: Use one or more offload threads
  netdev-dpdk: Remove rte-flow API access locks

 lib/automake.mk   |   5 +
 lib/dpctl.c   |  36 ++
 lib/dpif-netdev.c | 741 
 lib/dpif-netlink.c|   1 +
 lib/dpif-provider.h   |   7 +
 lib/dpif.c|   8 +
 lib/dpif.h|   9 +
 lib/id-fpool.c| 279 
 lib/id-fpool.h|  66 +++
 lib/mov-avg.h | 171 
 lib/mpsc-queue.c  | 251 +++
 lib/mpsc-queue.h  | 190 +
 lib/netdev-dpdk.c |   6 -
 lib/netdev-offload-dpdk.c | 277 ++--
 lib/netdev-offload-provider.h |   4 +
 lib/netdev-offload.c  |  92 +++-
 lib/netdev-offload.h  |  21 +
 lib/ovs-atomic-c++.h  |   3 +
 lib/ovs-atomic-clang.h|   5 +
 lib/ovs-atomic-gcc4+.h|   5 +
 lib/ovs-atomic-gcc4.7+.h  |   5 +
 lib/ovs-atomic-i586.h |   5 +
 lib/ovs-atomic-locked.h   |   9 +
 lib/ovs-atomic-msvc.h |  22 +
 lib/ovs-atomic-pthreads.h |   5 +
 lib/ovs-atomic-x86_64.h   |   5 +
 lib/ovs-atomic.h  |   8 +-
 lib/ovs-thread.c  |  61 ++-
 lib/ovs-thread.h  |   6 +-
 tests/automake.mk |   3 +
 tests/library.at  |  14 +
 tests/test-barrier.c  | 264 
 tests/test-id-fpool.c | 615 +++
 tests/test-mpsc-queue.c   | 772 +

[ovs-dev] [PATCH v2 1/8] configure: add --enable-asan option

2021-05-20 Thread Gaetan Rivet
Add a configure option to enable ASAN in a simple way.
Adding an AC variable to allow checking for support in the testsuite.

Signed-off-by: Gaetan Rivet 
---
 .ci/linux-build.sh |  4 ++--
 NEWS   |  1 +
 acinclude.m4   | 16 
 configure.ac   |  1 +
 tests/atlocal.in   |  1 +
 5 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index 0210d6a77..19600a668 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -229,10 +229,10 @@ fi
 if [ "$ASAN" ]; then
 # This will override default option configured in tests/atlocal.in.
 export ASAN_OPTIONS='detect_leaks=1'
+EXTRA_OPTS="$EXTRA_OPTS --enable-asan"
 # -O2 generates few false-positive memory leak reports in test-ovsdb
 # application, so lowering optimizations to -O1 here.
-CLFAGS_ASAN="-O1 -fno-omit-frame-pointer -fno-common -fsanitize=address"
-CFLAGS_FOR_OVS="${CFLAGS_FOR_OVS} ${CLFAGS_ASAN}"
+CFLAGS_FOR_OVS="${CFLAGS_FOR_OVS} -O1"
 fi
 
 save_OPTS="${OPTS} $*"
diff --git a/NEWS b/NEWS
index 402ce5969..79e18b85b 100644
--- a/NEWS
+++ b/NEWS
@@ -12,6 +12,7 @@ Post-v2.15.0
- DPDK:
  * OVS validated with DPDK 20.11.1. It is recommended to use this version
until further releases.
+   - New configure option '--enable-asan' enables AddressSanitizer.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/acinclude.m4 b/acinclude.m4
index 15a54d636..615e7f962 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -58,6 +58,22 @@ AC_DEFUN([OVS_ENABLE_WERROR],
fi
AC_SUBST([SPARSE_WERROR])])
 
+dnl OVS_ENABLE_ASAN
+AC_DEFUN([OVS_ENABLE_ASAN],
+  [AC_ARG_ENABLE(
+[asan],
+[AC_HELP_STRING([--enable-asan],
+[Enable the Address Sanitizer])],
+[ASAN_ENABLED=yes], [ASAN_ENABLED=no])
+   AC_SUBST([ASAN_ENABLED])
+   AC_CONFIG_COMMANDS_PRE([
+ if test "$ASAN_ENABLED" = "yes"; then
+ OVS_CFLAGS="$OVS_CFLAGS -fno-omit-frame-pointer"
+ OVS_CFLAGS="$OVS_CFLAGS -fno-common -fsanitize=address"
+ fi
+   ])
+  ])
+
 dnl OVS_CHECK_LINUX
 dnl
 dnl Configure linux kernel source tree
diff --git a/configure.ac b/configure.ac
index c077034d4..eec5a9d1b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -182,6 +182,7 @@ OVS_CONDITIONAL_CC_OPTION([-Wno-unused-parameter], 
[HAVE_WNO_UNUSED_PARAMETER])
 OVS_CONDITIONAL_CC_OPTION([-mavx512f], [HAVE_AVX512F])
 OVS_CHECK_CC_OPTION([-mavx512f], [CFLAGS="$CFLAGS -DHAVE_AVX512F"])
 OVS_ENABLE_WERROR
+OVS_ENABLE_ASAN
 OVS_ENABLE_SPARSE
 OVS_CTAGS_IDENTIFIERS
 OVS_CHECK_DPCLS_AUTOVALIDATOR
diff --git a/tests/atlocal.in b/tests/atlocal.in
index cfca7e192..f61e752bf 100644
--- a/tests/atlocal.in
+++ b/tests/atlocal.in
@@ -220,6 +220,7 @@ export OVS_SYSLOG_METHOD
 OVS_CTL_TIMEOUT=30
 export OVS_CTL_TIMEOUT
 
+ASAN_ENABLED='@ASAN_ENABLED@'
 # Add some default flags to make the tests run better under Address
 # Sanitizer, if it was used for the build.
 #
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 8/8] ovs-rcu: Add blocking RCU mode

2021-05-20 Thread Gaetan Rivet
Add the configure option --enable-rcu-blocking, that modifies the RCU
library. When enabled, quiescing from other threads will block, waiting
on the RCU thread to execute the postponed jobs.

This mode forces the deferred memory reclamation to happen
deterministically, reducing the latency of the deferral and forcing memory
to be freed any time a thread goes through a quiescent state.

Some use-after-free that were hidden by deferred memory reclamation may
become observable as a result. Previously the RCU mechanism would make them
harder to detect.

UAF detection tools should then be used in conjunction with this
compilation flag, e.g. (assuming llvm installed):

  ./configure --enable-rcu-blocking --enable-asan && make

  # Verify the tool works: should trigger a UAF
  ./tests/ovstest test-rcu-uaf quiesce
  ./tests/ovstest test-rcu-uaf try-quiesce
  ./tests/ovstest test-rcu-uaf quiesce-start-end

  # The testsuite can be used as well
  make check TESTSUITEFLAGS='-k rcu'

Signed-off-by: Gaetan Rivet 
---
 .ci/linux-build.sh   |  4 ++
 .github/workflows/build-and-test.yml |  7 +++
 NEWS |  1 +
 acinclude.m4 | 15 +
 configure.ac |  1 +
 lib/ovs-rcu.c| 87 
 lib/ovs-rcu.h| 31 ++
 tests/atlocal.in |  1 +
 tests/library.at |  3 +
 9 files changed, 150 insertions(+)

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index 19600a668..c9e04c090 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -235,6 +235,10 @@ if [ "$ASAN" ]; then
 CFLAGS_FOR_OVS="${CFLAGS_FOR_OVS} -O1"
 fi
 
+if [ "$RCU_BLOCK" ]; then
+EXTRA_OPTS="$EXTRA_OPTS --enable-rcu-blocking"
+fi
+
 save_OPTS="${OPTS} $*"
 OPTS="${EXTRA_OPTS} ${save_OPTS}"
 
diff --git a/.github/workflows/build-and-test.yml 
b/.github/workflows/build-and-test.yml
index e2350c6d9..56f6f42fc 100644
--- a/.github/workflows/build-and-test.yml
+++ b/.github/workflows/build-and-test.yml
@@ -23,6 +23,7 @@ jobs:
   M32: ${{ matrix.m32 }}
   OPTS:${{ matrix.opts }}
   TESTSUITE:   ${{ matrix.testsuite }}
+  RCU_BLOCK:   ${{ matrix.rcu_blocking }}
 
 name: linux ${{ join(matrix.*, ' ') }}
 runs-on: ubuntu-18.04
@@ -109,6 +110,12 @@ jobs:
   - compiler: gcc
 deb_package:  deb
 
+  - compiler: clang
+testsuite:test
+kernel:   3.16
+asan: asan
+rcu_blocking: rcu-blocking
+
 steps:
 - name: checkout
   uses: actions/checkout@v2
diff --git a/NEWS b/NEWS
index 79e18b85b..daafbe150 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,7 @@ Post-v2.15.0
  * OVS validated with DPDK 20.11.1. It is recommended to use this version
until further releases.
- New configure option '--enable-asan' enables AddressSanitizer.
+   - New configure option '--enable-rcu-blocking' to debug RCU usage.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/acinclude.m4 b/acinclude.m4
index 615e7f962..b01264373 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -1386,6 +1386,21 @@ AC_DEFUN([OVS_ENABLE_SPARSE],
  [], [enable_sparse=no])
AM_CONDITIONAL([ENABLE_SPARSE_BY_DEFAULT], [test $enable_sparse = yes])])
 
+dnl OVS_ENABLE_RCU_BLOCKING
+AC_DEFUN([OVS_ENABLE_RCU_BLOCKING],
+  [AC_ARG_ENABLE(
+[rcu-blocking],
+[AC_HELP_STRING([--enable-rcu-blocking],
+[Enable the blocking RCU mode])],
+[RCU_BLOCKING=yes], [RCU_BLOCKING=no])
+   AC_SUBST([RCU_BLOCKING])
+   AC_CONFIG_COMMANDS_PRE([
+ if test "$RCU_BLOCKING" = "yes"; then
+ OVS_CFLAGS="$OVS_CFLAGS -DOVS_RCU_BLOCKING=1"
+ fi
+   ])
+  ])
+
 dnl OVS_CTAGS_IDENTIFIERS
 dnl
 dnl ctags ignores symbols with extras identifiers. This is a list of
diff --git a/configure.ac b/configure.ac
index eec5a9d1b..de11ff777 100644
--- a/configure.ac
+++ b/configure.ac
@@ -184,6 +184,7 @@ OVS_CHECK_CC_OPTION([-mavx512f], [CFLAGS="$CFLAGS 
-DHAVE_AVX512F"])
 OVS_ENABLE_WERROR
 OVS_ENABLE_ASAN
 OVS_ENABLE_SPARSE
+OVS_ENABLE_RCU_BLOCKING
 OVS_CTAGS_IDENTIFIERS
 OVS_CHECK_DPCLS_AUTOVALIDATOR
 OVS_CHECK_BINUTILS_AVX512
diff --git a/lib/ovs-rcu.c b/lib/ovs-rcu.c
index 1866bd308..295a56778 100644
--- a/lib/ovs-rcu.c
+++ b/lib/ovs-rcu.c
@@ -71,6 +71,84 @@ static void ovsrcu_unregister__(struct ovsrcu_perthread *);
 static bool ovsrcu_call_postponed(void);
 static void *ovsrcu_postpone_thread(void *arg OVS_UNUSED);
 
+#ifdef OVS_RCU_BLOCKING
+
+static struct seq *postpone_wait;
+DEFINE_STATIC_PER_THREAD_DATA(bool, need_wait, false);
+DEFINE_STATIC_PER_THREAD_DATA(uint64_t, quiescent_seqno, 0);
+
+static void
+ovsrcu_postpone_end(void)
+{
+if (single_threaded()) {
+return;
+}
+seq_change(postpo

[ovs-dev] [PATCH v2 0/8] RCU: Add blocking mode for debugging

2021-05-20 Thread Gaetan Rivet
This series adds a compilation option that changes the behavior of the RCU
module. Once enabled, RCU reclamation by user threads becomes blocking until
the RCU threads has executed the scheduled callbacks.

Tools such as AddressSanitizer are useful to detect memory errors e.g. 
user-after-free.
Such tool can become ineffective if the RCU library is used to defer memory 
reclamation.
While this is the intended function of the RCU lib, nothing protects developers
from mistakes i.e. keeping references to memory scheduled for reclamation 
accross
quiescent periods.

Such error that should be detectable with ASAN, are made less likely to occur
due to RCU and thus harder to fix. However, if the RCU is modified so that user
threads are waiting on the RCU thread to execute the scheduled callbacks, they
should be forced to happen.

Unit tests have been written that should trigger a use-after-free from ASAN.
They are however thwarted by the RCU, until the blocking mode is enabled.
In that case, they will always abort on the expected error.

The full test-suite can be passed with the blocking RCU mode enabled.
An entry in the CI matrix is created for it. No error has been observed.

v2:

  * Rebased on master
  * Added documentation in lib/ovs-rcu.h following Ben's suggestion.

  CI: https://github.com/grivet/ovs/actions/runs/860557554

Gaetan Rivet (8):
  configure: add --enable-asan option
  tests: Add ovs-barrier unit test
  tests: Add RCU postpone test
  tests: Add ASAN use-after-free validation with RCU
  ovs-thread: Fix barrier use-after-free
  ovs-thread: Quiesce when joining pthreads
  ovs-rcu: Remove unused perthread mutex
  ovs-rcu: Add blocking RCU mode

 .ci/linux-build.sh   |   8 +-
 .github/workflows/build-and-test.yml |   7 +
 NEWS |   2 +
 acinclude.m4 |  31 
 configure.ac |   2 +
 lib/ovs-rcu.c|  90 -
 lib/ovs-rcu.h|  31 
 lib/ovs-thread.c |  77 ++--
 lib/ovs-thread.h |   6 +-
 tests/atlocal.in |   2 +
 tests/automake.mk|   2 +
 tests/library.at |  49 -
 tests/test-barrier.c | 264 +++
 tests/test-rcu-uaf.c |  98 ++
 tests/test-rcu.c |  61 +++
 15 files changed, 708 insertions(+), 22 deletions(-)
 create mode 100644 tests/test-barrier.c
 create mode 100644 tests/test-rcu-uaf.c

--
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 4/8] tests: Add ASAN use-after-free validation with RCU

2021-05-20 Thread Gaetan Rivet
When using the RCU mechanism and deferring memory reclamation, potential
use-after-free due to incorrect use of RCU can be hidden.

Add a test triggering a UAF event. When the test suite is built with
AddressSanitizer support, verify that the event triggers and the tool is
usable with RCU.

Signed-off-by: Gaetan Rivet 
---
 tests/automake.mk|  1 +
 tests/library.at | 33 +++
 tests/test-rcu-uaf.c | 98 
 3 files changed, 132 insertions(+)
 create mode 100644 tests/test-rcu-uaf.c

diff --git a/tests/automake.mk b/tests/automake.mk
index a32abd41c..4420a3f7f 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -472,6 +472,7 @@ tests_ovstest_SOURCES = \
tests/test-packets.c \
tests/test-random.c \
tests/test-rcu.c \
+   tests/test-rcu-uaf.c \
tests/test-reconnect.c \
tests/test-rstp.c \
tests/test-sflow.c \
diff --git a/tests/library.at b/tests/library.at
index 6e8a154e5..4a549f77e 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -261,6 +261,39 @@ AT_KEYWORDS([rcu])
 AT_CHECK([ovstest test-rcu], [0], [])
 AT_CLEANUP
 
+AT_SETUP([rcu quiesce use-after-free detection])
+AT_SKIP_IF([test "$IS_WIN32" = "yes"])
+AT_SKIP_IF([test "$ASAN_ENABLED" = "no"])
+# SIGABRT + 128
+exit_status=134
+AT_KEYWORDS([rcu asan])
+AT_CHECK([ovstest test-rcu-uaf quiesce], [$exit_status], [ignore], [ignore])
+# ASAN report is expected on success.
+rm asan.*
+AT_CLEANUP
+
+AT_SETUP([rcu try-quiesce use-after-free detection])
+AT_SKIP_IF([test "$IS_WIN32" = "yes"])
+AT_SKIP_IF([test "$ASAN_ENABLED" = "no"])
+# SIGABRT + 128
+exit_status=134
+AT_KEYWORDS([rcu asan])
+AT_CHECK([ovstest test-rcu-uaf try-quiesce], [$exit_status], [ignore], 
[ignore])
+# ASAN report is expected on success.
+rm asan.*
+AT_CLEANUP
+
+AT_SETUP([rcu quiesce-start-end use-after-free detection])
+AT_SKIP_IF([test "$IS_WIN32" = "yes"])
+AT_SKIP_IF([test "$ASAN_ENABLED" = "no"])
+AT_KEYWORDS([rcu asan])
+# SIGABRT + 128
+exit_status=134
+AT_CHECK([ovstest test-rcu-uaf quiesce-start-end], [$exit_status], [ignore], 
[ignore])
+# ASAN report is expected on success.
+rm asan.*
+AT_CLEANUP
+
 AT_SETUP([stopwatch module])
 AT_CHECK([ovstest test-stopwatch], [0], [..
 ], [ignore])
diff --git a/tests/test-rcu-uaf.c b/tests/test-rcu-uaf.c
new file mode 100644
index 0..f97738795
--- /dev/null
+++ b/tests/test-rcu-uaf.c
@@ -0,0 +1,98 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include 
+
+#include "ovs-thread.h"
+#include "ovs-rcu.h"
+#include "ovstest.h"
+#include "util.h"
+
+enum ovsrcu_uaf_type {
+OVSRCU_UAF_QUIESCE,
+OVSRCU_UAF_TRY_QUIESCE,
+OVSRCU_UAF_QUIESCE_START_END,
+};
+
+static void *
+rcu_uaf_main(void *aux)
+{
+enum ovsrcu_uaf_type *type = aux;
+char *xx = xmalloc(2);
+
+xx[0] = 'a';
+ovsrcu_postpone(free, xx);
+switch (*type) {
+case OVSRCU_UAF_QUIESCE:
+ovsrcu_quiesce();
+break;
+case OVSRCU_UAF_TRY_QUIESCE:
+while (ovsrcu_try_quiesce()) {
+;
+}
+break;
+case OVSRCU_UAF_QUIESCE_START_END:
+ovsrcu_quiesce_start();
+ovsrcu_quiesce_end();
+break;
+default:
+OVS_NOT_REACHED();
+}
+xx[1] = 'b';
+
+return NULL;
+}
+
+static void
+usage(char *test_name)
+{
+fprintf(stderr, "Usage: %s \n",
+test_name);
+}
+
+static void
+test_rcu_uaf(int argc, char *argv[])
+{
+char **args = argv + optind - 1;
+enum ovsrcu_uaf_type type;
+pthread_t quiescer;
+
+if (argc - optind != 1) {
+usage(args[0]);
+return;
+}
+
+set_program_name(argv[0]);
+
+if (!strcmp(args[1], "quiesce")) {
+type = OVSRCU_UAF_QUIESCE;
+} else if (!strcmp(args[1], "try-quiesce")) {
+type = OVSRCU_UAF_TRY_QUIESCE;
+} else if (!strcmp(args[1], "quiesce-start-end")) {
+type = OVSRCU_UAF_QUIESCE_START_END;
+} else {
+usage(args[0]);
+return;
+}
+
+/* Need to create a separate thread, to support try-quiesce. */
+quiescer = ovs_thread_create("rcu-uaf", rcu_uaf_main, &type);
+xpthread_join(quiescer, NULL);
+}
+
+OVSTEST_REGISTER("test-rcu-uaf", test_rcu_uaf);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 7/8] ovs-rcu: Remove unused perthread mutex

2021-05-20 Thread Gaetan Rivet
A mutex is allocated, initialized and destroyed, without being
used in the perthread structure.

Signed-off-by: Gaetan Rivet 
---
 lib/ovs-rcu.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/lib/ovs-rcu.c b/lib/ovs-rcu.c
index cde1e925b..1866bd308 100644
--- a/lib/ovs-rcu.c
+++ b/lib/ovs-rcu.c
@@ -47,7 +47,6 @@ struct ovsrcu_cbset {
 struct ovsrcu_perthread {
 struct ovs_list list_node;  /* In global list. */
 
-struct ovs_mutex mutex;
 uint64_t seqno;
 struct ovsrcu_cbset *cbset;
 char name[16];  /* This thread's name. */
@@ -84,7 +83,6 @@ ovsrcu_perthread_get(void)
 const char *name = get_subprogram_name();
 
 perthread = xmalloc(sizeof *perthread);
-ovs_mutex_init(&perthread->mutex);
 perthread->seqno = seq_read(global_seqno);
 perthread->cbset = NULL;
 ovs_strlcpy(perthread->name, name[0] ? name : "main",
@@ -406,7 +404,6 @@ ovsrcu_unregister__(struct ovsrcu_perthread *perthread)
 ovs_list_remove(&perthread->list_node);
 ovs_mutex_unlock(&ovsrcu_threads_mutex);
 
-ovs_mutex_destroy(&perthread->mutex);
 free(perthread);
 
 seq_change(global_seqno);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 6/8] ovs-thread: Quiesce when joining pthreads

2021-05-20 Thread Gaetan Rivet
Joining pthreads makes the caller quiescent. It should register as such,
as joined threads may wait on an RCU callback executing before quitting,
deadlocking the caller.

Signed-off-by: Gaetan Rivet 
---
 lib/ovs-thread.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
index 805cba622..bf58923f8 100644
--- a/lib/ovs-thread.c
+++ b/lib/ovs-thread.c
@@ -180,8 +180,6 @@ XPTHREAD_FUNC1(pthread_cond_destroy, pthread_cond_t *);
 XPTHREAD_FUNC1(pthread_cond_signal, pthread_cond_t *);
 XPTHREAD_FUNC1(pthread_cond_broadcast, pthread_cond_t *);
 
-XPTHREAD_FUNC2(pthread_join, pthread_t, void **);
-
 typedef void destructor_func(void *);
 XPTHREAD_FUNC2(pthread_key_create, pthread_key_t *, destructor_func *);
 XPTHREAD_FUNC1(pthread_key_delete, pthread_key_t);
@@ -191,6 +189,20 @@ XPTHREAD_FUNC2(pthread_setspecific, pthread_key_t, const 
void *);
 XPTHREAD_FUNC3(pthread_sigmask, int, const sigset_t *, sigset_t *);
 #endif
 
+void
+xpthread_join(pthread_t thread, void **retval)
+{
+int error;
+
+ovsrcu_quiesce_start();
+error = pthread_join(thread, retval);
+ovsrcu_quiesce_end();
+
+if (OVS_UNLIKELY(error)) {
+ovs_abort(error, "%s failed", __func__);
+}
+}
+
 static void
 ovs_mutex_init__(const struct ovs_mutex *l_, int type)
 {
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 3/8] tests: Add RCU postpone test

2021-05-20 Thread Gaetan Rivet
Add a simple postponing test verifying RCU callbacks have executed and
RCU exits in order. Add as part of library unit-tests.

Signed-off-by: Gaetan Rivet 
---
 tests/library.at |  8 ++-
 tests/test-rcu.c | 61 
 2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/tests/library.at b/tests/library.at
index e572c22e3..6e8a154e5 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -251,10 +251,16 @@ AT_KEYWORDS([barrier])
 AT_CHECK([ovstest test-barrier], [0], [])
 AT_CLEANUP
 
-AT_SETUP([rcu])
+AT_SETUP([rcu quiescing])
+AT_KEYWORDS([rcu])
 AT_CHECK([ovstest test-rcu-quiesce], [0], [])
 AT_CLEANUP
 
+AT_SETUP([rcu postponing])
+AT_KEYWORDS([rcu])
+AT_CHECK([ovstest test-rcu], [0], [])
+AT_CLEANUP
+
 AT_SETUP([stopwatch module])
 AT_CHECK([ovstest test-stopwatch], [0], [..
 ], [ignore])
diff --git a/tests/test-rcu.c b/tests/test-rcu.c
index 965f3c49f..38e1cbb6f 100644
--- a/tests/test-rcu.c
+++ b/tests/test-rcu.c
@@ -49,3 +49,64 @@ test_rcu_quiesce(int argc OVS_UNUSED, char *argv[] 
OVS_UNUSED)
 }
 
 OVSTEST_REGISTER("test-rcu-quiesce", test_rcu_quiesce);
+
+struct rcu_user_aux {
+bool done;
+};
+
+static void
+rcu_user_deferred(struct rcu_user_aux *aux)
+{
+aux->done = true;
+}
+
+static void *
+rcu_user_main(void *aux_)
+{
+struct rcu_user_aux *aux = aux_;
+
+ovsrcu_quiesce();
+
+aux->done = false;
+ovsrcu_postpone(rcu_user_deferred, aux);
+
+ovsrcu_quiesce();
+
+return NULL;
+}
+
+#define N_THREAD 4
+
+static void
+test_rcu(int argc OVS_UNUSED, char *argv[] OVS_UNUSED)
+{
+struct rcu_user_aux main_aux = {0};
+struct rcu_user_aux aux[N_THREAD];
+pthread_t users[N_THREAD];
+size_t i;
+
+memset(aux, 0, sizeof aux);
+
+for (i = 0; i < ARRAY_SIZE(users); i++) {
+users[i] = ovs_thread_create("user", rcu_user_main, &aux[i]);
+}
+
+for (i = 0; i < ARRAY_SIZE(users); i++) {
+xpthread_join(users[i], NULL);
+}
+
+/* Register a last callback and verify that it will be properly executed
+ * even if the RCU lib is exited without this thread quiescing.
+ */
+ovsrcu_postpone(rcu_user_deferred, &main_aux);
+
+ovsrcu_exit();
+
+ovs_assert(main_aux.done);
+
+for (i = 0; i < ARRAY_SIZE(users); i++) {
+ovs_assert(aux[i].done);
+}
+}
+
+OVSTEST_REGISTER("test-rcu", test_rcu);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 2/8] tests: Add ovs-barrier unit test

2021-05-20 Thread Gaetan Rivet
No unit test exist currently for the ovs-barrier type.
It is however crucial as a building block and should be verified to work
as expected.

Create a simple test verifying the basic function of ovs-barrier.
Integrate the test as part of the test suite.

Signed-off-by: Gaetan Rivet 
---
 tests/automake.mk|   1 +
 tests/library.at |   5 +
 tests/test-barrier.c | 264 +++
 3 files changed, 270 insertions(+)
 create mode 100644 tests/test-barrier.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 1a528aa39..a32abd41c 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -448,6 +448,7 @@ tests_ovstest_SOURCES = \
tests/ovstest.h \
tests/test-aes128.c \
tests/test-atomic.c \
+   tests/test-barrier.c \
tests/test-bundle.c \
tests/test-byte-order.c \
tests/test-classifier.c \
diff --git a/tests/library.at b/tests/library.at
index 1702b7556..e572c22e3 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -246,6 +246,11 @@ AT_SETUP([ofpbuf module])
 AT_CHECK([ovstest test-ofpbuf], [0], [])
 AT_CLEANUP
 
+AT_SETUP([barrier module])
+AT_KEYWORDS([barrier])
+AT_CHECK([ovstest test-barrier], [0], [])
+AT_CLEANUP
+
 AT_SETUP([rcu])
 AT_CHECK([ovstest test-rcu-quiesce], [0], [])
 AT_CLEANUP
diff --git a/tests/test-barrier.c b/tests/test-barrier.c
new file mode 100644
index 0..3bc5291cc
--- /dev/null
+++ b/tests/test-barrier.c
@@ -0,0 +1,264 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include 
+
+#include "ovs-thread.h"
+#include "ovs-rcu.h"
+#include "ovstest.h"
+#include "random.h"
+#include "util.h"
+
+#define DEFAULT_N_THREADS 4
+#define NB_STEPS 4
+
+static bool verbose;
+static struct ovs_barrier barrier;
+
+struct blocker_aux {
+unsigned int tid;
+bool leader;
+int step;
+};
+
+static void *
+basic_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+ovs_barrier_block(&barrier);
+aux->step++;
+ovs_barrier_block(&barrier);
+}
+
+return NULL;
+}
+
+static void
+basic_block_check(struct blocker_aux *aux, size_t n, int expected)
+{
+size_t i;
+
+for (i = 0; i < n; i++) {
+if (verbose) {
+printf("aux[%" PRIuSIZE "]=%d == %d", i, aux[i].step, expected);
+if (aux[i].step != expected) {
+printf(" <--- X");
+}
+printf("\n");
+} else {
+ovs_assert(aux[i].step == expected);
+}
+}
+ovs_barrier_block(&barrier);
+ovs_barrier_block(&barrier);
+}
+
+/*
+ * Basic barrier test.
+ *
+ * N writers and 1 reader participate in the test.
+ * Each thread goes through M steps (=NB_STEPS).
+ * The main thread participates as the reader.
+ *
+ * A Step is divided in three parts:
+ *1. before
+ *  (barrier)
+ *2. during
+ *  (barrier)
+ *3. after
+ *
+ * Each writer updates a thread-local variable with the
+ * current step number within part 2 and waits.
+ *
+ * The reader checks all variables during part 3, expecting
+ * all variables to be equal. If any variable differs, it means
+ * its thread was not properly blocked by the barrier.
+ */
+static void
+test_barrier_basic(size_t n_threads)
+{
+struct blocker_aux *aux;
+pthread_t *threads;
+size_t i;
+
+ovs_barrier_init(&barrier, n_threads + 1);
+
+aux = xcalloc(n_threads, sizeof *aux);
+threads = xmalloc(n_threads * sizeof *threads);
+for (i = 0; i < n_threads; i++) {
+threads[i] = ovs_thread_create("ovs-barrier",
+   basic_blocker_main, &aux[i]);
+}
+
+for (i = 0; i < NB_STEPS; i++) {
+basic_block_check(aux, n_threads, i);
+}
+ovs_barrier_destroy(&barrier);
+
+for (i = 0; i < n_threads; i++) {
+xpthread_join(threads[i], NULL);
+}
+
+free(threads);
+free(aux);
+}
+
+static unsigned int *shared_mem;
+
+static void *
+lead_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+if (aux->leader) {
+shared_mem

[ovs-dev] [PATCH v2 5/8] ovs-thread: Fix barrier use-after-free

2021-05-20 Thread Gaetan Rivet
When a thread is blocked on a barrier, there is no guarantee
regarding the moment it will resume, only that it will at some point in
the future.

One thread can resume first then proceed to destroy the barrier while
another thread has not yet awoken. When it finally happens, the second
thread will attempt a seq_read() on the barrier seq, while the first
thread have already destroyed it, triggering a use-after-free.

Introduce an additional indirection layer within the barrier.
A internal barrier implementation holds all the necessary elements
for a thread to safely block and destroy. Whenever a barrier is
destroyed, the internal implementation is left available to still
blocking threads if necessary. A reference counter is used to track
threads still using the implementation.

Note that current uses of ovs-barrier are not affected: RCU and
revalidators will not destroy their barrier immediately after blocking
on it.

Fixes: d8043da7182a ("ovs-thread: Implement OVS specific barrier.")
Signed-off-by: Gaetan Rivet 
---
 lib/ovs-thread.c | 61 +++-
 lib/ovs-thread.h |  6 ++---
 2 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
index b686e4548..805cba622 100644
--- a/lib/ovs-thread.c
+++ b/lib/ovs-thread.c
@@ -299,21 +299,53 @@ ovs_spin_init(const struct ovs_spin *spin)
 }
 #endif
 
+struct ovs_barrier_impl {
+uint32_t size;/* Number of threads to wait. */
+atomic_count count;   /* Number of threads already hit the barrier. */
+struct seq *seq;
+struct ovs_refcount refcnt;
+};
+
+static void
+ovs_barrier_impl_ref(struct ovs_barrier_impl *impl)
+{
+ovs_refcount_ref(&impl->refcnt);
+}
+
+static void
+ovs_barrier_impl_unref(struct ovs_barrier_impl *impl)
+{
+if (ovs_refcount_unref(&impl->refcnt) == 1) {
+seq_destroy(impl->seq);
+free(impl);
+}
+}
+
 /* Initializes the 'barrier'.  'size' is the number of threads
  * expected to hit the barrier. */
 void
 ovs_barrier_init(struct ovs_barrier *barrier, uint32_t size)
 {
-barrier->size = size;
-atomic_count_init(&barrier->count, 0);
-barrier->seq = seq_create();
+struct ovs_barrier_impl *impl;
+
+impl = xmalloc(sizeof *impl);
+impl->size = size;
+atomic_count_init(&impl->count, 0);
+impl->seq = seq_create();
+ovs_refcount_init(&impl->refcnt);
+
+ovsrcu_set(&barrier->impl, impl);
 }
 
 /* Destroys the 'barrier'. */
 void
 ovs_barrier_destroy(struct ovs_barrier *barrier)
 {
-seq_destroy(barrier->seq);
+struct ovs_barrier_impl *impl;
+
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovsrcu_set(&barrier->impl, NULL);
+ovs_barrier_impl_unref(impl);
 }
 
 /* Makes the calling thread block on the 'barrier' until all
@@ -325,23 +357,30 @@ ovs_barrier_destroy(struct ovs_barrier *barrier)
 void
 ovs_barrier_block(struct ovs_barrier *barrier)
 {
-uint64_t seq = seq_read(barrier->seq);
+struct ovs_barrier_impl *impl;
 uint32_t orig;
+uint64_t seq;
 
-orig = atomic_count_inc(&barrier->count);
-if (orig + 1 == barrier->size) {
-atomic_count_set(&barrier->count, 0);
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovs_barrier_impl_ref(impl);
+
+seq = seq_read(impl->seq);
+orig = atomic_count_inc(&impl->count);
+if (orig + 1 == impl->size) {
+atomic_count_set(&impl->count, 0);
 /* seq_change() serves as a release barrier against the other threads,
  * so the zeroed count is visible to them as they continue. */
-seq_change(barrier->seq);
+seq_change(impl->seq);
 } else {
 /* To prevent thread from waking up by other event,
  * keeps waiting for the change of 'barrier->seq'. */
-while (seq == seq_read(barrier->seq)) {
-seq_wait(barrier->seq, seq);
+while (seq == seq_read(impl->seq)) {
+seq_wait(impl->seq, seq);
 poll_block();
 }
 }
+
+ovs_barrier_impl_unref(impl);
 }
 
 DEFINE_EXTERN_PER_THREAD_DATA(ovsthread_id, OVSTHREAD_ID_UNSET);
diff --git a/lib/ovs-thread.h b/lib/ovs-thread.h
index 7ee98bd4e..3b444ccdc 100644
--- a/lib/ovs-thread.h
+++ b/lib/ovs-thread.h
@@ -21,16 +21,16 @@
 #include 
 #include 
 #include "ovs-atomic.h"
+#include "ovs-rcu.h"
 #include "openvswitch/thread.h"
 #include "util.h"
 
 struct seq;
 
 /* Poll-block()-able barrier similar to pthread_barrier_t. */
+struct ovs_barrier_impl;
 struct ovs_barrier {
-uint32_t size;/* Number of threads to wait. */
-atomic_count count;   /* Number of threads already hit the barrier. */
-struct seq *seq;
+OVSRCU_TYPE(struct ovs_barrier_impl *) impl;
 };
 
 /* Wrappers for pthread_mutexattr_*() that abort the process on any error. */
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 7/8] ovs-rcu: Remove unused perthread mutex

2021-04-27 Thread Gaetan Rivet
A mutex is allocated, initialized and destroyed, without being
used in the perthread structure.

Signed-off-by: Gaetan Rivet 
---
 lib/ovs-rcu.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/lib/ovs-rcu.c b/lib/ovs-rcu.c
index cde1e925b..1866bd308 100644
--- a/lib/ovs-rcu.c
+++ b/lib/ovs-rcu.c
@@ -47,7 +47,6 @@ struct ovsrcu_cbset {
 struct ovsrcu_perthread {
 struct ovs_list list_node;  /* In global list. */
 
-struct ovs_mutex mutex;
 uint64_t seqno;
 struct ovsrcu_cbset *cbset;
 char name[16];  /* This thread's name. */
@@ -84,7 +83,6 @@ ovsrcu_perthread_get(void)
 const char *name = get_subprogram_name();
 
 perthread = xmalloc(sizeof *perthread);
-ovs_mutex_init(&perthread->mutex);
 perthread->seqno = seq_read(global_seqno);
 perthread->cbset = NULL;
 ovs_strlcpy(perthread->name, name[0] ? name : "main",
@@ -406,7 +404,6 @@ ovsrcu_unregister__(struct ovsrcu_perthread *perthread)
 ovs_list_remove(&perthread->list_node);
 ovs_mutex_unlock(&ovsrcu_threads_mutex);
 
-ovs_mutex_destroy(&perthread->mutex);
 free(perthread);
 
 seq_change(global_seqno);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 6/8] ovs-thread: Quiesce when joining pthreads

2021-04-27 Thread Gaetan Rivet
Joining pthreads makes the caller quiescent. It should register as such,
as joined threads may wait on an RCU callback executing before quitting,
deadlocking the caller.

Signed-off-by: Gaetan Rivet 
---
 lib/ovs-thread.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
index 805cba622..bf58923f8 100644
--- a/lib/ovs-thread.c
+++ b/lib/ovs-thread.c
@@ -180,8 +180,6 @@ XPTHREAD_FUNC1(pthread_cond_destroy, pthread_cond_t *);
 XPTHREAD_FUNC1(pthread_cond_signal, pthread_cond_t *);
 XPTHREAD_FUNC1(pthread_cond_broadcast, pthread_cond_t *);
 
-XPTHREAD_FUNC2(pthread_join, pthread_t, void **);
-
 typedef void destructor_func(void *);
 XPTHREAD_FUNC2(pthread_key_create, pthread_key_t *, destructor_func *);
 XPTHREAD_FUNC1(pthread_key_delete, pthread_key_t);
@@ -191,6 +189,20 @@ XPTHREAD_FUNC2(pthread_setspecific, pthread_key_t, const 
void *);
 XPTHREAD_FUNC3(pthread_sigmask, int, const sigset_t *, sigset_t *);
 #endif
 
+void
+xpthread_join(pthread_t thread, void **retval)
+{
+int error;
+
+ovsrcu_quiesce_start();
+error = pthread_join(thread, retval);
+ovsrcu_quiesce_end();
+
+if (OVS_UNLIKELY(error)) {
+ovs_abort(error, "%s failed", __func__);
+}
+}
+
 static void
 ovs_mutex_init__(const struct ovs_mutex *l_, int type)
 {
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 5/8] ovs-thread: Fix barrier use-after-free

2021-04-27 Thread Gaetan Rivet
When a thread is blocked on a barrier, there is no guarantee
regarding the moment it will resume, only that it will at some point in
the future.

One thread can resume first then proceed to destroy the barrier while
another thread has not yet awoken. When it finally happens, the second
thread will attempt a seq_read() on the barrier seq, while the first
thread have already destroyed it, triggering a use-after-free.

Introduce an additional indirection layer within the barrier.
A internal barrier implementation holds all the necessary elements
for a thread to safely block and destroy. Whenever a barrier is
destroyed, the internal implementation is left available to still
blocking threads if necessary. A reference counter is used to track
threads still using the implementation.

Note that current uses of ovs-barrier are not affected: RCU and
revalidators will not destroy their barrier immediately after blocking
on it.

Fixes: d8043da7182a ("ovs-thread: Implement OVS specific barrier.")
Signed-off-by: Gaetan Rivet 
---
 lib/ovs-thread.c | 61 +++-
 lib/ovs-thread.h |  6 ++---
 2 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
index b686e4548..805cba622 100644
--- a/lib/ovs-thread.c
+++ b/lib/ovs-thread.c
@@ -299,21 +299,53 @@ ovs_spin_init(const struct ovs_spin *spin)
 }
 #endif
 
+struct ovs_barrier_impl {
+uint32_t size;/* Number of threads to wait. */
+atomic_count count;   /* Number of threads already hit the barrier. */
+struct seq *seq;
+struct ovs_refcount refcnt;
+};
+
+static void
+ovs_barrier_impl_ref(struct ovs_barrier_impl *impl)
+{
+ovs_refcount_ref(&impl->refcnt);
+}
+
+static void
+ovs_barrier_impl_unref(struct ovs_barrier_impl *impl)
+{
+if (ovs_refcount_unref(&impl->refcnt) == 1) {
+seq_destroy(impl->seq);
+free(impl);
+}
+}
+
 /* Initializes the 'barrier'.  'size' is the number of threads
  * expected to hit the barrier. */
 void
 ovs_barrier_init(struct ovs_barrier *barrier, uint32_t size)
 {
-barrier->size = size;
-atomic_count_init(&barrier->count, 0);
-barrier->seq = seq_create();
+struct ovs_barrier_impl *impl;
+
+impl = xmalloc(sizeof *impl);
+impl->size = size;
+atomic_count_init(&impl->count, 0);
+impl->seq = seq_create();
+ovs_refcount_init(&impl->refcnt);
+
+ovsrcu_set(&barrier->impl, impl);
 }
 
 /* Destroys the 'barrier'. */
 void
 ovs_barrier_destroy(struct ovs_barrier *barrier)
 {
-seq_destroy(barrier->seq);
+struct ovs_barrier_impl *impl;
+
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovsrcu_set(&barrier->impl, NULL);
+ovs_barrier_impl_unref(impl);
 }
 
 /* Makes the calling thread block on the 'barrier' until all
@@ -325,23 +357,30 @@ ovs_barrier_destroy(struct ovs_barrier *barrier)
 void
 ovs_barrier_block(struct ovs_barrier *barrier)
 {
-uint64_t seq = seq_read(barrier->seq);
+struct ovs_barrier_impl *impl;
 uint32_t orig;
+uint64_t seq;
 
-orig = atomic_count_inc(&barrier->count);
-if (orig + 1 == barrier->size) {
-atomic_count_set(&barrier->count, 0);
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovs_barrier_impl_ref(impl);
+
+seq = seq_read(impl->seq);
+orig = atomic_count_inc(&impl->count);
+if (orig + 1 == impl->size) {
+atomic_count_set(&impl->count, 0);
 /* seq_change() serves as a release barrier against the other threads,
  * so the zeroed count is visible to them as they continue. */
-seq_change(barrier->seq);
+seq_change(impl->seq);
 } else {
 /* To prevent thread from waking up by other event,
  * keeps waiting for the change of 'barrier->seq'. */
-while (seq == seq_read(barrier->seq)) {
-seq_wait(barrier->seq, seq);
+while (seq == seq_read(impl->seq)) {
+seq_wait(impl->seq, seq);
 poll_block();
 }
 }
+
+ovs_barrier_impl_unref(impl);
 }
 
 DEFINE_EXTERN_PER_THREAD_DATA(ovsthread_id, OVSTHREAD_ID_UNSET);
diff --git a/lib/ovs-thread.h b/lib/ovs-thread.h
index 7ee98bd4e..3b444ccdc 100644
--- a/lib/ovs-thread.h
+++ b/lib/ovs-thread.h
@@ -21,16 +21,16 @@
 #include 
 #include 
 #include "ovs-atomic.h"
+#include "ovs-rcu.h"
 #include "openvswitch/thread.h"
 #include "util.h"
 
 struct seq;
 
 /* Poll-block()-able barrier similar to pthread_barrier_t. */
+struct ovs_barrier_impl;
 struct ovs_barrier {
-uint32_t size;/* Number of threads to wait. */
-atomic_count count;   /* Number of threads already hit the barrier. */
-struct seq *seq;
+OVSRCU_TYPE(struct ovs_barrier_impl *) impl;
 };
 
 /* Wrappers for pthread_mutexattr_*() that abort the process on any error. */
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 0/8] RCU: Add blocking mode for debugging

2021-04-27 Thread Gaetan Rivet
This series adds a compilation option that changes the behavior of the RCU
module. Once enabled, RCU reclamation by user threads becomes blocking until
the RCU threads has executed the scheduled callbacks.

Tools such as AddressSanitizer are useful to detect memory errors e.g. 
user-after-free.
Such tool can become ineffective if the RCU library is used to defer memory 
reclamation.
While this is the intended function of the RCU lib, nothing protects developers
from mistakes i.e. keeping references to memory scheduled for reclamation 
accross
quiescent periods.

Such error that should be detectable with ASAN, are made less likely to occur
due to RCU and thus harder to fix. However, if the RCU is modified so that user
threads are waiting on the RCU thread to execute the scheduled callbacks, they
should be forced to happen.

Unit tests have been written that should trigger a use-after-free from ASAN.
They are however thwarted by the RCU, until the blocking mode is enabled.
In that case, they will always abort on the expected error.

The full test-suite can be passed with the blocking RCU mode enabled.
An entry in the CI matrix is created for it. No error has been observed.

Gaetan Rivet (8):
  configure: add --enable-asan option
  tests: Add ovs-barrier unit test
  tests: Add RCU postpone test
  tests: Add ASAN use-after-free validation with RCU
  ovs-thread: Fix barrier use-after-free
  ovs-thread: Quiesce when joining pthreads
  ovs-rcu: Remove unused perthread mutex
  ovs-rcu: Add blocking RCU mode

 .ci/linux-build.sh   |   8 +-
 .github/workflows/build-and-test.yml |   7 +
 NEWS |   2 +
 acinclude.m4 |  31 
 configure.ac |   2 +
 lib/ovs-rcu.c|  85 -
 lib/ovs-thread.c |  77 ++--
 lib/ovs-thread.h |   6 +-
 tests/atlocal.in |   2 +
 tests/automake.mk|   2 +
 tests/library.at |  49 -
 tests/test-barrier.c | 264 +++
 tests/test-rcu-uaf.c |  98 ++
 tests/test-rcu.c |  59 ++
 14 files changed, 670 insertions(+), 22 deletions(-)
 create mode 100644 tests/test-barrier.c
 create mode 100644 tests/test-rcu-uaf.c

--
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 8/8] ovs-rcu: Add blocking RCU mode

2021-04-27 Thread Gaetan Rivet
Add the configure option --enable-rcu-blocking, that modifies the RCU
library. When enabled, quiescing from other threads will block, waiting
on the RCU thread to execute the postponed jobs.

This mode forces the deferred memory reclamation to happen
deterministically, reducing the latency of the deferral and forcing memory
to be freed any time a thread goes through a quiescent state.

Some use-after-free that were hidden by deferred memory reclamation may
become observable as a result. Previously the RCU mechanism would make them
harder to detect.

UAF detection tools should then be used in conjunction with this
compilation flag, e.g. (assuming llvm installed):

  ./configure --enable-rcu-blocking --enable-asan
  make

  # Verify the tool works: should trigger a UAF
  ./tests/ovstest test-rcu-uaf quiesce
  ./tests/ovstest test-rcu-uaf try-quiesce
  ./tests/ovstest test-rcu-uaf quiesce-start-end

  # The testsuite can be used as well
  make check TESTSUITEFLAGS='-k rcu'

Signed-off-by: Gaetan Rivet 
---
 .ci/linux-build.sh   |  4 ++
 .github/workflows/build-and-test.yml |  7 +++
 NEWS |  1 +
 acinclude.m4 | 15 +
 configure.ac |  1 +
 lib/ovs-rcu.c| 82 
 tests/atlocal.in |  1 +
 tests/library.at |  3 +
 8 files changed, 114 insertions(+)

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index 3c58637b4..e4cbe2024 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -235,6 +235,10 @@ if [ "$ASAN" ]; then
 CFLAGS_FOR_OVS="${CFLAGS_FOR_OVS} -O1"
 fi
 
+if [ "$RCU_BLOCK" ]; then
+EXTRA_OPTS="$EXTRA_OPTS --enable-rcu-blocking"
+fi
+
 save_OPTS="${OPTS} $*"
 OPTS="${EXTRA_OPTS} ${save_OPTS}"
 
diff --git a/.github/workflows/build-and-test.yml 
b/.github/workflows/build-and-test.yml
index ce98a9f98..655923325 100644
--- a/.github/workflows/build-and-test.yml
+++ b/.github/workflows/build-and-test.yml
@@ -23,6 +23,7 @@ jobs:
   M32: ${{ matrix.m32 }}
   OPTS:${{ matrix.opts }}
   TESTSUITE:   ${{ matrix.testsuite }}
+  RCU_BLOCK:   ${{ matrix.rcu_blocking }}
 
 name: linux ${{ join(matrix.*, ' ') }}
 runs-on: ubuntu-18.04
@@ -109,6 +110,12 @@ jobs:
   - compiler: gcc
 deb_package:  deb
 
+  - compiler: clang
+testsuite:test
+kernel:   3.16
+asan: asan
+rcu_blocking: rcu-blocking
+
 steps:
 - name: checkout
   uses: actions/checkout@v2
diff --git a/NEWS b/NEWS
index 57e1f041b..83fcfe1d0 100644
--- a/NEWS
+++ b/NEWS
@@ -10,6 +10,7 @@ Post-v2.15.0
in ovsdb on startup.
  * New command 'record-hostname-if-not-set' to update hostname in ovsdb.
- New --enable-asan configure option enables AddressSanitizer.
+   - New --enable-rcu-blocking configure option to debug RCU usage.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/acinclude.m4 b/acinclude.m4
index 615e7f962..b01264373 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -1386,6 +1386,21 @@ AC_DEFUN([OVS_ENABLE_SPARSE],
  [], [enable_sparse=no])
AM_CONDITIONAL([ENABLE_SPARSE_BY_DEFAULT], [test $enable_sparse = yes])])
 
+dnl OVS_ENABLE_RCU_BLOCKING
+AC_DEFUN([OVS_ENABLE_RCU_BLOCKING],
+  [AC_ARG_ENABLE(
+[rcu-blocking],
+[AC_HELP_STRING([--enable-rcu-blocking],
+[Enable the blocking RCU mode])],
+[RCU_BLOCKING=yes], [RCU_BLOCKING=no])
+   AC_SUBST([RCU_BLOCKING])
+   AC_CONFIG_COMMANDS_PRE([
+ if test "$RCU_BLOCKING" = "yes"; then
+ OVS_CFLAGS="$OVS_CFLAGS -DOVS_RCU_BLOCKING=1"
+ fi
+   ])
+  ])
+
 dnl OVS_CTAGS_IDENTIFIERS
 dnl
 dnl ctags ignores symbols with extras identifiers. This is a list of
diff --git a/configure.ac b/configure.ac
index eec5a9d1b..de11ff777 100644
--- a/configure.ac
+++ b/configure.ac
@@ -184,6 +184,7 @@ OVS_CHECK_CC_OPTION([-mavx512f], [CFLAGS="$CFLAGS 
-DHAVE_AVX512F"])
 OVS_ENABLE_WERROR
 OVS_ENABLE_ASAN
 OVS_ENABLE_SPARSE
+OVS_ENABLE_RCU_BLOCKING
 OVS_CTAGS_IDENTIFIERS
 OVS_CHECK_DPCLS_AUTOVALIDATOR
 OVS_CHECK_BINUTILS_AVX512
diff --git a/lib/ovs-rcu.c b/lib/ovs-rcu.c
index 1866bd308..cd8414973 100644
--- a/lib/ovs-rcu.c
+++ b/lib/ovs-rcu.c
@@ -71,6 +71,79 @@ static void ovsrcu_unregister__(struct ovsrcu_perthread *);
 static bool ovsrcu_call_postponed(void);
 static void *ovsrcu_postpone_thread(void *arg OVS_UNUSED);
 
+#ifdef OVS_RCU_BLOCKING
+
+static struct seq *postpone_wait;
+DEFINE_STATIC_PER_THREAD_DATA(bool, need_wait, false);
+DEFINE_STATIC_PER_THREAD_DATA(uint64_t, quiescent_seqno, 0);
+
+static void
+ovsrcu_postpone_end(void)
+{
+if (single_threaded()) {
+return;
+}
+seq_change(postpone_wait);
+}
+
+static bool
+ovsrcu_do_not_block(void)
+{
+/* Do not wait on

[ovs-dev] [PATCH v1 4/8] tests: Add ASAN use-after-free validation with RCU

2021-04-27 Thread Gaetan Rivet
When using the RCU mechanism and deferring memory reclamation, potential
use-after-free due to incorrect use of RCU can be hidden.

Add a test triggering a UAF event. When the test suite is built with
AddressSanitizer support, verify that the event triggers and the tool is
usable with RCU.

Signed-off-by: Gaetan Rivet 
---
 tests/automake.mk|  1 +
 tests/library.at | 33 +++
 tests/test-rcu-uaf.c | 98 
 3 files changed, 132 insertions(+)
 create mode 100644 tests/test-rcu-uaf.c

diff --git a/tests/automake.mk b/tests/automake.mk
index a32abd41c..4420a3f7f 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -472,6 +472,7 @@ tests_ovstest_SOURCES = \
tests/test-packets.c \
tests/test-random.c \
tests/test-rcu.c \
+   tests/test-rcu-uaf.c \
tests/test-reconnect.c \
tests/test-rstp.c \
tests/test-sflow.c \
diff --git a/tests/library.at b/tests/library.at
index 6e8a154e5..4a549f77e 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -261,6 +261,39 @@ AT_KEYWORDS([rcu])
 AT_CHECK([ovstest test-rcu], [0], [])
 AT_CLEANUP
 
+AT_SETUP([rcu quiesce use-after-free detection])
+AT_SKIP_IF([test "$IS_WIN32" = "yes"])
+AT_SKIP_IF([test "$ASAN_ENABLED" = "no"])
+# SIGABRT + 128
+exit_status=134
+AT_KEYWORDS([rcu asan])
+AT_CHECK([ovstest test-rcu-uaf quiesce], [$exit_status], [ignore], [ignore])
+# ASAN report is expected on success.
+rm asan.*
+AT_CLEANUP
+
+AT_SETUP([rcu try-quiesce use-after-free detection])
+AT_SKIP_IF([test "$IS_WIN32" = "yes"])
+AT_SKIP_IF([test "$ASAN_ENABLED" = "no"])
+# SIGABRT + 128
+exit_status=134
+AT_KEYWORDS([rcu asan])
+AT_CHECK([ovstest test-rcu-uaf try-quiesce], [$exit_status], [ignore], 
[ignore])
+# ASAN report is expected on success.
+rm asan.*
+AT_CLEANUP
+
+AT_SETUP([rcu quiesce-start-end use-after-free detection])
+AT_SKIP_IF([test "$IS_WIN32" = "yes"])
+AT_SKIP_IF([test "$ASAN_ENABLED" = "no"])
+AT_KEYWORDS([rcu asan])
+# SIGABRT + 128
+exit_status=134
+AT_CHECK([ovstest test-rcu-uaf quiesce-start-end], [$exit_status], [ignore], 
[ignore])
+# ASAN report is expected on success.
+rm asan.*
+AT_CLEANUP
+
 AT_SETUP([stopwatch module])
 AT_CHECK([ovstest test-stopwatch], [0], [..
 ], [ignore])
diff --git a/tests/test-rcu-uaf.c b/tests/test-rcu-uaf.c
new file mode 100644
index 0..f97738795
--- /dev/null
+++ b/tests/test-rcu-uaf.c
@@ -0,0 +1,98 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include 
+
+#include "ovs-thread.h"
+#include "ovs-rcu.h"
+#include "ovstest.h"
+#include "util.h"
+
+enum ovsrcu_uaf_type {
+OVSRCU_UAF_QUIESCE,
+OVSRCU_UAF_TRY_QUIESCE,
+OVSRCU_UAF_QUIESCE_START_END,
+};
+
+static void *
+rcu_uaf_main(void *aux)
+{
+enum ovsrcu_uaf_type *type = aux;
+char *xx = xmalloc(2);
+
+xx[0] = 'a';
+ovsrcu_postpone(free, xx);
+switch (*type) {
+case OVSRCU_UAF_QUIESCE:
+ovsrcu_quiesce();
+break;
+case OVSRCU_UAF_TRY_QUIESCE:
+while (ovsrcu_try_quiesce()) {
+;
+}
+break;
+case OVSRCU_UAF_QUIESCE_START_END:
+ovsrcu_quiesce_start();
+ovsrcu_quiesce_end();
+break;
+default:
+OVS_NOT_REACHED();
+}
+xx[1] = 'b';
+
+return NULL;
+}
+
+static void
+usage(char *test_name)
+{
+fprintf(stderr, "Usage: %s \n",
+test_name);
+}
+
+static void
+test_rcu_uaf(int argc, char *argv[])
+{
+char **args = argv + optind - 1;
+enum ovsrcu_uaf_type type;
+pthread_t quiescer;
+
+if (argc - optind != 1) {
+usage(args[0]);
+return;
+}
+
+set_program_name(argv[0]);
+
+if (!strcmp(args[1], "quiesce")) {
+type = OVSRCU_UAF_QUIESCE;
+} else if (!strcmp(args[1], "try-quiesce")) {
+type = OVSRCU_UAF_TRY_QUIESCE;
+} else if (!strcmp(args[1], "quiesce-start-end")) {
+type = OVSRCU_UAF_QUIESCE_START_END;
+} else {
+usage(args[0]);
+return;
+}
+
+/* Need to create a separate thread, to support try-quiesce. */
+quiescer = ovs_thread_create("rcu-uaf", rcu_uaf_main, &type);
+xpthread_join(quiescer, NULL);
+}
+
+OVSTEST_REGISTER("test-rcu-uaf", test_rcu_uaf);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 3/8] tests: Add RCU postpone test

2021-04-27 Thread Gaetan Rivet
Add a simple postponing test verifying RCU callbacks have executed and
RCU exits in order. Add as part of library unit-tests.

Signed-off-by: Gaetan Rivet 
---
 tests/library.at |  8 ++-
 tests/test-rcu.c | 59 
 2 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/tests/library.at b/tests/library.at
index e572c22e3..6e8a154e5 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -251,10 +251,16 @@ AT_KEYWORDS([barrier])
 AT_CHECK([ovstest test-barrier], [0], [])
 AT_CLEANUP
 
-AT_SETUP([rcu])
+AT_SETUP([rcu quiescing])
+AT_KEYWORDS([rcu])
 AT_CHECK([ovstest test-rcu-quiesce], [0], [])
 AT_CLEANUP
 
+AT_SETUP([rcu postponing])
+AT_KEYWORDS([rcu])
+AT_CHECK([ovstest test-rcu], [0], [])
+AT_CLEANUP
+
 AT_SETUP([stopwatch module])
 AT_CHECK([ovstest test-stopwatch], [0], [..
 ], [ignore])
diff --git a/tests/test-rcu.c b/tests/test-rcu.c
index 965f3c49f..88db04a45 100644
--- a/tests/test-rcu.c
+++ b/tests/test-rcu.c
@@ -49,3 +49,62 @@ test_rcu_quiesce(int argc OVS_UNUSED, char *argv[] 
OVS_UNUSED)
 }
 
 OVSTEST_REGISTER("test-rcu-quiesce", test_rcu_quiesce);
+
+struct rcu_user_aux {
+bool done;
+};
+
+static void
+rcu_user_deferred(struct rcu_user_aux *aux)
+{
+aux->done = true;
+}
+
+static void *
+rcu_user_main(void *aux_)
+{
+struct rcu_user_aux *aux = aux_;
+
+ovsrcu_quiesce();
+
+aux->done = false;
+ovsrcu_postpone(rcu_user_deferred, aux);
+
+ovsrcu_quiesce();
+
+return NULL;
+}
+
+#define N_THREAD 4
+
+static void
+test_rcu(int argc OVS_UNUSED, char *argv[] OVS_UNUSED)
+{
+struct rcu_user_aux aux[N_THREAD] = {0};
+struct rcu_user_aux main_aux = {0};
+pthread_t users[N_THREAD];
+size_t i;
+
+for (i = 0; i < ARRAY_SIZE(users); i++) {
+users[i] = ovs_thread_create("user", rcu_user_main, &aux[i]);
+}
+
+for (i = 0; i < ARRAY_SIZE(users); i++) {
+xpthread_join(users[i], NULL);
+}
+
+/* Register a last callback and verify that it will be properly executed
+ * even if the RCU lib is exited without this thread quiescing.
+ */
+ovsrcu_postpone(rcu_user_deferred, &main_aux);
+
+ovsrcu_exit();
+
+ovs_assert(main_aux.done);
+
+for (i = 0; i < ARRAY_SIZE(users); i++) {
+ovs_assert(aux[i].done);
+}
+}
+
+OVSTEST_REGISTER("test-rcu", test_rcu);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 1/8] configure: add --enable-asan option

2021-04-27 Thread Gaetan Rivet
Add a configure option to enable ASAN in a simple way.
Adding an AC variable to allow checking for support in the testsuite.

Signed-off-by: Gaetan Rivet 
---
 .ci/linux-build.sh |  4 ++--
 NEWS   |  1 +
 acinclude.m4   | 16 
 configure.ac   |  1 +
 tests/atlocal.in   |  1 +
 5 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index 977449350..3c58637b4 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -229,10 +229,10 @@ fi
 if [ "$ASAN" ]; then
 # This will override default option configured in tests/atlocal.in.
 export ASAN_OPTIONS='detect_leaks=1'
+EXTRA_OPTS="$EXTRA_OPTS --enable-asan"
 # -O2 generates few false-positive memory leak reports in test-ovsdb
 # application, so lowering optimizations to -O1 here.
-CLFAGS_ASAN="-O1 -fno-omit-frame-pointer -fno-common -fsanitize=address"
-CFLAGS_FOR_OVS="${CFLAGS_FOR_OVS} ${CLFAGS_ASAN}"
+CFLAGS_FOR_OVS="${CFLAGS_FOR_OVS} -O1"
 fi
 
 save_OPTS="${OPTS} $*"
diff --git a/NEWS b/NEWS
index 95cf922aa..57e1f041b 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,7 @@ Post-v2.15.0
  * New option '--no-record-hostname' to disable hostname configuration
in ovsdb on startup.
  * New command 'record-hostname-if-not-set' to update hostname in ovsdb.
+   - New --enable-asan configure option enables AddressSanitizer.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/acinclude.m4 b/acinclude.m4
index 15a54d636..615e7f962 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -58,6 +58,22 @@ AC_DEFUN([OVS_ENABLE_WERROR],
fi
AC_SUBST([SPARSE_WERROR])])
 
+dnl OVS_ENABLE_ASAN
+AC_DEFUN([OVS_ENABLE_ASAN],
+  [AC_ARG_ENABLE(
+[asan],
+[AC_HELP_STRING([--enable-asan],
+[Enable the Address Sanitizer])],
+[ASAN_ENABLED=yes], [ASAN_ENABLED=no])
+   AC_SUBST([ASAN_ENABLED])
+   AC_CONFIG_COMMANDS_PRE([
+ if test "$ASAN_ENABLED" = "yes"; then
+ OVS_CFLAGS="$OVS_CFLAGS -fno-omit-frame-pointer"
+ OVS_CFLAGS="$OVS_CFLAGS -fno-common -fsanitize=address"
+ fi
+   ])
+  ])
+
 dnl OVS_CHECK_LINUX
 dnl
 dnl Configure linux kernel source tree
diff --git a/configure.ac b/configure.ac
index c077034d4..eec5a9d1b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -182,6 +182,7 @@ OVS_CONDITIONAL_CC_OPTION([-Wno-unused-parameter], 
[HAVE_WNO_UNUSED_PARAMETER])
 OVS_CONDITIONAL_CC_OPTION([-mavx512f], [HAVE_AVX512F])
 OVS_CHECK_CC_OPTION([-mavx512f], [CFLAGS="$CFLAGS -DHAVE_AVX512F"])
 OVS_ENABLE_WERROR
+OVS_ENABLE_ASAN
 OVS_ENABLE_SPARSE
 OVS_CTAGS_IDENTIFIERS
 OVS_CHECK_DPCLS_AUTOVALIDATOR
diff --git a/tests/atlocal.in b/tests/atlocal.in
index cfca7e192..f61e752bf 100644
--- a/tests/atlocal.in
+++ b/tests/atlocal.in
@@ -220,6 +220,7 @@ export OVS_SYSLOG_METHOD
 OVS_CTL_TIMEOUT=30
 export OVS_CTL_TIMEOUT
 
+ASAN_ENABLED='@ASAN_ENABLED@'
 # Add some default flags to make the tests run better under Address
 # Sanitizer, if it was used for the build.
 #
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 2/8] tests: Add ovs-barrier unit test

2021-04-27 Thread Gaetan Rivet
No unit test exist currently for the ovs-barrier type.
It is however crucial as a building block and should be verified to work
as expected.

Create a simple test verifying the basic function of ovs-barrier.
Integrate the test as part of the test suite.

Signed-off-by: Gaetan Rivet 
---
 tests/automake.mk|   1 +
 tests/library.at |   5 +
 tests/test-barrier.c | 264 +++
 3 files changed, 270 insertions(+)
 create mode 100644 tests/test-barrier.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 1a528aa39..a32abd41c 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -448,6 +448,7 @@ tests_ovstest_SOURCES = \
tests/ovstest.h \
tests/test-aes128.c \
tests/test-atomic.c \
+   tests/test-barrier.c \
tests/test-bundle.c \
tests/test-byte-order.c \
tests/test-classifier.c \
diff --git a/tests/library.at b/tests/library.at
index 1702b7556..e572c22e3 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -246,6 +246,11 @@ AT_SETUP([ofpbuf module])
 AT_CHECK([ovstest test-ofpbuf], [0], [])
 AT_CLEANUP
 
+AT_SETUP([barrier module])
+AT_KEYWORDS([barrier])
+AT_CHECK([ovstest test-barrier], [0], [])
+AT_CLEANUP
+
 AT_SETUP([rcu])
 AT_CHECK([ovstest test-rcu-quiesce], [0], [])
 AT_CLEANUP
diff --git a/tests/test-barrier.c b/tests/test-barrier.c
new file mode 100644
index 0..3bc5291cc
--- /dev/null
+++ b/tests/test-barrier.c
@@ -0,0 +1,264 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include 
+
+#include "ovs-thread.h"
+#include "ovs-rcu.h"
+#include "ovstest.h"
+#include "random.h"
+#include "util.h"
+
+#define DEFAULT_N_THREADS 4
+#define NB_STEPS 4
+
+static bool verbose;
+static struct ovs_barrier barrier;
+
+struct blocker_aux {
+unsigned int tid;
+bool leader;
+int step;
+};
+
+static void *
+basic_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+ovs_barrier_block(&barrier);
+aux->step++;
+ovs_barrier_block(&barrier);
+}
+
+return NULL;
+}
+
+static void
+basic_block_check(struct blocker_aux *aux, size_t n, int expected)
+{
+size_t i;
+
+for (i = 0; i < n; i++) {
+if (verbose) {
+printf("aux[%" PRIuSIZE "]=%d == %d", i, aux[i].step, expected);
+if (aux[i].step != expected) {
+printf(" <--- X");
+}
+printf("\n");
+} else {
+ovs_assert(aux[i].step == expected);
+}
+}
+ovs_barrier_block(&barrier);
+ovs_barrier_block(&barrier);
+}
+
+/*
+ * Basic barrier test.
+ *
+ * N writers and 1 reader participate in the test.
+ * Each thread goes through M steps (=NB_STEPS).
+ * The main thread participates as the reader.
+ *
+ * A Step is divided in three parts:
+ *1. before
+ *  (barrier)
+ *2. during
+ *  (barrier)
+ *3. after
+ *
+ * Each writer updates a thread-local variable with the
+ * current step number within part 2 and waits.
+ *
+ * The reader checks all variables during part 3, expecting
+ * all variables to be equal. If any variable differs, it means
+ * its thread was not properly blocked by the barrier.
+ */
+static void
+test_barrier_basic(size_t n_threads)
+{
+struct blocker_aux *aux;
+pthread_t *threads;
+size_t i;
+
+ovs_barrier_init(&barrier, n_threads + 1);
+
+aux = xcalloc(n_threads, sizeof *aux);
+threads = xmalloc(n_threads * sizeof *threads);
+for (i = 0; i < n_threads; i++) {
+threads[i] = ovs_thread_create("ovs-barrier",
+   basic_blocker_main, &aux[i]);
+}
+
+for (i = 0; i < NB_STEPS; i++) {
+basic_block_check(aux, n_threads, i);
+}
+ovs_barrier_destroy(&barrier);
+
+for (i = 0; i < n_threads; i++) {
+xpthread_join(threads[i], NULL);
+}
+
+free(threads);
+free(aux);
+}
+
+static unsigned int *shared_mem;
+
+static void *
+lead_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+if (aux->leader) {
+shared_mem

[ovs-dev] [PATCH v3 27/28] dpif-netdev: Use one or more offload threads

2021-04-25 Thread Gaetan Rivet
Read the user configuration in the netdev-offload module to modify the
number of threads used to manage hardware offload requests.

This allows processing insertion, deletion and modification
concurrently.

The offload thread structure was modified to contain all needed
elements. This structure is multiplied by the number of requested
threads and used separately.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
Reviewed-by: Maxime Coquelin 
---
 lib/dpif-netdev.c | 304 +-
 lib/netdev-offload-dpdk.c |   7 +-
 2 files changed, 204 insertions(+), 107 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index becf41adb..c1dc62886 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -460,25 +460,47 @@ struct dp_offload_thread_item {
 };
 
 struct dp_offload_thread {
-struct mpsc_queue queue;
-atomic_uint64_t enqueued_item;
-struct cmap megaflow_to_mark;
-struct cmap mark_to_flow;
-struct mov_avg_cma cma;
-struct mov_avg_ema ema;
+PADDED_MEMBERS(CACHE_LINE_SIZE,
+struct mpsc_queue queue;
+atomic_uint64_t enqueued_item;
+struct cmap megaflow_to_mark;
+struct cmap mark_to_flow;
+struct mov_avg_cma cma;
+struct mov_avg_ema ema;
+);
 };
+static struct dp_offload_thread *dp_offload_threads;
+static void *dp_netdev_flow_offload_main(void *arg);
 
-static struct dp_offload_thread dp_offload_thread = {
-.queue = MPSC_QUEUE_INITIALIZER(&dp_offload_thread.queue),
-.megaflow_to_mark = CMAP_INITIALIZER,
-.mark_to_flow = CMAP_INITIALIZER,
-.enqueued_item = ATOMIC_VAR_INIT(0),
-.cma = MOV_AVG_CMA_INITIALIZER,
-.ema = MOV_AVG_EMA_INITIALIZER(100),
-};
+static void
+dp_netdev_offload_init(void)
+{
+static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
+unsigned int nb_offload_thread = netdev_offload_thread_nb();
+unsigned int tid;
+
+if (!ovsthread_once_start(&once)) {
+return;
+}
+
+dp_offload_threads = xcalloc(nb_offload_thread,
+ sizeof *dp_offload_threads);
 
-static struct ovsthread_once offload_thread_once
-= OVSTHREAD_ONCE_INITIALIZER;
+for (tid = 0; tid < nb_offload_thread; tid++) {
+struct dp_offload_thread *thread;
+
+thread = &dp_offload_threads[tid];
+mpsc_queue_init(&thread->queue);
+cmap_init(&thread->megaflow_to_mark);
+cmap_init(&thread->mark_to_flow);
+atomic_init(&thread->enqueued_item, 0);
+mov_avg_cma_init(&thread->cma);
+mov_avg_ema_init(&thread->ema, 100);
+ovs_thread_create("hw_offload", dp_netdev_flow_offload_main, thread);
+}
+
+ovsthread_once_done(&once);
+}
 
 #define XPS_TIMEOUT 50LL/* In microseconds. */
 
@@ -2478,11 +2500,12 @@ megaflow_to_mark_associate(const ovs_u128 *mega_ufid, 
uint32_t mark)
 {
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data = xzalloc(sizeof(*data));
+unsigned int tid = netdev_offload_thread_id();
 
 data->mega_ufid = *mega_ufid;
 data->mark = mark;
 
-cmap_insert(&dp_offload_thread.megaflow_to_mark,
+cmap_insert(&dp_offload_threads[tid].megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 }
 
@@ -2492,11 +2515,12 @@ megaflow_to_mark_disassociate(const ovs_u128 *mega_ufid)
 {
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 CMAP_FOR_EACH_WITH_HASH (data, node, hash,
- &dp_offload_thread.megaflow_to_mark) {
+ &dp_offload_threads[tid].megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
-cmap_remove(&dp_offload_thread.megaflow_to_mark,
+cmap_remove(&dp_offload_threads[tid].megaflow_to_mark,
 CONST_CAST(struct cmap_node *, &data->node), hash);
 ovsrcu_postpone(free, data);
 return;
@@ -2512,9 +2536,10 @@ megaflow_to_mark_find(const ovs_u128 *mega_ufid)
 {
 size_t hash = dp_netdev_flow_hash(mega_ufid);
 struct megaflow_to_mark_data *data;
+unsigned int tid = netdev_offload_thread_id();
 
 CMAP_FOR_EACH_WITH_HASH (data, node, hash,
- &dp_offload_thread.megaflow_to_mark) {
+ &dp_offload_threads[tid].megaflow_to_mark) {
 if (ovs_u128_equals(*mega_ufid, data->mega_ufid)) {
 return data->mark;
 }
@@ -2529,9 +2554,10 @@ megaflow_to_mark_find(const ovs_u128 *mega_ufid)
 static void
 mark_to_flow_associate(const uint32_t mark, struct dp_netdev_flow *flow)
 {
+unsigned int tid = netdev_offload_thread_id();
 dp_netdev_flow_ref(flow);
 
-cmap_insert(&dp_of

  1   2   3   >