[ovs-dev] [PATCH ovn] lflow.c: Use warn level log for a lflow that has neither dp nor dpg.

2021-04-21 Thread Han Zhou
This should not happen in normal cases, so at least should be a WARN.

Signed-off-by: Han Zhou 
---
 controller/lflow.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/controller/lflow.c b/controller/lflow.c
index 680b8cca1..b8424e1fb 100644
--- a/controller/lflow.c
+++ b/controller/lflow.c
@@ -916,8 +916,9 @@ consider_logical_flow(const struct sbrec_logical_flow 
*lflow,
 bool ret = true;
 
 if (!dp_group && !dp) {
-VLOG_DBG("lflow "UUID_FMT" has no datapath binding, skip",
- UUID_ARGS(&lflow->header_.uuid));
+static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 1);
+VLOG_WARN_RL(&rl, "lflow "UUID_FMT" has no datapath binding, skip",
+ UUID_ARGS(&lflow->header_.uuid));
 return true;
 }
 ovs_assert(!dp_group || !dp);
-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn] ovn-controller.c: Remove extra local_lports_changed setting.

2021-04-21 Thread Han Zhou
Signed-off-by: Han Zhou 
---
 controller/ovn-controller.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 6f7c9ea61..86b92debb 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -1400,7 +1400,6 @@ init_binding_ctx(struct engine_node *node,
 b_ctx_out->lbinding_data = &rt_data->lbinding_data;
 b_ctx_out->local_iface_ids = &rt_data->local_iface_ids;
 b_ctx_out->tracked_dp_bindings = NULL;
-b_ctx_out->local_lports_changed = NULL;
 }
 
 static void
-- 
2.28.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] ovn-northd: bypass ct for allow ACLs

2021-04-21 Thread Ihar Hrachyshka
On Tue, Apr 13, 2021 at 4:02 AM Dumitru Ceara  wrote:
>
> On 4/13/21 2:26 AM, Ihar Hrachyshka wrote:
> > On Mon, Apr 12, 2021 at 5:00 AM Dumitru Ceara  wrote:
> >>
> >> On 4/6/21 6:35 PM, Ihar Hrachyshka wrote:
> >>> On Fri, Apr 2, 2021 at 7:23 AM Dumitru Ceara  wrote:
> 
>  On 3/24/21 3:40 PM, Ihar Hrachyshka wrote:
> > For allow ACLs, bypass connection tracking by avoiding setting ct
> > hints for matching traffic. Avoid sending all traffic to ct when a
> > stateful ACL is present. Before the patch, this unnecessarily hit
> > performance when mixed ACL action types were used for the same
> > datapath.
> >
> > ===
> >
> > For performance measurements, ovn-fake-multinode environment and qperf
> > were used. Performance measured between two virtual nodes, two ports
> > that belong to different LSs connected via router. Using qperf,
> > performance was measured for UDP, TCP, SCTP protocols (using
> > _lat and _bw tests). The qperf version used:
> > 0.4.9-16.fc31.x86_64.  Each test scenario was executed five times and
> > averages compared.
> >
> > Tests were executed with `allow` rules for the tested protocol and
> > `allow-related` for another protocol set for both ports, both
> > directions, e.g. for TCP scenario, the following ACLs were defined:
> >
> > ovn-nbctl acl-add sw0 to-lport 100 tcp allow
> > ovn-nbctl acl-add sw0 from-lport 100 tcp allow
> > ovn-nbctl acl-add sw1 to-lport 100 tcp allow
> > ovn-nbctl acl-add sw1 from-lport 100 tcp allow
> >
> > ovn-nbctl acl-add sw0 to-lport 100 sctp allow-related
> > ovn-nbctl acl-add sw0 from-lport 100 sctp allow-related
> > ovn-nbctl acl-add sw1 to-lport 100 sctp allow-related
> > ovn-nbctl acl-add sw1 from-lport 100 sctp allow-related
> >
> > In this particular environment, improvement was seen in send_bw,
> > latency, and msg_rate measurements, where applicable, for all three
> > protocols under test.
> >
> > for UDP, send_bw: 293.6 MB/sec => 313.2 MB/sec (+6.68%)
> >  latency: 16 us => 14.08 us (-12%)
> >  msg_rate: 62.56 K/sec => 71.06 K/sec (+13.59%)
> >
> > for TCP, latency: 18.6 us => 14.88 us (-20%)
> >  msg_rate: 53.8 K/sec => 67.28 K/sec (+25.06%)
> >
> > for SCTP, latency: 21.98 us => 19.42 us (-11.65%)
> >   msg_rate: 45.58 K/sec => 51.54 K/sec (+13.08%)
> >
> > Interestingly, some performance improvement was also seen for the same
> > scenarios with no ACLs set at all, albeit significantly more
> > negligible.
> >
> > for UDP, send_bw: 320.0 MB/sec => 338.6 MB/sec (+5.81%)
> >  latency: 13.74 us => 12.88 us (-6.68%)
> >  msg_rate: 73.02 K/sec => 77.84 K/sec (+6.6%)
> >
> > for TCP, latency: 15.62 us => 14.26 us (-9.54%)
> >  msg_rate: 64.02 K/sec => 70.26 K/sec (+9.75%)
> >
> > for SCTP, latency: 19.56 us => 18.16 us (-7.16%)
> >   msg_rate: 51.16 K/sec => 55.12 K/sec (+7.74%)
> >
> > Comparable numbers can be captured with iperf. It may be useful to run
> > more tests in a more elaborate (bare metal) environment.
> >
> > ===
> >
> > The patch takes inspiration from a now abandoned patch:
> >
> > "ovn-northd: Support mixing stateless/stateful ACLs with
> > Stateless_Filter." by Dumitru Ceara.
> >
> > Signed-off-by: Ihar Hrachyshka 
> >
> > ---
> 
>  Hi Ihar,
> 
>  Thanks for working on this and sorry for the delay in reviewing your 
>  patch.
> 
> >
> > v1: initial version.
> > v2: rebased after conflict.
> > ---
> >  NEWS|   1 +
> >  northd/ovn-northd.8.xml |   9 +-
> >  northd/ovn-northd.c | 166 +++
> >  tests/ovn-northd.at | 287 
> >  4 files changed, 400 insertions(+), 63 deletions(-)
> >
> > diff --git a/NEWS b/NEWS
> > index 530c5d42f..548a45fb7 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -7,6 +7,7 @@ Post-v21.03.0
> >  (This may take testing and tuning to be effective.)  This version 
> > of OVN
> >  requires DDLog 0.36.
> >- Introduce ovn-controller incremetal processing engine statistics
> > +  - Bypass connection tracking for ACL "allow" action processing.
> >
> >  OVN v21.03.0 - 12 Mar 2021
> >  -
> > diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> > index a62f5c057..f38d71682 100644
> > --- a/northd/ovn-northd.8.xml
> > +++ b/northd/ovn-northd.8.xml
> > @@ -419,7 +419,9 @@
> >before eventually advancing to ingress table ACLs. 
> > If
> >special ports such as route ports or localnet ports can't use 
> > ct(), a
> >priority-110 flow is added to skip over stateful ACLs. IPv6

Re: [ovs-dev] [PATCH v2 02/11] mpsc-queue: Module for lock-free message passing

2021-04-21 Thread 0-day Robot
Bleep bloop.  Greetings Gaëtan Rivet, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 97 characters long (recommended limit is 79)
#201 FILE: lib/mpsc-queue.c:107:
 * [1]: 
http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue

WARNING: Line is 84 characters long (recommended limit is 79)
#208 FILE: lib/mpsc-queue.c:114:
 * [3]: 
https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html

ERROR: Improper whitespace around control block
#528 FILE: lib/mpsc-queue.h:177:
#define MPSC_QUEUE_FOR_EACH(node, queue) \

ERROR: Improper whitespace around control block
#532 FILE: lib/mpsc-queue.h:181:
#define MPSC_QUEUE_FOR_EACH_POP(node, queue) \

Lines checked: 1303, Warnings: 2, Errors: 2


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v1 0/9] conntrack: improve multithread scalability

2021-04-21 Thread Gaëtan Rivet
On Wed, Feb 17, 2021, at 17:34, Gaetan Rivet wrote:
> Conntracks are executed within the datapath. Locks along this path are crucial
> and their critical section should be minimal. The global 'ct_lock' is 
> necessary
> before any action taken on connection states. This lock is needed for many
> operations on the conntrack, slowing down the datapath.
> 
> The cleanup thread 'ct_clean' will take it to do its job. As it can hold it a
> long time, the thread is limited in amount of connection cleaned per round,
> and calls are rate-limited.
> 
> * Timeout policies locking is contrived to avoid deadlock.
>   Anytime a connection state is updated, during its update it is unlocked,
>   'ct_lock' is taken, then the connection is locked again. Then the reverse
>   is done for unlock.
> 
> * Scalability is poor. The global ct_lock needs to be taken before applying
>   any change to a conn object. This is backward: local changes to smaller
>   objects should be independent, then the global lock should only be taken 
> once
>   the rest of the work is done, the goal being to have the smallest possible
>   critical section.
> 
> It can be improved. Using RCU-friendly structures for connections, zone limits
> and timeout policies, read-first workload is improved and the precedence of 
> the
> global 'ct_lock' and local 'conn->lock' can be inversed.
> 
> Running the conntrack benchmark we see these changes:
>   ./tests/ovstest test-conntrack benchmark  300 32
> 
> code \ N  1 2 4 8
>   Before   2310  2766  6117 19838  (ms)
>After   2072  2084  2653  4541  (ms)
> 
> One thread in the benchmark executes the task of a PMD, while the 
> 'ct_clean' thread
> runs in background as well.
> 
> Github actions: https://github.com/grivet/ovs/actions/runs/574446345
> 
> Gaetan Rivet (9):
>   conntrack: Use rcu-lists to store conn expirations
>   conntrack: Use a cmap to store zone limits
>   conntrack: Init hash basis first at creation
>   conntrack-tp: Use a cmap to store timeout policies
>   conntrack: Inverse conn and ct lock precedence
>   conntrack: Do not schedule zero ms timers
>   conntrack: Do not rate limit ct-sweep
>   conntrack: Do not log empty ct-sweep
>   conntrack: Use an atomic conn expiration value
> 
>  lib/conntrack-private.h |  96 +--
>  lib/conntrack-tp.c  | 137 +++---
>  lib/conntrack.c | 206 +---
>  lib/conntrack.h |   4 +-
>  lib/dpif-netdev.c   |   5 +-
>  5 files changed, 280 insertions(+), 168 deletions(-)
> 
> --
> 2.30.0
> 
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 

The v2 of this series has been sent: 
https://patchwork.ozlabs.org/project/openvswitch/list/?series=240241

-- 
Gaetan Rivet
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 10/11] conntrack: Do not log empty ct-sweep

2021-04-21 Thread Gaetan Rivet
Do not add noise to the DBG log for empty sweeps.
Only log time taken when some connections were cleaned.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/conntrack.c b/lib/conntrack.c
index 8a7538b7b..823fb060a 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -1647,8 +1647,10 @@ ct_sweep(struct conntrack *ct, long long now, size_t 
limit)
 }
 
 out:
-VLOG_DBG("conntrack cleanup %"PRIuSIZE" entries in %lld msec", count,
- time_msec() - now);
+if (count > 0) {
+VLOG_DBG("conntrack cleanup %"PRIuSIZE" entries in %lld msec", count,
+ time_msec() - now);
+}
 return min_expiration;
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 09/11] conntrack: Do not rate limit ct-sweep

2021-04-21 Thread Gaetan Rivet
The current rate limit is set to allow other threads to update the
connections when applicable. This was valid when taking the 'ct_lock'
was needed with a global critical section.

Now that the size of the critical section for 'ct_lock' is reduced, it
is not necessary to rate limit calls to ct_sweep() anymore.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack.c | 24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/lib/conntrack.c b/lib/conntrack.c
index ea2e5b63b..8a7538b7b 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -1675,20 +1675,12 @@ conntrack_clean(struct conntrack *ct, long long now)
  * there is an actual connection that expires, or because a new connection
  * might be created with the minimum timeout).
  *
- * The logic below has two goals:
- *
- * - We want to reduce the number of wakeups and batch connection cleanup
- *   when the load is not very high.  CT_CLEAN_INTERVAL ensures that if we
- *   are coping with the current cleanup tasks, then we wait at least
- *   5 seconds to do further cleanup.
- *
- * - We don't want to keep the map locked too long, as we might prevent
- *   traffic from flowing.  CT_CLEAN_MIN_INTERVAL ensures that if cleanup is
- *   behind, there is at least some 200ms blocks of time when the map will be
- *   left alone, so the datapath can operate unhindered.
+ * We want to reduce the number of wakeups and batch connection cleanup
+ * when the load is not very high.  CT_CLEAN_INTERVAL ensures that if we
+ * are coping with the current cleanup tasks, then we wait at least
+ * 5 seconds to do further cleanup.
  */
 #define CT_CLEAN_INTERVAL 5000 /* 5 seconds */
-#define CT_CLEAN_MIN_INTERVAL 200  /* 0.2 seconds */
 
 static void *
 clean_thread_main(void *f_)
@@ -1705,12 +1697,10 @@ clean_thread_main(void *f_)
 long long now = time_msec();
 next_wake = conntrack_clean(ct, now);
 
-if (next_wake < now) {
-poll_immediate_wake();
-} else if (next_wake < now + CT_CLEAN_MIN_INTERVAL) {
-poll_timer_wait_until(now + CT_CLEAN_MIN_INTERVAL);
+if (next_wake > now) {
+poll_timer_wait_until(MIN(next_wake, now + CT_CLEAN_INTERVAL));
 } else {
-poll_timer_wait_until(MAX(next_wake, now + CT_CLEAN_INTERVAL));
+poll_immediate_wake();
 }
 latch_wait(&ct->clean_thread_exit);
 poll_block();
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 11/11] conntrack: Use an atomic conn expiration value

2021-04-21 Thread Gaetan Rivet
A lock is taken during conn_lookup() to check whether a connection is
expired before returning it. This lock can have some contention.

Even though this lock ensures a consistent sequence of writes, it does
not imply a specific order. A ct_clean thread taking the lock first
could read a value that would be updated immediately after by a PMD
waiting on the same lock, just as well as the inverse order.

As such, the expiration time can be stale anytime it is read. In this
context, using an atomic will ensure the same guarantees for either
writes or reads, i.e. writes are consistent and reads are not undefined
behaviour. Reading an atomic is however less costly than taking and
releasing a lock.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack-tp.c  |  2 +-
 lib/conntrack.c | 27 +++
 3 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 51bec7272..553cb2163 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -144,7 +144,7 @@ struct conn {
 /* Mutable data. */
 struct ovs_mutex lock; /* Guards all mutable fields. */
 ovs_u128 label;
-long long expiration;
+atomic_llong expiration;
 uint32_t mark;
 int seq_skew;
 
diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
index 22363e7fe..5bf2816ca 100644
--- a/lib/conntrack-tp.c
+++ b/lib/conntrack-tp.c
@@ -240,7 +240,7 @@ static void
 conn_schedule_expiration(struct conn *conn, enum ct_timeout tm, long long now,
  uint32_t tp_value)
 {
-conn->expiration = now + tp_value * 1000;
+atomic_store_relaxed(&conn->expiration, now + tp_value * 1000);
 conn->exp.tm = tm;
 ignore(atomic_flag_test_and_set(&conn->exp.reschedule));
 }
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 823fb060a..aa9adb601 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -99,6 +99,7 @@ static enum ct_update_res conn_update(struct conntrack *ct, 
struct conn *conn,
   struct dp_packet *pkt,
   struct conn_lookup_ctx *ctx,
   long long now);
+static long long int conn_expiration(const struct conn *);
 static bool conn_expired(struct conn *, long long now);
 static void set_mark(struct dp_packet *, struct conn *,
  uint32_t val, uint32_t mask);
@@ -1018,13 +1019,10 @@ un_nat_packet(struct dp_packet *pkt, const struct conn 
*conn,
 static void
 conn_seq_skew_set(struct conntrack *ct, const struct conn *conn_in,
   long long now, int seq_skew, bool seq_skew_dir)
-OVS_NO_THREAD_SAFETY_ANALYSIS
 {
 struct conn *conn;
-ovs_mutex_unlock(&conn_in->lock);
-conn_lookup(ct, &conn_in->key, now, &conn, NULL);
-ovs_mutex_lock(&conn_in->lock);
 
+conn_lookup(ct, &conn_in->key, now, &conn, NULL);
 if (conn && seq_skew) {
 conn->seq_skew = seq_skew;
 conn->seq_skew_dir = seq_skew_dir;
@@ -1596,9 +1594,7 @@ ct_sweep(struct conntrack *ct, long long now, size_t 
limit)
 continue;
 }
 
-ovs_mutex_lock(&conn->lock);
-expiration = conn->expiration;
-ovs_mutex_unlock(&conn->lock);
+expiration = conn_expiration(conn);
 
 if (conn == end_of_queue) {
 /* If we already re-enqueued this conn during this sweep,
@@ -2477,14 +2473,21 @@ conn_update(struct conntrack *ct, struct conn *conn, 
struct dp_packet *pkt,
 return update_res;
 }
 
+static long long int
+conn_expiration(const struct conn *conn)
+{
+long long int expiration;
+
+atomic_read_relaxed(&CONST_CAST(struct conn *, conn)->expiration,
+&expiration);
+return expiration;
+}
+
 static bool
 conn_expired(struct conn *conn, long long now)
 {
 if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
-ovs_mutex_lock(&conn->lock);
-bool expired = now >= conn->expiration ? true : false;
-ovs_mutex_unlock(&conn->lock);
-return expired;
+return now >= conn_expiration(conn);
 }
 return false;
 }
@@ -2627,7 +2630,7 @@ conn_to_ct_dpif_entry(const struct conn *conn, struct 
ct_dpif_entry *entry,
 entry->mark = conn->mark;
 memcpy(&entry->labels, &conn->label, sizeof entry->labels);
 
-long long expiration = conn->expiration - now;
+long long expiration = conn_expiration(conn) - now;
 
 struct ct_l4_proto *class = l4_protos[conn->key.nw_proto];
 if (class->conn_get_protoinfo) {
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 07/11] conntrack: Inverse conn and ct lock precedence

2021-04-21 Thread Gaetan Rivet
The lock priority order is for the global 'ct_lock' to be taken first
and then 'conn->lock'. This is an issue, as multiple operations on
connections are thus blocked between threads contending on the
global 'ct_lock'.

This was previously necessary due to how the expiration lists, timeout
policies and zone limits were managed. They are now using RCU-friendly
structures that allow concurrent readers. The mutual exclusion now only
needs to happen during writes.

This allows reducing the 'ct_lock' precedence, and to only take it
when writing the relevant structures. This will reduce contention on
'ct_lock', which impairs scalability when the connection tracker is
used by many threads.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  7 --
 lib/conntrack-tp.c  | 30 +-
 lib/conntrack.c | 56 +
 3 files changed, 41 insertions(+), 52 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 23e4c9bb5..51bec7272 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -134,6 +134,9 @@ struct conn {
 struct nat_action_info_t *nat_info;
 char *alg;
 struct conn *nat_conn; /* The NAT 'conn' context, if there is one. */
+atomic_flag reclaimed; /* False during the lifetime of the connection,
+* True as soon as a thread has started freeing
+* its memory. */
 
 /* Inserted once by a PMD, then managed by the 'ct_clean' thread. */
 struct conn_expire exp;
@@ -196,8 +199,8 @@ struct conntrack {
 };
 
 /* Lock acquisition order:
- *1. 'ct_lock'
- *2. 'conn->lock'
+ *1. 'conn->lock'
+ *2. 'ct_lock'
  *3. 'resources_lock'
  */
 
diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
index 592e10c6f..22363e7fe 100644
--- a/lib/conntrack-tp.c
+++ b/lib/conntrack-tp.c
@@ -245,58 +245,30 @@ conn_schedule_expiration(struct conn *conn, enum 
ct_timeout tm, long long now,
 ignore(atomic_flag_test_and_set(&conn->exp.reschedule));
 }
 
-static void
-conn_update_expiration__(struct conntrack *ct, struct conn *conn,
- enum ct_timeout tm, long long now,
- uint32_t tp_value)
-OVS_REQUIRES(conn->lock)
-{
-ovs_mutex_unlock(&conn->lock);
-
-ovs_mutex_lock(&ct->ct_lock);
-ovs_mutex_lock(&conn->lock);
-conn_schedule_expiration(conn, tm, now, tp_value);
-ovs_mutex_unlock(&conn->lock);
-ovs_mutex_unlock(&ct->ct_lock);
-
-ovs_mutex_lock(&conn->lock);
-}
-
 /* The conn entry lock must be held on entry and exit. */
 void
 conn_update_expiration(struct conntrack *ct, struct conn *conn,
enum ct_timeout tm, long long now)
-OVS_REQUIRES(conn->lock)
 {
 struct timeout_policy *tp;
 uint32_t val;
 
-ovs_mutex_unlock(&conn->lock);
-
-ovs_mutex_lock(&ct->ct_lock);
-ovs_mutex_lock(&conn->lock);
 tp = timeout_policy_lookup(ct, conn->tp_id);
 if (tp) {
 val = tp->policy.attrs[tm_to_ct_dpif_tp(tm)];
 } else {
 val = ct_dpif_netdev_tp_def[tm_to_ct_dpif_tp(tm)];
 }
-ovs_mutex_unlock(&conn->lock);
-ovs_mutex_unlock(&ct->ct_lock);
-
-ovs_mutex_lock(&conn->lock);
 VLOG_DBG_RL(&rl, "Update timeout %s zone=%u with policy id=%d "
 "val=%u sec.",
 ct_timeout_str[tm], conn->key.zone, conn->tp_id, val);
 
-conn_update_expiration__(ct, conn, tm, now, val);
+conn_schedule_expiration(conn, tm, now, val);
 }
 
-/* ct_lock must be held. */
 void
 conn_init_expiration(struct conntrack *ct, struct conn *conn,
  enum ct_timeout tm, long long now)
-OVS_REQUIRES(ct->ct_lock)
 {
 struct timeout_policy *tp;
 uint32_t val;
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 7e565953d..16ed8de15 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -465,7 +465,7 @@ zone_limit_delete(struct conntrack *ct, uint16_t zone)
 
 static void
 conn_clean_cmn(struct conntrack *ct, struct conn *conn)
-OVS_REQUIRES(ct->ct_lock)
+OVS_REQUIRES(conn->lock, ct->ct_lock)
 {
 if (conn->alg) {
 expectation_clean(ct, &conn->key);
@@ -495,18 +495,29 @@ conn_unref(struct conn *conn)
  * removes the associated nat 'conn' from the lookup datastructures. */
 static void
 conn_clean(struct conntrack *ct, struct conn *conn)
-OVS_REQUIRES(ct->ct_lock)
+OVS_EXCLUDED(conn->lock, ct->ct_lock)
 {
 ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
 
+if (atomic_flag_test_and_set(&conn->reclaimed)) {
+return;
+}
+
+ovs_mutex_lock(&conn->lock);
+
+ovs_mutex_lock(&ct->ct_lock);
 conn_clean_cmn(ct, conn);
 if (conn->nat_conn) {
 uint32_t hash = conn_key_hash(&conn->nat_conn->key, ct->hash_basis);
 cmap_remove(&ct->conns, &conn->nat_conn->cm_node, hash);
 }
+ovs_mutex_unlock(&ct->ct_lock);
+
 conn->cleaned = true;
 conn_unref(conn);

[ovs-dev] [PATCH v2 03/11] conntrack: Use mpsc-queue to store conn expirations

2021-04-21 Thread Gaetan Rivet
Change the connection expiration lists from ovs_list to mpsc-queue.
This is a pre-step towards reducing the granularity of 'ct_lock'.

It simplifies the responsibilities toward updating the expiration queue.
The dataplane now appends the new conn for expiration once during
creation.  Any further update will only consist in writing the conn
expiration limit and marking the conn for expiration rescheduling.

The ageing thread 'ct_clean' is the only one consuming the expiration
lists.  If a conn was marked for rescheduling by a dataplane, it will
move the conn to the end of the queue.

Once the locks have been reworked, it means neither the dataplane
threads nor 'ct_clean' have to take a lock to update the expiration
lists (assuming the consumer lock is perpetually held by 'ct_clean');

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  84 +++-
 lib/conntrack-tp.c  |  28 +-
 lib/conntrack.c | 118 ++--
 3 files changed, 173 insertions(+), 57 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index e8332bdba..7db249fdb 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -29,6 +29,7 @@
 #include "openvswitch/list.h"
 #include "openvswitch/types.h"
 #include "packets.h"
+#include "mpsc-queue.h"
 #include "unaligned.h"
 #include "dp-packet.h"
 
@@ -86,22 +87,57 @@ struct alg_exp_node {
 bool nat_rpl_dst;
 };
 
+/* Timeouts: all the possible timeout states passed to update_expiration()
+ * are listed here. The name will be prefix by CT_TM_ and the value is in
+ * milliseconds */
+#define CT_TIMEOUTS \
+CT_TIMEOUT(TCP_FIRST_PACKET) \
+CT_TIMEOUT(TCP_OPENING) \
+CT_TIMEOUT(TCP_ESTABLISHED) \
+CT_TIMEOUT(TCP_CLOSING) \
+CT_TIMEOUT(TCP_FIN_WAIT) \
+CT_TIMEOUT(TCP_CLOSED) \
+CT_TIMEOUT(OTHER_FIRST) \
+CT_TIMEOUT(OTHER_MULTIPLE) \
+CT_TIMEOUT(OTHER_BIDIR) \
+CT_TIMEOUT(ICMP_FIRST) \
+CT_TIMEOUT(ICMP_REPLY)
+
+enum ct_timeout {
+#define CT_TIMEOUT(NAME) CT_TM_##NAME,
+CT_TIMEOUTS
+#undef CT_TIMEOUT
+N_CT_TM
+};
+
 enum OVS_PACKED_ENUM ct_conn_type {
 CT_CONN_TYPE_DEFAULT,
 CT_CONN_TYPE_UN_NAT,
 };
 
+struct conn_expire {
+struct mpsc_queue_node node;
+/* Timeout state of the connection.
+ * It follows the connection state updates.
+ */
+enum ct_timeout tm;
+atomic_flag reschedule;
+struct ovs_refcount refcount;
+};
+
 struct conn {
 /* Immutable data. */
 struct conn_key key;
 struct conn_key rev_key;
 struct conn_key parent_key; /* Only used for orig_tuple support. */
-struct ovs_list exp_node;
 struct cmap_node cm_node;
 struct nat_action_info_t *nat_info;
 char *alg;
 struct conn *nat_conn; /* The NAT 'conn' context, if there is one. */
 
+/* Inserted once by a PMD, then managed by the 'ct_clean' thread. */
+struct conn_expire exp;
+
 /* Mutable data. */
 struct ovs_mutex lock; /* Guards all mutable fields. */
 ovs_u128 label;
@@ -132,33 +168,10 @@ enum ct_update_res {
 CT_UPDATE_VALID_NEW,
 };
 
-/* Timeouts: all the possible timeout states passed to update_expiration()
- * are listed here. The name will be prefix by CT_TM_ and the value is in
- * milliseconds */
-#define CT_TIMEOUTS \
-CT_TIMEOUT(TCP_FIRST_PACKET) \
-CT_TIMEOUT(TCP_OPENING) \
-CT_TIMEOUT(TCP_ESTABLISHED) \
-CT_TIMEOUT(TCP_CLOSING) \
-CT_TIMEOUT(TCP_FIN_WAIT) \
-CT_TIMEOUT(TCP_CLOSED) \
-CT_TIMEOUT(OTHER_FIRST) \
-CT_TIMEOUT(OTHER_MULTIPLE) \
-CT_TIMEOUT(OTHER_BIDIR) \
-CT_TIMEOUT(ICMP_FIRST) \
-CT_TIMEOUT(ICMP_REPLY)
-
-enum ct_timeout {
-#define CT_TIMEOUT(NAME) CT_TM_##NAME,
-CT_TIMEOUTS
-#undef CT_TIMEOUT
-N_CT_TM
-};
-
 struct conntrack {
 struct ovs_mutex ct_lock; /* Protects 2 following fields. */
 struct cmap conns OVS_GUARDED;
-struct ovs_list exp_lists[N_CT_TM] OVS_GUARDED;
+struct mpsc_queue exp_lists[N_CT_TM];
 struct hmap zone_limits OVS_GUARDED;
 struct hmap timeout_policies OVS_GUARDED;
 uint32_t hash_basis; /* Salt for hashing a connection key. */
@@ -204,4 +217,25 @@ struct ct_l4_proto {
struct ct_dpif_protoinfo *);
 };
 
+static inline void
+conn_expire_append(struct conntrack *ct, struct conn *conn)
+{
+if (ovs_refcount_try_ref_rcu(&conn->exp.refcount)) {
+atomic_flag_clear(&conn->exp.reschedule);
+mpsc_queue_insert(&ct->exp_lists[conn->exp.tm], &conn->exp.node);
+}
+}
+
+static inline void
+conn_expire_prepend(struct conntrack *ct, struct conn *conn)
+OVS_REQUIRES(ct->exp_lists[conn->exp.tm].read_lock)
+{
+if (ovs_refcount_try_ref_rcu(&conn->exp.refcount)) {
+/* Do not change 'reschedule' state, if this expire node is put
+ * at the tail of the list, it will be re-examined next sweep.
+ */
+mpsc_queue_push_back(&ct->exp_lists[conn->exp.tm], &conn->exp

[ovs-dev] [PATCH v2 08/11] conntrack: Do not schedule zero ms timers

2021-04-21 Thread Gaetan Rivet
When ct_sweep() is far behind on its work, the 'next_wake' returned can
be before the moment it started. When it happens, the thread schedules a
zero ms timer that is logged as an error.

Instead, mark the thread for immediate wake in the next poll_block().

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/conntrack.c b/lib/conntrack.c
index 16ed8de15..ea2e5b63b 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -1706,6 +1706,8 @@ clean_thread_main(void *f_)
 next_wake = conntrack_clean(ct, now);
 
 if (next_wake < now) {
+poll_immediate_wake();
+} else if (next_wake < now + CT_CLEAN_MIN_INTERVAL) {
 poll_timer_wait_until(now + CT_CLEAN_MIN_INTERVAL);
 } else {
 poll_timer_wait_until(MAX(next_wake, now + CT_CLEAN_INTERVAL));
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 06/11] conntrack-tp: Use a cmap to store timeout policies

2021-04-21 Thread Gaetan Rivet
Multiple lookups are done to stored timeout policies, each time blocking
the global 'ct_lock'. This is usually not necessary and it should be
acceptable to get policy updates slightly delayed (by one RCU sync
at most). Using a CMAP reduces multiple lock taking and releasing in
the connection insertion path.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack-tp.c  | 54 +++--
 lib/conntrack.c |  9 ---
 lib/conntrack.h |  2 +-
 4 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 3a9583bce..23e4c9bb5 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -173,7 +173,7 @@ struct conntrack {
 struct cmap conns OVS_GUARDED;
 struct mpsc_queue exp_lists[N_CT_TM];
 struct cmap zone_limits OVS_GUARDED;
-struct hmap timeout_policies OVS_GUARDED;
+struct cmap timeout_policies OVS_GUARDED;
 uint32_t hash_basis; /* Salt for hashing a connection key. */
 pthread_t clean_thread; /* Periodically cleans up connection tracker. */
 struct latch clean_thread_exit; /* To destroy the 'clean_thread'. */
diff --git a/lib/conntrack-tp.c b/lib/conntrack-tp.c
index 6de2354c0..592e10c6f 100644
--- a/lib/conntrack-tp.c
+++ b/lib/conntrack-tp.c
@@ -47,14 +47,15 @@ static unsigned int ct_dpif_netdev_tp_def[] = {
 };
 
 static struct timeout_policy *
-timeout_policy_lookup(struct conntrack *ct, int32_t tp_id)
+timeout_policy_lookup_protected(struct conntrack *ct, int32_t tp_id)
 OVS_REQUIRES(ct->ct_lock)
 {
 struct timeout_policy *tp;
 uint32_t hash;
 
 hash = hash_int(tp_id, ct->hash_basis);
-HMAP_FOR_EACH_IN_BUCKET (tp, node, hash, &ct->timeout_policies) {
+CMAP_FOR_EACH_WITH_HASH_PROTECTED (tp, node, hash,
+   &ct->timeout_policies) {
 if (tp->policy.id == tp_id) {
 return tp;
 }
@@ -62,20 +63,25 @@ timeout_policy_lookup(struct conntrack *ct, int32_t tp_id)
 return NULL;
 }
 
-struct timeout_policy *
-timeout_policy_get(struct conntrack *ct, int32_t tp_id)
+static struct timeout_policy *
+timeout_policy_lookup(struct conntrack *ct, int32_t tp_id)
 {
 struct timeout_policy *tp;
+uint32_t hash;
 
-ovs_mutex_lock(&ct->ct_lock);
-tp = timeout_policy_lookup(ct, tp_id);
-if (!tp) {
-ovs_mutex_unlock(&ct->ct_lock);
-return NULL;
+hash = hash_int(tp_id, ct->hash_basis);
+CMAP_FOR_EACH_WITH_HASH (tp, node, hash, &ct->timeout_policies) {
+if (tp->policy.id == tp_id) {
+return tp;
+}
 }
+return NULL;
+}
 
-ovs_mutex_unlock(&ct->ct_lock);
-return tp;
+struct timeout_policy *
+timeout_policy_get(struct conntrack *ct, int32_t tp_id)
+{
+return timeout_policy_lookup(ct, tp_id);
 }
 
 static void
@@ -125,27 +131,30 @@ timeout_policy_create(struct conntrack *ct,
 init_default_tp(tp, tp_id);
 update_existing_tp(tp, new_tp);
 hash = hash_int(tp_id, ct->hash_basis);
-hmap_insert(&ct->timeout_policies, &tp->node, hash);
+cmap_insert(&ct->timeout_policies, &tp->node, hash);
 }
 
 static void
 timeout_policy_clean(struct conntrack *ct, struct timeout_policy *tp)
 OVS_REQUIRES(ct->ct_lock)
 {
-hmap_remove(&ct->timeout_policies, &tp->node);
-free(tp);
+uint32_t hash = hash_int(tp->policy.id, ct->hash_basis);
+cmap_remove(&ct->timeout_policies, &tp->node, hash);
+ovsrcu_postpone(free, tp);
 }
 
 static int
-timeout_policy_delete__(struct conntrack *ct, uint32_t tp_id)
+timeout_policy_delete__(struct conntrack *ct, uint32_t tp_id,
+bool warn_on_error)
 OVS_REQUIRES(ct->ct_lock)
 {
+struct timeout_policy *tp;
 int err = 0;
-struct timeout_policy *tp = timeout_policy_lookup(ct, tp_id);
 
+tp = timeout_policy_lookup_protected(ct, tp_id);
 if (tp) {
 timeout_policy_clean(ct, tp);
-} else {
+} else if (warn_on_error) {
 VLOG_WARN_RL(&rl, "Failed to delete a non-existent timeout "
  "policy: id=%d", tp_id);
 err = ENOENT;
@@ -159,7 +168,7 @@ timeout_policy_delete(struct conntrack *ct, uint32_t tp_id)
 int err;
 
 ovs_mutex_lock(&ct->ct_lock);
-err = timeout_policy_delete__(ct, tp_id);
+err = timeout_policy_delete__(ct, tp_id, true);
 ovs_mutex_unlock(&ct->ct_lock);
 return err;
 }
@@ -170,7 +179,7 @@ timeout_policy_init(struct conntrack *ct)
 {
 struct timeout_policy tp;
 
-hmap_init(&ct->timeout_policies);
+cmap_init(&ct->timeout_policies);
 
 /* Create default timeout policy. */
 memset(&tp, 0, sizeof tp);
@@ -182,14 +191,11 @@ int
 timeout_policy_update(struct conntrack *ct,
   struct timeout_policy *new_tp)
 {
-int err = 0;
 uint32_t tp_id = new_tp->policy.id;
+int err = 0;
 
 ovs_mutex_lock(&ct->ct_lock);
-struct timeout_policy 

[ovs-dev] [PATCH v2 04/11] conntrack: Use a cmap to store zone limits

2021-04-21 Thread Gaetan Rivet
Change the data structure from hmap to cmap for zone limits.
As they are shared amongst multiple conntrack users, multiple
readers want to check the current zone limit state before progressing in
their processing. Using a CMAP allows doing lookups without taking the
global 'ct_lock', thus reducing contention.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack-private.h |  2 +-
 lib/conntrack.c | 70 -
 lib/conntrack.h |  2 +-
 lib/dpif-netdev.c   |  5 +--
 4 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/lib/conntrack-private.h b/lib/conntrack-private.h
index 7db249fdb..3a9583bce 100644
--- a/lib/conntrack-private.h
+++ b/lib/conntrack-private.h
@@ -172,7 +172,7 @@ struct conntrack {
 struct ovs_mutex ct_lock; /* Protects 2 following fields. */
 struct cmap conns OVS_GUARDED;
 struct mpsc_queue exp_lists[N_CT_TM];
-struct hmap zone_limits OVS_GUARDED;
+struct cmap zone_limits OVS_GUARDED;
 struct hmap timeout_policies OVS_GUARDED;
 uint32_t hash_basis; /* Salt for hashing a connection key. */
 pthread_t clean_thread; /* Periodically cleans up connection tracker. */
diff --git a/lib/conntrack.c b/lib/conntrack.c
index 4243541da..59a8c5342 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -79,7 +79,7 @@ enum ct_alg_ctl_type {
 };
 
 struct zone_limit {
-struct hmap_node node;
+struct cmap_node node;
 struct conntrack_zone_limit czl;
 };
 
@@ -303,7 +303,7 @@ conntrack_init(void)
 for (unsigned i = 0; i < ARRAY_SIZE(ct->exp_lists); i++) {
 mpsc_queue_init(&ct->exp_lists[i]);
 }
-hmap_init(&ct->zone_limits);
+cmap_init(&ct->zone_limits);
 ct->zone_limit_seq = 0;
 timeout_policy_init(ct);
 ovs_mutex_unlock(&ct->ct_lock);
@@ -339,12 +339,25 @@ zone_key_hash(int32_t zone, uint32_t basis)
 }
 
 static struct zone_limit *
-zone_limit_lookup(struct conntrack *ct, int32_t zone)
+zone_limit_lookup_protected(struct conntrack *ct, int32_t zone)
 OVS_REQUIRES(ct->ct_lock)
 {
 uint32_t hash = zone_key_hash(zone, ct->hash_basis);
 struct zone_limit *zl;
-HMAP_FOR_EACH_IN_BUCKET (zl, node, hash, &ct->zone_limits) {
+CMAP_FOR_EACH_WITH_HASH_PROTECTED (zl, node, hash, &ct->zone_limits) {
+if (zl->czl.zone == zone) {
+return zl;
+}
+}
+return NULL;
+}
+
+static struct zone_limit *
+zone_limit_lookup(struct conntrack *ct, int32_t zone)
+{
+uint32_t hash = zone_key_hash(zone, ct->hash_basis);
+struct zone_limit *zl;
+CMAP_FOR_EACH_WITH_HASH (zl, node, hash, &ct->zone_limits) {
 if (zl->czl.zone == zone) {
 return zl;
 }
@@ -354,7 +367,6 @@ zone_limit_lookup(struct conntrack *ct, int32_t zone)
 
 static struct zone_limit *
 zone_limit_lookup_or_default(struct conntrack *ct, int32_t zone)
-OVS_REQUIRES(ct->ct_lock)
 {
 struct zone_limit *zl = zone_limit_lookup(ct, zone);
 return zl ? zl : zone_limit_lookup(ct, DEFAULT_ZONE);
@@ -363,13 +375,16 @@ zone_limit_lookup_or_default(struct conntrack *ct, 
int32_t zone)
 struct conntrack_zone_limit
 zone_limit_get(struct conntrack *ct, int32_t zone)
 {
-ovs_mutex_lock(&ct->ct_lock);
-struct conntrack_zone_limit czl = {DEFAULT_ZONE, 0, 0, 0};
+struct conntrack_zone_limit czl = {
+.zone = DEFAULT_ZONE,
+.limit = 0,
+.count = ATOMIC_COUNT_INIT(0),
+.zone_limit_seq = 0,
+};
 struct zone_limit *zl = zone_limit_lookup_or_default(ct, zone);
 if (zl) {
 czl = zl->czl;
 }
-ovs_mutex_unlock(&ct->ct_lock);
 return czl;
 }
 
@@ -377,13 +392,19 @@ static int
 zone_limit_create(struct conntrack *ct, int32_t zone, uint32_t limit)
 OVS_REQUIRES(ct->ct_lock)
 {
+struct zone_limit *zl = zone_limit_lookup_protected(ct, zone);
+
+if (zl) {
+return 0;
+}
+
 if (zone >= DEFAULT_ZONE && zone <= MAX_ZONE) {
-struct zone_limit *zl = xzalloc(sizeof *zl);
+zl = xzalloc(sizeof *zl);
 zl->czl.limit = limit;
 zl->czl.zone = zone;
 zl->czl.zone_limit_seq = ct->zone_limit_seq++;
 uint32_t hash = zone_key_hash(zone, ct->hash_basis);
-hmap_insert(&ct->zone_limits, &zl->node, hash);
+cmap_insert(&ct->zone_limits, &zl->node, hash);
 return 0;
 } else {
 return EINVAL;
@@ -394,13 +415,14 @@ int
 zone_limit_update(struct conntrack *ct, int32_t zone, uint32_t limit)
 {
 int err = 0;
-ovs_mutex_lock(&ct->ct_lock);
 struct zone_limit *zl = zone_limit_lookup(ct, zone);
 if (zl) {
 zl->czl.limit = limit;
 VLOG_INFO("Changed zone limit of %u for zone %d", limit, zone);
 } else {
+ovs_mutex_lock(&ct->ct_lock);
 err = zone_limit_create(ct, zone, limit);
+ovs_mutex_unlock(&ct->ct_lock);
 if (!err) {
 VLOG_INFO("Created zone limit of %u for zone %d", limit, zone);
 } else {
@@ -

[ovs-dev] [PATCH v2 02/11] mpsc-queue: Module for lock-free message passing

2021-04-21 Thread Gaetan Rivet
Add a lockless multi-producer/single-consumer (MPSC), linked-list based,
intrusive, unbounded queue that does not require deferred memory
management.

The queue is an implementation of the structure described by Dmitri
Vyukov[1]. It adds a slightly more explicit API explaining the proper use
of the queue.
Alternatives were considered such as a Treiber Stack [2] or a
Michael-Scott queue [3], but this one is faster, simpler and scalable.

[1]: 
http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue

[2]: R. K. Treiber. Systems programming: Coping with parallelism.
 Technical Report RJ 5118, IBM Almaden Research Center, April 1986.

[3]: M. M. Michael, Simple, Fast, and Practical Non-Blocking and Blocking
 Concurrent Queue Algorithms.
 
https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html

The queue is designed to improve the specific MPSC setup.  A benchmark
accompanies the unit tests to measure the difference in this configuration.
A single reader thread polls the queue while N writers enqueue elements
as fast as possible.  The mpsc-queue is compared against the regular ovs-list
as well as the guarded list.  The latter usually offers a slight improvement
by batching the element removal, however the mpsc-queue is faster.

The average is of each producer threads time:

   $ ./tests/ovstest test-mpsc-queue benchmark 300 1
   Benchmarking n=300 on 1 + 1 threads.
type\thread:  Reader  1Avg
 mpsc-queue: 161161161 ms
   list: 803803803 ms
   guarded list: 665665665 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 2
   Benchmarking n=300 on 1 + 2 threads.
type\thread:  Reader  1  2Avg
 mpsc-queue: 102101 97 99 ms
   list: 246212246229 ms
   guarded list: 264263214238 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 3
   Benchmarking n=300 on 1 + 3 threads.
type\thread:  Reader  1  2  3Avg
 mpsc-queue:  92 91 92 91 91 ms
   list: 520517515520517 ms
   guarded list: 405395401404400 ms

   $ ./tests/ovstest test-mpsc-queue benchmark 300 4
   Benchmarking n=300 on 1 + 4 threads.
type\thread:  Reader  1  2  3  4Avg
 mpsc-queue:  77 73 73 77 75 74 ms
   list: 371359361287370344 ms
   guarded list: 389388359363357366 ms

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/automake.mk |   2 +
 lib/mpsc-queue.c| 251 ++
 lib/mpsc-queue.h| 189 +++
 tests/automake.mk   |   1 +
 tests/library.at|   5 +
 tests/test-mpsc-queue.c | 727 
 6 files changed, 1175 insertions(+)
 create mode 100644 lib/mpsc-queue.c
 create mode 100644 lib/mpsc-queue.h
 create mode 100644 tests/test-mpsc-queue.c

diff --git a/lib/automake.mk b/lib/automake.mk
index 39901bd6d..9c6836688 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -166,6 +166,8 @@ lib_libopenvswitch_la_SOURCES = \
lib/memory.c \
lib/memory.h \
lib/meta-flow.c \
+   lib/mpsc-queue.c \
+   lib/mpsc-queue.h \
lib/multipath.c \
lib/multipath.h \
lib/namemap.c \
diff --git a/lib/mpsc-queue.c b/lib/mpsc-queue.c
new file mode 100644
index 0..cd26ed207
--- /dev/null
+++ b/lib/mpsc-queue.c
@@ -0,0 +1,251 @@
+/*
+ * Copyright (c) 2020 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include "ovs-atomic.h"
+
+#include "mpsc-queue.h"
+
+/* Multi-producer, single-consumer queue
+ * =
+ *
+ * This an implementation of the MPSC queue described by Dmitri Vyukov [1].
+ *
+ * One atomic exchange operation is done per insertion.  Removal in most cases
+ * will not require atomic operation and will use one atomic exchange to close
+ * the queue chain.
+ *
+ * Insertion
+ * =
+ *
+ * The queue is implemented using a linked-list.  Insertion is done at the
+ * back of the queue, by swapping the current end with the new node atomically,
+ * then pointing the previous end toward the new node.  To follow Vyukov
+ * nomenclature, the end-node of the chain is called head.  A producer will
+ * only 

[ovs-dev] [PATCH v2 05/11] conntrack: Init hash basis first at creation

2021-04-21 Thread Gaetan Rivet
The 'hash_basis' field is used sometimes during sub-systems init
routine. It will be 0 by default before randomization. Sub-systems would
then init some nodes with incorrect hash values.

The timeout policies module is affected, making the default policy being
referenced using an incorrect hash value.

Fixes: 2078901a4c14 ("userspace: Add conntrack timeout policy support.")
Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/conntrack.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/conntrack.c b/lib/conntrack.c
index 59a8c5342..034f440c5 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -291,6 +291,11 @@ conntrack_init(void)
 static struct ovsthread_once setup_l4_once = OVSTHREAD_ONCE_INITIALIZER;
 struct conntrack *ct = xzalloc(sizeof *ct);
 
+/* This value can be used during init (e.g. timeout_policy_init()),
+ * set it first to ensure it is available.
+ */
+ct->hash_basis = random_uint32();
+
 ovs_rwlock_init(&ct->resources_lock);
 ovs_rwlock_wrlock(&ct->resources_lock);
 hmap_init(&ct->alg_expectations);
@@ -308,7 +313,6 @@ conntrack_init(void)
 timeout_policy_init(ct);
 ovs_mutex_unlock(&ct->ct_lock);
 
-ct->hash_basis = random_uint32();
 atomic_count_init(&ct->n_conn, 0);
 atomic_init(&ct->n_conn_limit, DEFAULT_N_CONN_LIMIT);
 atomic_init(&ct->tcp_seq_chk, true);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 01/11] ovs-atomic: Expose atomic exchange operation

2021-04-21 Thread Gaetan Rivet
The atomic exchange operation is a useful primitive that should be
available as well.  Most compiler already expose or offer a way
to use it, but a single symbol needs to be defined.

Signed-off-by: Gaetan Rivet 
Reviewed-by: Eli Britstein 
---
 lib/ovs-atomic-c++.h  |  3 +++
 lib/ovs-atomic-clang.h|  5 +
 lib/ovs-atomic-gcc4+.h|  5 +
 lib/ovs-atomic-gcc4.7+.h  |  5 +
 lib/ovs-atomic-i586.h |  5 +
 lib/ovs-atomic-locked.h   |  9 +
 lib/ovs-atomic-msvc.h | 22 ++
 lib/ovs-atomic-pthreads.h |  5 +
 lib/ovs-atomic-x86_64.h   |  5 +
 lib/ovs-atomic.h  |  8 +++-
 10 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/lib/ovs-atomic-c++.h b/lib/ovs-atomic-c++.h
index d47b8dd39..8605fa9d3 100644
--- a/lib/ovs-atomic-c++.h
+++ b/lib/ovs-atomic-c++.h
@@ -29,6 +29,9 @@ using std::atomic_compare_exchange_strong_explicit;
 using std::atomic_compare_exchange_weak;
 using std::atomic_compare_exchange_weak_explicit;
 
+using std::atomic_exchange;
+using std::atomic_exchange_explicit;
+
 #define atomic_read(SRC, DST) \
 atomic_read_explicit(SRC, DST, memory_order_seq_cst)
 #define atomic_read_explicit(SRC, DST, ORDER)   \
diff --git a/lib/ovs-atomic-clang.h b/lib/ovs-atomic-clang.h
index 34cc2faa7..cdf02a512 100644
--- a/lib/ovs-atomic-clang.h
+++ b/lib/ovs-atomic-clang.h
@@ -67,6 +67,11 @@ typedef enum {
 #define atomic_compare_exchange_weak_explicit(DST, EXP, SRC, ORD1, ORD2) \
 __c11_atomic_compare_exchange_weak(DST, EXP, SRC, ORD1, ORD2)
 
+#define atomic_exchange(RMW, ARG) \
+atomic_exchange_explicit(RMW, ARG, memory_order_seq_cst)
+#define atomic_exchange_explicit(RMW, ARG, ORDER) \
+__c11_atomic_exchange(RMW, ARG, ORDER)
+
 #define atomic_add(RMW, ARG, ORIG) \
 atomic_add_explicit(RMW, ARG, ORIG, memory_order_seq_cst)
 #define atomic_sub(RMW, ARG, ORIG) \
diff --git a/lib/ovs-atomic-gcc4+.h b/lib/ovs-atomic-gcc4+.h
index 25bcf20a0..f9accde1a 100644
--- a/lib/ovs-atomic-gcc4+.h
+++ b/lib/ovs-atomic-gcc4+.h
@@ -128,6 +128,11 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit   \
 atomic_compare_exchange_strong_explicit
 
+#define atomic_exchange_explicit(DST, SRC, ORDER) \
+__sync_lock_test_and_set(DST, SRC)
+#define atomic_exchange(DST, SRC) \
+atomic_exchange_explicit(DST, SRC, memory_order_seq_cst)
+
 #define atomic_op__(RMW, OP, ARG, ORIG) \
 ({  \
 typeof(RMW) rmw__ = (RMW);  \
diff --git a/lib/ovs-atomic-gcc4.7+.h b/lib/ovs-atomic-gcc4.7+.h
index 4c197ebe0..846e05775 100644
--- a/lib/ovs-atomic-gcc4.7+.h
+++ b/lib/ovs-atomic-gcc4.7+.h
@@ -61,6 +61,11 @@ typedef enum {
 #define atomic_compare_exchange_weak_explicit(DST, EXP, SRC, ORD1, ORD2) \
 __atomic_compare_exchange_n(DST, EXP, SRC, true, ORD1, ORD2)
 
+#define atomic_exchange_explicit(DST, SRC, ORDER) \
+__atomic_exchange_n(DST, SRC, ORDER)
+#define atomic_exchange(DST, SRC) \
+atomic_exchange_explicit(DST, SRC, memory_order_seq_cst)
+
 #define atomic_add(RMW, OPERAND, ORIG) \
 atomic_add_explicit(RMW, OPERAND, ORIG, memory_order_seq_cst)
 #define atomic_sub(RMW, OPERAND, ORIG) \
diff --git a/lib/ovs-atomic-i586.h b/lib/ovs-atomic-i586.h
index 9a385ce84..35a0959ff 100644
--- a/lib/ovs-atomic-i586.h
+++ b/lib/ovs-atomic-i586.h
@@ -400,6 +400,11 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit   \
 atomic_compare_exchange_strong_explicit
 
+#define atomic_exchange_explicit(RMW, ARG, ORDER) \
+atomic_exchange__(RMW, ARG, ORDER)
+#define atomic_exchange(RMW, ARG) \
+atomic_exchange_explicit(RMW, ARG, memory_order_seq_cst)
+
 #define atomic_add__(RMW, ARG, CLOB)\
 asm volatile("lock; xadd %0,%1 ; "  \
  "# atomic_add__ "  \
diff --git a/lib/ovs-atomic-locked.h b/lib/ovs-atomic-locked.h
index f8f0ba2a5..bf38c4a43 100644
--- a/lib/ovs-atomic-locked.h
+++ b/lib/ovs-atomic-locked.h
@@ -31,6 +31,15 @@ void atomic_unlock__(void *);
  atomic_unlock__(DST),  \
  false)))
 
+#define atomic_exchange_locked(DST, SRC) \
+({   \
+atomic_lock__(DST);  \
+typeof(*(DST)) __tmp = *(DST);   \
+*(DST) = SRC;\
+atomic_unlock__(DST);\
+__tmp;   \
+})
+
 #define atomic_op_locked_add +=
 #define atomic_op_locked_sub -=
 #define atomic_op_locked_or  |=
diff --git a/lib/ovs-atomic-msvc.h b/lib/ovs-atomic-msvc.h
index 9def887d3..ef8310269 100644
--- a/lib/ovs-atomic-msvc.h
+++ b/lib/ovs-atomic-msvc.h
@@ -345,6 +345,28 @@ atomic_signal_fence(memory_order order)
 #define atomic_compare_exchange_weak_explicit \
 atomic_compare_exchange_strong_explicit
 
+/* While intrinsics offering

[ovs-dev] [PATCH v2 00/11] conntrack: improve multithread scalability

2021-04-21 Thread Gaetan Rivet
Conntracks are executed within the datapath. Locks along this path are crucial
and their critical section should be minimal. The global 'ct_lock' is necessary
before any action taken on connection states. This lock is needed for many
operations on the conntrack, slowing down the datapath.

The cleanup thread 'ct_clean' will take it to do its job. As it can hold it a
long time, the thread is limited in amount of connection cleaned per round,
and calls are rate-limited.

* Timeout policies locking is contrived to avoid deadlock.
  Anytime a connection state is updated, during its update it is unlocked,
  'ct_lock' is taken, then the connection is locked again. Then the reverse
  is done for unlock.

* Scalability is poor. The global ct_lock needs to be taken before applying
  any change to a conn object. This is backward: local changes to smaller
  objects should be independent, then the global lock should only be taken once
  the rest of the work is done, the goal being to have the smallest possible
  critical section.

It can be improved. Using RCU-friendly structures for connections, zone limits
and timeout policies, read-first workload is improved and the precedence of the
global 'ct_lock' and local 'conn->lock' can be inversed.

Running the conntrack benchmark we see these changes:
  ./tests/ovstest test-conntrack benchmark  300 32

code \ N  1 2 4 8
  Before   2310  2766  6117 19838  (ms)
   After   2072  2084  2653  4541  (ms)

One thread in the benchmark executes the task of a PMD, while the 'ct_clean' 
thread
runs in background as well.

Github actions: https://github.com/grivet/ovs/actions/runs/574446345

v2:

An mpsc-queue is used instead of rculist to manage connection expirations lists.
PMDs and ct_clean all act as producers, while ct_clean is the sole consumer 
thread.
A PMD now needs to take the 'ct_lock' only when creating a new connection, and 
only
while inserting it in the conn CMAP. For any updates, only the conn lock is now 
required,
to properly change its state.

The mpsc-queue implementation is identical to the one from the parallel offload 
series [1].

CI: https://github.com/grivet/ovs/actions/runs/772118640

[1]: https://patchwork.ozlabs.org/project/openvswitch/list/?series=238779

Gaetan Rivet (11):
  ovs-atomic: Expose atomic exchange operation
  mpsc-queue: Module for lock-free message passing
  conntrack: Use mpsc-queue to store conn expirations
  conntrack: Use a cmap to store zone limits
  conntrack: Init hash basis first at creation
  conntrack-tp: Use a cmap to store timeout policies
  conntrack: Inverse conn and ct lock precedence
  conntrack: Do not schedule zero ms timers
  conntrack: Do not rate limit ct-sweep
  conntrack: Do not log empty ct-sweep
  conntrack: Use an atomic conn expiration value

 lib/automake.mk   |   2 +
 lib/conntrack-private.h   |  97 +++--
 lib/conntrack-tp.c| 100 ++
 lib/conntrack.c   | 306 +++-
 lib/conntrack.h   |   4 +-
 lib/dpif-netdev.c |   5 +-
 lib/mpsc-queue.c  | 251 +
 lib/mpsc-queue.h  | 189 ++
 lib/ovs-atomic-c++.h  |   3 +
 lib/ovs-atomic-clang.h|   5 +
 lib/ovs-atomic-gcc4+.h|   5 +
 lib/ovs-atomic-gcc4.7+.h  |   5 +
 lib/ovs-atomic-i586.h |   5 +
 lib/ovs-atomic-locked.h   |   9 +
 lib/ovs-atomic-msvc.h |  22 ++
 lib/ovs-atomic-pthreads.h |   5 +
 lib/ovs-atomic-x86_64.h   |   5 +
 lib/ovs-atomic.h  |   8 +-
 tests/automake.mk |   1 +
 tests/library.at  |   5 +
 tests/test-mpsc-queue.c   | 727 ++
 21 files changed, 1573 insertions(+), 186 deletions(-)
 create mode 100644 lib/mpsc-queue.c
 create mode 100644 lib/mpsc-queue.h
 create mode 100644 tests/test-mpsc-queue.c

--
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 14/28] dpif-netdev: Quiesce offload thread periodically

2021-04-21 Thread Maxime Coquelin



On 4/12/21 5:20 PM, Gaetan Rivet wrote:
> Similar to what was done for the PMD threads [1], reduce the performance
> impact of quiescing too often in the offload thread.
> 
> After each processed offload, the offload thread currently quiesce and
> will sync with RCU. This synchronization can be lengthy and make the
> thread unnecessary slow.
> 
> Instead attempt to quiesce every 10 ms at most. While the queue is
> empty, the offload thread remains quiescent.
> 
> [1]: 81ac8b3b194c ("dpif-netdev: Do RCU synchronization at fixed interval
>  in PMD main loop.")
> 
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/dpif-netdev.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index d458bcb12..4e403a9e5 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -2740,15 +2740,20 @@ err_free:
>  return -1;
>  }
>  
> +#define DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US (10 * 1000) /* 10 ms */
> +
>  static void *
>  dp_netdev_flow_offload_main(void *data OVS_UNUSED)
>  {
>  struct dp_offload_thread_item *offload;
>  struct ovs_list *list;
>  long long int latency_us;
> +long long int next_rcu;
> +long long int now;
>  const char *op;
>  int ret;
>  
> +next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
>  for (;;) {
>  ovs_mutex_lock(&dp_offload_thread.mutex);
>  if (ovs_list_is_empty(&dp_offload_thread.list)) {
> @@ -2756,6 +2761,7 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
>  ovs_mutex_cond_wait(&dp_offload_thread.cond,
>  &dp_offload_thread.mutex);
>  ovsrcu_quiesce_end();
> +next_rcu = time_usec() + DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;
>  }
>  list = ovs_list_pop_front(&dp_offload_thread.list);
>  dp_offload_thread.enqueued_item--;
> @@ -2779,7 +2785,9 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
>  OVS_NOT_REACHED();
>  }
>  
> -latency_us = time_usec() - offload->timestamp;
> +now = time_usec();
> +
> +latency_us = now - offload->timestamp;
>  mov_avg_cma_update(&dp_offload_thread.cma, latency_us);
>  mov_avg_ema_update(&dp_offload_thread.ema, latency_us);
>  
> @@ -2787,7 +2795,13 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
>   ret == 0 ? "succeed" : "failed", op,
>   UUID_ARGS((struct uuid *) &offload->flow->mega_ufid));
>  dp_netdev_free_flow_offload(offload);
> -ovsrcu_quiesce();
> +
> +/* Do RCU synchronization at fixed interval. */
> +if (now > next_rcu) {
> +if (!ovsrcu_try_quiesce()) {
> +next_rcu += DP_NETDEV_OFFLOAD_QUIESCE_INTERVAL_US;

Commit 81ac8b3b194c do it a bit differently, by using current time as a
base for calculating the next deadline while you use current deadline
(which is in the past) to calculate the next one.

Under heavy load, won't it just end up doing the RCU synchronization at
every it iteration, as next_rcu would always run behind?

> +}
> +}
>  }
>  
>  return NULL;
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 08/28] dpif-netdev: Implement hardware offloads stats query

2021-04-21 Thread Maxime Coquelin



On 4/12/21 5:19 PM, Gaetan Rivet wrote:
> In the netdev datapath, keep track of the enqueued offloads between
> the PMDs and the offload thread.  Additionally, query each netdev
> for their hardware offload counters.
> 
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/dpif-netdev.c | 90 ++-
>  1 file changed, 89 insertions(+), 1 deletion(-)
> 

Reviewed-by: Maxime Coquelin 

Thanks,
Maxime


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 07/28] mov-avg: Add a moving average helper structure

2021-04-21 Thread Maxime Coquelin



On 4/12/21 5:19 PM, Gaetan Rivet wrote:
> Add a new module offering a helper to compute the Cumulative
> Moving Average (CMA) and the Exponential Moving Average (EMA)
> of a series of values.
> 
> Use the new helpers to add latency metrics in dpif-netdev.
> 
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/automake.mk |   1 +
>  lib/mov-avg.h   | 171 
>  2 files changed, 172 insertions(+)
>  create mode 100644 lib/mov-avg.h
> 

Reviewed-by: Maxime Coquelin 

Thanks,
Maxime

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 06/28] dpif-netdev: Rename offload thread structure

2021-04-21 Thread Maxime Coquelin



On 4/12/21 5:19 PM, Gaetan Rivet wrote:
> The offload management in userspace is done through a separate thread.
> The naming of the structure holding the objects used for synchronization
> with the dataplane is generic and nondescript.
> 
> Clarify the object function by renaming it.
> 
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/dpif-netdev.c | 52 +++
>  1 file changed, 26 insertions(+), 26 deletions(-)
> 

Reviewed-by: Maxime Coquelin 

Thanks,
Maxime

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 ovn] ovn-sbctl: Prevent core dump from ovn-sbctl lflow-list [datpath] 0xflow

2021-04-21 Thread Alexey Roytman
From: Alexey Roytman 

When ovn-sbctl lflow-list gets lflow argument with 0x prefix, e.g. 0x8131c8a8,
it prints correct output, but fails with coredump.
For example:
ovn-sbctl --uuid lflow-list sw1 0x8131c8a8
  
Datapath: "sw1" (4b1e53d8-9f0f-4768-b4a6-6cbc58a4bfda)  Pipeline: egress
uuid=0x8131c8a8, table=10(ls_out_port_sec_l2 ), priority=100  ,
match=(eth.mcast), action=(output;)
free(): invalid pointer
[2]616553 abort (core dumped)  ovn-sbctl --uuid dump-flows sw1 0x8131c8a8
 
 This patch fixes it.

Signed-off-by: Alexey Roytman 
---
 utilities/ovn-sbctl.c | 28 +---
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/utilities/ovn-sbctl.c b/utilities/ovn-sbctl.c
index e3aa7a68e..99c112358 100644
--- a/utilities/ovn-sbctl.c
+++ b/utilities/ovn-sbctl.c
@@ -764,23 +764,28 @@ sbctl_lflow_cmp(const void *a_, const void *b_)
 return cmp ? cmp : strcmp(a->actions, b->actions);
 }
 
-static char *
+static bool
+is_uuid_with_prefix(const char *uuid)
+{
+ return uuid[0] == '0' && (uuid[1] == 'x' || uuid[1] == 'X');
+}
+
+static bool
 parse_partial_uuid(char *s)
 {
 /* Accept a full or partial UUID. */
 if (uuid_is_partial_string(s)) {
-return s;
+return true;
 }
 
 /* Accept a full or partial UUID prefixed by 0x, since "ovs-ofctl
  * dump-flows" prints cookies prefixed by 0x. */
-if (s[0] == '0' && (s[1] == 'x' || s[1] == 'X')
-&& uuid_is_partial_string(s + 2)) {
-return s + 2;
+if (is_uuid_with_prefix(s) && uuid_is_partial_string(s + 2)) {
+return true;
 }
 
 /* Not a (partial) UUID. */
-return NULL;
+return false;
 }
 
 static const char *
@@ -799,8 +804,11 @@ is_partial_uuid_match(const struct uuid *uuid, const char 
*match)
  * from UUIDs, and cookie values are printed without leading zeros because
  * they're just numbers. */
 const char *s1 = strip_leading_zero(uuid_s);
-const char *s2 = strip_leading_zero(match);
-
+const char *s2 = match;
+if (is_uuid_with_prefix(s2)) {
+s2 = s2 + 2;
+}
+s2 = strip_leading_zero(s2);
 return !strncmp(s1, s2, strlen(s2));
 }
 
@@ -1134,12 +1142,10 @@ cmd_lflow_list(struct ctl_context *ctx)
 }
 
 for (size_t i = 1; i < ctx->argc; i++) {
-char *s = parse_partial_uuid(ctx->argv[i]);
-if (!s) {
+if (!parse_partial_uuid(ctx->argv[i])) {
 ctl_fatal("%s is not a UUID or the beginning of a UUID",
   ctx->argv[i]);
 }
-ctx->argv[i] = s;
 }
 
 struct vconn *vconn = sbctl_open_vconn(&ctx->options);
-- 
2.25.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v5 5/5] northd: Flood ARPs to routers for "unreachable" addresses.

2021-04-21 Thread Numan Siddique
On Tue, Apr 20, 2021 at 11:57 AM Mark Michelson  wrote:
>
> Previously, ARP TPAs were filtered down only to "reachable" addresses.
> Reachable addresses are all router interface addresses, as well as NAT
> external addresses and load balancer VIPs that are within the subnet
> handled by a router's port.
>
> However, it is possible that in some configurations, CMSes purposely
> configure NAT or load balancer addresses on a router that are outside
> the router's subnets, and they expect the router to respond to ARPs for
> those addresses.
>
> This commit adds a higher priority flow to logical switches that makes
> it so ARPs targeted at "unreachable" addresses are flooded to all ports.
> This way, the ARPs can reach the router appropriately and receive a
> response.
>
> Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1929901
>
> Signed-off-by: Mark Michelson 

Hi Mark,

I would suggest to add 2 things

1.  Test cases in ovn-northd.at which checks for the expected logical flows
 to make sure that both the northd versions generate the same logical flows.
This will check any future regressions.

2.  Previous versions of your patch series had a system test. Can you please
 add that test case ?


> ---
>  northd/ovn-northd.8.xml |   8 ++
>  northd/ovn-northd.c | 158 +++-
>  northd/ovn_northd.dl|  98 -
>  3 files changed, 194 insertions(+), 70 deletions(-)
>
> diff --git a/northd/ovn-northd.8.xml b/northd/ovn-northd.8.xml
> index 8197aa513..ed8c44940 100644
> --- a/northd/ovn-northd.8.xml
> +++ b/northd/ovn-northd.8.xml
> @@ -1535,6 +1535,14 @@ output;
>  logical ports.
>
>
> +  
> +Priority-80 flows for each IP address/VIP/NAT address configured
> +outside its owning router port's subnet. These flows match ARP
> +requests and ND packets for the specific IP addresses.  Matched 
> packets
> +are forwarded only to the MC_FLOOD multicast group which
> +contains all connected logical ports.
> +  
> +

In the code I see priority-90 flows are getting added or updated. Does
the documentation need
to be updated ?

Thanks
Numan


>
>  Priority-75 flows for each port connected to a logical router
>  matching self originated ARP request/ND packets.  These packets
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index d36a3c29b..bfacb7464 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -6480,44 +6480,51 @@ build_lswitch_rport_arp_req_self_orig_flow(struct 
> ovn_port *op,
>  ds_destroy(&match);
>  }
>
> -/*
> - * Ingress table 19: Flows that forward ARP/ND requests only to the routers
> - * that own the addresses. Other ARP/ND packets are still flooded in the
> - * switching domain as regular broadcast.
> - */
>  static void
> -build_lswitch_rport_arp_req_flow_for_ip(struct sset *ips,
> -int addr_family,
> -struct ovn_port *patch_op,
> -struct ovn_datapath *od,
> -uint32_t priority,
> -struct hmap *lflows,
> -const struct ovsdb_idl_row 
> *stage_hint)
> +arp_nd_ns_match(struct sset *ips, int addr_family, struct ds *match)
>  {
> -struct ds match   = DS_EMPTY_INITIALIZER;
> -struct ds actions = DS_EMPTY_INITIALIZER;
>
>  /* Packets received from VXLAN tunnels have already been through the
>   * router pipeline so we should skip them. Normally this is done by the
>   * multicast_group implementation (VXLAN packets skip table 32 which
>   * delivers to patch ports) but we're bypassing multicast_groups.
>   */
> -ds_put_cstr(&match, FLAGBIT_NOT_VXLAN " && ");
> +ds_put_cstr(match, FLAGBIT_NOT_VXLAN " && ");
>
>  if (addr_family == AF_INET) {
> -ds_put_cstr(&match, "arp.op == 1 && arp.tpa == { ");
> +ds_put_cstr(match, "arp.op == 1 && arp.tpa == { ");
>  } else {
> -ds_put_cstr(&match, "nd_ns && nd.target == { ");
> +ds_put_cstr(match, "nd_ns && nd.target == { ");
>  }
>
>  const char *ip_address;
>  SSET_FOR_EACH (ip_address, ips) {
> -ds_put_format(&match, "%s, ", ip_address);
> +ds_put_format(match, "%s, ", ip_address);
>  }
>
> -ds_chomp(&match, ' ');
> -ds_chomp(&match, ',');
> -ds_put_cstr(&match, "}");
> +ds_chomp(match, ' ');
> +ds_chomp(match, ',');
> +ds_put_cstr(match, "}");
> +}
> +
> +/*
> + * Ingress table 19: Flows that forward ARP/ND requests only to the routers
> + * that own the addresses. Other ARP/ND packets are still flooded in the
> + * switching domain as regular broadcast.
> + */
> +static void
> +build_lswitch_rport_arp_req_flow_for_reachable_ip(struct sset *ips,
> +  int addr_family,
> 

Re: [ovs-dev] [PATCH ovn v5 4/5] pinctrl: Add Chassis MAC_Bindings for all router addresses.

2021-04-21 Thread Numan Siddique
On Tue, Apr 20, 2021 at 11:57 AM Mark Michelson  wrote:
>
> Previous behavior of ovn-controller was to only create MAC bindings for
> the same addresses for which we would send gARPs. That is:
>
> * A logical switch has its "nat_addresses" configured.
> * That logical switch has a localnet port.
>
> This makes sense for gARPs: we only want to send gARPs for addresses
> accessible from outside of OVN, and we only want to send gARPs for
> addresses explicitly enabled by the administrator.
>
> For directly adding MAC bindings, though, this makes less sense. In the
> case where a VM or pod on the underlay network wishes to reach another
> VM or pod via a DNAT or load balancer address, an ARP was required to
> learn the MAC binding. When the information is all available in the
> database, this seems like an unnecessary step.
>
> This commit makes the following changes:
> * MAC_Binding now has a "chassis_name" column. This is used to identify
>   the chassis that added the MAC Binding IF the MAC Binding was auto-
>   generated by ovn-controller. This is also used to ensure that stale
>   bindings can be cleaned up.
> * The term "local garp" has been replaced with "chassis MAC binding".
>   Removing the word "garp" helps to remove any implication of packets
>   being transmitted. The word "chassis" precedes "MAC binding" to
>   indicate it is a MAC binding added by a chassis and with a
>   chassis_name in its records. MAC bindings with no chassis_name were
>   added dynamically due to reception of ARP/gARP packets.
> * Chassis MAC Binding code has been separated from gARP code since they
>   do not operate using the same sets of addresses.
> * Chassis MAC Bindings are added for all router_addresses on a port
>   binding. This means all NAT, Load Balancer, and router interface
>   addresses are added for routers that are residents of the chassis.
>
> Signed-off-by: Mark Michelson 

Hi Mark,

Please see below for just a  comment.

Thanks
Numan


> ---
>  controller/ovn-controller.c |   4 +
>  controller/pinctrl.c| 298 +++-
>  controller/pinctrl.h|   1 +
>  northd/ovn-northd.c |   2 +-
>  northd/ovn_northd.dl|   5 +-
>  ovn-sb.ovsschema|   5 +-
>  ovn-sb.xml  |   7 +
>  tests/ovn-controller.at | 179 ++
>  8 files changed, 427 insertions(+), 74 deletions(-)
>
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index 16c8ecb21..872d68799 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -2507,6 +2507,9 @@ main(int argc, char *argv[])
>  = ovsdb_idl_index_create2(ovnsb_idl_loop.idl,
>&sbrec_fdb_col_mac,
>&sbrec_fdb_col_dp_key);
> +struct ovsdb_idl_index *sbrec_mac_binding_by_chassis
> += ovsdb_idl_index_create1(ovnsb_idl_loop.idl,
> +  &sbrec_mac_binding_col_chassis_name);
>
>  ovsdb_idl_track_add_all(ovnsb_idl_loop.idl);
>  ovsdb_idl_omit_alert(ovnsb_idl_loop.idl,
> @@ -2931,6 +2934,7 @@ main(int argc, char *argv[])
>  sbrec_port_binding_by_key,
>  sbrec_port_binding_by_name,
>  sbrec_mac_binding_by_lport_ip,
> +sbrec_mac_binding_by_chassis,
>  sbrec_igmp_group,
>  sbrec_ip_multicast,
>  sbrec_fdb_by_dp_key_mac,
> diff --git a/controller/pinctrl.c b/controller/pinctrl.c
> index 523a45b9a..51327f370 100644
> --- a/controller/pinctrl.c
> +++ b/controller/pinctrl.c
> @@ -205,6 +205,7 @@ static void send_garp_rarp_prepare(
>  struct ovsdb_idl_index *sbrec_port_binding_by_datapath,
>  struct ovsdb_idl_index *sbrec_port_binding_by_name,
>  struct ovsdb_idl_index *sbrec_mac_binding_by_lport_ip,
> +struct ovsdb_idl_index *sbrec_mac_binding_by_chassis,
>  const struct ovsrec_bridge *,
>  const struct sbrec_chassis *,
>  const struct hmap *local_datapaths,
> @@ -3318,6 +3319,7 @@ pinctrl_run(struct ovsdb_idl_txn *ovnsb_idl_txn,
>  struct ovsdb_idl_index *sbrec_port_binding_by_key,
>  struct ovsdb_idl_index *sbrec_port_binding_by_name,
>  struct ovsdb_idl_index *sbrec_mac_binding_by_lport_ip,
> +struct ovsdb_idl_index *sbrec_mac_binding_by_chassis,
>  struct ovsdb_idl_index *sbrec_igmp_groups,
>  struct ovsdb_idl_index *sbrec_ip_multicast_opts,
>  struct ovsdb_idl_index *sbrec_fdb_by_dp_key_mac,
> @@ -3339,7 +3341,8 @@ pinctrl_run(struct ovsdb_idl_txn *ovnsb_idl_txn,
> sbrec_port_binding_by_key, chassis);
>  send_garp_rarp_prepare(ovnsb_idl_txn, sbrec_port_binding_by_datapath,
> sbrec_port_binding_by_name,
> -

Re: [ovs-dev] [PATCH v2 05/28] dpif-netdev: Rename flow offload thread

2021-04-21 Thread Maxime Coquelin



On 4/12/21 5:19 PM, Gaetan Rivet wrote:
> ovs_strlcpy silently fails to copy the thread name if it is too long.
> Rename the flow offload thread to differentiate it from the main thread.
> 
> Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread")
> Signed-off-by: Gaetan Rivet 
> ---
>  lib/dpif-netdev.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 

If a new revision is needed, maybe move this fix at the beginning of the
series to avoid apply conflicts.

Other than that it looks good to me:

Reviewed-by: Maxime Coquelin 

Thanks,
Maxime

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v5 3/5] northd: Save all router addresses in Port_Bindings

2021-04-21 Thread Numan Siddique
On Tue, Apr 20, 2021 at 11:57 AM Mark Michelson  wrote:
>
> northd currently stores logical switch port's configured "nat-addresses"
> in the southbound Port_Binding "nat_addresses" column. This allows for
> explicit configuration of which NAT addresses should be broadcast in
> gARP messages by OVN.
>
> This adds a similar column, "router_addresses" to the Port_Binding. The
> difference is that this column contains all NAT, load balancer, and
> router interface addresses, no matter the northbound configuration.
>
> This column will be used in an upcoming commit.
>
> Signed-off-by: Mark Michelson 

Hi Mark,

Please see below for few comments.


> ---
>  northd/ovn-northd.c  | 222 +--
>  northd/ovn_northd.dl |  35 +--
>  ovn-sb.ovsschema |   5 +-
>  ovn-sb.xml   |  20 
>  tests/ovn-northd.at  | 206 +++
>  5 files changed, 408 insertions(+), 80 deletions(-)
>
> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> index d8ee65a5f..78ac696a5 100644
> --- a/northd/ovn-northd.c
> +++ b/northd/ovn-northd.c
> @@ -2486,7 +2486,7 @@ get_nat_addresses(const struct ovn_port *op, size_t *n)
>  {
>  size_t n_nats = 0;
>  struct eth_addr mac;
> -if (!op->nbrp || !op->od || !op->od->nbr
> +if (!op || !op->nbrp || !op->od || !op->od->nbr
>  || (!op->od->nbr->n_nat && !op->od->nbr->n_load_balancer)
>  || !eth_addr_from_string(op->nbrp->mac, &mac)) {
>  *n = n_nats;
> @@ -2878,6 +2878,132 @@ ovn_update_ipv6_prefix(struct hmap *ports)
>  }
>  }
>
> +static void
> +ovn_port_set_garp_addresses(const struct ovn_port *op, const char **nats,
> +size_t n_nats, const char *router_networks)
> +{
> +bool peer_is_gateway = op->peer
> +   && op->peer->od
> +   && op->peer->od->nbr
> +   && smap_get(&op->peer->od->nbr->options,
> +   "chassis");
> +
> +bool peer_has_dist_gw_port = op->peer
> + && op->peer->od
> + && op->peer->od->l3redirect_port;
> +

I think caller of this function can pass the 'peer_is_gateway' and
'peer_has_dist_gw_port' bools
since these values are stored  before calling this function.


> +const char *nat_addresses = smap_get(&op->nbsp->options,
> + "nat-addresses");
> +size_t n_sources = 0;
> +const char **source = NULL;
> +if (nat_addresses && (peer_is_gateway || peer_has_dist_gw_port)) {
> +if (!strcmp(nat_addresses, "router")) {
> +/* We size this to n_nats + 1 since it can at most include
> + * all NAT addresses plus the router's networks.
> + */
> +source = nats;
> +n_sources = n_nats;
> +/* Only accept manual specification of ethernet address
> + * followed by IPv4 addresses on type "l3gateway" ports. */
> +} else if (peer_is_gateway) {
> +struct lport_addresses laddrs;
> +if (!extract_lsp_addresses(nat_addresses, &laddrs)) {
> +static struct vlog_rate_limit rl =
> +VLOG_RATE_LIMIT_INIT(1, 1);
> +VLOG_WARN_RL(&rl, "Error extracting nat-addresses.");
> +} else {
> +destroy_lport_addresses(&laddrs);
> +/* Allocate enough space for the nat_addresses plus
> + * the router networks
> + */
> +source = &nat_addresses;
> +n_sources = 1;
> +}
> +}
> +}
> +
> +size_t n_garps = 0;
> +/* The "garps" array contains the addresses for which
> + * ovn-controller should send GARPs. This is controlled
> + * by the "nat-addresses" option on the logical switch.
> + *
> + * Size the garps array to one more than the number of NAT
> + * sources so that we can fit the NATs and the router
> + * networks.
> + */
> +char **garps = xcalloc(n_sources + 1, sizeof *garps);
> +for (n_garps = 0; n_garps < n_sources; n_garps++) {
> +garps[n_garps] = xstrdup(source[n_garps]);
> +}
> +
> +/* Add the router mac and IPv4 addresses to
> + * Port_Binding.nat_addresses so that GARP is sent for these
> + * IPs by the ovn-controller on which the distributed gateway
> + * router port resides if:
> + *
> + * -  op->peer has 'reside-on-redirect-chassis' set and the
> + *the logical router datapath has distributed router port.
> + *
> + * -  op->peer is distributed gateway router port.
> + *
> + * -  op->peer's router is a gateway router and op has a localnet
> + *port.
> + *
> + * Note: Port_Binding.nat_addresses column is also used for
> + * sending the GARPs for the router port IPs.
> + * */
> +bool add_router_port_garp = false;

Re: [ovs-dev] [PATCH v2 04/28] dpctl: Add function to read hardware offload statistics

2021-04-21 Thread Maxime Coquelin



On 4/12/21 5:19 PM, Gaetan Rivet wrote:
> Expose a function to query datapath offload statistics.
> This function is separate from the current one in netdev-offload
> as it exposes more detailed statistics from the datapath, instead of
> only from the netdev-offload provider.
> 
> Each datapath is meant to use the custom counters as it sees fit for its
> handling of hardware offloads.
> 
> Call the new API from dpctl.
> 
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/dpctl.c | 36 
>  lib/dpif-netdev.c   |  1 +
>  lib/dpif-netlink.c  |  1 +
>  lib/dpif-provider.h |  7 +++
>  lib/dpif.c  |  8 
>  lib/dpif.h  |  9 +
>  6 files changed, 62 insertions(+)
> 


Reviewed-by: Maxime Coquelin 

Thanks,
Maxime

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 ovn] ovn-sbctl: Prevent core dump from ovn-sbctl lflow-list [datpath] 0xflow

2021-04-21 Thread Mark Michelson

Hi Alexey, nice find! I have a comment down below:

On 4/20/21 6:22 PM, Alexey Roytman wrote:

From: Alexey Roytman 

When ovn-sbctl lflow-list gets lflow argument with 0x prefix, e.g. 0x8131c8a8,
it prints correct output, but fails with coredump.
For example:
ovn-sbctl --uuid lflow-list sw1 0x8131c8a8
 
Datapath: "sw1" (4b1e53d8-9f0f-4768-b4a6-6cbc58a4bfda)  Pipeline: egress

   uuid=0x8131c8a8, table=10(ls_out_port_sec_l2 ), priority=100  ,
match=(eth.mcast), action=(output;)
free(): invalid pointer
[2]616553 abort (core dumped)  ovn-sbctl --uuid dump-flows sw1 0x8131c8a8

This patch fixes it.

Signed-off-by: Alexey Roytman 
---
  utilities/ovn-sbctl.c | 17 -
  1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/utilities/ovn-sbctl.c b/utilities/ovn-sbctl.c
index e3aa7a68e..099d31f08 100644
--- a/utilities/ovn-sbctl.c
+++ b/utilities/ovn-sbctl.c
@@ -764,6 +764,12 @@ sbctl_lflow_cmp(const void *a_, const void *b_)
  return cmp ? cmp : strcmp(a->actions, b->actions);
  }
  
+static bool

+is_uuid_with_prefix(const char *uuid)
+{
+ return uuid[0] == '0' && (uuid[1] == 'x' || uuid[1] == 'X');
+}
+
  static char *
  parse_partial_uuid(char *s)
  {
@@ -774,8 +780,7 @@ parse_partial_uuid(char *s)
  
  /* Accept a full or partial UUID prefixed by 0x, since "ovs-ofctl

   * dump-flows" prints cookies prefixed by 0x. */
-if (s[0] == '0' && (s[1] == 'x' || s[1] == 'X')
-&& uuid_is_partial_string(s + 2)) {
+if (is_uuid_with_prefix(s) && uuid_is_partial_string(s + 2)) {
  return s + 2;
  }
  
@@ -799,8 +804,11 @@ is_partial_uuid_match(const struct uuid *uuid, const char *match)

   * from UUIDs, and cookie values are printed without leading zeros because
   * they're just numbers. */
  const char *s1 = strip_leading_zero(uuid_s);
-const char *s2 = strip_leading_zero(match);
-
+const char *s2 = match;
+if (is_uuid_with_prefix(s2)) {
+s2 = s2 + 2;
+}
+s2 = strip_leading_zero(s2);
  return !strncmp(s1, s2, strlen(s2));
  }
  
@@ -1139,7 +1147,6 @@ cmd_lflow_list(struct ctl_context *ctx)

  ctl_fatal("%s is not a UUID or the beginning of a UUID",
ctx->argv[i]);
  }
-ctx->argv[i] = s;


Since you no longer assign s, this section could be re-written to get 
rid of s:


if (!parse_partial_uuid(ctx->argv[i])) {
ctl_fatal("%s is not a UUID or the beginning of a UUID", 
ctx->argv[i];

}

This then snowballs to the parse_partial_uuid() function. The reason the 
crash was happening was because parse_partial_uuid() was returning a 
pointer to the middle of an allocated string. We could increase the 
safety of parse_partial_uuid() by having it return a boolean instead of 
a dangerous pointer. This way, there is no chance of misusing the string 
it returns since it no longer will return a string.



  }
  
  struct vconn *vconn = sbctl_open_vconn(&ctx->options);




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2] conntrack: document NULL SNAT behavior and add a test case

2021-04-21 Thread Ilya Maximets
On 3/30/21 2:25 PM, Eelco Chaudron wrote:
> Currently, conntrack in the kernel has an undocumented feature referred
> to as NULL SNAT. Basically, when a source port collision is detected
> during the commit, the source port will be translated to an ephemeral
> port. If there is no collision, no SNAT is performed.
> 
> This patchset documents this behavior and adds a self-test to verify
> it's not changing.
> 
> Signed-off-by: Eelco Chaudron 
> ---
> v2: Fixed NULL SNAT to only work in the -rpl state to be inline with
> OpenShift-SDN's behavior.
> 
>  lib/ovs-actions.xml  |   10 
>  tests/system-kmod-macros.at  |7 ++
>  tests/system-traffic.at  |   46 
> ++
>  tests/system-userspace-macros.at |   10 
>  4 files changed, 73 insertions(+)

As per comment from Ben for v2 of Paolo's patch, I think we need
to avoid usage of NULL in this patch too:

  
https://patchwork.ozlabs.org/project/openvswitch/patch/161721063438.355752.4375787531104430414.st...@fed.void/

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH net] openvswitch: meter: remove rate from the bucket size calculation

2021-04-21 Thread Ilya Maximets
Implementation of meters supposed to be a classic token bucket with 2
typical parameters: rate and burst size.

Burst size in this schema is the maximum number of bytes/packets that
could pass without being rate limited.

Recent changes to userspace datapath made meter implementation to be
in line with the kernel one, and this uncovered several issues.

The main problem is that maximum bucket size for unknown reason
accounts not only burst size, but also the numerical value of rate.
This creates a lot of confusion around behavior of meters.

For example, if rate is configured as 1000 pps and burst size set to 1,
this should mean that meter will tolerate bursts of 1 packet at most,
i.e. not a single packet above the rate should pass the meter.
However, current implementation calculates maximum bucket size as
(rate + burst size), so the effective bucket size will be 1001.  This
means that first 1000 packets will not be rate limited and average
rate might be twice as high as the configured rate.  This also makes
it practically impossible to configure meter that will have burst size
lower than the rate, which might be a desirable configuration if the
rate is high.

Inability to configure low values of a burst size and overall inability
for a user to predict what will be a maximum and average rate from the
configured parameters of a meter without looking at the OVS and kernel
code might be also classified as a security issue, because drop meters
are frequently used as a way of protection from DoS attacks.

This change removes rate from the calculation of a bucket size, making
it in line with the classic token bucket algorithm and essentially
making the rate and burst tolerance being predictable from a users'
perspective.

Same change proposed for the userspace implementation.

Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
Signed-off-by: Ilya Maximets 
---

The same patch for the userspace datapath:
  
https://patchwork.ozlabs.org/project/openvswitch/patch/20210421134816.311584-1-i.maxim...@ovn.org/

 net/openvswitch/meter.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c
index 15424d26e85d..96b524ceabca 100644
--- a/net/openvswitch/meter.c
+++ b/net/openvswitch/meter.c
@@ -392,7 +392,7 @@ static struct dp_meter *dp_meter_create(struct nlattr **a)
 *
 * Start with a full bucket.
 */
-   band->bucket = (band->burst_size + band->rate) * 1000ULL;
+   band->bucket = band->burst_size * 1000ULL;
band_max_delta_t = div_u64(band->bucket, band->rate);
if (band_max_delta_t > meter->max_delta_t)
meter->max_delta_t = band_max_delta_t;
@@ -641,7 +641,7 @@ bool ovs_meter_execute(struct datapath *dp, struct sk_buff 
*skb,
long long int max_bucket_size;
 
band = &meter->bands[i];
-   max_bucket_size = (band->burst_size + band->rate) * 1000LL;
+   max_bucket_size = band->burst_size * 1000LL;
 
band->bucket += delta_ms * band->rate;
if (band->bucket > max_bucket_size)
-- 
2.26.3

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] dpif-netdev: Remove meter rate from the bucket size calculation.

2021-04-21 Thread Ilya Maximets
Implementation of meters supposed to be a classic token bucket with 2
typical parameters: rate and burst size.

Burst size in this schema is the maximum number of bytes/packets that
could pass without being rate limited.

Recent changes to userspace datapath made meter implementation to be
in line with the kernel one, and this uncovered several issues.

The main problem is that maximum bucket size for unknown reason
accounts not only burst size, but also the numerical value of rate.
This creates a lot of confusion around behavior of meters.

For example, if rate is configured as 1000 pps and burst size set to 1,
this should mean that meter will tolerate bursts of 1 packet at most,
i.e. not a single packet above the rate should pass the meter.
However, current implementation calculates maximum bucket size as
(rate + burst size), so the effective bucket size will be 1001.  This
means that first 1000 packets will not be rate limited and average
rate might be twice as high as the configured rate.  This also makes
it practically impossible to configure meter that will have burst size
lower than the rate, which might be a desirable configuration if the
rate is high.

Inability to configure low values of a burst size and overall inability
for a user to predict what will be a maximum and average rate from the
configured parameters of a meter without looking at the OVS and kernel
code might be also classified as a security issue, because drop meters
are frequently used as a way of protection from DoS attacks.

This change removes rate from the calculation of a bucket size, making
it in line with the classic token bucket algorithm and essentially
making the rate and burst tolerance being predictable from a users'
perspective.

Same change will be proposed for the kernel implementation.
Unit tests changed back to their correct version and enhanced.

Signed-off-by: Ilya Maximets 
---
 lib/dpif-netdev.c |  5 ++--
 tests/dpif-netdev.at  | 53 +++
 tests/ofproto-dpif.at |  2 +-
 3 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 251788b04..650e67ab3 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -6229,7 +6229,7 @@ dp_netdev_run_meter(struct dp_netdev *dp, struct 
dp_packet_batch *packets_,
 uint64_t max_bucket_size;
 
 band = &meter->bands[m];
-max_bucket_size = (band->rate + band->burst_size) * 1000ULL;
+max_bucket_size = band->burst_size * 1000ULL;
 /* Update band's bucket. */
 band->bucket += (uint64_t) delta_t * band->rate;
 if (band->bucket > max_bucket_size) {
@@ -6357,8 +6357,7 @@ dpif_netdev_meter_set(struct dpif *dpif, ofproto_meter_id 
meter_id,
 meter->bands[i].rate = config->bands[i].rate;
 meter->bands[i].burst_size = config->bands[i].burst_size;
 /* Start with a full bucket. */
-meter->bands[i].bucket =
-(meter->bands[i].burst_size + meter->bands[i].rate) * 1000ULL;
+meter->bands[i].bucket = meter->bands[i].burst_size * 1000ULL;
 
 /* Figure out max delta_t that is enough to fill any bucket. */
 band_max_delta_t
diff --git a/tests/dpif-netdev.at b/tests/dpif-netdev.at
index 57cae383f..16402ebae 100644
--- a/tests/dpif-netdev.at
+++ b/tests/dpif-netdev.at
@@ -314,21 +314,21 @@ done
 sleep 1  # wait for forwarders process packets
 
 # Meter 1 is measuring packets, allowing one packet per second with
-# bursts of one packet, so 3 out of 5 packets should hit the drop
-# band.
-# Meter 2 is measuring kbps, with burst size 2 (== 3000 bits). 6 packets
-# (360 bytes == 2880 bits) pass, but the last packet should hit the drop band.
+# bursts of one packet, so 4 out of 5 packets should hit the drop band.
+# Meter 2 is measuring kbps, with burst size 2 (== 2000 bits). 4 packets
+# (240 bytes == 1920 bits) pass, but the last three packets should hit the
+# drop band.  There should be 80 bits remaining for the next packets.
 AT_CHECK([ovs-ofctl -O OpenFlow13 meter-stats br0 | strip_timers], [0], [dnl
 OFPST_METER reply (OF1.3) (xid=0x2):
 meter:1 flow_count:1 packet_in_count:5 byte_in_count:300 duration:0.0s bands:
-0: packet_count:3 byte_count:180
+0: packet_count:4 byte_count:240
 
 meter:2 flow_count:1 packet_in_count:7 byte_in_count:420 duration:0.0s bands:
-0: packet_count:1 byte_count:60
+0: packet_count:3 byte_count:180
 ])
 
-# Advance time by 1/2 second
-ovs-appctl time/warp 500
+# Advance time by 870 ms
+ovs-appctl time/warp 870
 
 for i in `seq 1 5`; do
   AT_CHECK(
@@ -345,16 +345,41 @@ sleep 1  # wait for forwarders process packets
 # Meter 1 is measuring packets, allowing one packet per second with
 # bursts of one packet, so all 5 of the new packets should hit the drop
 # band.
-# Meter 2 is measuring kbps, with burst size 2 (== 3000 bits). After 500ms
-# there should be space for 120 + 500 bits, so one new 60 byte (480 bit) packet
-# should pass, remaining 4 should hit the

Re: [ovs-dev] [PATCH ovn] ovn-controller-vtep: Set chassis_name for newly created Encap.

2021-04-21 Thread Numan Siddique
On Wed, Apr 21, 2021 at 3:43 AM Dumitru Ceara  wrote:
>
> On 4/20/21 8:55 PM, Odintsov Vladislav wrote:
> > Hi,
> >
> > Thanks for the quick reaction!
> > I've tested your patch against 20.06.3 and it works well for me.
> > May I ask you to backport your patch down to supported branches?
> >
> > Tested-by: Vladislav Odintsov 
>
> Thanks for trying it out, Vladislav, and thanks for the review, Mark!
>
> As this is a bug fix and it applies cleanly to all supported branches,
> when it's merged I'm pretty sure the maintainers can backport it
> directly there.

Thanks Dumitru, Vladislav and Mark.

I applied this patch to the main branch and backported to branch-21.03
and all the way down to branch-20.03.

Thanks

Numan

>
> Regards,
> Dumitru
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v4] conntrack: handle SNAT with all-zero IP address

2021-04-21 Thread Aaron Conole
Paolo Valerio  writes:

> this patch introduces for the userspace datapath the handling
> of rules like the following:
>
> ct(commit,nat(src=0.0.0.0),...)
>
> Kernel datapath already handle this case that is particularly
> handy in scenarios like the following:
>
> Given A: 10.1.1.1, B: 192.168.2.100, C: 10.1.1.2
>
> A opens a connection toward B on port 80 selecting as source port 1.
> B's IP gets dnat'ed to C's IP (10.1.1.1:1 -> 192.168.2.100:80).
>
> This will result in:
>
> tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=1,dport=80),reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=1),protoinfo=(state=ESTABLISHED)
>
> A now tries to establish another connection with C using source port
> 1, this time using C's IP address (10.1.1.1:1 -> 10.1.1.2:80).
>
> This second connection, if processed by conntrack with no SNAT/DNAT
> involved, collides with the reverse tuple of the first connection,
> so the entry for this valid connection doesn't get created.
>
> With this commit, and adding a SNAT rule with 0.0.0.0 for
> 10.1.1.1:1 -> 10.1.1.2:80 will allow to create the conn entry:
>
> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=1,dport=80),reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=10001),protoinfo=(state=ESTABLISHED)
> tcp,orig=(src=10.1.1.1,dst=192.168.2.100,sport=1,dport=80),reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=1),protoinfo=(state=ESTABLISHED)
>
> The issue exists even in the opposite case (with A trying to connect
> to C using B's IP after establishing a direct connection from A to C).
>
> This commit refactors the relevant function in a way that both of the
> previously mentioned cases are handled as well.
>
> Suggested-by: Eelco Chaudron 
> Signed-off-by: Paolo Valerio 
> ---

I think this needs an update to ovs-actions.xml to remove:

Note that this is currently only implemented in the Linux kernel
datapath.

Maybe also a NEWS entry to document that support has been added for
ephemeral port SNAT.

> v2: enable NULL SNAT self-test also for userspace.
> v3: replace NULL with all-zero in the commit message
> v4: no code changes, just restore some removed new line
>
> Note for the maintainers:
> the patch depends on the following
>
> https://patchwork.ozlabs.org/project/openvswitch/patch/161710710690.181407.5749135681436588686.stgit@ebuild/
>
>  lib/conntrack.c  |  335 
> --
>  lib/conntrack.h  |   15 ++
>  tests/system-userspace-macros.at |7 -
>  3 files changed, 228 insertions(+), 129 deletions(-)
>
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 99198a601..885c81cbf 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -108,8 +108,8 @@ static void set_label(struct dp_packet *, struct conn *,
>  static void *clean_thread_main(void *f_);
>  
>  static bool
> -nat_select_range_tuple(struct conntrack *ct, const struct conn *conn,
> -   struct conn *nat_conn);
> +nat_get_unique_tuple(struct conntrack *ct, const struct conn *conn,
> + struct conn *nat_conn);
>  
>  static uint8_t
>  reverse_icmp_type(uint8_t type);
> @@ -728,11 +728,11 @@ pat_packet(struct dp_packet *pkt, const struct conn 
> *conn)
>  }
>  } else if (conn->nat_info->nat_action & NAT_ACTION_DST) {
>  if (conn->key.nw_proto == IPPROTO_TCP) {
> -struct tcp_header *th = dp_packet_l4(pkt);
> -packet_set_tcp_port(pkt, th->tcp_src, conn->rev_key.src.port);
> +packet_set_tcp_port(pkt, conn->rev_key.dst.port,
> +conn->rev_key.src.port);
>  } else if (conn->key.nw_proto == IPPROTO_UDP) {
> -struct udp_header *uh = dp_packet_l4(pkt);
> -packet_set_udp_port(pkt, uh->udp_src, conn->rev_key.src.port);
> +packet_set_udp_port(pkt, conn->rev_key.dst.port,
> +conn->rev_key.src.port);
>  }
>  }
>  }
> @@ -786,11 +786,9 @@ un_pat_packet(struct dp_packet *pkt, const struct conn 
> *conn)
>  }
>  } else if (conn->nat_info->nat_action & NAT_ACTION_DST) {
>  if (conn->key.nw_proto == IPPROTO_TCP) {
> -struct tcp_header *th = dp_packet_l4(pkt);
> -packet_set_tcp_port(pkt, conn->key.dst.port, th->tcp_dst);
> +packet_set_tcp_port(pkt, conn->key.dst.port, conn->key.src.port);
>  } else if (conn->key.nw_proto == IPPROTO_UDP) {
> -struct udp_header *uh = dp_packet_l4(pkt);
> -packet_set_udp_port(pkt, conn->key.dst.port, uh->udp_dst);
> +packet_set_udp_port(pkt, conn->key.dst.port, conn->key.src.port);
>  }
>  }
>  }
> @@ -810,12 +808,10 @@ reverse_pat_packet(struct dp_packet *pkt, const struct 
> conn *conn)
>  }
>  } else if (conn->nat_info->nat_action & NAT_ACTION_DST) {
>  if (conn->key.nw_proto == IPPROTO_TCP) {
> -struct tcp_header *th_in = dp_packet_l4(pkt);
> -

Re: [ovs-dev] [PATCH v2] conntrack: document NULL SNAT behavior and add a test case

2021-04-21 Thread Aaron Conole
Eelco Chaudron  writes:

> Currently, conntrack in the kernel has an undocumented feature referred
> to as NULL SNAT. Basically, when a source port collision is detected
> during the commit, the source port will be translated to an ephemeral
> port. If there is no collision, no SNAT is performed.
>
> This patchset documents this behavior and adds a self-test to verify
> it's not changing.
>
> Signed-off-by: Eelco Chaudron 
> ---
> v2: Fixed NULL SNAT to only work in the -rpl state to be inline with
> OpenShift-SDN's behavior.

Acked-by: Aaron Conole 

>  lib/ovs-actions.xml  |   10 
>  tests/system-kmod-macros.at  |7 ++
>  tests/system-traffic.at  |   46 
> ++
>  tests/system-userspace-macros.at |   10 
>  4 files changed, 73 insertions(+)
>
> diff --git a/lib/ovs-actions.xml b/lib/ovs-actions.xml
> index a2778de4b..a0070e6c6 100644
> --- a/lib/ovs-actions.xml
> +++ b/lib/ovs-actions.xml
> @@ -1833,6 +1833,16 @@ for i in [1,n_members]:
>  connection, will behave the same as a bare nat.
>
>  
> +  
> +For SNAT, there is a special case when the src IP
> +address is configured as all 0's, i.e.,
> +nat(src=0.0.0.0). In this case, when a source port
> +collision is detected during the commit, the source port will be
> +translated to an ephemeral port. If there is no collision, no 
> SNAT
> +is performed. Note that this is currently only implemented in the
> +Linux kernel datapath.
> +  
> +
>
>  Open vSwitch 2.6 introduced nat.  Linux 4.6 was the
>  earliest upstream kernel that implemented ct 
> support for
> diff --git a/tests/system-kmod-macros.at b/tests/system-kmod-macros.at
> index 15628a7c6..38bb1c55c 100644
> --- a/tests/system-kmod-macros.at
> +++ b/tests/system-kmod-macros.at
> @@ -99,6 +99,13 @@ m4_define([CHECK_CONNTRACK_FRAG_OVERLAP],
>  #
>  m4_define([CHECK_CONNTRACK_NAT])
>  
> +# CHECK_CONNTRACK_NULL_SNAT()
> +#
> +# Perform requirements checks for running conntrack SNAT NULL tests.
> +# The kernel always supports NULL SNAT, so no check is needed.
> +#
> +m4_define([CHECK_CONNTRACK_NULL_SNAT])
> +
>  # CHECK_CONNTRACK_TIMEOUT()
>  #
>  # Perform requirements checks for running conntrack customized timeout tests.
> diff --git a/tests/system-traffic.at b/tests/system-traffic.at
> index fb5b9a36d..99fba3a6e 100644
> --- a/tests/system-traffic.at
> +++ b/tests/system-traffic.at
> @@ -4433,6 +4433,52 @@ 
> tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=,dport=),reply=(src=
>  OVS_TRAFFIC_VSWITCHD_STOP
>  AT_CLEANUP
>  
> +
> +AT_SETUP([conntrack - NULL SNAT])
> +AT_SKIP_IF([test $HAVE_NC = no])
> +CHECK_CONNTRACK()
> +CHECK_CONNTRACK_NULL_SNAT()
> +OVS_TRAFFIC_VSWITCHD_START()
> +
> +ADD_NAMESPACES(at_ns0, at_ns1)
> +ADD_VETH(p0, at_ns0, br0, "10.1.1.1/24")
> +ADD_VETH(p1, at_ns1, br0, "10.1.1.2/24")
> +NS_CHECK_EXEC([at_ns0], [ip route add 172.1.1.0/24 via 10.1.1.2])
> +
> +OVS_START_L7([at_ns1], [http])
> +
> +AT_DATA([flows.txt], [dnl
> +table=0,priority=30,ct_state=-trk,ip,action=ct(table=0)
> +table=0,priority=20,ct_state=-rpl,ip,nw_dst=10.1.1.0/24,actions=ct(commit,nat(src=0.0.0.0),table=10)
> +table=0,priority=20,ct_state=+rpl,ip,nw_dst=10.1.1.0/24,actions=resubmit(,10)
> +table=0,priority=20,ip,nw_dst=172.1.1.2,actions=ct(commit,nat(dst=10.1.1.2),table=10)
> +table=0,priority=10,arp,action=normal
> +table=0,priority=1,action=drop
> +table=10,priority=20,ct_state=+rpl,ip,nw_dst=10.1.1.0/24 
> actions=ct(table=20,nat)
> +table=10,priority=10,ip,nw_dst=10.1.1.0/24 actions=resubmit(,20)
> +table=20,priority=10,ip,nw_dst=10.1.1.1,action=1
> +table=20,priority=10,ip,nw_dst=10.1.1.2,action=2
> +])
> +AT_CHECK([ovs-ofctl --bundle add-flows br0 flows.txt])
> +
> +dnl - Test to make sure src nat is NOT done when not needed
> +NS_CHECK_EXEC([at_ns0], [echo "TEST" | nc -p 3 10.1.1.2 80 > nc-1.log])
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep "orig=.src=10\.1\.1\.1,"], 
> [0], [dnl
> +tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=3,dport=80),reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=3),protoinfo=(state=TIME_WAIT)
> +])
> +
> +dnl - Test to make sure src nat is done when needed
> +NS_CHECK_EXEC([at_ns0], [echo "TEST2" | nc -p 30001 172.1.1.2 80 > nc-2.log])
> +NS_CHECK_EXEC([at_ns0], [echo "TEST3" | nc -p 30001 10.1.1.2 80 > nc-3.log])
> +AT_CHECK([ovs-appctl dpctl/dump-conntrack | grep 30001 | grep 
> "orig=.src=10\.1\.1\.1," | sed -e 's/port=30001/port=/g' -e 
> 's/sport=80,dport=[[0-9]]\+/sport=80,dport=/g' | sort], [0], [dnl
> +tcp,orig=(src=10.1.1.1,dst=10.1.1.2,sport=,dport=80),reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=),protoinfo=(state=TIME_WAIT)
> +tcp,orig=(src=10.1.1.1,dst=172.1.1.2,sport=,dport=80),reply=(src=10.1.1.2,dst=10.1.1.1,sport=80,dport=),protoinfo=(state=TIME_WAIT)
> +])
> +
> +OVS_TRAFFIC_VSWITCHD_STOP
> +AT_CLEANUP

Re: [ovs-dev] [PATCH] ofproto/ofproto-dpif-sflow: Check sflow agent in case of race

2021-04-21 Thread Simon Horman
On Mon, Apr 19, 2021 at 10:26:04AM +0300, Gavin Li wrote:
> sFlow agent may not exist while calling dpif_sflow_received. The sFlow
> may be deleted in another thread, eg, by user, which will cause crash.
> 
> Issue: 2576607
> Change-Id: Iad8202be1d06cda7d096109a2719ef934d245248
> Signed-off-by: Gavin Li 
> Reviewed-by: Gavi Teitz 

Thanks, pushed to master, branch-2.15, branch-2.14 and branch-2.13.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 03/28] netdev-offload-dpdk: Implement hw-offload statistics read

2021-04-21 Thread Maxime Coquelin



On 4/12/21 5:19 PM, Gaetan Rivet wrote:
> In the DPDK offload provider, keep track of inserted rte_flow and report
> it when queried.
> 
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/netdev-offload-dpdk.c | 31 +++
>  1 file changed, 31 insertions(+)


Reviewed-by: Maxime Coquelin 

Thanks,
Maxime

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 02/28] netdev-offload-dpdk: Use per-netdev offload metadata

2021-04-21 Thread Maxime Coquelin



On 4/12/21 5:19 PM, Gaetan Rivet wrote:
> Add a per-netdev offload data field as part of netdev hw_info structure.
> Use this field in netdev-offload-dpdk to map offload metadata (ufid to
> rte_flow). Use flow API deinit ops to destroy the per-netdev metadata
> when deallocating a netdev. Use RCU primitives to ensure coherency
> during port deletion.
> 
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/netdev-offload-dpdk.c | 131 --
>  lib/netdev-offload.h  |   2 +
>  2 files changed, 114 insertions(+), 19 deletions(-)


Reviewed-by: Maxime Coquelin 

Thanks,
Maxime

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] tests: Run tests with Datapath Groups enabled/disabled.

2021-04-21 Thread Numan Siddique
On Tue, Apr 20, 2021 at 2:47 PM Mark Michelson  wrote:
>
> Acked-by: Mark Michelson 
>

Thanks Dumitru (and Mark). I applied the patch to the main branch.

Numan

> On 4/14/21 6:09 AM, Dumitru Ceara wrote:
> > This doubles the size of the test matrix but should help protecting
> > against regressions.
> >
> > This change is inspired from Ilya's work below but instead of selectively
> > running tests with Datapath Groups if an external variable is set, it
> > always runs both sets of tests expanding the OVN_FOR_EACH_NORTHD
> > implementation:
> > https://github.com/igsilya/ovn/commit/a5eef74cc374f04bb6a12545ceb504108e724135
> >
> > Signed-off-by: Dumitru Ceara 
> > ---
> > Note:
> > - If accepted this should probably be pushed after the following
> > patch is accepted:
> > http://patchwork.ozlabs.org/project/ovn/patch/20210412124009.20468-1-dce...@redhat.com/
>
> This patch is merged.
>
> > - It might turn out during review that Ilya's original approach is
> >preferable.
> > - A sample CI run for reference wrt. additional runtime can be found
> >here (the failure is due to a known flaky test):
> >https://github.com/dceara/ovn/actions/runs/747763765
> > ---
> >   tests/ovn-macros.at | 10 ++
> >   tests/ovn-northd.at | 10 ++
> >   tests/ovs-macros.at |  4 +++-
> >   3 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/tests/ovn-macros.at b/tests/ovn-macros.at
> > index 25f3dbe34..3f432cef5 100644
> > --- a/tests/ovn-macros.at
> > +++ b/tests/ovn-macros.at
> > @@ -232,6 +232,10 @@ ovn_start () {
> >  --ic-nb-db=unix:"$ovs_base"/ovn-ic-nb/ovn-ic-nb.sock \
> >  --ic-sb-db=unix:"$ovs_base"/ovn-ic-sb/ovn-ic-sb.sock
> >   fi
> > +
> > +if test X$NORTHD_USE_DP_GROUPS = Xyes; then
> > +ovn-nbctl set NB_Global . options:use_logical_dp_groups=true
> > +fi
> >   }
> >
> >   # Interconnection networks.
> > @@ -493,9 +497,15 @@ m4_define([OVN_POPULATE_ARP], 
> > [AT_CHECK(ovn_populate_arp__, [0], [ignore])])
> >
> >   m4_define([OVN_FOR_EACH_NORTHD], [dnl
> >   m4_pushdef([NORTHD_TYPE], [ovn-northd])dnl
> > +m4_pushdef([NORTHD_USE_DP_GROUPS], [yes])dnl
> > +$1
> > +m4_popdef([NORTHD_USE_DP_GROUPS])dnl
> >   $1
> >   m4_popdef([NORTHD_TYPE])dnl
> >   m4_pushdef([NORTHD_TYPE], [ovn-northd-ddlog])dnl
> > +m4_pushdef([NORTHD_USE_DP_GROUPS], [yes])dnl
> > +$1
> > +m4_popdef([NORTHD_USE_DP_GROUPS])dnl
> >   $1
> >   m4_popdef([NORTHD_TYPE])dnl
> >   ])
> > diff --git a/tests/ovn-northd.at b/tests/ovn-northd.at
> > index eef360802..5548ae9fd 100644
> > --- a/tests/ovn-northd.at
> > +++ b/tests/ovn-northd.at
> > @@ -736,6 +736,11 @@ AT_CLEANUP
> >
> >   OVN_FOR_EACH_NORTHD([
> >   AT_SETUP([ovn -- northbound database reconnection])
> > +
> > +dnl This test doesn't take into account flows that are shared between
> > +dnl datapaths when datapath groups are enabled.
> > +AT_SKIP_IF([test NORTHD_USE_DP_GROUPS = yes])
> > +
> >   ovn_start --no-backup-northd
> >
> >   # Check that ovn-northd is active, by verifying that it creates and
> > @@ -771,6 +776,11 @@ AT_CLEANUP
> >
> >   OVN_FOR_EACH_NORTHD([
> >   AT_SETUP([ovn -- southbound database reconnection])
> > +
> > +dnl This test doesn't take into account flows that are shared between
> > +dnl datapaths when datapath groups are enabled.
> > +AT_SKIP_IF([test NORTHD_USE_DP_GROUPS = yes])
> > +
> >   ovn_start --no-backup-northd
> >
> >   # Check that ovn-northd is active, by verifying that it creates and
> > diff --git a/tests/ovs-macros.at b/tests/ovs-macros.at
> > index cd6d7488a..634c205a8 100644
> > --- a/tests/ovs-macros.at
> > +++ b/tests/ovs-macros.at
> > @@ -10,9 +10,11 @@ dnl   set it as a shell variable as well.
> >   dnl - Skip the test if it's for ovn-northd-ddlog but it didn't get built.
> >   m4_rename([AT_SETUP], [OVS_AT_SETUP])
> >   m4_define([AT_SETUP],
> > -  [OVS_AT_SETUP($@[]m4_ifdef([NORTHD_TYPE], [ -- NORTHD_TYPE]))
> > +  [OVS_AT_SETUP($@[]m4_ifdef([NORTHD_TYPE], [ -- 
> > NORTHD_TYPE])[]m4_ifdef([NORTHD_USE_DP_GROUPS], [ -- 
> > dp-groups=NORTHD_USE_DP_GROUPS]))
> >   m4_ifdef([NORTHD_TYPE], [[NORTHD_TYPE]=NORTHD_TYPE
> >   ])dnl
> > +m4_ifdef([NORTHD_USE_DP_GROUPS], 
> > [[NORTHD_USE_DP_GROUPS]=NORTHD_USE_DP_GROUPS
> > +])dnl
> >   m4_if(NORTHD_TYPE, [ovn-northd-ddlog], [dnl
> >   AT_SKIP_IF([test $TEST_DDLOG = no])
> >   ])dnl
> >
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] ovn-nbctl: fails to delete and add lr-nat in trx

2021-04-21 Thread Aidan Shribman
yes. will add a test

On Wed, Apr 21, 2021, 10:51 Numan Siddique  wrote:

> On Thu, Apr 8, 2021 at 3:13 AM Aidan Shribman 
> wrote:
> >
> > The delete operation did not update the internal state cuasing the add
> > operation to wrongly believe that the lr-nat entry is still present and
> > thus can't be added.
> >
> > Below is the sequance is failed before and now passes:
> >
> > ***
> > echo "add switch"
> > ovn-nbctl ls-del ls0 2>/dev/null
> > ovn-nbctl ls-add ls0
> > ovn-nbctl ls-list
> > ovn-nbctl show ls0
> > ovn-nbctl lsp-del lsp0 2>/dev/null
> > ovn-nbctl lsp-add ls0 lsp0 00:00:00:00:00:01 10.0.0.0/24
> > ovn-nbctl lsp-list ls0
> >
> > echo "add router"
> > ovn-nbctl ls-del lr0 2>/dev/null
> > ovn-nbctl lr-add lr0
> > ovn-nbctl lr-list
> > ovn-nbctl show lr0
> >
> > echo "add nat"
> > ovn-nbctl lr-nat-del lr0 2>/dev/null
> > ovn-nbctl lr-nat-add lr0 dnat_and_snat 192.168.0.3 10.0.0.3 \
> > lsp0 f0:00:00:00:00:03
> > ovn-nbctl lr-nat-list lr0
> >
> > echo "FIXED: delete nat (all of type) + add nat"
> > ovn-nbctl lr-nat-del lr0 dnat_and_snat -- \
> > --may-exist lr-nat-add lr0 dnat_and_snat 192.168.0.3 10.0.0.3 \
> > lsp0 f0:00:00:00:00:03
> > ovn-nbctl lr-nat-list lr0
> >
> > echo "FIXED: delete nat (all of external_ip) + add"
> > ovn-nbctl lr-nat-del lr0 dnat_and_snat 192.168.0.3 -- \
> > --may-exist lr-nat-add lr0 dnat_and_snat 192.168.0.3 10.0.0.3 \
> > lsp0 f0:00:00:00:00:03
> > ovn-nbctl lr-nat-list lr0
> > ***
> >
> > Signed-off-by: Aidan Shribman 
> > Fixed: 1928226 (ovn-nbctl fails to delete/add lr-nat in trx)
>
> Hi Aidan,
>
> I haven't reviewed the code yet. But looking into the commit message,
> I think you can add a test case for this fix ? Can you please add the test
> case
> and submit v3 ?
>
> Thanks
> Numan
>
> > ---
> >  utilities/ovn-nbctl.c | 33 ++---
> >  1 file changed, 18 insertions(+), 15 deletions(-)
> >
> > diff --git a/utilities/ovn-nbctl.c b/utilities/ovn-nbctl.c
> > index 184058356..db1ac4f5b 100644
> > --- a/utilities/ovn-nbctl.c
> > +++ b/utilities/ovn-nbctl.c
> > @@ -4531,11 +4531,18 @@ nbctl_lr_nat_del(struct ctl_context *ctx)
> >
> >  if (ctx->argc == 3) {
> >  /*Deletes all NATs with the specified type. */
> > +struct nbrec_nat **keep_nat = xmalloc(lr->n_nat * sizeof
> *keep_nat);
> > +size_t n_keep_nat = 0;
> >  for (size_t i = 0; i < lr->n_nat; i++) {
> > -if (!strcmp(nat_type, lr->nat[i]->type)) {
> > -nbrec_logical_router_update_nat_delvalue(lr,
> lr->nat[i]);
> > +if (strcmp(nat_type, lr->nat[i]->type)) {
> > +keep_nat[n_keep_nat++] = lr->nat[i];
> >  }
> >  }
> > +if (n_keep_nat < lr->n_nat) {
> > +nbrec_logical_router_verify_nat(lr);
> > +nbrec_logical_router_set_nat(lr, keep_nat, n_keep_nat);
> > +}
> > +free(keep_nat);
> >  return;
> >  }
> >
> > @@ -4547,31 +4554,27 @@ nbctl_lr_nat_del(struct ctl_context *ctx)
> >
> >  int is_snat = !strcmp("snat", nat_type);
> >  /* Remove the matching NAT. */
> > +struct nbrec_nat **keep_nat = xmalloc(lr->n_nat * sizeof *keep_nat);
> > +size_t n_keep_nat = 0;
> >  for (size_t i = 0; i < lr->n_nat; i++) {
> >  struct nbrec_nat *nat = lr->nat[i];
> > -bool should_return = false;
> >  char *old_ip = normalize_prefix_str(is_snat
> >  ? nat->logical_ip
> >  : nat->external_ip);
> > -if (!old_ip) {
> > -continue;
> > -}
> > -if (!strcmp(nat_type, nat->type) && !strcmp(nat_ip, old_ip)) {
> > -nbrec_logical_router_update_nat_delvalue(lr, nat);
> > -should_return = true;
> > +if (!old_ip || strcmp(nat_type, nat->type) || strcmp(nat_ip,
> old_ip)) {
> > +keep_nat[n_keep_nat++] = lr->nat[i];
> >  }
> >  free(old_ip);
> > -if (should_return) {
> > -goto cleanup;
> > -}
> >  }
> >
> > -if (must_exist) {
> > +if (n_keep_nat < lr->n_nat) {
> > +nbrec_logical_router_verify_nat(lr);
> > +nbrec_logical_router_set_nat(lr, keep_nat, n_keep_nat);
> > +} else if (must_exist) {
> >  ctl_error(ctx, "no matching NAT with the type (%s) and %s (%s)",
> >nat_type, is_snat ? "logical_ip" : "external_ip",
> nat_ip);
> >  }
> > -
> > -cleanup:
> > +free(keep_nat);
> >  free(nat_ip);
> >  }
> >
> > --
> > 2.25.1
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] dpdk: Use DPDK 20.11.1 release

2021-04-21 Thread Hariprasad Govindharajan
Modify ci linux build script to use the latest DPDK stable release.
Modify Documentation to use the latest DPDK stable release 20.11.1
Update NEWS file to reflect the latest DPDK stable releases.
FAQ is updated to reflect the latest DPDK for each branch.

Signed-off-by: Hariprasad Govindharajan 
---
 .ci/linux-build.sh   | 2 +-
 Documentation/faq/releases.rst   | 6 +++---
 Documentation/intro/install/dpdk.rst | 8 
 NEWS | 3 +++
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index 9774493..0210d6a 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -201,7 +201,7 @@ fi
 
 if [ "$DPDK" ] || [ "$DPDK_SHARED" ]; then
 if [ -z "$DPDK_VER" ]; then
-DPDK_VER="20.11"
+DPDK_VER="20.11.1"
 fi
 install_dpdk $DPDK_VER
 if [ "$CC" = "clang" ]; then
diff --git a/Documentation/faq/releases.rst b/Documentation/faq/releases.rst
index 6a5e414..3bc34c8 100644
--- a/Documentation/faq/releases.rst
+++ b/Documentation/faq/releases.rst
@@ -204,9 +204,9 @@ Q: What DPDK version does each Open vSwitch release work 
with?
 2.10.x   17.11.10
 2.11.x   18.11.9
 2.12.x   18.11.9
-2.13.x   19.11.2
-2.14.x   19.11.2
-2.15.x   20.11.0
+2.13.x   19.11.8
+2.14.x   19.11.8
+2.15.x   20.11.1
  
 
 Q: Are all the DPDK releases that OVS versions work with maintained?
diff --git a/Documentation/intro/install/dpdk.rst 
b/Documentation/intro/install/dpdk.rst
index 3a24e54..612f2fd 100644
--- a/Documentation/intro/install/dpdk.rst
+++ b/Documentation/intro/install/dpdk.rst
@@ -42,7 +42,7 @@ Build requirements
 In addition to the requirements described in :doc:`general`, building Open
 vSwitch with DPDK will require the following:
 
-- DPDK 20.11
+- DPDK 20.11.1
 
 - A `DPDK supported NIC`_
 
@@ -73,9 +73,9 @@ Install DPDK
 #. Download the `DPDK sources`_, extract the file and set ``DPDK_DIR``::
 
$ cd /usr/src/
-   $ wget https://fast.dpdk.org/rel/dpdk-20.11.tar.xz
-   $ tar xf dpdk-20.11.tar.xz
-   $ export DPDK_DIR=/usr/src/dpdk-20.11
+   $ wget https://fast.dpdk.org/rel/dpdk-20.11.1.tar.xz
+   $ tar xf dpdk-20.11.1.tar.xz
+   $ export DPDK_DIR=/usr/src/dpdk-stable-20.11.1
$ cd $DPDK_DIR
 
 #. Configure and install DPDK using Meson
diff --git a/NEWS b/NEWS
index 95cf922..402ce59 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,9 @@ Post-v2.15.0
  * New option '--no-record-hostname' to disable hostname configuration
in ovsdb on startup.
  * New command 'record-hostname-if-not-set' to update hostname in ovsdb.
+   - DPDK:
+ * OVS validated with DPDK 20.11.1. It is recommended to use this version
+   until further releases.
 
 
 v2.15.0 - 15 Feb 2021
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH branch-2.15] dpdk: Use DPDK 20.11.1 release

2021-04-21 Thread Hariprasad Govindharajan
Modify ci linux build script to use the latest DPDK stable release.
Modify Documentation to use the latest DPDK stable release 20.11.1
Update NEWS file to reflect the latest DPDK stable releases.
FAQ is updated to reflect the latest DPDK for each branch.

Signed-off-by: Hariprasad Govindharajan 
---
 .ci/linux-build.sh   | 2 +-
 Documentation/faq/releases.rst   | 6 +++---
 Documentation/intro/install/dpdk.rst | 8 
 NEWS | 3 +++
 4 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index dd29a41..1acf501 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -201,7 +201,7 @@ fi
 
 if [ "$DPDK" ] || [ "$DPDK_SHARED" ]; then
 if [ -z "$DPDK_VER" ]; then
-DPDK_VER="20.11"
+DPDK_VER="20.11.1"
 fi
 install_dpdk $DPDK_VER
 if [ "$CC" = "clang" ]; then
diff --git a/Documentation/faq/releases.rst b/Documentation/faq/releases.rst
index 6a5e414..3bc34c8 100644
--- a/Documentation/faq/releases.rst
+++ b/Documentation/faq/releases.rst
@@ -204,9 +204,9 @@ Q: What DPDK version does each Open vSwitch release work 
with?
 2.10.x   17.11.10
 2.11.x   18.11.9
 2.12.x   18.11.9
-2.13.x   19.11.2
-2.14.x   19.11.2
-2.15.x   20.11.0
+2.13.x   19.11.8
+2.14.x   19.11.8
+2.15.x   20.11.1
  
 
 Q: Are all the DPDK releases that OVS versions work with maintained?
diff --git a/Documentation/intro/install/dpdk.rst 
b/Documentation/intro/install/dpdk.rst
index 3a24e54..612f2fd 100644
--- a/Documentation/intro/install/dpdk.rst
+++ b/Documentation/intro/install/dpdk.rst
@@ -42,7 +42,7 @@ Build requirements
 In addition to the requirements described in :doc:`general`, building Open
 vSwitch with DPDK will require the following:
 
-- DPDK 20.11
+- DPDK 20.11.1
 
 - A `DPDK supported NIC`_
 
@@ -73,9 +73,9 @@ Install DPDK
 #. Download the `DPDK sources`_, extract the file and set ``DPDK_DIR``::
 
$ cd /usr/src/
-   $ wget https://fast.dpdk.org/rel/dpdk-20.11.tar.xz
-   $ tar xf dpdk-20.11.tar.xz
-   $ export DPDK_DIR=/usr/src/dpdk-20.11
+   $ wget https://fast.dpdk.org/rel/dpdk-20.11.1.tar.xz
+   $ tar xf dpdk-20.11.1.tar.xz
+   $ export DPDK_DIR=/usr/src/dpdk-stable-20.11.1
$ cd $DPDK_DIR
 
 #. Configure and install DPDK using Meson
diff --git a/NEWS b/NEWS
index 036d403..d6a9395 100644
--- a/NEWS
+++ b/NEWS
@@ -4,6 +4,9 @@ v2.15.1 - xx xxx 
  * New option '--no-record-hostname' to disable hostname configuration
in ovsdb on startup.
  * New command 'record-hostname-if-not-set' to update hostname in ovsdb.
+   - DPDK:
+ * OVS validated with DPDK 20.11.1. It is recommended to use this version
+   until further releases.
 
 
 v2.15.0 - 15 Feb 2021
-- 
2.7.4

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] controller: Monitor all logical flows that refer to datapath groups.

2021-04-21 Thread Ilya Maximets
On 4/21/21 9:40 AM, Dumitru Ceara wrote:
> On 4/20/21 8:54 PM, Han Zhou wrote:
>> On Tue, Apr 20, 2021 at 12:43 AM Dumitru Ceara  wrote:
>>>
>>> On 4/19/21 7:24 PM, Han Zhou wrote:
 On Mon, Apr 19, 2021 at 12:27 AM Dumitru Ceara 
>> wrote:
>
> On 4/16/21 7:16 PM, Mark Michelson wrote:
>> Thanks for the explanation Dumitru. It made good sense out of what had
>> to be a difficult thing to debug. I'm assuming you went with
>> unconditional monitoring since this will work with any change to a DP
>> group, and not just the addition new datapaths to the group.
>
> There's also a case I didn't mention in the commit log (but I can add
>> it
> if you think it's worth adding):
>
> If a logical flow LF1 is not shared between datapaths, so it's only
> applied to DP1, when DP2 is added that needs an identical flow,
> ovn-northd might generate a fresh new DPG = {.datapaths=[DP1, DP2]},
>> and
> create a new LF1' applied on datapath_group=DPG.
>
> Without unconditional monitoring for datapath groups there's no easy
>> way
> to get the notification about DPG and LF1' being added in the same
> jsonrpc update from SB with the delete for LF.
>
> Given that the number of datapath groups is expected to be relatively
> low compared to the total number of logical flows, unconditional
> monitoring seems like the best solution.
>
>>
>> Acked-by: Mark Michelson 
>
> Thanks!
>

 +1 (although late)

 I am catching up the DPG implementation, and here are some thoughts on
>> the
 use cases. For my understanding, except for the DPG that contains all
>> DPs,
 most DPGs are mapping to port-groups, which usually map to
 tenants/namespaces. The conditional monitoring feature (when
 ovn-monitor-all is disabled) is useful mainly when there are a large
>> number
 of small tenants and they don't need to talk to each other, so their
 workloads may reside on different HVs and DPs, so conditional monitoring
 can result in a smaller number of lflows on each HV in this scenario.
 However, when DPG related flows are unconditionally monitored, it makes
 conditional monitoring not as useful as it should be when there are a
>> lot
 of ACLs, because ACLs are the major user of the port-group based DPGs,
 which may contribute to a significant portion of lflows. In such
>> scenarios
 (when there are many small tenants), maybe it is better to disable DPG
>> (and
 disable ovn-monitor-all) to benefit more from the conditional monitoring
>>>
>>> I think DPGs may be useful in this case too (many small tenants) with
>>> conditional monitoring enabled.
>>>
>>> There's a reasonable number of "trivial" logical flows, e.g., the flows
>>> that advance to next table by default, that are shared between all
>>> logical datapaths and DPGs will optimize those.
>>>
>>> For example, on one such large downstream OpenStack deployment we saw
>>> IIRC a reduction of ~20% w.r.t. number of logical flows and SB size when
>>> enabling DPG.  It's not a huge factor but it's still a relevant
>>> reduction in number of objects in the SB that need to be sent to all
>>> ovn-controllers.
>>
>> It is true that DPG will still reduce SB size on the central nodes, but my
>> above point is about the lflows being conditionally monitored by each
>> individual HV, because usually central DB size is not a concern if
>> ovn-monitor-all is disabled. In the particular scenario of large amount of
>> small tenants, without DPG, a HV may be interested in only a small amount
>> of DPs which has a small amount of flows in total. With DPG enabled,
>> although sharing the common flows between the DPs would reduce some
>> redundant flows, but since the number of interested DPs is small for each
>> HV, this savings may be small. On the other hand, since all PG related ACL
>> flows for all tenants (all DPs) are now monitored, each HV could have a
>> huge increase of unnecessary ACL flows from unrelated tenants. That's why I
>> think in such a scenario DPG should be disabled.
> 
> I see, you're right, makes sense.  Another way to make this work in all
> scenarios is to add two-level deep conditional monitoring to the IDL and
> ovsdb-server, that is, allow ovn-controller to conditionally monitor
> "all logical flows that refer to DPG that contain datapaths equal to a
> specific value".

Another (probably, easier in implementation?) option is to avoid flow
updates if we have in-flight condition changes, i.e. if we do not have
a full view over logical flows.

> 
>>
>>>
 feature. In the contrary use cases when there are few but large tenants,
 each of them maps to a large port-group, crossing a large amount of
 datapaths, then enabling DPG may be a better choice, and at the same
>> time
 enabling ovn-monitor-all may be recommended.

 Is the above understanding correct?

>>>
>>> I think so.
>>>
>>> Maybe it m

Re: [ovs-dev] [PATCH ovn 0/2] ovn-ctl: stop databases with stop_ovn_daemon()

2021-04-21 Thread Numan Siddique
On Tue, Apr 20, 2021 at 1:06 PM Mark Michelson  wrote:
>
> Hi Dan,
>
> Acked-by: Mark Michelson 
>
> for both patches in the series.

Thanks Dan (and Mark).  I applied the patches to the main branch.

Numan

>
> On 4/12/21 10:58 AM, Dan Williams wrote:
> > Wait for the databases to actually exit before returning to
> > the caller of ovn-ctl.
> >
> > v1 -> v2
> > -
> >* No code changes
> >* Fixed stop_ovn_daemon() name in patch headers
> >* Added "ovn" to patch subjects for correct patchwork
> >
> > Dan Williams (2):
> >ovn-lib: harmonize stop_ovn_daemon() with ovs-lib
> >ovn-ctl: stop databases with stop_ovn_daemon()
> >
> >   utilities/ovn-ctl| 16 +---
> >   utilities/ovn-lib.in | 40 ++--
> >   2 files changed, 39 insertions(+), 17 deletions(-)
> >
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] ovn-nbctl: fails to delete and add lr-nat in trx

2021-04-21 Thread Numan Siddique
On Thu, Apr 8, 2021 at 3:13 AM Aidan Shribman  wrote:
>
> The delete operation did not update the internal state cuasing the add
> operation to wrongly believe that the lr-nat entry is still present and
> thus can't be added.
>
> Below is the sequance is failed before and now passes:
>
> ***
> echo "add switch"
> ovn-nbctl ls-del ls0 2>/dev/null
> ovn-nbctl ls-add ls0
> ovn-nbctl ls-list
> ovn-nbctl show ls0
> ovn-nbctl lsp-del lsp0 2>/dev/null
> ovn-nbctl lsp-add ls0 lsp0 00:00:00:00:00:01 10.0.0.0/24
> ovn-nbctl lsp-list ls0
>
> echo "add router"
> ovn-nbctl ls-del lr0 2>/dev/null
> ovn-nbctl lr-add lr0
> ovn-nbctl lr-list
> ovn-nbctl show lr0
>
> echo "add nat"
> ovn-nbctl lr-nat-del lr0 2>/dev/null
> ovn-nbctl lr-nat-add lr0 dnat_and_snat 192.168.0.3 10.0.0.3 \
> lsp0 f0:00:00:00:00:03
> ovn-nbctl lr-nat-list lr0
>
> echo "FIXED: delete nat (all of type) + add nat"
> ovn-nbctl lr-nat-del lr0 dnat_and_snat -- \
> --may-exist lr-nat-add lr0 dnat_and_snat 192.168.0.3 10.0.0.3 \
> lsp0 f0:00:00:00:00:03
> ovn-nbctl lr-nat-list lr0
>
> echo "FIXED: delete nat (all of external_ip) + add"
> ovn-nbctl lr-nat-del lr0 dnat_and_snat 192.168.0.3 -- \
> --may-exist lr-nat-add lr0 dnat_and_snat 192.168.0.3 10.0.0.3 \
> lsp0 f0:00:00:00:00:03
> ovn-nbctl lr-nat-list lr0
> ***
>
> Signed-off-by: Aidan Shribman 
> Fixed: 1928226 (ovn-nbctl fails to delete/add lr-nat in trx)

Hi Aidan,

I haven't reviewed the code yet. But looking into the commit message,
I think you can add a test case for this fix ? Can you please add the test case
and submit v3 ?

Thanks
Numan

> ---
>  utilities/ovn-nbctl.c | 33 ++---
>  1 file changed, 18 insertions(+), 15 deletions(-)
>
> diff --git a/utilities/ovn-nbctl.c b/utilities/ovn-nbctl.c
> index 184058356..db1ac4f5b 100644
> --- a/utilities/ovn-nbctl.c
> +++ b/utilities/ovn-nbctl.c
> @@ -4531,11 +4531,18 @@ nbctl_lr_nat_del(struct ctl_context *ctx)
>
>  if (ctx->argc == 3) {
>  /*Deletes all NATs with the specified type. */
> +struct nbrec_nat **keep_nat = xmalloc(lr->n_nat * sizeof *keep_nat);
> +size_t n_keep_nat = 0;
>  for (size_t i = 0; i < lr->n_nat; i++) {
> -if (!strcmp(nat_type, lr->nat[i]->type)) {
> -nbrec_logical_router_update_nat_delvalue(lr, lr->nat[i]);
> +if (strcmp(nat_type, lr->nat[i]->type)) {
> +keep_nat[n_keep_nat++] = lr->nat[i];
>  }
>  }
> +if (n_keep_nat < lr->n_nat) {
> +nbrec_logical_router_verify_nat(lr);
> +nbrec_logical_router_set_nat(lr, keep_nat, n_keep_nat);
> +}
> +free(keep_nat);
>  return;
>  }
>
> @@ -4547,31 +4554,27 @@ nbctl_lr_nat_del(struct ctl_context *ctx)
>
>  int is_snat = !strcmp("snat", nat_type);
>  /* Remove the matching NAT. */
> +struct nbrec_nat **keep_nat = xmalloc(lr->n_nat * sizeof *keep_nat);
> +size_t n_keep_nat = 0;
>  for (size_t i = 0; i < lr->n_nat; i++) {
>  struct nbrec_nat *nat = lr->nat[i];
> -bool should_return = false;
>  char *old_ip = normalize_prefix_str(is_snat
>  ? nat->logical_ip
>  : nat->external_ip);
> -if (!old_ip) {
> -continue;
> -}
> -if (!strcmp(nat_type, nat->type) && !strcmp(nat_ip, old_ip)) {
> -nbrec_logical_router_update_nat_delvalue(lr, nat);
> -should_return = true;
> +if (!old_ip || strcmp(nat_type, nat->type) || strcmp(nat_ip, 
> old_ip)) {
> +keep_nat[n_keep_nat++] = lr->nat[i];
>  }
>  free(old_ip);
> -if (should_return) {
> -goto cleanup;
> -}
>  }
>
> -if (must_exist) {
> +if (n_keep_nat < lr->n_nat) {
> +nbrec_logical_router_verify_nat(lr);
> +nbrec_logical_router_set_nat(lr, keep_nat, n_keep_nat);
> +} else if (must_exist) {
>  ctl_error(ctx, "no matching NAT with the type (%s) and %s (%s)",
>nat_type, is_snat ? "logical_ip" : "external_ip", nat_ip);
>  }
> -
> -cleanup:
> +free(keep_nat);
>  free(nat_ip);
>  }
>
> --
> 2.25.1
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH 1/3] dpif-netdev: Do not flush PMD offloads on reload

2021-04-21 Thread Eli Britstein
Before flushing offloads of a removed port was supported by [1], it was
necessary to flush the 'marks'. In doing so, all offloads of the PMD are
removed, include the ones that are not related to the removed port and
that are not modified following this removal. As a result such flows are
evicted from being offloaded, and won't resume offloading.

As PMD offload flush is not necessary, avoid it.

[1] 62d1c28e9ce0 ("dpif-netdev: Flush offload rules upon port deletion.")

Signed-off-by: Eli Britstein 
Reviewed-by: Gaetan Rivet 
---
 lib/dpif-netdev.c | 13 -
 1 file changed, 13 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 251788b04..a69b85842 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2570,18 +2570,6 @@ mark_to_flow_disassociate(struct dp_netdev_pmd_thread 
*pmd,
 return ret;
 }
 
-static void
-flow_mark_flush(struct dp_netdev_pmd_thread *pmd)
-{
-struct dp_netdev_flow *flow;
-
-CMAP_FOR_EACH (flow, mark_node, &flow_mark.mark_to_flow) {
-if (flow->pmd_id == pmd->core_id) {
-queue_netdev_flow_del(pmd, flow);
-}
-}
-}
-
 static struct dp_netdev_flow *
 mark_to_flow_find(const struct dp_netdev_pmd_thread *pmd,
   const uint32_t mark)
@@ -5165,7 +5153,6 @@ reload_affected_pmds(struct dp_netdev *dp)
 
 CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
 if (pmd->need_reload) {
-flow_mark_flush(pmd);
 dp_netdev_reload_pmd__(pmd);
 }
 }
-- 
2.26.2.1730.g385c171

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH 2/3] dpif-netdev: Fix offloads of modified flows

2021-04-21 Thread Eli Britstein
Association of a mark to a flow is done as part of its offload handling,
in the offloading thread. However, the PMD thread specifies whether an
offload request is an "add" or "modify" by the association of a mark to
the flow.
This is exposed to a race condition. A flow might be created with
actions that cannot be fully offloaded, for example flooding (before MAC
learning), and later modified to have actions that can be fully
offloaded. If the two requests are queued before the offload thread
handling, they are both marked as "add". When the offload thread handles
them, the first request is partially offloaded, and the second one is
ignored as the flow is already considered as offloaded.

Fix it by specifying add/modify of an offload request by the actual flow
state change, without relying on the mark.

Fixes: 3c7330ebf036 ("netdev-offload-dpdk: Support offload of output action.")
Signed-off-by: Eli Britstein 
Reviewed-by: Gaetan Rivet 
---
 lib/dpif-netdev.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index a69b85842..37216effc 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2787,7 +2787,8 @@ queue_netdev_flow_del(struct dp_netdev_pmd_thread *pmd,
 static void
 queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd,
   struct dp_netdev_flow *flow, struct match *match,
-  const struct nlattr *actions, size_t actions_len)
+  const struct nlattr *actions, size_t actions_len,
+  const struct dp_netdev_actions *old_actions)
 {
 struct dp_flow_offload_item *offload;
 int op;
@@ -2803,11 +2804,9 @@ queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd,
 ovsthread_once_done(&offload_thread_once);
 }
 
-if (flow->mark != INVALID_FLOW_MARK) {
-op = DP_NETDEV_FLOW_OFFLOAD_OP_MOD;
-} else {
-op = DP_NETDEV_FLOW_OFFLOAD_OP_ADD;
-}
+op = old_actions
+? DP_NETDEV_FLOW_OFFLOAD_OP_MOD
+: DP_NETDEV_FLOW_OFFLOAD_OP_ADD;
 offload = dp_netdev_alloc_flow_offload(pmd, flow, op);
 offload->match = *match;
 offload->actions = xmalloc(actions_len);
@@ -3680,7 +3679,7 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
 cmap_insert(&pmd->flow_table, CONST_CAST(struct cmap_node *, &flow->node),
 dp_netdev_flow_hash(&flow->ufid));
 
-queue_netdev_flow_put(pmd, flow, match, actions, actions_len);
+queue_netdev_flow_put(pmd, flow, match, actions, actions_len, NULL);
 
 if (OVS_UNLIKELY(!VLOG_DROP_DBG((&upcall_rl {
 struct ds ds = DS_EMPTY_INITIALIZER;
@@ -3767,7 +3766,7 @@ flow_put_on_pmd(struct dp_netdev_pmd_thread *pmd,
 ovsrcu_set(&netdev_flow->actions, new_actions);
 
 queue_netdev_flow_put(pmd, netdev_flow, match,
-  put->actions, put->actions_len);
+  put->actions, put->actions_len, old_actions);
 
 if (stats) {
 get_dpif_flow_status(pmd->dp, netdev_flow, stats, NULL);
-- 
2.26.2.1730.g385c171

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH 3/3] dpif-netdev: Log flow modification in debug level

2021-04-21 Thread Eli Britstein
Log flow modifications to help debugging.

Signed-off-by: Eli Britstein 
Reviewed-by: Gaetan Rivet 
---
 lib/dpif-netdev.c | 101 +-
 1 file changed, 55 insertions(+), 46 deletions(-)

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 37216effc..d8df9f428 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -2793,6 +2793,61 @@ queue_netdev_flow_put(struct dp_netdev_pmd_thread *pmd,
 struct dp_flow_offload_item *offload;
 int op;
 
+if (OVS_UNLIKELY(!VLOG_DROP_DBG((&upcall_rl {
+struct ds ds = DS_EMPTY_INITIALIZER;
+struct ofpbuf key_buf, mask_buf;
+struct odp_flow_key_parms odp_parms = {
+.flow = &match->flow,
+.mask = &match->wc.masks,
+.support = dp_netdev_support,
+};
+
+ofpbuf_init(&key_buf, 0);
+ofpbuf_init(&mask_buf, 0);
+
+odp_flow_key_from_flow(&odp_parms, &key_buf);
+odp_parms.key_buf = &key_buf;
+odp_flow_key_from_mask(&odp_parms, &mask_buf);
+
+if (old_actions) {
+ds_put_cstr(&ds, "flow_mod: ");
+} else {
+ds_put_cstr(&ds, "flow_add: ");
+}
+odp_format_ufid(&flow->ufid, &ds);
+ds_put_cstr(&ds, " mega_");
+odp_format_ufid(&flow->mega_ufid, &ds);
+ds_put_cstr(&ds, " ");
+odp_flow_format(key_buf.data, key_buf.size,
+mask_buf.data, mask_buf.size,
+NULL, &ds, false);
+if (old_actions) {
+ds_put_cstr(&ds, ", old_actions:");
+format_odp_actions(&ds, old_actions->actions, old_actions->size,
+   NULL);
+}
+ds_put_cstr(&ds, ", actions:");
+format_odp_actions(&ds, actions, actions_len, NULL);
+
+VLOG_DBG("%s", ds_cstr(&ds));
+
+ofpbuf_uninit(&key_buf);
+ofpbuf_uninit(&mask_buf);
+
+/* Add a printout of the actual match installed. */
+struct match m;
+ds_clear(&ds);
+ds_put_cstr(&ds, "flow match: ");
+miniflow_expand(&flow->cr.flow.mf, &m.flow);
+miniflow_expand(&flow->cr.mask->mf, &m.wc.masks);
+memset(&m.tun_md, 0, sizeof m.tun_md);
+match_format(&m, NULL, &ds, OFP_DEFAULT_PRIORITY);
+
+VLOG_DBG("%s", ds_cstr(&ds));
+
+ds_destroy(&ds);
+}
+
 if (!netdev_is_flow_api_enabled()) {
 return;
 }
@@ -3681,52 +3736,6 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread *pmd,
 
 queue_netdev_flow_put(pmd, flow, match, actions, actions_len, NULL);
 
-if (OVS_UNLIKELY(!VLOG_DROP_DBG((&upcall_rl {
-struct ds ds = DS_EMPTY_INITIALIZER;
-struct ofpbuf key_buf, mask_buf;
-struct odp_flow_key_parms odp_parms = {
-.flow = &match->flow,
-.mask = &match->wc.masks,
-.support = dp_netdev_support,
-};
-
-ofpbuf_init(&key_buf, 0);
-ofpbuf_init(&mask_buf, 0);
-
-odp_flow_key_from_flow(&odp_parms, &key_buf);
-odp_parms.key_buf = &key_buf;
-odp_flow_key_from_mask(&odp_parms, &mask_buf);
-
-ds_put_cstr(&ds, "flow_add: ");
-odp_format_ufid(ufid, &ds);
-ds_put_cstr(&ds, " mega_");
-odp_format_ufid(&flow->mega_ufid, &ds);
-ds_put_cstr(&ds, " ");
-odp_flow_format(key_buf.data, key_buf.size,
-mask_buf.data, mask_buf.size,
-NULL, &ds, false);
-ds_put_cstr(&ds, ", actions:");
-format_odp_actions(&ds, actions, actions_len, NULL);
-
-VLOG_DBG("%s", ds_cstr(&ds));
-
-ofpbuf_uninit(&key_buf);
-ofpbuf_uninit(&mask_buf);
-
-/* Add a printout of the actual match installed. */
-struct match m;
-ds_clear(&ds);
-ds_put_cstr(&ds, "flow match: ");
-miniflow_expand(&flow->cr.flow.mf, &m.flow);
-miniflow_expand(&flow->cr.mask->mf, &m.wc.masks);
-memset(&m.tun_md, 0, sizeof m.tun_md);
-match_format(&m, NULL, &ds, OFP_DEFAULT_PRIORITY);
-
-VLOG_DBG("%s", ds_cstr(&ds));
-
-ds_destroy(&ds);
-}
-
 return flow;
 }
 
-- 
2.26.2.1730.g385c171

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH 0/3] dpif-netdev offload transitions

2021-04-21 Thread Eli Britstein
This patch-set improves offloads transitions behavior.

Patch #1 avoids flushing PMD offloads unnecessarily.
Patch #2 fixes a race condition with flow modifications.
Patch #3 improves debuggability of flow modifications.

Travis:
v1: https://travis-ci.org/github/elibritstein/OVS/builds/767839987

GitHub Actions:
v1: https://github.com/elibritstein/OVS/actions/runs/769805954
- This run has encountered some internal GitHub problems.
- A previous good run, with the same code, only changed commit
  messages since:
https://github.com/elibritstein/OVS/actions/runs/70787

Eli Britstein (3):
  dpif-netdev: Do not flush PMD offloads on reload
  dpif-netdev: Fix offloads of modified flows
  dpif-netdev: Log flow modification in debug level

 lib/dpif-netdev.c | 129 ++
 1 file changed, 62 insertions(+), 67 deletions(-)

-- 
2.26.2.1730.g385c171

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH V6 00/13] Netdev vxlan-decap offload

2021-04-21 Thread Kovacevic, Marko
> On 4/7/2021 12:10 PM, Sriharsha Basavapatna wrote:
> >
> > On Sun, Apr 4, 2021 at 3:25 PM Eli Britstein  > > wrote:
> >
> > VXLAN decap in OVS-DPDK configuration consists of two flows:
> > F1: in_port(ens1f0),eth(),ipv4(),udp(),
> > actions:tnl_pop(vxlan_sys_4789)
> > F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
> >
> > F1 is a classification flow. It has outer headers matches and it
> > classifies the packet as a VXLAN packet, and using tnl_pop action the
> > packet continues processing in F2.
> > F2 is a flow that has matches on tunnel metadata as well as on the
> > inner
> > packet headers (as any other flow).
> >
> > In order to fully offload VXLAN decap path, both F1 and F2 should be
> > offloaded. As there are more than one flow in HW, it is possible that
> > F1 is done by HW but F2 is not. Packet is received by SW, and
> > should be
> > processed starting from F2 as F1 was already done by HW.
> > Rte_flows are applicable only on physical port IDs. Keeping the
> > original
> > physical in port on which the packet is received on enables applying
> > vport flows (e.g. F2) on that physical port.
> >
> > This patch-set makes use of [1] introduced in DPDK 20.11, that
> > adds API
> > for tunnel offloads.
> >
> > Note that MLX5 PMD has a bug that the tnl_pop private actions must be
> > first. In OVS it is not.
> > Fixing this issue is scheduled for 21.05 (and stable 20.11.2).
> > Meanwhile, tests were done with a workaround for it [2].
> >
> > v2-v1:
> > - Tracking original in_port, and applying vport on that physical
> > port instead of all PFs.
> > v3-v2:
> > - Traversing ports using a new API instead of flow_dump.
> > - Refactor packet state recover logic, with bug fix for error
> > pop_header.
> > - One ref count for netdev in non-tunnel case.
> > - Rename variables, comments, rebase.
> > v4-v3:
> > - Extract orig_in_port from physdev for flow modify.
> > - Miss handling fixes.
> > v5-v4:
> > - Drop refactor offload rule creation commit.
> > - Comment about setting in_port in restore.
> > - Refactor vports flow offload commit.
> > v6-v5:
> > - Fixed duplicate netdev ref bug.
> >
> >
> > Can you provide some info on this bug ?  and what changes were done to
> > fix this ?
> 
> With v5, the 2 netdevs sent to ufid_to_rte_flow_associate are always
> non-NULL (was not like this previously), and there was this line:
> 
> +    data->netdev = vport ? netdev_ref(vport) : physdev;
> 
> As the "vport" was always non-null, even for non-tunnels, it took
> another ref of it, but in disassociate, only one close was done.
> 
> With v6 it is like this (changed arguments names a bit)
> 
> +    data->physdev = netdev != physdev ? netdev_ref(physdev) : physdev;
> 
> Checking the netdevs are different, not non-NULL.
> 
> > Thanks,
> > -Harsha
> >
> >
> > Travis:
> > v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
> > v2
> > :
> > https://travis-ci.org/github/elibritstein/OVS/builds/758382963
> > v3
> > :
> > https://travis-ci.org/github/elibritstein/OVS/builds/761089087
> > v4
> > :
> > https://travis-ci.org/github/elibritstein/OVS/builds/763146966
> > v5
> > :
> > https://travis-ci.org/github/elibritstein/OVS/builds/765271879
> > v6
> > :
> > https://travis-ci.org/github/elibritstein/OVS/builds/765816800
> > 
> >
> > GitHub Actions:
> > v1: https://github.com/elibritstein/OVS/actions/runs/515334647
> > 
> > v2: https://github.com/elibritstein/OVS/actions/runs/554986007
> > 
> > v3: https://github.com/elibritstein/OVS/actions/runs/613226225
> > 
> > v4: https://github.com/elibritstein/OVS/actions/runs/658415274
> > 
> > v5: https://github.com/elibritstein/OVS/actions/runs/704357369
> > 
> > v6: https://github.com/elibritstein/OVS/actions/runs/716304028
> > 
> >
> > [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
> > 
> > [2] https:/

Re: [ovs-dev] [PATCH ovn] ovn-controller-vtep: Set chassis_name for newly created Encap.

2021-04-21 Thread Dumitru Ceara
On 4/20/21 8:55 PM, Odintsov Vladislav wrote:
> Hi,
> 
> Thanks for the quick reaction!
> I've tested your patch against 20.06.3 and it works well for me.
> May I ask you to backport your patch down to supported branches?
>  
> Tested-by: Vladislav Odintsov 

Thanks for trying it out, Vladislav, and thanks for the review, Mark!

As this is a bug fix and it applies cleanly to all supported branches,
when it's merged I'm pretty sure the maintainers can backport it
directly there.

Regards,
Dumitru

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] controller: Monitor all logical flows that refer to datapath groups.

2021-04-21 Thread Dumitru Ceara
On 4/20/21 8:54 PM, Han Zhou wrote:
> On Tue, Apr 20, 2021 at 12:43 AM Dumitru Ceara  wrote:
>>
>> On 4/19/21 7:24 PM, Han Zhou wrote:
>>> On Mon, Apr 19, 2021 at 12:27 AM Dumitru Ceara 
> wrote:

 On 4/16/21 7:16 PM, Mark Michelson wrote:
> Thanks for the explanation Dumitru. It made good sense out of what had
> to be a difficult thing to debug. I'm assuming you went with
> unconditional monitoring since this will work with any change to a DP
> group, and not just the addition new datapaths to the group.

 There's also a case I didn't mention in the commit log (but I can add
> it
 if you think it's worth adding):

 If a logical flow LF1 is not shared between datapaths, so it's only
 applied to DP1, when DP2 is added that needs an identical flow,
 ovn-northd might generate a fresh new DPG = {.datapaths=[DP1, DP2]},
> and
 create a new LF1' applied on datapath_group=DPG.

 Without unconditional monitoring for datapath groups there's no easy
> way
 to get the notification about DPG and LF1' being added in the same
 jsonrpc update from SB with the delete for LF.

 Given that the number of datapath groups is expected to be relatively
 low compared to the total number of logical flows, unconditional
 monitoring seems like the best solution.

>
> Acked-by: Mark Michelson 

 Thanks!

>>>
>>> +1 (although late)
>>>
>>> I am catching up the DPG implementation, and here are some thoughts on
> the
>>> use cases. For my understanding, except for the DPG that contains all
> DPs,
>>> most DPGs are mapping to port-groups, which usually map to
>>> tenants/namespaces. The conditional monitoring feature (when
>>> ovn-monitor-all is disabled) is useful mainly when there are a large
> number
>>> of small tenants and they don't need to talk to each other, so their
>>> workloads may reside on different HVs and DPs, so conditional monitoring
>>> can result in a smaller number of lflows on each HV in this scenario.
>>> However, when DPG related flows are unconditionally monitored, it makes
>>> conditional monitoring not as useful as it should be when there are a
> lot
>>> of ACLs, because ACLs are the major user of the port-group based DPGs,
>>> which may contribute to a significant portion of lflows. In such
> scenarios
>>> (when there are many small tenants), maybe it is better to disable DPG
> (and
>>> disable ovn-monitor-all) to benefit more from the conditional monitoring
>>
>> I think DPGs may be useful in this case too (many small tenants) with
>> conditional monitoring enabled.
>>
>> There's a reasonable number of "trivial" logical flows, e.g., the flows
>> that advance to next table by default, that are shared between all
>> logical datapaths and DPGs will optimize those.
>>
>> For example, on one such large downstream OpenStack deployment we saw
>> IIRC a reduction of ~20% w.r.t. number of logical flows and SB size when
>> enabling DPG.  It's not a huge factor but it's still a relevant
>> reduction in number of objects in the SB that need to be sent to all
>> ovn-controllers.
> 
> It is true that DPG will still reduce SB size on the central nodes, but my
> above point is about the lflows being conditionally monitored by each
> individual HV, because usually central DB size is not a concern if
> ovn-monitor-all is disabled. In the particular scenario of large amount of
> small tenants, without DPG, a HV may be interested in only a small amount
> of DPs which has a small amount of flows in total. With DPG enabled,
> although sharing the common flows between the DPs would reduce some
> redundant flows, but since the number of interested DPs is small for each
> HV, this savings may be small. On the other hand, since all PG related ACL
> flows for all tenants (all DPs) are now monitored, each HV could have a
> huge increase of unnecessary ACL flows from unrelated tenants. That's why I
> think in such a scenario DPG should be disabled.

I see, you're right, makes sense.  Another way to make this work in all
scenarios is to add two-level deep conditional monitoring to the IDL and
ovsdb-server, that is, allow ovn-controller to conditionally monitor
"all logical flows that refer to DPG that contain datapaths equal to a
specific value".

> 
>>
>>> feature. In the contrary use cases when there are few but large tenants,
>>> each of them maps to a large port-group, crossing a large amount of
>>> datapaths, then enabling DPG may be a better choice, and at the same
> time
>>> enabling ovn-monitor-all may be recommended.
>>>
>>> Is the above understanding correct?
>>>
>>
>> I think so.
>>
>> Maybe it makes sense to try to add some guidelines about when to enable
>> the datapath groups feature and how to combine it with conditional
>> monitoring, maybe as part of the OVN documentation?  What do you think?
>>
> Yes, I agree we should document the guidelines.
> 
>> Thanks,
>> Dumitru
>>
>>
>