Re: [ovs-dev] [PATCH net] openvswitch: meter: remove rate from the bucket size calculation

2021-04-27 Thread Jean Tourrilhes
On Wed, Apr 28, 2021 at 02:24:10PM +0800, Tonghao Zhang wrote:
> Hi Ilya
> If we set the burst size too small, the meters of ovs don't work.

Most likely, you need to set the burst size larger.
A quick Google on finding a good burst size :
https://www.juniper.net/documentation/us/en/software/junos/routing-policy/topics/concept/policer-mx-m120-m320-burstsize-determining.html

Now, the interesting question, is the behaviour of OVS
different from a standard token bucket, such as a kernel policer ?
Here is how to set up a kernel policer :
--
# Create a dummy classful discipline to attach filter
tc qdisc del dev eth6 root
tc qdisc add dev eth6 root handle 1: prio bands 2 priomap  0 0 0 0 0 0 0 0 0 0 
0 0 0 0 0 0
tc qdisc add dev eth6 parent 1:1 handle 10: pfifo limit 1000
tc qdisc add dev eth6 parent 1:2 handle 20: pfifo limit 1000
tc -s qdisc show dev eth6
tc -s class show dev eth6

# Filter to do hard rate limiting
tc filter del dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 
tc filter add dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 match 
u32 0 0 police rate 200mbit burst 20K mtu 1 drop
tc -s filter show dev eth6
tc filter change dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 
match u32 0 0 police rate 200mbit burst 50K mtu 1 drop
--

Regards,

Jean
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] openvswitch: meter: remove rate from the bucket size calculation

2021-04-27 Thread Tonghao Zhang
On Wed, Apr 21, 2021 at 9:57 PM Ilya Maximets  wrote:
>
> Implementation of meters supposed to be a classic token bucket with 2
> typical parameters: rate and burst size.
>
> Burst size in this schema is the maximum number of bytes/packets that
> could pass without being rate limited.
>
> Recent changes to userspace datapath made meter implementation to be
> in line with the kernel one, and this uncovered several issues.
>
> The main problem is that maximum bucket size for unknown reason
> accounts not only burst size, but also the numerical value of rate.
> This creates a lot of confusion around behavior of meters.
>
> For example, if rate is configured as 1000 pps and burst size set to 1,
> this should mean that meter will tolerate bursts of 1 packet at most,
> i.e. not a single packet above the rate should pass the meter.
> However, current implementation calculates maximum bucket size as
> (rate + burst size), so the effective bucket size will be 1001.  This
> means that first 1000 packets will not be rate limited and average
> rate might be twice as high as the configured rate.  This also makes
> it practically impossible to configure meter that will have burst size
> lower than the rate, which might be a desirable configuration if the
> rate is high.
>
> Inability to configure low values of a burst size and overall inability
> for a user to predict what will be a maximum and average rate from the
> configured parameters of a meter without looking at the OVS and kernel
> code might be also classified as a security issue, because drop meters
> are frequently used as a way of protection from DoS attacks.
>
> This change removes rate from the calculation of a bucket size, making
> it in line with the classic token bucket algorithm and essentially
> making the rate and burst tolerance being predictable from a users'
> perspective.
>
> Same change proposed for the userspace implementation.
Hi Ilya
If we set the burst size too small, the meters of ovs don't work.  For example,
ovs-ofctl -O OpenFlow13 add-meter br-int "meter=1 kbps stats burst
bands=type=drop rate=1 burst_size=12"
ovs-ofctl -O OpenFlow13 add-flow br-int "in_port=$P0 action=meter=1,output:$P1"
but the rate of port P1 was 5.61 Mbit/s
or
ovs-ofctl -O OpenFlow13 add-meter br-int "meter=1 kbps stats burst
bands=type=drop rate=1 burst_size=1"
but the rate of port P1 was 0.

the length of packets is 1400B.
I think we should check whether the band->burst_size >= band->burst_rate ?

I don't test the userspace meters.
> Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
> Signed-off-by: Ilya Maximets 
> ---
>
> The same patch for the userspace datapath:
>   
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210421134816.311584-1-i.maxim...@ovn.org/
>
>  net/openvswitch/meter.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c
> index 15424d26e85d..96b524ceabca 100644
> --- a/net/openvswitch/meter.c
> +++ b/net/openvswitch/meter.c
> @@ -392,7 +392,7 @@ static struct dp_meter *dp_meter_create(struct nlattr **a)
>  *
>  * Start with a full bucket.
>  */
> -   band->bucket = (band->burst_size + band->rate) * 1000ULL;
> +   band->bucket = band->burst_size * 1000ULL;
> band_max_delta_t = div_u64(band->bucket, band->rate);
> if (band_max_delta_t > meter->max_delta_t)
> meter->max_delta_t = band_max_delta_t;
> @@ -641,7 +641,7 @@ bool ovs_meter_execute(struct datapath *dp, struct 
> sk_buff *skb,
> long long int max_bucket_size;
>
> band = &meter->bands[i];
> -   max_bucket_size = (band->burst_size + band->rate) * 1000LL;
> +   max_bucket_size = band->burst_size * 1000LL;
>
> band->bucket += delta_ms * band->rate;
> if (band->bucket > max_bucket_size)
> --
> 2.26.3
>


-- 
Best regards, Tonghao
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC PATCH] dpif-netdev: Support "port-forward" mode to avoid dp cache lookup

2021-04-27 Thread Eli Britstein



On 4/27/2021 6:57 PM, Sriharsha Basavapatna wrote:

On Tue, Apr 27, 2021 at 6:42 PM Eli Britstein  wrote:

On 4/27/2021 2:45 PM, Sriharsha Basavapatna wrote:

On Tue, Apr 27, 2021 at 4:26 PM Ilya Maximets  wrote:

On 4/27/21 11:56 AM, Sriharsha Basavapatna via dev wrote:

Hi Eli,

On Sun, Apr 25, 2021 at 6:22 PM Eli Britstein  wrote:

Hi Harsha,

On 4/20/2021 11:07 AM, Sriharsha Basavapatna wrote:

Sometimes a port might be configured with a single flow that just
forwards packets to another port. This would be useful in configs
where the bridge is just fowarding packets between two ports (for
example, between a vhost-user port and a physical port). A flow
that matches only on the in_port and with an action that forwards
to another port would be configured, to avoid learning or matching
on packet headers.

Example:
$ ovs-ofctl add-flow br0 in_port=1,actions=output:2
$ ovs-ofctl add-flow br0 in_port=2,actions=output:1

This translates to a datapath flow with the match fields wildcarded
for the packet headers. However, the datapath processing still involves

There are still several matches (not wildcards):

 - recirc_id
 - in_port
 - packet_type
 - dl_type
 - vlan_tci
 - nw_frag (for ip packets)

So there might be multiple flows for each such openflow rule.

In the past, I have tried to optimize such scenario, see:

https://mail.openvswitch.org/pipermail/ovs-dev/2019-April/357882.html

That was wrong as commented afterwards.

Another related patch-set was this (also not accepted):

https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/363948.html

Ilya wrote an alternative patch:

https://patchwork.ozlabs.org/patch/1105880/

AFAIR, it didn't improve performance either.

Would be good to have some performance numbers for it as there was
no test results published and I don't know if someone ever tested it.


Thanks for the above pointers. Ilya had also shared this patch
recently while discussing this topic at the ovs-dpdk community
meeting. I want to see if we can utilize part of the logic in that
patch to add some constraints, while still avoiding an additional
table/lookup.  The 'port-forward' mode implies that the user wants to
avoid any kind of lookup in the datapath (as indicated by the ofctl
rule + port-forward mode).

I don't see how to completely avoid lookups.

IIUC, in this patch there is a match and upcall for the first packet,
but there are no matches for subsequent packets.

That's right. Allow the first packet to go through match, upcall,
dp/cache insertion etc. For subsequent packets avoid lookup.


   This will work
only for flow actions that doesn't modify the packet.  If for some
reason the flow contains header modifications OVS will not do that
correctly because the header is not parsed.  Also, if the packet is
a bit different from the very first packet, we might attempt to
modify headers that doesn't exist.  All in all, this is very dangerous
and might lead to OVS crash.  We can't rely on the user to set specific
OF rules for this functionality and we should not have a feature that
might crash OVS if not used accurately.

The way to not parse the packet at all and to not perform any matches is
the way to completely ignore OF rules, but OVS is an OF switch and
such functionality just doesn't fit.

If I add a constraint to check that there is only one action and it's
an OUTPUT action (i.e don't enable port-forward mode if the DP flow
contains other actions like modify), like it is done in your patch,
that should handle this issue ?

Thanks,
-Harsha

In my change I minimized the lookup as possible to a single 64bit key.
And it will actually work with any OF rules and without enabling of
any special flags.  Would be great to see some performance numbers
for it as I didn't see any.


With pvp tests (vxlan config), we have
seen better performance both in pps: ~50% and cpp: ~35%, at a few
thousand flows. Similar improvement can be seen with simple
configurations (e.g testpmd in the vm in txonly fwd mode).


Besides, I've tried this patch. Maybe I did something wrong (I
configured port-forward=true on those ports and those openflow rules,
and pinged between those ports). I didn't see it worked (the coverage,
and also I added my own prints).

When you enable port-forward and start the traffic, you should see a
message like this:
"dpif_netdev(pmd-c02/id:74)|DBG|Setting port_forward_flow: port:
0x7f63400050b0 flow: 0x7f634000afb0"

I'm guessing the flow isn't getting added to the port; the insertion
is currently done when there's an emc hit. I should probably move the
insertion code to handle_packet_upcall(). As a workaround, can you
please update the emc insertion probability (ovs-vsctl --no-wait set
Open_vSwitch . other_config:emc-insert-inv-prob=1) and retry your test
?

Also, please disable normal mode in the bridge (ovs-ofctl del-flows
br0; and then add ofctl rules).  Let me know if you still see the
problem, I'll work with you offline.


With this proposed patch, 

Re: [ovs-dev] [PATCH v1 3/8] tests: Add RCU postpone test

2021-04-27 Thread 0-day Robot
Bleep bloop.  Greetings Gaetan Rivet, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


build:
gcc -std=gnu99 -DHAVE_CONFIG_H -I.-I ./include -I ./include -I ./lib -I 
./lib-Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith 
-Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast 
-Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes 
-Wmissing-field-initializers -fno-strict-aliasing -Wshadow -Werror -Werror   -g 
-O2 -DHAVE_LD_AVX512_GOOD -MT tests/test-odp.o -MD -MP -MF $depbase.Tpo -c -o 
tests/test-odp.o tests/test-odp.c &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo tests/test-ofpbuf.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
gcc -std=gnu99 -DHAVE_CONFIG_H -I.-I ./include -I ./include -I ./lib -I 
./lib-Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith 
-Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast 
-Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes 
-Wmissing-field-initializers -fno-strict-aliasing -Wshadow -Werror -Werror   -g 
-O2 -DHAVE_LD_AVX512_GOOD -MT tests/test-ofpbuf.o -MD -MP -MF $depbase.Tpo -c 
-o tests/test-ofpbuf.o tests/test-ofpbuf.c &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo tests/test-packets.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
gcc -std=gnu99 -DHAVE_CONFIG_H -I.-I ./include -I ./include -I ./lib -I 
./lib-Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith 
-Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast 
-Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes 
-Wmissing-field-initializers -fno-strict-aliasing -Wshadow -Werror -Werror   -g 
-O2 -DHAVE_LD_AVX512_GOOD -MT tests/test-packets.o -MD -MP -MF $depbase.Tpo -c 
-o tests/test-packets.o tests/test-packets.c &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo tests/test-random.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
gcc -std=gnu99 -DHAVE_CONFIG_H -I.-I ./include -I ./include -I ./lib -I 
./lib-Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith 
-Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast 
-Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes 
-Wmissing-field-initializers -fno-strict-aliasing -Wshadow -Werror -Werror   -g 
-O2 -DHAVE_LD_AVX512_GOOD -MT tests/test-random.o -MD -MP -MF $depbase.Tpo -c 
-o tests/test-random.o tests/test-random.c &&\
mv -f $depbase.Tpo $depbase.Po
depbase=`echo tests/test-rcu.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
gcc -std=gnu99 -DHAVE_CONFIG_H -I.-I ./include -I ./include -I ./lib -I 
./lib-Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith 
-Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast 
-Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes 
-Wmissing-field-initializers -fno-strict-aliasing -Wshadow -Werror -Werror   -g 
-O2 -DHAVE_LD_AVX512_GOOD -MT tests/test-rcu.o -MD -MP -MF $depbase.Tpo -c -o 
tests/test-rcu.o tests/test-rcu.c &&\
mv -f $depbase.Tpo $depbase.Po
tests/test-rcu.c: In function ‘test_rcu’:
tests/test-rcu.c:83:12: error: missing braces around initializer 
[-Werror=missing-braces]
 struct rcu_user_aux aux[N_THREAD] = {0};
^
tests/test-rcu.c:83:12: error: (near initialization for ‘aux[0]’) 
[-Werror=missing-braces]
cc1: all warnings being treated as errors
make[2]: *** [tests/test-rcu.o] Error 1
make[2]: Leaving directory 
`/var/lib/jenkins/jobs/0day_robot_upstream_build_from_pw/workspace'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/var/lib/jenkins/jobs/0day_robot_upstream_build_from_pw/workspace'
make: *** [all] Error 2


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 4/4] ovn-controller: Fix port group conjunction flow explosion problem.

2021-04-27 Thread Han Zhou
On Tue, Apr 27, 2021 at 10:29 AM Han Zhou  wrote:
>
>
>
> On Tue, Apr 27, 2021 at 8:43 AM Mark Gray  wrote:
> >
> > On 22/04/2021 21:14, Han Zhou wrote:
> > > For an ACL with match: outport == @PG && ip4.src == $PG_AS, given
below
> > > scale:
> > >
> > > P: PG size
> > > LP: number of local lports
> > > D: number of all datapaths (logical switches)
> > > LD: number of datapaths that contain local lports
> > >
> > > With current OVN implementation, the total number of OF flows is:
> > > LP + (P * D) + D
> > >
> > > The reason is, firstly, datapath is not part of the conjunction, so
for
> > > each datapath the lflow is reparsed.
> > >
> > > Secondly, although ovn-controller tries to filter out the flows that
are
> > > for non-local lports, with the conjunction match, the logic that
filters
> > > out non-local flows doesn't work for the conjunction part that doesn't
> > > have the lport in the match (the P * D part). When there is only one
> > > port on each LS it is fine, because no conjunction will be used
because
> > > SB port groups are splited per datapath, so each port group would have
> > suggest "split per datapath"
>
> Ack
>
> > > only one port. However, when more than one ports are on each LS the
flow
> > > explosion happens.
> > >
> > > This patch deal with the second reason above, by refining the SB port
> > > groups to store only locally bound lports: empty const sets will not
> > > generate any flows. This reduces the related flow number from
> > > LP + (P * D) + D to LP + (P * LD) + LD.
> > >
> > > Since LD is expected to be small, so even if it is a multiplier, the
> > > total number is larged reduced.  In particular, in ovn-k8s use cases
the
> > suggest "reduced significantly"
>
> Ack
>
> > > LD is always 1, so the formular above becomes LP + P + LD.
> > >
> > s/formular/formula
>
> Ack
>
> > > With a scale of 1k k8s nodes, each has 4 ports for the same PG: P =
4k,
> > > LP = 4, D = 1k, LD = 1. The current implementation generates ~4m
flows.
> > > With this patch it becomes only ~4k.
> > Cool!
> > >
> > > Reported-by: Girish Moodalbail 
> > > Reported-at:
https://mail.openvswitch.org/pipermail/ovs-dev/2021-March/381082.html
> > > Reported-by: Dumitru Ceara 
> > > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1944098
> > > Signed-off-by: Han Zhou 
> >
> > I tested this as well and it seemed to work as expected.
>
> Thanks for the test!
>
> > > ---
> > > v1->v2: fix memory leaks found by address sanitizer
> > >
> > >  controller/binding.c|  20 
> > >  controller/binding.h|   9 ++
> > >  controller/ovn-controller.c | 217
++--
> > >  include/ovn/expr.h  |   2 +-
> > >  lib/expr.c  |   8 +-
> > >  tests/ovn.at|  53 +
> > >  tests/test-ovn.c|  12 +-
> > >  utilities/ovn-trace.c   |   4 +-
> > >  8 files changed, 283 insertions(+), 42 deletions(-)
> > >
> > > diff --git a/controller/binding.c b/controller/binding.c
> > > index 514f5f33f..5aca964cc 100644
> > > --- a/controller/binding.c
> > > +++ b/controller/binding.c
> > > @@ -2987,3 +2987,23 @@ cleanup:
> > >
> > >  return b_lport;
> > >  }
> > > +
> > > +struct sset *
> > > +binding_collect_local_binding_lports(struct local_binding_data
*lbinding_data)
> > > +{
> > > +struct sset *lports = xzalloc(sizeof *lports);
> > > +sset_init(lports);
> > > +struct shash_node *shash_node;
> > > +SHASH_FOR_EACH (shash_node, &lbinding_data->lports) {
> > > +struct binding_lport *b_lport = shash_node->data;
> > > +sset_add(lports, b_lport->name);
> > > +}
> > > +return lports;
> > > +}
> > > +
> > > +void
> > > +binding_destroy_local_binding_lports(struct sset *lports)
> > > +{
> > > +sset_destroy(lports);
> > > +free(lports);
> > > +}
> > > diff --git a/controller/binding.h b/controller/binding.h
> > > index 4fc9ef207..31f0352a0 100644
> > > --- a/controller/binding.h
> > > +++ b/controller/binding.h
> > > @@ -128,4 +128,13 @@ void binding_seqno_run(struct local_binding_data
*lbinding_data);
> > >  void binding_seqno_install(struct local_binding_data *lbinding_data);
> > >  void binding_seqno_flush(void);
> > >  void binding_dump_local_bindings(struct local_binding_data *, struct
ds *);
> > > +
> > > +/* Generates a sset of lport names from local_binding_data.
> > > + * Note: the caller is responsible for destroying and freeing the
returned
> > > + * sset, by calling binding_collect_local_binding_lports(). */
> > I think this^ should say binding_destroy_local_binding_lports()?
>
> Oops. My bad.
>
> > > +struct sset *binding_collect_local_binding_lports(struct
local_binding_data *);
> > > +
> > > +/* Destroy and free the lports sset returned by
> > > + * binding_collect_local_binding_lports(). */
> > > +void binding_destroy_local_binding_lports(struct sset *lports);
> > >  #endif /* controller/binding.h */
> > > diff --git a/controller/ovn-controller.c b/controller/ovn-controller

[ovs-dev] [PATCH ovn v3 5/5] ovn-controller: Fix port group conjunction flow explosion problem.

2021-04-27 Thread Han Zhou
For an ACL with match: outport == @PG && ip4.src == $PG_AS, given below
scale:

P: PG size
LP: number of local lports
D: number of all datapaths (logical switches)
LD: number of datapaths that contain local lports

With current OVN implementation, the total number of OF flows is:
LP + (P * D) + D

The reason is, firstly, datapath is not part of the conjunction, so for
each datapath the lflow is reparsed.

Secondly, although ovn-controller tries to filter out the flows that are
for non-local lports, with the conjunction match, the logic that filters
out non-local flows doesn't work for the conjunction part that doesn't
have the lport in the match (the P * D part). When there is only one
port on each LS it is fine, because no conjunction will be used because
SB port groups are split per datapath, so each port group would have
only one port. However, when more than one ports are on each LS the flow
explosion happens.

This patch deal with the second reason above, by refining the SB port
groups to store only locally bound lports: empty const sets will not
generate any flows. This reduces the related flow number from
LP + (P * D) + D to LP + (P * LD) + LD.

Since LD is expected to be small, so even if it is a multiplier, the
total number is reduced significantly. In particular, in ovn-k8s use
cases the LD is always 1, so the formula above becomes LP + P + LD.

With a scale of 1k k8s nodes, each has 4 ports for the same PG: P = 4k,
LP = 4, D = 1k, LD = 1. The current implementation generates ~4m flows.
With this patch it becomes only ~4k.

Reported-by: Girish Moodalbail 
Reported-at: 
https://mail.openvswitch.org/pipermail/ovs-dev/2021-March/381082.html
Reported-by: Dumitru Ceara 
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1944098
Tested-by: Zhen Wang 
Signed-off-by: Han Zhou 
---
 controller/binding.c|  20 
 controller/binding.h|   9 ++
 controller/ovn-controller.c | 212 +++-
 include/ovn/expr.h  |   3 +-
 lib/expr.c  |  12 +-
 tests/ovn.at|  55 ++
 tests/test-ovn.c|   4 +-
 utilities/ovn-trace.c   |   2 +-
 8 files changed, 281 insertions(+), 36 deletions(-)

diff --git a/controller/binding.c b/controller/binding.c
index 514f5f33f..5aca964cc 100644
--- a/controller/binding.c
+++ b/controller/binding.c
@@ -2987,3 +2987,23 @@ cleanup:
 
 return b_lport;
 }
+
+struct sset *
+binding_collect_local_binding_lports(struct local_binding_data *lbinding_data)
+{
+struct sset *lports = xzalloc(sizeof *lports);
+sset_init(lports);
+struct shash_node *shash_node;
+SHASH_FOR_EACH (shash_node, &lbinding_data->lports) {
+struct binding_lport *b_lport = shash_node->data;
+sset_add(lports, b_lport->name);
+}
+return lports;
+}
+
+void
+binding_destroy_local_binding_lports(struct sset *lports)
+{
+sset_destroy(lports);
+free(lports);
+}
diff --git a/controller/binding.h b/controller/binding.h
index 4fc9ef207..cd573dbbe 100644
--- a/controller/binding.h
+++ b/controller/binding.h
@@ -128,4 +128,13 @@ void binding_seqno_run(struct local_binding_data 
*lbinding_data);
 void binding_seqno_install(struct local_binding_data *lbinding_data);
 void binding_seqno_flush(void);
 void binding_dump_local_bindings(struct local_binding_data *, struct ds *);
+
+/* Generates a sset of lport names from local_binding_data.
+ * Note: the caller is responsible for destroying and freeing the returned
+ * sset, by calling binding_detroy_local_binding_lports(). */
+struct sset *binding_collect_local_binding_lports(struct local_binding_data *);
+
+/* Destroy and free the lports sset returned by
+ * binding_collect_local_binding_lports(). */
+void binding_destroy_local_binding_lports(struct sset *lports);
 #endif /* controller/binding.h */
diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 00ae49eb9..f86d780cc 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -1367,6 +1367,7 @@ addr_sets_update(const struct sbrec_address_set_table 
*address_set_table,
 }
 }
 }
+
 static void
 en_addr_sets_run(struct engine_node *node, void *data)
 {
@@ -1415,20 +1416,72 @@ addr_sets_sb_address_set_handler(struct engine_node 
*node, void *data)
 }
 
 struct ed_type_port_groups{
-struct shash port_groups;
+/* A copy of SB port_groups, each converted as a sset for efficient lport
+ * lookup. */
+struct shash port_group_ssets;
+
+/* Const sets containing local lports, used for expr parsing. */
+struct shash port_groups_cs_local;
+
 bool change_tracked;
 struct sset new;
 struct sset deleted;
 struct sset updated;
 };
 
+static void
+port_group_ssets_add_or_update(struct shash *port_group_ssets,
+   const struct sbrec_port_group *pg)
+{
+struct sset *lports = shash_find_data(port_group_ssets, pg->name);
+if (lports) {
+sset_clear(lports);
+} 

[ovs-dev] [PATCH ovn v3 4/5] expr.c: Split expr_const_sets_add.

2021-04-27 Thread Han Zhou
Split this function to two functions for integers (e.g. IP addresses) and
strings (e.g. port names), because they will become more different in
next patch.

Suggested-by: Mark Gray 
Signed-off-by: Han Zhou 
---
 controller/ovn-controller.c | 24 ++--
 include/ovn/expr.h  |  7 ++--
 lib/expr.c  | 78 ++---
 tests/test-ovn.c| 12 +++---
 utilities/ovn-trace.c   | 12 +++---
 5 files changed, 75 insertions(+), 58 deletions(-)

diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 7320bd56c..00ae49eb9 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -1339,9 +1339,9 @@ addr_sets_init(const struct sbrec_address_set_table 
*address_set_table,
 {
 const struct sbrec_address_set *as;
 SBREC_ADDRESS_SET_TABLE_FOR_EACH (as, address_set_table) {
-expr_const_sets_add(addr_sets, as->name,
-(const char *const *) as->addresses,
-as->n_addresses, true);
+expr_const_sets_add_integers(addr_sets, as->name,
+ (const char *const *) as->addresses,
+ as->n_addresses);
 }
 }
 
@@ -1356,9 +1356,9 @@ addr_sets_update(const struct sbrec_address_set_table 
*address_set_table,
 expr_const_sets_remove(addr_sets, as->name);
 sset_add(deleted, as->name);
 } else {
-expr_const_sets_add(addr_sets, as->name,
-(const char *const *) as->addresses,
-as->n_addresses, true);
+expr_const_sets_add_integers(addr_sets, as->name,
+ (const char *const *) as->addresses,
+ as->n_addresses);
 if (sbrec_address_set_is_new(as)) {
 sset_add(new, as->name);
 } else {
@@ -1455,9 +1455,9 @@ port_groups_init(const struct sbrec_port_group_table 
*port_group_table,
 {
 const struct sbrec_port_group *pg;
 SBREC_PORT_GROUP_TABLE_FOR_EACH (pg, port_group_table) {
-expr_const_sets_add(port_groups, pg->name,
-(const char *const *) pg->ports,
-pg->n_ports, false);
+expr_const_sets_add_strings(port_groups, pg->name,
+(const char *const *) pg->ports,
+pg->n_ports);
 }
 }
 
@@ -1472,9 +1472,9 @@ port_groups_update(const struct sbrec_port_group_table 
*port_group_table,
 expr_const_sets_remove(port_groups, pg->name);
 sset_add(deleted, pg->name);
 } else {
-expr_const_sets_add(port_groups, pg->name,
-(const char *const *) pg->ports,
-pg->n_ports, false);
+expr_const_sets_add_strings(port_groups, pg->name,
+(const char *const *) pg->ports,
+pg->n_ports);
 if (sbrec_port_group_is_new(pg)) {
 sset_add(new, pg->name);
 } else {
diff --git a/include/ovn/expr.h b/include/ovn/expr.h
index 032370058..96435038a 100644
--- a/include/ovn/expr.h
+++ b/include/ovn/expr.h
@@ -545,9 +545,10 @@ void expr_constant_set_destroy(struct expr_constant_set 
*cs);
  * are ignored.
  */
 
-void expr_const_sets_add(struct shash *const_sets, const char *name,
- const char * const *values, size_t n_values,
- bool convert_to_integer);
+void expr_const_sets_add_integers(struct shash *const_sets, const char *name,
+  const char * const *values, size_t n_values);
+void expr_const_sets_add_strings(struct shash *const_sets, const char *name,
+ const char * const *values, size_t n_values);
 void expr_const_sets_remove(struct shash *const_sets, const char *name);
 void expr_const_sets_destroy(struct shash *const_sets);
 
diff --git a/lib/expr.c b/lib/expr.c
index f061a8fbe..cfc1082e1 100644
--- a/lib/expr.c
+++ b/lib/expr.c
@@ -1061,11 +1061,12 @@ expr_constant_set_destroy(struct expr_constant_set *cs)
 }
 
 /* Adds an constant set named 'name' to 'const_sets', replacing any existing
- * constant set entry with the given name. */
+ * constant set entry with the given name. The 'values' must be strings that
+ * can be converted to integers or masked integers, such as IP addresses.
+ * Values that can't be converted are skipped. */
 void
-expr_const_sets_add(struct shash *const_sets, const char *name,
-const char *const *values, size_t n_values,
-bool convert_to_integer)
+expr_const_sets_add_integers(struct shash *const_sets, const char *name,
+ const char *const *values, size_t n_values)
 {
 /* Replace any existing entry for th

[ovs-dev] [PATCH ovn v3 3/5] ovn-controller.c: Reorder addrset and portgroup related functions.

2021-04-27 Thread Han Zhou
Move the logically related functions together, which would also make
reviewing the next patch much easier.

Signed-off-by: Han Zhou 
---
 controller/ovn-controller.c | 467 ++--
 1 file changed, 233 insertions(+), 234 deletions(-)

diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 13c03131c..7320bd56c 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -433,80 +433,6 @@ get_ovs_chassis_id(const struct ovsrec_open_vswitch_table 
*ovs_table)
 return chassis_id;
 }
 
-/* Iterate address sets in the southbound database.  Create and update the
- * corresponding symtab entries as necessary. */
-static void
-addr_sets_init(const struct sbrec_address_set_table *address_set_table,
-   struct shash *addr_sets)
-{
-const struct sbrec_address_set *as;
-SBREC_ADDRESS_SET_TABLE_FOR_EACH (as, address_set_table) {
-expr_const_sets_add(addr_sets, as->name,
-(const char *const *) as->addresses,
-as->n_addresses, true);
-}
-}
-
-static void
-addr_sets_update(const struct sbrec_address_set_table *address_set_table,
- struct shash *addr_sets, struct sset *new,
- struct sset *deleted, struct sset *updated)
-{
-const struct sbrec_address_set *as;
-SBREC_ADDRESS_SET_TABLE_FOR_EACH_TRACKED (as, address_set_table) {
-if (sbrec_address_set_is_deleted(as)) {
-expr_const_sets_remove(addr_sets, as->name);
-sset_add(deleted, as->name);
-} else {
-expr_const_sets_add(addr_sets, as->name,
-(const char *const *) as->addresses,
-as->n_addresses, true);
-if (sbrec_address_set_is_new(as)) {
-sset_add(new, as->name);
-} else {
-sset_add(updated, as->name);
-}
-}
-}
-}
-
-/* Iterate port groups in the southbound database.  Create and update the
- * corresponding symtab entries as necessary. */
- static void
-port_groups_init(const struct sbrec_port_group_table *port_group_table,
- struct shash *port_groups)
-{
-const struct sbrec_port_group *pg;
-SBREC_PORT_GROUP_TABLE_FOR_EACH (pg, port_group_table) {
-expr_const_sets_add(port_groups, pg->name,
-(const char *const *) pg->ports,
-pg->n_ports, false);
-}
-}
-
-static void
-port_groups_update(const struct sbrec_port_group_table *port_group_table,
-   struct shash *port_groups, struct sset *new,
-   struct sset *deleted, struct sset *updated)
-{
-const struct sbrec_port_group *pg;
-SBREC_PORT_GROUP_TABLE_FOR_EACH_TRACKED (pg, port_group_table) {
-if (sbrec_port_group_is_deleted(pg)) {
-expr_const_sets_remove(port_groups, pg->name);
-sset_add(deleted, pg->name);
-} else {
-expr_const_sets_add(port_groups, pg->name,
-(const char *const *) pg->ports,
-pg->n_ports, false);
-if (sbrec_port_group_is_new(pg)) {
-sset_add(new, pg->name);
-} else {
-sset_add(updated, pg->name);
-}
-}
-}
-}
-
 static void
 update_ssl_config(const struct ovsrec_ssl_table *ssl_table)
 {
@@ -1011,166 +937,6 @@ en_ofctrl_is_connected_run(struct engine_node *node, 
void *data)
 engine_set_node_state(node, EN_UNCHANGED);
 }
 
-struct ed_type_addr_sets {
-struct shash addr_sets;
-bool change_tracked;
-struct sset new;
-struct sset deleted;
-struct sset updated;
-};
-
-static void *
-en_addr_sets_init(struct engine_node *node OVS_UNUSED,
-  struct engine_arg *arg OVS_UNUSED)
-{
-struct ed_type_addr_sets *as = xzalloc(sizeof *as);
-
-shash_init(&as->addr_sets);
-as->change_tracked = false;
-sset_init(&as->new);
-sset_init(&as->deleted);
-sset_init(&as->updated);
-return as;
-}
-
-static void
-en_addr_sets_cleanup(void *data)
-{
-struct ed_type_addr_sets *as = data;
-expr_const_sets_destroy(&as->addr_sets);
-shash_destroy(&as->addr_sets);
-sset_destroy(&as->new);
-sset_destroy(&as->deleted);
-sset_destroy(&as->updated);
-}
-
-static void
-en_addr_sets_run(struct engine_node *node, void *data)
-{
-struct ed_type_addr_sets *as = data;
-
-sset_clear(&as->new);
-sset_clear(&as->deleted);
-sset_clear(&as->updated);
-expr_const_sets_destroy(&as->addr_sets);
-
-struct sbrec_address_set_table *as_table =
-(struct sbrec_address_set_table *)EN_OVSDB_GET(
-engine_get_input("SB_address_set", node));
-
-addr_sets_init(as_table, &as->addr_sets);
-
-as->change_tracked = false;
-engine_set_node_state(node, EN_UPDATED);
-}
-
-static bool
-addr_sets_sb_address_set_handler(struct e

[ovs-dev] [PATCH ovn v3 2/5] ovn.at: Improve "No ovn-controller assert when generating conjunction flows"

2021-04-27 Thread Han Zhou
This patch improves the test case by binding 2 VIFs on the HV instead of
one, to make sure conjunction is still used and the scenario is still
tested by this test case when a following patch optimizes conjunction flows.

Signed-off-by: Han Zhou 
---
 tests/ovn.at | 96 
 1 file changed, 51 insertions(+), 45 deletions(-)

diff --git a/tests/ovn.at b/tests/ovn.at
index 9f38ec6ec..e52bb55cd 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -25121,10 +25121,12 @@ ovs-vsctl add-br br-phys
 ovn_attach n1 br-phys 192.168.0.10
 
 as hv1
-ovs-vsctl \
--- add-port br-int vif1 \
--- set Interface vif1 external_ids:iface-id=sw0-p1 \
-ofport-request=1
+for i in 1 2; do
+ovs-vsctl \
+-- add-port br-int vif$i \
+-- set Interface vif$i external_ids:iface-id=sw0-p$i \
+ofport-request=$i
+done
 
 check as hv1
 ovs-vsctl set open . external_ids:ovn-monitor-all=true
@@ -25132,10 +25134,10 @@ ovs-vsctl set open . external_ids:ovn-monitor-all=true
 check ovn-nbctl ls-add sw0
 check ovn-nbctl pg-add pg1
 check ovn-nbctl pg-add pg2
-check ovn-nbctl lsp-add sw0 sw0-p2
-check ovn-nbctl lsp-set-addresses sw0-p2 "00:00:00:00:00:02 192.168.47.2"
 check ovn-nbctl lsp-add sw0 sw0-p3
 check ovn-nbctl lsp-set-addresses sw0-p3 "00:00:00:00:00:03 192.168.47.3"
+check ovn-nbctl lsp-add sw0 sw0-p4
+check ovn-nbctl lsp-set-addresses sw0-p4 "00:00:00:00:00:04 192.168.47.4"
 
 # Pause ovn-northd. When it is resumed, all the below NB updates
 # will be sent in one transaction.
@@ -25145,20 +25147,22 @@ check as northd-backup ovn-appctl -t NORTHD_TYPE pause
 
 check ovn-nbctl lsp-add sw0 sw0-p1
 check ovn-nbctl lsp-set-addresses sw0-p1 "00:00:00:00:00:01 192.168.47.1"
-check ovn-nbctl pg-set-ports pg1 sw0-p1 sw0-p2
-check ovn-nbctl pg-set-ports pg2 sw0-p3
+check ovn-nbctl lsp-add sw0 sw0-p2
+check ovn-nbctl lsp-set-addresses sw0-p2 "00:00:00:00:00:02 192.168.47.2"
+check ovn-nbctl pg-set-ports pg1 sw0-p1 sw0-p2 sw0-p3
+check ovn-nbctl pg-set-ports pg2 sw0-p4
 check ovn-nbctl acl-add pg1 to-lport 1002 "outport == @pg1 && ip4 && ip4.src 
== \$pg2_ip4 && udp && udp.dst >= 1 && udp.dst <= 65535" allow-related
 
 # resume ovn-northd now. This should result in a single update message
 # from SB ovsdb-server to ovn-controller for all the above NB updates.
 check as northd ovn-appctl -t NORTHD_TYPE resume
 
-AS_BOX([Wait for sw0-p1 to be up])
-wait_for_ports_up sw0-p1
+AS_BOX([Wait for sw0-p1 and sw0-p2 to be up])
+wait_for_ports_up sw0-p1 sw0-p2
 
 # When the port group pg1 is updated, it should not result in
 # any assert in ovn-controller.
-ovn-nbctl --wait=hv pg-set-ports pg1 sw0-p1 sw0-p2 sw0-p3
+ovn-nbctl --wait=hv pg-set-ports pg1 sw0-p1 sw0-p2 sw0-p3 sw0-p4
 AT_CHECK([kill -0 $(cat hv1/ovn-controller.pid)])
 check ovn-nbctl --wait=hv sync
 
@@ -25166,40 +25170,42 @@ check ovn-nbctl --wait=hv sync
 AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=45 | ofctl_strip_all | \
 grep "priority=2002" | grep conjunction | \
 sed 's/conjunction([[^)]]*)/conjunction()/g' | sort], [0], [dnl
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x10/0xfff0
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x100/0xff00
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x1000/0xf000
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x2/0xfffe
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x20/0xffe0
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x200/0xfe00
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x2000/0xe000
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x4/0xfffc
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x40/0xffc0
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x400/0xfc00
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x4000/0xc000
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x8/0xfff8
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x80/0xff80
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x800/0xf800
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x8000/0x8000
 actions=conjunction()
- table=45, 
priority=2002,udp,reg0=0x100/0x100,metadat

[ovs-dev] [PATCH ovn v3 1/5] inc-proc-eng: Call clear_tracked_data before recompute.

2021-04-27 Thread Han Zhou
Cleanup partially tracked data due to some of the change handler
executions before falling back to recompute. This is done already
in the en_runtime_data_run() implementation, but this patch makes
it a generic behavior of the I-P engine.

Signed-off-by: Han Zhou 
---
 controller/ovn-controller.c | 17 -
 lib/inc-proc-eng.c  |  6 ++
 2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 6f7c9ea61..13c03131c 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -1412,23 +1412,6 @@ en_runtime_data_run(struct engine_node *node, void *data)
 struct sset *local_lport_ids = &rt_data->local_lport_ids;
 struct sset *active_tunnels = &rt_data->active_tunnels;
 
-/* Clear the (stale) tracked data if any. Even though the tracked data
- * gets cleared in the beginning of engine_init_run(),
- * any of the runtime data handler might have set some tracked
- * data and later another runtime data handler might return false
- * resulting in full recompute of runtime engine and rendering the tracked
- * data stale.
- *
- * It's possible that engine framework can be enhanced to indicate
- * the node handlers (in this case flow_output_runtime_data_handler)
- * that its input node had a full recompute. However we would still
- * need to clear the tracked data, because we don't want the
- * stale tracked data to be accessed outside of the engine, since the
- * tracked data is cleared in the engine_init_run() and not at the
- * end of the engine run.
- * */
-en_runtime_data_clear_tracked_data(data);
-
 static bool first_run = true;
 if (first_run) {
 /* don't cleanup since there is no data yet */
diff --git a/lib/inc-proc-eng.c b/lib/inc-proc-eng.c
index a6337a1d9..c349efb22 100644
--- a/lib/inc-proc-eng.c
+++ b/lib/inc-proc-eng.c
@@ -326,6 +326,12 @@ engine_recompute(struct engine_node *node, bool forced, 
bool allowed)
 return;
 }
 
+/* Clear tracked data before calling run() so that partially tracked data
+ * from some of the change handler executions are cleared. */
+if (node->clear_tracked_data) {
+node->clear_tracked_data(node->data);
+}
+
 /* Run the node handler which might change state. */
 node->run(node, node->data);
 node->stats.recompute++;
-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn v3 0/5] Fix port group conjunction flow explosion problem.

2021-04-27 Thread Han Zhou
v2 -> v3: Addresses Mark's comments.

Han Zhou (5):
  inc-proc-eng: Call clear_tracked_data before recompute.
  ovn.at: Improve "No ovn-controller assert when generating conjunction
flows"
  ovn-controller.c: Reorder addrset and portgroup related functions.
  expr.c: Split expr_const_sets_add.
  ovn-controller: Fix port group conjunction flow explosion problem.

 controller/binding.c|  20 ++
 controller/binding.h|   9 +
 controller/ovn-controller.c | 642 ++--
 include/ovn/expr.h  |   8 +-
 lib/expr.c  |  84 +++--
 lib/inc-proc-eng.c  |   6 +
 tests/ovn.at| 151 ++---
 tests/test-ovn.c|  12 +-
 utilities/ovn-trace.c   |  12 +-
 9 files changed, 600 insertions(+), 344 deletions(-)

-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 7/8] ovs-rcu: Remove unused perthread mutex

2021-04-27 Thread Gaetan Rivet
A mutex is allocated, initialized and destroyed, without being
used in the perthread structure.

Signed-off-by: Gaetan Rivet 
---
 lib/ovs-rcu.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/lib/ovs-rcu.c b/lib/ovs-rcu.c
index cde1e925b..1866bd308 100644
--- a/lib/ovs-rcu.c
+++ b/lib/ovs-rcu.c
@@ -47,7 +47,6 @@ struct ovsrcu_cbset {
 struct ovsrcu_perthread {
 struct ovs_list list_node;  /* In global list. */
 
-struct ovs_mutex mutex;
 uint64_t seqno;
 struct ovsrcu_cbset *cbset;
 char name[16];  /* This thread's name. */
@@ -84,7 +83,6 @@ ovsrcu_perthread_get(void)
 const char *name = get_subprogram_name();
 
 perthread = xmalloc(sizeof *perthread);
-ovs_mutex_init(&perthread->mutex);
 perthread->seqno = seq_read(global_seqno);
 perthread->cbset = NULL;
 ovs_strlcpy(perthread->name, name[0] ? name : "main",
@@ -406,7 +404,6 @@ ovsrcu_unregister__(struct ovsrcu_perthread *perthread)
 ovs_list_remove(&perthread->list_node);
 ovs_mutex_unlock(&ovsrcu_threads_mutex);
 
-ovs_mutex_destroy(&perthread->mutex);
 free(perthread);
 
 seq_change(global_seqno);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 6/8] ovs-thread: Quiesce when joining pthreads

2021-04-27 Thread Gaetan Rivet
Joining pthreads makes the caller quiescent. It should register as such,
as joined threads may wait on an RCU callback executing before quitting,
deadlocking the caller.

Signed-off-by: Gaetan Rivet 
---
 lib/ovs-thread.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
index 805cba622..bf58923f8 100644
--- a/lib/ovs-thread.c
+++ b/lib/ovs-thread.c
@@ -180,8 +180,6 @@ XPTHREAD_FUNC1(pthread_cond_destroy, pthread_cond_t *);
 XPTHREAD_FUNC1(pthread_cond_signal, pthread_cond_t *);
 XPTHREAD_FUNC1(pthread_cond_broadcast, pthread_cond_t *);
 
-XPTHREAD_FUNC2(pthread_join, pthread_t, void **);
-
 typedef void destructor_func(void *);
 XPTHREAD_FUNC2(pthread_key_create, pthread_key_t *, destructor_func *);
 XPTHREAD_FUNC1(pthread_key_delete, pthread_key_t);
@@ -191,6 +189,20 @@ XPTHREAD_FUNC2(pthread_setspecific, pthread_key_t, const 
void *);
 XPTHREAD_FUNC3(pthread_sigmask, int, const sigset_t *, sigset_t *);
 #endif
 
+void
+xpthread_join(pthread_t thread, void **retval)
+{
+int error;
+
+ovsrcu_quiesce_start();
+error = pthread_join(thread, retval);
+ovsrcu_quiesce_end();
+
+if (OVS_UNLIKELY(error)) {
+ovs_abort(error, "%s failed", __func__);
+}
+}
+
 static void
 ovs_mutex_init__(const struct ovs_mutex *l_, int type)
 {
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 5/8] ovs-thread: Fix barrier use-after-free

2021-04-27 Thread Gaetan Rivet
When a thread is blocked on a barrier, there is no guarantee
regarding the moment it will resume, only that it will at some point in
the future.

One thread can resume first then proceed to destroy the barrier while
another thread has not yet awoken. When it finally happens, the second
thread will attempt a seq_read() on the barrier seq, while the first
thread have already destroyed it, triggering a use-after-free.

Introduce an additional indirection layer within the barrier.
A internal barrier implementation holds all the necessary elements
for a thread to safely block and destroy. Whenever a barrier is
destroyed, the internal implementation is left available to still
blocking threads if necessary. A reference counter is used to track
threads still using the implementation.

Note that current uses of ovs-barrier are not affected: RCU and
revalidators will not destroy their barrier immediately after blocking
on it.

Fixes: d8043da7182a ("ovs-thread: Implement OVS specific barrier.")
Signed-off-by: Gaetan Rivet 
---
 lib/ovs-thread.c | 61 +++-
 lib/ovs-thread.h |  6 ++---
 2 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c
index b686e4548..805cba622 100644
--- a/lib/ovs-thread.c
+++ b/lib/ovs-thread.c
@@ -299,21 +299,53 @@ ovs_spin_init(const struct ovs_spin *spin)
 }
 #endif
 
+struct ovs_barrier_impl {
+uint32_t size;/* Number of threads to wait. */
+atomic_count count;   /* Number of threads already hit the barrier. */
+struct seq *seq;
+struct ovs_refcount refcnt;
+};
+
+static void
+ovs_barrier_impl_ref(struct ovs_barrier_impl *impl)
+{
+ovs_refcount_ref(&impl->refcnt);
+}
+
+static void
+ovs_barrier_impl_unref(struct ovs_barrier_impl *impl)
+{
+if (ovs_refcount_unref(&impl->refcnt) == 1) {
+seq_destroy(impl->seq);
+free(impl);
+}
+}
+
 /* Initializes the 'barrier'.  'size' is the number of threads
  * expected to hit the barrier. */
 void
 ovs_barrier_init(struct ovs_barrier *barrier, uint32_t size)
 {
-barrier->size = size;
-atomic_count_init(&barrier->count, 0);
-barrier->seq = seq_create();
+struct ovs_barrier_impl *impl;
+
+impl = xmalloc(sizeof *impl);
+impl->size = size;
+atomic_count_init(&impl->count, 0);
+impl->seq = seq_create();
+ovs_refcount_init(&impl->refcnt);
+
+ovsrcu_set(&barrier->impl, impl);
 }
 
 /* Destroys the 'barrier'. */
 void
 ovs_barrier_destroy(struct ovs_barrier *barrier)
 {
-seq_destroy(barrier->seq);
+struct ovs_barrier_impl *impl;
+
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovsrcu_set(&barrier->impl, NULL);
+ovs_barrier_impl_unref(impl);
 }
 
 /* Makes the calling thread block on the 'barrier' until all
@@ -325,23 +357,30 @@ ovs_barrier_destroy(struct ovs_barrier *barrier)
 void
 ovs_barrier_block(struct ovs_barrier *barrier)
 {
-uint64_t seq = seq_read(barrier->seq);
+struct ovs_barrier_impl *impl;
 uint32_t orig;
+uint64_t seq;
 
-orig = atomic_count_inc(&barrier->count);
-if (orig + 1 == barrier->size) {
-atomic_count_set(&barrier->count, 0);
+impl = ovsrcu_get(struct ovs_barrier_impl *, &barrier->impl);
+ovs_barrier_impl_ref(impl);
+
+seq = seq_read(impl->seq);
+orig = atomic_count_inc(&impl->count);
+if (orig + 1 == impl->size) {
+atomic_count_set(&impl->count, 0);
 /* seq_change() serves as a release barrier against the other threads,
  * so the zeroed count is visible to them as they continue. */
-seq_change(barrier->seq);
+seq_change(impl->seq);
 } else {
 /* To prevent thread from waking up by other event,
  * keeps waiting for the change of 'barrier->seq'. */
-while (seq == seq_read(barrier->seq)) {
-seq_wait(barrier->seq, seq);
+while (seq == seq_read(impl->seq)) {
+seq_wait(impl->seq, seq);
 poll_block();
 }
 }
+
+ovs_barrier_impl_unref(impl);
 }
 
 DEFINE_EXTERN_PER_THREAD_DATA(ovsthread_id, OVSTHREAD_ID_UNSET);
diff --git a/lib/ovs-thread.h b/lib/ovs-thread.h
index 7ee98bd4e..3b444ccdc 100644
--- a/lib/ovs-thread.h
+++ b/lib/ovs-thread.h
@@ -21,16 +21,16 @@
 #include 
 #include 
 #include "ovs-atomic.h"
+#include "ovs-rcu.h"
 #include "openvswitch/thread.h"
 #include "util.h"
 
 struct seq;
 
 /* Poll-block()-able barrier similar to pthread_barrier_t. */
+struct ovs_barrier_impl;
 struct ovs_barrier {
-uint32_t size;/* Number of threads to wait. */
-atomic_count count;   /* Number of threads already hit the barrier. */
-struct seq *seq;
+OVSRCU_TYPE(struct ovs_barrier_impl *) impl;
 };
 
 /* Wrappers for pthread_mutexattr_*() that abort the process on any error. */
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

[ovs-dev] [PATCH v1 0/8] RCU: Add blocking mode for debugging

2021-04-27 Thread Gaetan Rivet
This series adds a compilation option that changes the behavior of the RCU
module. Once enabled, RCU reclamation by user threads becomes blocking until
the RCU threads has executed the scheduled callbacks.

Tools such as AddressSanitizer are useful to detect memory errors e.g. 
user-after-free.
Such tool can become ineffective if the RCU library is used to defer memory 
reclamation.
While this is the intended function of the RCU lib, nothing protects developers
from mistakes i.e. keeping references to memory scheduled for reclamation 
accross
quiescent periods.

Such error that should be detectable with ASAN, are made less likely to occur
due to RCU and thus harder to fix. However, if the RCU is modified so that user
threads are waiting on the RCU thread to execute the scheduled callbacks, they
should be forced to happen.

Unit tests have been written that should trigger a use-after-free from ASAN.
They are however thwarted by the RCU, until the blocking mode is enabled.
In that case, they will always abort on the expected error.

The full test-suite can be passed with the blocking RCU mode enabled.
An entry in the CI matrix is created for it. No error has been observed.

Gaetan Rivet (8):
  configure: add --enable-asan option
  tests: Add ovs-barrier unit test
  tests: Add RCU postpone test
  tests: Add ASAN use-after-free validation with RCU
  ovs-thread: Fix barrier use-after-free
  ovs-thread: Quiesce when joining pthreads
  ovs-rcu: Remove unused perthread mutex
  ovs-rcu: Add blocking RCU mode

 .ci/linux-build.sh   |   8 +-
 .github/workflows/build-and-test.yml |   7 +
 NEWS |   2 +
 acinclude.m4 |  31 
 configure.ac |   2 +
 lib/ovs-rcu.c|  85 -
 lib/ovs-thread.c |  77 ++--
 lib/ovs-thread.h |   6 +-
 tests/atlocal.in |   2 +
 tests/automake.mk|   2 +
 tests/library.at |  49 -
 tests/test-barrier.c | 264 +++
 tests/test-rcu-uaf.c |  98 ++
 tests/test-rcu.c |  59 ++
 14 files changed, 670 insertions(+), 22 deletions(-)
 create mode 100644 tests/test-barrier.c
 create mode 100644 tests/test-rcu-uaf.c

--
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 8/8] ovs-rcu: Add blocking RCU mode

2021-04-27 Thread Gaetan Rivet
Add the configure option --enable-rcu-blocking, that modifies the RCU
library. When enabled, quiescing from other threads will block, waiting
on the RCU thread to execute the postponed jobs.

This mode forces the deferred memory reclamation to happen
deterministically, reducing the latency of the deferral and forcing memory
to be freed any time a thread goes through a quiescent state.

Some use-after-free that were hidden by deferred memory reclamation may
become observable as a result. Previously the RCU mechanism would make them
harder to detect.

UAF detection tools should then be used in conjunction with this
compilation flag, e.g. (assuming llvm installed):

  ./configure --enable-rcu-blocking --enable-asan
  make

  # Verify the tool works: should trigger a UAF
  ./tests/ovstest test-rcu-uaf quiesce
  ./tests/ovstest test-rcu-uaf try-quiesce
  ./tests/ovstest test-rcu-uaf quiesce-start-end

  # The testsuite can be used as well
  make check TESTSUITEFLAGS='-k rcu'

Signed-off-by: Gaetan Rivet 
---
 .ci/linux-build.sh   |  4 ++
 .github/workflows/build-and-test.yml |  7 +++
 NEWS |  1 +
 acinclude.m4 | 15 +
 configure.ac |  1 +
 lib/ovs-rcu.c| 82 
 tests/atlocal.in |  1 +
 tests/library.at |  3 +
 8 files changed, 114 insertions(+)

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index 3c58637b4..e4cbe2024 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -235,6 +235,10 @@ if [ "$ASAN" ]; then
 CFLAGS_FOR_OVS="${CFLAGS_FOR_OVS} -O1"
 fi
 
+if [ "$RCU_BLOCK" ]; then
+EXTRA_OPTS="$EXTRA_OPTS --enable-rcu-blocking"
+fi
+
 save_OPTS="${OPTS} $*"
 OPTS="${EXTRA_OPTS} ${save_OPTS}"
 
diff --git a/.github/workflows/build-and-test.yml 
b/.github/workflows/build-and-test.yml
index ce98a9f98..655923325 100644
--- a/.github/workflows/build-and-test.yml
+++ b/.github/workflows/build-and-test.yml
@@ -23,6 +23,7 @@ jobs:
   M32: ${{ matrix.m32 }}
   OPTS:${{ matrix.opts }}
   TESTSUITE:   ${{ matrix.testsuite }}
+  RCU_BLOCK:   ${{ matrix.rcu_blocking }}
 
 name: linux ${{ join(matrix.*, ' ') }}
 runs-on: ubuntu-18.04
@@ -109,6 +110,12 @@ jobs:
   - compiler: gcc
 deb_package:  deb
 
+  - compiler: clang
+testsuite:test
+kernel:   3.16
+asan: asan
+rcu_blocking: rcu-blocking
+
 steps:
 - name: checkout
   uses: actions/checkout@v2
diff --git a/NEWS b/NEWS
index 57e1f041b..83fcfe1d0 100644
--- a/NEWS
+++ b/NEWS
@@ -10,6 +10,7 @@ Post-v2.15.0
in ovsdb on startup.
  * New command 'record-hostname-if-not-set' to update hostname in ovsdb.
- New --enable-asan configure option enables AddressSanitizer.
+   - New --enable-rcu-blocking configure option to debug RCU usage.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/acinclude.m4 b/acinclude.m4
index 615e7f962..b01264373 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -1386,6 +1386,21 @@ AC_DEFUN([OVS_ENABLE_SPARSE],
  [], [enable_sparse=no])
AM_CONDITIONAL([ENABLE_SPARSE_BY_DEFAULT], [test $enable_sparse = yes])])
 
+dnl OVS_ENABLE_RCU_BLOCKING
+AC_DEFUN([OVS_ENABLE_RCU_BLOCKING],
+  [AC_ARG_ENABLE(
+[rcu-blocking],
+[AC_HELP_STRING([--enable-rcu-blocking],
+[Enable the blocking RCU mode])],
+[RCU_BLOCKING=yes], [RCU_BLOCKING=no])
+   AC_SUBST([RCU_BLOCKING])
+   AC_CONFIG_COMMANDS_PRE([
+ if test "$RCU_BLOCKING" = "yes"; then
+ OVS_CFLAGS="$OVS_CFLAGS -DOVS_RCU_BLOCKING=1"
+ fi
+   ])
+  ])
+
 dnl OVS_CTAGS_IDENTIFIERS
 dnl
 dnl ctags ignores symbols with extras identifiers. This is a list of
diff --git a/configure.ac b/configure.ac
index eec5a9d1b..de11ff777 100644
--- a/configure.ac
+++ b/configure.ac
@@ -184,6 +184,7 @@ OVS_CHECK_CC_OPTION([-mavx512f], [CFLAGS="$CFLAGS 
-DHAVE_AVX512F"])
 OVS_ENABLE_WERROR
 OVS_ENABLE_ASAN
 OVS_ENABLE_SPARSE
+OVS_ENABLE_RCU_BLOCKING
 OVS_CTAGS_IDENTIFIERS
 OVS_CHECK_DPCLS_AUTOVALIDATOR
 OVS_CHECK_BINUTILS_AVX512
diff --git a/lib/ovs-rcu.c b/lib/ovs-rcu.c
index 1866bd308..cd8414973 100644
--- a/lib/ovs-rcu.c
+++ b/lib/ovs-rcu.c
@@ -71,6 +71,79 @@ static void ovsrcu_unregister__(struct ovsrcu_perthread *);
 static bool ovsrcu_call_postponed(void);
 static void *ovsrcu_postpone_thread(void *arg OVS_UNUSED);
 
+#ifdef OVS_RCU_BLOCKING
+
+static struct seq *postpone_wait;
+DEFINE_STATIC_PER_THREAD_DATA(bool, need_wait, false);
+DEFINE_STATIC_PER_THREAD_DATA(uint64_t, quiescent_seqno, 0);
+
+static void
+ovsrcu_postpone_end(void)
+{
+if (single_threaded()) {
+return;
+}
+seq_change(postpone_wait);
+}
+
+static bool
+ovsrcu_do_not_block(void)
+{
+/* Do not wait on the postpone thread if it has been cleared for exit. */
+return single_threaded() ||
+   (strncmp(get_subprogram_name(),

[ovs-dev] [PATCH v1 4/8] tests: Add ASAN use-after-free validation with RCU

2021-04-27 Thread Gaetan Rivet
When using the RCU mechanism and deferring memory reclamation, potential
use-after-free due to incorrect use of RCU can be hidden.

Add a test triggering a UAF event. When the test suite is built with
AddressSanitizer support, verify that the event triggers and the tool is
usable with RCU.

Signed-off-by: Gaetan Rivet 
---
 tests/automake.mk|  1 +
 tests/library.at | 33 +++
 tests/test-rcu-uaf.c | 98 
 3 files changed, 132 insertions(+)
 create mode 100644 tests/test-rcu-uaf.c

diff --git a/tests/automake.mk b/tests/automake.mk
index a32abd41c..4420a3f7f 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -472,6 +472,7 @@ tests_ovstest_SOURCES = \
tests/test-packets.c \
tests/test-random.c \
tests/test-rcu.c \
+   tests/test-rcu-uaf.c \
tests/test-reconnect.c \
tests/test-rstp.c \
tests/test-sflow.c \
diff --git a/tests/library.at b/tests/library.at
index 6e8a154e5..4a549f77e 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -261,6 +261,39 @@ AT_KEYWORDS([rcu])
 AT_CHECK([ovstest test-rcu], [0], [])
 AT_CLEANUP
 
+AT_SETUP([rcu quiesce use-after-free detection])
+AT_SKIP_IF([test "$IS_WIN32" = "yes"])
+AT_SKIP_IF([test "$ASAN_ENABLED" = "no"])
+# SIGABRT + 128
+exit_status=134
+AT_KEYWORDS([rcu asan])
+AT_CHECK([ovstest test-rcu-uaf quiesce], [$exit_status], [ignore], [ignore])
+# ASAN report is expected on success.
+rm asan.*
+AT_CLEANUP
+
+AT_SETUP([rcu try-quiesce use-after-free detection])
+AT_SKIP_IF([test "$IS_WIN32" = "yes"])
+AT_SKIP_IF([test "$ASAN_ENABLED" = "no"])
+# SIGABRT + 128
+exit_status=134
+AT_KEYWORDS([rcu asan])
+AT_CHECK([ovstest test-rcu-uaf try-quiesce], [$exit_status], [ignore], 
[ignore])
+# ASAN report is expected on success.
+rm asan.*
+AT_CLEANUP
+
+AT_SETUP([rcu quiesce-start-end use-after-free detection])
+AT_SKIP_IF([test "$IS_WIN32" = "yes"])
+AT_SKIP_IF([test "$ASAN_ENABLED" = "no"])
+AT_KEYWORDS([rcu asan])
+# SIGABRT + 128
+exit_status=134
+AT_CHECK([ovstest test-rcu-uaf quiesce-start-end], [$exit_status], [ignore], 
[ignore])
+# ASAN report is expected on success.
+rm asan.*
+AT_CLEANUP
+
 AT_SETUP([stopwatch module])
 AT_CHECK([ovstest test-stopwatch], [0], [..
 ], [ignore])
diff --git a/tests/test-rcu-uaf.c b/tests/test-rcu-uaf.c
new file mode 100644
index 0..f97738795
--- /dev/null
+++ b/tests/test-rcu-uaf.c
@@ -0,0 +1,98 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include 
+
+#include "ovs-thread.h"
+#include "ovs-rcu.h"
+#include "ovstest.h"
+#include "util.h"
+
+enum ovsrcu_uaf_type {
+OVSRCU_UAF_QUIESCE,
+OVSRCU_UAF_TRY_QUIESCE,
+OVSRCU_UAF_QUIESCE_START_END,
+};
+
+static void *
+rcu_uaf_main(void *aux)
+{
+enum ovsrcu_uaf_type *type = aux;
+char *xx = xmalloc(2);
+
+xx[0] = 'a';
+ovsrcu_postpone(free, xx);
+switch (*type) {
+case OVSRCU_UAF_QUIESCE:
+ovsrcu_quiesce();
+break;
+case OVSRCU_UAF_TRY_QUIESCE:
+while (ovsrcu_try_quiesce()) {
+;
+}
+break;
+case OVSRCU_UAF_QUIESCE_START_END:
+ovsrcu_quiesce_start();
+ovsrcu_quiesce_end();
+break;
+default:
+OVS_NOT_REACHED();
+}
+xx[1] = 'b';
+
+return NULL;
+}
+
+static void
+usage(char *test_name)
+{
+fprintf(stderr, "Usage: %s \n",
+test_name);
+}
+
+static void
+test_rcu_uaf(int argc, char *argv[])
+{
+char **args = argv + optind - 1;
+enum ovsrcu_uaf_type type;
+pthread_t quiescer;
+
+if (argc - optind != 1) {
+usage(args[0]);
+return;
+}
+
+set_program_name(argv[0]);
+
+if (!strcmp(args[1], "quiesce")) {
+type = OVSRCU_UAF_QUIESCE;
+} else if (!strcmp(args[1], "try-quiesce")) {
+type = OVSRCU_UAF_TRY_QUIESCE;
+} else if (!strcmp(args[1], "quiesce-start-end")) {
+type = OVSRCU_UAF_QUIESCE_START_END;
+} else {
+usage(args[0]);
+return;
+}
+
+/* Need to create a separate thread, to support try-quiesce. */
+quiescer = ovs_thread_create("rcu-uaf", rcu_uaf_main, &type);
+xpthread_join(quiescer, NULL);
+}
+
+OVSTEST_REGISTER("test-rcu-uaf", test_rcu_uaf);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 3/8] tests: Add RCU postpone test

2021-04-27 Thread Gaetan Rivet
Add a simple postponing test verifying RCU callbacks have executed and
RCU exits in order. Add as part of library unit-tests.

Signed-off-by: Gaetan Rivet 
---
 tests/library.at |  8 ++-
 tests/test-rcu.c | 59 
 2 files changed, 66 insertions(+), 1 deletion(-)

diff --git a/tests/library.at b/tests/library.at
index e572c22e3..6e8a154e5 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -251,10 +251,16 @@ AT_KEYWORDS([barrier])
 AT_CHECK([ovstest test-barrier], [0], [])
 AT_CLEANUP
 
-AT_SETUP([rcu])
+AT_SETUP([rcu quiescing])
+AT_KEYWORDS([rcu])
 AT_CHECK([ovstest test-rcu-quiesce], [0], [])
 AT_CLEANUP
 
+AT_SETUP([rcu postponing])
+AT_KEYWORDS([rcu])
+AT_CHECK([ovstest test-rcu], [0], [])
+AT_CLEANUP
+
 AT_SETUP([stopwatch module])
 AT_CHECK([ovstest test-stopwatch], [0], [..
 ], [ignore])
diff --git a/tests/test-rcu.c b/tests/test-rcu.c
index 965f3c49f..88db04a45 100644
--- a/tests/test-rcu.c
+++ b/tests/test-rcu.c
@@ -49,3 +49,62 @@ test_rcu_quiesce(int argc OVS_UNUSED, char *argv[] 
OVS_UNUSED)
 }
 
 OVSTEST_REGISTER("test-rcu-quiesce", test_rcu_quiesce);
+
+struct rcu_user_aux {
+bool done;
+};
+
+static void
+rcu_user_deferred(struct rcu_user_aux *aux)
+{
+aux->done = true;
+}
+
+static void *
+rcu_user_main(void *aux_)
+{
+struct rcu_user_aux *aux = aux_;
+
+ovsrcu_quiesce();
+
+aux->done = false;
+ovsrcu_postpone(rcu_user_deferred, aux);
+
+ovsrcu_quiesce();
+
+return NULL;
+}
+
+#define N_THREAD 4
+
+static void
+test_rcu(int argc OVS_UNUSED, char *argv[] OVS_UNUSED)
+{
+struct rcu_user_aux aux[N_THREAD] = {0};
+struct rcu_user_aux main_aux = {0};
+pthread_t users[N_THREAD];
+size_t i;
+
+for (i = 0; i < ARRAY_SIZE(users); i++) {
+users[i] = ovs_thread_create("user", rcu_user_main, &aux[i]);
+}
+
+for (i = 0; i < ARRAY_SIZE(users); i++) {
+xpthread_join(users[i], NULL);
+}
+
+/* Register a last callback and verify that it will be properly executed
+ * even if the RCU lib is exited without this thread quiescing.
+ */
+ovsrcu_postpone(rcu_user_deferred, &main_aux);
+
+ovsrcu_exit();
+
+ovs_assert(main_aux.done);
+
+for (i = 0; i < ARRAY_SIZE(users); i++) {
+ovs_assert(aux[i].done);
+}
+}
+
+OVSTEST_REGISTER("test-rcu", test_rcu);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 1/8] configure: add --enable-asan option

2021-04-27 Thread Gaetan Rivet
Add a configure option to enable ASAN in a simple way.
Adding an AC variable to allow checking for support in the testsuite.

Signed-off-by: Gaetan Rivet 
---
 .ci/linux-build.sh |  4 ++--
 NEWS   |  1 +
 acinclude.m4   | 16 
 configure.ac   |  1 +
 tests/atlocal.in   |  1 +
 5 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/.ci/linux-build.sh b/.ci/linux-build.sh
index 977449350..3c58637b4 100755
--- a/.ci/linux-build.sh
+++ b/.ci/linux-build.sh
@@ -229,10 +229,10 @@ fi
 if [ "$ASAN" ]; then
 # This will override default option configured in tests/atlocal.in.
 export ASAN_OPTIONS='detect_leaks=1'
+EXTRA_OPTS="$EXTRA_OPTS --enable-asan"
 # -O2 generates few false-positive memory leak reports in test-ovsdb
 # application, so lowering optimizations to -O1 here.
-CLFAGS_ASAN="-O1 -fno-omit-frame-pointer -fno-common -fsanitize=address"
-CFLAGS_FOR_OVS="${CFLAGS_FOR_OVS} ${CLFAGS_ASAN}"
+CFLAGS_FOR_OVS="${CFLAGS_FOR_OVS} -O1"
 fi
 
 save_OPTS="${OPTS} $*"
diff --git a/NEWS b/NEWS
index 95cf922aa..57e1f041b 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,7 @@ Post-v2.15.0
  * New option '--no-record-hostname' to disable hostname configuration
in ovsdb on startup.
  * New command 'record-hostname-if-not-set' to update hostname in ovsdb.
+   - New --enable-asan configure option enables AddressSanitizer.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/acinclude.m4 b/acinclude.m4
index 15a54d636..615e7f962 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -58,6 +58,22 @@ AC_DEFUN([OVS_ENABLE_WERROR],
fi
AC_SUBST([SPARSE_WERROR])])
 
+dnl OVS_ENABLE_ASAN
+AC_DEFUN([OVS_ENABLE_ASAN],
+  [AC_ARG_ENABLE(
+[asan],
+[AC_HELP_STRING([--enable-asan],
+[Enable the Address Sanitizer])],
+[ASAN_ENABLED=yes], [ASAN_ENABLED=no])
+   AC_SUBST([ASAN_ENABLED])
+   AC_CONFIG_COMMANDS_PRE([
+ if test "$ASAN_ENABLED" = "yes"; then
+ OVS_CFLAGS="$OVS_CFLAGS -fno-omit-frame-pointer"
+ OVS_CFLAGS="$OVS_CFLAGS -fno-common -fsanitize=address"
+ fi
+   ])
+  ])
+
 dnl OVS_CHECK_LINUX
 dnl
 dnl Configure linux kernel source tree
diff --git a/configure.ac b/configure.ac
index c077034d4..eec5a9d1b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -182,6 +182,7 @@ OVS_CONDITIONAL_CC_OPTION([-Wno-unused-parameter], 
[HAVE_WNO_UNUSED_PARAMETER])
 OVS_CONDITIONAL_CC_OPTION([-mavx512f], [HAVE_AVX512F])
 OVS_CHECK_CC_OPTION([-mavx512f], [CFLAGS="$CFLAGS -DHAVE_AVX512F"])
 OVS_ENABLE_WERROR
+OVS_ENABLE_ASAN
 OVS_ENABLE_SPARSE
 OVS_CTAGS_IDENTIFIERS
 OVS_CHECK_DPCLS_AUTOVALIDATOR
diff --git a/tests/atlocal.in b/tests/atlocal.in
index cfca7e192..f61e752bf 100644
--- a/tests/atlocal.in
+++ b/tests/atlocal.in
@@ -220,6 +220,7 @@ export OVS_SYSLOG_METHOD
 OVS_CTL_TIMEOUT=30
 export OVS_CTL_TIMEOUT
 
+ASAN_ENABLED='@ASAN_ENABLED@'
 # Add some default flags to make the tests run better under Address
 # Sanitizer, if it was used for the build.
 #
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v1 2/8] tests: Add ovs-barrier unit test

2021-04-27 Thread Gaetan Rivet
No unit test exist currently for the ovs-barrier type.
It is however crucial as a building block and should be verified to work
as expected.

Create a simple test verifying the basic function of ovs-barrier.
Integrate the test as part of the test suite.

Signed-off-by: Gaetan Rivet 
---
 tests/automake.mk|   1 +
 tests/library.at |   5 +
 tests/test-barrier.c | 264 +++
 3 files changed, 270 insertions(+)
 create mode 100644 tests/test-barrier.c

diff --git a/tests/automake.mk b/tests/automake.mk
index 1a528aa39..a32abd41c 100644
--- a/tests/automake.mk
+++ b/tests/automake.mk
@@ -448,6 +448,7 @@ tests_ovstest_SOURCES = \
tests/ovstest.h \
tests/test-aes128.c \
tests/test-atomic.c \
+   tests/test-barrier.c \
tests/test-bundle.c \
tests/test-byte-order.c \
tests/test-classifier.c \
diff --git a/tests/library.at b/tests/library.at
index 1702b7556..e572c22e3 100644
--- a/tests/library.at
+++ b/tests/library.at
@@ -246,6 +246,11 @@ AT_SETUP([ofpbuf module])
 AT_CHECK([ovstest test-ofpbuf], [0], [])
 AT_CLEANUP
 
+AT_SETUP([barrier module])
+AT_KEYWORDS([barrier])
+AT_CHECK([ovstest test-barrier], [0], [])
+AT_CLEANUP
+
 AT_SETUP([rcu])
 AT_CHECK([ovstest test-rcu-quiesce], [0], [])
 AT_CLEANUP
diff --git a/tests/test-barrier.c b/tests/test-barrier.c
new file mode 100644
index 0..3bc5291cc
--- /dev/null
+++ b/tests/test-barrier.c
@@ -0,0 +1,264 @@
+/*
+ * Copyright (c) 2021 NVIDIA Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include 
+
+#include "ovs-thread.h"
+#include "ovs-rcu.h"
+#include "ovstest.h"
+#include "random.h"
+#include "util.h"
+
+#define DEFAULT_N_THREADS 4
+#define NB_STEPS 4
+
+static bool verbose;
+static struct ovs_barrier barrier;
+
+struct blocker_aux {
+unsigned int tid;
+bool leader;
+int step;
+};
+
+static void *
+basic_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+ovs_barrier_block(&barrier);
+aux->step++;
+ovs_barrier_block(&barrier);
+}
+
+return NULL;
+}
+
+static void
+basic_block_check(struct blocker_aux *aux, size_t n, int expected)
+{
+size_t i;
+
+for (i = 0; i < n; i++) {
+if (verbose) {
+printf("aux[%" PRIuSIZE "]=%d == %d", i, aux[i].step, expected);
+if (aux[i].step != expected) {
+printf(" <--- X");
+}
+printf("\n");
+} else {
+ovs_assert(aux[i].step == expected);
+}
+}
+ovs_barrier_block(&barrier);
+ovs_barrier_block(&barrier);
+}
+
+/*
+ * Basic barrier test.
+ *
+ * N writers and 1 reader participate in the test.
+ * Each thread goes through M steps (=NB_STEPS).
+ * The main thread participates as the reader.
+ *
+ * A Step is divided in three parts:
+ *1. before
+ *  (barrier)
+ *2. during
+ *  (barrier)
+ *3. after
+ *
+ * Each writer updates a thread-local variable with the
+ * current step number within part 2 and waits.
+ *
+ * The reader checks all variables during part 3, expecting
+ * all variables to be equal. If any variable differs, it means
+ * its thread was not properly blocked by the barrier.
+ */
+static void
+test_barrier_basic(size_t n_threads)
+{
+struct blocker_aux *aux;
+pthread_t *threads;
+size_t i;
+
+ovs_barrier_init(&barrier, n_threads + 1);
+
+aux = xcalloc(n_threads, sizeof *aux);
+threads = xmalloc(n_threads * sizeof *threads);
+for (i = 0; i < n_threads; i++) {
+threads[i] = ovs_thread_create("ovs-barrier",
+   basic_blocker_main, &aux[i]);
+}
+
+for (i = 0; i < NB_STEPS; i++) {
+basic_block_check(aux, n_threads, i);
+}
+ovs_barrier_destroy(&barrier);
+
+for (i = 0; i < n_threads; i++) {
+xpthread_join(threads[i], NULL);
+}
+
+free(threads);
+free(aux);
+}
+
+static unsigned int *shared_mem;
+
+static void *
+lead_blocker_main(void *aux_)
+{
+struct blocker_aux *aux = aux_;
+size_t i;
+
+aux->step = 0;
+for (i = 0; i < NB_STEPS; i++) {
+if (aux->leader) {
+shared_mem = xmalloc(sizeof *shared_mem);
+if (verbose) {
+printf("*T1: allocated shmem\n");
+}
+}
+xnanosleep(random_range(100) * 1000);
+
+   

[ovs-dev] [PATCH] openvswitch: fix typo

2021-04-27 Thread qhjindev
change 'subsytem' to 'subsystem'

Signed-off-by: qhjindev 
---
 net/openvswitch/vport.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 1eb7495ac5b4..8a930ca6d6b1 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -20,7 +20,7 @@
 struct vport;
 struct vport_parms;
 
-/* The following definitions are for users of the vport subsytem: */
+/* The following definitions are for users of the vport subsystem: */
 
 int ovs_vport_init(void);
 void ovs_vport_exit(void);
-- 
2.17.1


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] tests: Fix inconsistent "ACL Conjunction" test.

2021-04-27 Thread Numan Siddique
On Tue, Apr 27, 2021 at 1:20 PM Mark Michelson  wrote:
>
> The ACL Conjunction test would occasionally fail during automated test
> runs. During the test, we send a packet on a netdev-dummy interface and
> check the associated pcap file to ensure the packet is sent where we
> expect and that it has the expected contents. Looking at logs from
> failed runs, it appeared that the pcap file was unpopulated. This likely
> was because we were attempting to dump the contents of the pcap file
> before the packet had been processed and added to the pcap file.
>
> This patch aims to fix the problem by blocking until the pcap file has
> been modified when sending the packet to the netdev-dummy interface.
> Since this could be a useful thing for other tests, this new method of
> blocking has been added to ovn-macros.at.
>
> Signed-off-by: Mark Michelson 

Acked-by: Numan Siddique 

Numan

> ---
>  tests/ovn-macros.at | 23 +++
>  tests/ovn.at|  8 
>  2 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/tests/ovn-macros.at b/tests/ovn-macros.at
> index bd227215a..94fba405e 100644
> --- a/tests/ovn-macros.at
> +++ b/tests/ovn-macros.at
> @@ -509,6 +509,29 @@ options:rxq_pcap=${pcap_file}-rx.pcap
>  OVS_WAIT_WHILE([test 24 = $(wc -c ${pcap_file}-tx.pcap | cut -d " " 
> -f1)])
>  }
>
> +# Receive a packet on a dummy netdev interface. If we expect packets to be
> +# recorded, then wait until the pcap file reflects the change.
> +netdev_dummy_receive() {
> +local interface="$1"
> +local packet="$2"
> +local hv="$3"
> +local pcap_file="$4"
> +
> +if test -n "pcap_file" ; then
> +ts_old=$(stat -c %y "$pcap_file")
> +fi
> +if test -n "$hv" ; then
> +as "$hv" ovs-appctl netdev-dummy/receive "$interface" "$packet"
> +else
> +ovs-appctl netdev-dummy/receive "$interface" "$packet"
> +fi
> +if test -n "$pcap_file" ; then
> +OVS_WAIT_WHILE(
> +  [ts_new=$(stat -c %y "$pcap_file")
> +   test "$ts_new" = "$ts_old"])
> +fi
> +}
> +
>  OVS_END_SHELL_HELPERS
>
>  m4_define([OVN_POPULATE_ARP], [AT_CHECK(ovn_populate_arp__, [0], [ignore])])
> diff --git a/tests/ovn.at b/tests/ovn.at
> index 9f38ec6ec..c1158f5d0 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -14016,11 +14016,11 @@ check ovn-nbctl --wait=hv acl-add ls1 to-lport 1001 
> \
>  # port numbers, e.g. 11 for vif11.
>  test_ip() {
>  # This packet has bad checksums but logical L3 routing doesn't check.
> -local inport=$1 src_mac=$2 dst_mac=$3 src_ip=$4 dst_ip=$5
> +local inport=$1 src_mac=$2 dst_mac=$3 src_ip=$4 dst_ip=$5 pcap_file=$6
>  local packet=${dst_mac}${src_mac}0800451c4011${src_ip}\
>  ${dst_ip}00350008
> -shift; shift; shift; shift; shift
> -as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
> +shift; shift; shift; shift; shift; shift
> +netdev_dummy_receive hv1-vif1 $packet hv1 "$pcap_file"
>  for outport; do
>  echo $packet >> $outport.expected
>  done
> @@ -14040,7 +14040,7 @@ options:rxq_pcap=${pcap_file}-rx.pcap
>  sip=`ip_to_hex 10 0 0 4`
>  dip=`ip_to_hex 10 0 0 6`
>
> -test_ip 1 f001 f002 $sip $dip 2
> +test_ip 1 f001 f002 $sip $dip hv1/vif2-tx.pcap 2
>
>  cat 2.expected > expout
>  $PYTHON "$ovs_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap > 2.packets
> --
> 2.29.2
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 09/11] conntrack: Do not rate limit ct-sweep

2021-04-27 Thread Gaëtan Rivet
On Tue, Apr 27, 2021, at 21:18, Aaron Conole wrote:
> Gaetan Rivet  writes:
> 
> > The current rate limit is set to allow other threads to update the
> > connections when applicable. This was valid when taking the 'ct_lock'
> > was needed with a global critical section.
> >
> > Now that the size of the critical section for 'ct_lock' is reduced, it
> > is not necessary to rate limit calls to ct_sweep() anymore.
> >
> > Signed-off-by: Gaetan Rivet 
> > Reviewed-by: Eli Britstein 
> > ---
> 
> It's weird to see patch 8/11 and 9/11 set up this way.
> 
> Would it make sense to just squash them together?
> 
> >  lib/conntrack.c | 24 +++-
> >  1 file changed, 7 insertions(+), 17 deletions(-)
> >
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index ea2e5b63b..8a7538b7b 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -1675,20 +1675,12 @@ conntrack_clean(struct conntrack *ct, long long now)
> >   * there is an actual connection that expires, or because a new connection
> >   * might be created with the minimum timeout).
> >   *
> > - * The logic below has two goals:
> > - *
> > - * - We want to reduce the number of wakeups and batch connection cleanup
> > - *   when the load is not very high.  CT_CLEAN_INTERVAL ensures that if we
> > - *   are coping with the current cleanup tasks, then we wait at least
> > - *   5 seconds to do further cleanup.
> > - *
> > - * - We don't want to keep the map locked too long, as we might prevent
> > - *   traffic from flowing.  CT_CLEAN_MIN_INTERVAL ensures that if cleanup 
> > is
> > - *   behind, there is at least some 200ms blocks of time when the map will 
> > be
> > - *   left alone, so the datapath can operate unhindered.
> > + * We want to reduce the number of wakeups and batch connection cleanup
> > + * when the load is not very high.  CT_CLEAN_INTERVAL ensures that if we
> > + * are coping with the current cleanup tasks, then we wait at least
> > + * 5 seconds to do further cleanup.
> >   */
> >  #define CT_CLEAN_INTERVAL 5000 /* 5 seconds */
> > -#define CT_CLEAN_MIN_INTERVAL 200  /* 0.2 seconds */
> >  
> >  static void *
> >  clean_thread_main(void *f_)
> > @@ -1705,12 +1697,10 @@ clean_thread_main(void *f_)
> >  long long now = time_msec();
> >  next_wake = conntrack_clean(ct, now);
> >  
> > -if (next_wake < now) {
> > -poll_immediate_wake();
> > -} else if (next_wake < now + CT_CLEAN_MIN_INTERVAL) {
> > -poll_timer_wait_until(now + CT_CLEAN_MIN_INTERVAL);
> > +if (next_wake > now) {
> > +poll_timer_wait_until(MIN(next_wake, now + CT_CLEAN_INTERVAL));
> >  } else {
> > -poll_timer_wait_until(MAX(next_wake, now + CT_CLEAN_INTERVAL));
> > +poll_immediate_wake();
> >  }
> >  latch_wait(&ct->clean_thread_exit);
> >  poll_block();
> 
> 

Hello Aaron,

Indeed, the reasoning is that avoiding the 0-ms timers should be less 
controversial,
I expected this patch to be discussed and probably changed or dropped.
So to simplify this I preferred to split the patches.

It can be squashed if everyone agrees with the change.

William had some objection to it, and we did not reach a consensus during v1 
review.
If there are a lot of connections, removing the rate limit will make the 
ct_clean thread
more active. William said at the time that running a CPU at 100% to age 
connections
might be a problem. I think that if we have ageing work to do, it should be 
done as soon
as possible. Connections being freed will reduce memory usage and general 
stress on the
conntrack. Additionally, poll_block() will make the thread yield and the 
scheduler should deal with being fair.

This patch is not essential to the series. It is here because I wanted to bring 
attention to this
bit of logic that is not justified any more.

-- 
Gaetan Rivet
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 00/11] conntrack: improve multithread scalability

2021-04-27 Thread Aaron Conole
Gaetan Rivet  writes:

> Conntracks are executed within the datapath. Locks along this path are crucial
> and their critical section should be minimal. The global 'ct_lock' is 
> necessary
> before any action taken on connection states. This lock is needed for many
> operations on the conntrack, slowing down the datapath.
>
> The cleanup thread 'ct_clean' will take it to do its job. As it can hold it a
> long time, the thread is limited in amount of connection cleaned per round,
> and calls are rate-limited.
>
> * Timeout policies locking is contrived to avoid deadlock.
>   Anytime a connection state is updated, during its update it is unlocked,
>   'ct_lock' is taken, then the connection is locked again. Then the reverse
>   is done for unlock.
>
> * Scalability is poor. The global ct_lock needs to be taken before applying
>   any change to a conn object. This is backward: local changes to smaller
>   objects should be independent, then the global lock should only be taken 
> once
>   the rest of the work is done, the goal being to have the smallest possible
>   critical section.
>
> It can be improved. Using RCU-friendly structures for connections, zone limits
> and timeout policies, read-first workload is improved and the precedence of 
> the
> global 'ct_lock' and local 'conn->lock' can be inversed.
>
> Running the conntrack benchmark we see these changes:
>   ./tests/ovstest test-conntrack benchmark  300 32
>
> code \ N  1 2 4 8
>   Before   2310  2766  6117 19838  (ms)
>After   2072  2084  2653  4541  (ms)
>
> One thread in the benchmark executes the task of a PMD, while the 'ct_clean' 
> thread
> runs in background as well.
>
> Github actions: https://github.com/grivet/ovs/actions/runs/574446345
>
> v2:
>
> An mpsc-queue is used instead of rculist to manage connection expirations 
> lists.
> PMDs and ct_clean all act as producers, while ct_clean is the sole consumer 
> thread.
> A PMD now needs to take the 'ct_lock' only when creating a new connection, 
> and only
> while inserting it in the conn CMAP. For any updates, only the conn lock is 
> now required,
> to properly change its state.
>
> The mpsc-queue implementation is identical to the one from the parallel 
> offload series [1].

I guess some of these are reused between those two series (the atomic
exchange and mpsc patches), so maybe we should focus on them first?

Ilya, WDYT?  Maybe these overlap patches can go in separately since
there are two series that will use them.

> CI: https://github.com/grivet/ovs/actions/runs/772118640
>
> [1]: https://patchwork.ozlabs.org/project/openvswitch/list/?series=238779
>
> Gaetan Rivet (11):
>   ovs-atomic: Expose atomic exchange operation
>   mpsc-queue: Module for lock-free message passing
>   conntrack: Use mpsc-queue to store conn expirations
>   conntrack: Use a cmap to store zone limits
>   conntrack: Init hash basis first at creation
>   conntrack-tp: Use a cmap to store timeout policies
>   conntrack: Inverse conn and ct lock precedence
>   conntrack: Do not schedule zero ms timers
>   conntrack: Do not rate limit ct-sweep
>   conntrack: Do not log empty ct-sweep
>   conntrack: Use an atomic conn expiration value
>
>  lib/automake.mk   |   2 +
>  lib/conntrack-private.h   |  97 +++--
>  lib/conntrack-tp.c| 100 ++
>  lib/conntrack.c   | 306 +++-
>  lib/conntrack.h   |   4 +-
>  lib/dpif-netdev.c |   5 +-
>  lib/mpsc-queue.c  | 251 +
>  lib/mpsc-queue.h  | 189 ++
>  lib/ovs-atomic-c++.h  |   3 +
>  lib/ovs-atomic-clang.h|   5 +
>  lib/ovs-atomic-gcc4+.h|   5 +
>  lib/ovs-atomic-gcc4.7+.h  |   5 +
>  lib/ovs-atomic-i586.h |   5 +
>  lib/ovs-atomic-locked.h   |   9 +
>  lib/ovs-atomic-msvc.h |  22 ++
>  lib/ovs-atomic-pthreads.h |   5 +
>  lib/ovs-atomic-x86_64.h   |   5 +
>  lib/ovs-atomic.h  |   8 +-
>  tests/automake.mk |   1 +
>  tests/library.at  |   5 +
>  tests/test-mpsc-queue.c   | 727 ++
>  21 files changed, 1573 insertions(+), 186 deletions(-)
>  create mode 100644 lib/mpsc-queue.c
>  create mode 100644 lib/mpsc-queue.h
>  create mode 100644 tests/test-mpsc-queue.c
>
> --
> 2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] checkpatch: Re-enforce line length checks.

2021-04-27 Thread Numan Siddique
On Mon, Apr 26, 2021 at 1:18 PM Mark Gray  wrote:
>
> On 23/04/2021 14:08, Dumitru Ceara wrote:
> > This check was removed by accident, re-add it.  Also add a test for it.
> >
> > Fixes: 0e77b3bcbfe2 ("ovn-northd-ddlog: New implementation of ovn-northd 
> > based on ddlog.")
> > Signed-off-by: Dumitru Ceara 
> > ---
> >  tests/checkpatch.at | 12 
> >  utilities/checkpatch.py |  2 +-
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/tests/checkpatch.at b/tests/checkpatch.at
> > index 8f45beac9..0c724d3d6 100755
> > --- a/tests/checkpatch.at
> > +++ b/tests/checkpatch.at
> > @@ -329,3 +329,15 @@ try_checkpatch \
> >  "
> >
> >  AT_CLEANUP
> > +
> > +AT_SETUP([checkpatch - line too long])
> > +try_checkpatch \
> > +   "COMMON_PATCH_HEADER
> > ++/* This is a very long 
> > line.. */
> > +" \
> > +"WARNING: Line is 80 characters long (recommended limit is 79)
> > +#8 FILE: A.c:1:
> > +/* This is a very long 
> > line.. */
> > +"
> > +
> > +AT_CLEANUP
> > diff --git a/utilities/checkpatch.py b/utilities/checkpatch.py
> > index af7bcfc29..9e8d17653 100755
> > --- a/utilities/checkpatch.py
> > +++ b/utilities/checkpatch.py
> > @@ -184,7 +184,7 @@ skip_signoff_check = False
> >  #
> >  # Python isn't checked as flake8 performs these checks during build.
> >  line_length_blacklist = re.compile(
> > -r'\.(am|at|etc|in|m4|mk|patch|py|dl)|$|debian/rules')
> > +r'\.(am|at|etc|in|m4|mk|patch|py|dl)$|debian/rules')
> >
> >  # Don't enforce a requirement that leading whitespace be all spaces on
> >  # files that include these characters in their name, since these kinds
> >
>
> Good catch. I tested and it worked.
>
> Acked-by: Mark D. Gray 

Thanks. I applied this patch to the main branch.

Numan

>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] conntrack: handle SNAT with all-zero IP address

2021-04-27 Thread Paolo Valerio
Gaëtan Rivet  writes:

> (Adding back the mailing list + original CCes to the thread.)
>

something weird happened, cause patchwork got it:

https://patchwork.ozlabs.org/project/openvswitch/patch/161912956184.60963.6344811329504822258.st...@fed.void/#2672584

not a big deal, I'll check my side.

> On Mon, Apr 26, 2021, at 19:09, Paolo Valerio wrote:
>> Hello Gaetan,
>> 
>> thanks for the feedback
>> 
>> Gaëtan Rivet  writes:
>> 
>> > On Fri, Apr 23, 2021, at 00:28, Paolo Valerio wrote:
>> 
>> [...]
>> 
>> >> +
>> >> +int i, j, s_attempts, d_attempts;
>> >
>> > Why not use uint16_t here?
>> > {curr,min,max}_{d,s}port are uint16_t and {s,d}_attemps will be set to 
>> > values derived from them.
>> > i and j will then be compared against {s,d}_attempts, so it seems safer to 
>> > keep them all aligned.
>> >
>> 
>> ACK
>> 
>> > Additionally, it seems s,d_attempts are unnecessary.
>> > They are only used to know the number of NEXT_PORT_IN_RANGE() that should 
>> > be attempted.
>> > Their names are slightly misleading (if they are counts of attempts, 
>> > n_attempts would be clearer),
>> > but also the index could be initialized to the number of attempts 
>> > remaining, and decrease during the loop.
>> > As the indexes are not useful within the loop, it seems ok.
>> >
>> > Furthermore, if they are not useful, could the indexes be masked 
>> > completely? Would it be acceptable
>> > to declare them within the for() loop? I know it's should generally be 
>> > avoided, but I've seen a few places
>> > where in-line declaration were used. In that case I think it's justified 
>> > if it makes the macro safer to use and simpler
>> > to read.
>> >
>> 
>> Right, the indexes are not useful within the loop, and masking them
>> would make the macro simpler. OTOH, declaring them within the for()
>> and nesting the loops would lead to a warning (-Wshadow).
>> 
>> If I didn't miss anything, and if you are ok with it, I would change it,
>> based on your suggestions, like the following:
>> 
>> uint16_t i, j;
>> FOR_EACH_PORT_IN_RANGE(i, curr_dport, min_dport, max_dport) {
>> nat_conn->rev_key.src.port = htons(curr_dport);
>> FOR_EACH_PORT_IN_RANGE(j, curr_sport, min_sport, max_sport) {
>> [...]
>> }
>> }
>> 
>> #define FOR_EACH_PORT_IN_RANGE(idx, curr, min, max) \
>> for (INIT_N_PORT_ATTEMPTS(idx, curr, min, max); \
>>  idx > 0; idx--, NEXT_PORT_IN_RANGE(curr, min, max))
>> 
>> WDYT?
>> 
>
> To nest the loops, you can use the __COUNTER__ macro, like so:
>
> /* Generate a unique name with the __COUNTER__ macro to allow nesting loops. 
> */
> #define OVS_STR_(x,y) x##y
> #define OVS_STR(x, y) OVS_STR_(x,y)
> /* There is one such 'stringify' macro in cmap.h as well, maybe it could be 
> shared in a util.h or similar. */
>

Ok, I'll share the macro in util.h.
I would split the patch in this case.

> #define FOR_EACH_PORT_IN_RANGE__(curr, min, max, INAME) \
> for (uint16_t INAME = N_PORT_ATTEMPTS(curr, min, max); \
>   INAME > 0; INAME--, NEXT_PORT_IN_RANGE(curr, min, max))
>
> #define FOR_EACH_PORT_IN_RANGE(curr, min, max) \
> FOR_EACH_PORT_IN_RANGE__(curr, min, max, OVS_STR(idx, __COUNTER__))
>
>> >> +FOR_EACH_PORT_IN_RANGE(i, d_attempts, curr_dport, min_dport, 
>> >> max_dport) {
>> >> +nat_conn->rev_key.src.port = htons(curr_dport);
>> >> +FOR_EACH_PORT_IN_RANGE(j, s_attempts, curr_sport, min_sport, 
>> >> max_sport) {
>> >> +nat_conn->rev_key.dst.port = htons(curr_sport);
>> >> +if (!conn_lookup(ct, &nat_conn->rev_key,
>> >> + time_msec(), NULL, NULL)) {
>> >> +return true;
>> >>  }
>> >> -first_port = min_port;
>> >> -port = first_port;
>> >> -all_ports_tried = false;
>> >>  }
>> >>  }
>> >> -return false;
>> >> +
>> >> +/* Check if next IP is in range and respin. Otherwise, notify
>> >> + * exhaustion to the caller. */
>> >> +next_addr:
>> >> +if (next_addr_in_range_guarded(&curr_addr, &min_addr,
>> >> +   &max_addr, &guard_addr,
>> >> +   conn->key.dl_type == 
>> >> htons(ETH_TYPE_IP))) {
>> >> +return false;
>> >> +}
>> >> +
>> >> +goto another_round;
>> >>  }
>> >>  
>> >>  static enum ct_update_res
>> >> diff --git a/lib/conntrack.h b/lib/conntrack.h
>> >> index 9553b188a..c68a83ccd 100644
>> >> --- a/lib/conntrack.h
>> >> +++ b/lib/conntrack.h
>> >> @@ -77,6 +77,14 @@ enum nat_action_e {
>> >>  NAT_ACTION_DST_PORT = 1 << 3,
>> >>  };
>> >>  
>> >> +#define NAT_ACTION_SNAT_ALL (NAT_ACTION_SRC | NAT_ACTION_SRC_PORT)
>> >> +#define NAT_ACTION_DNAT_ALL (NAT_ACTION_DST | NAT_ACTION_DST_PORT)
>> >> +
>> >> +enum {
>> >> +MIN_NAT_EPHEMERAL_PORT = 1024,
>> >> +MAX_NAT_EPHEMERAL_PORT = 65535
>> >> +};
>> >> +
>> >>  struct nat_action_info_t {
>> >>  union ct_addr min_addr;
>> >>  union ct_addr max_addr;
>> >> @@ -85,6

Re: [ovs-dev] [PATCH ovn v6 0/5] ARP and Floating IP Fixes

2021-04-27 Thread Numan Siddique
On Fri, Apr 23, 2021 at 3:17 PM Mark Michelson  wrote:
>
> This patch series aims to fix issues seen in OpenStack deployments when
> floating IPs were assigned to routers, and those floating IPs were not
> part of any subnet configured on that router.
>
> Originally, this was a two patch series but it has bloomed into a 5
> patch series.
>
> Patch 1 fixes the scenario where a VM attempts to reach a floating IP on
> the directly connected router. This has been part of this patch series
> since v1.
>
> Patch 2 is an incidental fix that removes a redundant paragraph from
> documentation.
>
> Patches 3 and 4 work towards pre-allocating MAC_Bindings for known
> router addresses. Patch 3 is the northd side, placing all
> router_addresses in the connected logical switch port's Port_Binding
> record. Patch 4 is the ovn-controller side, adding the MAC_Bindings
> based on the Port_Binding's router_addresses.
>
> And Patch 5 addresses the situation for when the pre-allocated
> MAC_Bindings cannot be used. For this situation, we will flood the ARP
> request if the TPA is for a configured IP address that is outside the
> connected routers' subnets.
> ---
> v5 -> v6:
> * Patch 3 now only saves gateway router addresses to the connected
>   switch's router_addresses column. Previous versions saved all router
>   addresses to all connected switches' columns.
> * Patch 5 has two new tests added. One ensures that the priority 90
>   flows that flood ARP for unreachable addresses are present. The other
>   is a restored system test that ensures that a ping to a floating IP
>   outside of the router's subnet succeeds.
> * Patch 4 has a small change of types from int to size_t for a loop
>   index.
>
> v4 -> v5:
> Fixed memory leaks in patch 3 and patch 4. Patches 1, 2,  and 5 are the
> same as in v4.
> ---
>
> Mark Michelson (5):
>   northd: Swap src and dst eth addresses in router egress loop.
>   ovn-sb: Remove redundant "nat-addresses" information from
> Port_Binding.
>   northd: Save all router addresses in Port_Bindings
>   pinctrl: Add Chassis MAC_Bindings for all router addresses.
>   northd: Flood ARPs to routers for "unreachable" addresses.
>
>  controller/ovn-controller.c |   4 +
>  controller/pinctrl.c| 300 +---
>  controller/pinctrl.h|   1 +
>  northd/ovn-northd.8.xml |   8 +
>  northd/ovn-northd.c | 377 
>  northd/ovn_northd.dl| 146 ++
>  ovn-sb.ovsschema|   8 +-
>  ovn-sb.xml  |  37 +++-
>  tests/ovn-controller.at | 179 +
>  tests/ovn-northd.at | 305 +
>  tests/system-ovn.at | 218 +
>  11 files changed, 1348 insertions(+), 235 deletions(-)

Hi Mark,

Thanks for the v6.

I see 2 issues with this version.

1. I still see that the port_Binding (of a logical switch port has
router_addresses) set even if its peer is not
a gateway router port.  Is it required that we need to store this
information ?

Please check this out - http://paste.openstack.org/show/804814/
I was expecting that router_addressses need not to be set for
port_bindings sw0-lr0 and sw1-lr0.

2. There are many system tests which are failing.  I see the below
warning message in the test suite logs

--- /dev/null   2021-04-26 15:23:24.901985828 -0400
+++ 
/home/nusiddiq/workspace_cpp/ovn-org/ovn/_gcc_system/tests/system-kmod-testsuite.dir/at-groups/1/stdout
2021-04-27 15:11:09.277572760 -0400
@@ -0,0 +1 @@
+2021-04-27T19:11:06.035Z|00020|ovsdb_idl|WARN|transaction error:
{"details":"Transaction causes multiple rows in \"MAC_Binding\" table
to have identical values (R1_join and \"20.0.0.2\") for index on
columns \"logical_port\" and \"ip\".  First row, with UUID
a426a9eb-121b-4b1b-91d1-8f2ba095e3b7, existed in the database before
this transaction and was not modified by the transaction.  Second row,
with UUID 92af4be3-1cb6-4287-bae7-b823479476ed, was inserted by this
transaction.","error":"constraint violation"}
ovsdb-server.log:

Looks like all the system tests are running as expected, but in the
end the test fails because ovn-controller.log has the above warning
message.

It needs to be seen if this warning message can be whitelisted or not.

Thanks
Numan


>
> --
> 2.29.2
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 10/11] conntrack: Do not log empty ct-sweep

2021-04-27 Thread Aaron Conole
Gaetan Rivet  writes:

> Do not add noise to the DBG log for empty sweeps.
> Only log time taken when some connections were cleaned.
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---
>  lib/conntrack.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 8a7538b7b..823fb060a 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -1647,8 +1647,10 @@ ct_sweep(struct conntrack *ct, long long now, size_t 
> limit)
>  }
>  
>  out:
> -VLOG_DBG("conntrack cleanup %"PRIuSIZE" entries in %lld msec", count,
> - time_msec() - now);
> +if (count > 0) {

I think:

+if (count) {

Since it's unsigned, if it is set it will only be positive.

> +VLOG_DBG("conntrack cleanup %"PRIuSIZE" entries in %lld msec", count,
> + time_msec() - now);
> +}
>  return min_expiration;
>  }

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 09/11] conntrack: Do not rate limit ct-sweep

2021-04-27 Thread Aaron Conole
Gaetan Rivet  writes:

> The current rate limit is set to allow other threads to update the
> connections when applicable. This was valid when taking the 'ct_lock'
> was needed with a global critical section.
>
> Now that the size of the critical section for 'ct_lock' is reduced, it
> is not necessary to rate limit calls to ct_sweep() anymore.
>
> Signed-off-by: Gaetan Rivet 
> Reviewed-by: Eli Britstein 
> ---

It's weird to see patch 8/11 and 9/11 set up this way.

Would it make sense to just squash them together?

>  lib/conntrack.c | 24 +++-
>  1 file changed, 7 insertions(+), 17 deletions(-)
>
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index ea2e5b63b..8a7538b7b 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -1675,20 +1675,12 @@ conntrack_clean(struct conntrack *ct, long long now)
>   * there is an actual connection that expires, or because a new connection
>   * might be created with the minimum timeout).
>   *
> - * The logic below has two goals:
> - *
> - * - We want to reduce the number of wakeups and batch connection cleanup
> - *   when the load is not very high.  CT_CLEAN_INTERVAL ensures that if we
> - *   are coping with the current cleanup tasks, then we wait at least
> - *   5 seconds to do further cleanup.
> - *
> - * - We don't want to keep the map locked too long, as we might prevent
> - *   traffic from flowing.  CT_CLEAN_MIN_INTERVAL ensures that if cleanup is
> - *   behind, there is at least some 200ms blocks of time when the map will be
> - *   left alone, so the datapath can operate unhindered.
> + * We want to reduce the number of wakeups and batch connection cleanup
> + * when the load is not very high.  CT_CLEAN_INTERVAL ensures that if we
> + * are coping with the current cleanup tasks, then we wait at least
> + * 5 seconds to do further cleanup.
>   */
>  #define CT_CLEAN_INTERVAL 5000 /* 5 seconds */
> -#define CT_CLEAN_MIN_INTERVAL 200  /* 0.2 seconds */
>  
>  static void *
>  clean_thread_main(void *f_)
> @@ -1705,12 +1697,10 @@ clean_thread_main(void *f_)
>  long long now = time_msec();
>  next_wake = conntrack_clean(ct, now);
>  
> -if (next_wake < now) {
> -poll_immediate_wake();
> -} else if (next_wake < now + CT_CLEAN_MIN_INTERVAL) {
> -poll_timer_wait_until(now + CT_CLEAN_MIN_INTERVAL);
> +if (next_wake > now) {
> +poll_timer_wait_until(MIN(next_wake, now + CT_CLEAN_INTERVAL));
>  } else {
> -poll_timer_wait_until(MAX(next_wake, now + CT_CLEAN_INTERVAL));
> +poll_immediate_wake();
>  }
>  latch_wait(&ct->clean_thread_exit);
>  poll_block();

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 ovn] ovn-sbctl: Prevent core dump from ovn-sbctl lflow-list [datpath] 0xflow

2021-04-27 Thread Mark Michelson

Excellent! Looks good to me!

Acked-by: Mark Michelson 

On 4/21/21 12:17 PM, Alexey Roytman wrote:

From: Alexey Roytman 

When ovn-sbctl lflow-list gets lflow argument with 0x prefix, e.g. 0x8131c8a8,
it prints correct output, but fails with coredump.
For example:
ovn-sbctl --uuid lflow-list sw1 0x8131c8a8
   
Datapath: "sw1" (4b1e53d8-9f0f-4768-b4a6-6cbc58a4bfda)  Pipeline: egress

 uuid=0x8131c8a8, table=10(ls_out_port_sec_l2 ), priority=100  ,
match=(eth.mcast), action=(output;)
free(): invalid pointer
[2]616553 abort (core dumped)  ovn-sbctl --uuid dump-flows sw1 0x8131c8a8
  
  This patch fixes it.


Signed-off-by: Alexey Roytman 
---
  utilities/ovn-sbctl.c | 28 +---
  1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/utilities/ovn-sbctl.c b/utilities/ovn-sbctl.c
index e3aa7a68e..99c112358 100644
--- a/utilities/ovn-sbctl.c
+++ b/utilities/ovn-sbctl.c
@@ -764,23 +764,28 @@ sbctl_lflow_cmp(const void *a_, const void *b_)
  return cmp ? cmp : strcmp(a->actions, b->actions);
  }
  
-static char *

+static bool
+is_uuid_with_prefix(const char *uuid)
+{
+ return uuid[0] == '0' && (uuid[1] == 'x' || uuid[1] == 'X');
+}
+
+static bool
  parse_partial_uuid(char *s)
  {
  /* Accept a full or partial UUID. */
  if (uuid_is_partial_string(s)) {
-return s;
+return true;
  }
  
  /* Accept a full or partial UUID prefixed by 0x, since "ovs-ofctl

   * dump-flows" prints cookies prefixed by 0x. */
-if (s[0] == '0' && (s[1] == 'x' || s[1] == 'X')
-&& uuid_is_partial_string(s + 2)) {
-return s + 2;
+if (is_uuid_with_prefix(s) && uuid_is_partial_string(s + 2)) {
+return true;
  }
  
  /* Not a (partial) UUID. */

-return NULL;
+return false;
  }
  
  static const char *

@@ -799,8 +804,11 @@ is_partial_uuid_match(const struct uuid *uuid, const char 
*match)
   * from UUIDs, and cookie values are printed without leading zeros because
   * they're just numbers. */
  const char *s1 = strip_leading_zero(uuid_s);
-const char *s2 = strip_leading_zero(match);
-
+const char *s2 = match;
+if (is_uuid_with_prefix(s2)) {
+s2 = s2 + 2;
+}
+s2 = strip_leading_zero(s2);
  return !strncmp(s1, s2, strlen(s2));
  }
  
@@ -1134,12 +1142,10 @@ cmd_lflow_list(struct ctl_context *ctx)

  }
  
  for (size_t i = 1; i < ctx->argc; i++) {

-char *s = parse_partial_uuid(ctx->argv[i]);
-if (!s) {
+if (!parse_partial_uuid(ctx->argv[i])) {
  ctl_fatal("%s is not a UUID or the beginning of a UUID",
ctx->argv[i]);
  }
-ctx->argv[i] = s;
  }
  
  struct vconn *vconn = sbctl_open_vconn(&ctx->options);




___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH] tests: Fix inconsistent "ACL Conjunction" test.

2021-04-27 Thread 0-day Robot
Bleep bloop.  Greetings Mark Michelson, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


git-am:
error: sha1 information is lacking or useless (tests/ovn-macros.at).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0001 tests: Fix inconsistent "ACL Conjunction" test.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 4/4] ovn-controller: Fix port group conjunction flow explosion problem.

2021-04-27 Thread Han Zhou
On Tue, Apr 27, 2021 at 8:43 AM Mark Gray  wrote:
>
> On 22/04/2021 21:14, Han Zhou wrote:
> > For an ACL with match: outport == @PG && ip4.src == $PG_AS, given below
> > scale:
> >
> > P: PG size
> > LP: number of local lports
> > D: number of all datapaths (logical switches)
> > LD: number of datapaths that contain local lports
> >
> > With current OVN implementation, the total number of OF flows is:
> > LP + (P * D) + D
> >
> > The reason is, firstly, datapath is not part of the conjunction, so for
> > each datapath the lflow is reparsed.
> >
> > Secondly, although ovn-controller tries to filter out the flows that are
> > for non-local lports, with the conjunction match, the logic that filters
> > out non-local flows doesn't work for the conjunction part that doesn't
> > have the lport in the match (the P * D part). When there is only one
> > port on each LS it is fine, because no conjunction will be used because
> > SB port groups are splited per datapath, so each port group would have
> suggest "split per datapath"

Ack

> > only one port. However, when more than one ports are on each LS the flow
> > explosion happens.
> >
> > This patch deal with the second reason above, by refining the SB port
> > groups to store only locally bound lports: empty const sets will not
> > generate any flows. This reduces the related flow number from
> > LP + (P * D) + D to LP + (P * LD) + LD.
> >
> > Since LD is expected to be small, so even if it is a multiplier, the
> > total number is larged reduced.  In particular, in ovn-k8s use cases the
> suggest "reduced significantly"

Ack

> > LD is always 1, so the formular above becomes LP + P + LD.
> >
> s/formular/formula

Ack

> > With a scale of 1k k8s nodes, each has 4 ports for the same PG: P = 4k,
> > LP = 4, D = 1k, LD = 1. The current implementation generates ~4m flows.
> > With this patch it becomes only ~4k.
> Cool!
> >
> > Reported-by: Girish Moodalbail 
> > Reported-at:
https://mail.openvswitch.org/pipermail/ovs-dev/2021-March/381082.html
> > Reported-by: Dumitru Ceara 
> > Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1944098
> > Signed-off-by: Han Zhou 
>
> I tested this as well and it seemed to work as expected.

Thanks for the test!

> > ---
> > v1->v2: fix memory leaks found by address sanitizer
> >
> >  controller/binding.c|  20 
> >  controller/binding.h|   9 ++
> >  controller/ovn-controller.c | 217 ++--
> >  include/ovn/expr.h  |   2 +-
> >  lib/expr.c  |   8 +-
> >  tests/ovn.at|  53 +
> >  tests/test-ovn.c|  12 +-
> >  utilities/ovn-trace.c   |   4 +-
> >  8 files changed, 283 insertions(+), 42 deletions(-)
> >
> > diff --git a/controller/binding.c b/controller/binding.c
> > index 514f5f33f..5aca964cc 100644
> > --- a/controller/binding.c
> > +++ b/controller/binding.c
> > @@ -2987,3 +2987,23 @@ cleanup:
> >
> >  return b_lport;
> >  }
> > +
> > +struct sset *
> > +binding_collect_local_binding_lports(struct local_binding_data
*lbinding_data)
> > +{
> > +struct sset *lports = xzalloc(sizeof *lports);
> > +sset_init(lports);
> > +struct shash_node *shash_node;
> > +SHASH_FOR_EACH (shash_node, &lbinding_data->lports) {
> > +struct binding_lport *b_lport = shash_node->data;
> > +sset_add(lports, b_lport->name);
> > +}
> > +return lports;
> > +}
> > +
> > +void
> > +binding_destroy_local_binding_lports(struct sset *lports)
> > +{
> > +sset_destroy(lports);
> > +free(lports);
> > +}
> > diff --git a/controller/binding.h b/controller/binding.h
> > index 4fc9ef207..31f0352a0 100644
> > --- a/controller/binding.h
> > +++ b/controller/binding.h
> > @@ -128,4 +128,13 @@ void binding_seqno_run(struct local_binding_data
*lbinding_data);
> >  void binding_seqno_install(struct local_binding_data *lbinding_data);
> >  void binding_seqno_flush(void);
> >  void binding_dump_local_bindings(struct local_binding_data *, struct
ds *);
> > +
> > +/* Generates a sset of lport names from local_binding_data.
> > + * Note: the caller is responsible for destroying and freeing the
returned
> > + * sset, by calling binding_collect_local_binding_lports(). */
> I think this^ should say binding_destroy_local_binding_lports()?

Oops. My bad.

> > +struct sset *binding_collect_local_binding_lports(struct
local_binding_data *);
> > +
> > +/* Destroy and free the lports sset returned by
> > + * binding_collect_local_binding_lports(). */
> > +void binding_destroy_local_binding_lports(struct sset *lports);
> >  #endif /* controller/binding.h */
> > diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> > index 7320bd56c..c6ba9ff88 100644
> > --- a/controller/ovn-controller.c
> > +++ b/controller/ovn-controller.c
> > @@ -1341,7 +1341,7 @@ addr_sets_init(const struct
sbrec_address_set_table *address_set_table,
> >  SBREC_ADDRESS_SET_TABLE_FOR_EACH (as, address_set_table) {
> >  ex

Re: [ovs-dev] [PATCH ovn v2 1/4] inc-proc-eng: Call clear_tracked_data before recompute.

2021-04-27 Thread Zhen Wang (SW-CLOUD)


From: Han Zhou 
Sent: Tuesday, April 27, 2021 9:53 AM
To: Mark Gray 
Cc: Han Zhou ; ovs dev ; Zhen Wang 
(SW-CLOUD) ; Girish Moodalbail 
Subject: Re: [ovs-dev] [PATCH ovn v2 1/4] inc-proc-eng: Call clear_tracked_data 
before recompute.

External email: Use caution opening links or attachments



On Tue, Apr 27, 2021 at 8:39 AM Mark Gray 
mailto:mark.d.g...@redhat.com>> wrote:
>
> Hi Han,
>
> Thanks for fixing this. I reviewed this series but I am not an expert on
> the code. Please have a look at my suggestions but I suggest also
> waiting for an ack from Girish or Krzystof as they will probably test it.
>

Thanks Mark for the review.

+ Winson who verified for the same environment where Girish was reporting the 
issue. (Both of them are now my colleagues :) )
Winson would you add your Tested-by to this series?

In our 600+  node k8s cluster,  scale up 1000 pods in namespace with 
NetworkPolicy  “deny-from-other-namespaces”

Without Han’s Patch:
K8s node br-int OF increased  around 800K.

With Han’s patch:
K8s node br-int OF increased  around 6K

Regards,
Winson

> Mark
>
> On 22/04/2021 21:14, Han Zhou wrote:
> > Cleanup particially tracked data due to some of the change handler
> s/particially/partially?

Ack

> > executions before falling back to recompute. This is done already
> > in the en_runtime_data_run() implementation, but this patch makes
> > it a generic behavior of the I-P engine.
> >
> > Signed-off-by: Han Zhou mailto:hz...@ovn.org>>
> > ---
> > v1->v2: no change
> >
> >  controller/ovn-controller.c | 17 -
> >  lib/inc-proc-eng.c  |  5 +
> >  2 files changed, 5 insertions(+), 17 deletions(-)
> >
> > diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> > index 6f7c9ea61..13c03131c 100644
> > --- a/controller/ovn-controller.c
> > +++ b/controller/ovn-controller.c
> > @@ -1412,23 +1412,6 @@ en_runtime_data_run(struct engine_node *node, void 
> > *data)
> >  struct sset *local_lport_ids = &rt_data->local_lport_ids;
> >  struct sset *active_tunnels = &rt_data->active_tunnels;
> >
> > -/* Clear the (stale) tracked data if any. Even though the tracked data
> > - * gets cleared in the beginning of engine_init_run(),
> > - * any of the runtime data handler might have set some tracked
> > - * data and later another runtime data handler might return false
> > - * resulting in full recompute of runtime engine and rendering the 
> > tracked
> > - * data stale.
> > - *
> > - * It's possible that engine framework can be enhanced to indicate
> > - * the node handlers (in this case flow_output_runtime_data_handler)
> > - * that its input node had a full recompute. However we would still
> > - * need to clear the tracked data, because we don't want the
> > - * stale tracked data to be accessed outside of the engine, since the
> > - * tracked data is cleared in the engine_init_run() and not at the
> > - * end of the engine run.
> > - * */
> > -en_runtime_data_clear_tracked_data(data);
> > -
> >  static bool first_run = true;
> >  if (first_run) {
> >  /* don't cleanup since there is no data yet */
> > diff --git a/lib/inc-proc-eng.c b/lib/inc-proc-eng.c
> > index a6337a1d9..161327404 100644
> > --- a/lib/inc-proc-eng.c
> > +++ b/lib/inc-proc-eng.c
> > @@ -327,6 +327,11 @@ engine_recompute(struct engine_node *node, bool 
> > forced, bool allowed)
> >  }
> >
> >  /* Run the node handler which might change state. */
> Can you move this^ comment down to above the run function as I think it
> is relevant to that code?

Well, I added it this way because the major step here is to "run()", i.e. 
recompute, and I added a minor/sub step which is clearing tracked data first. 
Is this reasonable? I can change it if you think the other way is better.

> > +/* Clear tracked data before calling run() so that partially tracked 
> > data
> > + * from some of the change handler executions are cleared. */
> > +if (node->clear_tracked_data) {
> > +node->clear_tracked_data(node->data);
> > +}
> >  node->run(node, node->data);
> >  node->stats.recompute++;
> >  }
> >
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] tests: Fix inconsistent "ACL Conjunction" test.

2021-04-27 Thread Mark Michelson
The ACL Conjunction test would occasionally fail during automated test
runs. During the test, we send a packet on a netdev-dummy interface and
check the associated pcap file to ensure the packet is sent where we
expect and that it has the expected contents. Looking at logs from
failed runs, it appeared that the pcap file was unpopulated. This likely
was because we were attempting to dump the contents of the pcap file
before the packet had been processed and added to the pcap file.

This patch aims to fix the problem by blocking until the pcap file has
been modified when sending the packet to the netdev-dummy interface.
Since this could be a useful thing for other tests, this new method of
blocking has been added to ovn-macros.at.

Signed-off-by: Mark Michelson 
---
 tests/ovn-macros.at | 23 +++
 tests/ovn.at|  8 
 2 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/tests/ovn-macros.at b/tests/ovn-macros.at
index bd227215a..94fba405e 100644
--- a/tests/ovn-macros.at
+++ b/tests/ovn-macros.at
@@ -509,6 +509,29 @@ options:rxq_pcap=${pcap_file}-rx.pcap
 OVS_WAIT_WHILE([test 24 = $(wc -c ${pcap_file}-tx.pcap | cut -d " " -f1)])
 }
 
+# Receive a packet on a dummy netdev interface. If we expect packets to be
+# recorded, then wait until the pcap file reflects the change.
+netdev_dummy_receive() {
+local interface="$1"
+local packet="$2"
+local hv="$3"
+local pcap_file="$4"
+
+if test -n "pcap_file" ; then
+ts_old=$(stat -c %y "$pcap_file")
+fi
+if test -n "$hv" ; then
+as "$hv" ovs-appctl netdev-dummy/receive "$interface" "$packet"
+else
+ovs-appctl netdev-dummy/receive "$interface" "$packet"
+fi
+if test -n "$pcap_file" ; then
+OVS_WAIT_WHILE(
+  [ts_new=$(stat -c %y "$pcap_file")
+   test "$ts_new" = "$ts_old"])
+fi
+}
+
 OVS_END_SHELL_HELPERS
 
 m4_define([OVN_POPULATE_ARP], [AT_CHECK(ovn_populate_arp__, [0], [ignore])])
diff --git a/tests/ovn.at b/tests/ovn.at
index 9f38ec6ec..c1158f5d0 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -14016,11 +14016,11 @@ check ovn-nbctl --wait=hv acl-add ls1 to-lport 1001 \
 # port numbers, e.g. 11 for vif11.
 test_ip() {
 # This packet has bad checksums but logical L3 routing doesn't check.
-local inport=$1 src_mac=$2 dst_mac=$3 src_ip=$4 dst_ip=$5
+local inport=$1 src_mac=$2 dst_mac=$3 src_ip=$4 dst_ip=$5 pcap_file=$6
 local packet=${dst_mac}${src_mac}0800451c4011${src_ip}\
 ${dst_ip}00350008
-shift; shift; shift; shift; shift
-as hv1 ovs-appctl netdev-dummy/receive hv1-vif1 $packet
+shift; shift; shift; shift; shift; shift
+netdev_dummy_receive hv1-vif1 $packet hv1 "$pcap_file"
 for outport; do
 echo $packet >> $outport.expected
 done
@@ -14040,7 +14040,7 @@ options:rxq_pcap=${pcap_file}-rx.pcap
 sip=`ip_to_hex 10 0 0 4`
 dip=`ip_to_hex 10 0 0 6`
 
-test_ip 1 f001 f002 $sip $dip 2
+test_ip 1 f001 f002 $sip $dip hv1/vif2-tx.pcap 2
 
 cat 2.expected > expout
 $PYTHON "$ovs_srcdir/utilities/ovs-pcap.in" hv1/vif2-tx.pcap > 2.packets
-- 
2.29.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [v11 01/15] dpif-netdev: Refactor to multiple header files.

2021-04-27 Thread 0-day Robot
Bleep bloop.  Greetings Cian Ferriter, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
ERROR: Improper whitespace around control block
#231 FILE: lib/dpif-netdev-private-dfc.h:111:
#define EMC_FOR_EACH_POS_WITH_HASH(EMC, CURRENT_ENTRY, HASH) \

ERROR: Improper whitespace around control block
#347 FILE: lib/dpif-netdev-private-dfc.h:227:
EMC_FOR_EACH_POS_WITH_HASH(cache, current_entry, key->hash) {

ERROR: Improper whitespace around control block
#465 FILE: lib/dpif-netdev-private-dpcls.h:95:
#define NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(VALUE, KEY, FLOWMAP)   \

ERROR: Inappropriate bracing around statement
#466 FILE: lib/dpif-netdev-private-dpcls.h:96:
MINIFLOW_FOR_EACH_IN_FLOWMAP (VALUE, &(KEY)->mf, FLOWMAP)

Lines checked: 1624, Warnings: 0, Errors: 4


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 2/4] ovn.at: Improve "No ovn-controller assert when generating conjunction flows"

2021-04-27 Thread Han Zhou
On Tue, Apr 27, 2021 at 8:40 AM Mark Gray  wrote:
>
> On 22/04/2021 21:14, Han Zhou wrote:
> > This patch improves the test case by binding 2 VIFs on the HV instead of
> > one, to make sure conjunction is still used and the scenario is still
> > tested by this test case when a following patch optimizes conjunction
flows.
> >
> > Signed-off-by: Han Zhou 
> > ---
> > v1->v2: no change
> >
> >  tests/ovn.at | 94 
> >  1 file changed, 50 insertions(+), 44 deletions(-)
> >
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index 3d0a7f63f..55444bbd7 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -25111,10 +25111,12 @@ ovs-vsctl add-br br-phys
> >  ovn_attach n1 br-phys 192.168.0.10
> >
> >  as hv1
> > -ovs-vsctl \
> > --- add-port br-int vif1 \
> > --- set Interface vif1 external_ids:iface-id=sw0-p1 \
> > -ofport-request=1
> > +for i in 1 2; do
> > +ovs-vsctl \
> > +-- add-port br-int vif$i \
> > +-- set Interface vif$i external_ids:iface-id=sw0-p$i \
> > +ofport-request=$i
> > +done
> >
> >  check as hv1
> >  ovs-vsctl set open . external_ids:ovn-monitor-all=true
> > @@ -25122,10 +25124,10 @@ ovs-vsctl set open .
external_ids:ovn-monitor-all=true
> >  check ovn-nbctl ls-add sw0
> >  check ovn-nbctl pg-add pg1
> >  check ovn-nbctl pg-add pg2
> > -check ovn-nbctl lsp-add sw0 sw0-p2
> > -check ovn-nbctl lsp-set-addresses sw0-p2 "00:00:00:00:00:02
192.168.47.2"
> >  check ovn-nbctl lsp-add sw0 sw0-p3
> >  check ovn-nbctl lsp-set-addresses sw0-p3 "00:00:00:00:00:03
192.168.47.3"
> > +check ovn-nbctl lsp-add sw0 sw0-p4
> > +check ovn-nbctl lsp-set-addresses sw0-p4 "00:00:00:00:00:04
192.168.47.4"
> >
> >  # Pause ovn-northd. When it is resumed, all the below NB updates
> >  # will be sent in one transaction.
> > @@ -25135,8 +25137,10 @@ check as northd-backup ovn-appctl -t
NORTHD_TYPE pause
> >
> >  check ovn-nbctl lsp-add sw0 sw0-p1
> >  check ovn-nbctl lsp-set-addresses sw0-p1 "00:00:00:00:00:01
192.168.47.1"
> > -check ovn-nbctl pg-set-ports pg1 sw0-p1 sw0-p2
> > -check ovn-nbctl pg-set-ports pg2 sw0-p3
> > +check ovn-nbctl lsp-add sw0 sw0-p2
> > +check ovn-nbctl lsp-set-addresses sw0-p2 "00:00:00:00:00:02
192.168.47.2"
> > +check ovn-nbctl pg-set-ports pg1 sw0-p1 sw0-p2 sw0-p3
> > +check ovn-nbctl pg-set-ports pg2 sw0-p4
> >  check ovn-nbctl acl-add pg1 to-lport 1002 "outport == @pg1 && ip4 &&
ip4.src == \$pg2_ip4 && udp && udp.dst >= 1 && udp.dst <= 65535"
allow-related
> >
> >  # resume ovn-northd now. This should result in a single update message
> > @@ -25144,11 +25148,11 @@ check ovn-nbctl acl-add pg1 to-lport 1002
"outport == @pg1 && ip4 && ip4.src ==
> >  check as northd ovn-appctl -t NORTHD_TYPE resume
> >
> >  AS_BOX([Wait for sw0-p1 to be up])
> Update AS_BOX above^ as it does not match the code anymore.

Ack. Thanks!

> > -wait_for_ports_up sw0-p1
> > +wait_for_ports_up sw0-p1 sw0-p2
> >
> >  # When the port group pg1 is updated, it should not result in
> >  # any assert in ovn-controller.
> > -ovn-nbctl --wait=hv pg-set-ports pg1 sw0-p1 sw0-p2 sw0-p3
> > +ovn-nbctl --wait=hv pg-set-ports pg1 sw0-p1 sw0-p2 sw0-p3 sw0-p4
> >  AT_CHECK([kill -0 $(cat hv1/ovn-controller.pid)])
> >  check ovn-nbctl --wait=hv sync
> >
> > @@ -25156,40 +25160,42 @@ check ovn-nbctl --wait=hv sync
> >  AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=45 |
ofctl_strip_all | \
> >  grep "priority=2002" | grep conjunction | \
> >  sed 's/conjunction([[^)]]*)/conjunction()/g' | sort], [0], [dnl
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x10/0xfff0
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x100/0xff00
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x1000/0xf000
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x2/0xfffe
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x20/0xffe0
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x200/0xfe00
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x2000/0xe000
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x4/0xfffc
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x40/0xffc0
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x400/0xfc00
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x4000/0xc000
actions=conjunction()
> > - table=45,
priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src

Re: [ovs-dev] [PATCH ovn v2 1/4] inc-proc-eng: Call clear_tracked_data before recompute.

2021-04-27 Thread Han Zhou
On Tue, Apr 27, 2021 at 8:39 AM Mark Gray  wrote:
>
> Hi Han,
>
> Thanks for fixing this. I reviewed this series but I am not an expert on
> the code. Please have a look at my suggestions but I suggest also
> waiting for an ack from Girish or Krzystof as they will probably test it.
>

Thanks Mark for the review.

+ Winson who verified for the same environment where Girish was reporting
the issue. (Both of them are now my colleagues :) )
Winson would you add your Tested-by to this series?

> Mark
>
> On 22/04/2021 21:14, Han Zhou wrote:
> > Cleanup particially tracked data due to some of the change handler
> s/particially/partially?

Ack

> > executions before falling back to recompute. This is done already
> > in the en_runtime_data_run() implementation, but this patch makes
> > it a generic behavior of the I-P engine.
> >
> > Signed-off-by: Han Zhou 
> > ---
> > v1->v2: no change
> >
> >  controller/ovn-controller.c | 17 -
> >  lib/inc-proc-eng.c  |  5 +
> >  2 files changed, 5 insertions(+), 17 deletions(-)
> >
> > diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> > index 6f7c9ea61..13c03131c 100644
> > --- a/controller/ovn-controller.c
> > +++ b/controller/ovn-controller.c
> > @@ -1412,23 +1412,6 @@ en_runtime_data_run(struct engine_node *node,
void *data)
> >  struct sset *local_lport_ids = &rt_data->local_lport_ids;
> >  struct sset *active_tunnels = &rt_data->active_tunnels;
> >
> > -/* Clear the (stale) tracked data if any. Even though the tracked
data
> > - * gets cleared in the beginning of engine_init_run(),
> > - * any of the runtime data handler might have set some tracked
> > - * data and later another runtime data handler might return false
> > - * resulting in full recompute of runtime engine and rendering the
tracked
> > - * data stale.
> > - *
> > - * It's possible that engine framework can be enhanced to indicate
> > - * the node handlers (in this case
flow_output_runtime_data_handler)
> > - * that its input node had a full recompute. However we would still
> > - * need to clear the tracked data, because we don't want the
> > - * stale tracked data to be accessed outside of the engine, since
the
> > - * tracked data is cleared in the engine_init_run() and not at the
> > - * end of the engine run.
> > - * */
> > -en_runtime_data_clear_tracked_data(data);
> > -
> >  static bool first_run = true;
> >  if (first_run) {
> >  /* don't cleanup since there is no data yet */
> > diff --git a/lib/inc-proc-eng.c b/lib/inc-proc-eng.c
> > index a6337a1d9..161327404 100644
> > --- a/lib/inc-proc-eng.c
> > +++ b/lib/inc-proc-eng.c
> > @@ -327,6 +327,11 @@ engine_recompute(struct engine_node *node, bool
forced, bool allowed)
> >  }
> >
> >  /* Run the node handler which might change state. */
> Can you move this^ comment down to above the run function as I think it
> is relevant to that code?

Well, I added it this way because the major step here is to "run()", i.e.
recompute, and I added a minor/sub step which is clearing tracked data
first. Is this reasonable? I can change it if you think the other way is
better.

> > +/* Clear tracked data before calling run() so that partially
tracked data
> > + * from some of the change handler executions are cleared. */
> > +if (node->clear_tracked_data) {
> > +node->clear_tracked_data(node->data);
> > +}
> >  node->run(node, node->data);
> >  node->stats.recompute++;
> >  }
> >
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 15/15] netdev: Optimize netdev_send_prepare_batch.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

Optimize for the best case here where all packets will be compatible
with 'netdev_flags'.

Signed-off-by: Harry van Haaren 
Co-authored-by: Cian Ferriter 
Signed-off-by: Cian Ferriter 
---
 NEWS |  2 ++
 lib/netdev.c | 31 ++-
 2 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/NEWS b/NEWS
index 0fa195acd..34a42250a 100644
--- a/NEWS
+++ b/NEWS
@@ -20,6 +20,8 @@ Post-v2.15.0
CPU supports it. This enhances performance by using the native vpopcount
instructions, instead of the emulated version of vpopcount.
  * Optimize dp_netdev_output by enhancing compiler optimization potential.
+ * Optimize netdev sending by assuming the happy case, and using fallback
+   for if the netdev doesnt meet the required HWOL needs of a packet.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/lib/netdev.c b/lib/netdev.c
index 91e91955c..29a5f1aa9 100644
--- a/lib/netdev.c
+++ b/lib/netdev.c
@@ -837,20 +837,33 @@ static void
 netdev_send_prepare_batch(const struct netdev *netdev,
   struct dp_packet_batch *batch)
 {
-struct dp_packet *packet;
-size_t i, size = dp_packet_batch_size(batch);
+struct dp_packet *p;
+uint32_t i, size = dp_packet_batch_size(batch);
+char *err_msg = NULL;
 
-DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
-char *errormsg = NULL;
+for (i = 0; i < size; i++) {
+p = batch->packets[i];
+int pkt_ok = netdev_send_prepare_packet(netdev->ol_flags, p, &err_msg);
 
-if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
-dp_packet_batch_refill(batch, packet, i);
+if (OVS_UNLIKELY(!pkt_ok)) {
+goto refill_loop;
+}
+}
+
+return;
+
+refill_loop:
+/* Loop through packets from the start of the batch again. This is the
+ * exceptional case where packets aren't compatible with 'netdev_flags'. */
+DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, p, batch) {
+if (netdev_send_prepare_packet(netdev->ol_flags, p, &err_msg)) {
+dp_packet_batch_refill(batch, p, i);
 } else {
-dp_packet_delete(packet);
+dp_packet_delete(p);
 COVERAGE_INC(netdev_send_prepare_drops);
 VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
- netdev_get_name(netdev), errormsg);
-free(errormsg);
+ netdev_get_name(netdev), err_msg);
+free(err_msg);
 }
 }
 }
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 12/15] dpdk: Cache result of CPU ISA checks.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

As a small optimization, this patch caches the result of a CPU ISA
check from DPDK. Particularly in the case of running the DPCLS
autovalidator (which repeatedly probes subtables) this reduces
the amount of CPU ISA lookups from the DPDK level.

By caching them at the OVS/dpdk.c level, the ISA checks remain
runtime for the CPU where they are executed, but subsequent checks
for the same ISA feature become much cheaper.

Signed-off-by: Harry van Haaren 
Co-authored-by: Cian Ferriter 
Signed-off-by: Cian Ferriter 

---

v8: Add NEWS entry.
---
 NEWS   |  1 +
 lib/dpdk.c | 28 
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/NEWS b/NEWS
index 31def36b3..c3102427e 100644
--- a/NEWS
+++ b/NEWS
@@ -40,6 +40,7 @@ v2.15.0 - 15 Feb 2021
- DPDK:
  * Removed support for vhost-user dequeue zero-copy.
  * Add support for DPDK 20.11.
+ * Cache results for CPU ISA checks, reduces overhead on repeated lookups.
- Userspace datapath:
  * Add the 'pmd' option to "ovs-appctl dpctl/dump-flows", which
restricts a flow dump to a single PMD thread if set.
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 319540394..c883a4b8b 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -614,13 +614,33 @@ print_dpdk_version(void)
 puts(rte_version());
 }
 
+/* Avoid calling rte_cpu_get_flag_enabled() excessively, by caching the
+ * result of the call for each CPU flag in a static variable. To avoid
+ * allocating large numbers of static variables, use a uint8 as a bitfield.
+ * Note the macro must only return if the ISA check is done and available.
+ */
+#define ISA_CHECK_DONE_BIT (1 << 0)
+#define ISA_AVAILABLE_BIT  (1 << 1)
+
 #define CHECK_CPU_FEATURE(feature, name_str, RTE_CPUFLAG)   \
 do {\
 if (strncmp(feature, name_str, strlen(name_str)) == 0) {\
-int has_isa = rte_cpu_get_flag_enabled(RTE_CPUFLAG);\
-VLOG_DBG("CPU flag %s, available %s\n", name_str,   \
-  has_isa ? "yes" : "no");  \
-return true;\
+static uint8_t isa_check_##RTE_CPUFLAG; \
+int check = isa_check_##RTE_CPUFLAG & ISA_CHECK_DONE_BIT;   \
+if (OVS_UNLIKELY(!check)) { \
+int has_isa = rte_cpu_get_flag_enabled(RTE_CPUFLAG);\
+VLOG_DBG("CPU flag %s, available %s\n", \
+ name_str, has_isa ? "yes" : "no"); \
+isa_check_##RTE_CPUFLAG = ISA_CHECK_DONE_BIT;   \
+if (has_isa) {  \
+isa_check_##RTE_CPUFLAG |= ISA_AVAILABLE_BIT;   \
+}   \
+}   \
+if (isa_check_##RTE_CPUFLAG & ISA_AVAILABLE_BIT) {  \
+return true;\
+} else {\
+return false;   \
+}   \
 }   \
 } while (0)
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 14/15] dpif-netdev: Optimize dp output action.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit optimizes the output action, by enabling the compiler to
optimize the code better through reducing code complexity.

The core concept of this optimization is that the array-length checks
have already been performed above the copying code, so can be removed.
Removing of the per-packet length checks allows the compiler to auto-vectorize
the stores using SIMD registers.

Signed-off-by: Harry van Haaren 

---

v8: Add NEWS entry.
---
 NEWS  |  1 +
 lib/dpif-netdev.c | 23 ++-
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/NEWS b/NEWS
index 61f34ffc1..0fa195acd 100644
--- a/NEWS
+++ b/NEWS
@@ -19,6 +19,7 @@ Post-v2.15.0
  * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if the
CPU supports it. This enhances performance by using the native vpopcount
instructions, instead of the emulated version of vpopcount.
+ * Optimize dp_netdev_output by enhancing compiler optimization potential.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 2db04d63b..3d05a23a4 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -7285,12 +7285,25 @@ dp_execute_output_action(struct dp_netdev_pmd_thread 
*pmd,
 pmd->n_output_batches++;
 }
 
-struct dp_packet *packet;
-DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
-p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
-pmd->ctx.last_rxq;
-dp_packet_batch_add(&p->output_pkts, packet);
+/* The above checks ensure that there is enough space in the output batch.
+ * Using dp_packet_batch_add() has a branch to check if the batch is full.
+ * This branch reduces the compiler's ability to optimize efficiently. The
+ * below code implements packet movement between batches without checks,
+ * with the required semantics of output batch perhaps containing packets.
+ */
+int batch_size = dp_packet_batch_size(packets_);
+int out_batch_idx = dp_packet_batch_size(&p->output_pkts);
+struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq;
+struct dp_packet_batch *output_batch = &p->output_pkts;
+
+for (int i = 0; i < batch_size; i++) {
+struct dp_packet *packet = packets_->packets[i];
+p->output_pkts_rxqs[out_batch_idx] = rxq;
+output_batch->packets[out_batch_idx] = packet;
+out_batch_idx++;
 }
+output_batch->count += batch_size;
+
 return true;
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 13/15] dpcls-avx512: Enable avx512 vector popcount instruction.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit enables the AVX512-VPOPCNTDQ Vector Popcount
instruction. This instruction is not available on every CPU
that supports the AVX512-F Foundation ISA, hence it is enabled
only when the additional VPOPCNTDQ ISA check is passed.

The vector popcount instruction is used instead of the AVX512
popcount emulation code present in the avx512 optimized DPCLS today.
It provides higher performance in the SIMD miniflow processing
as that requires the popcount to calculate the miniflow block indexes.

Signed-off-by: Harry van Haaren 

---

v8: Add NEWS entry.
---
 NEWS   |  3 +
 lib/dpdk.c |  1 +
 lib/dpif-netdev-lookup-avx512-gather.c | 84 --
 3 files changed, 70 insertions(+), 18 deletions(-)

diff --git a/NEWS b/NEWS
index c3102427e..61f34ffc1 100644
--- a/NEWS
+++ b/NEWS
@@ -16,6 +16,9 @@ Post-v2.15.0
  * Enable AVX512 optimized DPCLS to search subtables with larger miniflows.
  * Add more specialized DPCLS subtables to cover common rules, enhancing
the lookup performance.
+ * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if the
+   CPU supports it. This enhances performance by using the native vpopcount
+   instructions, instead of the emulated version of vpopcount.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/lib/dpdk.c b/lib/dpdk.c
index c883a4b8b..a9494a40f 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -655,6 +655,7 @@ dpdk_get_cpu_has_isa(const char *arch, const char *feature)
 #if __x86_64__
 /* CPU flags only defined for the architecture that support it. */
 CHECK_CPU_FEATURE(feature, "avx512f", RTE_CPUFLAG_AVX512F);
+CHECK_CPU_FEATURE(feature, "avx512vpopcntdq", RTE_CPUFLAG_AVX512VPOPCNTDQ);
 CHECK_CPU_FEATURE(feature, "bmi2", RTE_CPUFLAG_BMI2);
 #endif
 
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c 
b/lib/dpif-netdev-lookup-avx512-gather.c
index 7adf29914..c338c2fcd 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -53,6 +53,15 @@
 
 VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather);
 
+
+/* Wrapper function required to enable ISA. */
+static inline __m512i
+__attribute__((__target__("avx512vpopcntdq")))
+_mm512_popcnt_epi64_wrapper(__m512i v_in)
+{
+return _mm512_popcnt_epi64(v_in);
+}
+
 static inline __m512i
 _mm512_popcnt_epi64_manual(__m512i v_in)
 {
@@ -126,7 +135,8 @@ avx512_blocks_gather(__m512i v_u0, /* reg of u64 of all u0 
bits */
  __mmask64 u1_bcast_msk,  /* mask of u1 lanes */
  const uint64_t pkt_mf_u0_pop, /* num bits in u0 of pkt */
  __mmask64 zero_mask, /* maskz if pkt not have mf bit */
- __mmask64 u64_lanes_mask) /* total lane count to use */
+ __mmask64 u64_lanes_mask, /* total lane count to use */
+ const uint32_t use_vpop)  /* use AVX512 vpopcntdq */
 {
 /* Suggest to compiler to load tbl blocks ahead of gather(). */
 __m512i v_tbl_blocks = _mm512_maskz_loadu_epi64(u64_lanes_mask,
@@ -140,8 +150,15 @@ avx512_blocks_gather(__m512i v_u0, /* reg of u64 of all u0 
bits */
   tbl_mf_masks);
 __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_tbl_masks);
 
-/* Manual AVX512 popcount for u64 lanes. */
-__m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
+/* Calculate AVX512 popcount for u64 lanes using the native instruction
+ * if available, or using emulation if not available.
+ */
+__m512i v_popcnts;
+if (use_vpop) {
+v_popcnts = _mm512_popcnt_epi64_wrapper(v_masks);
+} else {
+v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
+}
 
 /* Add popcounts and offset for u1 bits. */
 __m512i v_idx_u0_offset = _mm512_maskz_set1_epi64(u1_bcast_msk,
@@ -166,7 +183,8 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
const struct netdev_flow_key *keys[],
struct dpcls_rule **rules,
const uint32_t bit_count_u0,
-   const uint32_t bit_count_u1)
+   const uint32_t bit_count_u1,
+   const uint32_t use_vpop)
 {
 OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t block_cache[BLOCKS_CACHE_SIZE];
 uint32_t hashes[NETDEV_MAX_BURST];
@@ -218,7 +236,8 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
 u1_bcast_mask,
 pkt_mf_u0_pop,
 zero_mask,
-bit_count_total_mask);
+bit_count_total_mask,
+use_vpop);
 _mm512_storeu_si512(&block_cache[i * MF_BLOCKS_PER_PACKET], v_bl

[ovs-dev] [v11 11/15] dpif-netdev/dpcls: Specialize more subtable signatures.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit adds more subtables to be specialized. The traffic
pattern here being matched is VXLAN traffic subtables, which commonly
have (5,3), (9,1) and (9,4) subtable fingerprints.

Signed-off-by: Harry van Haaren 

---

v8: Add NEWS entry.
---
 NEWS   | 2 ++
 lib/dpif-netdev-lookup-avx512-gather.c | 6 ++
 lib/dpif-netdev-lookup-generic.c   | 6 ++
 3 files changed, 14 insertions(+)

diff --git a/NEWS b/NEWS
index 26cfae908..31def36b3 100644
--- a/NEWS
+++ b/NEWS
@@ -14,6 +14,8 @@ Post-v2.15.0
packets. It supports partial HWOL, EMC, SMC and DPCLS lookups.
  * Add commands to get and set the dpif implementations.
  * Enable AVX512 optimized DPCLS to search subtables with larger miniflows.
+ * Add more specialized DPCLS subtables to cover common rules, enhancing
+   the lookup performance.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c 
b/lib/dpif-netdev-lookup-avx512-gather.c
index 1bfabdcb1..7adf29914 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -299,6 +299,9 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
 return avx512_lookup_impl(subtable, keys_map, keys, rules, U0, U1);   \
 } \
 
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 4)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 1)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 3)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0)
@@ -331,6 +334,9 @@ dpcls_subtable_avx512_gather_probe(uint32_t u0_bits, 
uint32_t u1_bits)
 return NULL;
 }
 
+CHECK_LOOKUP_FUNCTION(9, 4);
+CHECK_LOOKUP_FUNCTION(9, 1);
+CHECK_LOOKUP_FUNCTION(5, 3);
 CHECK_LOOKUP_FUNCTION(5, 1);
 CHECK_LOOKUP_FUNCTION(4, 1);
 CHECK_LOOKUP_FUNCTION(4, 0);
diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index e3b6be4b6..6c74ac3a1 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -282,6 +282,9 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable 
*subtable,
 return lookup_generic_impl(subtable, keys_map, keys, rules, U0, U1);  \
 } \
 
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 4)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(9, 1)
+DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 3)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1)
 DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0)
@@ -303,6 +306,9 @@ dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t 
u1_bits)
 {
 dpcls_subtable_lookup_func f = NULL;
 
+CHECK_LOOKUP_FUNCTION(9, 4);
+CHECK_LOOKUP_FUNCTION(9, 1);
+CHECK_LOOKUP_FUNCTION(5, 3);
 CHECK_LOOKUP_FUNCTION(5, 1);
 CHECK_LOOKUP_FUNCTION(4, 1);
 CHECK_LOOKUP_FUNCTION(4, 0);
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 10/15] dpif-netdev/dpcls-avx512: Enable 16 block processing.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit implements larger subtable searches in avx512. A limitation
of the previous implementation was that up to 8 blocks of miniflow
data could be matched on (so a subtable with 8 blocks was handled
in avx, but 9 blocks or more would fall back to scalar/generic).
This limitation is removed in this patch, where up to 16 blocks
of subtable can be matched on.

>From an implementation perspective, the key to enabling 16 blocks
over 8 blocks was to do bitmask calculation up front, and then use
the pre-calculated bitmasks for 2x passes of the "blocks gather"
routine. The bitmasks need to be shifted for k-mask usage in the
upper (8-15) block range, but it is relatively trivial. This also
helps in case expanding to 24 blocks is desired in future.

The implementation of the 2nd iteration to handle > 8 blocks is
behind a conditional branch which checks the total number of bits.
This helps the specialized versions of the function that have a
miniflow fingerprint of less-than-or-equal 8 blocks, as the code
can be statically stripped out of those functions. Specialized
functions that do require more than 8 blocks will have the branch
removed and unconditionally execute the 2nd blocks gather routine.

Lastly, the _any() flavour will have the conditional branch, and
the branch predictor may mispredict a bit, but per burst will
likely get most packets correct (particularly towards the middle
and end of a burst).

The code has been run with unit tests under autovalidation and
passes all cases, and unit test coverage has been checked to
ensure the 16 block code paths are executing.

Signed-off-by: Harry van Haaren 

---

v9: Fixup post 2.15 rebase on NEWS
v8: Add NEWS entry
---
 NEWS   |   1 +
 lib/dpif-netdev-lookup-avx512-gather.c | 203 ++---
 2 files changed, 147 insertions(+), 57 deletions(-)

diff --git a/NEWS b/NEWS
index 71e7b9047..26cfae908 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,7 @@ Post-v2.15.0
  * Add avx512 implementation of dpif which can process non recirculated
packets. It supports partial HWOL, EMC, SMC and DPCLS lookups.
  * Add commands to get and set the dpif implementations.
+ * Enable AVX512 optimized DPCLS to search subtables with larger miniflows.
 
 
 v2.15.0 - 15 Feb 2021
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c 
b/lib/dpif-netdev-lookup-avx512-gather.c
index 8fc1cdfa5..1bfabdcb1 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -34,7 +34,21 @@
  * AVX512 code at a time.
  */
 #define NUM_U64_IN_ZMM_REG (8)
-#define BLOCKS_CACHE_SIZE (NETDEV_MAX_BURST * NUM_U64_IN_ZMM_REG)
+
+/* This implementation of AVX512 gather allows up to 16 blocks of MF data to be
+ * present in the blocks_cache, hence the multiply by 2 in the blocks count.
+ */
+#define MF_BLOCKS_PER_PACKET (NUM_U64_IN_ZMM_REG * 2)
+
+/* Blocks cache size is the maximum number of miniflow blocks that this
+ * implementation of lookup can handle.
+ */
+#define BLOCKS_CACHE_SIZE (NETDEV_MAX_BURST * MF_BLOCKS_PER_PACKET)
+
+/* The gather instruction can handle a scale for the size of the items to
+ * gather. For uint64_t data, this scale is 8.
+ */
+#define GATHER_SCALE_8 (8)
 
 
 VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather);
@@ -69,22 +83,83 @@ netdev_rule_matches_key(const struct dpcls_rule *rule,
 {
 const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
 const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
-const uint32_t lane_mask = (1 << mf_bits_total) - 1;
+const uint32_t lane_mask = (1ULL << mf_bits_total) - 1;
 
 /* Always load a full cache line from blocks_cache. Other loads must be
  * trimmed to the amount of data required for mf_bits_total blocks.
  */
-__m512i v_blocks = _mm512_loadu_si512(&block_cache[0]);
-__m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask, &maskp[0]);
-__m512i v_key= _mm512_maskz_loadu_epi64(lane_mask, &keyp[0]);
+uint32_t res_mask;
+
+{
+__m512i v_blocks = _mm512_loadu_si512(&block_cache[0]);
+__m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask, &maskp[0]);
+__m512i v_key= _mm512_maskz_loadu_epi64(lane_mask, &keyp[0]);
+__m512i v_data = _mm512_and_si512(v_blocks, v_mask);
+res_mask = _mm512_mask_cmpeq_epi64_mask(lane_mask, v_data, v_key);
+}
 
-__m512i v_data = _mm512_and_si512(v_blocks, v_mask);
-uint32_t res_mask = _mm512_mask_cmpeq_epi64_mask(lane_mask, v_data, v_key);
+if (mf_bits_total > 8) {
+uint32_t lane_mask_gt8 = lane_mask >> 8;
+__m512i v_blocks = _mm512_loadu_si512(&block_cache[8]);
+__m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask_gt8, &maskp[8]);
+__m512i v_key= _mm512_maskz_loadu_epi64(lane_mask_gt8, &keyp[8]);
+__m512i v_data = _mm512_and_si512(v_blocks, v_mask);
+uint32_t c = _mm512_mask_cmpeq_epi64_mask(lane_mask_gt8, v_data,
+   

[ovs-dev] [v11 09/15] dpif-netdev/dpcls: Refactor function names to dpcls.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit refactors the function names from netdev_*
namespace to the dpcls_* namespace, as they are only used
by dpcls code. With the name change, it becomes more obvious
that the functions belong to dpcls functionality, and in the
dpif-netdev-private-dpcls.h header file.

Signed-off-by: Harry van Haaren 
---
 lib/dpif-netdev-private-dpcls.h |  6 ++
 lib/dpif-netdev.c   | 21 ++---
 2 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/lib/dpif-netdev-private-dpcls.h b/lib/dpif-netdev-private-dpcls.h
index f223a93e4..28c6a10ff 100644
--- a/lib/dpif-netdev-private-dpcls.h
+++ b/lib/dpif-netdev-private-dpcls.h
@@ -97,10 +97,8 @@ struct dpcls_subtable {
 
 /* Generates a mask for each bit set in the subtable's miniflow. */
 void
-netdev_flow_key_gen_masks(const struct netdev_flow_key *tbl,
-  uint64_t *mf_masks,
-  const uint32_t mf_bits_u0,
-  const uint32_t mf_bits_u1);
+dpcls_flow_key_gen_masks(const struct netdev_flow_key *tbl, uint64_t *mf_masks,
+ const uint32_t mf_bits_u0, const uint32_t mf_bits_u1);
 
 /* Matches a dpcls rule against the incoming packet in 'target' */
 bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index b35ccbe3b..2db04d63b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -8309,7 +8309,7 @@ dpcls_create_subtable(struct dpcls *cls, const struct 
netdev_flow_key *mask)
 subtable->mf_bits_set_unit0 = unit0;
 subtable->mf_bits_set_unit1 = unit1;
 subtable->mf_masks = xmalloc(sizeof(uint64_t) * (unit0 + unit1));
-netdev_flow_key_gen_masks(mask, subtable->mf_masks, unit0, unit1);
+dpcls_flow_key_gen_masks(mask, subtable->mf_masks, unit0, unit1);
 
 /* Get the preferred subtable search function for this (u0,u1) subtable.
  * The function is guaranteed to always return a valid implementation, and
@@ -8484,11 +8484,10 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
 }
 }
 
-/* Inner loop for mask generation of a unit, see netdev_flow_key_gen_masks. */
+/* Inner loop for mask generation of a unit, see dpcls_flow_key_gen_masks. */
 static inline void
-netdev_flow_key_gen_mask_unit(uint64_t iter,
-  const uint64_t count,
-  uint64_t *mf_masks)
+dpcls_flow_key_gen_mask_unit(uint64_t iter, const uint64_t count,
+ uint64_t *mf_masks)
 {
 int i;
 for (i = 0; i < count; i++) {
@@ -8509,16 +8508,16 @@ netdev_flow_key_gen_mask_unit(uint64_t iter,
  * @param mf_bits_unit0 Number of bits set in unit0 of the miniflow
  */
 void
-netdev_flow_key_gen_masks(const struct netdev_flow_key *tbl,
-  uint64_t *mf_masks,
-  const uint32_t mf_bits_u0,
-  const uint32_t mf_bits_u1)
+dpcls_flow_key_gen_masks(const struct netdev_flow_key *tbl,
+ uint64_t *mf_masks,
+ const uint32_t mf_bits_u0,
+ const uint32_t mf_bits_u1)
 {
 uint64_t iter_u0 = tbl->mf.map.bits[0];
 uint64_t iter_u1 = tbl->mf.map.bits[1];
 
-netdev_flow_key_gen_mask_unit(iter_u0, mf_bits_u0, &mf_masks[0]);
-netdev_flow_key_gen_mask_unit(iter_u1, mf_bits_u1, &mf_masks[mf_bits_u0]);
+dpcls_flow_key_gen_mask_unit(iter_u0, mf_bits_u0, &mf_masks[0]);
+dpcls_flow_key_gen_mask_unit(iter_u1, mf_bits_u1, &mf_masks[mf_bits_u0]);
 }
 
 /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 04/15] dpif-avx512: Add ISA implementation of dpif.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit adds the AVX512 implementation of DPIF functionality,
specifically the dp_netdev_input_outer_avx512 function. This function
only handles outer (no re-circulations), and is optimized to use the
AVX512 ISA for packet batching and other DPIF work.

Sparse is not able to handle the AVX512 intrinsics, causing compile
time failures, so it is disabled for this file.

Signed-off-by: Harry van Haaren 
Co-authored-by: Cian Ferriter 
Signed-off-by: Cian Ferriter 

---

v8:
- Fixup AVX512 mask to uint32_t conversion compilation warning.
---
 lib/automake.mk  |   5 +-
 lib/dpif-netdev-avx512.c | 265 +++
 lib/dpif-netdev-private-dfc.h|   8 +
 lib/dpif-netdev-private-dpif.h   |  32 
 lib/dpif-netdev-private-thread.h |  11 +-
 lib/dpif-netdev-private.h|  25 +++
 lib/dpif-netdev.c|  70 ++--
 7 files changed, 400 insertions(+), 16 deletions(-)
 create mode 100644 lib/dpif-netdev-avx512.c
 create mode 100644 lib/dpif-netdev-private-dpif.h

diff --git a/lib/automake.mk b/lib/automake.mk
index 0bef0cc69..5fab8ba4f 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -33,11 +33,13 @@ lib_libopenvswitchavx512_la_CFLAGS = \
-mavx512f \
-mavx512bw \
-mavx512dq \
+   -mbmi \
-mbmi2 \
-fPIC \
$(AM_CFLAGS)
 lib_libopenvswitchavx512_la_SOURCES = \
-   lib/dpif-netdev-lookup-avx512-gather.c
+   lib/dpif-netdev-lookup-avx512-gather.c \
+   lib/dpif-netdev-avx512.c
 lib_libopenvswitchavx512_la_LDFLAGS = \
-static
 endif
@@ -113,6 +115,7 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpif-netdev.h \
lib/dpif-netdev-private-dfc.h \
lib/dpif-netdev-private-dpcls.h \
+   lib/dpif-netdev-private-dpif.h \
lib/dpif-netdev-private-flow.h \
lib/dpif-netdev-private-hwol.h \
lib/dpif-netdev-private-thread.h \
diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
new file mode 100644
index 0..91f51c479
--- /dev/null
+++ b/lib/dpif-netdev-avx512.c
@@ -0,0 +1,265 @@
+/*
+ * Copyright (c) 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifdef __x86_64__
+/* Sparse cannot handle the AVX512 instructions. */
+#if !defined(__CHECKER__)
+
+#include 
+
+#include "dpif-netdev.h"
+#include "dpif-netdev-perf.h"
+
+#include "dpif-netdev-private.h"
+#include "dpif-netdev-private-dpcls.h"
+#include "dpif-netdev-private-flow.h"
+#include "dpif-netdev-private-thread.h"
+
+#include "dp-packet.h"
+#include "netdev.h"
+
+#include "immintrin.h"
+
+/* Structure to contain per-packet metadata that must be attributed to the
+ * dp netdev flow. This is unfortunate to have to track per packet, however
+ * it's a bit awkward to maintain them in a performant way. This structure
+ * helps to keep two variables on a single cache line per packet.
+ */
+struct pkt_flow_meta {
+uint16_t bytes;
+uint16_t tcp_flags;
+};
+
+/* Structure of heap allocated memory for DPIF internals. */
+struct dpif_userdata {
+OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
+struct netdev_flow_key keys[NETDEV_MAX_BURST];
+OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
+struct netdev_flow_key *key_ptrs[NETDEV_MAX_BURST];
+OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
+struct pkt_flow_meta pkt_meta[NETDEV_MAX_BURST];
+};
+
+int32_t
+dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
+ struct dp_packet_batch *packets,
+ odp_port_t in_port)
+{
+/* Allocate DPIF userdata. */
+if (OVS_UNLIKELY(!pmd->netdev_input_func_userdata)) {
+pmd->netdev_input_func_userdata =
+xmalloc_pagealign(sizeof(struct dpif_userdata));
+}
+
+struct dpif_userdata *ud = pmd->netdev_input_func_userdata;
+struct netdev_flow_key *keys = ud->keys;
+struct netdev_flow_key **key_ptrs = ud->key_ptrs;
+struct pkt_flow_meta *pkt_meta = ud->pkt_meta;
+
+/* The AVX512 DPIF implementation handles rules in a way that is optimized
+ * for reducing data-movement between HWOL/EMC/SMC and DPCLS. This is
+ * achieved by separating the rule arrays. Bitmasks are kept for each
+ * packet, indicating if it matched in the HWOL/EMC/SMC array or DPCLS
+ * array. Later the two arrays are merged by AVX-512 expand instructions.
+ */
+
+/* Stores the computed output: a rule pointer for each packet. */
+struct

[ovs-dev] [v11 06/15] dpif-netdev: Add command to switch dpif implementation.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit adds a new command to allow the user to switch
the active DPIF implementation at runtime. A probe function
is executed before switching the DPIF implementation, to ensure
the CPU is capable of running the ISA required. For example, the
below code will switch to the AVX512 enabled DPIF assuming
that the runtime CPU is capable of running AVX512 instructions:

 $ ovs-appctl dpif-netdev/dpif-set dpif_avx512

A new configuration flag is added to allow selection of the
default DPIF. This is useful for running the unit-tests against
the available DPIF implementations, without modifying each unit test.

The design of the testing & validation for ISA optimized DPIF
implementations is based around the work already upstream for DPCLS.
Note however that a DPCLS lookup has no state or side-effects, allowing
the auto-validator implementation to perform multiple lookups and
provide consistent statistic counters.

The DPIF component does have state, so running two implementations in
parallel and comparing output is not a valid testing method, as there
are changes in DPIF statistic counters (side effects). As a result, the
DPIF is tested directly against the unit-tests.

Signed-off-by: Harry van Haaren 
Co-authored-by: Cian Ferriter 
Signed-off-by: Cian Ferriter 

---

v11:
- Improve the dp_netdev_impl_get_default() function so PMD threads created
  after running "dpif-set" command will use the DPIF implementation that was
  set.
---
 acinclude.m4 |  15 +
 configure.ac |   1 +
 lib/automake.mk  |   1 +
 lib/dpif-netdev-avx512.c |  14 +
 lib/dpif-netdev-private-dpif.c   | 103 +++
 lib/dpif-netdev-private-dpif.h   |  47 +-
 lib/dpif-netdev-private-thread.h |  12 +---
 lib/dpif-netdev.c|  89 --
 8 files changed, 266 insertions(+), 16 deletions(-)
 create mode 100644 lib/dpif-netdev-private-dpif.c

diff --git a/acinclude.m4 b/acinclude.m4
index 15a54d636..5fbcd9872 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -30,6 +30,21 @@ AC_DEFUN([OVS_CHECK_DPCLS_AUTOVALIDATOR], [
   fi
 ])
 
+dnl Set OVS DPIF default implementation at configure time for running the unit
+dnl tests on the whole codebase without modifying tests per DPIF impl
+AC_DEFUN([OVS_CHECK_DPIF_AVX512_DEFAULT], [
+  AC_ARG_ENABLE([dpif-default-avx512],
+[AC_HELP_STRING([--enable-dpif-default-avx512], [Enable DPIF 
AVX512 implementation as default.])],
+[dpifavx512=yes],[dpifavx512=no])
+  AC_MSG_CHECKING([whether DPIF AVX512 is default implementation])
+  if test "$dpifavx512" != yes; then
+AC_MSG_RESULT([no])
+  else
+OVS_CFLAGS="$OVS_CFLAGS -DDPIF_AVX512_DEFAULT"
+AC_MSG_RESULT([yes])
+  fi
+])
+
 dnl OVS_ENABLE_WERROR
 AC_DEFUN([OVS_ENABLE_WERROR],
   [AC_ARG_ENABLE(
diff --git a/configure.ac b/configure.ac
index c077034d4..e45685a6c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -185,6 +185,7 @@ OVS_ENABLE_WERROR
 OVS_ENABLE_SPARSE
 OVS_CTAGS_IDENTIFIERS
 OVS_CHECK_DPCLS_AUTOVALIDATOR
+OVS_CHECK_DPIF_AVX512_DEFAULT
 OVS_CHECK_BINUTILS_AVX512
 
 AC_ARG_VAR(KARCH, [Kernel Architecture String])
diff --git a/lib/automake.mk b/lib/automake.mk
index 5fab8ba4f..6279662f8 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -115,6 +115,7 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpif-netdev.h \
lib/dpif-netdev-private-dfc.h \
lib/dpif-netdev-private-dpcls.h \
+   lib/dpif-netdev-private-dpif.c \
lib/dpif-netdev-private-dpif.h \
lib/dpif-netdev-private-flow.h \
lib/dpif-netdev-private-hwol.h \
diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
index ed79df255..c23ac0f82 100644
--- a/lib/dpif-netdev-avx512.c
+++ b/lib/dpif-netdev-avx512.c
@@ -19,6 +19,7 @@
 #if !defined(__CHECKER__)
 
 #include 
+#include 
 
 #include "dpif-netdev.h"
 #include "dpif-netdev-perf.h"
@@ -54,6 +55,19 @@ struct dpif_userdata {
 struct pkt_flow_meta pkt_meta[NETDEV_MAX_BURST];
 };
 
+int32_t
+dp_netdev_input_outer_avx512_probe(void)
+{
+int avx512f_available = dpdk_get_cpu_has_isa("x86_64", "avx512f");
+int bmi2_available = dpdk_get_cpu_has_isa("x86_64", "bmi2");
+
+if (!avx512f_available || !bmi2_available) {
+return -ENOTSUP;
+}
+
+return 0;
+}
+
 int32_t
 dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
  struct dp_packet_batch *packets,
diff --git a/lib/dpif-netdev-private-dpif.c b/lib/dpif-netdev-private-dpif.c
new file mode 100644
index 0..e417fa86d
--- /dev/null
+++ b/lib/dpif-netdev-private-dpif.c
@@ -0,0 +1,103 @@
+/*
+ * Copyright (c) 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Un

[ovs-dev] [v11 07/15] dpif-netdev: Add command to get dpif implementations.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit adds a new command to retrieve the list of available
DPIF implementations. This can be used by to check what implementations
of the DPIF are available in any given OVS binary.

Usage:
 $ ovs-appctl dpif-netdev/dpif-get

Signed-off-by: Harry van Haaren 
---
 lib/dpif-netdev-private-dpif.c |  8 
 lib/dpif-netdev-private-dpif.h |  6 ++
 lib/dpif-netdev.c  | 24 
 3 files changed, 38 insertions(+)

diff --git a/lib/dpif-netdev-private-dpif.c b/lib/dpif-netdev-private-dpif.c
index e417fa86d..9ea038748 100644
--- a/lib/dpif-netdev-private-dpif.c
+++ b/lib/dpif-netdev-private-dpif.c
@@ -73,6 +73,14 @@ dp_netdev_impl_set_default(dp_netdev_input_func func)
 default_dpif_func = func;
 }
 
+uint32_t
+dp_netdev_impl_get(const struct dpif_netdev_impl_info_t **out_impls)
+{
+ovs_assert(out_impls);
+*out_impls = dpif_impls;
+return ARRAY_SIZE(dpif_impls);
+}
+
 /* This function checks all available DPIF implementations, and selects the
  * returns the function pointer to the one requested by "name".
  */
diff --git a/lib/dpif-netdev-private-dpif.h b/lib/dpif-netdev-private-dpif.h
index fb5380d2c..3bd67cbcd 100644
--- a/lib/dpif-netdev-private-dpif.h
+++ b/lib/dpif-netdev-private-dpif.h
@@ -47,6 +47,12 @@ struct dpif_netdev_impl_info_t {
 const char *name;
 };
 
+/* This function returns all available implementations to the caller. The
+ * quantity of implementations is returned by the int return value.
+ */
+uint32_t
+dp_netdev_impl_get(const struct dpif_netdev_impl_info_t **out_impls);
+
 /* This function checks all available DPIF implementations, and selects the
  * returns the function pointer to the one requested by "name".
  */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 2dfe9003e..b35ccbe3b 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -993,6 +993,27 @@ dpif_netdev_subtable_lookup_set(struct unixctl_conn *conn, 
int argc,
 ds_destroy(&reply);
 }
 
+static void
+dpif_netdev_impl_get(struct unixctl_conn *conn, int argc OVS_UNUSED,
+ const char *argv[] OVS_UNUSED, void *aux OVS_UNUSED)
+{
+const struct dpif_netdev_impl_info_t *dpif_impls;
+uint32_t count = dp_netdev_impl_get(&dpif_impls);
+if (count == 0) {
+unixctl_command_reply_error(conn, "error getting dpif names");
+return;
+}
+
+/* Add all dpif functions to reply string. */
+struct ds reply = DS_EMPTY_INITIALIZER;
+ds_put_cstr(&reply, "Available DPIF implementations:\n");
+for (uint32_t i = 0; i < count; i++) {
+ds_put_format(&reply, "  %s\n", dpif_impls[i].name);
+}
+unixctl_command_reply(conn, ds_cstr(&reply));
+ds_destroy(&reply);
+}
+
 static void
 dpif_netdev_impl_set(struct unixctl_conn *conn, int argc,
  const char *argv[], void *aux OVS_UNUSED)
@@ -1294,6 +1315,9 @@ dpif_netdev_init(void)
  "dpif_implementation_name [dp]",
  1, 2, dpif_netdev_impl_set,
  NULL);
+unixctl_command_register("dpif-netdev/dpif-get", "",
+ 0, 0, dpif_netdev_impl_get,
+ NULL);
 return 0;
 }
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 08/15] docs/dpdk/bridge: Add dpif performance section.

2021-04-27 Thread Cian Ferriter
This section details how two new commands can be used to list and select
the different dpif implementations. It also details how a non default
dpif implementation can be tested with the OVS unit test suite.

Add NEWS updates for the dpif-netdev.c refactor and the new dpif
implementations/commands.

Signed-off-by: Cian Ferriter 

---

v8:
- Merge NEWS file items into one Userspace Datapath: heading
---
 Documentation/topics/dpdk/bridge.rst | 37 
 NEWS |  4 +++
 2 files changed, 41 insertions(+)

diff --git a/Documentation/topics/dpdk/bridge.rst 
b/Documentation/topics/dpdk/bridge.rst
index 526d5c959..ca90d7bdb 100644
--- a/Documentation/topics/dpdk/bridge.rst
+++ b/Documentation/topics/dpdk/bridge.rst
@@ -214,3 +214,40 @@ implementation ::
 
 Compile OVS in debug mode to have `ovs_assert` statements error out if
 there is a mis-match in the DPCLS lookup implementation.
+
+Datapath Interface Performance
+--
+
+The datapath interface (DPIF) or dp_netdev_input() is responsible for taking
+packets through the major components of the userspace datapath; such as
+miniflow_extract, EMC, SMC and DPCLS lookups, and a lot of the performance
+stats associated with the datapath.
+
+Just like with the SIMD DPCLS work above, SIMD can be applied to the DPIF to
+improve performance.
+
+OVS provides multiple implementations of the DPIF. These can be listed with the
+following command ::
+
+$ ovs-appctl dpif-netdev/dpif-get
+Available DPIF implementations:
+  dpif_scalar
+  dpif_avx512
+
+By default, dpif_scalar is used. The DPIF implementation can be selected by
+name ::
+
+$ ovs-appctl dpif-netdev/dpif-set dpif_avx512
+DPIF implementation set to dpif_avx512.
+
+$ ovs-appctl dpif-netdev/dpif-set dpif_scalar
+DPIF implementation set to dpif_scalar.
+
+Running Unit Tests with AVX512 DPIF
+~~~
+
+Since the AVX512 DPIF is disabled by default, a compile time option is
+available in order to test it with the OVS unit test suite. When building with
+a CPU that supports AVX512, use the following configure option ::
+
+$ ./configure --enable-dpif-default-avx512
diff --git a/NEWS b/NEWS
index 95cf922aa..71e7b9047 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,10 @@ Post-v2.15.0
  * New option '--no-record-hostname' to disable hostname configuration
in ovsdb on startup.
  * New command 'record-hostname-if-not-set' to update hostname in ovsdb.
+ * Refactor lib/dpif-netdev.c to multiple header files.
+ * Add avx512 implementation of dpif which can process non recirculated
+   packets. It supports partial HWOL, EMC, SMC and DPCLS lookups.
+ * Add commands to get and set the dpif implementations.
 
 
 v2.15.0 - 15 Feb 2021
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 05/15] dpif-avx512: Add HWOL support to avx512 dpif.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

Partial hardware offload is implemented in a very similar way to the
scalar dpif.

Signed-off-by: Harry van Haaren 
---
 lib/dpif-netdev-avx512.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
index 91f51c479..ed79df255 100644
--- a/lib/dpif-netdev-avx512.c
+++ b/lib/dpif-netdev-avx512.c
@@ -27,6 +27,7 @@
 #include "dpif-netdev-private-dpcls.h"
 #include "dpif-netdev-private-flow.h"
 #include "dpif-netdev-private-thread.h"
+#include "dpif-netdev-private-hwol.h"
 
 #include "dp-packet.h"
 #include "netdev.h"
@@ -112,9 +113,32 @@ dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread 
*pmd,
 uint32_t i = __builtin_ctz(iter);
 iter = _blsr_u64(iter);
 
-/* Initialize packet md and do miniflow extract. */
+/* Get packet pointer from bitmask and packet md. */
 struct dp_packet *packet = packets->packets[i];
 pkt_metadata_init(&packet->md, in_port);
+
+struct dp_netdev_flow *f = NULL;
+
+/* Check for partial hardware offload mark */
+uint32_t mark;
+if (dp_packet_has_flow_mark(packet, &mark)) {
+f = mark_to_flow_find(pmd, mark);
+if (f) {
+rules[i] = &f->cr;
+
+/* This is nasty - instead of using the HWOL provided flow,
+ * parse the packet data anyway to find the location of the TCP
+ * header to extract the TCP flags for the rule.
+ */
+pkt_meta[i].tcp_flags = parse_tcp_flags(packet);
+
+pkt_meta[i].bytes = dp_packet_size(packet);
+hwol_emc_smc_hitmask |= (1 << i);
+continue;
+}
+}
+
+/* Do miniflow extract into keys */
 struct netdev_flow_key *key = &keys[i];
 miniflow_extract(packet, &key->mf);
 
@@ -125,8 +149,6 @@ dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread 
*pmd,
 key->len = netdev_flow_key_size(miniflow_n_values(&key->mf));
 key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet, &key->mf);
 
-struct dp_netdev_flow *f = NULL;
-
 if (emc_enabled) {
 f = emc_lookup(&cache->emc_cache, key);
 
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 03/15] dpif-netdev: Add function pointer for netdev input.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit adds a function pointer to the pmd thread data structure,
giving the pmd thread flexibility in its dpif-input function choice.
This allows choosing of the implementation based on ISA capabilities
of the runtime CPU, leading to optimizations and higher performance.

Signed-off-by: Harry van Haaren 
---
 lib/dpif-netdev-private-thread.h | 12 
 lib/dpif-netdev.c|  7 ++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-thread.h
index 5e5308b96..01a28a681 100644
--- a/lib/dpif-netdev-private-thread.h
+++ b/lib/dpif-netdev-private-thread.h
@@ -47,6 +47,13 @@ struct dp_netdev_pmd_thread_ctx {
 uint32_t emc_insert_min;
 };
 
+/* Forward declaration for typedef. */
+struct dp_netdev_pmd_thread;
+
+typedef void (*dp_netdev_input_func)(struct dp_netdev_pmd_thread *pmd,
+ struct dp_packet_batch *packets,
+ odp_port_t port_no);
+
 /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
  * the performance overhead of interrupt processing.  Therefore netdev can
  * not implement rx-wait for these devices.  dpif-netdev needs to poll
@@ -101,6 +108,11 @@ struct dp_netdev_pmd_thread {
 /* Current context of the PMD thread. */
 struct dp_netdev_pmd_thread_ctx ctx;
 
+/* Function pointer to call for dp_netdev_input() functionality. */
+dp_netdev_input_func netdev_input_func;
+/* Pointer for per-DPIF implementation scratch space. */
+void *netdev_input_func_userdata;
+
 struct seq *reload_seq;
 uint64_t last_reload_seq;
 
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 88f37c505..bec984643 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4234,8 +4234,9 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread 
*pmd,
 }
 }
 }
+
 /* Process packet batch. */
-dp_netdev_input(pmd, &batch, port_no);
+pmd->netdev_input_func(pmd, &batch, port_no);
 
 /* Assign processing cycles to rx queue. */
 cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
@@ -6033,6 +6034,10 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread 
*pmd, struct dp_netdev *dp,
 hmap_init(&pmd->tnl_port_cache);
 hmap_init(&pmd->send_port_cache);
 cmap_init(&pmd->tx_bonds);
+
+/* Initialize the DPIF function pointer to the default scalar version. */
+pmd->netdev_input_func = dp_netdev_input;
+
 /* init the 'flow_cache' since there is no
  * actual thread created for NON_PMD_CORE_ID. */
 if (core_id == NON_PMD_CORE_ID) {
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [v11 02/15] dpif-netdev: Split HWOL out to own header file.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

This commit moves the datapath lookup functions required for
hardware offload to a seperate file. This allows other DPIF
implementations to access the lookup functions, encouraging
code reuse.

Signed-off-by: Harry van Haaren 
---
 lib/automake.mk|  1 +
 lib/dpif-netdev-private-hwol.h | 63 ++
 lib/dpif-netdev.c  | 39 ++---
 3 files changed, 67 insertions(+), 36 deletions(-)
 create mode 100644 lib/dpif-netdev-private-hwol.h

diff --git a/lib/automake.mk b/lib/automake.mk
index 9fa8712c3..0bef0cc69 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -114,6 +114,7 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpif-netdev-private-dfc.h \
lib/dpif-netdev-private-dpcls.h \
lib/dpif-netdev-private-flow.h \
+   lib/dpif-netdev-private-hwol.h \
lib/dpif-netdev-private-thread.h \
lib/dpif-netdev-private.h \
lib/dpif-netdev-perf.c \
diff --git a/lib/dpif-netdev-private-hwol.h b/lib/dpif-netdev-private-hwol.h
new file mode 100644
index 0..b93297a74
--- /dev/null
+++ b/lib/dpif-netdev-private-hwol.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2015 Nicira, Inc.
+ * Copyright (c) 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef DPIF_NETDEV_PRIVATE_HWOL_H
+#define DPIF_NETDEV_PRIVATE_HWOL_H 1
+
+#include "dpif-netdev-private-flow.h"
+
+#define MAX_FLOW_MARK   (UINT32_MAX - 1)
+#define INVALID_FLOW_MARK   0
+/* Zero flow mark is used to indicate the HW to remove the mark. A packet
+ * marked with zero mark is received in SW without a mark at all, so it
+ * cannot be used as a valid mark.
+ */
+
+struct megaflow_to_mark_data {
+const struct cmap_node node;
+ovs_u128 mega_ufid;
+uint32_t mark;
+};
+
+struct flow_mark {
+struct cmap megaflow_to_mark;
+struct cmap mark_to_flow;
+struct id_pool *pool;
+};
+
+/* allocated in dpif-netdev.c */
+extern struct flow_mark flow_mark;
+
+static inline struct dp_netdev_flow *
+mark_to_flow_find(const struct dp_netdev_pmd_thread *pmd,
+  const uint32_t mark)
+{
+struct dp_netdev_flow *flow;
+
+CMAP_FOR_EACH_WITH_HASH (flow, mark_node, hash_int(mark, 0),
+ &flow_mark.mark_to_flow) {
+if (flow->mark == mark && flow->pmd_id == pmd->core_id &&
+flow->dead == false) {
+return flow;
+}
+}
+
+return NULL;
+}
+
+
+#endif /* dpif-netdev-private-hwol.h */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 298bfe444..88f37c505 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -84,6 +84,8 @@
 #include "util.h"
 #include "uuid.h"
 
+#include "dpif-netdev-private-hwol.h"
+
 VLOG_DEFINE_THIS_MODULE(dpif_netdev);
 
 /* Auto Load Balancing Defaults */
@@ -1954,26 +1956,8 @@ dp_netdev_pmd_find_dpcls(struct dp_netdev_pmd_thread 
*pmd,
 return cls;
 }
 
-#define MAX_FLOW_MARK   (UINT32_MAX - 1)
-#define INVALID_FLOW_MARK   0
-/* Zero flow mark is used to indicate the HW to remove the mark. A packet
- * marked with zero mark is received in SW without a mark at all, so it
- * cannot be used as a valid mark.
- */
-
-struct megaflow_to_mark_data {
-const struct cmap_node node;
-ovs_u128 mega_ufid;
-uint32_t mark;
-};
-
-struct flow_mark {
-struct cmap megaflow_to_mark;
-struct cmap mark_to_flow;
-struct id_pool *pool;
-};
 
-static struct flow_mark flow_mark = {
+struct flow_mark flow_mark = {
 .megaflow_to_mark = CMAP_INITIALIZER,
 .mark_to_flow = CMAP_INITIALIZER,
 };
@@ -2142,23 +2126,6 @@ flow_mark_flush(struct dp_netdev_pmd_thread *pmd)
 }
 }
 
-static struct dp_netdev_flow *
-mark_to_flow_find(const struct dp_netdev_pmd_thread *pmd,
-  const uint32_t mark)
-{
-struct dp_netdev_flow *flow;
-
-CMAP_FOR_EACH_WITH_HASH (flow, mark_node, hash_int(mark, 0),
- &flow_mark.mark_to_flow) {
-if (flow->mark == mark && flow->pmd_id == pmd->core_id &&
-flow->dead == false) {
-return flow;
-}
-}
-
-return NULL;
-}
-
 static struct dp_flow_offload_item *
 dp_netdev_alloc_flow_offload(struct dp_netdev_pmd_thread *pmd,
  struct dp_netdev_flow *flow,
-- 
2.31.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mail

[ovs-dev] [v11 01/15] dpif-netdev: Refactor to multiple header files.

2021-04-27 Thread Cian Ferriter
From: Harry van Haaren 

Split the very large file dpif-netdev.c and the datastructures
it contains into multiple header files. Each header file is
responsible for the datastructures of that component.

This logical split allows better reuse and modularity of the code,
and reduces the very large file dpif-netdev.c to be more managable.

Due to dependencies between components, it is not possible to
move component in smaller granularities than this patch.

To explain the dependencies better, eg:

DPCLS has no deps (from dpif-netdev.c file)
FLOW depends on DPCLS (struct dpcls_rule)
DFC depends on DPCLS (netdev_flow_key) and FLOW (netdev_flow_key)
THREAD depends on DFC (struct dfc_cache)

DFC_PROC depends on THREAD (struct pmd_thread)

DPCLS lookup.h/c require only DPCLS
DPCLS implementations require only dpif-netdev-lookup.h.
- This change was made in 2.12 release with function pointers
- This commit only refactors the name to "private-dpcls.h"

Signed-off-by: Harry van Haaren 
Co-authored-by: Cian Ferriter 
Signed-off-by: Cian Ferriter 
---
 lib/automake.mk|   4 +
 lib/dpif-netdev-lookup-autovalidator.c |   1 -
 lib/dpif-netdev-lookup-avx512-gather.c |   1 -
 lib/dpif-netdev-lookup-generic.c   |   1 -
 lib/dpif-netdev-lookup.h   |   2 +-
 lib/dpif-netdev-private-dfc.h  | 244 
 lib/dpif-netdev-private-dpcls.h| 129 ++
 lib/dpif-netdev-private-flow.h | 162 
 lib/dpif-netdev-private-thread.h   | 206 ++
 lib/dpif-netdev-private.h  | 100 +
 lib/dpif-netdev.c  | 519 +
 11 files changed, 760 insertions(+), 609 deletions(-)
 create mode 100644 lib/dpif-netdev-private-dfc.h
 create mode 100644 lib/dpif-netdev-private-dpcls.h
 create mode 100644 lib/dpif-netdev-private-flow.h
 create mode 100644 lib/dpif-netdev-private-thread.h

diff --git a/lib/automake.mk b/lib/automake.mk
index 39901bd6d..9fa8712c3 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -111,6 +111,10 @@ lib_libopenvswitch_la_SOURCES = \
lib/dpif-netdev-lookup-generic.c \
lib/dpif-netdev.c \
lib/dpif-netdev.h \
+   lib/dpif-netdev-private-dfc.h \
+   lib/dpif-netdev-private-dpcls.h \
+   lib/dpif-netdev-private-flow.h \
+   lib/dpif-netdev-private-thread.h \
lib/dpif-netdev-private.h \
lib/dpif-netdev-perf.c \
lib/dpif-netdev-perf.h \
diff --git a/lib/dpif-netdev-lookup-autovalidator.c 
b/lib/dpif-netdev-lookup-autovalidator.c
index 97b59fdd0..475e1ab1e 100644
--- a/lib/dpif-netdev-lookup-autovalidator.c
+++ b/lib/dpif-netdev-lookup-autovalidator.c
@@ -17,7 +17,6 @@
 #include 
 #include "dpif-netdev.h"
 #include "dpif-netdev-lookup.h"
-#include "dpif-netdev-private.h"
 #include "openvswitch/vlog.h"
 
 VLOG_DEFINE_THIS_MODULE(dpif_lookup_autovalidator);
diff --git a/lib/dpif-netdev-lookup-avx512-gather.c 
b/lib/dpif-netdev-lookup-avx512-gather.c
index 5e3634249..8fc1cdfa5 100644
--- a/lib/dpif-netdev-lookup-avx512-gather.c
+++ b/lib/dpif-netdev-lookup-avx512-gather.c
@@ -21,7 +21,6 @@
 
 #include "dpif-netdev.h"
 #include "dpif-netdev-lookup.h"
-#include "dpif-netdev-private.h"
 #include "cmap.h"
 #include "flow.h"
 #include "pvector.h"
diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index b1a0cfc36..e3b6be4b6 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -17,7 +17,6 @@
 
 #include 
 #include "dpif-netdev.h"
-#include "dpif-netdev-private.h"
 #include "dpif-netdev-lookup.h"
 
 #include "bitmap.h"
diff --git a/lib/dpif-netdev-lookup.h b/lib/dpif-netdev-lookup.h
index bd72aa29b..59f51faa0 100644
--- a/lib/dpif-netdev-lookup.h
+++ b/lib/dpif-netdev-lookup.h
@@ -19,7 +19,7 @@
 
 #include 
 #include "dpif-netdev.h"
-#include "dpif-netdev-private.h"
+#include "dpif-netdev-private-dpcls.h"
 
 /* Function to perform a probe for the subtable bit fingerprint.
  * Returns NULL if not valid, or a valid function pointer to call for this
diff --git a/lib/dpif-netdev-private-dfc.h b/lib/dpif-netdev-private-dfc.h
new file mode 100644
index 0..52349a3fc
--- /dev/null
+++ b/lib/dpif-netdev-private-dfc.h
@@ -0,0 +1,244 @@
+/*
+ * Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2015 Nicira, Inc.
+ * Copyright (c) 2019, 2020, 2021 Intel Corporation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef DPIF_NETDEV_PRIVATE_DFC_H
+#define DPIF_NETDE

[ovs-dev] [v11 00/15] DPIF Framework + Optimizations

2021-04-27 Thread Cian Ferriter
v11 Summary:
- Improve the dp_netdev_impl_get_default() function so PMD threads created
  after running "dpif-set" command will use the DPIF implementation that was
  set.
- Fix small comment formatting issues.

v10 Summary:
- Removed AVX512 POC work for DPIF and MFEX which was added in v9
-- MFEX patches will be sent separately
- Rebase additions to NEWS entries
- Update copyright notices

v9 Summary:
- Added AVX512 POC work for DPIF and MFEX in single patch at end
-- Note that the AVX512 MFEX is for Ether()/IP()/UDP() traffic.
-- A significant performance boost is possible with these optimizations.

v8 Summary:
- Added NEWS entries for significant changes
- Added scalar optimizations for datapath TX
- Patchset is now ready for merge in my opinion.

v7 summary:
- OVS Conference included DPIF overview, youtube link:
--- https://youtu.be/5dWyPxiXEhg
- Rebased and tested on the DPDK 20.11 v4 patch
--- Link: https://patchwork.ozlabs.org/project/openvswitch/list/?series=220645
--- Tested this series for shared/static builds
--- Tested this series with/without -march=
- Minor code improvements in DPIF component (see commits for details)
- Improved CPU ISA checks, caching results
- Commit message improvements (.'s etc)
- Added performance data of patchset
--- Note that the benchmark below does not utilize the AVX512-vpopcntdq
--- optimizations, and performance is expected to improve when used.
--- Further optimizations are planned that continue.

Benchmark Details & Results
===

Intel® Xeon® Gold 6230 CPU @2.10GHz
OVS*-DPDK* Phy-Phy Performance 4x 25G Ports - Total 1 million flows
1C1T-4P, 64-byte frame size, performance in mpps:

Results Table:
---
DPIF  | Scalar | Scalar | AVX512 | AVX512 |
DPCLS | Scalar | AVX512 | Scalar | AVX512 |
---
mpps  |  6.955 |  7.530 |  7.530 |  7.962 |

By enabling both AVX512 DPIF and DPCLS, packet forwarding
is  7.962 / 6.955 = 1.1447x faster, aka 14% speedup.



v6 summary:
- Rebase to DPDK 20.11 enabling patch
--- This creates a dependency, expect CI build failures on the last
patch in this series if it is not applied!
- Small improvements to DPIF layer
--- EMC/SMC enabling in AVX512 DPIF cleanups
- CPU ISA flags are cached, lowering overhead
- Wilcard Classifier DPCLS
--- Refactor and cleanups for function names
--- Enable more subtable specializations
--- Enable AVX512 vpopcount instruction


v5 summary:
- Dropped MFEX optimizations, re-targetting to a later release
--- This allows focus of community reviews & development on DPIF
--- Note OVS Conference talk still introduces both DPIF and MFEX topics
- DPIF improvements
--- Better EMC/SMC handling
--- HWOL is enabled in the avx512 DPIF
--- Documentation & NEWS items added
--- Various smaller improvements

v4 summary:
- Updated and improve DPIF component
--- SMC now implemented
--- EMC handling improved
--- Novel batching method using AVX512 implemented
--- see commits for details
- Updated Miniflow Extract component
--- Improved AVX512 code path performance
--- Implemented multiple TODO item's in v3
--- Add "disable" implementation to return to scalar miniflow only
--- More fixes planned for v5/future revisions:
 Rename command to better reflect usage
 Improve dynamicness of patterns
 Add more demo protocols to show usage
- Future work
--- Documentation/NEWS items
--- Statistics for optimized MFEX
- Note that this patchset will be discussed/presented at OvsConf soon :)

v3 update summary:
(Cian Ferriter helping with rebases, review and code cleanups)
- Split out partially related changes (these will be sent separately)
--- netdev output action optimization
--- avx512 dpcls 16-block support optimization
- Squash commit which moves netdev struct flow into the refactor commit:
--- Squash dpif-netdev: move netdev flow struct to header
--- Into dpif-netdev: Refactor to multiple header files
- Implement Miniflow extract for AVX-512 DPIF
--- A generic method of matching patterns and packets is implemented,
providing traffic-pattern specific miniflow-extract acceleration.
--- The patterns today are hard-coded, however in a future patchset it
is intended to make these runtime configurable, allowing users to
optimize the SIMD miniflow extract for active traffic types.
- Notes:
--- 32 bit builds will be fixed in next release by adding flexible
miniflow extract optimization selection.
--- AVX-512 VBMI ISA is not yet supported in OVS due to requiring the
DPDK 20.11 update for RTE_CPUFLAG_*. Once on a newer DPDK this will
be added.

v2 updates:
- Includes DPIF command switching at runtime
- Includes AVX512 DPIF implementation
- Includes some partially related changes (can be split out of set?)
--- netdev output action optimization
--- avx512 dpcls 16-block support optimization


This patchset is a v7 for making the DPIF components of the
userspace datapath more flexible. It has been refactored to b

Re: [ovs-dev] [RFC PATCH] dpif-netdev: Support "port-forward" mode to avoid dp cache lookup

2021-04-27 Thread Sriharsha Basavapatna via dev
On Tue, Apr 27, 2021 at 6:42 PM Eli Britstein  wrote:
>
>
> On 4/27/2021 2:45 PM, Sriharsha Basavapatna wrote:
> > On Tue, Apr 27, 2021 at 4:26 PM Ilya Maximets  wrote:
> >> On 4/27/21 11:56 AM, Sriharsha Basavapatna via dev wrote:
> >>> Hi Eli,
> >>>
> >>> On Sun, Apr 25, 2021 at 6:22 PM Eli Britstein  wrote:
>  Hi Harsha,
> 
>  On 4/20/2021 11:07 AM, Sriharsha Basavapatna wrote:
> > Sometimes a port might be configured with a single flow that just
> > forwards packets to another port. This would be useful in configs
> > where the bridge is just fowarding packets between two ports (for
> > example, between a vhost-user port and a physical port). A flow
> > that matches only on the in_port and with an action that forwards
> > to another port would be configured, to avoid learning or matching
> > on packet headers.
> >
> > Example:
> > $ ovs-ofctl add-flow br0 in_port=1,actions=output:2
> > $ ovs-ofctl add-flow br0 in_port=2,actions=output:1
> >
> > This translates to a datapath flow with the match fields wildcarded
> > for the packet headers. However, the datapath processing still involves
>  There are still several matches (not wildcards):
> 
>  - recirc_id
>  - in_port
>  - packet_type
>  - dl_type
>  - vlan_tci
>  - nw_frag (for ip packets)
> 
>  So there might be multiple flows for each such openflow rule.
> 
>  In the past, I have tried to optimize such scenario, see:
> 
>  https://mail.openvswitch.org/pipermail/ovs-dev/2019-April/357882.html
> 
>  That was wrong as commented afterwards.
> 
>  Another related patch-set was this (also not accepted):
> 
>  https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/363948.html
> 
>  Ilya wrote an alternative patch:
> 
>  https://patchwork.ozlabs.org/patch/1105880/
> 
>  AFAIR, it didn't improve performance either.
> >> Would be good to have some performance numbers for it as there was
> >> no test results published and I don't know if someone ever tested it.
> >>
> >>> Thanks for the above pointers. Ilya had also shared this patch
> >>> recently while discussing this topic at the ovs-dpdk community
> >>> meeting. I want to see if we can utilize part of the logic in that
> >>> patch to add some constraints, while still avoiding an additional
> >>> table/lookup.  The 'port-forward' mode implies that the user wants to
> >>> avoid any kind of lookup in the datapath (as indicated by the ofctl
> >>> rule + port-forward mode).
> >> I don't see how to completely avoid lookups.
> >>
> >> IIUC, in this patch there is a match and upcall for the first packet,
> >> but there are no matches for subsequent packets.
> > That's right. Allow the first packet to go through match, upcall,
> > dp/cache insertion etc. For subsequent packets avoid lookup.
> >
> >>   This will work
> >> only for flow actions that doesn't modify the packet.  If for some
> >> reason the flow contains header modifications OVS will not do that
> >> correctly because the header is not parsed.  Also, if the packet is
> >> a bit different from the very first packet, we might attempt to
> >> modify headers that doesn't exist.  All in all, this is very dangerous
> >> and might lead to OVS crash.  We can't rely on the user to set specific
> >> OF rules for this functionality and we should not have a feature that
> >> might crash OVS if not used accurately.
> >>
> >> The way to not parse the packet at all and to not perform any matches is
> >> the way to completely ignore OF rules, but OVS is an OF switch and
> >> such functionality just doesn't fit.
> > If I add a constraint to check that there is only one action and it's
> > an OUTPUT action (i.e don't enable port-forward mode if the DP flow
> > contains other actions like modify), like it is done in your patch,
> > that should handle this issue ?
> >
> > Thanks,
> > -Harsha
> >> In my change I minimized the lookup as possible to a single 64bit key.
> >> And it will actually work with any OF rules and without enabling of
> >> any special flags.  Would be great to see some performance numbers
> >> for it as I didn't see any.
> >>
> >>> With pvp tests (vxlan config), we have
> >>> seen better performance both in pps: ~50% and cpp: ~35%, at a few
> >>> thousand flows. Similar improvement can be seen with simple
> >>> configurations (e.g testpmd in the vm in txonly fwd mode).
> >>>
>  Besides, I've tried this patch. Maybe I did something wrong (I
>  configured port-forward=true on those ports and those openflow rules,
>  and pinged between those ports). I didn't see it worked (the coverage,
>  and also I added my own prints).
> >>> When you enable port-forward and start the traffic, you should see a
> >>> message like this:
> >>> "dpif_netdev(pmd-c02/id:74)|DBG|Setting port_forward_flow: port:
> >>> 0x7f63400050b0 flow: 0x7f634000afb0"
> >>>
> >>> 

Re: [ovs-dev] [PATCH ovn v2 4/4] ovn-controller: Fix port group conjunction flow explosion problem.

2021-04-27 Thread Mark Gray
On 22/04/2021 21:14, Han Zhou wrote:
> For an ACL with match: outport == @PG && ip4.src == $PG_AS, given below
> scale:
> 
> P: PG size
> LP: number of local lports
> D: number of all datapaths (logical switches)
> LD: number of datapaths that contain local lports
> 
> With current OVN implementation, the total number of OF flows is:
> LP + (P * D) + D
> 
> The reason is, firstly, datapath is not part of the conjunction, so for
> each datapath the lflow is reparsed.
> 
> Secondly, although ovn-controller tries to filter out the flows that are
> for non-local lports, with the conjunction match, the logic that filters
> out non-local flows doesn't work for the conjunction part that doesn't
> have the lport in the match (the P * D part). When there is only one
> port on each LS it is fine, because no conjunction will be used because
> SB port groups are splited per datapath, so each port group would have
suggest "split per datapath"
> only one port. However, when more than one ports are on each LS the flow
> explosion happens.
> 
> This patch deal with the second reason above, by refining the SB port
> groups to store only locally bound lports: empty const sets will not
> generate any flows. This reduces the related flow number from
> LP + (P * D) + D to LP + (P * LD) + LD.
> 
> Since LD is expected to be small, so even if it is a multiplier, the
> total number is larged reduced.  In particular, in ovn-k8s use cases the
suggest "reduced significantly"
> LD is always 1, so the formular above becomes LP + P + LD.
> 
s/formular/formula
> With a scale of 1k k8s nodes, each has 4 ports for the same PG: P = 4k,
> LP = 4, D = 1k, LD = 1. The current implementation generates ~4m flows.
> With this patch it becomes only ~4k.
Cool!
> 
> Reported-by: Girish Moodalbail 
> Reported-at: 
> https://mail.openvswitch.org/pipermail/ovs-dev/2021-March/381082.html
> Reported-by: Dumitru Ceara 
> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1944098
> Signed-off-by: Han Zhou 

I tested this as well and it seemed to work as expected.
> ---
> v1->v2: fix memory leaks found by address sanitizer
> 
>  controller/binding.c|  20 
>  controller/binding.h|   9 ++
>  controller/ovn-controller.c | 217 ++--
>  include/ovn/expr.h  |   2 +-
>  lib/expr.c  |   8 +-
>  tests/ovn.at|  53 +
>  tests/test-ovn.c|  12 +-
>  utilities/ovn-trace.c   |   4 +-
>  8 files changed, 283 insertions(+), 42 deletions(-)
> 
> diff --git a/controller/binding.c b/controller/binding.c
> index 514f5f33f..5aca964cc 100644
> --- a/controller/binding.c
> +++ b/controller/binding.c
> @@ -2987,3 +2987,23 @@ cleanup:
>  
>  return b_lport;
>  }
> +
> +struct sset *
> +binding_collect_local_binding_lports(struct local_binding_data 
> *lbinding_data)
> +{
> +struct sset *lports = xzalloc(sizeof *lports);
> +sset_init(lports);
> +struct shash_node *shash_node;
> +SHASH_FOR_EACH (shash_node, &lbinding_data->lports) {
> +struct binding_lport *b_lport = shash_node->data;
> +sset_add(lports, b_lport->name);
> +}
> +return lports;
> +}
> +
> +void
> +binding_destroy_local_binding_lports(struct sset *lports)
> +{
> +sset_destroy(lports);
> +free(lports);
> +}
> diff --git a/controller/binding.h b/controller/binding.h
> index 4fc9ef207..31f0352a0 100644
> --- a/controller/binding.h
> +++ b/controller/binding.h
> @@ -128,4 +128,13 @@ void binding_seqno_run(struct local_binding_data 
> *lbinding_data);
>  void binding_seqno_install(struct local_binding_data *lbinding_data);
>  void binding_seqno_flush(void);
>  void binding_dump_local_bindings(struct local_binding_data *, struct ds *);
> +
> +/* Generates a sset of lport names from local_binding_data.
> + * Note: the caller is responsible for destroying and freeing the returned
> + * sset, by calling binding_collect_local_binding_lports(). */
I think this^ should say binding_destroy_local_binding_lports()?
> +struct sset *binding_collect_local_binding_lports(struct local_binding_data 
> *);
> +
> +/* Destroy and free the lports sset returned by
> + * binding_collect_local_binding_lports(). */
> +void binding_destroy_local_binding_lports(struct sset *lports);
>  #endif /* controller/binding.h */
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index 7320bd56c..c6ba9ff88 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -1341,7 +1341,7 @@ addr_sets_init(const struct sbrec_address_set_table 
> *address_set_table,
>  SBREC_ADDRESS_SET_TABLE_FOR_EACH (as, address_set_table) {
>  expr_const_sets_add(addr_sets, as->name,
>  (const char *const *) as->addresses,
> -as->n_addresses, true);
> +as->n_addresses, true, NULL);
>  }
>  }
>  
> @@ -1358,7 +1358,7 @@ addr_sets_update(const struct sbrec_address_se

Re: [ovs-dev] [PATCH ovn v2 2/4] ovn.at: Improve "No ovn-controller assert when generating conjunction flows"

2021-04-27 Thread Mark Gray
On 22/04/2021 21:14, Han Zhou wrote:
> This patch improves the test case by binding 2 VIFs on the HV instead of
> one, to make sure conjunction is still used and the scenario is still
> tested by this test case when a following patch optimizes conjunction flows.
> 
> Signed-off-by: Han Zhou 
> ---
> v1->v2: no change
> 
>  tests/ovn.at | 94 
>  1 file changed, 50 insertions(+), 44 deletions(-)
> 
> diff --git a/tests/ovn.at b/tests/ovn.at
> index 3d0a7f63f..55444bbd7 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -25111,10 +25111,12 @@ ovs-vsctl add-br br-phys
>  ovn_attach n1 br-phys 192.168.0.10
>  
>  as hv1
> -ovs-vsctl \
> --- add-port br-int vif1 \
> --- set Interface vif1 external_ids:iface-id=sw0-p1 \
> -ofport-request=1
> +for i in 1 2; do
> +ovs-vsctl \
> +-- add-port br-int vif$i \
> +-- set Interface vif$i external_ids:iface-id=sw0-p$i \
> +ofport-request=$i
> +done
>  
>  check as hv1
>  ovs-vsctl set open . external_ids:ovn-monitor-all=true
> @@ -25122,10 +25124,10 @@ ovs-vsctl set open . 
> external_ids:ovn-monitor-all=true
>  check ovn-nbctl ls-add sw0
>  check ovn-nbctl pg-add pg1
>  check ovn-nbctl pg-add pg2
> -check ovn-nbctl lsp-add sw0 sw0-p2
> -check ovn-nbctl lsp-set-addresses sw0-p2 "00:00:00:00:00:02 192.168.47.2"
>  check ovn-nbctl lsp-add sw0 sw0-p3
>  check ovn-nbctl lsp-set-addresses sw0-p3 "00:00:00:00:00:03 192.168.47.3"
> +check ovn-nbctl lsp-add sw0 sw0-p4
> +check ovn-nbctl lsp-set-addresses sw0-p4 "00:00:00:00:00:04 192.168.47.4"
>  
>  # Pause ovn-northd. When it is resumed, all the below NB updates
>  # will be sent in one transaction.
> @@ -25135,8 +25137,10 @@ check as northd-backup ovn-appctl -t NORTHD_TYPE 
> pause
>  
>  check ovn-nbctl lsp-add sw0 sw0-p1
>  check ovn-nbctl lsp-set-addresses sw0-p1 "00:00:00:00:00:01 192.168.47.1"
> -check ovn-nbctl pg-set-ports pg1 sw0-p1 sw0-p2
> -check ovn-nbctl pg-set-ports pg2 sw0-p3
> +check ovn-nbctl lsp-add sw0 sw0-p2
> +check ovn-nbctl lsp-set-addresses sw0-p2 "00:00:00:00:00:02 192.168.47.2"
> +check ovn-nbctl pg-set-ports pg1 sw0-p1 sw0-p2 sw0-p3
> +check ovn-nbctl pg-set-ports pg2 sw0-p4
>  check ovn-nbctl acl-add pg1 to-lport 1002 "outport == @pg1 && ip4 && ip4.src 
> == \$pg2_ip4 && udp && udp.dst >= 1 && udp.dst <= 65535" allow-related
>  
>  # resume ovn-northd now. This should result in a single update message
> @@ -25144,11 +25148,11 @@ check ovn-nbctl acl-add pg1 to-lport 1002 "outport 
> == @pg1 && ip4 && ip4.src ==
>  check as northd ovn-appctl -t NORTHD_TYPE resume
>  
>  AS_BOX([Wait for sw0-p1 to be up])
Update AS_BOX above^ as it does not match the code anymore.
> -wait_for_ports_up sw0-p1
> +wait_for_ports_up sw0-p1 sw0-p2
>  
>  # When the port group pg1 is updated, it should not result in
>  # any assert in ovn-controller.
> -ovn-nbctl --wait=hv pg-set-ports pg1 sw0-p1 sw0-p2 sw0-p3
> +ovn-nbctl --wait=hv pg-set-ports pg1 sw0-p1 sw0-p2 sw0-p3 sw0-p4
>  AT_CHECK([kill -0 $(cat hv1/ovn-controller.pid)])
>  check ovn-nbctl --wait=hv sync
>  
> @@ -25156,40 +25160,42 @@ check ovn-nbctl --wait=hv sync
>  AT_CHECK([as hv1 ovs-ofctl dump-flows br-int table=45 | ofctl_strip_all | \
>  grep "priority=2002" | grep conjunction | \
>  sed 's/conjunction([[^)]]*)/conjunction()/g' | sort], [0], [dnl
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x10/0xfff0
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x100/0xff00
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x1000/0xf000
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x2/0xfffe
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x20/0xffe0
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x200/0xfe00
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x2000/0xe000
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x4/0xfffc
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x40/0xffc0
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x400/0xfc00
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x4000/0xc000
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_dst=0x8/0xfff8
>  actions=conjunction()
> - table=45, 
> priority=2002,udp,reg0=0x100/0x100,metadata=0x1,nw_src=192.168.47.3,tp_ds

Re: [ovs-dev] [PATCH ovn v2 1/4] inc-proc-eng: Call clear_tracked_data before recompute.

2021-04-27 Thread Mark Gray
Hi Han,

Thanks for fixing this. I reviewed this series but I am not an expert on
the code. Please have a look at my suggestions but I suggest also
waiting for an ack from Girish or Krzystof as they will probably test it.

Mark

On 22/04/2021 21:14, Han Zhou wrote:
> Cleanup particially tracked data due to some of the change handler
s/particially/partially?
> executions before falling back to recompute. This is done already
> in the en_runtime_data_run() implementation, but this patch makes
> it a generic behavior of the I-P engine.
> 
> Signed-off-by: Han Zhou 
> ---
> v1->v2: no change
> 
>  controller/ovn-controller.c | 17 -
>  lib/inc-proc-eng.c  |  5 +
>  2 files changed, 5 insertions(+), 17 deletions(-)
> 
> diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
> index 6f7c9ea61..13c03131c 100644
> --- a/controller/ovn-controller.c
> +++ b/controller/ovn-controller.c
> @@ -1412,23 +1412,6 @@ en_runtime_data_run(struct engine_node *node, void 
> *data)
>  struct sset *local_lport_ids = &rt_data->local_lport_ids;
>  struct sset *active_tunnels = &rt_data->active_tunnels;
>  
> -/* Clear the (stale) tracked data if any. Even though the tracked data
> - * gets cleared in the beginning of engine_init_run(),
> - * any of the runtime data handler might have set some tracked
> - * data and later another runtime data handler might return false
> - * resulting in full recompute of runtime engine and rendering the 
> tracked
> - * data stale.
> - *
> - * It's possible that engine framework can be enhanced to indicate
> - * the node handlers (in this case flow_output_runtime_data_handler)
> - * that its input node had a full recompute. However we would still
> - * need to clear the tracked data, because we don't want the
> - * stale tracked data to be accessed outside of the engine, since the
> - * tracked data is cleared in the engine_init_run() and not at the
> - * end of the engine run.
> - * */
> -en_runtime_data_clear_tracked_data(data);
> -
>  static bool first_run = true;
>  if (first_run) {
>  /* don't cleanup since there is no data yet */
> diff --git a/lib/inc-proc-eng.c b/lib/inc-proc-eng.c
> index a6337a1d9..161327404 100644
> --- a/lib/inc-proc-eng.c
> +++ b/lib/inc-proc-eng.c
> @@ -327,6 +327,11 @@ engine_recompute(struct engine_node *node, bool forced, 
> bool allowed)
>  }
>  
>  /* Run the node handler which might change state. */
Can you move this^ comment down to above the run function as I think it
is relevant to that code?
> +/* Clear tracked data before calling run() so that partially tracked data
> + * from some of the change handler executions are cleared. */
> +if (node->clear_tracked_data) {
> +node->clear_tracked_data(node->data);
> +}
>  node->run(node, node->data);
>  node->stats.recompute++;
>  }
> 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC PATCH] dpif-netdev: Support "port-forward" mode to avoid dp cache lookup

2021-04-27 Thread Eli Britstein



On 4/27/2021 2:45 PM, Sriharsha Basavapatna wrote:

On Tue, Apr 27, 2021 at 4:26 PM Ilya Maximets  wrote:

On 4/27/21 11:56 AM, Sriharsha Basavapatna via dev wrote:

Hi Eli,

On Sun, Apr 25, 2021 at 6:22 PM Eli Britstein  wrote:

Hi Harsha,

On 4/20/2021 11:07 AM, Sriharsha Basavapatna wrote:

Sometimes a port might be configured with a single flow that just
forwards packets to another port. This would be useful in configs
where the bridge is just fowarding packets between two ports (for
example, between a vhost-user port and a physical port). A flow
that matches only on the in_port and with an action that forwards
to another port would be configured, to avoid learning or matching
on packet headers.

Example:
$ ovs-ofctl add-flow br0 in_port=1,actions=output:2
$ ovs-ofctl add-flow br0 in_port=2,actions=output:1

This translates to a datapath flow with the match fields wildcarded
for the packet headers. However, the datapath processing still involves

There are still several matches (not wildcards):

- recirc_id
- in_port
- packet_type
- dl_type
- vlan_tci
- nw_frag (for ip packets)

So there might be multiple flows for each such openflow rule.

In the past, I have tried to optimize such scenario, see:

https://mail.openvswitch.org/pipermail/ovs-dev/2019-April/357882.html

That was wrong as commented afterwards.

Another related patch-set was this (also not accepted):

https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/363948.html

Ilya wrote an alternative patch:

https://patchwork.ozlabs.org/patch/1105880/

AFAIR, it didn't improve performance either.

Would be good to have some performance numbers for it as there was
no test results published and I don't know if someone ever tested it.


Thanks for the above pointers. Ilya had also shared this patch
recently while discussing this topic at the ovs-dpdk community
meeting. I want to see if we can utilize part of the logic in that
patch to add some constraints, while still avoiding an additional
table/lookup.  The 'port-forward' mode implies that the user wants to
avoid any kind of lookup in the datapath (as indicated by the ofctl
rule + port-forward mode).

I don't see how to completely avoid lookups.

IIUC, in this patch there is a match and upcall for the first packet,
but there are no matches for subsequent packets.

That's right. Allow the first packet to go through match, upcall,
dp/cache insertion etc. For subsequent packets avoid lookup.


  This will work
only for flow actions that doesn't modify the packet.  If for some
reason the flow contains header modifications OVS will not do that
correctly because the header is not parsed.  Also, if the packet is
a bit different from the very first packet, we might attempt to
modify headers that doesn't exist.  All in all, this is very dangerous
and might lead to OVS crash.  We can't rely on the user to set specific
OF rules for this functionality and we should not have a feature that
might crash OVS if not used accurately.

The way to not parse the packet at all and to not perform any matches is
the way to completely ignore OF rules, but OVS is an OF switch and
such functionality just doesn't fit.

If I add a constraint to check that there is only one action and it's
an OUTPUT action (i.e don't enable port-forward mode if the DP flow
contains other actions like modify), like it is done in your patch,
that should handle this issue ?

Thanks,
-Harsha

In my change I minimized the lookup as possible to a single 64bit key.
And it will actually work with any OF rules and without enabling of
any special flags.  Would be great to see some performance numbers
for it as I didn't see any.


With pvp tests (vxlan config), we have
seen better performance both in pps: ~50% and cpp: ~35%, at a few
thousand flows. Similar improvement can be seen with simple
configurations (e.g testpmd in the vm in txonly fwd mode).


Besides, I've tried this patch. Maybe I did something wrong (I
configured port-forward=true on those ports and those openflow rules,
and pinged between those ports). I didn't see it worked (the coverage,
and also I added my own prints).

When you enable port-forward and start the traffic, you should see a
message like this:
"dpif_netdev(pmd-c02/id:74)|DBG|Setting port_forward_flow: port:
0x7f63400050b0 flow: 0x7f634000afb0"

I'm guessing the flow isn't getting added to the port; the insertion
is currently done when there's an emc hit. I should probably move the
insertion code to handle_packet_upcall(). As a workaround, can you
please update the emc insertion probability (ovs-vsctl --no-wait set
Open_vSwitch . other_config:emc-insert-inv-prob=1) and retry your test
?

Also, please disable normal mode in the bridge (ovs-ofctl del-flows
br0; and then add ofctl rules).  Let me know if you still see the
problem, I'll work with you offline.


With this proposed patch, what will be the behavior in case there are
multiple DP flows for that single openflow rule?

Right now I'm thinki

[ovs-dev] [PATCH] Fix redundant datapath set ethernet action with NSH Decap

2021-04-27 Thread Martin Varghese
From: Martin Varghese 

When a decap action is applied on NSH header encapsulatiing a
ethernet packet a redundant set mac address action is programmed
to the datapath.

Fixes: f839892a206a ("OF support and translation of generic encap and decap")
Signed-off-by: Martin Varghese 
---
 lib/odp-util.c   | 3 ++-
 ofproto/ofproto-dpif-xlate.c | 2 ++
 tests/nsh.at | 8 
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/lib/odp-util.c b/lib/odp-util.c
index e1199d1da..9d558082f 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -7830,7 +7830,8 @@ commit_set_ether_action(const struct flow *flow, struct 
flow *base_flow,
 struct offsetof_sizeof ovs_key_ethernet_offsetof_sizeof_arr[] =
 OVS_KEY_ETHERNET_OFFSETOF_SIZEOF_ARR;
 
-if (flow->packet_type != htonl(PT_ETH)) {
+if ((flow->packet_type != htonl(PT_ETH)) ||
+(base_flow->packet_type != htonl(PT_ETH))) {
 return;
 }
 
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 7108c8a30..a6f4ea334 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -6549,6 +6549,8 @@ xlate_generic_decap_action(struct xlate_ctx *ctx,
  * Delay generating pop_eth to the next commit. */
 flow->packet_type = htonl(PACKET_TYPE(OFPHTN_ETHERTYPE,
   ntohs(flow->dl_type)));
+flow->dl_src = eth_addr_zero;
+flow->dl_dst = eth_addr_zero;
 ctx->wc->masks.dl_type = OVS_BE16_MAX;
 }
 return false;
diff --git a/tests/nsh.at b/tests/nsh.at
index d5c772ff0..e84134e42 100644
--- a/tests/nsh.at
+++ b/tests/nsh.at
@@ -105,7 +105,7 @@ bridge("br0")
 
 Final flow: 
in_port=1,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_dst=11:22:33:44:55:66,dl_type=0x894f,nsh_flags=0,nsh_ttl=63,nsh_mdtype=1,nsh_np=3,nsh_spi=0x1234,nsh_si=255,nsh_c1=0x11223344,nsh_c2=0x0,nsh_c3=0x0,nsh_c4=0x0,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0
 Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no
-Datapath actions: 
push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x1)
+Datapath actions: 
push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,pop_nsh(),recirc(0x1)
 ])
 
 AT_CHECK([
@@ -139,7 +139,7 @@ ovs-appctl time/warp 1000
 AT_CHECK([
 ovs-appctl dpctl/dump-flows dummy@ovs-dummy | strip_used | grep -v ipv6 | 
sort
 ], [0], [flow-dump from the main thread:
-recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(frag=no),
 packets:1, bytes:98, used:0.0s, 
actions:push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x3)
+recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(frag=no),
 packets:1, bytes:98, used:0.0s, 
actions:push_nsh(flags=0,ttl=63,mdtype=1,np=3,spi=0x1234,si=255,c1=0x11223344,c2=0x0,c3=0x0,c4=0x0),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,pop_nsh(),recirc(0x3)
 
recirc_id(0x3),in_port(1),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
 packets:1, bytes:98, used:0.0s, actions:2
 ])
 
@@ -232,7 +232,7 @@ bridge("br0")
 
 Final flow: 
in_port=1,vlan_tci=0x,dl_src=00:00:00:00:00:00,dl_dst=11:22:33:44:55:66,dl_type=0x894f,nsh_flags=0,nsh_ttl=63,nsh_mdtype=2,nsh_np=3,nsh_spi=0x1234,nsh_si=255,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=0
 Megaflow: recirc_id=0,eth,ip,in_port=1,dl_dst=66:77:88:99:aa:bb,nw_frag=no
-Datapath actions: 
push_nsh(flags=0,ttl=63,mdtype=2,np=3,spi=0x1234,si=255,md2=0x1a041234567820001408fedcba9876543210),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x1)
+Datapath actions: 
push_nsh(flags=0,ttl=63,mdtype=2,np=3,spi=0x1234,si=255,md2=0x1a041234567820001408fedcba9876543210),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,pop_nsh(),recirc(0x1)
 ])
 
 AT_CHECK([
@@ -266,7 +266,7 @@ ovs-appctl time/warp 1000
 AT_CHECK([
 ovs-appctl dpctl/dump-flows dummy@ovs-dummy | strip_used | grep -v ipv6 | 
sort
 ], [0], [flow-dump from the main thread:
-recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(frag=no),
 packets:1, bytes:98, used:0.0s, 
actions:push_nsh(flags=0,ttl=63,mdtype=2,np=3,spi=0x1234,si=255,md2=0x1a041234567820001408fedcba9876543210),push_eth(src=00:00:00:00:00:00,dst=11:22:33:44:55:66),pop_eth,pop_nsh(),set(eth(dst=11:22:33:44:55:66)),recirc(0x3)
+recirc_id(0),in_port(1),packet_type(ns=0,id=0),eth(dst=1e:2c:e9:2a:66:9e),eth_type(0x0800),ipv4(f

Re: [ovs-dev] [RFC PATCH] dpif-netdev: Support "port-forward" mode to avoid dp cache lookup

2021-04-27 Thread Sriharsha Basavapatna via dev
On Tue, Apr 27, 2021 at 4:26 PM Ilya Maximets  wrote:
>
> On 4/27/21 11:56 AM, Sriharsha Basavapatna via dev wrote:
> > Hi Eli,
> >
> > On Sun, Apr 25, 2021 at 6:22 PM Eli Britstein  wrote:
> >>
> >> Hi Harsha,
> >>
> >> On 4/20/2021 11:07 AM, Sriharsha Basavapatna wrote:
> >>> Sometimes a port might be configured with a single flow that just
> >>> forwards packets to another port. This would be useful in configs
> >>> where the bridge is just fowarding packets between two ports (for
> >>> example, between a vhost-user port and a physical port). A flow
> >>> that matches only on the in_port and with an action that forwards
> >>> to another port would be configured, to avoid learning or matching
> >>> on packet headers.
> >>>
> >>> Example:
> >>> $ ovs-ofctl add-flow br0 in_port=1,actions=output:2
> >>> $ ovs-ofctl add-flow br0 in_port=2,actions=output:1
> >>>
> >>> This translates to a datapath flow with the match fields wildcarded
> >>> for the packet headers. However, the datapath processing still involves
> >>
> >> There are still several matches (not wildcards):
> >>
> >>- recirc_id
> >>- in_port
> >>- packet_type
> >>- dl_type
> >>- vlan_tci
> >>- nw_frag (for ip packets)
> >>
> >> So there might be multiple flows for each such openflow rule.
> >>
> >> In the past, I have tried to optimize such scenario, see:
> >>
> >> https://mail.openvswitch.org/pipermail/ovs-dev/2019-April/357882.html
> >>
> >> That was wrong as commented afterwards.
> >>
> >> Another related patch-set was this (also not accepted):
> >>
> >> https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/363948.html
> >>
> >> Ilya wrote an alternative patch:
> >>
> >> https://patchwork.ozlabs.org/patch/1105880/
> >>
> >> AFAIR, it didn't improve performance either.
>
> Would be good to have some performance numbers for it as there was
> no test results published and I don't know if someone ever tested it.
>
> >
> > Thanks for the above pointers. Ilya had also shared this patch
> > recently while discussing this topic at the ovs-dpdk community
> > meeting. I want to see if we can utilize part of the logic in that
> > patch to add some constraints, while still avoiding an additional
> > table/lookup.  The 'port-forward' mode implies that the user wants to
> > avoid any kind of lookup in the datapath (as indicated by the ofctl
> > rule + port-forward mode).
>
> I don't see how to completely avoid lookups.
>
> IIUC, in this patch there is a match and upcall for the first packet,
> but there are no matches for subsequent packets.

That's right. Allow the first packet to go through match, upcall,
dp/cache insertion etc. For subsequent packets avoid lookup.

>  This will work
> only for flow actions that doesn't modify the packet.  If for some
> reason the flow contains header modifications OVS will not do that
> correctly because the header is not parsed.  Also, if the packet is
> a bit different from the very first packet, we might attempt to
> modify headers that doesn't exist.  All in all, this is very dangerous
> and might lead to OVS crash.  We can't rely on the user to set specific
> OF rules for this functionality and we should not have a feature that
> might crash OVS if not used accurately.
>
> The way to not parse the packet at all and to not perform any matches is
> the way to completely ignore OF rules, but OVS is an OF switch and
> such functionality just doesn't fit.

If I add a constraint to check that there is only one action and it's
an OUTPUT action (i.e don't enable port-forward mode if the DP flow
contains other actions like modify), like it is done in your patch,
that should handle this issue ?

Thanks,
-Harsha
>
> In my change I minimized the lookup as possible to a single 64bit key.
> And it will actually work with any OF rules and without enabling of
> any special flags.  Would be great to see some performance numbers
> for it as I didn't see any.
>
> > With pvp tests (vxlan config), we have
> > seen better performance both in pps: ~50% and cpp: ~35%, at a few
> > thousand flows. Similar improvement can be seen with simple
> > configurations (e.g testpmd in the vm in txonly fwd mode).
> >
> >>
> >> Besides, I've tried this patch. Maybe I did something wrong (I
> >> configured port-forward=true on those ports and those openflow rules,
> >> and pinged between those ports). I didn't see it worked (the coverage,
> >> and also I added my own prints).
> >
> > When you enable port-forward and start the traffic, you should see a
> > message like this:
> > "dpif_netdev(pmd-c02/id:74)|DBG|Setting port_forward_flow: port:
> > 0x7f63400050b0 flow: 0x7f634000afb0"
> >
> > I'm guessing the flow isn't getting added to the port; the insertion
> > is currently done when there's an emc hit. I should probably move the
> > insertion code to handle_packet_upcall(). As a workaround, can you
> > please update the emc insertion probability (ovs-vsctl --no-wait set
> > Open_vSwitch . other_config:emc-insert-inv-prob=1)

Re: [ovs-dev] [PATCH] ofp_actions: fix typo in set_mpls_tc formatting

2021-04-27 Thread Ilya Maximets
On 4/27/21 1:00 PM, Adrian Moreno wrote:
> I guess this was a cut-and-paste error from set_mpls_ttl
> 
> Signed-off-by: Adrian Moreno 
> ---
>  lib/ofp-actions.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c
> index 0342a228b..91d810860 100644
> --- a/lib/ofp-actions.c
> +++ b/lib/ofp-actions.c
> @@ -3850,7 +3850,7 @@ static void
>  format_SET_MPLS_TC(const struct ofpact_mpls_tc *a,
> const struct ofpact_format_params *fp)
>  {
> -ds_put_format(fp->s, "%sset_mpls_ttl(%s%"PRIu8"%s)%s",
> +ds_put_format(fp->s, "%sset_mpls_tc(%s%"PRIu8"%s)%s",
>colors.paren, colors.end, a->tc,
>colors.paren, colors.end);
>  }
> 

Hi.  Thanks for the fix!

Could you, please, add a unit test for this issue?
You can use following commit as a reference:

  0062a04d8701 ("tests: Add parse-flow tests for MPLS fields.")

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] ofp_actions: fix typo in set_mpls_tc formatting

2021-04-27 Thread Adrian Moreno
I guess this was a cut-and-paste error from set_mpls_ttl

Signed-off-by: Adrian Moreno 
---
 lib/ofp-actions.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/ofp-actions.c b/lib/ofp-actions.c
index 0342a228b..91d810860 100644
--- a/lib/ofp-actions.c
+++ b/lib/ofp-actions.c
@@ -3850,7 +3850,7 @@ static void
 format_SET_MPLS_TC(const struct ofpact_mpls_tc *a,
const struct ofpact_format_params *fp)
 {
-ds_put_format(fp->s, "%sset_mpls_ttl(%s%"PRIu8"%s)%s",
+ds_put_format(fp->s, "%sset_mpls_tc(%s%"PRIu8"%s)%s",
   colors.paren, colors.end, a->tc,
   colors.paren, colors.end);
 }
-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC PATCH] dpif-netdev: Support "port-forward" mode to avoid dp cache lookup

2021-04-27 Thread Ilya Maximets
On 4/27/21 11:56 AM, Sriharsha Basavapatna via dev wrote:
> Hi Eli,
> 
> On Sun, Apr 25, 2021 at 6:22 PM Eli Britstein  wrote:
>>
>> Hi Harsha,
>>
>> On 4/20/2021 11:07 AM, Sriharsha Basavapatna wrote:
>>> Sometimes a port might be configured with a single flow that just
>>> forwards packets to another port. This would be useful in configs
>>> where the bridge is just fowarding packets between two ports (for
>>> example, between a vhost-user port and a physical port). A flow
>>> that matches only on the in_port and with an action that forwards
>>> to another port would be configured, to avoid learning or matching
>>> on packet headers.
>>>
>>> Example:
>>> $ ovs-ofctl add-flow br0 in_port=1,actions=output:2
>>> $ ovs-ofctl add-flow br0 in_port=2,actions=output:1
>>>
>>> This translates to a datapath flow with the match fields wildcarded
>>> for the packet headers. However, the datapath processing still involves
>>
>> There are still several matches (not wildcards):
>>
>>- recirc_id
>>- in_port
>>- packet_type
>>- dl_type
>>- vlan_tci
>>- nw_frag (for ip packets)
>>
>> So there might be multiple flows for each such openflow rule.
>>
>> In the past, I have tried to optimize such scenario, see:
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2019-April/357882.html
>>
>> That was wrong as commented afterwards.
>>
>> Another related patch-set was this (also not accepted):
>>
>> https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/363948.html
>>
>> Ilya wrote an alternative patch:
>>
>> https://patchwork.ozlabs.org/patch/1105880/
>>
>> AFAIR, it didn't improve performance either.

Would be good to have some performance numbers for it as there was
no test results published and I don't know if someone ever tested it.

> 
> Thanks for the above pointers. Ilya had also shared this patch
> recently while discussing this topic at the ovs-dpdk community
> meeting. I want to see if we can utilize part of the logic in that
> patch to add some constraints, while still avoiding an additional
> table/lookup.  The 'port-forward' mode implies that the user wants to
> avoid any kind of lookup in the datapath (as indicated by the ofctl
> rule + port-forward mode).

I don't see how to completely avoid lookups.

IIUC, in this patch there is a match and upcall for the first packet,
but there are no matches for subsequent packets.  This will work
only for flow actions that doesn't modify the packet.  If for some
reason the flow contains header modifications OVS will not do that
correctly because the header is not parsed.  Also, if the packet is
a bit different from the very first packet, we might attempt to
modify headers that doesn't exist.  All in all, this is very dangerous
and might lead to OVS crash.  We can't rely on the user to set specific
OF rules for this functionality and we should not have a feature that
might crash OVS if not used accurately.

The way to not parse the packet at all and to not perform any matches is
the way to completely ignore OF rules, but OVS is an OF switch and
such functionality just doesn't fit.

In my change I minimized the lookup as possible to a single 64bit key.
And it will actually work with any OF rules and without enabling of
any special flags.  Would be great to see some performance numbers
for it as I didn't see any.

> With pvp tests (vxlan config), we have
> seen better performance both in pps: ~50% and cpp: ~35%, at a few
> thousand flows. Similar improvement can be seen with simple
> configurations (e.g testpmd in the vm in txonly fwd mode).
> 
>>
>> Besides, I've tried this patch. Maybe I did something wrong (I
>> configured port-forward=true on those ports and those openflow rules,
>> and pinged between those ports). I didn't see it worked (the coverage,
>> and also I added my own prints).
> 
> When you enable port-forward and start the traffic, you should see a
> message like this:
> "dpif_netdev(pmd-c02/id:74)|DBG|Setting port_forward_flow: port:
> 0x7f63400050b0 flow: 0x7f634000afb0"
> 
> I'm guessing the flow isn't getting added to the port; the insertion
> is currently done when there's an emc hit. I should probably move the
> insertion code to handle_packet_upcall(). As a workaround, can you
> please update the emc insertion probability (ovs-vsctl --no-wait set
> Open_vSwitch . other_config:emc-insert-inv-prob=1) and retry your test
> ?
> 
> Also, please disable normal mode in the bridge (ovs-ofctl del-flows
> br0; and then add ofctl rules).  Let me know if you still see the
> problem, I'll work with you offline.
> 
>>
>> With this proposed patch, what will be the behavior in case there are
>> multiple DP flows for that single openflow rule?
> 
> Right now I'm thinking that the ofctl rule takes precedence since the
> user just wants to forward to another port. If there are multiple DP
> flows, then the first one will act as the default flow.  What do you
> think ?
> 
> Thanks,
> -Harsha
> 
> 
>>
>> Thanks,
>> Eli
>>
>>> flow cache (EMC/SMC) 

Re: [ovs-dev] [RFC PATCH] dpif-netdev: Support "port-forward" mode to avoid dp cache lookup

2021-04-27 Thread Sriharsha Basavapatna via dev
Hi Eli,

On Sun, Apr 25, 2021 at 6:22 PM Eli Britstein  wrote:
>
> Hi Harsha,
>
> On 4/20/2021 11:07 AM, Sriharsha Basavapatna wrote:
> > Sometimes a port might be configured with a single flow that just
> > forwards packets to another port. This would be useful in configs
> > where the bridge is just fowarding packets between two ports (for
> > example, between a vhost-user port and a physical port). A flow
> > that matches only on the in_port and with an action that forwards
> > to another port would be configured, to avoid learning or matching
> > on packet headers.
> >
> > Example:
> > $ ovs-ofctl add-flow br0 in_port=1,actions=output:2
> > $ ovs-ofctl add-flow br0 in_port=2,actions=output:1
> >
> > This translates to a datapath flow with the match fields wildcarded
> > for the packet headers. However, the datapath processing still involves
>
> There are still several matches (not wildcards):
>
>- recirc_id
>- in_port
>- packet_type
>- dl_type
>- vlan_tci
>- nw_frag (for ip packets)
>
> So there might be multiple flows for each such openflow rule.
>
> In the past, I have tried to optimize such scenario, see:
>
> https://mail.openvswitch.org/pipermail/ovs-dev/2019-April/357882.html
>
> That was wrong as commented afterwards.
>
> Another related patch-set was this (also not accepted):
>
> https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/363948.html
>
> Ilya wrote an alternative patch:
>
> https://patchwork.ozlabs.org/patch/1105880/
>
> AFAIR, it didn't improve performance either.

Thanks for the above pointers. Ilya had also shared this patch
recently while discussing this topic at the ovs-dpdk community
meeting. I want to see if we can utilize part of the logic in that
patch to add some constraints, while still avoiding an additional
table/lookup.  The 'port-forward' mode implies that the user wants to
avoid any kind of lookup in the datapath (as indicated by the ofctl
rule + port-forward mode).  With pvp tests (vxlan config), we have
seen better performance both in pps: ~50% and cpp: ~35%, at a few
thousand flows. Similar improvement can be seen with simple
configurations (e.g testpmd in the vm in txonly fwd mode).

>
> Besides, I've tried this patch. Maybe I did something wrong (I
> configured port-forward=true on those ports and those openflow rules,
> and pinged between those ports). I didn't see it worked (the coverage,
> and also I added my own prints).

When you enable port-forward and start the traffic, you should see a
message like this:
"dpif_netdev(pmd-c02/id:74)|DBG|Setting port_forward_flow: port:
0x7f63400050b0 flow: 0x7f634000afb0"

I'm guessing the flow isn't getting added to the port; the insertion
is currently done when there's an emc hit. I should probably move the
insertion code to handle_packet_upcall(). As a workaround, can you
please update the emc insertion probability (ovs-vsctl --no-wait set
Open_vSwitch . other_config:emc-insert-inv-prob=1) and retry your test
?

Also, please disable normal mode in the bridge (ovs-ofctl del-flows
br0; and then add ofctl rules).  Let me know if you still see the
problem, I'll work with you offline.

>
> With this proposed patch, what will be the behavior in case there are
> multiple DP flows for that single openflow rule?

Right now I'm thinking that the ofctl rule takes precedence since the
user just wants to forward to another port. If there are multiple DP
flows, then the first one will act as the default flow.  What do you
think ?

Thanks,
-Harsha


>
> Thanks,
> Eli
>
> > flow cache (EMC/SMC) lookup and with a large number of flows it also
> > results in dpcls lookup due to cache miss. Avoiding cache lookup in
> > such configurations results in better performance (pps and cpp).
> >
> > This patch provides a new interface config parameter - "port-forward",
> > to avoid datapath cache lookup. When this is enabled, the datapath flow
> > is saved in the port when there is a cache hit for the initial packet.
> > For subsequent packets, the flow is readily found in the port structure,
> > thus avoiding cache and dpcls lookup.
> >
> > Example:
> > $ ovs-vsctl add-port br0 dpdk0 \
> > -- set Interface dpdk0 other_config:port-forward=true
> >
> > A coverage counter has also been added to track packets processed in
> > port-forward mode.
> >
> > $ ovs-appctl coverage/show   | grep datapath_port_forward_packet
> >
> > Signed-off-by: Sriharsha Basavapatna 
> > ---
> >   lib/dpif-netdev.c| 79 ++--
> >   vswitchd/vswitch.xml | 11 ++
> >   2 files changed, 72 insertions(+), 18 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 251788b04..133ed7c1e 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -114,6 +114,7 @@ COVERAGE_DEFINE(datapath_drop_invalid_port);
> >   COVERAGE_DEFINE(datapath_drop_invalid_bond);
> >   COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
> >   COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
> > +C

Re: [ovs-dev] [PATCH v5] conntrack: handle SNAT with all-zero IP address

2021-04-27 Thread Gaëtan Rivet
(Adding back the mailing list + original CCes to the thread.)

On Mon, Apr 26, 2021, at 19:09, Paolo Valerio wrote:
> Hello Gaetan,
> 
> thanks for the feedback
> 
> Gaëtan Rivet  writes:
> 
> > On Fri, Apr 23, 2021, at 00:28, Paolo Valerio wrote:
> 
> [...]
> 
> >> +
> >> +int i, j, s_attempts, d_attempts;
> >
> > Why not use uint16_t here?
> > {curr,min,max}_{d,s}port are uint16_t and {s,d}_attemps will be set to 
> > values derived from them.
> > i and j will then be compared against {s,d}_attempts, so it seems safer to 
> > keep them all aligned.
> >
> 
> ACK
> 
> > Additionally, it seems s,d_attempts are unnecessary.
> > They are only used to know the number of NEXT_PORT_IN_RANGE() that should 
> > be attempted.
> > Their names are slightly misleading (if they are counts of attempts, 
> > n_attempts would be clearer),
> > but also the index could be initialized to the number of attempts 
> > remaining, and decrease during the loop.
> > As the indexes are not useful within the loop, it seems ok.
> >
> > Furthermore, if they are not useful, could the indexes be masked 
> > completely? Would it be acceptable
> > to declare them within the for() loop? I know it's should generally be 
> > avoided, but I've seen a few places
> > where in-line declaration were used. In that case I think it's justified if 
> > it makes the macro safer to use and simpler
> > to read.
> >
> 
> Right, the indexes are not useful within the loop, and masking them
> would make the macro simpler. OTOH, declaring them within the for()
> and nesting the loops would lead to a warning (-Wshadow).
> 
> If I didn't miss anything, and if you are ok with it, I would change it,
> based on your suggestions, like the following:
> 
> uint16_t i, j;
> FOR_EACH_PORT_IN_RANGE(i, curr_dport, min_dport, max_dport) {
> nat_conn->rev_key.src.port = htons(curr_dport);
> FOR_EACH_PORT_IN_RANGE(j, curr_sport, min_sport, max_sport) {
> [...]
> }
> }
> 
> #define FOR_EACH_PORT_IN_RANGE(idx, curr, min, max) \
> for (INIT_N_PORT_ATTEMPTS(idx, curr, min, max); \
>  idx > 0; idx--, NEXT_PORT_IN_RANGE(curr, min, max))
> 
> WDYT?
> 

To nest the loops, you can use the __COUNTER__ macro, like so:

/* Generate a unique name with the __COUNTER__ macro to allow nesting loops. */
#define OVS_STR_(x,y) x##y
#define OVS_STR(x, y) OVS_STR_(x,y)
/* There is one such 'stringify' macro in cmap.h as well, maybe it could be 
shared in a util.h or similar. */

#define FOR_EACH_PORT_IN_RANGE__(curr, min, max, INAME) \
for (uint16_t INAME = N_PORT_ATTEMPTS(curr, min, max); \
  INAME > 0; INAME--, NEXT_PORT_IN_RANGE(curr, min, max))

#define FOR_EACH_PORT_IN_RANGE(curr, min, max) \
FOR_EACH_PORT_IN_RANGE__(curr, min, max, OVS_STR(idx, __COUNTER__))

> >> +FOR_EACH_PORT_IN_RANGE(i, d_attempts, curr_dport, min_dport, 
> >> max_dport) {
> >> +nat_conn->rev_key.src.port = htons(curr_dport);
> >> +FOR_EACH_PORT_IN_RANGE(j, s_attempts, curr_sport, min_sport, 
> >> max_sport) {
> >> +nat_conn->rev_key.dst.port = htons(curr_sport);
> >> +if (!conn_lookup(ct, &nat_conn->rev_key,
> >> + time_msec(), NULL, NULL)) {
> >> +return true;
> >>  }
> >> -first_port = min_port;
> >> -port = first_port;
> >> -all_ports_tried = false;
> >>  }
> >>  }
> >> -return false;
> >> +
> >> +/* Check if next IP is in range and respin. Otherwise, notify
> >> + * exhaustion to the caller. */
> >> +next_addr:
> >> +if (next_addr_in_range_guarded(&curr_addr, &min_addr,
> >> +   &max_addr, &guard_addr,
> >> +   conn->key.dl_type == 
> >> htons(ETH_TYPE_IP))) {
> >> +return false;
> >> +}
> >> +
> >> +goto another_round;
> >>  }
> >>  
> >>  static enum ct_update_res
> >> diff --git a/lib/conntrack.h b/lib/conntrack.h
> >> index 9553b188a..c68a83ccd 100644
> >> --- a/lib/conntrack.h
> >> +++ b/lib/conntrack.h
> >> @@ -77,6 +77,14 @@ enum nat_action_e {
> >>  NAT_ACTION_DST_PORT = 1 << 3,
> >>  };
> >>  
> >> +#define NAT_ACTION_SNAT_ALL (NAT_ACTION_SRC | NAT_ACTION_SRC_PORT)
> >> +#define NAT_ACTION_DNAT_ALL (NAT_ACTION_DST | NAT_ACTION_DST_PORT)
> >> +
> >> +enum {
> >> +MIN_NAT_EPHEMERAL_PORT = 1024,
> >> +MAX_NAT_EPHEMERAL_PORT = 65535
> >> +};
> >> +
> >>  struct nat_action_info_t {
> >>  union ct_addr min_addr;
> >>  union ct_addr max_addr;
> >> @@ -85,6 +93,28 @@ struct nat_action_info_t {
> >>  uint16_t nat_action;
> >>  };
> >>  
> >> +#define IN_RANGE(curr, min, max) \
> >> +(curr >= min && curr <= max)
> >> +
> >> +#define NEXT_PORT_IN_RANGE(curr, min, max) \
> >> +curr = (!IN_RANGE(curr, min, max) || curr == max) ? min : curr + 1
> >> +
> >> +/* if the current port is out of range increase the attempts by
> >> + * one so that in the worst case scenario the current out of
> >> + * r