Re: [ovs-dev] [PATCH ovn v3] controller: avoid recomputes triggered by SBDB Port_Binding updates.

2022-07-20 Thread Han Zhou
On Wed, Jul 20, 2022 at 10:29 AM Numan Siddique  wrote:
>
> On Wed, Jul 13, 2022 at 4:29 AM Dumitru Ceara  wrote:
> >
> > On 7/13/22 11:08, Dumitru Ceara wrote:
> > > On 7/13/22 09:40, Xavier Simonart wrote:
> > >> Hi Han, Dumitru
> > >>
> > >
> > > Hi Han, Xavier,
> > >
> > > Sorry, I had already replied to the previous message and only then
> > > noticed this one.
> > >
> > >> I think that we should, as much as possible, try to achieve both
goals:
> > >> - have an accurate ovn-installed
> > >> - do not increase latency in large scale deployments
> > >>
> > >
> > > +1
> > >
> > >> The fact that ovn-installed is sent too early for mc flows is
already an
> > >> issue today, independent of this patch.
> > >> Fixing ovn-installed related to mc flows by delaying the state
change (for
> > >> all cases, included when no mc groups) might be seen as a performance
> > >> regression.
> > >>
> > >
> > > I think it will, and I'm not sure we can convince the CMS that this is
> > > "just metrics".
> > >
> > >> I agree that we should fix this ovn-installed issue, but it is not a
> > >> regression added by this patch. We should enter a BZ for it.
> > >> Per my understanding, the mc flows are updated when the
SB_multicast_group
> > >> is seen as updated by ovn-controller, due to its references to port
binding.
> > >> Other flows related to port binding are installed earlier, i.e. when
> > >> ovn-controller writes port_binding->chassis (i.e. before it receives
SB
> > >> confirmation). So, while sending the mc flows earlier than what we
do today
> > >> might be more complex, I think it makes some kind of sense (we would
send
> > >> all those flows within the same loop).
> > >
> > > I'm inclining towards leaving it as it is today if this is the only
flow
> > > we're missing.  It's a guess without testing things out, but I think
> > > it's for the MC_FLOOD_L2 multicast group which is used only for
> > > forwarding ARP packets originated by OVN routers or destined to a
> > > specific OVN router.  Losing some of those packets is not a big deal.
> > >
> > > But it might be good to confirm that this is the MC group we install
the
> > > flow for.
> > >
> >
> > Oh, well, the port is also part of the MC_FLOOD group.  This is however
> > only used for BUM traffic.  So losing some packets here is also not
> > terrible, I think.
> >
>
> When a logical port is claimed,  we process the logical flows related
> to the logical port (i.e with inport ==  or outport == )
> and install
> the corresponding openflows. All the generic logical flows (i.e
> without inport or outport match) would have already been programmed
> (if the datapath already part of local_datapaths).
> These processed logical flows (lflow_handle_flows_for_lport() in
> lflow.c) will be most likely part of the same openflow bundle. And
> once the sequence number
> for this bundle is acknowledged we set "ovn-installed=true".  When CMS
> notices "ovn-installed=true" , I think it can fairly assume that the
> flows for the lport are
> programmed.
>
> I think the only flows pertaining to the logical port  which we would
> be missing are the multicast related flows and the logical flows which
> ovn-northd would generate after the logical port
> is claimed (presently it is the arp responder flows) and I don't think
> we can wait for these logical flows to be programmed by ovn-northd
> before setting "ovn-installed=true".

The missing flows is just a side-effect. I am more concerned with the
clearness of the state-machine.
To my understanding it would be very clear to define the "CLAIMED" state's
job as claiming the port in SB-DB. If SB commit fails, the retry should
happen at this state. If we see the update notification (i.e. we see the
PB.chassis matches the desired chassis), we move to the next state
"INSTALL_FLOWS". Now if we move the state forward without confirming the
PB.chassis is updated in SB, we would need to perform the task in all the
following states. The only benefit we get from this is that ovn-installed
can be set to true a little bit earlier (to save a SB round trip), at the
cost of more complexity (even more so if ovn-monitor-all needs to be
considered in this logic) and less clarity of the state machine.

Is it possible to address it with the simpler/clear approach first and see
if it really causes obvious performance regression, then we can consider
the "short-cuts"? I am not sure if it is some kind of premature
optimization at this point.

Thanks,
Han
>
> Delaying setting the ovn-installed=true would definitely result in
> latency.  It would not be easy for ovn-controller to keep track of
> openflows already programmed for a logical port
> In other words,  I don't think ovn-controller can accurately keep
> track of all the openflows related to a logical port are programmed or
> not  unless all these flows are grouped in one bundle.
>
> Also since the present ovn main already sets ovn-installed=true a bit
> early i.e. even before the multicast and arp responder f

Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-20 Thread Han Zhou
On Wed, Jul 20, 2022 at 12:13 PM Numan Siddique  wrote:
>
> On Wed, Jul 20, 2022 at 6:18 AM Dumitru Ceara  wrote:
> >
> >
> > Hi Han,
> >
> > On 7/7/22 19:02, Dumitru Ceara wrote:
> > > On 7/7/22 18:21, Han Zhou wrote:
> > >> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara 
wrote:
> > >>>
> > >>> On 7/7/22 13:45, Dumitru Ceara wrote:
> >  On 7/7/22 00:08, Han Zhou wrote:
> > > On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara 
wrote:
> > >>
> > >> Hi Han,
> > >>
> > >> On 7/6/22 00:41, Han Zhou wrote:
> > >>> The ls_in_pre_stateful priority 120 flow that saves dst IP and
Port
> > >> to
> > >>> registers is causing a critical dataplane performance impact to
> > >>> short-lived connections, because it unwildcards megaflows with
exact
> > >>> match on dst IP and L4 ports. Any new connections with a
different
> > >>> client side L4 port will encounter datapath flow miss and
upcall to
> > >>> ovs-vswitchd.
> > >>>
> > >>> These fields (dst IP and port) were saved to registers to solve
a
> > >>> problem of LB hairpin use case when different VIPs are sharing
> > >>> overlapping backend+port [0]. The change [0] might not have as
wide
> > >>> performance impact as it is now because at that time one of the
match
> > >>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for
established
> > >> and
> > >>> natted traffic, while now the impact is more obvious because
> > >>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> > >>> configured on the LS) since commit [1], after several other
> > >> indirectly
> > >>> related optimizations and refactors.
> > >>>
> > >>> Since the changes that introduced the performance problem had
their
> > >>> own values (fixes problems or optimizes performance), so we
don't
> > >> want
> > >>> to revert any of the changes (and it is also not
straightforward to
> > >>> revert any of them because there have been lots of changes and
> > >> refactors
> > >>> on top of them).
> > >>>
> > >>> Change [0] itself has added an alternative way to solve the
> > >> overlapping
> > >>> backends problem, which utilizes ct fields instead of saving
dst IP
> > >> and
> > >>> port to registers. This patch forces to that approach and
removes the
> > >>> flows/actions that saves the dst IP and port to avoid the
dataplane
> > >>> performance problem for short-lived connections.
> > >>>
> > >>> (With this approach, the ct_state DNAT is not HW offload
friendly,
> > >> so it
> > >>> may result in those flows not being offloaded, which is
supposed to
> > >> be
> > >>> solved in a follow-up patch)
> >
> > While we're waiting for more input from ovn-k8s on this, I have a
> > slightly different question.
> >
> > Aren't we hitting a similar problem in the router pipeline, due to
> > REG_ORIG_TP_DPORT_ROUTER?
> >
> > Thanks,
> > Dumitru
> >
> > >>>
> > >>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> > >> shared
> > > backends.")
> > >>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> > >> traffic.")
> > >>>
> > >>> Signed-off-by: Han Zhou 
> > >>> ---
> > >>
> > >> I think the main concern I have is that this forces us to choose
> > >> between:
> > >> a. non hwol friendly flows (reduced performance)
> > >> b. less functionality (with the knob in patch 3/3 set to false).
> > >>
> > > Thanks Dumitru for the comments! I agree the solution is not
ideal,
> > >> but if
> > > we look at it from a different angle, even with a), for most
> > >> pod->service
> > > traffic the performance is still much better than how it is today
(not
> > > offloaded kernel datapath is still much better than userspace
> > >> slowpath).
> > > And *hopefully* b) is ok for most use cases to get HW-offload
> > >> capability.
> > >
> > >>>
> > >>> Just a note on this item.  I'm a bit confused about why all traffic
> > >>> would be slowpath-ed?  It's just the first packet that goes to
vswitchd
> > >>> as an upcall, right?
> > >>>
> > >>
> > >> It is about all traffic for *short-lived* connections. Any clients ->
> > >> service traffic with the pattern:
> > >> 1. TCP connection setup
> > >> 2. Set API request, receives response
> > >> 3. Close TCP connection
> > >> It can be tested with netperf TCP_CRR. Every time the client side
TCP port
> > >> is different, but since the server -> client DP flow includes the
client
> > >> TCP port, for each such transaction there is going to be at least a
DP flow
> > >> miss and goes to userspace. Such application latency would be very
high. In
> > >> addition, it causes the OVS handler CPU spikes very high which would
> > >> further impact the dataplane performance of the system.
> > >>
> > >>> Once the megaflow (even if it's more specific than ideal) is
installed
> > >>> all following traffic in that session should be forwarded in fast
path
> > >>> (k

[ovs-dev] [PATCH ovn] pinctrl: fix monitor state when using udp(icmp) to check healthy state

2022-07-20 Thread wangchuanlei
Hi,
On my enviroment, i have a load-balancer, witch uses udp(icmp) protocol to 
check connection status between vip and backend.
In normal cirsumstances, vip send request packet to backends , and backends 
send reply to vip, so the result of health check is online status. Then, i set 
the nic of backends vm to down state through command "ip link set eth0 down", 
in this case, the backends vm would't send reply to vip, but vip still send 
request to backends.In function svc_monitors_run, after sending request packet, 
state of vip would change to waiting, after 3 seconds, due to no reply packet, 
svc_mon->state change to SVC_MON_S_ONLINE, so, even no reply state keep on 
line. This is unreasonable, it should be offline.
So this patch is to fix this bug!

Signed-off-by: wangchuanlei 
---
 controller/pinctrl.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/controller/pinctrl.c b/controller/pinctrl.c
index 38e8590af..1a21c7704 100644
--- a/controller/pinctrl.c
+++ b/controller/pinctrl.c
@@ -7526,13 +7526,9 @@ svc_monitors_run(struct rconn *swconn,
 
 case SVC_MON_S_WAITING:
 if (current_time > svc_mon->wait_time) {
-if (svc_mon->protocol ==  SVC_MON_PROTO_TCP) {
-svc_mon->n_failures++;
-svc_mon->state = SVC_MON_S_OFFLINE;
-} else {
-svc_mon->n_success++;
-svc_mon->state = SVC_MON_S_ONLINE;
-}
+svc_mon->n_failures++;
+svc_mon->state = SVC_MON_S_OFFLINE;
+
 svc_mon->next_send_time = current_time + svc_mon->interval;
 next_run_time = svc_mon->next_send_time;
 } else {
-- 
2.27.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn v2] Fix compilation issue in fedora 37/rawhide.

2022-07-20 Thread numans
From: Numan Siddique 

Compilation is failing with the error:

File "Documentation/conf.py", line 61, in 
with open(filename, 'rU') as f:
 
ValueError: invalid mode: 'rU'

The python 3 documentation [1] says:

There is an additional mode character permitted, 'U', which no
longer has any effect, and is considered deprecated. It
previously enabled universal newlines in text mode, which
became the default behaviour in Python 3.0. Refer to the
documentation of the newline parameter for further details.

[1] - https://docs.python.org/3.9/library/functions.html#open

This patch fixes this issue.

Signed-off-by: Numan Siddique 
---
 Documentation/conf.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/conf.py b/Documentation/conf.py
index d89c64e775..f7eceaec83 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -58,7 +58,7 @@ author = u'The Open Virtual Network (OVN) Development 
Community'
 # The full version, including alpha/beta/rc tags.
 release = None
 filename = "../configure.ac"
-with open(filename, 'rU') as f:
+with open(filename, 'r') as f:
 for line in f:
 if 'AC_INIT' in line:
 # Parse "AC_INIT(openvswitch, 2.7.90, b...@openvswitch.org)":
-- 
2.36.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] Fix compilation issue in fedora 37/rawhide.

2022-07-20 Thread 0-day Robot
Bleep bloop.  Greetings Numan Siddique, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
ERROR: Author Numan Siddique  needs to sign off.
Lines checked: 27, Warnings: 0, Errors: 1


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn] Fix compilation issue in fedora 37/rawhide.

2022-07-20 Thread numans
From: Numan Siddique 

Compilation is failing with the error:

---
File "Documentation/conf.py", line 61, in 
with open(filename, 'rU') as f:
 
ValueError: invalid mode: 'rU'
---

The python 3 documentation [1] says:

There is an additional mode character permitted, 'U', which no
longer has any effect, and is considered deprecated. It
previously enabled universal newlines in text mode, which
became the default behaviour in Python 3.0. Refer to the
documentation of the newline parameter for further details.

[1] - https://docs.python.org/3.9/library/functions.html#open

This patch fixes this issue.

Signed-off-by: Numan Siddique 
---
 Documentation/conf.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/conf.py b/Documentation/conf.py
index d89c64e775..f7eceaec83 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -58,7 +58,7 @@ author = u'The Open Virtual Network (OVN) Development 
Community'
 # The full version, including alpha/beta/rc tags.
 release = None
 filename = "../configure.ac"
-with open(filename, 'rU') as f:
+with open(filename, 'r') as f:
 for line in f:
 if 'AC_INIT' in line:
 # Parse "AC_INIT(openvswitch, 2.7.90, b...@openvswitch.org)":
-- 
2.36.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v3 4/6] northd: Add MAC binding aging mechanism

2022-07-20 Thread 0-day Robot
Bleep bloop.  Greetings Ales Musil, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 84 characters long (recommended limit is 79)
#311 FILE: ovn-nb.xml:2397:
  type='{"type": "integer", "minInteger": 0, "maxInteger": 
4294967295}'>

Lines checked: 321, Warnings: 1, Errors: 0


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v3 5/6] ovn.at: Add test case covering the MAC binding aging

2022-07-20 Thread Mark Michelson

Hi Ales, I have one finding below.

On 7/20/22 04:48, Ales Musil wrote:

Reported-at: https://bugzilla.redhat.com/2084668
Signed-off-by: Ales Musil 
---
v3: Rebase on top of current main.
 Update according to the final conclusion.
---
  tests/ovn.at | 109 +++
  1 file changed, 109 insertions(+)

diff --git a/tests/ovn.at b/tests/ovn.at
index 7d10610fd..cbacc52c5 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -32505,3 +32505,112 @@ AT_CHECK([test $(ovn-sbctl list fdb | grep -c 
"00:00:00:00:10:30") = 0])
  OVN_CLEANUP([hv1])
  AT_CLEANUP
  ])
+
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([MAC binding aging])
+ovn_start
+
+net_add n1
+
+AT_CHECK([ovn-nbctl ls-add public])
+AT_CHECK([ovn-nbctl ls-add internal])
+
+AT_CHECK([ovn-nbctl lsp-add public ln_port])
+AT_CHECK([ovn-nbctl lsp-set-addresses ln_port unknown])
+AT_CHECK([ovn-nbctl lsp-set-type ln_port localnet])
+AT_CHECK([ovn-nbctl lsp-set-options ln_port network_name=physnet1])
+
+AT_CHECK([ovn-nbctl lsp-add public public-gw])
+AT_CHECK([ovn-nbctl lsp-set-type public-gw router])
+AT_CHECK([ovn-nbctl lsp-set-addresses public-gw 00:00:00:00:10:00 router])
+AT_CHECK([ovn-nbctl lsp-set-options public-gw router-port=gw-public])
+
+AT_CHECK([ovn-nbctl lsp-add internal internal-gw])
+AT_CHECK([ovn-nbctl lsp-set-type internal-gw router])
+AT_CHECK([ovn-nbctl lsp-set-addresses internal-gw 00:00:00:00:20:00 router])
+AT_CHECK([ovn-nbctl lsp-set-options internal-gw router-port=gw-internal])
+
+AT_CHECK([ovn-nbctl lsp-add internal vif1])
+AT_CHECK([ovn-nbctl lsp-set-addresses vif1 "00:00:00:00:20:10 192.168.20.10"])
+
+AT_CHECK([ovn-nbctl lsp-add internal vif2])
+AT_CHECK([ovn-nbctl lsp-set-addresses vif2 "00:00:00:00:20:20 192.168.20.20"])
+
+AT_CHECK([ovn-nbctl lr-add gw])
+AT_CHECK([ovn-nbctl lrp-add gw gw-public 00:00:00:00:10:00 192.168.10.1/24])
+AT_CHECK([ovn-nbctl lrp-add gw gw-internal 00:00:00:00:20:00 192.168.20.1/24])
+
+sim_add hv1
+as hv1
+ovs-vsctl add-br br-underlay
+ovn_attach n1 br-underlay 192.168.0.1
+ovs-vsctl add-br br-phys
+ovs-vsctl -- add-port br-int vif1 -- \
+set interface vif1 external-ids:iface-id=vif1 \
+options:tx_pcap=hv1/vif1-tx.pcap \
+options:rxq_pcap=hv1/vif1-rx.pcap \
+ofport-request=1
+ovs-vsctl -- add-port br-phys ext1 -- \
+set interface ext1 \
+options:tx_pcap=hv1/ext1-tx.pcap \
+options:rxq_pcap=hv1/ext1-rx.pcap \
+ofport-request=2
+ovs-vsctl set open . external_ids:ovn-bridge-mappings=physnet1:br-phys
+
+sim_add hv2
+as hv2
+ovs-vsctl add-br br-underlay
+ovn_attach n1 br-underlay 192.168.0.2
+ovs-vsctl add-br br-phys
+ovs-vsctl -- add-port br-int vif2 -- \
+set interface vif2 external-ids:iface-id=vif2 \
+options:tx_pcap=hv2/vif2-tx.pcap \
+options:rxq_pcap=hv2/vif2-rx.pcap \
+ofport-request=1
+ovs-vsctl -- add-port br-phys ext2 -- \
+set interface ext2 \
+options:tx_pcap=hv2/ext2-tx.pcap \
+options:rxq_pcap=hv2/ext2-rx.pcap \
+ofport-request=2
+ovs-vsctl set open . external_ids:ovn-bridge-mappings=physnet1:br-phys
+
+OVN_POPULATE_ARP
+
+send_garp() {
+hv=$1
+dev=$2
+mac_byte=$3
+ip_byte=${4-$3}
+
+mac="10$mac_byte"
+ip=`ip_to_hex 192 168 10 $ip_byte`
+packet=${mac}08060001080006040002${mac}${ip}${mac}${ip}
+as $hv ovs-appctl netdev-dummy/receive $dev $packet
+}
+
+# Check if the option is not present by default
+AT_CHECK([fetch_column nb:logical_router options name="gw" | grep -q 
mac_binding_age_threshold], [1])
+
+# Set the MAC binding aging threshold
+AT_CHECK([ovn-nbctl set logical_router gw options:mac_binding_age_threshold=1])
+AT_CHECK([fetch_column nb:logical_router options | grep -q 
mac_binding_age_threshold=1])
+AT_CHECK([ovn-nbctl --wait=sb sync])
+
+# Send GARP to populate MAC binding table records
+send_garp hv1 ext1 10
+send_garp hv2 ext2 20
+
+OVS_WAIT_UNTIL([ovn-sbctl list mac_binding | grep -q "192.168.10.10"])
+OVS_WAIT_UNTIL([ovn-sbctl list mac_binding | grep -q "192.168.10.20"])


I worry that on a slower system, the MAC binding could be added and aged 
out before you reach the OVS_WAIT_UNTIL() commands above, causing the 
test to fail.


What might be better is to create the MAC bindings, ensure they are 
present, then set the mac_binding_age_threshold to 1 and make sure they 
go away.



+
+# Check if the records are removed after 1 sec of inactivity
+OVS_WAIT_UNTIL([
+test "0" = "$(ovn-sbctl list mac_binding | grep -c '192.168.10.10')"
+])
+OVS_WAIT_UNTIL([
+test "0" = "$(ovn-sbctl list mac_binding | grep -c '192.168.10.20')"
+])
+
+OVN_CLEANUP([hv1], [hv2])
+AT_CLEANUP
+])


___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v3 2/6] controller: Add mac-binding-index.c/.h files

2022-07-20 Thread Mark Michelson

Hi Ales, I have one minor finding below.

On 7/20/22 04:48, Ales Musil wrote:

Add helper source file for creating index
over MAC binding table.

Reported-at: https://bugzilla.redhat.com/2084668
Signed-off-by: Ales Musil 
---
v3: Rebase on top of current main.
 Update according to the final conclusion.
---
  controller/ovn-controller.c |  8 +++-
  lib/automake.mk |  2 ++
  lib/mac-binding-index.c | 35 +++
  lib/mac-binding-index.h | 26 ++
  4 files changed, 66 insertions(+), 5 deletions(-)
  create mode 100644 lib/mac-binding-index.c
  create mode 100644 lib/mac-binding-index.h

diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 2e9138036..31d7f4566 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -54,6 +54,7 @@
  #include "lib/extend-table.h"
  #include "lib/ip-mcast-index.h"
  #include "lib/mcast-group-index.h"
+#include "lib/mac-binding-index.h"
  #include "lib/ovn-sb-idl.h"
  #include "lib/ovn-util.h"
  #include "patch.h"
@@ -3454,9 +3455,7 @@ main(int argc, char *argv[])
  = ovsdb_idl_index_create1(ovnsb_idl_loop.idl,
&sbrec_datapath_binding_col_tunnel_key);
  struct ovsdb_idl_index *sbrec_mac_binding_by_lport_ip
-= ovsdb_idl_index_create2(ovnsb_idl_loop.idl,
-  &sbrec_mac_binding_col_logical_port,
-  &sbrec_mac_binding_col_ip);
+= mac_binding_by_lport_ip_index_create(ovnsb_idl_loop.idl);
  struct ovsdb_idl_index *sbrec_ip_multicast
  = ip_mcast_index_create(ovnsb_idl_loop.idl);
  struct ovsdb_idl_index *sbrec_igmp_group
@@ -3469,8 +3468,7 @@ main(int argc, char *argv[])
&sbrec_fdb_col_mac,
&sbrec_fdb_col_dp_key);
  struct ovsdb_idl_index *sbrec_mac_binding_by_datapath
-= ovsdb_idl_index_create1(ovnsb_idl_loop.idl,
-  &sbrec_mac_binding_col_datapath);
+= mac_binding_by_datapath_index_create(ovnsb_idl_loop.idl);
  struct ovsdb_idl_index *sbrec_static_mac_binding_by_datapath
  = ovsdb_idl_index_create1(ovnsb_idl_loop.idl,
&sbrec_static_mac_binding_col_datapath);
diff --git a/lib/automake.mk b/lib/automake.mk
index 3a2da1fe4..0c0b95f4e 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -26,6 +26,8 @@ lib_libovn_la_SOURCES = \
lib/ovn-parallel-hmap.c \
lib/ip-mcast-index.c \
lib/ip-mcast-index.h \
+   lib/mac-binding-index.c \
+lib/mac-binding-index.h \


The second line you added is indented with four spaces instead of a tab.


lib/mcast-group-index.c \
lib/mcast-group-index.h \
lib/lex.c \
diff --git a/lib/mac-binding-index.c b/lib/mac-binding-index.c
new file mode 100644
index 0..9a8b7deb8
--- /dev/null
+++ b/lib/mac-binding-index.c
@@ -0,0 +1,35 @@
+/* Copyright (c) 2022, Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include "lib/mac-binding-index.h"
+#include "lib/ovn-sb-idl.h"
+
+struct ovsdb_idl_index *
+mac_binding_by_datapath_index_create(struct ovsdb_idl
+ *idl)
+{
+return ovsdb_idl_index_create1(idl, &sbrec_mac_binding_col_datapath);
+}
+
+struct ovsdb_idl_index *
+mac_binding_by_lport_ip_index_create(struct ovsdb_idl
+ *idl)
+{
+return ovsdb_idl_index_create2(idl,
+   &sbrec_mac_binding_col_logical_port,
+   &sbrec_mac_binding_col_ip);
+}
diff --git a/lib/mac-binding-index.h b/lib/mac-binding-index.h
new file mode 100644
index 0..f2853a7ca
--- /dev/null
+++ b/lib/mac-binding-index.h
@@ -0,0 +1,26 @@
+/* Copyright (c) 2022, Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations und

Re: [ovs-dev] [RFC ovn] Multi-chassis port + MTU behavior

2022-07-20 Thread Numan Siddique
On Mon, Jul 18, 2022 at 7:44 PM Ihar Hrachyshka  wrote:
>
> Hi folks,
>
> looking for some advices on MTU / IP behavior for multi-chassis ports.
>
> 22.06 got the new multichassis port concept introduced where the same
> port may be present on multiple chassis, which can be used as a
> performance optimization for VM live migration, among other things.
> Migration with sub-0.1ms network downtime is achieved by cloning all
> traffic directed towards a multichassis port to all its locations,
> making all chassis hosting it receive relevant traffic at any given
> moment until migration is complete.
>
> For logical switches with a localnet port where traffic usually goes
> through the localnet port and not tunnels, it means enforcement of
> tunneling of egress and ingress traffic for a multichassis port. (The
> rest of the traffic between other, non-multichassis, ports keeps going
> through localnet port.) Tunneling enforcement is done because traffic
> sent through localnet port won't generally be cloned to all port binding
> locations by the upstream switch.
>
> A problem comes down when ingress or egress traffic of a multichassis
> port, being redirected through a tunnel, gets lost because of geneve
> header overhead or because the interface used for tunneling has a
> different MTU from the physical bridge backing up the localnet port.
>
> This is happening when:
>   - tunnel_iface_mtu < localnet_mtu + geneve_overhead
>
> This makes for an unfortunate situation where, for a multichassis port,
> SOME traffic (e.g. regular ICMP requests) pass through without any
> problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't.
> (Test scenario demonstrating it is included below.)
>
> Here are ideas I came up with on how this could be resolved or at least
> mitigated:
>
> a) pass "oversized" traffic through localnet, and the rest through
> tunnels. Apart from confusing pattern on the wire where packets that
> belong to the same TCP session may go through two paths (arguably not a
> big problem and should be expected by other network hardware), this gets
> us back to the problem of upstream switch not delivering packets to all
> binding chassis.
>
> b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP
> errors on attempts to send oversized packets to or from a multichassis
> port. Then TCP sessions could adjust their properties to reflect the new
> recommended MTU. We already have a similar mechanism for router ports.
> There are several caveats and limitations with fragmentation needed (b):
>
> - AFAIU this works for some protocols but not others. It also depends on
>   the client network stack to reflect the change in network path. TCP
>   should probably work.
>
> - It may be confusing for users that some L2 paths between ports of the
>   same switch have reduced MTU while others have the regular, depending
>   on the type of a port (single- or multi-chassis).
>
> c) implement fragmentation inside L2 domain. I actually don't know if
> that's even compliant with RFCs. Usually packet fragmentation is
> implemented on L2 domain boundary, by a router. In this scenario, a peer
> port on the same switch would receive fragments for a packet that was
> sent as a single piece.
>
> I currently lean towards (b) though it's not a universal fix since it
> requires collaboration of the underlying network stack. But (a) leaves
> cloning broken, and (c) is even more invasive, taking action on port's
> packets (fragmenting them) without the port's owner knowledge.
>
> Perhaps this is all unnecessary and there's a way to make OVN
> transparently split and reassemble packets as needed, though I doubt it
> since it doesn't encapsulate tunneled traffic into another application
> layer. But if there's a way to achieve transparent re-assemble, or there
> are other alternatives beyond (a)-(c), let me know. Please let me know
> what you think of (a)-(c) regardless.
>
> Thanks,
> Ihar

I lean towards (b) too.   I suppose we only need this special handling
until the migration is complete and the multi chassis option is
cleared by CMS.

For (a) how would OVN decide if the packet is oversized ?  Using the
check_pkt_larger OVS action ?

I've no idea how (c) can be done.

Thanks
Numan

> ---
>  tests/ovn.at | 92 
>  1 file changed, 92 insertions(+)
>
> diff --git a/tests/ovn.at b/tests/ovn.at
> index c346975e6..4ec1d37c3 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -14966,6 +14966,98 @@ OVN_CLEANUP([hv1],[hv2],[hv3])
>  AT_CLEANUP
>  ])
>
> +OVN_FOR_EACH_NORTHD([
> +AT_SETUP([localnet connectivity with multiple requested-chassis, max mtu])
> +AT_KEYWORDS([multi-chassis])
> +ovn_start
> +
> +net_add n1
> +for i in 1 2; do
> +sim_add hv$i
> +as hv$i
> +check ovs-vsctl add-br br-phys
> +ovn_attach n1 br-phys 192.168.0.$i
> +check ovs-vsctl set open . external-ids:ovn-bridge-mappings=phys:br-phys
> +done
> +
> +check ovn-nbctl ls-add ls0
> +

Re: [ovs-dev] [PATCH ovn] tests: fixed '1 LR with distributed router gateway port' fluke

2022-07-20 Thread Numan Siddique
On Tue, Jul 12, 2022 at 11:57 AM Ihar Hrachyshka  wrote:
>
> The test sometimes failed because of duplicate periodic bcast ARPs
> captured. Use _UNIQ version of packet check procedure to pass.
>
> Signed-off-by: Ihar Hrachyshka 

Thanks.  Applied to the main branch.

Numan

> ---
>  tests/ovn.at | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tests/ovn.at b/tests/ovn.at
> index d65a0ac39..71a7c55a2 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -11696,7 +11696,7 @@ as hv3 ovs-ofctl dump-flows br-int
>  echo ""
>
>  echo $expected >> hv3-vif1.expected
> -OVN_CHECK_PACKETS([hv3/vif1-tx.pcap], [hv3-vif1.expected])
> +OVN_CHECK_PACKETS_UNIQ([hv3/vif1-tx.pcap], [hv3-vif1.expected])
>
>  #Check ovn-trace over "chassisredirect" port
>  AT_CAPTURE_FILE([trace])
> @@ -11753,7 +11753,7 @@ dst_ip=`ip_to_hex 192 168 1 3`
>  
> expected=${dst_mac}${src_mac}0800451c3f110100${src_ip}${dst_ip}00350008
>
>  echo $expected >> hv2-vif1.expected
> -OVN_CHECK_PACKETS([hv2/vif1-tx.pcap], [hv2-vif1.expected])
> +OVN_CHECK_PACKETS_UNIQ([hv2/vif1-tx.pcap], [hv2-vif1.expected])
>
>  check_row_count Port_Binding 1 logical_port=cr-alice
>
> --
> 2.34.1
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-20 Thread Numan Siddique
On Wed, Jul 20, 2022 at 6:18 AM Dumitru Ceara  wrote:
>
>
> Hi Han,
>
> On 7/7/22 19:02, Dumitru Ceara wrote:
> > On 7/7/22 18:21, Han Zhou wrote:
> >> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara  wrote:
> >>>
> >>> On 7/7/22 13:45, Dumitru Ceara wrote:
>  On 7/7/22 00:08, Han Zhou wrote:
> > On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara  wrote:
> >>
> >> Hi Han,
> >>
> >> On 7/6/22 00:41, Han Zhou wrote:
> >>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
> >> to
> >>> registers is causing a critical dataplane performance impact to
> >>> short-lived connections, because it unwildcards megaflows with exact
> >>> match on dst IP and L4 ports. Any new connections with a different
> >>> client side L4 port will encounter datapath flow miss and upcall to
> >>> ovs-vswitchd.
> >>>
> >>> These fields (dst IP and port) were saved to registers to solve a
> >>> problem of LB hairpin use case when different VIPs are sharing
> >>> overlapping backend+port [0]. The change [0] might not have as wide
> >>> performance impact as it is now because at that time one of the match
> >>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
> >> and
> >>> natted traffic, while now the impact is more obvious because
> >>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
> >>> configured on the LS) since commit [1], after several other
> >> indirectly
> >>> related optimizations and refactors.
> >>>
> >>> Since the changes that introduced the performance problem had their
> >>> own values (fixes problems or optimizes performance), so we don't
> >> want
> >>> to revert any of the changes (and it is also not straightforward to
> >>> revert any of them because there have been lots of changes and
> >> refactors
> >>> on top of them).
> >>>
> >>> Change [0] itself has added an alternative way to solve the
> >> overlapping
> >>> backends problem, which utilizes ct fields instead of saving dst IP
> >> and
> >>> port to registers. This patch forces to that approach and removes the
> >>> flows/actions that saves the dst IP and port to avoid the dataplane
> >>> performance problem for short-lived connections.
> >>>
> >>> (With this approach, the ct_state DNAT is not HW offload friendly,
> >> so it
> >>> may result in those flows not being offloaded, which is supposed to
> >> be
> >>> solved in a follow-up patch)
>
> While we're waiting for more input from ovn-k8s on this, I have a
> slightly different question.
>
> Aren't we hitting a similar problem in the router pipeline, due to
> REG_ORIG_TP_DPORT_ROUTER?
>
> Thanks,
> Dumitru
>
> >>>
> >>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
> >> shared
> > backends.")
> >>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
> >> traffic.")
> >>>
> >>> Signed-off-by: Han Zhou 
> >>> ---
> >>
> >> I think the main concern I have is that this forces us to choose
> >> between:
> >> a. non hwol friendly flows (reduced performance)
> >> b. less functionality (with the knob in patch 3/3 set to false).
> >>
> > Thanks Dumitru for the comments! I agree the solution is not ideal,
> >> but if
> > we look at it from a different angle, even with a), for most
> >> pod->service
> > traffic the performance is still much better than how it is today (not
> > offloaded kernel datapath is still much better than userspace
> >> slowpath).
> > And *hopefully* b) is ok for most use cases to get HW-offload
> >> capability.
> >
> >>>
> >>> Just a note on this item.  I'm a bit confused about why all traffic
> >>> would be slowpath-ed?  It's just the first packet that goes to vswitchd
> >>> as an upcall, right?
> >>>
> >>
> >> It is about all traffic for *short-lived* connections. Any clients ->
> >> service traffic with the pattern:
> >> 1. TCP connection setup
> >> 2. Set API request, receives response
> >> 3. Close TCP connection
> >> It can be tested with netperf TCP_CRR. Every time the client side TCP port
> >> is different, but since the server -> client DP flow includes the client
> >> TCP port, for each such transaction there is going to be at least a DP flow
> >> miss and goes to userspace. Such application latency would be very high. In
> >> addition, it causes the OVS handler CPU spikes very high which would
> >> further impact the dataplane performance of the system.
> >>
> >>> Once the megaflow (even if it's more specific than ideal) is installed
> >>> all following traffic in that session should be forwarded in fast path
> >>> (kernel).
> >>>
> >>> Also, I'm not sure I follow why the same behavior wouldn't happen with
> >>> your changes too for pod->service.  The datapath flow includes the
> >>> dp_hash() match, and that's likely different for different connections.
> >>>
> >>
> >> With the change it is not 

Re: [ovs-dev] [PATCH ovn] controller: fix typo in get_lport_type_str()

2022-07-20 Thread Numan Siddique
On Tue, Jul 19, 2022 at 10:13 AM Vladislav Odintsov  wrote:
>
> Signed-off-by: Vladislav Odintsov 

Thanks.  Applied to the main branch.

Numan

> ---
>  controller/binding.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/controller/binding.c b/controller/binding.c
> index 9faef02dd..ba803ae3e 100644
> --- a/controller/binding.c
> +++ b/controller/binding.c
> @@ -881,7 +881,7 @@ get_lport_type_str(enum en_lport_type lport_type)
>  case LP_CHASSISREDIRECT:
>  return "CHASSISREDIRECT";
>  case LP_L3GATEWAY:
> -return "L3GATEWAT";
> +return "L3GATEWAY";
>  case LP_LOCALNET:
>  return "PATCH";
>  case LP_LOCALPORT:
> --
> 2.26.3
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn] ci: ovn-kubernetes: Align CI jobs with recent ovn-kubernetes upstream.

2022-07-20 Thread Numan Siddique
On Wed, Jul 20, 2022 at 12:22 AM Ales Musil  wrote:
>
> Hi Dumitru,
>
> this is a great improvement, thank you.
>
> Acked-by: Ales Musil 

Thanks.  Applied to the main branch.

Numan

>
> On Tue, Jul 19, 2022 at 10:20 PM Dumitru Ceara  wrote:
>
> > This has the benefit that we now use names that have more meaning
> > instead of 'true/false'.  The change is inspired from the current
> > contents of the upstream ovn-kubernetes e2e GHA test specification:
> >
> > https://github.com/ovn-org/ovn-kubernetes/blob/69d8a995d79b280c4bb46f079fbe26f4d5550b16/.github/workflows/test.yml#L349
> >
> > E.g., a new job will be called:
> > e2e (control-plane, HA, shared, ipv4, noSnatGW)
> >
> > Instead of:
> > e2e (control-plane, true, true, true, ipv4, IPv4, true, false)
> >
> > Signed-off-by: Dumitru Ceara 
> > ---
> > Note: one of the jobs is still failing but that's a test issue and is
> > being tracked upstream by:
> > https://github.com/ovn-org/ovn-kubernetes/issues/3078
> > ---
> >  .github/workflows/ovn-kubernetes.yml | 59 +++-
> >  1 file changed, 23 insertions(+), 36 deletions(-)
> >
> > diff --git a/.github/workflows/ovn-kubernetes.yml
> > b/.github/workflows/ovn-kubernetes.yml
> > index 431e47660..03f35d7a3 100644
> > --- a/.github/workflows/ovn-kubernetes.yml
> > +++ b/.github/workflows/ovn-kubernetes.yml
> > @@ -55,44 +55,31 @@ jobs:
> >  strategy:
> >fail-fast: false
> >matrix:
> > -target:
> > -  - shard: shard-conformance
> > -hybrid-overlay: false
> > -multicast-enable: false
> > -emptylb-enable: false
> > -  - shard: control-plane
> > -hybrid-overlay: true
> > -multicast-enable: true
> > -emptylb-enable: true
> > -ipfamily:
> > -  - ip: ipv4
> > -name: "IPv4"
> > -ipv4: true
> > -ipv6: false
> > -  - ip: ipv6
> > -name: "IPv6"
> > -ipv4: false
> > -ipv6: true
> > -  - ip: dualstack
> > -name: "Dualstack"
> > -ipv4: true
> > -ipv6: true
> > -# Example of how to exclude a fully qualified test:
> > -# - {"ipfamily": {"ip": ipv4}, "ha": {"enabled": "false"},
> > "gateway-mode": shared, "target": {"shard": shard-n-other}}
> > -exclude:
> > - # Not currently supported but needs to be.
> > - - {"ipfamily": {"ip": dualstack}, "target": {"shard":
> > control-plane}}
> > - - {"ipfamily": {"ip": ipv6}, "target": {"shard": control-plane}}
> > +# Valid options are:
> > +# target: ["shard-conformance", "control-plane" ]
> > +# shard-conformance: hybrid-overlay = multicast-enable =
> > emptylb-enable = false
> > +# control-plane: hybrid-overlay = multicast-enable =
> > emptylb-enable = true
> > +# gateway-mode: ["local", "shared"]
> > +# ipfamily: ["ipv4", "ipv6", "dualstack"]
> > +# disable-snat-multiple-gws: ["noSnatGW", "snatGW"]
> > +include:
> > +  - {"target": "shard-conformance", "ha": "HA",   "gateway-mode":
> > "local",  "ipfamily": "ipv6",  "disable-snat-multiple-gws": "snatGW"}
> > +  - {"target": "shard-conformance", "ha": "HA",   "gateway-mode":
> > "local",  "ipfamily": "dualstack", "disable-snat-multiple-gws": "snatGW"}
> > +  - {"target": "shard-conformance", "ha": "HA",   "gateway-mode":
> > "shared", "ipfamily": "ipv4",  "disable-snat-multiple-gws": "snatGW"}
> > +  - {"target": "shard-conformance", "ha": "HA",   "gateway-mode":
> > "shared", "ipfamily": "ipv6",  "disable-snat-multiple-gws": "snatGW"}
> > +  - {"target": "control-plane", "ha": "HA",   "gateway-mode":
> > "shared", "ipfamily": "ipv4",  "disable-snat-multiple-gws": "noSnatGW"}
> > +  - {"target": "control-plane", "ha": "HA",   "gateway-mode":
> > "shared", "ipfamily": "ipv4",  "disable-snat-multiple-gws": "snatGW"}
> >  needs: [build]
> >  env:
> > -  JOB_NAME: "${{ matrix.target.shard }}-${{ matrix.ipfamily.name }}"
> > +  JOB_NAME: "${{ matrix.target }}-${{ matrix.ha }}-${{
> > matrix.gateway-mode }}-${{ matrix.ipfamily }}-${{
> > matrix.disable-snat-multiple-gws }}-${{ matrix.second-bridge }}"
> > +  OVN_HYBRID_OVERLAY_ENABLE: "${{ matrix.target == 'control-plane' }}"
> > +  OVN_MULTICAST_ENABLE:  "${{ matrix.target == 'control-plane' }}"
> > +  OVN_EMPTY_LB_EVENTS: "${{ matrix.target == 'control-plane' }}"
> >OVN_HA: "true"
> > -  KIND_IPV4_SUPPORT: "${{ matrix.ipfamily.ipv4 }}"
> > -  KIND_IPV6_SUPPORT: "${{ matrix.ipfamily.ipv6 }}"
> > -  OVN_HYBRID_OVERLAY_ENABLE: "${{ matrix.target.hybrid-overlay }}"
> > -  OVN_GATEWAY_MODE: "shared"
> > -  OVN_MULTICAST_ENABLE:  "${{ matrix.target.multicast-enable }}"
> > -  OVN_EMPTY_LB_EVENTS: "${{ matrix.target.emptylb-enable }}"
> > +  OVN_DISABLE_SNAT_MU

Re: [ovs-dev] [PATCH ovn v2 1/2] tests: Factor out reset_pcap_file() helper.

2022-07-20 Thread Numan Siddique
On Wed, Jul 20, 2022 at 3:00 AM Ales Musil  wrote:
>
> Hi Dumitru,
>
> nice to see it unified, thank you.
>
> Acked-by: Ales Musil 

Thanks.  I applied this patch to the main branch.

Numan

>
>
> On Tue, Jul 19, 2022 at 10:43 PM Dumitru Ceara  wrote:
>
> > Signed-off-by: Dumitru Ceara 
> > ---
> > V2: Rebase and fix up new calls in test "localnet connectivity with
> > multiple requested-chassis"
> > ---
> >  tests/ovn-macros.at |   20 +++
> >  tests/ovn.at|  364
> > ---
> >  2 files changed, 46 insertions(+), 338 deletions(-)
> >
> > diff --git a/tests/ovn-macros.at b/tests/ovn-macros.at
> > index 427b7b669..0ad6b58c4 100644
> > --- a/tests/ovn-macros.at
> > +++ b/tests/ovn-macros.at
> > @@ -511,9 +511,13 @@ wait_for_ports_up() {
> >  fi
> >  }
> >
> > -# reset_pcap_file iface pcap_file
> > +# reset_iface_pcap_file iface pcap_file
> >  # Resets the pcap file associates with OVS interface.  should be used
> >  # with dummy datapath.
> > +#
> > +# XXX: This should actually replace reset_pcap_file() as they do almost
> > +# exactly the same thing but the "wait while the pcap file has the size
> > +# of the PCAP header" check causes tests to fail.
> >  reset_iface_pcap_file() {
> >  local iface=$1
> >  local pcap_file=$2
> > @@ -528,6 +532,20 @@ options:rxq_pcap=${pcap_file}-rx.pcap
> >  OVS_WAIT_WHILE([test 24 = $(wc -c ${pcap_file}-tx.pcap | cut -d " "
> > -f1)])
> >  }
> >
> > +# reset_pcap_file iface pcap_file
> > +# Resets the pcap file associates with OVS interface.  should be used
> > +# with dummy datapath.
> > +reset_pcap_file() {
> > +local iface=$1
> > +local pcap_file=$2
> > +check rm -f dummy-*.pcap
> > +check ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap
> > \
> > +options:rxq_pcap=dummy-rx.pcap
> > +check rm -f ${pcap_file}*.pcap
> > +check ovs-vsctl -- set Interface $iface
> > options:tx_pcap=${pcap_file}-tx.pcap \
> > +options:rxq_pcap=${pcap_file}-rx.pcap
> > +}
> > +
> >  # Receive a packet on a dummy netdev interface. If we expect packets to be
> >  # recorded, then wait until the pcap file reflects the change.
> >  netdev_dummy_receive() {
> > diff --git a/tests/ovn.at b/tests/ovn.at
> > index 7d10610fd..a3e8a7758 100644
> > --- a/tests/ovn.at
> > +++ b/tests/ovn.at
> > @@ -6487,16 +6487,6 @@ compare_dhcp_packets() {
> >  fi
> >  }
> >
> > -reset_pcap_file() {
> > -local iface=$1
> > -local pcap_file=$2
> > -check ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap
> > \
> > -options:rxq_pcap=dummy-rx.pcap
> > -rm -f ${pcap_file}*.pcap
> > -check ovs-vsctl -- set Interface $iface
> > options:tx_pcap=${pcap_file}-tx.pcap \
> > -options:rxq_pcap=${pcap_file}-rx.pcap
> > -}
> > -
> >  AT_CAPTURE_FILE([sbflows])
> >  ovn-sbctl dump-flows > sbflows
> >
> > @@ -7082,16 +7072,6 @@ test_dhcpv6() {
> >  as hv1 ovs-appctl netdev-dummy/receive hv1-vif$inport $request
> >  }
> >
> > -reset_pcap_file() {
> > -local iface=$1
> > -local pcap_file=$2
> > -ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap \
> > -options:rxq_pcap=dummy-rx.pcap
> > -rm -f ${pcap_file}*.pcap
> > -ovs-vsctl -- set Interface $iface
> > options:tx_pcap=${pcap_file}-tx.pcap \
> > -options:rxq_pcap=${pcap_file}-rx.pcap
> > -}
> > -
> >  AT_CAPTURE_FILE([ofctl_monitor0.log])
> >  as hv1 ovs-ofctl monitor br-int resume --detach --no-chdir \
> >  --pidfile=ovs-ofctl0.pid 2> ofctl_monitor0.log
> > @@ -8789,16 +8769,6 @@ OVS_WAIT_UNTIL([
> >  ])
> >  ovn-nbctl lsp-set-options lrp0-rp router-port=lrp0
> > nat-addresses="f0:00:00:00:00:03 192.168.0.3"
> >
> > -reset_pcap_file() {
> > -local iface=$1
> > -local pcap_file=$2
> > -ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap \
> > -options:rxq_pcap=dummy-rx.pcap
> > -rm -f ${pcap_file}*.pcap
> > -ovs-vsctl -- set Interface $iface
> > options:tx_pcap=${pcap_file}-tx.pcap \
> > -options:rxq_pcap=${pcap_file}-rx.pcap
> > -}
> > -
> >  reset_pcap_file snoopvif hv1/snoopvif
> >  OVS_WAIT_UNTIL([test `wc -c < "hv1/snoopvif-tx.pcap"` -ge 140])
> >
> > @@ -8873,20 +8843,8 @@ OVN_WAIT_PATCH_PORT_FLOWS(["ln_port"], ["hv2"])
> >  # Otherwise a garp might be sent after pcap have been reset but before
> > chassis is removed
> >  AT_CHECK([ovn-nbctl --wait=hv remove logical_router lr0 options chassis])
> >
> > -reset_pcap_file() {
> > -local hv=$1
> > -local iface=$2
> > -local pcap_file=$3
> > -as $hv
> > -ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap \
> > -options:rxq_pcap=dummy-rx.pcap
> > -rm -f ${pcap_file}*.pcap
> > -ovs-vsctl -- set Interface $iface
> > options:tx_pcap=${pcap_file}-tx.pcap \
> > -options:rxq_pcap=${pcap_file}-rx.pcap
> > -}
> > -
> > -reset_pcap_file hv1 snoopvif hv1/snoopvif
> > -reset_pcap_file hv2 snoopvif hv2/snoopvif
> > +as hv1 reset_pcap_file snoopvif hv1/snoopvif
> > +as hv2 reset_pcap_file snoopvif hv2

Re: [ovs-dev] [PATCH v3 3/3] netdev-offload-dpdk: use flow transfer proxy mechanism

2022-07-20 Thread 0-day Robot
Bleep bloop.  Greetings Ivan Malov, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 80 characters long (recommended limit is 79)
#55 FILE: lib/netdev-dpdk.c:1207:
   &dev->flow_transfer_proxy_port_id, NULL);

WARNING: Line is 80 characters long (recommended limit is 79)
#105 FILE: lib/netdev-dpdk.c:3891:
response = xasprintf("Device '%s' can not be detached (flow proxy)",

ERROR: Inappropriate bracing around statement
#112 FILE: lib/netdev-dpdk.c:3898:
if (dev_self != NULL)

WARNING: Line is 80 characters long (recommended limit is 79)
#134 FILE: lib/netdev-dpdk.c:5282:
ret = flow_transfer_proxy ? dev->flow_transfer_proxy_port_id : dev->port_id;

WARNING: Line is 80 characters long (recommended limit is 79)
#225 FILE: lib/netdev-dpdk.c:5469:
ret = rte_flow_tunnel_action_decap_release(dev->flow_transfer_proxy_port_id,

Lines checked: 436, Warnings: 4, Errors: 1


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v2 3/3] netdev-offload-dpdk: use flow transfer proxy mechanism

2022-07-20 Thread 0-day Robot
Bleep bloop.  Greetings Ivan Malov, I am a robot and I have tried out your 
patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Line is 80 characters long (recommended limit is 79)
#55 FILE: lib/netdev-dpdk.c:1207:
   &dev->flow_transfer_proxy_port_id, NULL);

WARNING: Line is 80 characters long (recommended limit is 79)
#105 FILE: lib/netdev-dpdk.c:3891:
response = xasprintf("Device '%s' can not be detached (flow proxy)",

ERROR: Inappropriate bracing around statement
#112 FILE: lib/netdev-dpdk.c:3898:
if (dev_self != NULL)

WARNING: Line is 80 characters long (recommended limit is 79)
#134 FILE: lib/netdev-dpdk.c:5282:
ret = flow_transfer_proxy ? dev->flow_transfer_proxy_port_id : dev->port_id;

WARNING: Line is 80 characters long (recommended limit is 79)
#225 FILE: lib/netdev-dpdk.c:5469:
ret = rte_flow_tunnel_action_decap_release(dev->flow_transfer_proxy_port_id,

Lines checked: 436, Warnings: 4, Errors: 1


Please check this out.  If you feel there has been an error, please email 
acon...@redhat.com

Thanks,
0-day Robot
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2] odp-execute: Avoid unnecessary logging for action implementations.

2022-07-20 Thread Ilya Maximets
There is no need to log if the implementation didn't change.
Scalar one is default, any change will be logged.  And availability
is not really important to log at INFO level.  Moving these logs
to DBG level to avoid littering the log file and confusing users.
We do the same for miniflow_extract and datapath interface
implementations.

Additionally text of the log message made more readable and uniform
with the one used for miniflow_extract.

Fixes: 95e4a35b0a1d ("odp-execute: Add function pointers to odp-execute for 
different action implementations.")
Signed-off-by: Ilya Maximets 
---

Version 2:
  - Dropped the change in test macros, filtering still needed for
the autovalidator build.

 lib/odp-execute-private.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/lib/odp-execute-private.c b/lib/odp-execute-private.c
index bec49206e..f80ae5a23 100644
--- a/lib/odp-execute-private.c
+++ b/lib/odp-execute-private.c
@@ -97,9 +97,10 @@ odp_execute_action_set(const char *name)
 for (int i = 0; i < ACTION_IMPL_MAX; i++) {
 /* String compare, and set ptrs atomically. */
 if (!strcmp(action_impls[i].name, name)) {
-active_action_impl_index = i;
-
-VLOG_INFO("Action implementation set to %s", name);
+if (i != active_action_impl_index) {
+active_action_impl_index = i;
+VLOG_INFO("Action implementation set to %s", name);
+}
 return &action_impls[i];
 }
 }
@@ -142,8 +143,8 @@ odp_execute_action_init(void)
 
 action_impls[i].available = avail;
 
-VLOG_INFO("Action implementation %s (available: %s)",
-  action_impls[i].name, avail ? "Yes" : "No");
+VLOG_DBG("Actions implementation '%s' %s available.",
+ action_impls[i].name, avail ? "is" : "is not");
 
 /* The following is a run-time check to make sure a scalar
  * implementation exists for the given ISA implementation. This is to
-- 
2.34.3

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v3] controller: avoid recomputes triggered by SBDB Port_Binding updates.

2022-07-20 Thread Numan Siddique
On Wed, Jul 13, 2022 at 4:29 AM Dumitru Ceara  wrote:
>
> On 7/13/22 11:08, Dumitru Ceara wrote:
> > On 7/13/22 09:40, Xavier Simonart wrote:
> >> Hi Han, Dumitru
> >>
> >
> > Hi Han, Xavier,
> >
> > Sorry, I had already replied to the previous message and only then
> > noticed this one.
> >
> >> I think that we should, as much as possible, try to achieve both goals:
> >> - have an accurate ovn-installed
> >> - do not increase latency in large scale deployments
> >>
> >
> > +1
> >
> >> The fact that ovn-installed is sent too early for mc flows is already an
> >> issue today, independent of this patch.
> >> Fixing ovn-installed related to mc flows by delaying the state change (for
> >> all cases, included when no mc groups) might be seen as a performance
> >> regression.
> >>
> >
> > I think it will, and I'm not sure we can convince the CMS that this is
> > "just metrics".
> >
> >> I agree that we should fix this ovn-installed issue, but it is not a
> >> regression added by this patch. We should enter a BZ for it.
> >> Per my understanding, the mc flows are updated when the SB_multicast_group
> >> is seen as updated by ovn-controller, due to its references to port 
> >> binding.
> >> Other flows related to port binding are installed earlier, i.e. when
> >> ovn-controller writes port_binding->chassis (i.e. before it receives SB
> >> confirmation). So, while sending the mc flows earlier than what we do today
> >> might be more complex, I think it makes some kind of sense (we would send
> >> all those flows within the same loop).
> >
> > I'm inclining towards leaving it as it is today if this is the only flow
> > we're missing.  It's a guess without testing things out, but I think
> > it's for the MC_FLOOD_L2 multicast group which is used only for
> > forwarding ARP packets originated by OVN routers or destined to a
> > specific OVN router.  Losing some of those packets is not a big deal.
> >
> > But it might be good to confirm that this is the MC group we install the
> > flow for.
> >
>
> Oh, well, the port is also part of the MC_FLOOD group.  This is however
> only used for BUM traffic.  So losing some packets here is also not
> terrible, I think.
>

When a logical port is claimed,  we process the logical flows related
to the logical port (i.e with inport ==  or outport == )
and install
the corresponding openflows. All the generic logical flows (i.e
without inport or outport match) would have already been programmed
(if the datapath already part of local_datapaths).
These processed logical flows (lflow_handle_flows_for_lport() in
lflow.c) will be most likely part of the same openflow bundle. And
once the sequence number
for this bundle is acknowledged we set "ovn-installed=true".  When CMS
notices "ovn-installed=true" , I think it can fairly assume that the
flows for the lport are
programmed.

I think the only flows pertaining to the logical port  which we would
be missing are the multicast related flows and the logical flows which
ovn-northd would generate after the logical port
is claimed (presently it is the arp responder flows) and I don't think
we can wait for these logical flows to be programmed by ovn-northd
before setting "ovn-installed=true".

Delaying setting the ovn-installed=true would definitely result in
latency.  It would not be easy for ovn-controller to keep track of
openflows already programmed for a logical port
In other words,  I don't think ovn-controller can accurately keep
track of all the openflows related to a logical port are programmed or
not  unless all these flows are grouped in one bundle.

Also since the present ovn main already sets ovn-installed=true a bit
early i.e. even before the multicast and arp responder flows are
programmed, I think it is out of this patch's scope to address it.

So I think the patch is fine with me once Xavier addresses (1) i.e
remove the 'chassis_update_required'.

Thanks
Numan


> > Thanks,
> > Dumitru
> >
> >>
> >> Thanks
> >> Xavier
> >>
> >>
> >>
> >> On Wed, Jul 13, 2022 at 8:28 AM Han Zhou  wrote:
> >>
> >>>
> >>>
> >>> On Tue, Jul 12, 2022 at 12:35 AM Dumitru Ceara  wrote:
> 
>  On 7/12/22 08:52, Han Zhou wrote:
> > On Mon, Jul 11, 2022 at 4:55 AM Dumitru Ceara 
> >>> wrote:
> >>
> >> On 7/11/22 13:31, Xavier Simonart wrote:
> >>> Hi Han
> >>>
> >>> Thanks for your review.
> >>>
> >>> Let me try to understand your two main concerns and the proper way to
> > fix
> >>> it.
> >>> 1) We only try once to write pb->chassis. If the commit fails,
> > pb->chassis
> >>> is not written. As commit fails, we will recompute, but as the
> >>> update_required flag is not set anymore, we might end up with no
> >>> pb->chassis.
> >>> => I'll remove the flag and try to update until it's confirmed.
> >
> > Thank you!
> >
> >>> 2) The state machine, and when we move to INSTALL_FLOWS. Serializing
> >>> the
> >>> state machine, by waiting for confirmation to be received befor

[ovs-dev] [PATCH] system-dpdk: Add testpmd clean up in MTU unit tests.

2022-07-20 Thread Michael Phelan
The MTU vport unit tests do not clean up testpmd after use which causes them to 
fail randomly.
This commit amends the MTU vport unit tests to clean up testpmd after running.

Fixes: bf47829116a8feb54fe795aa19915f6e6283af93 ("tests: Add OVS-DPDK MTU unit 
tests.")
Signed-off-by: Michael Phelan 
---
 tests/system-dpdk.at | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
index 3beccda44..15f97097a 100644
--- a/tests/system-dpdk.at
+++ b/tests/system-dpdk.at
@@ -669,6 +669,9 @@ AT_CHECK([ovs-vsctl set Interface dpdkvhostuserclient0 
mtu_request=9000])
 AT_CHECK([ovs-appctl dpctl/show], [], [stdout])
 AT_CHECK([egrep 'mtu=9000' stdout], [], [stdout])
 
+dnl Clean up the testpmd now
+pkill -f -x -9 'tail -f /dev/null'
+
 dnl Clean up
 AT_CHECK([ovs-vsctl del-port br10 dpdkvhostuserclient0], [], [stdout], 
[stderr])
 OVS_VSWITCHD_STOP("m4_join([], [SYSTEM_DPDK_ALLOWED_LOGS], [
@@ -723,6 +726,9 @@ AT_CHECK([ovs-vsctl set Interface dpdkvhostuserclient0 
mtu_request=2000])
 AT_CHECK([ovs-appctl dpctl/show], [], [stdout])
 AT_CHECK([egrep 'mtu=2000' stdout], [], [stdout])
 
+dnl Clean up the testpmd now
+pkill -f -x -9 'tail -f /dev/null'
+
 dnl Clean up
 AT_CHECK([ovs-vsctl del-port br10 dpdkvhostuserclient0], [], [stdout], 
[stderr])
 OVS_VSWITCHD_STOP("m4_join([], [SYSTEM_DPDK_ALLOWED_LOGS], [
@@ -852,6 +858,9 @@ dnl Set MTU value above upper bound and check for error
 AT_CHECK([ovs-vsctl set Interface dpdkvhostuserclient0 mtu_request=9711])
 AT_CHECK([grep "dpdkvhostuserclient0: unsupported MTU 9711" ovs-vswitchd.log], 
[], [stdout])
 
+dnl Clean up the testpmd now
+pkill -f -x -9 'tail -f /dev/null'
+
 dnl Clean up
 AT_CHECK([ovs-vsctl del-port br10 dpdkvhostuserclient0], [], [stdout], 
[stderr])
 OVS_VSWITCHD_STOP("m4_join([], [SYSTEM_DPDK_ALLOWED_LOGS], [
@@ -906,6 +915,8 @@ dnl Set MTU value below lower bound and check for error
 AT_CHECK([ovs-vsctl set Interface dpdkvhostuserclient0 mtu_request=67])
 AT_CHECK([grep "dpdkvhostuserclient0: unsupported MTU 67" ovs-vswitchd.log], 
[], [stdout])
 
+dnl Clean up the testpmd now
+pkill -f -x -9 'tail -f /dev/null'
 
 dnl Clean up
 AT_CHECK([ovs-vsctl del-port br10 dpdkvhostuserclient0], [], [stdout], 
[stderr])
-- 
2.25.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 0/3] Rework the usage of DPDK transfer flow offloads

2022-07-20 Thread Ilya Maximets
On 7/20/22 14:18, Ivan Malov wrote:
> DPDK has got support for offloads involving assignment of per-packet
> Rx metadata (flag, mark, tunnel ID and the likes). However, delivery
> of such metadata from the NIC to the PMD might need to be negotiated
> in advance. API [1] addresses that problem. Make OvS invoke this API.
> 
> Another problem is how flow rules with attribute "transfer" refer to
> embedded switch ports by means of their representors. Action PORT_ID
> has proved ambiguous: it refers to a DPDK ethdev, but in fact steers
> packets to the entity represented by the ethdev, in example, to a VF.
> The problem is addressed by [2]. Use the solution in OvS accordingly.
> 
> In addition, [2] and [3] address filtering traffic by input ports of
> the embedded switch. In the suggested approach, "transfer" rules are
> managed via the only ethdev with sufficient privileges, whilst match
> criteria include an explicit item to indicate the desired input port.
> Revisit OvS support for "transfer" rules to follow the said approach.
> 
> The following tests have been considered so far:
> - build check with the current dpdk-next-net;
> - running "make check" for every patch;
> - tunnel offload demo with net/sfc PMD.
> 
> [1] http://mails.dpdk.org/archives/dev/2021-October/224291.html
> [2] http://mails.dpdk.org/archives/dev/2021-October/224620.html
> [3] http://mails.dpdk.org/archives/dev/2021-October/225081.html
> ---
>  v1 -> v2: amendments to care about proxy detach and port ID logging
>  v2 -> v3: a minor adjustment in the cover letter's subject line
> 
> Ivan Malov (3):
>   netdev-dpdk: negotiate delivery of per-packet Rx metadata
>   netdev-offload-dpdk: replace action PORT_ID with REPRESENTED_PORT
>   netdev-offload-dpdk: use flow transfer proxy mechanism
> 
>  lib/netdev-dpdk.c | 141 +-
>  lib/netdev-dpdk.h |   4 +-
>  lib/netdev-offload-dpdk.c |  91 +---
>  3 files changed, 209 insertions(+), 27 deletions(-)
> 

Hi.  Since this patch set contains use of DPDK's experimental API,
it needs to be based on dpdk-latest branch and have 'dpdk-latest'
in the subject prefix.  OVS doesn't normally use experimental API.
The tunnel offload support was an exception, because it required
significant re-work of the generic code that wasn't feasible to
carry in the dpdk-latest branch.

CC: Ian, dpdk-latest branch might need some rebase.

Also, please, don't send new versions in reply to previous one.
It's only required for dpdk-dev list for some reason, but prohibited
or frown upon in every other list that I know of.  It's very
inconvenient to work this way.

For the patch set itself, a brief look over current state of DPDK
shows that not all drivers that support PORT_ID support the
REPRESENTED_PORT action as well.  I see at least dpaa2 and cnxk
that doesn't.  So, this patch set essentially breaks offloading
for these NICs, IIUC.  Yet another rte_flow inconsistency.
Any plans for addressing that?

Best regards, Ilya Maximets.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] odp-execute: Avoid unnecessary logging for action implementations.

2022-07-20 Thread Ilya Maximets
There is no need to log if the implementation didn't change.
Scalar one is default, any change will be logged.  And availability
is not really important to log at INFO level.  Moving these logs
to DBG level to avoid littering the log file and confusing users.
We do the same for miniflow_extract and datapath interface
implementations.

Additionally, text of the log message made more readable and uniform
with the one used for miniflow_extract.

Fixes: 95e4a35b0a1d ("odp-execute: Add function pointers to odp-execute for 
different action implementations.")
Signed-off-by: Ilya Maximets 
---
 lib/odp-execute-private.c | 11 ++-
 tests/ofproto-macros.at   |  1 -
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/odp-execute-private.c b/lib/odp-execute-private.c
index bec49206e..f80ae5a23 100644
--- a/lib/odp-execute-private.c
+++ b/lib/odp-execute-private.c
@@ -97,9 +97,10 @@ odp_execute_action_set(const char *name)
 for (int i = 0; i < ACTION_IMPL_MAX; i++) {
 /* String compare, and set ptrs atomically. */
 if (!strcmp(action_impls[i].name, name)) {
-active_action_impl_index = i;
-
-VLOG_INFO("Action implementation set to %s", name);
+if (i != active_action_impl_index) {
+active_action_impl_index = i;
+VLOG_INFO("Action implementation set to %s", name);
+}
 return &action_impls[i];
 }
 }
@@ -142,8 +143,8 @@ odp_execute_action_init(void)
 
 action_impls[i].available = avail;
 
-VLOG_INFO("Action implementation %s (available: %s)",
-  action_impls[i].name, avail ? "Yes" : "No");
+VLOG_DBG("Actions implementation '%s' %s available.",
+ action_impls[i].name, avail ? "is" : "is not");
 
 /* The following is a run-time check to make sure a scalar
  * implementation exists for the given ISA implementation. This is to
diff --git a/tests/ofproto-macros.at b/tests/ofproto-macros.at
index 676d55aa9..84f07c108 100644
--- a/tests/ofproto-macros.at
+++ b/tests/ofproto-macros.at
@@ -182,7 +182,6 @@ m4_define([_OVS_VSWITCHD_START],
on_exit "kill_ovs_vswitchd `cat ovs-vswitchd.pid`"
AT_CHECK([[sed < stderr '
 /ovs_numa|INFO|Discovered /d
-/odp_execute_impl|INFO|Action implementation /d
 /vlog|INFO|opened log file/d
 /vswitchd|INFO|ovs-vswitchd (Open vSwitch)/d
 /reconnect|INFO|/d
-- 
2.34.3

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v3 3/3] netdev-offload-dpdk: use flow transfer proxy mechanism

2022-07-20 Thread Ivan Malov

Hi Eli,

Thanks a lot for a prompt review. PSB.

On Wed, 20 Jul 2022, Eli Britstein wrote:





-Original Message-
From: Ivan Malov 
Sent: Wednesday, July 20, 2022 3:18 PM
To: d...@openvswitch.org
Cc: Eli Britstein ; Stephen Hemminger
; Ilya Maximets ; Ori
Kam ; Maxime Coquelin
; David Marchand
; Andrew Rybchenko

Subject: [PATCH v3 3/3] netdev-offload-dpdk: use flow transfer proxy
mechanism

External email: Use caution opening links or attachments


Manage "transfer" flows via the corresponding mechanism.
Doing so requires that the traffic source be specified explicitly, via the
corresponding pattern item.

Signed-off-by: Ivan Malov 
Acked-by: Andrew Rybchenko 
---
lib/netdev-dpdk.c | 99 ---
lib/netdev-dpdk.h |  4 +-
lib/netdev-offload-dpdk.c | 61 
3 files changed, 135 insertions(+), 29 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
45e5d26d2..01fb40255 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -420,6 +420,7 @@ enum dpdk_hw_ol_features {

struct netdev_dpdk {
PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
+dpdk_port_t flow_transfer_proxy_port_id;
dpdk_port_t port_id;

/* If true, device was attached by rte_eth_dev_attach(). */ @@ -1130,8
+1131,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
uint32_t rx_chksm_offload_capa = RTE_ETH_RX_OFFLOAD_UDP_CKSUM |
 RTE_ETH_RX_OFFLOAD_TCP_CKSUM |
 RTE_ETH_RX_OFFLOAD_IPV4_CKSUM;
-
#ifdef ALLOW_EXPERIMENTAL_API

There are all over the patches this ifdef, even in cases it's harmless to do 
without. It makes the code less readable and might cause future cherry-picking 
issues. Try to minimize it only to places of a must.


I see your point. But I'm a rookie here, in OvS, and from the existing
code it's not clear which kinds of places would be harmless to stay
without if-defs and which would not. May I kindly request that
you point out at least one example in my patch = a place where
it would be safe to remove the checks. - ?


+int ret;
+
/*
 * Full tunnel offload requires that tunnel ID metadata be
 * delivered with "miss" packets from the hardware to the @@ -1141,6
+1143,27 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
 * Request delivery of such metadata.
 */
dpdk_eth_dev_init_rx_metadata(dev);
+
+/*
+ * Managing "transfer" flows requires that the user communicate them
+ * via a port which has the privilege to control the embedded switch.
+ * For some vendors, all ports in a given switching domain have
+ * this privilege. For other vendors, it's only one port.
+ *
+ * Get the proxy port ID and remember it for later use.
+ */
+ret = rte_flow_pick_transfer_proxy(dev->port_id,
+   &dev->flow_transfer_proxy_port_id, 
NULL);
+if (ret != 0) {
+/*
+ * The PMD does not indicate the proxy port.
+ * Assume the proxy is unneeded.
+ */
+dev->flow_transfer_proxy_port_id = dev->port_id;
+}
+#else /* ! ALLOW_EXPERIMENTAL_API */
+/* No API to get transfer proxy; assume the proxy is unneeded. */
+dev->flow_transfer_proxy_port_id = dev->port_id;
#endif /* ALLOW_EXPERIMENTAL_API */

rte_eth_dev_info_get(dev->port_id, &info); @@ -3762,8 +3785,12 @@
netdev_dpdk_detach(struct unixctl_conn *conn, int argc OVS_UNUSED,
   const char *argv[], void *aux OVS_UNUSED)  {
struct ds used_interfaces = DS_EMPTY_INITIALIZER;
+#ifdef ALLOW_EXPERIMENTAL_API
+struct netdev_dpdk *dev_self = NULL; #endif /*
+ALLOW_EXPERIMENTAL_API */
struct rte_eth_dev_info dev_info;
dpdk_port_t sibling_port_id;
+struct netdev_dpdk *dev;
dpdk_port_t port_id;
bool used = false;
char *response;
@@ -3781,8 +3808,6 @@ netdev_dpdk_detach(struct unixctl_conn *conn, int
argc OVS_UNUSED,
  argv[1]);

RTE_ETH_FOREACH_DEV_SIBLING (sibling_port_id, port_id) {
-struct netdev_dpdk *dev;
-
LIST_FOR_EACH (dev, list_node, &dpdk_list) {
if (dev->port_id != sibling_port_id) {
continue;
@@ -3802,6 +3827,27 @@ netdev_dpdk_detach(struct unixctl_conn *conn,
int argc OVS_UNUSED,
}
ds_destroy(&used_interfaces);

+#ifdef ALLOW_EXPERIMENTAL_API
+/*
+ * The device being detached may happen to be a flow proxy port
+ * for another device (still attached). If so, do not allow to
+ * detach. Devices dependent on this one must be detached first.

I don't think this is acceptable to deny the port from being detached, or to 
enforce such ordering. For example, ports are being detached upon shutdown, 
with unknown order.


I see your point. However, as far as I can understand, this specific
function is not the one that unplugs devices upon the "global" OvS
shutdown. It might be the one which is invoked when the user asks
to unplu

Re: [ovs-dev] [PATCH v3 3/3] netdev-offload-dpdk: use flow transfer proxy mechanism

2022-07-20 Thread Eli Britstein via dev



>-Original Message-
>From: Ivan Malov 
>Sent: Wednesday, July 20, 2022 3:18 PM
>To: d...@openvswitch.org
>Cc: Eli Britstein ; Stephen Hemminger
>; Ilya Maximets ; Ori
>Kam ; Maxime Coquelin
>; David Marchand
>; Andrew Rybchenko
>
>Subject: [PATCH v3 3/3] netdev-offload-dpdk: use flow transfer proxy
>mechanism
>
>External email: Use caution opening links or attachments
>
>
>Manage "transfer" flows via the corresponding mechanism.
>Doing so requires that the traffic source be specified explicitly, via the
>corresponding pattern item.
>
>Signed-off-by: Ivan Malov 
>Acked-by: Andrew Rybchenko 
>---
> lib/netdev-dpdk.c | 99 ---
> lib/netdev-dpdk.h |  4 +-
> lib/netdev-offload-dpdk.c | 61 
> 3 files changed, 135 insertions(+), 29 deletions(-)
>
>diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
>45e5d26d2..01fb40255 100644
>--- a/lib/netdev-dpdk.c
>+++ b/lib/netdev-dpdk.c
>@@ -420,6 +420,7 @@ enum dpdk_hw_ol_features {
>
> struct netdev_dpdk {
> PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
>+dpdk_port_t flow_transfer_proxy_port_id;
> dpdk_port_t port_id;
>
> /* If true, device was attached by rte_eth_dev_attach(). */ @@ -1130,8
>+1131,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> uint32_t rx_chksm_offload_capa = RTE_ETH_RX_OFFLOAD_UDP_CKSUM |
>  RTE_ETH_RX_OFFLOAD_TCP_CKSUM |
>  RTE_ETH_RX_OFFLOAD_IPV4_CKSUM;
>-
> #ifdef ALLOW_EXPERIMENTAL_API
There are all over the patches this ifdef, even in cases it's harmless to do 
without. It makes the code less readable and might cause future cherry-picking 
issues. Try to minimize it only to places of a must.
>+int ret;
>+
> /*
>  * Full tunnel offload requires that tunnel ID metadata be
>  * delivered with "miss" packets from the hardware to the @@ -1141,6
>+1143,27 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>  * Request delivery of such metadata.
>  */
> dpdk_eth_dev_init_rx_metadata(dev);
>+
>+/*
>+ * Managing "transfer" flows requires that the user communicate them
>+ * via a port which has the privilege to control the embedded switch.
>+ * For some vendors, all ports in a given switching domain have
>+ * this privilege. For other vendors, it's only one port.
>+ *
>+ * Get the proxy port ID and remember it for later use.
>+ */
>+ret = rte_flow_pick_transfer_proxy(dev->port_id,
>+   &dev->flow_transfer_proxy_port_id, 
>NULL);
>+if (ret != 0) {
>+/*
>+ * The PMD does not indicate the proxy port.
>+ * Assume the proxy is unneeded.
>+ */
>+dev->flow_transfer_proxy_port_id = dev->port_id;
>+}
>+#else /* ! ALLOW_EXPERIMENTAL_API */
>+/* No API to get transfer proxy; assume the proxy is unneeded. */
>+dev->flow_transfer_proxy_port_id = dev->port_id;
> #endif /* ALLOW_EXPERIMENTAL_API */
>
> rte_eth_dev_info_get(dev->port_id, &info); @@ -3762,8 +3785,12 @@
>netdev_dpdk_detach(struct unixctl_conn *conn, int argc OVS_UNUSED,
>const char *argv[], void *aux OVS_UNUSED)  {
> struct ds used_interfaces = DS_EMPTY_INITIALIZER;
>+#ifdef ALLOW_EXPERIMENTAL_API
>+struct netdev_dpdk *dev_self = NULL; #endif /*
>+ALLOW_EXPERIMENTAL_API */
> struct rte_eth_dev_info dev_info;
> dpdk_port_t sibling_port_id;
>+struct netdev_dpdk *dev;
> dpdk_port_t port_id;
> bool used = false;
> char *response;
>@@ -3781,8 +3808,6 @@ netdev_dpdk_detach(struct unixctl_conn *conn, int
>argc OVS_UNUSED,
>   argv[1]);
>
> RTE_ETH_FOREACH_DEV_SIBLING (sibling_port_id, port_id) {
>-struct netdev_dpdk *dev;
>-
> LIST_FOR_EACH (dev, list_node, &dpdk_list) {
> if (dev->port_id != sibling_port_id) {
> continue;
>@@ -3802,6 +3827,27 @@ netdev_dpdk_detach(struct unixctl_conn *conn,
>int argc OVS_UNUSED,
> }
> ds_destroy(&used_interfaces);
>
>+#ifdef ALLOW_EXPERIMENTAL_API
>+/*
>+ * The device being detached may happen to be a flow proxy port
>+ * for another device (still attached). If so, do not allow to
>+ * detach. Devices dependent on this one must be detached first.
I don't think this is acceptable to deny the port from being detached, or to 
enforce such ordering. For example, ports are being detached upon shutdown, 
with unknown order.
Suppose A is the proxy port for ports B,C. When port A is going to be detached, 
flush the offloads of ports B and C (in addition to flushing the offloads of 
port A). Then, it is safe to detach it.
>+ */
>+LIST_FOR_EACH (dev, list_node, &dpdk_list) {
>+if (dev->port_id == port_id) {
>+dev_self = dev;
>+} else if (dev->flow_transfer_proxy_port_id == port_id) {
>+response = xasprintf("Device '%s' can not be detached (flow 

Re: [ovs-dev] [PATCH 3/3] netdev-offload-dpdk: use flow transfer proxy mechanism

2022-07-20 Thread Ivan Malov

Hi Eli,

Thank you for your review efforts. Please find the new version:

http://patchwork.ozlabs.org/project/openvswitch/patch/20220720121823.2497727-4-ivan.ma...@oktetlabs.ru/

It cares about the detach scenario we've been talking about.
And it cares about port ID logging for transfer flows, too.

On Wed, 8 Jun 2022, Eli Britstein wrote:


Hi Ivan,


-Original Message-
From: Ivan Malov 
Sent: Wednesday, June 8, 2022 10:02 PM
To: Eli Britstein 
Cc: d...@openvswitch.org; Andrew Rybchenko
; Ilya Maximets ;
Ori Kam ; NBU-Contact-Thomas Monjalon (EXTERNAL)
; Stephen Hemminger
; David Marchand
; Gaetan Rivet ; Maxime
Coquelin 
Subject: RE: [PATCH 3/3] netdev-offload-dpdk: use flow transfer proxy
mechanism

External email: Use caution opening links or attachments


Hi Eli,

On Wed, 8 Jun 2022, Eli Britstein wrote:


Hi Ivan,


-Original Message-
From: Ivan Malov 
Sent: Wednesday, June 8, 2022 5:46 PM
To: Eli Britstein 
Cc: d...@openvswitch.org; Andrew Rybchenko
; Ilya Maximets ;
Ori Kam ; NBU-Contact-Thomas Monjalon (EXTERNAL)
; Stephen Hemminger
; David Marchand
; Gaetan Rivet ; Maxime
Coquelin 
Subject: RE: [PATCH 3/3] netdev-offload-dpdk: use flow transfer proxy
mechanism

External email: Use caution opening links or attachments


Hi Eli,

On Wed, 8 Jun 2022, Eli Britstein wrote:


Hi Ivan,


-Original Message-
From: Ivan Malov 
Sent: Tuesday, June 7, 2022 11:56 PM
To: Eli Britstein 
Cc: d...@openvswitch.org; Andrew Rybchenko
; Ilya Maximets
; Ori Kam ;
NBU-Contact-Thomas Monjalon (EXTERNAL) ;
Stephen Hemminger ; David Marchand
; Gaetan Rivet ;

Maxime

Coquelin 
Subject: RE: [PATCH 3/3] netdev-offload-dpdk: use flow transfer
proxy mechanism

External email: Use caution opening links or attachments


Hi Eli,

On Wed, 1 Jun 2022, Eli Britstein wrote:


- Missing proper handling of the testpmd syntax logging. It
changes the used

port according to "transfer", but the log still uses

netdev_dpdk_get_port_id().


Thanks for noticing. I will see to it in the next version.


- The usage of the "proxy" port for rte-flow implies that this
proxy port is

attached to OVS, otherwise it is not "started" and creation of
flows will

fail.


That's the way it is. If there is no proxy for a given port, then
the original port value will be used for managing flows. For
vendors that don't need the proxy, this will work. For others, it won't.

That's OK.



I don't really understand why this can't be done inside dpdk domain
(if there

is a proxy, and it is up, use it, otherwise don't).

That's *currently* the way it is. I understand that if dpdk works
like this OVS

should align, but maybe you or someone else here knows why dpdk works
like this? (not too late to change, this is experimental...).


Regardless of DPDK, on some NICs, it is possible to insert rules via
unprivileged PFs or VFs, but there are also NICs which cannot do it.

In DPDK, this contradiction has to be resolved somehow.
In example, for NICs that can only manage flows via privileged ports,
two possible solutions exist:

1. Route flow management requests from unprivileged ethdevs
   to the privileged one implicitly, inside the PMD. This
   is transparent to users, but, at the same time, it is
   tricky because the application does not realise that
   flows it manages via an ethdev "B" are in fact
   communicated to the NIC via an ethdev "A".

   Unbeknownst of the implicit scheme, the application may
   detach the privileged ethdev "A" in-between. And, when
   time comes to remove flows, doing so via ethdev "B"
   will fail. This scheme breaks in-app housekeeping.

2. Expose the "proxy" port existence to the application.
   If it knows the truth about the real ethdev that
   handles the transfer flows, it won't attempt to
   detach it in-between. The housekeeping is fine.

Outing the existence of the "proxy" port to users seems like the most
reasonable approach. This is why it was implemented in DPDK like this.
Currently, it's indeed an experimental feature. DPDK PMDs which need
it, are supposed to switch to it during the transition phase.

Thanks very much for the explanation, though IMHO relevant PMDs could

still hide it and not do this "outing" of their internals.


Sort of yes, they could hide it. But that would mean doing additional record-
keeping internally, in order to return EBUSY when the app asks to detach the
privileged port which still has active flows on it that have been originally
requested via an unprivileged port. Might be quite error prone. Also, given the
fact that quite a few vendors might need this, isn't it better to make the
feature generic?

Discussing such scenario, this patch does not handle it. Suppose A is the 
privileged port, served as a proxy for port B.
Now, there are flows applied on port B (but actually on A). Nothing prevents 
OVS to detach port A. The flows applied on port B remain ghosted in OVS. When 
they are being removed, the behavior is unknown, and can also crash.
Am I wrong?





However, I 

[ovs-dev] [PATCH v3 3/3] netdev-offload-dpdk: use flow transfer proxy mechanism

2022-07-20 Thread Ivan Malov
Manage "transfer" flows via the corresponding mechanism.
Doing so requires that the traffic source be specified
explicitly, via the corresponding pattern item.

Signed-off-by: Ivan Malov 
Acked-by: Andrew Rybchenko 
---
 lib/netdev-dpdk.c | 99 ---
 lib/netdev-dpdk.h |  4 +-
 lib/netdev-offload-dpdk.c | 61 
 3 files changed, 135 insertions(+), 29 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 45e5d26d2..01fb40255 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -420,6 +420,7 @@ enum dpdk_hw_ol_features {
 
 struct netdev_dpdk {
 PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
+dpdk_port_t flow_transfer_proxy_port_id;
 dpdk_port_t port_id;
 
 /* If true, device was attached by rte_eth_dev_attach(). */
@@ -1130,8 +1131,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
 uint32_t rx_chksm_offload_capa = RTE_ETH_RX_OFFLOAD_UDP_CKSUM |
  RTE_ETH_RX_OFFLOAD_TCP_CKSUM |
  RTE_ETH_RX_OFFLOAD_IPV4_CKSUM;
-
 #ifdef ALLOW_EXPERIMENTAL_API
+int ret;
+
 /*
  * Full tunnel offload requires that tunnel ID metadata be
  * delivered with "miss" packets from the hardware to the
@@ -1141,6 +1143,27 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
  * Request delivery of such metadata.
  */
 dpdk_eth_dev_init_rx_metadata(dev);
+
+/*
+ * Managing "transfer" flows requires that the user communicate them
+ * via a port which has the privilege to control the embedded switch.
+ * For some vendors, all ports in a given switching domain have
+ * this privilege. For other vendors, it's only one port.
+ *
+ * Get the proxy port ID and remember it for later use.
+ */
+ret = rte_flow_pick_transfer_proxy(dev->port_id,
+   &dev->flow_transfer_proxy_port_id, 
NULL);
+if (ret != 0) {
+/*
+ * The PMD does not indicate the proxy port.
+ * Assume the proxy is unneeded.
+ */
+dev->flow_transfer_proxy_port_id = dev->port_id;
+}
+#else /* ! ALLOW_EXPERIMENTAL_API */
+/* No API to get transfer proxy; assume the proxy is unneeded. */
+dev->flow_transfer_proxy_port_id = dev->port_id;
 #endif /* ALLOW_EXPERIMENTAL_API */
 
 rte_eth_dev_info_get(dev->port_id, &info);
@@ -3762,8 +3785,12 @@ netdev_dpdk_detach(struct unixctl_conn *conn, int argc 
OVS_UNUSED,
const char *argv[], void *aux OVS_UNUSED)
 {
 struct ds used_interfaces = DS_EMPTY_INITIALIZER;
+#ifdef ALLOW_EXPERIMENTAL_API
+struct netdev_dpdk *dev_self = NULL;
+#endif /* ALLOW_EXPERIMENTAL_API */
 struct rte_eth_dev_info dev_info;
 dpdk_port_t sibling_port_id;
+struct netdev_dpdk *dev;
 dpdk_port_t port_id;
 bool used = false;
 char *response;
@@ -3781,8 +3808,6 @@ netdev_dpdk_detach(struct unixctl_conn *conn, int argc 
OVS_UNUSED,
   argv[1]);
 
 RTE_ETH_FOREACH_DEV_SIBLING (sibling_port_id, port_id) {
-struct netdev_dpdk *dev;
-
 LIST_FOR_EACH (dev, list_node, &dpdk_list) {
 if (dev->port_id != sibling_port_id) {
 continue;
@@ -3802,6 +3827,27 @@ netdev_dpdk_detach(struct unixctl_conn *conn, int argc 
OVS_UNUSED,
 }
 ds_destroy(&used_interfaces);
 
+#ifdef ALLOW_EXPERIMENTAL_API
+/*
+ * The device being detached may happen to be a flow proxy port
+ * for another device (still attached). If so, do not allow to
+ * detach. Devices dependent on this one must be detached first.
+ */
+LIST_FOR_EACH (dev, list_node, &dpdk_list) {
+if (dev->port_id == port_id) {
+dev_self = dev;
+} else if (dev->flow_transfer_proxy_port_id == port_id) {
+response = xasprintf("Device '%s' can not be detached (flow 
proxy)",
+ argv[1]);
+goto error;
+}
+}
+
+/* Indicate that the device being detached no longer needs a flow proxy. */
+if (dev_self != NULL)
+dev_self->flow_transfer_proxy_port_id = port_id;
+#endif /* ALLOW_EXPERIMENTAL_API */
+
 rte_eth_dev_info_get(port_id, &dev_info);
 rte_eth_dev_close(port_id);
 if (rte_dev_remove(dev_info.device) < 0) {
@@ -5167,7 +5213,8 @@ unlock:
 }
 
 int
-netdev_dpdk_get_port_id(struct netdev *netdev)
+netdev_dpdk_get_port_id(struct netdev *netdev,
+bool flow_transfer_proxy)
 {
 struct netdev_dpdk *dev;
 int ret = -1;
@@ -5178,7 +5225,7 @@ netdev_dpdk_get_port_id(struct netdev *netdev)
 
 dev = netdev_dpdk_cast(netdev);
 ovs_mutex_lock(&dev->mutex);
-ret = dev->port_id;
+ret = flow_transfer_proxy ? dev->flow_transfer_proxy_port_id : 
dev->port_id;
 ovs_mutex_unlock(&dev->mutex);
 out:
 return ret;
@@ -5214,13 +5261,15 @@ out:
 
 int
 netdev_dpdk_rte_flow_destroy(struct netdev *netde

[ovs-dev] [PATCH v3 2/3] netdev-offload-dpdk: replace action PORT_ID with REPRESENTED_PORT

2022-07-20 Thread Ivan Malov
Action PORT_ID has been deprecated. Use REPRESENTED_PORT instead.

Signed-off-by: Ivan Malov 
Acked-by: Andrew Rybchenko 
---
 lib/netdev-offload-dpdk.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index 12d299603..9cd5a0159 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -733,6 +733,14 @@ dump_flow_action(struct ds *s, struct ds *s_extra,
 ds_put_cstr(s, "rss / ");
 } else if (actions->type == RTE_FLOW_ACTION_TYPE_COUNT) {
 ds_put_cstr(s, "count / ");
+#ifdef ALLOW_EXPERIMENTAL_API
+} else if (actions->type == RTE_FLOW_ACTION_TYPE_REPRESENTED_PORT) {
+const struct rte_flow_action_ethdev *ethdev = actions->conf;
+
+ds_put_cstr(s, "represented_port ");
+if (ethdev) {
+ds_put_format(s, "ethdev_port_id %d ", ethdev->port_id);
+#else /* ! ALLOW_EXPERIMENTAL_API */
 } else if (actions->type == RTE_FLOW_ACTION_TYPE_PORT_ID) {
 const struct rte_flow_action_port_id *port_id = actions->conf;
 
@@ -740,6 +748,7 @@ dump_flow_action(struct ds *s, struct ds *s_extra,
 if (port_id) {
 ds_put_format(s, "original %d id %d ",
   port_id->original, port_id->id);
+#endif /* ALLOW_EXPERIMENTAL_API */
 }
 ds_put_cstr(s, "/ ");
 } else if (actions->type == RTE_FLOW_ACTION_TYPE_DROP) {
@@ -1767,6 +1776,24 @@ add_count_action(struct flow_actions *actions)
 add_flow_action(actions, RTE_FLOW_ACTION_TYPE_COUNT, count);
 }
 
+#ifdef ALLOW_EXPERIMENTAL_API
+static int
+add_represented_port_action(struct flow_actions *actions,
+struct netdev *outdev)
+{
+struct rte_flow_action_ethdev *ethdev;
+int outdev_id;
+
+outdev_id = netdev_dpdk_get_port_id(outdev);
+if (outdev_id < 0) {
+return -1;
+}
+ethdev = xzalloc(sizeof *ethdev);
+ethdev->port_id = outdev_id;
+add_flow_action(actions, RTE_FLOW_ACTION_TYPE_REPRESENTED_PORT, ethdev);
+return 0;
+}
+#else /* ! ALLOW_EXPERIMENTAL_API */
 static int
 add_port_id_action(struct flow_actions *actions,
struct netdev *outdev)
@@ -1783,6 +1810,7 @@ add_port_id_action(struct flow_actions *actions,
 add_flow_action(actions, RTE_FLOW_ACTION_TYPE_PORT_ID, port_id);
 return 0;
 }
+#endif /* ALLOW_EXPERIMENTAL_API */
 
 static int
 add_output_action(struct netdev *netdev,
@@ -1800,7 +1828,11 @@ add_output_action(struct netdev *netdev,
 return -1;
 }
 if (!netdev_flow_api_equals(netdev, outdev) ||
+#ifdef ALLOW_EXPERIMENTAL_API
+add_represented_port_action(actions, outdev)) {
+#else /* ! ALLOW_EXPERIMENTAL_API */
 add_port_id_action(actions, outdev)) {
+#endif /* ALLOW_EXPERIMENTAL_API */
 VLOG_DBG_RL(&rl, "%s: Output to port \'%s\' cannot be offloaded.",
 netdev_get_name(netdev), netdev_get_name(outdev));
 ret = -1;
-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 1/3] netdev-dpdk: negotiate delivery of per-packet Rx metadata

2022-07-20 Thread Ivan Malov
This may be required by some PMDs in offload scenarios.

Signed-off-by: Ivan Malov 
Acked-by: Andrew Rybchenko 
---
 lib/netdev-dpdk.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index f9535bfb4..45e5d26d2 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -1085,6 +1085,38 @@ dpdk_eth_flow_ctrl_setup(struct netdev_dpdk *dev) 
OVS_REQUIRES(dev->mutex)
 }
 }
 
+#ifdef ALLOW_EXPERIMENTAL_API
+static void
+dpdk_eth_dev_init_rx_metadata(struct netdev_dpdk *dev)
+{
+uint64_t rx_metadata = 0;
+int ret;
+
+/* For the fallback offload (non-"transfer" rules) */
+rx_metadata |= RTE_ETH_RX_METADATA_USER_MARK;
+/* For the full offload ("transfer" rules) */
+rx_metadata |= RTE_ETH_RX_METADATA_TUNNEL_ID;
+
+ret = rte_eth_rx_metadata_negotiate(dev->port_id, &rx_metadata);
+if (ret == 0) {
+if (!(rx_metadata & RTE_ETH_RX_METADATA_USER_MARK)) {
+VLOG_DBG("The NIC will not provide per-packet USER_MARK on port "
+ DPDK_PORT_ID_FMT, dev->port_id);
+}
+if (!(rx_metadata & RTE_ETH_RX_METADATA_TUNNEL_ID)) {
+VLOG_DBG("The NIC will not provide per-packet TUNNEL_ID on port "
+ DPDK_PORT_ID_FMT, dev->port_id);
+}
+} else if (ret == -ENOTSUP) {
+VLOG_DBG("Rx metadata negotiate procedure is not supported on port "
+ DPDK_PORT_ID_FMT, dev->port_id);
+} else {
+VLOG_WARN("Cannot negotiate Rx metadata on port "
+  DPDK_PORT_ID_FMT, dev->port_id);
+}
+}
+#endif /* ALLOW_EXPERIMENTAL_API */
+
 static int
 dpdk_eth_dev_init(struct netdev_dpdk *dev)
 OVS_REQUIRES(dev->mutex)
@@ -1099,6 +1131,18 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
  RTE_ETH_RX_OFFLOAD_TCP_CKSUM |
  RTE_ETH_RX_OFFLOAD_IPV4_CKSUM;
 
+#ifdef ALLOW_EXPERIMENTAL_API
+/*
+ * Full tunnel offload requires that tunnel ID metadata be
+ * delivered with "miss" packets from the hardware to the
+ * PMD. The same goes for megaflow mark metadata which is
+ * used in MARK + RSS offload scenario.
+ *
+ * Request delivery of such metadata.
+ */
+dpdk_eth_dev_init_rx_metadata(dev);
+#endif /* ALLOW_EXPERIMENTAL_API */
+
 rte_eth_dev_info_get(dev->port_id, &info);
 
 if (strstr(info.driver_name, "vf") != NULL) {
-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v3 0/3] Rework the usage of DPDK transfer flow offloads

2022-07-20 Thread Ivan Malov
DPDK has got support for offloads involving assignment of per-packet
Rx metadata (flag, mark, tunnel ID and the likes). However, delivery
of such metadata from the NIC to the PMD might need to be negotiated
in advance. API [1] addresses that problem. Make OvS invoke this API.

Another problem is how flow rules with attribute "transfer" refer to
embedded switch ports by means of their representors. Action PORT_ID
has proved ambiguous: it refers to a DPDK ethdev, but in fact steers
packets to the entity represented by the ethdev, in example, to a VF.
The problem is addressed by [2]. Use the solution in OvS accordingly.

In addition, [2] and [3] address filtering traffic by input ports of
the embedded switch. In the suggested approach, "transfer" rules are
managed via the only ethdev with sufficient privileges, whilst match
criteria include an explicit item to indicate the desired input port.
Revisit OvS support for "transfer" rules to follow the said approach.

The following tests have been considered so far:
- build check with the current dpdk-next-net;
- running "make check" for every patch;
- tunnel offload demo with net/sfc PMD.

[1] http://mails.dpdk.org/archives/dev/2021-October/224291.html
[2] http://mails.dpdk.org/archives/dev/2021-October/224620.html
[3] http://mails.dpdk.org/archives/dev/2021-October/225081.html
---
 v1 -> v2: amendments to care about proxy detach and port ID logging
 v2 -> v3: a minor adjustment in the cover letter's subject line

Ivan Malov (3):
  netdev-dpdk: negotiate delivery of per-packet Rx metadata
  netdev-offload-dpdk: replace action PORT_ID with REPRESENTED_PORT
  netdev-offload-dpdk: use flow transfer proxy mechanism

 lib/netdev-dpdk.c | 141 +-
 lib/netdev-dpdk.h |   4 +-
 lib/netdev-offload-dpdk.c |  91 +---
 3 files changed, 209 insertions(+), 27 deletions(-)

-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 3/3] netdev-offload-dpdk: use flow transfer proxy mechanism

2022-07-20 Thread Ivan Malov
Manage "transfer" flows via the corresponding mechanism.
Doing so requires that the traffic source be specified
explicitly, via the corresponding pattern item.

Signed-off-by: Ivan Malov 
Acked-by: Andrew Rybchenko 
---
 lib/netdev-dpdk.c | 99 ---
 lib/netdev-dpdk.h |  4 +-
 lib/netdev-offload-dpdk.c | 61 
 3 files changed, 135 insertions(+), 29 deletions(-)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 45e5d26d2..01fb40255 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -420,6 +420,7 @@ enum dpdk_hw_ol_features {
 
 struct netdev_dpdk {
 PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
+dpdk_port_t flow_transfer_proxy_port_id;
 dpdk_port_t port_id;
 
 /* If true, device was attached by rte_eth_dev_attach(). */
@@ -1130,8 +1131,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
 uint32_t rx_chksm_offload_capa = RTE_ETH_RX_OFFLOAD_UDP_CKSUM |
  RTE_ETH_RX_OFFLOAD_TCP_CKSUM |
  RTE_ETH_RX_OFFLOAD_IPV4_CKSUM;
-
 #ifdef ALLOW_EXPERIMENTAL_API
+int ret;
+
 /*
  * Full tunnel offload requires that tunnel ID metadata be
  * delivered with "miss" packets from the hardware to the
@@ -1141,6 +1143,27 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
  * Request delivery of such metadata.
  */
 dpdk_eth_dev_init_rx_metadata(dev);
+
+/*
+ * Managing "transfer" flows requires that the user communicate them
+ * via a port which has the privilege to control the embedded switch.
+ * For some vendors, all ports in a given switching domain have
+ * this privilege. For other vendors, it's only one port.
+ *
+ * Get the proxy port ID and remember it for later use.
+ */
+ret = rte_flow_pick_transfer_proxy(dev->port_id,
+   &dev->flow_transfer_proxy_port_id, 
NULL);
+if (ret != 0) {
+/*
+ * The PMD does not indicate the proxy port.
+ * Assume the proxy is unneeded.
+ */
+dev->flow_transfer_proxy_port_id = dev->port_id;
+}
+#else /* ! ALLOW_EXPERIMENTAL_API */
+/* No API to get transfer proxy; assume the proxy is unneeded. */
+dev->flow_transfer_proxy_port_id = dev->port_id;
 #endif /* ALLOW_EXPERIMENTAL_API */
 
 rte_eth_dev_info_get(dev->port_id, &info);
@@ -3762,8 +3785,12 @@ netdev_dpdk_detach(struct unixctl_conn *conn, int argc 
OVS_UNUSED,
const char *argv[], void *aux OVS_UNUSED)
 {
 struct ds used_interfaces = DS_EMPTY_INITIALIZER;
+#ifdef ALLOW_EXPERIMENTAL_API
+struct netdev_dpdk *dev_self = NULL;
+#endif /* ALLOW_EXPERIMENTAL_API */
 struct rte_eth_dev_info dev_info;
 dpdk_port_t sibling_port_id;
+struct netdev_dpdk *dev;
 dpdk_port_t port_id;
 bool used = false;
 char *response;
@@ -3781,8 +3808,6 @@ netdev_dpdk_detach(struct unixctl_conn *conn, int argc 
OVS_UNUSED,
   argv[1]);
 
 RTE_ETH_FOREACH_DEV_SIBLING (sibling_port_id, port_id) {
-struct netdev_dpdk *dev;
-
 LIST_FOR_EACH (dev, list_node, &dpdk_list) {
 if (dev->port_id != sibling_port_id) {
 continue;
@@ -3802,6 +3827,27 @@ netdev_dpdk_detach(struct unixctl_conn *conn, int argc 
OVS_UNUSED,
 }
 ds_destroy(&used_interfaces);
 
+#ifdef ALLOW_EXPERIMENTAL_API
+/*
+ * The device being detached may happen to be a flow proxy port
+ * for another device (still attached). If so, do not allow to
+ * detach. Devices dependent on this one must be detached first.
+ */
+LIST_FOR_EACH (dev, list_node, &dpdk_list) {
+if (dev->port_id == port_id) {
+dev_self = dev;
+} else if (dev->flow_transfer_proxy_port_id == port_id) {
+response = xasprintf("Device '%s' can not be detached (flow 
proxy)",
+ argv[1]);
+goto error;
+}
+}
+
+/* Indicate that the device being detached no longer needs a flow proxy. */
+if (dev_self != NULL)
+dev_self->flow_transfer_proxy_port_id = port_id;
+#endif /* ALLOW_EXPERIMENTAL_API */
+
 rte_eth_dev_info_get(port_id, &dev_info);
 rte_eth_dev_close(port_id);
 if (rte_dev_remove(dev_info.device) < 0) {
@@ -5167,7 +5213,8 @@ unlock:
 }
 
 int
-netdev_dpdk_get_port_id(struct netdev *netdev)
+netdev_dpdk_get_port_id(struct netdev *netdev,
+bool flow_transfer_proxy)
 {
 struct netdev_dpdk *dev;
 int ret = -1;
@@ -5178,7 +5225,7 @@ netdev_dpdk_get_port_id(struct netdev *netdev)
 
 dev = netdev_dpdk_cast(netdev);
 ovs_mutex_lock(&dev->mutex);
-ret = dev->port_id;
+ret = flow_transfer_proxy ? dev->flow_transfer_proxy_port_id : 
dev->port_id;
 ovs_mutex_unlock(&dev->mutex);
 out:
 return ret;
@@ -5214,13 +5261,15 @@ out:
 
 int
 netdev_dpdk_rte_flow_destroy(struct netdev *netde

[ovs-dev] [PATCH v2 2/3] netdev-offload-dpdk: replace action PORT_ID with REPRESENTED_PORT

2022-07-20 Thread Ivan Malov
Action PORT_ID has been deprecated. Use REPRESENTED_PORT instead.

Signed-off-by: Ivan Malov 
Acked-by: Andrew Rybchenko 
---
 lib/netdev-offload-dpdk.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/lib/netdev-offload-dpdk.c b/lib/netdev-offload-dpdk.c
index 12d299603..9cd5a0159 100644
--- a/lib/netdev-offload-dpdk.c
+++ b/lib/netdev-offload-dpdk.c
@@ -733,6 +733,14 @@ dump_flow_action(struct ds *s, struct ds *s_extra,
 ds_put_cstr(s, "rss / ");
 } else if (actions->type == RTE_FLOW_ACTION_TYPE_COUNT) {
 ds_put_cstr(s, "count / ");
+#ifdef ALLOW_EXPERIMENTAL_API
+} else if (actions->type == RTE_FLOW_ACTION_TYPE_REPRESENTED_PORT) {
+const struct rte_flow_action_ethdev *ethdev = actions->conf;
+
+ds_put_cstr(s, "represented_port ");
+if (ethdev) {
+ds_put_format(s, "ethdev_port_id %d ", ethdev->port_id);
+#else /* ! ALLOW_EXPERIMENTAL_API */
 } else if (actions->type == RTE_FLOW_ACTION_TYPE_PORT_ID) {
 const struct rte_flow_action_port_id *port_id = actions->conf;
 
@@ -740,6 +748,7 @@ dump_flow_action(struct ds *s, struct ds *s_extra,
 if (port_id) {
 ds_put_format(s, "original %d id %d ",
   port_id->original, port_id->id);
+#endif /* ALLOW_EXPERIMENTAL_API */
 }
 ds_put_cstr(s, "/ ");
 } else if (actions->type == RTE_FLOW_ACTION_TYPE_DROP) {
@@ -1767,6 +1776,24 @@ add_count_action(struct flow_actions *actions)
 add_flow_action(actions, RTE_FLOW_ACTION_TYPE_COUNT, count);
 }
 
+#ifdef ALLOW_EXPERIMENTAL_API
+static int
+add_represented_port_action(struct flow_actions *actions,
+struct netdev *outdev)
+{
+struct rte_flow_action_ethdev *ethdev;
+int outdev_id;
+
+outdev_id = netdev_dpdk_get_port_id(outdev);
+if (outdev_id < 0) {
+return -1;
+}
+ethdev = xzalloc(sizeof *ethdev);
+ethdev->port_id = outdev_id;
+add_flow_action(actions, RTE_FLOW_ACTION_TYPE_REPRESENTED_PORT, ethdev);
+return 0;
+}
+#else /* ! ALLOW_EXPERIMENTAL_API */
 static int
 add_port_id_action(struct flow_actions *actions,
struct netdev *outdev)
@@ -1783,6 +1810,7 @@ add_port_id_action(struct flow_actions *actions,
 add_flow_action(actions, RTE_FLOW_ACTION_TYPE_PORT_ID, port_id);
 return 0;
 }
+#endif /* ALLOW_EXPERIMENTAL_API */
 
 static int
 add_output_action(struct netdev *netdev,
@@ -1800,7 +1828,11 @@ add_output_action(struct netdev *netdev,
 return -1;
 }
 if (!netdev_flow_api_equals(netdev, outdev) ||
+#ifdef ALLOW_EXPERIMENTAL_API
+add_represented_port_action(actions, outdev)) {
+#else /* ! ALLOW_EXPERIMENTAL_API */
 add_port_id_action(actions, outdev)) {
+#endif /* ALLOW_EXPERIMENTAL_API */
 VLOG_DBG_RL(&rl, "%s: Output to port \'%s\' cannot be offloaded.",
 netdev_get_name(netdev), netdev_get_name(outdev));
 ret = -1;
-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 0/3] netdev-dpdk: negotiate delivery of per-packet Rx metadata

2022-07-20 Thread Ivan Malov
DPDK has got support for offloads involving assignment of per-packet
Rx metadata (flag, mark, tunnel ID and the likes). However, delivery
of such metadata from the NIC to the PMD might need to be negotiated
in advance. API [1] addresses that problem. Make OvS invoke this API.

Another problem is how flow rules with attribute "transfer" refer to
embedded switch ports by means of their representors. Action PORT_ID
has proved ambiguous: it refers to a DPDK ethdev, but in fact steers
packets to the entity represented by the ethdev, in example, to a VF.
The problem is addressed by [2]. Use the solution in OvS accordingly.

In addition, [2] and [3] address filtering traffic by input ports of
the embedded switch. In the suggested approach, "transfer" rules are
managed via the only ethdev with sufficient privileges, whilst match
criteria include an explicit item to indicate the desired input port.
Revisit OvS support for "transfer" rules to follow the said approach.

The following tests have been considered so far:
- build check with the current dpdk-next-net;
- running "make check" for every patch;
- tunnel offload demo with net/sfc PMD.

[1] http://mails.dpdk.org/archives/dev/2021-October/224291.html
[2] http://mails.dpdk.org/archives/dev/2021-October/224620.html
[3] http://mails.dpdk.org/archives/dev/2021-October/225081.html
---
 v1 -> v2: amendments to care about proxy detach and port ID logging

Ivan Malov (3):
  netdev-dpdk: negotiate delivery of per-packet Rx metadata
  netdev-offload-dpdk: replace action PORT_ID with REPRESENTED_PORT
  netdev-offload-dpdk: use flow transfer proxy mechanism

 lib/netdev-dpdk.c | 141 +-
 lib/netdev-dpdk.h |   4 +-
 lib/netdev-offload-dpdk.c |  91 +---
 3 files changed, 209 insertions(+), 27 deletions(-)

-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH v2 1/3] netdev-dpdk: negotiate delivery of per-packet Rx metadata

2022-07-20 Thread Ivan Malov
This may be required by some PMDs in offload scenarios.

Signed-off-by: Ivan Malov 
Acked-by: Andrew Rybchenko 
---
 lib/netdev-dpdk.c | 44 
 1 file changed, 44 insertions(+)

diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index f9535bfb4..45e5d26d2 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -1085,6 +1085,38 @@ dpdk_eth_flow_ctrl_setup(struct netdev_dpdk *dev) 
OVS_REQUIRES(dev->mutex)
 }
 }
 
+#ifdef ALLOW_EXPERIMENTAL_API
+static void
+dpdk_eth_dev_init_rx_metadata(struct netdev_dpdk *dev)
+{
+uint64_t rx_metadata = 0;
+int ret;
+
+/* For the fallback offload (non-"transfer" rules) */
+rx_metadata |= RTE_ETH_RX_METADATA_USER_MARK;
+/* For the full offload ("transfer" rules) */
+rx_metadata |= RTE_ETH_RX_METADATA_TUNNEL_ID;
+
+ret = rte_eth_rx_metadata_negotiate(dev->port_id, &rx_metadata);
+if (ret == 0) {
+if (!(rx_metadata & RTE_ETH_RX_METADATA_USER_MARK)) {
+VLOG_DBG("The NIC will not provide per-packet USER_MARK on port "
+ DPDK_PORT_ID_FMT, dev->port_id);
+}
+if (!(rx_metadata & RTE_ETH_RX_METADATA_TUNNEL_ID)) {
+VLOG_DBG("The NIC will not provide per-packet TUNNEL_ID on port "
+ DPDK_PORT_ID_FMT, dev->port_id);
+}
+} else if (ret == -ENOTSUP) {
+VLOG_DBG("Rx metadata negotiate procedure is not supported on port "
+ DPDK_PORT_ID_FMT, dev->port_id);
+} else {
+VLOG_WARN("Cannot negotiate Rx metadata on port "
+  DPDK_PORT_ID_FMT, dev->port_id);
+}
+}
+#endif /* ALLOW_EXPERIMENTAL_API */
+
 static int
 dpdk_eth_dev_init(struct netdev_dpdk *dev)
 OVS_REQUIRES(dev->mutex)
@@ -1099,6 +1131,18 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
  RTE_ETH_RX_OFFLOAD_TCP_CKSUM |
  RTE_ETH_RX_OFFLOAD_IPV4_CKSUM;
 
+#ifdef ALLOW_EXPERIMENTAL_API
+/*
+ * Full tunnel offload requires that tunnel ID metadata be
+ * delivered with "miss" packets from the hardware to the
+ * PMD. The same goes for megaflow mark metadata which is
+ * used in MARK + RSS offload scenario.
+ *
+ * Request delivery of such metadata.
+ */
+dpdk_eth_dev_init_rx_metadata(dev);
+#endif /* ALLOW_EXPERIMENTAL_API */
+
 rte_eth_dev_info_get(dev->port_id, &info);
 
 if (strstr(info.driver_name, "vf") != NULL) {
-- 
2.30.2

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn 1/3] Don't save original dst IP and Port to avoid megaflow unwildcarding.

2022-07-20 Thread Dumitru Ceara


Hi Han,

On 7/7/22 19:02, Dumitru Ceara wrote:
> On 7/7/22 18:21, Han Zhou wrote:
>> On Thu, Jul 7, 2022 at 8:55 AM Dumitru Ceara  wrote:
>>>
>>> On 7/7/22 13:45, Dumitru Ceara wrote:
 On 7/7/22 00:08, Han Zhou wrote:
> On Wed, Jul 6, 2022 at 8:45 AM Dumitru Ceara  wrote:
>>
>> Hi Han,
>>
>> On 7/6/22 00:41, Han Zhou wrote:
>>> The ls_in_pre_stateful priority 120 flow that saves dst IP and Port
>> to
>>> registers is causing a critical dataplane performance impact to
>>> short-lived connections, because it unwildcards megaflows with exact
>>> match on dst IP and L4 ports. Any new connections with a different
>>> client side L4 port will encounter datapath flow miss and upcall to
>>> ovs-vswitchd.
>>>
>>> These fields (dst IP and port) were saved to registers to solve a
>>> problem of LB hairpin use case when different VIPs are sharing
>>> overlapping backend+port [0]. The change [0] might not have as wide
>>> performance impact as it is now because at that time one of the match
>>> condition "REGBIT_CONNTRACK_NAT == 1" was set only for established
>> and
>>> natted traffic, while now the impact is more obvious because
>>> REGBIT_CONNTRACK_NAT is now set for all IP traffic (if any VIP
>>> configured on the LS) since commit [1], after several other
>> indirectly
>>> related optimizations and refactors.
>>>
>>> Since the changes that introduced the performance problem had their
>>> own values (fixes problems or optimizes performance), so we don't
>> want
>>> to revert any of the changes (and it is also not straightforward to
>>> revert any of them because there have been lots of changes and
>> refactors
>>> on top of them).
>>>
>>> Change [0] itself has added an alternative way to solve the
>> overlapping
>>> backends problem, which utilizes ct fields instead of saving dst IP
>> and
>>> port to registers. This patch forces to that approach and removes the
>>> flows/actions that saves the dst IP and port to avoid the dataplane
>>> performance problem for short-lived connections.
>>>
>>> (With this approach, the ct_state DNAT is not HW offload friendly,
>> so it
>>> may result in those flows not being offloaded, which is supposed to
>> be
>>> solved in a follow-up patch)

While we're waiting for more input from ovn-k8s on this, I have a
slightly different question.

Aren't we hitting a similar problem in the router pipeline, due to
REG_ORIG_TP_DPORT_ROUTER?

Thanks,
Dumitru

>>>
>>> [0] ce0ef8d59850 ("Properly handle hairpin traffic for VIPs with
>> shared
> backends.")
>>> [1] 0038579d1928 ("northd: Optimize ct nat for load balancer
>> traffic.")
>>>
>>> Signed-off-by: Han Zhou 
>>> ---
>>
>> I think the main concern I have is that this forces us to choose
>> between:
>> a. non hwol friendly flows (reduced performance)
>> b. less functionality (with the knob in patch 3/3 set to false).
>>
> Thanks Dumitru for the comments! I agree the solution is not ideal,
>> but if
> we look at it from a different angle, even with a), for most
>> pod->service
> traffic the performance is still much better than how it is today (not
> offloaded kernel datapath is still much better than userspace
>> slowpath).
> And *hopefully* b) is ok for most use cases to get HW-offload
>> capability.
>
>>>
>>> Just a note on this item.  I'm a bit confused about why all traffic
>>> would be slowpath-ed?  It's just the first packet that goes to vswitchd
>>> as an upcall, right?
>>>
>>
>> It is about all traffic for *short-lived* connections. Any clients ->
>> service traffic with the pattern:
>> 1. TCP connection setup
>> 2. Set API request, receives response
>> 3. Close TCP connection
>> It can be tested with netperf TCP_CRR. Every time the client side TCP port
>> is different, but since the server -> client DP flow includes the client
>> TCP port, for each such transaction there is going to be at least a DP flow
>> miss and goes to userspace. Such application latency would be very high. In
>> addition, it causes the OVS handler CPU spikes very high which would
>> further impact the dataplane performance of the system.
>>
>>> Once the megaflow (even if it's more specific than ideal) is installed
>>> all following traffic in that session should be forwarded in fast path
>>> (kernel).
>>>
>>> Also, I'm not sure I follow why the same behavior wouldn't happen with
>>> your changes too for pod->service.  The datapath flow includes the
>>> dp_hash() match, and that's likely different for different connections.
>>>
>>
>> With the change it is not going to happen, because the match is for server
>> side port only.
>> For dp_hash(), for what I remembered, there are as many as the number of
>> megaflows as the number of buckets (the masked hash value) at most.
>>
> 
> OK, that explains it, thanks.
> 
>>> Or am I missing something

[ovs-dev] [PATCH ovn v3 6/6] northd: Limit bulk removal of MAC binding aging

2022-07-20 Thread Ales Musil
Because the transaction is limited, in terms of how
many operations it can do, we should not allow
more than 15 MAC bindings to be removed at once every
10 ms. This has a downside that in theory we could
never finish the removal, however in practice it is
unlikely considering that not all routers will have aging
enabled and the enabled will be with reasonable threshold.

Reported-at: https://bugzilla.redhat.com/2084668
Signed-off-by: Ales Musil 
---
v3: Rebase on top of current main.
Update according to the final conclusion.
---
 northd/mac-binding-aging.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/northd/mac-binding-aging.c b/northd/mac-binding-aging.c
index 8af477ff1..32beef38b 100644
--- a/northd/mac-binding-aging.c
+++ b/northd/mac-binding-aging.c
@@ -28,6 +28,9 @@
 
 VLOG_DEFINE_THIS_MODULE(mac_binding_aging);
 
+#define MAC_BINDING_BULK_REMOVAL_LIMIT 15
+#define MAC_BINDING_BULK_REMOVAL_DELAY_MSEC 10
+
 struct mac_binding_waker {
 bool should_schedule;
 long long next_wake_msec;
@@ -39,7 +42,8 @@ static void
 mac_binding_aging_run_for_datapath(const struct sbrec_datapath_binding *dp,
const struct nbrec_logical_router *nbr,
struct ovsdb_idl_index *mb_by_datapath,
-   int64_t now, int64_t *wake_delay)
+   int64_t now, int64_t *wake_delay,
+   uint8_t *removed_n)
 {
 uint64_t threshold = smap_get_uint(&nbr->options,
  "mac_binding_age_threshold",
@@ -60,6 +64,10 @@ mac_binding_aging_run_for_datapath(const struct 
sbrec_datapath_binding *dp,
 return;
 } else if (elapsed >= threshold) {
 sbrec_mac_binding_delete(mb);
+(*removed_n)++;
+if (*removed_n == MAC_BINDING_BULK_REMOVAL_LIMIT) {
+break;
+}
 } else {
 *wake_delay = MIN(*wake_delay, threshold - elapsed);
 }
@@ -78,6 +86,7 @@ en_mac_binding_aging_run(struct engine_node *node, void *data 
OVS_UNUSED)
 
 int64_t next_expire_msec = INT64_MAX;
 int64_t now = time_wall_msec();
+uint8_t removed_n = 0;
 struct northd_data *northd_data = engine_get_input_data("northd", node);
 struct ovsdb_idl_index *sbrec_mac_binding_by_datapath =
 engine_ovsdb_node_get_index(engine_get_input("SB_mac_binding", node),
@@ -88,7 +97,13 @@ en_mac_binding_aging_run(struct engine_node *node, void 
*data OVS_UNUSED)
 if (od->sb && od->nbr) {
 mac_binding_aging_run_for_datapath(od->sb, od->nbr,
sbrec_mac_binding_by_datapath,
-   now, &next_expire_msec);
+   now, &next_expire_msec,
+   &removed_n);
+if (removed_n == MAC_BINDING_BULK_REMOVAL_LIMIT) {
+/* Schedule the next run after specified delay. */
+next_expire_msec = MAC_BINDING_BULK_REMOVAL_DELAY_MSEC;
+break;
+}
 }
 }
 
-- 
2.35.3

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn v3 2/6] controller: Add mac-binding-index.c/.h files

2022-07-20 Thread Ales Musil
Add helper source file for creating index
over MAC binding table.

Reported-at: https://bugzilla.redhat.com/2084668
Signed-off-by: Ales Musil 
---
v3: Rebase on top of current main.
Update according to the final conclusion.
---
 controller/ovn-controller.c |  8 +++-
 lib/automake.mk |  2 ++
 lib/mac-binding-index.c | 35 +++
 lib/mac-binding-index.h | 26 ++
 4 files changed, 66 insertions(+), 5 deletions(-)
 create mode 100644 lib/mac-binding-index.c
 create mode 100644 lib/mac-binding-index.h

diff --git a/controller/ovn-controller.c b/controller/ovn-controller.c
index 2e9138036..31d7f4566 100644
--- a/controller/ovn-controller.c
+++ b/controller/ovn-controller.c
@@ -54,6 +54,7 @@
 #include "lib/extend-table.h"
 #include "lib/ip-mcast-index.h"
 #include "lib/mcast-group-index.h"
+#include "lib/mac-binding-index.h"
 #include "lib/ovn-sb-idl.h"
 #include "lib/ovn-util.h"
 #include "patch.h"
@@ -3454,9 +3455,7 @@ main(int argc, char *argv[])
 = ovsdb_idl_index_create1(ovnsb_idl_loop.idl,
   &sbrec_datapath_binding_col_tunnel_key);
 struct ovsdb_idl_index *sbrec_mac_binding_by_lport_ip
-= ovsdb_idl_index_create2(ovnsb_idl_loop.idl,
-  &sbrec_mac_binding_col_logical_port,
-  &sbrec_mac_binding_col_ip);
+= mac_binding_by_lport_ip_index_create(ovnsb_idl_loop.idl);
 struct ovsdb_idl_index *sbrec_ip_multicast
 = ip_mcast_index_create(ovnsb_idl_loop.idl);
 struct ovsdb_idl_index *sbrec_igmp_group
@@ -3469,8 +3468,7 @@ main(int argc, char *argv[])
   &sbrec_fdb_col_mac,
   &sbrec_fdb_col_dp_key);
 struct ovsdb_idl_index *sbrec_mac_binding_by_datapath
-= ovsdb_idl_index_create1(ovnsb_idl_loop.idl,
-  &sbrec_mac_binding_col_datapath);
+= mac_binding_by_datapath_index_create(ovnsb_idl_loop.idl);
 struct ovsdb_idl_index *sbrec_static_mac_binding_by_datapath
 = ovsdb_idl_index_create1(ovnsb_idl_loop.idl,
   &sbrec_static_mac_binding_col_datapath);
diff --git a/lib/automake.mk b/lib/automake.mk
index 3a2da1fe4..0c0b95f4e 100644
--- a/lib/automake.mk
+++ b/lib/automake.mk
@@ -26,6 +26,8 @@ lib_libovn_la_SOURCES = \
lib/ovn-parallel-hmap.c \
lib/ip-mcast-index.c \
lib/ip-mcast-index.h \
+   lib/mac-binding-index.c \
+lib/mac-binding-index.h \
lib/mcast-group-index.c \
lib/mcast-group-index.h \
lib/lex.c \
diff --git a/lib/mac-binding-index.c b/lib/mac-binding-index.c
new file mode 100644
index 0..9a8b7deb8
--- /dev/null
+++ b/lib/mac-binding-index.c
@@ -0,0 +1,35 @@
+/* Copyright (c) 2022, Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include 
+
+#include "lib/mac-binding-index.h"
+#include "lib/ovn-sb-idl.h"
+
+struct ovsdb_idl_index *
+mac_binding_by_datapath_index_create(struct ovsdb_idl
+ *idl)
+{
+return ovsdb_idl_index_create1(idl, &sbrec_mac_binding_col_datapath);
+}
+
+struct ovsdb_idl_index *
+mac_binding_by_lport_ip_index_create(struct ovsdb_idl
+ *idl)
+{
+return ovsdb_idl_index_create2(idl,
+   &sbrec_mac_binding_col_logical_port,
+   &sbrec_mac_binding_col_ip);
+}
diff --git a/lib/mac-binding-index.h b/lib/mac-binding-index.h
new file mode 100644
index 0..f2853a7ca
--- /dev/null
+++ b/lib/mac-binding-index.h
@@ -0,0 +1,26 @@
+/* Copyright (c) 2022, Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#ifndef OVN_MAC_BINDING_INDEX_H
+#define OVN_MAC_BINDING_INDEX_H 1
+
+#include "lib/ovn-sb-idl.h"
+
+struct ovsdb_idl_index *mac_binding_by_datapath_index_crea

[ovs-dev] [PATCH ovn v3 4/6] northd: Add MAC binding aging mechanism

2022-07-20 Thread Ales Musil
Add MAC binding aging mechanism, that utilizes
the timestamp column of MAC_Binding table.
When the MAC binding exceeds the threshold it is
removed from SB DB, this is postponed only in case
we receive update ARP with update to MAC address.

The threshold is configurable via option
"mac_binding_age_threshold" that can be specified
for each logical router. The option is defaulting to
0 which means that by default the aging is disabled
and the MAC binding rows will be persisted the same
way as before.

Reported-at: https://bugzilla.redhat.com/2084668
Signed-off-by: Ales Musil 
---
v3: Rebase on top of current main.
Update according to the final conclusion.
---
 northd/automake.mk |   2 +
 northd/inc-proc-northd.c   |  13 
 northd/mac-binding-aging.c | 151 +
 northd/mac-binding-aging.h |  33 
 ovn-nb.xml |   7 ++
 5 files changed, 206 insertions(+)
 create mode 100644 northd/mac-binding-aging.c
 create mode 100644 northd/mac-binding-aging.h

diff --git a/northd/automake.mk b/northd/automake.mk
index 4862ec7b7..81582867d 100644
--- a/northd/automake.mk
+++ b/northd/automake.mk
@@ -1,6 +1,8 @@
 # ovn-northd
 bin_PROGRAMS += northd/ovn-northd
 northd_ovn_northd_SOURCES = \
+   northd/mac-binding-aging.c \
+   northd/mac-binding-aging.h \
northd/northd.c \
northd/northd.h \
northd/ovn-northd.c \
diff --git a/northd/inc-proc-northd.c b/northd/inc-proc-northd.c
index 43093cb5a..4a3699106 100644
--- a/northd/inc-proc-northd.c
+++ b/northd/inc-proc-northd.c
@@ -22,9 +22,11 @@
 #include "ip-mcast-index.h"
 #include "static-mac-binding-index.h"
 #include "lib/inc-proc-eng.h"
+#include "lib/mac-binding-index.h"
 #include "lib/ovn-nb-idl.h"
 #include "lib/ovn-sb-idl.h"
 #include "mcast-group-index.h"
+#include "northd/mac-binding-aging.h"
 #include "openvswitch/poll-loop.h"
 #include "openvswitch/vlog.h"
 #include "inc-proc-northd.h"
@@ -149,6 +151,8 @@ enum sb_engine_node {
  * avoid sparse errors. */
 static ENGINE_NODE(northd, "northd");
 static ENGINE_NODE(lflow, "lflow");
+static ENGINE_NODE(mac_binding_aging, "mac_binding_aging");
+static ENGINE_NODE(mac_binding_aging_waker, "mac_binding_aging_waker");
 
 void inc_proc_northd_init(struct ovsdb_idl_loop *nb,
   struct ovsdb_idl_loop *sb)
@@ -211,12 +215,16 @@ void inc_proc_northd_init(struct ovsdb_idl_loop *nb,
 engine_add_input(&en_northd, &en_sb_load_balancer, NULL);
 engine_add_input(&en_northd, &en_sb_fdb, NULL);
 engine_add_input(&en_northd, &en_sb_static_mac_binding, NULL);
+engine_add_input(&en_mac_binding_aging, &en_sb_mac_binding, NULL);
+engine_add_input(&en_mac_binding_aging, &en_northd, NULL);
+engine_add_input(&en_mac_binding_aging, &en_mac_binding_aging_waker, NULL);
 engine_add_input(&en_lflow, &en_nb_bfd, NULL);
 engine_add_input(&en_lflow, &en_sb_bfd, NULL);
 engine_add_input(&en_lflow, &en_sb_logical_flow, NULL);
 engine_add_input(&en_lflow, &en_sb_multicast_group, NULL);
 engine_add_input(&en_lflow, &en_sb_igmp_group, NULL);
 engine_add_input(&en_lflow, &en_northd, NULL);
+engine_add_input(&en_lflow, &en_mac_binding_aging, NULL);
 
 struct engine_arg engine_arg = {
 .nb_idl = nb->idl,
@@ -235,6 +243,8 @@ void inc_proc_northd_init(struct ovsdb_idl_loop *nb,
 chassis_hostname_index_create(sb->idl);
 struct ovsdb_idl_index *sbrec_static_mac_binding_by_lport_ip
 = static_mac_binding_index_create(sb->idl);
+struct ovsdb_idl_index *sbrec_mac_binding_by_datapath
+= mac_binding_by_datapath_index_create(sb->idl);
 
 engine_init(&en_lflow, &engine_arg);
 
@@ -256,6 +266,9 @@ void inc_proc_northd_init(struct ovsdb_idl_loop *nb,
 engine_ovsdb_node_add_index(&en_sb_static_mac_binding,
 "sbrec_static_mac_binding_by_lport_ip",
 sbrec_static_mac_binding_by_lport_ip);
+engine_ovsdb_node_add_index(&en_sb_mac_binding,
+"sbrec_mac_binding_by_datapath",
+sbrec_mac_binding_by_datapath);
 }
 
 void inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn,
diff --git a/northd/mac-binding-aging.c b/northd/mac-binding-aging.c
new file mode 100644
index 0..8af477ff1
--- /dev/null
+++ b/northd/mac-binding-aging.c
@@ -0,0 +1,151 @@
+/* Copyright (c) 2022, Red Hat, Inc.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at:
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations un

[ovs-dev] [PATCH ovn v3 5/6] ovn.at: Add test case covering the MAC binding aging

2022-07-20 Thread Ales Musil
Reported-at: https://bugzilla.redhat.com/2084668
Signed-off-by: Ales Musil 
---
v3: Rebase on top of current main.
Update according to the final conclusion.
---
 tests/ovn.at | 109 +++
 1 file changed, 109 insertions(+)

diff --git a/tests/ovn.at b/tests/ovn.at
index 7d10610fd..cbacc52c5 100644
--- a/tests/ovn.at
+++ b/tests/ovn.at
@@ -32505,3 +32505,112 @@ AT_CHECK([test $(ovn-sbctl list fdb | grep -c 
"00:00:00:00:10:30") = 0])
 OVN_CLEANUP([hv1])
 AT_CLEANUP
 ])
+
+OVN_FOR_EACH_NORTHD([
+AT_SETUP([MAC binding aging])
+ovn_start
+
+net_add n1
+
+AT_CHECK([ovn-nbctl ls-add public])
+AT_CHECK([ovn-nbctl ls-add internal])
+
+AT_CHECK([ovn-nbctl lsp-add public ln_port])
+AT_CHECK([ovn-nbctl lsp-set-addresses ln_port unknown])
+AT_CHECK([ovn-nbctl lsp-set-type ln_port localnet])
+AT_CHECK([ovn-nbctl lsp-set-options ln_port network_name=physnet1])
+
+AT_CHECK([ovn-nbctl lsp-add public public-gw])
+AT_CHECK([ovn-nbctl lsp-set-type public-gw router])
+AT_CHECK([ovn-nbctl lsp-set-addresses public-gw 00:00:00:00:10:00 router])
+AT_CHECK([ovn-nbctl lsp-set-options public-gw router-port=gw-public])
+
+AT_CHECK([ovn-nbctl lsp-add internal internal-gw])
+AT_CHECK([ovn-nbctl lsp-set-type internal-gw router])
+AT_CHECK([ovn-nbctl lsp-set-addresses internal-gw 00:00:00:00:20:00 router])
+AT_CHECK([ovn-nbctl lsp-set-options internal-gw router-port=gw-internal])
+
+AT_CHECK([ovn-nbctl lsp-add internal vif1])
+AT_CHECK([ovn-nbctl lsp-set-addresses vif1 "00:00:00:00:20:10 192.168.20.10"])
+
+AT_CHECK([ovn-nbctl lsp-add internal vif2])
+AT_CHECK([ovn-nbctl lsp-set-addresses vif2 "00:00:00:00:20:20 192.168.20.20"])
+
+AT_CHECK([ovn-nbctl lr-add gw])
+AT_CHECK([ovn-nbctl lrp-add gw gw-public 00:00:00:00:10:00 192.168.10.1/24])
+AT_CHECK([ovn-nbctl lrp-add gw gw-internal 00:00:00:00:20:00 192.168.20.1/24])
+
+sim_add hv1
+as hv1
+ovs-vsctl add-br br-underlay
+ovn_attach n1 br-underlay 192.168.0.1
+ovs-vsctl add-br br-phys
+ovs-vsctl -- add-port br-int vif1 -- \
+set interface vif1 external-ids:iface-id=vif1 \
+options:tx_pcap=hv1/vif1-tx.pcap \
+options:rxq_pcap=hv1/vif1-rx.pcap \
+ofport-request=1
+ovs-vsctl -- add-port br-phys ext1 -- \
+set interface ext1 \
+options:tx_pcap=hv1/ext1-tx.pcap \
+options:rxq_pcap=hv1/ext1-rx.pcap \
+ofport-request=2
+ovs-vsctl set open . external_ids:ovn-bridge-mappings=physnet1:br-phys
+
+sim_add hv2
+as hv2
+ovs-vsctl add-br br-underlay
+ovn_attach n1 br-underlay 192.168.0.2
+ovs-vsctl add-br br-phys
+ovs-vsctl -- add-port br-int vif2 -- \
+set interface vif2 external-ids:iface-id=vif2 \
+options:tx_pcap=hv2/vif2-tx.pcap \
+options:rxq_pcap=hv2/vif2-rx.pcap \
+ofport-request=1
+ovs-vsctl -- add-port br-phys ext2 -- \
+set interface ext2 \
+options:tx_pcap=hv2/ext2-tx.pcap \
+options:rxq_pcap=hv2/ext2-rx.pcap \
+ofport-request=2
+ovs-vsctl set open . external_ids:ovn-bridge-mappings=physnet1:br-phys
+
+OVN_POPULATE_ARP
+
+send_garp() {
+hv=$1
+dev=$2
+mac_byte=$3
+ip_byte=${4-$3}
+
+mac="10$mac_byte"
+ip=`ip_to_hex 192 168 10 $ip_byte`
+packet=${mac}08060001080006040002${mac}${ip}${mac}${ip}
+as $hv ovs-appctl netdev-dummy/receive $dev $packet
+}
+
+# Check if the option is not present by default
+AT_CHECK([fetch_column nb:logical_router options name="gw" | grep -q 
mac_binding_age_threshold], [1])
+
+# Set the MAC binding aging threshold
+AT_CHECK([ovn-nbctl set logical_router gw options:mac_binding_age_threshold=1])
+AT_CHECK([fetch_column nb:logical_router options | grep -q 
mac_binding_age_threshold=1])
+AT_CHECK([ovn-nbctl --wait=sb sync])
+
+# Send GARP to populate MAC binding table records
+send_garp hv1 ext1 10
+send_garp hv2 ext2 20
+
+OVS_WAIT_UNTIL([ovn-sbctl list mac_binding | grep -q "192.168.10.10"])
+OVS_WAIT_UNTIL([ovn-sbctl list mac_binding | grep -q "192.168.10.20"])
+
+# Check if the records are removed after 1 sec of inactivity
+OVS_WAIT_UNTIL([
+test "0" = "$(ovn-sbctl list mac_binding | grep -c '192.168.10.10')"
+])
+OVS_WAIT_UNTIL([
+test "0" = "$(ovn-sbctl list mac_binding | grep -c '192.168.10.20')"
+])
+
+OVN_CLEANUP([hv1], [hv2])
+AT_CLEANUP
+])
-- 
2.35.3

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn v3 3/6] northd: Move struct ovn_datapath and related structs to northd.h

2022-07-20 Thread Ales Musil
The struct ovn_datapath could not be used outside the northd.c
move it to northd.h that it can be used by other .c files later on.

Reported-at: https://bugzilla.redhat.com/2084668
Signed-off-by: Ales Musil 
---
v3: Rebase on top of current main.
Update according to the final conclusion.
---
 northd/northd.c | 154 --
 northd/northd.h | 158 
 2 files changed, 158 insertions(+), 154 deletions(-)

diff --git a/northd/northd.c b/northd/northd.c
index d31cb1688..db672be17 100644
--- a/northd/northd.c
+++ b/northd/northd.c
@@ -514,74 +514,6 @@ port_has_qos_params(const struct smap *opts)
 }
 
 
-/*
- * Multicast snooping and querier per datapath configuration.
- */
-struct mcast_switch_info {
-
-bool enabled;   /* True if snooping enabled. */
-bool querier;   /* True if querier enabled. */
-bool flood_unregistered;/* True if unregistered multicast should be
- * flooded.
- */
-bool flood_relay;   /* True if the switch is connected to a
- * multicast router and unregistered multicast
- * should be flooded to the mrouter. Only
- * applicable if flood_unregistered == false.
- */
-bool flood_reports; /* True if the switch has at least one port
- * configured to flood reports.
- */
-bool flood_static;  /* True if the switch has at least one port
- * configured to flood traffic.
- */
-int64_t table_size; /* Max number of IP multicast groups. */
-int64_t idle_timeout;   /* Timeout after which an idle group is
- * flushed.
- */
-int64_t query_interval; /* Interval between multicast queries. */
-char *eth_src;  /* ETH src address of the queries. */
-char *ipv4_src; /* IPv4 src address of the queries. */
-char *ipv6_src; /* IPv6 src address of the queries. */
-
-int64_t query_max_response; /* Expected time after which reports should
- * be received for queries that were sent out.
- */
-
-atomic_uint64_t active_v4_flows;   /* Current number of active IPv4 
multicast
- * flows.
- */
-atomic_uint64_t active_v6_flows;   /* Current number of active IPv6 
multicast
- * flows.
- */
-};
-
-struct mcast_router_info {
-bool relay;/* True if the router should relay IP multicast. */
-bool flood_static; /* True if the router has at least one port configured
-* to flood traffic.
-*/
-};
-
-struct mcast_info {
-
-struct hmap group_tnlids;  /* Group tunnel IDs in use on this DP. */
-uint32_t group_tnlid_hint; /* Hint for allocating next group tunnel ID. */
-struct ovs_list groups;/* List of groups learnt on this DP. */
-
-union {
-struct mcast_switch_info sw;  /* Switch specific multicast info. */
-struct mcast_router_info rtr; /* Router specific multicast info. */
-};
-};
-
-struct mcast_port_info {
-bool flood; /* True if the port should flood IP multicast traffic
- * regardless if it's registered or not. */
-bool flood_reports; /* True if the port should flood IP multicast reports
- * (e.g., IGMP join/leave). */
-};
-
 static void
 init_mcast_port_info(struct mcast_port_info *mcast_info,
  const struct nbrec_logical_switch_port *nbsp,
@@ -611,92 +543,6 @@ ovn_mcast_group_allocate_key(struct mcast_info *mcast_info)
   &mcast_info->group_tnlid_hint);
 }
 
-/* The 'key' comes from nbs->header_.uuid or nbr->header_.uuid or
- * sb->external_ids:logical-switch. */
-struct ovn_datapath {
-struct hmap_node key_node;  /* Index on 'key'. */
-struct uuid key;/* (nbs/nbr)->header_.uuid. */
-
-const struct nbrec_logical_switch *nbs;  /* May be NULL. */
-const struct nbrec_logical_router *nbr;  /* May be NULL. */
-const struct sbrec_datapath_binding *sb; /* May be NULL. */
-
-struct ovs_list list;   /* In list of similar records. */
-
-uint32_t tunnel_key;
-
-/* Logical switch data. */
-struct ovn_port **router_ports;
-size_t n_router_ports;
-size_t n_allocated_router_ports;
-
-struct hmap port_tnlids;
-uint32_t port_key_hint;
-
-bool has_stateful_acl;
-bool has_lb_vip;
-bool has_unknown;
-bool has_acls;
-
-/* IPAM data. */
-s

[ovs-dev] [PATCH ovn v3 0/6] MAC binding aging mechanism

2022-07-20 Thread Ales Musil
Add MAC binding aging mechanism, that
should take care of stale MAC bindings.

The mechanism is configurable per logical
router, with option called
"mac_binding_age_threshold", which is 0 by
default. The zero also indicates that the
aging mechanism is disabled.

The aging is based on timestamp in MAC_binding
table, rows that exceed threshold specified
by the LR option will be deleted. The row timestamp
is refresh when we receive ARP that changes
MAC address of corresponding row.

In order to prevent big pressure on SB DB
limit how many MAC bindings can be removed
at once. The limit is currently set to 15
rows with delay of 10 msec before next batch.

The test case is present as separate patch of the series.

Ales Musil (6):
  northd, controller: Add timestamp column to MAC_Binding table
  controller: Add mac-binding-index.c/.h files
  northd: Move struct ovn_datapath and related structs to northd.h
  northd: Add MAC binding aging mechanism
  ovn.at: Add test case covering the MAC binding aging
  northd: Limit bulk removal of MAC binding aging

 controller/ovn-controller.c |   8 +-
 controller/pinctrl.c|   2 +
 lib/automake.mk |   2 +
 lib/mac-binding-index.c |  35 
 lib/mac-binding-index.h |  26 ++
 northd/automake.mk  |   2 +
 northd/inc-proc-northd.c|  13 +++
 northd/mac-binding-aging.c  | 166 
 northd/mac-binding-aging.h  |  33 +++
 northd/northd.c | 154 -
 northd/northd.h | 158 ++
 northd/ovn-northd.c |   2 +-
 ovn-nb.xml  |   7 ++
 ovn-sb.ovsschema|   5 +-
 ovn-sb.xml  |   6 ++
 tests/ovn.at| 109 +++
 16 files changed, 566 insertions(+), 162 deletions(-)
 create mode 100644 lib/mac-binding-index.c
 create mode 100644 lib/mac-binding-index.h
 create mode 100644 northd/mac-binding-aging.c
 create mode 100644 northd/mac-binding-aging.h

-- 
2.35.3

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH ovn v3 1/6] northd, controller: Add timestamp column to MAC_Binding table

2022-07-20 Thread Ales Musil
The new timestamp column in MAC_Binding is
populated with current time whenever the row is
created or the MAC address is updated.
This can be utilized by MAC binding aging mechanism,
when we can check if enough time has passed since the
creation/update.

Reported-at: https://bugzilla.redhat.com/2084668
Signed-off-by: Ales Musil 
---
v3: Rebase on top of current main.
Update according to the final conclusion.
---
 controller/pinctrl.c | 2 ++
 northd/ovn-northd.c  | 2 +-
 ovn-sb.ovsschema | 5 +++--
 ovn-sb.xml   | 6 ++
 4 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/controller/pinctrl.c b/controller/pinctrl.c
index 38e8590af..e3e2e9ac0 100644
--- a/controller/pinctrl.c
+++ b/controller/pinctrl.c
@@ -4209,8 +4209,10 @@ mac_binding_add_to_sb(struct ovsdb_idl_txn 
*ovnsb_idl_txn,
 sbrec_mac_binding_set_ip(b, ip);
 sbrec_mac_binding_set_mac(b, mac_string);
 sbrec_mac_binding_set_datapath(b, dp);
+sbrec_mac_binding_set_timestamp(b, time_wall_msec());
 } else if (strcmp(b->mac, mac_string)) {
 sbrec_mac_binding_set_mac(b, mac_string);
+sbrec_mac_binding_set_timestamp(b, time_wall_msec());
 }
 }
 
diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
index ab28756af..bd35802ed 100644
--- a/northd/ovn-northd.c
+++ b/northd/ovn-northd.c
@@ -115,7 +115,7 @@ static const char *rbac_port_binding_update[] =
 static const char *rbac_mac_binding_auth[] =
 {""};
 static const char *rbac_mac_binding_update[] =
-{"logical_port", "ip", "mac", "datapath"};
+{"logical_port", "ip", "mac", "datapath", "timestamp"};
 
 static const char *rbac_svc_monitor_auth[] =
 {""};
diff --git a/ovn-sb.ovsschema b/ovn-sb.ovsschema
index 3b78ea6f6..bca552b5e 100644
--- a/ovn-sb.ovsschema
+++ b/ovn-sb.ovsschema
@@ -1,7 +1,7 @@
 {
 "name": "OVN_Southbound",
-"version": "20.23.0",
-"cksum": "4045988377 28575",
+"version": "20.24.0",
+"cksum": "4165157445 28634",
 "tables": {
 "SB_Global": {
 "columns": {
@@ -260,6 +260,7 @@
 "logical_port": {"type": "string"},
 "ip": {"type": "string"},
 "mac": {"type": "string"},
+"timestamp": {"type": {"key": "integer"}},
 "datapath": {"type": {"key": {"type": "uuid",
   "refTable": 
"Datapath_Binding",
 "indexes": [["logical_port", "ip"]],
diff --git a/ovn-sb.xml b/ovn-sb.xml
index 59ad3aa2d..c5e6c9cc1 100644
--- a/ovn-sb.xml
+++ b/ovn-sb.xml
@@ -3621,6 +3621,12 @@ tcp.flags = RST;
 
   The Ethernet address to which the IP is bound.
 
+
+
+  The timestamp in msec when the MAC binding was added or updated.
+  Records that existed before this column will have 0.
+
+
 
   The logical datapath to which the logical port belongs.
 
-- 
2.35.3

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2 2/2] multicast: Properly flood IGMP queries and reports.

2022-07-20 Thread Ales Musil
Hi Dumitru,

the code looks good to me, with one small nit see below.

On Tue, Jul 19, 2022 at 10:43 PM Dumitru Ceara  wrote:

> RFC 4541 "Considerations for Internet Group Management Protocol (IGMP)
> and Multicast Listener Discovery (MLD) Snooping Switches" [0] states:
>
> For IGMP packets:
> 2.1.1.  IGMP Forwarding Rules
>1) A snooping switch should forward IGMP Membership Reports only to
>   those ports where multicast routers are attached.
>
>   Alternatively stated: a snooping switch should not forward IGMP
>   Membership Reports to ports on which only hosts are attached.  An
>   administrative control may be provided to override this
>   restriction, allowing the report messages to be flooded to other
>   ports.
>
>   This is the main IGMP snooping functionality for the control
> path.
>
>   Sending membership reports to other hosts can result, for IGMPv1
>   and IGMPv2, in unintentionally preventing a host from joining a
>   specific multicast group.
>
>   When an IGMPv1 or IGMPv2 host receives a membership report for a
>   group address that it intends to join, the host will suppress its
>   own membership report for the same group.  This join or message
>   suppression is a requirement for IGMPv1 and IGMPv2 hosts.
>   However, if a switch does not receive a membership report from
> the
>   host it will not forward multicast data to it.
>
> In OVN this translates to forwarding reports only on:
> a. ports where mrouters have been learned (IGMP queries were received on
>those ports)
> b. ports connecting a multicast enabled logical switch to a multicast
>relay enabled logical router (OVN mrouter)
> c. ports configured with mcast_flood_reports=true
>
> 2.1.2.  Data Forwarding Rules
>
>1) Packets with a destination IP address outside 224.0.0.X which are
>   not IGMP should be forwarded according to group-based port
>   membership tables and must also be forwarded on router ports.
>
> In OVN this translates to forwarding traffic for a group G to:
> a. all ports where G was learned
> b. all ports where an (external or OVN) mrouter was learned.
> c. all ports configured with mcast_flood=true
>
> IGMP queries are already snooped by ovn-controller.  Just propagate the
> information about where mrouters are to the Southbound DB through means
> of a custom IGMP_Group (name="mrouters").
>
> Snooped queries are then re-injected in the OVS pipeline with outport
> set to MC_FLOOD_L2 (only flood them in the L2 domain).
>
> Snooped reports are re-injected in the OVS pipeline with outport set to
> MC_MROUTER_FLOOD (flood them to snooped mrouters and OVN mrouters).
>
> The same behavior applies to IPv6 too (MLD).
>
> [0] https://datatracker.ietf.org/doc/html/rfc4541
>
> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1933990
> Reported-at: https://github.com/ovn-org/ovn/issues/126
> Acked-by: Numan Siddique 
> Signed-off-by: Dumitru Ceara 
> ---
> V2:
> - Added Numan's ack.
> ---
>  controller/ip-mcast.c   |  129 -
>  controller/ip-mcast.h   |   14 
>  controller/pinctrl.c|  129 +
>  lib/ip-mcast-index.h|5 +
>  lib/mcast-group-index.h |4 -
>  northd/northd.c |   54 +++
>  northd/ovn-northd.8.xml |6 --
>  tests/ovn-macros.at |   25 +++
>  tests/ovn.at|  164
> ++-
>  9 files changed, 472 insertions(+), 58 deletions(-)
>
> diff --git a/controller/ip-mcast.c b/controller/ip-mcast.c
> index 9b0b4465a..a870fb29e 100644
> --- a/controller/ip-mcast.c
> +++ b/controller/ip-mcast.c
> @@ -16,6 +16,7 @@
>  #include 
>
>  #include "ip-mcast.h"
> +#include "ip-mcast-index.h"
>  #include "lport.h"
>  #include "lib/ovn-sb-idl.h"
>
> @@ -27,6 +28,18 @@ struct igmp_group_port {
>  const struct sbrec_port_binding *port;
>  };
>
> +static const struct sbrec_igmp_group *
> +igmp_group_lookup_(struct ovsdb_idl_index *igmp_groups,
> +   const char *addr_str,
> +   const struct sbrec_datapath_binding *datapath,
> +   const struct sbrec_chassis *chassis);
> +
> +static struct sbrec_igmp_group *
> +igmp_group_create_(struct ovsdb_idl_txn *idl_txn,
> +   const char *addr_str,
> +   const struct sbrec_datapath_binding *datapath,
> +   const struct sbrec_chassis *chassis);
> +
>  struct ovsdb_idl_index *
>  igmp_group_index_create(struct ovsdb_idl *idl)
>  {
> @@ -54,17 +67,16 @@ igmp_group_lookup(struct ovsdb_idl_index *igmp_groups,
>  return NULL;
>  }
>
> -struct sbrec_igmp_group *target =
> -sbrec_igmp_group_index_init_row(igmp_groups);
> -
> -sbrec_igmp_group_index_set_address(target, addr_str);
> -sbrec_igmp_group_index_set_datapath(target, datapath);

Re: [ovs-dev] [PATCH ovn v2 1/2] tests: Factor out reset_pcap_file() helper.

2022-07-20 Thread Ales Musil
Hi Dumitru,

nice to see it unified, thank you.

Acked-by: Ales Musil 


On Tue, Jul 19, 2022 at 10:43 PM Dumitru Ceara  wrote:

> Signed-off-by: Dumitru Ceara 
> ---
> V2: Rebase and fix up new calls in test "localnet connectivity with
> multiple requested-chassis"
> ---
>  tests/ovn-macros.at |   20 +++
>  tests/ovn.at|  364
> ---
>  2 files changed, 46 insertions(+), 338 deletions(-)
>
> diff --git a/tests/ovn-macros.at b/tests/ovn-macros.at
> index 427b7b669..0ad6b58c4 100644
> --- a/tests/ovn-macros.at
> +++ b/tests/ovn-macros.at
> @@ -511,9 +511,13 @@ wait_for_ports_up() {
>  fi
>  }
>
> -# reset_pcap_file iface pcap_file
> +# reset_iface_pcap_file iface pcap_file
>  # Resets the pcap file associates with OVS interface.  should be used
>  # with dummy datapath.
> +#
> +# XXX: This should actually replace reset_pcap_file() as they do almost
> +# exactly the same thing but the "wait while the pcap file has the size
> +# of the PCAP header" check causes tests to fail.
>  reset_iface_pcap_file() {
>  local iface=$1
>  local pcap_file=$2
> @@ -528,6 +532,20 @@ options:rxq_pcap=${pcap_file}-rx.pcap
>  OVS_WAIT_WHILE([test 24 = $(wc -c ${pcap_file}-tx.pcap | cut -d " "
> -f1)])
>  }
>
> +# reset_pcap_file iface pcap_file
> +# Resets the pcap file associates with OVS interface.  should be used
> +# with dummy datapath.
> +reset_pcap_file() {
> +local iface=$1
> +local pcap_file=$2
> +check rm -f dummy-*.pcap
> +check ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap
> \
> +options:rxq_pcap=dummy-rx.pcap
> +check rm -f ${pcap_file}*.pcap
> +check ovs-vsctl -- set Interface $iface
> options:tx_pcap=${pcap_file}-tx.pcap \
> +options:rxq_pcap=${pcap_file}-rx.pcap
> +}
> +
>  # Receive a packet on a dummy netdev interface. If we expect packets to be
>  # recorded, then wait until the pcap file reflects the change.
>  netdev_dummy_receive() {
> diff --git a/tests/ovn.at b/tests/ovn.at
> index 7d10610fd..a3e8a7758 100644
> --- a/tests/ovn.at
> +++ b/tests/ovn.at
> @@ -6487,16 +6487,6 @@ compare_dhcp_packets() {
>  fi
>  }
>
> -reset_pcap_file() {
> -local iface=$1
> -local pcap_file=$2
> -check ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap
> \
> -options:rxq_pcap=dummy-rx.pcap
> -rm -f ${pcap_file}*.pcap
> -check ovs-vsctl -- set Interface $iface
> options:tx_pcap=${pcap_file}-tx.pcap \
> -options:rxq_pcap=${pcap_file}-rx.pcap
> -}
> -
>  AT_CAPTURE_FILE([sbflows])
>  ovn-sbctl dump-flows > sbflows
>
> @@ -7082,16 +7072,6 @@ test_dhcpv6() {
>  as hv1 ovs-appctl netdev-dummy/receive hv1-vif$inport $request
>  }
>
> -reset_pcap_file() {
> -local iface=$1
> -local pcap_file=$2
> -ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap \
> -options:rxq_pcap=dummy-rx.pcap
> -rm -f ${pcap_file}*.pcap
> -ovs-vsctl -- set Interface $iface
> options:tx_pcap=${pcap_file}-tx.pcap \
> -options:rxq_pcap=${pcap_file}-rx.pcap
> -}
> -
>  AT_CAPTURE_FILE([ofctl_monitor0.log])
>  as hv1 ovs-ofctl monitor br-int resume --detach --no-chdir \
>  --pidfile=ovs-ofctl0.pid 2> ofctl_monitor0.log
> @@ -8789,16 +8769,6 @@ OVS_WAIT_UNTIL([
>  ])
>  ovn-nbctl lsp-set-options lrp0-rp router-port=lrp0
> nat-addresses="f0:00:00:00:00:03 192.168.0.3"
>
> -reset_pcap_file() {
> -local iface=$1
> -local pcap_file=$2
> -ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap \
> -options:rxq_pcap=dummy-rx.pcap
> -rm -f ${pcap_file}*.pcap
> -ovs-vsctl -- set Interface $iface
> options:tx_pcap=${pcap_file}-tx.pcap \
> -options:rxq_pcap=${pcap_file}-rx.pcap
> -}
> -
>  reset_pcap_file snoopvif hv1/snoopvif
>  OVS_WAIT_UNTIL([test `wc -c < "hv1/snoopvif-tx.pcap"` -ge 140])
>
> @@ -8873,20 +8843,8 @@ OVN_WAIT_PATCH_PORT_FLOWS(["ln_port"], ["hv2"])
>  # Otherwise a garp might be sent after pcap have been reset but before
> chassis is removed
>  AT_CHECK([ovn-nbctl --wait=hv remove logical_router lr0 options chassis])
>
> -reset_pcap_file() {
> -local hv=$1
> -local iface=$2
> -local pcap_file=$3
> -as $hv
> -ovs-vsctl -- set Interface $iface options:tx_pcap=dummy-tx.pcap \
> -options:rxq_pcap=dummy-rx.pcap
> -rm -f ${pcap_file}*.pcap
> -ovs-vsctl -- set Interface $iface
> options:tx_pcap=${pcap_file}-tx.pcap \
> -options:rxq_pcap=${pcap_file}-rx.pcap
> -}
> -
> -reset_pcap_file hv1 snoopvif hv1/snoopvif
> -reset_pcap_file hv2 snoopvif hv2/snoopvif
> +as hv1 reset_pcap_file snoopvif hv1/snoopvif
> +as hv2 reset_pcap_file snoopvif hv2/snoopvif
>
>  hv1_uuid=$(ovn-sbctl --bare --columns _uuid list chassis hv1)
>  AT_CHECK([ovn-nbctl set logical_router lr0 options:chassis=hv1])
> @@ -8902,8 +8860,8 @@ OVN_CHECK_PACKETS([hv2/snoopvif-tx.pcap],
> [empty_expected])
>  # Temporarily remove lr0 chassis
>  AT_CHECK([ovn-nbctl remove logical_router lr0 options chassis])
>
> -reset_pcap_file hv1 snoopvif hv1/snoopvif
> -reset_pcap_