[ovs-dev] [PATCH ovn] pinctrl: fix monitor state when using udp(icmp) to check healthy state.

2022-07-21 Thread wangchuanlei
When vm of backend do not send reply packet, monitor state
should change state from waiting to  offline, but code of
current is to online, this is a bug, this patch is to fix
it!

Signed-off-by: wangchuanlei 
---
 controller/pinctrl.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/controller/pinctrl.c b/controller/pinctrl.c
index 38e8590af..1a21c7704 100644
--- a/controller/pinctrl.c
+++ b/controller/pinctrl.c
@@ -7526,13 +7526,9 @@ svc_monitors_run(struct rconn *swconn,
 
 case SVC_MON_S_WAITING:
 if (current_time > svc_mon->wait_time) {
-if (svc_mon->protocol ==  SVC_MON_PROTO_TCP) {
-svc_mon->n_failures++;
-svc_mon->state = SVC_MON_S_OFFLINE;
-} else {
-svc_mon->n_success++;
-svc_mon->state = SVC_MON_S_ONLINE;
-}
+svc_mon->n_failures++;
+svc_mon->state = SVC_MON_S_OFFLINE;
+
 svc_mon->next_send_time = current_time + svc_mon->interval;
 next_run_time = svc_mon->next_send_time;
 } else {
-- 
2.27.0

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] Fix compilation issue in fedora 37/rawhide.

2022-07-21 Thread Numan Siddique
On Thu, Jul 21, 2022 at 3:44 PM Dumitru Ceara  wrote:
>
> On 7/21/22 03:14, num...@ovn.org wrote:
> > From: Numan Siddique 
> >
> > Compilation is failing with the error:
> >
> > File "Documentation/conf.py", line 61, in 
> > with open(filename, 'rU') as f:
> >  
> > ValueError: invalid mode: 'rU'
> >
> > The python 3 documentation [1] says:
> >
> > There is an additional mode character permitted, 'U', which no
> > longer has any effect, and is considered deprecated. It
> > previously enabled universal newlines in text mode, which
> > became the default behaviour in Python 3.0. Refer to the
> > documentation of the newline parameter for further details.
> >
> > [1] - https://docs.python.org/3.9/library/functions.html#open
> >
> > This patch fixes this issue.
> >
> > Signed-off-by: Numan Siddique 
> > ---
>
> Looks good to me, thanks!
>
> Acked-by: Dumitru Ceara 

Thanks.  I applied this patch to main and backported to branch-22.06
and branch-22.03.

Numan

>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v2] Fix compilation issue in fedora 37/rawhide.

2022-07-21 Thread Dumitru Ceara
On 7/21/22 03:14, num...@ovn.org wrote:
> From: Numan Siddique 
> 
> Compilation is failing with the error:
> 
> File "Documentation/conf.py", line 61, in 
> with open(filename, 'rU') as f:
>  
> ValueError: invalid mode: 'rU'
> 
> The python 3 documentation [1] says:
> 
> There is an additional mode character permitted, 'U', which no
> longer has any effect, and is considered deprecated. It
> previously enabled universal newlines in text mode, which
> became the default behaviour in Python 3.0. Refer to the
> documentation of the newline parameter for further details.
> 
> [1] - https://docs.python.org/3.9/library/functions.html#open
> 
> This patch fixes this issue.
> 
> Signed-off-by: Numan Siddique 
> ---

Looks good to me, thanks!

Acked-by: Dumitru Ceara 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v3] controller: avoid recomputes triggered by SBDB Port_Binding updates.

2022-07-21 Thread Han Zhou
On Thu, Jul 21, 2022 at 3:48 AM Dumitru Ceara  wrote:
>
> On 7/21/22 08:29, Han Zhou wrote:
> > On Wed, Jul 20, 2022 at 10:29 AM Numan Siddique  wrote:
> >>
> >> On Wed, Jul 13, 2022 at 4:29 AM Dumitru Ceara 
wrote:
> >>>
> >>> On 7/13/22 11:08, Dumitru Ceara wrote:
>  On 7/13/22 09:40, Xavier Simonart wrote:
> > Hi Han, Dumitru
> >
> 
>  Hi Han, Xavier,
> 
>  Sorry, I had already replied to the previous message and only then
>  noticed this one.
> 
> > I think that we should, as much as possible, try to achieve both
> > goals:
> > - have an accurate ovn-installed
> > - do not increase latency in large scale deployments
> >
> 
>  +1
> 
> > The fact that ovn-installed is sent too early for mc flows is
> > already an
> > issue today, independent of this patch.
> > Fixing ovn-installed related to mc flows by delaying the state
> > change (for
> > all cases, included when no mc groups) might be seen as a
performance
> > regression.
> >
> 
>  I think it will, and I'm not sure we can convince the CMS that this
is
>  "just metrics".
> 
> > I agree that we should fix this ovn-installed issue, but it is not a
> > regression added by this patch. We should enter a BZ for it.
> > Per my understanding, the mc flows are updated when the
> > SB_multicast_group
> > is seen as updated by ovn-controller, due to its references to port
> > binding.
> > Other flows related to port binding are installed earlier, i.e. when
> > ovn-controller writes port_binding->chassis (i.e. before it receives
> > SB
> > confirmation). So, while sending the mc flows earlier than what we
> > do today
> > might be more complex, I think it makes some kind of sense (we would
> > send
> > all those flows within the same loop).
> 
>  I'm inclining towards leaving it as it is today if this is the only
> > flow
>  we're missing.  It's a guess without testing things out, but I think
>  it's for the MC_FLOOD_L2 multicast group which is used only for
>  forwarding ARP packets originated by OVN routers or destined to a
>  specific OVN router.  Losing some of those packets is not a big deal.
> 
>  But it might be good to confirm that this is the MC group we install
> > the
>  flow for.
> 
> >>>
> >>> Oh, well, the port is also part of the MC_FLOOD group.  This is
however
> >>> only used for BUM traffic.  So losing some packets here is also not
> >>> terrible, I think.
> >>>
> >>
> >> When a logical port is claimed,  we process the logical flows related
> >> to the logical port (i.e with inport ==  or outport == )
> >> and install
> >> the corresponding openflows. All the generic logical flows (i.e
> >> without inport or outport match) would have already been programmed
> >> (if the datapath already part of local_datapaths).
> >> These processed logical flows (lflow_handle_flows_for_lport() in
> >> lflow.c) will be most likely part of the same openflow bundle. And
> >> once the sequence number
> >> for this bundle is acknowledged we set "ovn-installed=true".  When CMS
> >> notices "ovn-installed=true" , I think it can fairly assume that the
> >> flows for the lport are
> >> programmed.
> >>
> >> I think the only flows pertaining to the logical port  which we would
> >> be missing are the multicast related flows and the logical flows which
> >> ovn-northd would generate after the logical port
> >> is claimed (presently it is the arp responder flows) and I don't think
> >> we can wait for these logical flows to be programmed by ovn-northd
> >> before setting "ovn-installed=true".
> >
> > The missing flows is just a side-effect. I am more concerned with the
> > clearness of the state-machine.
> > To my understanding it would be very clear to define the "CLAIMED"
state's
> > job as claiming the port in SB-DB. If SB commit fails, the retry should
> > happen at this state. If we see the update notification (i.e. we see the
> > PB.chassis matches the desired chassis), we move to the next state
> > "INSTALL_FLOWS". Now if we move the state forward without confirming the
> > PB.chassis is updated in SB, we would need to perform the task in all
the
> > following states. The only benefit we get from this is that
ovn-installed
> > can be set to true a little bit earlier (to save a SB round trip), at
the
> > cost of more complexity (even more so if ovn-monitor-all needs to be
> > considered in this logic) and less clarity of the state machine.
> >
> > Is it possible to address it with the simpler/clear approach first and
see
> > if it really causes obvious performance regression, then we can consider
> > the "short-cuts"? I am not sure if it is some kind of premature
> > optimization at this point.
> >
>
> It's not exactly measured with the same code but I think this is still
> relevant.  A while ago we had tried to see the impact of delaying
> ovn-installed as long as the SB is readonly (last 

Re: [ovs-dev] [PATCH ovn] pinctrl: fix monitor state when using udp(icmp) to check healthy state

2022-07-21 Thread Numan Siddique
On Thu, Jul 21, 2022 at 12:40 AM wangchuanlei 
wrote:

> Hi,
> On my enviroment, i have a load-balancer, witch uses udp(icmp)
> protocol to check connection status between vip and backend.
> In normal cirsumstances, vip send request packet to backends , and
> backends send reply to vip, so the result of health check is online status.
> Then, i set the nic of backends vm to down state through command "ip link
> set eth0 down", in this case, the backends vm would't send reply to vip,
> but vip still send request to backends.In function svc_monitors_run, after
> sending request packet, state of vip would change to waiting, after 3
> seconds, due to no reply packet, svc_mon->state change to SVC_MON_S_ONLINE,
> so, even no reply state keep on line. This is unreasonable, it should be
> offline.
> So this patch is to fix this bug!
>
> Signed-off-by: wangchuanlei 
>

Thanks for the patch.

Can you please format the commit message as per the guidelines here -
https://github.com/ovn-org/ovn/blob/main/Documentation/internals/contributing/submitting-patches.rst

Can also please drop "Hi" from the commit message.

Thanks
Numan


> ---
>  controller/pinctrl.c | 10 +++---
>  1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/controller/pinctrl.c b/controller/pinctrl.c
> index 38e8590af..1a21c7704 100644
> --- a/controller/pinctrl.c
> +++ b/controller/pinctrl.c
> @@ -7526,13 +7526,9 @@ svc_monitors_run(struct rconn *swconn,
>
>  case SVC_MON_S_WAITING:
>  if (current_time > svc_mon->wait_time) {
> -if (svc_mon->protocol ==  SVC_MON_PROTO_TCP) {
> -svc_mon->n_failures++;
> -svc_mon->state = SVC_MON_S_OFFLINE;
> -} else {
> -svc_mon->n_success++;
> -svc_mon->state = SVC_MON_S_ONLINE;
> -}
> +svc_mon->n_failures++;
> +svc_mon->state = SVC_MON_S_OFFLINE;
> +
>  svc_mon->next_send_time = current_time +
> svc_mon->interval;
>  next_run_time = svc_mon->next_send_time;
>  } else {
> --
> 2.27.0
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [RFC ovn] Multi-chassis port + MTU behavior

2022-07-21 Thread Ihar Hrachyshka
On Wed, Jul 20, 2022 at 3:45 PM Numan Siddique  wrote:
>
> On Mon, Jul 18, 2022 at 7:44 PM Ihar Hrachyshka  wrote:
> >
> > Hi folks,
> >
> > looking for some advices on MTU / IP behavior for multi-chassis ports.
> >
> > 22.06 got the new multichassis port concept introduced where the same
> > port may be present on multiple chassis, which can be used as a
> > performance optimization for VM live migration, among other things.
> > Migration with sub-0.1ms network downtime is achieved by cloning all
> > traffic directed towards a multichassis port to all its locations,
> > making all chassis hosting it receive relevant traffic at any given
> > moment until migration is complete.
> >
> > For logical switches with a localnet port where traffic usually goes
> > through the localnet port and not tunnels, it means enforcement of
> > tunneling of egress and ingress traffic for a multichassis port. (The
> > rest of the traffic between other, non-multichassis, ports keeps going
> > through localnet port.) Tunneling enforcement is done because traffic
> > sent through localnet port won't generally be cloned to all port binding
> > locations by the upstream switch.
> >
> > A problem comes down when ingress or egress traffic of a multichassis
> > port, being redirected through a tunnel, gets lost because of geneve
> > header overhead or because the interface used for tunneling has a
> > different MTU from the physical bridge backing up the localnet port.
> >
> > This is happening when:
> >   - tunnel_iface_mtu < localnet_mtu + geneve_overhead
> >
> > This makes for an unfortunate situation where, for a multichassis port,
> > SOME traffic (e.g. regular ICMP requests) pass through without any
> > problems, while OTHER traffic (e.g. produced with 'ping -s) doesn't.
> > (Test scenario demonstrating it is included below.)
> >
> > Here are ideas I came up with on how this could be resolved or at least
> > mitigated:
> >
> > a) pass "oversized" traffic through localnet, and the rest through
> > tunnels. Apart from confusing pattern on the wire where packets that
> > belong to the same TCP session may go through two paths (arguably not a
> > big problem and should be expected by other network hardware), this gets
> > us back to the problem of upstream switch not delivering packets to all
> > binding chassis.
> >
> > b) fragmentation needed. We could send ICMP Fragmentation Needed ICMP
> > errors on attempts to send oversized packets to or from a multichassis
> > port. Then TCP sessions could adjust their properties to reflect the new
> > recommended MTU. We already have a similar mechanism for router ports.
> > There are several caveats and limitations with fragmentation needed (b):
> >
> > - AFAIU this works for some protocols but not others. It also depends on
> >   the client network stack to reflect the change in network path. TCP
> >   should probably work.
> >
> > - It may be confusing for users that some L2 paths between ports of the
> >   same switch have reduced MTU while others have the regular, depending
> >   on the type of a port (single- or multi-chassis).
> >
> > c) implement fragmentation inside L2 domain. I actually don't know if
> > that's even compliant with RFCs. Usually packet fragmentation is
> > implemented on L2 domain boundary, by a router. In this scenario, a peer
> > port on the same switch would receive fragments for a packet that was
> > sent as a single piece.
> >
> > I currently lean towards (b) though it's not a universal fix since it
> > requires collaboration of the underlying network stack. But (a) leaves
> > cloning broken, and (c) is even more invasive, taking action on port's
> > packets (fragmenting them) without the port's owner knowledge.
> >
> > Perhaps this is all unnecessary and there's a way to make OVN
> > transparently split and reassemble packets as needed, though I doubt it
> > since it doesn't encapsulate tunneled traffic into another application
> > layer. But if there's a way to achieve transparent re-assemble, or there
> > are other alternatives beyond (a)-(c), let me know. Please let me know
> > what you think of (a)-(c) regardless.
> >
> > Thanks,
> > Ihar
>
> I lean towards (b) too.   I suppose we only need this special handling
> until the migration is complete and the multi chassis option is
> cleared by CMS.

Yes in case of short-term use of the feature (e.g. for live
migration). If we are talking about persistent multi-chassis ports
(e.g. to clone traffic), this may be more of a problem. I'd expect
that in this case, MTUs will be carefully aligned between tunneling
interfaces and physical networks so that any packet directed towards a
physical network would fit through the tunneling interface. (This
implication is already documented in ovn-nb.xml, so what is discussed
here is more of an optimization for environments that don't abide to
expected MTU configuration.)

>
> For (a) how would OVN decide if the packet is oversized ?  Using the
> check_pkt_larger OVS action ?

Re: [ovs-dev] [PATCH v2] odp-execute: Avoid unnecessary logging for action implementations.

2022-07-21 Thread Eelco Chaudron



On 20 Jul 2022, at 20:00, Ilya Maximets wrote:

> There is no need to log if the implementation didn't change.
> Scalar one is default, any change will be logged.  And availability
> is not really important to log at INFO level.  Moving these logs
> to DBG level to avoid littering the log file and confusing users.
> We do the same for miniflow_extract and datapath interface
> implementations.
>
> Additionally text of the log message made more readable and uniform
> with the one used for miniflow_extract.
>
> Fixes: 95e4a35b0a1d ("odp-execute: Add function pointers to odp-execute for 
> different action implementations.")
> Signed-off-by: Ilya Maximets 

Changes look good to me.

Acked-by: Eelco Chaudron 

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH ovn v3] controller: avoid recomputes triggered by SBDB Port_Binding updates.

2022-07-21 Thread Dumitru Ceara
On 7/21/22 08:29, Han Zhou wrote:
> On Wed, Jul 20, 2022 at 10:29 AM Numan Siddique  wrote:
>>
>> On Wed, Jul 13, 2022 at 4:29 AM Dumitru Ceara  wrote:
>>>
>>> On 7/13/22 11:08, Dumitru Ceara wrote:
 On 7/13/22 09:40, Xavier Simonart wrote:
> Hi Han, Dumitru
>

 Hi Han, Xavier,

 Sorry, I had already replied to the previous message and only then
 noticed this one.

> I think that we should, as much as possible, try to achieve both
> goals:
> - have an accurate ovn-installed
> - do not increase latency in large scale deployments
>

 +1

> The fact that ovn-installed is sent too early for mc flows is
> already an
> issue today, independent of this patch.
> Fixing ovn-installed related to mc flows by delaying the state
> change (for
> all cases, included when no mc groups) might be seen as a performance
> regression.
>

 I think it will, and I'm not sure we can convince the CMS that this is
 "just metrics".

> I agree that we should fix this ovn-installed issue, but it is not a
> regression added by this patch. We should enter a BZ for it.
> Per my understanding, the mc flows are updated when the
> SB_multicast_group
> is seen as updated by ovn-controller, due to its references to port
> binding.
> Other flows related to port binding are installed earlier, i.e. when
> ovn-controller writes port_binding->chassis (i.e. before it receives
> SB
> confirmation). So, while sending the mc flows earlier than what we
> do today
> might be more complex, I think it makes some kind of sense (we would
> send
> all those flows within the same loop).

 I'm inclining towards leaving it as it is today if this is the only
> flow
 we're missing.  It's a guess without testing things out, but I think
 it's for the MC_FLOOD_L2 multicast group which is used only for
 forwarding ARP packets originated by OVN routers or destined to a
 specific OVN router.  Losing some of those packets is not a big deal.

 But it might be good to confirm that this is the MC group we install
> the
 flow for.

>>>
>>> Oh, well, the port is also part of the MC_FLOOD group.  This is however
>>> only used for BUM traffic.  So losing some packets here is also not
>>> terrible, I think.
>>>
>>
>> When a logical port is claimed,  we process the logical flows related
>> to the logical port (i.e with inport ==  or outport == )
>> and install
>> the corresponding openflows. All the generic logical flows (i.e
>> without inport or outport match) would have already been programmed
>> (if the datapath already part of local_datapaths).
>> These processed logical flows (lflow_handle_flows_for_lport() in
>> lflow.c) will be most likely part of the same openflow bundle. And
>> once the sequence number
>> for this bundle is acknowledged we set "ovn-installed=true".  When CMS
>> notices "ovn-installed=true" , I think it can fairly assume that the
>> flows for the lport are
>> programmed.
>>
>> I think the only flows pertaining to the logical port  which we would
>> be missing are the multicast related flows and the logical flows which
>> ovn-northd would generate after the logical port
>> is claimed (presently it is the arp responder flows) and I don't think
>> we can wait for these logical flows to be programmed by ovn-northd
>> before setting "ovn-installed=true".
> 
> The missing flows is just a side-effect. I am more concerned with the
> clearness of the state-machine.
> To my understanding it would be very clear to define the "CLAIMED" state's
> job as claiming the port in SB-DB. If SB commit fails, the retry should
> happen at this state. If we see the update notification (i.e. we see the
> PB.chassis matches the desired chassis), we move to the next state
> "INSTALL_FLOWS". Now if we move the state forward without confirming the
> PB.chassis is updated in SB, we would need to perform the task in all the
> following states. The only benefit we get from this is that ovn-installed
> can be set to true a little bit earlier (to save a SB round trip), at the
> cost of more complexity (even more so if ovn-monitor-all needs to be
> considered in this logic) and less clarity of the state machine.
> 
> Is it possible to address it with the simpler/clear approach first and see
> if it really causes obvious performance regression, then we can consider
> the "short-cuts"? I am not sure if it is some kind of premature
> optimization at this point.
> 

It's not exactly measured with the same code but I think this is still
relevant.  A while ago we had tried to see the impact of delaying
ovn-installed as long as the SB is readonly (last transaction still in
progress).  The impact was measurable, up to 500msec on the 120 node
cluster we tested on:

https://mail.openvswitch.org/pipermail/ovs-dev/2022-April/393319.html

This was with a real OpenShift 120 node cluster running a scale test
used for measuring 

Re: [ovs-dev] [PATCH v2] odp-execute: Avoid unnecessary logging for action implementations.

2022-07-21 Thread Finn, Emma



> -Original Message-
> From: Ilya Maximets 
> Sent: Wednesday 20 July 2022 19:01
> To: ovs-dev@openvswitch.org
> Cc: Stokes, Ian ; Eelco Chaudron
> ; Finn, Emma ; Ilya
> Maximets 
> Subject: [PATCH v2] odp-execute: Avoid unnecessary logging for action
> implementations.
> 
> There is no need to log if the implementation didn't change.
> Scalar one is default, any change will be logged.  And availability is not 
> really
> important to log at INFO level.  Moving these logs to DBG level to avoid
> littering the log file and confusing users.
> We do the same for miniflow_extract and datapath interface
> implementations.
> 
> Additionally text of the log message made more readable and uniform with
> the one used for miniflow_extract.
> 
> Fixes: 95e4a35b0a1d ("odp-execute: Add function pointers to odp-execute
> for different action implementations.")
> Signed-off-by: Ilya Maximets 
> ---
> 
> Version 2:
>   - Dropped the change in test macros, filtering still needed for
> the autovalidator build.
> 
>  lib/odp-execute-private.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 

Changes look good to me. 
Acked-by: Emma Finn 

> diff --git a/lib/odp-execute-private.c b/lib/odp-execute-private.c index
> bec49206e..f80ae5a23 100644
> --- a/lib/odp-execute-private.c
> +++ b/lib/odp-execute-private.c
> @@ -97,9 +97,10 @@ odp_execute_action_set(const char *name)
>  for (int i = 0; i < ACTION_IMPL_MAX; i++) {
>  /* String compare, and set ptrs atomically. */
>  if (!strcmp(action_impls[i].name, name)) {
> -active_action_impl_index = i;
> -
> -VLOG_INFO("Action implementation set to %s", name);
> +if (i != active_action_impl_index) {
> +active_action_impl_index = i;
> +VLOG_INFO("Action implementation set to %s", name);
> +}
>  return &action_impls[i];
>  }
>  }
> @@ -142,8 +143,8 @@ odp_execute_action_init(void)
> 
>  action_impls[i].available = avail;
> 
> -VLOG_INFO("Action implementation %s (available: %s)",
> -  action_impls[i].name, avail ? "Yes" : "No");
> +VLOG_DBG("Actions implementation '%s' %s available.",
> + action_impls[i].name, avail ? "is" : "is not");
> 
>  /* The following is a run-time check to make sure a scalar
>   * implementation exists for the given ISA implementation. This is to
> --
> 2.34.3

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


[ovs-dev] [PATCH] conntrack: narrow the scope of ct_lock in new conn setup

2022-07-21 Thread wenxu
From: wenxu 

There is a big scope ct_lock for new conn setup. The
ct_lock should be hold only for conns map insert and
expire rculist insert.

Signed-off-by: wenxu 
---
 lib/conntrack.c | 42 ++
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/lib/conntrack.c b/lib/conntrack.c
index 145409f..3288cc7 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -48,6 +48,8 @@ COVERAGE_DEFINE(conntrack_full);
 COVERAGE_DEFINE(conntrack_l3csum_err);
 COVERAGE_DEFINE(conntrack_l4csum_err);
 COVERAGE_DEFINE(conntrack_lookup_natted_miss);
+COVERAGE_DEFINE(conntrack_duplicate);
+COVERAGE_DEFINE(conntrack_nat_duplicate);
 
 struct conn_lookup_ctx {
 struct conn_key key;
@@ -1010,10 +1012,10 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
const struct nat_action_info_t *nat_action_info,
const char *helper, const struct alg_exp_node *alg_exp,
enum ct_alg_ctl_type ct_alg_ctl, uint32_t tp_id)
-OVS_REQUIRES(ct->ct_lock)
 {
 struct conn *nc = NULL;
 struct conn *nat_conn = NULL;
+uint32_t nat_hash;
 
 if (!valid_new(pkt, &ctx->key)) {
 pkt->md.ct_state = CS_INVALID;
@@ -1089,16 +1091,35 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
 nat_conn->nat_action = 0;
 nat_conn->alg = NULL;
 nat_conn->nat_conn = NULL;
-uint32_t nat_hash = conn_key_hash(&nat_conn->key, ct->hash_basis);
+nat_hash = conn_key_hash(&nat_conn->key, ct->hash_basis);
+ovs_mutex_lock(&ct->ct_lock);
+if (conn_key_lookup(ct, &nat_conn->key, nat_hash, now,
+NULL, NULL)) {
+ovs_mutex_unlock(&ct->ct_lock);
+COVERAGE_INC(conntrack_nat_duplicate);
+goto out;
+}
 cmap_insert(&ct->conns, &nat_conn->cm_node, nat_hash);
+ovs_mutex_unlock(&ct->ct_lock);
 }
 
 nc->nat_conn = nat_conn;
-ovs_mutex_init_adaptive(&nc->lock);
 nc->conn_type = CT_CONN_TYPE_DEFAULT;
 atomic_flag_clear(&nc->reclaimed);
+ovs_mutex_lock(&ct->ct_lock);
+if (conn_key_lookup(ct, &ctx->key, ctx->hash, now, NULL, NULL)) {
+if (nat_conn) {
+cmap_remove(&ct->conns, &nat_conn->cm_node, nat_hash);
+}
+
+ovs_mutex_unlock(&ct->ct_lock);
+COVERAGE_INC(conntrack_duplicate);
+goto out;
+}
 cmap_insert(&ct->conns, &nc->cm_node, ctx->hash);
 conn_expire_push_front(ct, nc);
+ovs_mutex_unlock(&ct->ct_lock);
+ovs_mutex_init_adaptive(&nc->lock);
 atomic_count_inc(&ct->n_conn);
 ctx->conn = nc; /* For completeness. */
 if (zl) {
@@ -1117,12 +1138,13 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
  * ephemeral ports.  A DOS attack should be protected against with
  * firewall rules or a separate firewall.  Also using zone partitioning
  * can limit DoS impact. */
-nat_res_exhaustion:
-free(nat_conn);
-delete_conn__(nc);
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
+nat_res_exhaustion:
 VLOG_WARN_RL(&rl, "Unable to NAT due to tuple space exhaustion - "
  "if DoS attack, use firewalling and/or zone partitioning.");
+out:
+free(nat_conn);
+delete_conn__(nc);
 return NULL;
 }
 
@@ -1437,12 +1459,8 @@ process_one(struct conntrack *ct, struct dp_packet *pkt,
 }
 ovs_rwlock_unlock(&ct->resources_lock);
 
-ovs_mutex_lock(&ct->ct_lock);
-if (!conn_lookup(ct, &ctx->key, now, NULL, NULL)) {
-conn = conn_not_found(ct, pkt, ctx, commit, now, nat_action_info,
-  helper, alg_exp, ct_alg_ctl, tp_id);
-}
-ovs_mutex_unlock(&ct->ct_lock);
+conn = conn_not_found(ct, pkt, ctx, commit, now, nat_action_info,
+  helper, alg_exp, ct_alg_ctl, tp_id);
 }
 
 write_ct_md(pkt, zone, conn, &ctx->key, alg_exp);
-- 
1.8.3.1

___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on selected ports

2022-07-21 Thread Jan Scheurich via dev
Hi,

True, the cost of polling a packet from a physical port on a remote NUMA node 
is slightly higher than from local NUMA node. Hence the cross-NUMA polling of 
rx queues has some overhead. However, the packet processing cost is much more 
influence by the location of the target vhostuser ports. If the majority of the 
rx queue traffic is going to a VM on the other NUMA node, it is actually 
*better* to poll the packets in a PMD on the VM's NUMA node.

Long story short, OVS doesn't have sufficient data to be able to correctly 
predict the actual rxq load when assigned to another PMD in a different queue 
configuration. The rxq processing cycles measured on the current PMD is the 
best estimate we have for balancing the overall load on the PMDs. We need to 
live with the inevitable inaccuracies.

My main point is: these inaccuracies don't matter. The purpose of balancing the 
load over PMDs is *not* to minimize the total cycles spent by PMDs on 
processing packets. The PMD run in a busy loop anyhow and burn all cycles of 
the CPU. The purpose is to prevent that some PMD unnecessarily gets congested 
(i.e. load > 95%) while others have a lot of spare capacity and could take over 
some rxqs.

Cross-NUMA polling of physical port rxqs has proven to be an extremely valuable 
tool to help OVS's cycle-based rxq-balancing algorithm to do its job, and I 
strongly suggest we allow the proposed per-port opt-in option.

BR, Jan

From: Anurag Agarwal 
Sent: Thursday, 21 July 2022 07:15
To: lic...@chinatelecom.cn
Cc: Jan Scheurich 
Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hello Cheng,
With cross-numa enabled, we flatten the PMD list across NUMAs and select 
the least loaded PMD. Thus I would not like to consider the case below.

Regards,
Anurag

From: lic...@chinatelecom.cn 
mailto:lic...@chinatelecom.cn>>
Sent: Thursday, July 21, 2022 8:19 AM
To: Anurag Agarwal 
mailto:anurag.agar...@ericsson.com>>
Cc: Jan Scheurich 
mailto:jan.scheur...@ericsson.com>>
Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hi Anurag,

"If local numa has bandwidth for rxq, we are not supposed to assign a rxq to 
remote pmd."
Would you like to consider this case? If not, I think we don't have to resolve 
the cycles measurement issue for cross numa case.


李成

From: Anurag Agarwal
Date: 2022-07-21 10:21
To: lic...@chinatelecom.cn
CC: Jan Scheurich
Subject: RE: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports
+ Jan

Hello Cheng,
  Thanks for your insightful comments. Please find my inputs inline.

Regards,
Anurag

From: lic...@chinatelecom.cn 
mailto:lic...@chinatelecom.cn>>
Sent: Monday, July 11, 2022 7:51 AM
To: Anurag Agarwal 
mailto:anurag.agar...@ericsson.com>>
Subject: Re: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
selected ports

Hi Anurag,

Sorry for late reply, I was busy on a task last two weeks.

I think you proposal can cover the case I reported. It looks good to me.
>> Thanks for your review and positive feedback

However, to enable cross numa rxq pollin, we may have another problem to 
address.

From my test, cross numa polling has worse performance than numa affinity 
polling.(at least 10%)
So if local numa has bandwidth for rxq, we are not supposed to assign a rxq to 
remote pmd.
Unfortunately, we don't know if a pmd is out of bandwidth from it's assigned 
rxq cycles.
Because rx batchs size impacts the rxq cycle a lot in my test:
rx batch  cycles per pkt
1.00 5738
5.00 2353
12.15   1770
32.00   1533

Pkts come faster, the rx batch size is larger. More rxqs a pmd is assigned, rx 
batch size if larger.
Imaging that pmd pA has only one rxq assigned. Pkts comes at 1.00 pkt/5738 
cycle, the rxq rx batch size is 1.00.
Now pA has 2 rxq assigned, each rxq has pkts comes at 1.00 pkt/5738 cycle.
pmd spends 5738 cycles process the first rxq, and then the second.
After the second rxq is processed, pmd comes back to the first rxq, now first 
rxq has 2 pkts ready(becase 2*5738 cycles passed).
The rxq batch size becomes 2.
>> Ok. Do you think it is a more generic problem with cycles measurement and 
>> PMD utilization? Not specific to cross-numa feature..

So it's hard to say if a pmd is overload from the rxq cycles.
At last, I think cross numa feature is very nice. I will make effort on this as 
well to cover cases in our company.
Let keep in sync on progress :)
>> Thanks



李成

From: Anurag Agarwal
Date: 2022-06-29 14:12
To: lic...@chinatelecom.cn
Subject: RE: [ovs-dev] [PATCH v5] dpif-netdev: Allow cross-NUMA polling on 
sel