Re: [ovs-discuss] ovsdb relay server active connection probe interval do not work

2023-03-16 Thread Jake Yip via discuss

Hi all,

Apologies for jumping into this thread. We are seeing the same and it's 
nice to find someone with similar issues :)


On 8/3/2023 3:43 am, Ilya Maximets via discuss wrote:


We see failures on the OVSDB Relay side:

2023-03-06T22:19:32.966Z|00099|reconnect|ERR|ssl:xxx:16642: no response to 
inactivity probe after 5 seconds, disconnecting
2023-03-06T22:19:32.966Z|00100|reconnect|INFO|ssl:xxx:16642: connection dropped
2023-03-06T22:19:40.989Z|00101|reconnect|INFO|ssl:xxx:16642: connected
2023-03-06T22:19:50.997Z|00102|reconnect|ERR|ssl:xxx:16642: no response to 
inactivity probe after 5 seconds, disconnecting
2023-03-06T22:19:50.997Z|00103|reconnect|INFO|ssl:xxx:16642: connection dropped
2023-03-06T22:19:59.022Z|00104|reconnect|INFO|ssl:xxx:16642: connected
2023-03-06T22:20:09.026Z|00105|reconnect|ERR|ssl:xxx:16642: no response to 
inactivity probe after 5 seconds, disconnecting
2023-03-06T22:20:09.026Z|00106|reconnect|INFO|ssl:xxx:16642: connection dropped
2023-03-06T22:20:17.052Z|00107|reconnect|INFO|ssl:xxx:16642: connected
2023-03-06T22:20:27.056Z|00108|reconnect|ERR|ssl:xxx:16642: no response to 
inactivity probe after 5 seconds, disconnecting
2023-03-06T22:20:27.056Z|00109|reconnect|INFO|ssl:xxx:16642: connection dropped
2023-03-06T22:20:35.111Z|00110|reconnect|INFO|ssl:xxx:16642: connected

On the DB cluster this looks like:

2023-03-06T22:19:04.208Z|00451|stream_ssl|WARN|SSL_read: unexpected SSL 
connection close
2023-03-06T22:19:04.211Z|00452|reconnect|WARN|ssl:xxx:52590: connection dropped 
(Protocol error)


OK.  These are symptoms.  The cause must be something like
'Unreasonably long MANY ms poll interval' on the DB cluster side.
i.e. the reason why the main DB cluster didn't reply to the
probes sent from the relay.  Because as soon as server receives
the probe, it replies right back.  If it didn't reply, it was
doing something else for an extended period of time.  "MANY" is
more than 5 seconds.



We are seeing the same issue here after moving to OVN relay.

- On the relay "no response to inactivity probe after 5 seconds"
- On the OVSDB cluster
  - "Unreasonably long 1726ms poll interval"
  - "connection dropped (Input/output error)"
  - "SSL_write: system error (Broken pipe)"
  - 100% CPU on northd process

Is there anything we could look for on the OVSDB side to narrow down 
what may be causing the load on the cluster side?


A brief history - We are migrating an OpenStack cloud from MidoNet to 
OVN. This cloud has roughly


- 400 neutron networks / ovn logical switches
- 300 neutron routers
- 14000 neutron ports / ovn logical switchports
- 28000 neutron security groups / ovn port group
- 8 neutron secgroup rules / acl

We populated the OVN DB by using OpenStack/Neutron ovn sync script.

We have attempted the migration twice previously (2021, 2022) but failed 
due to load issues. We've reported issues and have seen lots of 
performance improvements over the last two years. Here is a BIG thank 
you to the dev teams!


We are now on the following versions

- OVS 2.17
- OVN 22.03

We are exploring upgrade as an option, but I am concerned if there's 
something fundamentally wrong with the data / config we have that is 
causing the high load, and would like to rule that out first. Please let 
me know if you need more information, will be happy to start a new 
thread too.


Regards,
Jake
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] OVN interconnection and NAT

2023-03-16 Thread Tiago Pires via discuss
Hi,

With the backports for multiple DGP applied and using --gateway-port, the
traffic between AZs is not natted and routed as expected and the traffic to
Internet works fine and it is natted.
But when I do not use --gateway-port, the traffic to the Internet works
fine but the traffic against a remote AZ that should be routed does not
work. So this following statement may not be correct, as having multiple
DGPs the traffic that supposed to be natted is working without use
--gateway-port:

"--gateway-port option allows specifying the distributed
  gateway port of router where the NAT rule needs to be
  applied. GATEWAY_PORT should reference a
  Logical_Router_Port row that is a distributed gateway port
  of router. When router has multiple distributed gateway
  ports and the gateway port for this NAT can’t be inferred
  from the external_ip, it is an error to not specify the
  GATEWAY_PORT."

When checking the traffic between AZs, in the local AZ where --gateway-port
is not being in use, the traffic from a remote AZ is dropped:
recirc_id(0),tunnel(tun_id=0xff0002,src=192.168.40.50,dst=192.168.40.221,geneve({class=0x102,type=0x80,len=4,0x30002/0x7fff}),flags(-df+csum+key)),in_port(2),ct_label(0/0x2),eth(src=aa:aa:aa:aa:aa:10,ds
t=aa:aa:aa:aa:aa:20),eth_type(0x0800),ipv4(src=
8.0.0.0/248.0.0.0,dst=20.0.1.2,ttl=63,frag=no), packets:2, bytes:196,
used:0.345s,
actions:set(eth(src=00:de:ad:fe:00:01,dst=00:de:ad:01:00:01)),set(ipv4(ttl=6
2)),3
recirc_id(0),in_port(3),ct_label(0/0x2),eth(src=00:de:ad:01:00:01,dst=00:de:ad:fe:00:01),eth_type(0x0800),ipv4(src=
20.0.1.2/255.255.255.254,dst=10.0.1.0/255.255.255.0,ttl=64,frag=no),
packets:2, bytes:196,
used:0.345s, actions:drop

But for a local AZ where --gateway-port is being in use, the traffic from a
remote AZ works fine:
recirc_id(0),in_port(3),ct_label(0/0x2),eth(src=00:de:ad:01:00:01,dst=00:de:ad:fe:00:01),eth_type(0x0800),ipv4(src=
30.0.1.2/255.255.255.254,dst=10.0.1.0/255.255.255.0,tos=0/0x3,ttl=64,frag=no),
packets:2, b
ytes:196, used:0.596s,
actions:set(tunnel(tun_id=0xff0002,dst=192.168.40.50,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x10003}),flags(df|csum|key))),set(eth(src=aa:aa:aa:aa:aa:30,dst=aa:aa:aa:a
a:aa:10)),set(ipv4(ttl=63)),2
recirc_id(0),tunnel(tun_id=0xff0002,src=192.168.40.50,dst=192.168.40.247,geneve({class=0x102,type=0x80,len=4,0x30001/0x7fff}),flags(-df+csum+key)),in_port(2),ct_label(0/0x2),eth(src=aa:aa:aa:aa:aa:10,ds
t=aa:aa:aa:aa:aa:30),eth_type(0x0800),ipv4(src=
8.0.0.0/248.0.0.0,dst=30.0.1.2,ttl=63,frag=no), packets:2, bytes:196,
used:0.596s,
actions:set(eth(src=00:de:ad:fe:00:01,dst=00:de:ad:01:00:01)),set(ipv4(ttl=6
2)),3

It seems I have to use --gateway-port when using OVN Interconnect even
knowing both are not related. So, it seems there is a bug.

Tiago Pires


On Wed, Mar 15, 2023 at 7:48 PM Han Zhou  wrote:

>
>
> On Wed, Mar 15, 2023 at 1:00 PM Tiago Pires 
> wrote:
> >
> > Hi Vladislav,
> >
> > It seems the gateway_port option was added on 22.09 according with this
> commit:
> https://github.com/ovn-org/ovn/commit/4f93381d7d38aa21f56fb3ff4ec00490fca12614
> .
> > It is what I need in order to make my use case to work, let me try it.
>
> Thanks for reporting this issue. It would be good to try the gateway_port
> option, but it also seems to be a bug somewhere if it behaves as what you
> described in the example, because:
>
> "When  a  logical  router  has multiple distributed gateway ports and this
> column is not set for a NAT
>   rule, then the rule will be applied at the distributed
> gateway port which is in the same  network  as
>   the  external_ip  of  the NAT rule "
>
> We need to check more on this.
>
> Regards,
> Han
>
> >
> > Thank you
> >
> > Tiago Pires
> >
> >
> >
> > On Wed, Mar 15, 2023 at 2:10 PM Vladislav Odintsov 
> wrote:
> >>
> >> I’m sorry, of course I meant gateway_port instead of logical_port:
> >>
> >>gateway_port: optional weak reference to Logical_Router_Port
> >>   A distributed gateway port in the Logical_Router_Port
> table where the NAT rule needs to be applied.
> >>
> >>   When multiple distributed gateway ports are configured on
> a Logical_Router, applying a  NAT  rule  at
> >>   each  of the distributed gateway ports might not be
> desired. Consider the case where a logical router
> >>   has 2 distributed  gateway  port,  one  with  networks
> 50.0.0.10/24  and  the  other  with  networks
> >>   60.0.0.10/24.  If  the  logical router has a NAT rule of
> type snat, logical_ip 10.1.1.0/24 and exter‐
> >>   nal_ip 50.1.1.20/24, the rule needs to be selectively
> applied on  matching  packets  entering/leaving
> >>   through the distributed gateway port with networks
> 50.0.0.10/24.
> >>
> >>   When  a  logical  router  has multiple distributed
> gateway ports and this