Re: [ovs-discuss] OVN scale

Tony Liu Mon, 27 Jul 2020 10:17:06 -0700

Hi Han,

Just some updates here.

I tried with 4K networks on single router. Configuration was done without any 
issues. I checked both
nb-db and sb-db, they all look good. It's just that router configuration is 
huge (in Neutron DB, nb-db
and flow table in sb-db), because it contains all 4K ports. Also, the pipeline 
of router datapath in sb-db
is quite big.

I see ovn-northd master and sb-db leader are busy, taking 90+% CPU. There are 
only 3 compute nodes
and 2 gateway nodes. Does that monitor setting "ovn-monitor-all" matters in 
such case? Any idea what
they are busy with, without any configuration updates from OpenStack? The nb-db 
is not busy though.

Probably because nb-db is busy, ovn-controller can't connect to it 
consistently. It keeps being
disconnected and reconnecting. Restarting ovn-controller seems help. I am able 
to launch a few VMs
on different networks and they are connected via the router.

Now, I have problem on external access. The router is set as gateway to a 
provider/underlay network
on an interface on the gateway node. The router is allocated an underlay 
address from that provider
network. My understanding is that, the br-ex on gateway node holding the active 
router will broadcast
ARP to announce that router underlay address in case of failover. Also, it will 
respond ARP request for
that router underlay address. But when I run tcpdump on that underlay interface 
on gateway node,
I see ARP request coming in, but no ARP response going out. I checked the flow 
table in sb-db, it seems
ok. I also checked flow on br-ex by "ovs-ofctl dump-flows br-ex", I don't see 
anything about ARP there.
How should I look into it?

Again, the case is to support 4K networks with external access (security group 
is disabled),
4K routers (one for each network), 50 routers (one for 80 networks), 1 router 
(for all 4K networks)...
All networks are isolated by ACL on the logical router. Which option should 
work better?
Any comment is appreciated.

Thanks!

Tony

________________________________
From: discuss <ovs-discuss-boun...@openvswitch.org> on behalf of Tony Liu 
<tonyliu0...@hotmail.com>
Sent: July 21, 2020 09:09 PM
To: Daniel Alvarez <dalva...@redhat.com>
Cc: ovs-discuss@openvswitch.org <ovs-discuss@openvswitch.org>
Subject: Re: [ovs-discuss] OVN scale

[root@ovn-db-2 ~]# ovn-nbctl list nb_global
_uuid               : b7b3aa05-f7ed-4dbc-979f-10445ac325b8
connections         : []
external_ids        : {"neutron:liveness_check_at"="2020-07-22 
04:03:17.726917+00:00"}
hv_cfg              : 312
ipsec               : false
name                : ""
nb_cfg              : 2636
options             : {mac_prefix="ca:e8:07", 
svc_monitor_mac="4e:d0:3a:80:d4:b7"}
sb_cfg              : 2005
ssl                 : []

[root@ovn-db-2 ~]# ovn-sbctl list sb_global
_uuid               : 3720bc1d-b0da-47ce-85ca-96fa8d398489
connections         : []
external_ids        : {}
ipsec               : false
nb_cfg              : 312
options             : {mac_prefix="ca:e8:07", 
svc_monitor_mac="4e:d0:3a:80:d4:b7"}
ssl                 : []

The NBDB and SBDB is definitely out of sync. Is there any way to force 
ovn-northd sync them?

Thanks!

Tony

________________________________
From: Tony Liu <tonyliu0...@hotmail.com>
Sent: July 21, 2020 08:39 PM
To: Daniel Alvarez <dalva...@redhat.com>
Cc: Cory Hawkless <c...@hawkless.id.au>; ovs-discuss@openvswitch.org 
<ovs-discuss@openvswitch.org>; Dumitru Ceara <dce...@redhat.com>
Subject: Re: [ovs-discuss] OVN scale

When create a network (and subnet) on OpenStack, a GW port and service port 
(for DHCP and metadata)
are also created. They are created in Neutron and onv-nb-db by ML2 driver. Then 
ovn-northd will translate
such update from NBDB to SBDB. My question here is that, with 20.03, is this 
translation incremental?

After created 4000 networks successfully on OpenStack, I see 4000 logical 
switches and 8000 LS ports
in NBDB. But in SBDB, there are only 1567 port-bindings. The break happened 
when translating 1568th
port. If ovn-northd recompiles the whole DB for every update, this problem can 
be explained. The DB is
too big for ovn-northd to compile in time, so all the followed updates are 
lost. Does it make sense?

I recall DB update is coordinated by some "version", like some changes happened 
in NBDB, the version
bumps up, ovn-northd update SBDB and bumps up version as well, so they match. 
So, if NBDB version
bumps up more than once while ovn-northd updating SBDB, is that still going to 
work? If yes, then it's
just matter of time, no matter how fast update happening in NBDB, ovn-northd 
will catch them up
eventually. Am I right about that?

Any comment is welcome.

Thanks!

Tony

________________________________
From: Tony Liu <tonyliu0...@hotmail.com>
Sent: July 21, 2020 10:22 AM
To: Daniel Alvarez <dalva...@redhat.com>
Cc: Cory Hawkless <c...@hawkless.id.au>; ovs-discuss@openvswitch.org 
<ovs-discuss@openvswitch.org>; Dumitru Ceara <dce...@redhat.com>
Subject: Re: [ovs-discuss] OVN scale

Hi Daniel, all

4000 networks and 50 routers, 200 networks on each router, they are all created.
CPU usage of Neutron server, ovn-nb-db, ovn-northd, ovn-sb-db, ovn-controller 
and ovs-vswitchd is OK,
not consistently 100%, but still some spikes to it.

Now, when create VM, I got that "waiting for vif-plugged-in timeout". This 
brings out another question,
it used to be neutron-agent notifying Neutron server port status change, with 
OVN, who does it?
How should I look into this?

Please see my other comments Inline...

Thanks!

Tony
________________________________
From: Daniel Alvarez <dalva...@redhat.com>
Sent: July 21, 2020 12:06 AM
To: Tony Liu <tonyliu0...@hotmail.com>
Cc: Cory Hawkless <c...@hawkless.id.au>; ovs-discuss@openvswitch.org 
<ovs-discuss@openvswitch.org>; Dumitru Ceara <dce...@redhat.com>
Subject: Re: [ovs-discuss] OVN scale

Hi Tony, all

On 21 Jul 2020, at 07:53, Tony Liu <tonyliu0...@hotmail.com> wrote:

Hi Cory,

With 4000 networks all connecting to one router with external GW, all networks 
and router
are created and connected. I launched a few VMs on some networks, they are 
connected and
all have external connectivity. When running ping on VM, there is a slow ping 
(a few seconds)
out of 10+ normal pings (< 1ms). When checking CPU usage, I see Neutron server, 
OVN DB,
OVN controller and ovs-switchd all take almost 100% CPU. It's been like that 
for hours already.
Since they are all created and some of them work fine (didn't validate all 
networks), not sure
what those services are busy with. Checked log, the ovn-controller keep 
switching between
ovn-sb-db, because of heartbeat timeout.

How are you deploying OpenStack and in particular the OVN dbs? Is it RAFT 
cluster?

> Kolla Ansible. I see cluster-local-address and remote address (to the first 
> node)
> is specified for all 3 nodes. I assume clustering is enabled.
> Is there different type of cluster?

What’s your current value for ovn-remote-probe-interval? If it’s too low, this 
can be triggering reconnections all the time and creating a snowball effect.

> external_ids        : {ovn-encap-ip="10.6.30.22", ovn-encap-type=geneve, 
> ovn-remote="tcp:10.6.20.84:6642,tcp:10.6.20.85:6642,tcp:10.6.20.86:6642", 
> ovn-remote-probe-interval="60000", system-id="compute-3"}

You can bump the probe interval timeout like this:

ovs-vsctl set open . external_ids:ovn-remote-probe-interval=<TIME IN MS>

I'd like know if that's expected, or something I can tune to fix the problem. 
If that's expected,
I can't think of anything other than building multiple clusters to support that 
kind of scale.

I am running test with 4000 networks with 50 routers, 80 networks on each 
router. Wondering
if that's going to help.

Reducing the number of routers should help. Also there are some improvements in 
20.06 release when it comes to the number of logical flows by a series of 
patches from Han. I will post the links later, sorry.

Also there is a big improvement around large Port Groups as they are now split 
by data path reducing dramatically the calculations in ovn-controller. 
Specially in scenarios with a large number of networks like yours.
However you seem to have no security groups and hence no Port Groups in the NB 
database. Is this correct?

> Yes. For now, I want to avoid scale impact from SG, so I disable it.

Is there any chance you can re run the initial scenario but with 20.06?

> Is there container for 20.06? Or where I can get the packages of 20.06?
>I should be able to upgrade 20.03 to 20.06 by upgrading packages.

The goal is to have thousands networks connecting to external. I'd like to know 
what's the
expected scale supported by current OVN.

+Dumitru as we know that there is a limit of 3000 in the number of re 
submissions. So having 3K routers connected to the public logical switch may 
hit this limitation. Please @Dumitru correct me if I’m wrong.

Any comment is welcome.

Thanks!

Tony

________________________________
From: Cory Hawkless <c...@hawkless.id.au>
Sent: July 20, 2020 10:04 PM
To: Tony Liu <tonyliu0...@hotmail.com>; ovs-discuss@openvswitch.org 
<ovs-discuss@openvswitch.org>
Subject: RE: OVN scale

I would expect to see 100% cpu utilisation on anything involved in the process 
of creating 4000 networks and routers but the question is for how long do you 
see high utilisation? Does it last for seconds, minutes, hours?

Do the resources actually get created after some period of time or is the 
process failing?

From: discuss [mailto:ovs-discuss-boun...@openvswitch.org] On Behalf Of Tony Liu
Sent: Tuesday, 21 July 2020 1:53 PM
To: ovs-discuss@openvswitch.org
Subject: [ovs-discuss] OVN scale

Hi folks,

This is my first email here. Please let me know if there is any rule

or convention I need to follow. Don't want to break it.

I started with OpenStack Ussuri and OVN 20.03.0 recently and currently

running some scaling test. Searched around for scaling info and noticed

some improvements already presented, which is pretty cool.

Wondering that "incremental" by DDlog implemented yet?

With a 3-node OVN DB cluster and 3 compute nodes (with OVN controller),

I created 4000 networks from OpenStack, 4000 logical routers with

external GW, add one network to each LR. Port security is disabled on

all networks. Then I see ovn-northd, ovn-controller and ovs-switchd all

take almost 100% CPU. Is this expected?

I revised solution and running test to have 4000 networks, 20 LRs and

200 networks on each LR. Will see if this makes any difference.

Is there any scaling and performance report with the latest OVN release

as my reference?

Thanks!

Tony

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] OVN scale

Reply via email to