Re: [ovs-discuss] OVN scale

2020-08-04 Thread Tony Liu
Hi,

Continue this thread with some updates.

I finally got 4096 networks and 256 router created, 16 networks connecting
to each router. All routers are set as external gateway.

On underlay, those 256 gateway addresses on the provider network are
reachable. Ping is steady.

I launched 10 VMs on one compute node. One of them failed because network
allocation failed. Didn't look into it.

When ping from underlay to VM, it's bumpy. There is 1s or 2s delay about
every 10 pings.

Can't launch any more VMs. It always fails.

One of the Neutron node is very busy. From the logging on INFO level,
it just keeps connecting to OVN.

The active ovn-northd is busy, but all ovn-nb-db and ovn-sb-db are not.

On compute node, ovn-controller is very busy. It keeps saying
"commit failed".

2020-08-05T02:44:23.927Z|04125|reconnect|INFO|tcp:10.6.20.84:6642: connected
2020-08-05T02:44:23.936Z|04126|main|INFO|OVNSB commit failed, force recompute 
next time.
2020-08-05T02:44:23.938Z|04127|ovsdb_idl|INFO|tcp:10.6.20.84:6642: clustered 
database server is disconnected from cluster; trying another server
2020-08-05T02:44:23.939Z|04128|reconnect|INFO|tcp:10.6.20.84:6642: connection 
attempt timed out
2020-08-05T02:44:23.939Z|04129|reconnect|INFO|tcp:10.6.20.84:6642: waiting 2 
seconds before reconnect


The connection to local OVSDB keeps being dropped, because no probe
response. The probe interval is set to 30s already.

2020-08-05T02:47:15.437Z|04351|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 
(10.6.20.22:42362<->10.6.20.86:6642) at lib/stream-fd.c:157 (100% CPU usage)
2020-08-05T02:47:15.438Z|04352|reconnect|WARN|tcp:127.0.0.1:6640: connection 
dropped (Broken pipe)
2020-08-05T02:47:15.438Z|04353|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt:
 connecting...
2020-08-05T02:47:15.449Z|04354|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt:
 connected


Also error about localnet port.

2020-08-05T02:47:15.403Z|04345|patch|ERR|bridge not found for localnet port 
'provnet-006baf64-409d-434d-b95b-017a77969b55' with network name 'physnet1'


First of all, this kind of scale should work fine, right?

Any advices how to look into it?


Thanks!

Tony

> -Original Message-
> From: dev  On Behalf Of Tony Liu
> Sent: Monday, July 27, 2020 10:16 AM
> To: Han Zhou 
> Cc: ovs-...@openvswitch.org; ovs-discuss@openvswitch.org
> Subject: Re: [ovs-dev] [ovs-discuss] OVN scale
> 
> Hi Han,
> 
> Just some updates here.
> 
> I tried with 4K networks on single router. Configuration was done
> without any issues. I checked both nb-db and sb-db, they all look good.
> It's just that router configuration is huge (in Neutron DB, nb-db and
> flow table in sb-db), because it contains all 4K ports. Also, the
> pipeline of router datapath in sb-db is quite big.
> 
> I see ovn-northd master and sb-db leader are busy, taking 90+% CPU.
> There are only 3 compute nodes and 2 gateway nodes. Does that monitor
> setting "ovn-monitor-all" matters in such case? Any idea what they are
> busy with, without any configuration updates from OpenStack? The nb-db
> is not busy though.
> 
> Probably because nb-db is busy, ovn-controller can't connect to it
> consistently. It keeps being disconnected and reconnecting. Restarting
> ovn-controller seems help. I am able to launch a few VMs on different
> networks and they are connected via the router.
> 
> Now, I have problem on external access. The router is set as gateway to
> a provider/underlay network on an interface on the gateway node. The
> router is allocated an underlay address from that provider network. My
> understanding is that, the br-ex on gateway node holding the active
> router will broadcast ARP to announce that router underlay address in
> case of failover. Also, it will respond ARP request for that router
> underlay address. But when I run tcpdump on that underlay interface on
> gateway node, I see ARP request coming in, but no ARP response going out.
> I checked the flow table in sb-db, it seems ok. I also checked flow on
> br-ex by "ovs-ofctl dump-flows br-ex", I don't see anything about ARP
> there.
> How should I look into it?
> 
> Again, the case is to support 4K networks with external access (security
> group is disabled), 4K routers (one for each network), 50 routers (one
> for 80 networks), 1 router (for all 4K networks)...
> All networks are isolated by ACL on the logical router. Which option
> should work better?
> Any comment is appreciated.
> 
> 
> Thanks!
> 
> Tony
> 
> 
> 
> From: discuss  on behalf of Tony
> Liu 
> Sent: July 21, 2020 09:09 PM
> To: Daniel Alvarez 
> Cc: ovs-discuss@openvswitch.org 
> Subject: Re: [ovs-discuss] OVN scale
> 
> [root@ovn-db-2 ~]# ovn-nbctl list nb_global
> _uuid   : b7b3aa05-f7ed-4dbc-979f-10445ac325b8
> connections : []
> external_ids: {"neutron:liveness_check_at"="2020-07-22
> 04:03:17.726917+00:00"}
> hv_cfg 

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-04 Thread Tony Liu
In that case, I can use set-connection to set one row.


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 4:44 PM
> To: Tony Liu 
> Cc: Numan Siddique ; Han Zhou ; ovs-
> discuss ; ovs-dev 
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> 
> 
> On Tue, Aug 4, 2020 at 2:50 PM Tony Liu   > wrote:
> 
> 
>   Hi,
> 
>   Since I have 3 OVN DB nodes, should I add 3 rows in connection
> table
>   for the inactivity_probe? Or put 3 addresses into one row?
> 
>   "set-connection" set one row only, and there is no "add-connection".
>   How should I add 3 rows into the table connection?
> 
> 
> 
> 
> You only need to set one row. Try this command:
> 
> ovn-nbctl -- --id=@conn_uuid create Connection
> target="ptcp\:6641\:0.0.0.0" inactivity_probe=0 -- set NB_Global .
> connections=@conn_uuid
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Numan Siddique mailto:num...@ovn.org> >
>   > Sent: Tuesday, August 4, 2020 12:36 AM
>   > To: Tony Liu   >
>   > Cc: ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >; ovs-dev> d...@openvswitch.org  >
>   > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
>   >
>   >
>   >
>   > On Tue, Aug 4, 2020 at 9:12 AM Tony Liu  
>   >   > > wrote:
>   >
>   >
>   >   In my deployment, on each Neutron server, there are 13
> Neutron
>   > server processes.
>   >   I see 12 of them (monitor, maintenance, RPC, API) connect
> to both
>   > ovn-nb-db
>   >   and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB
> clients.
>   >   Is so many clients OK?
>   >
>   >   Any suggestions how to figure out which side doesn't
> respond the
>   > probe,
>   >   if it's bi-directional? I don't see any activities from
> logging,
>   > other than
>   >   connect/drop and reconnect...
>   >
>   >   BTW, please let me know if this is not the right place to
> discuss
>   > Neutron OVN
>   >   ML2 driver.
>   >
>   >
>   >   Thanks!
>   >
>   >   Tony
>   >
>   >   > -Original Message-
>   >   > From: dev mailto:ovs-
> dev-boun...@openvswitch.org>  
>   > boun...@openvswitch.org  > > On
> Behalf Of Tony Liu
>   >   > Sent: Monday, August 3, 2020 7:45 PM
>   >   > To: ovs-discuss mailto:ovs-
> disc...@openvswitch.org>  
>   > disc...@openvswitch.org  > >;
> ovs-dev>   > d...@openvswitch.org 
>  > >
>   >   > Subject: [ovs-dev] [OVN] no response to inactivity probe
>   >   >
>   >   > Hi,
>   >   >
>   >   > Neutron OVN ML2 driver was disconnected by ovn-nb-db.
> There are
>   > many
>   >   > error messages from ovn-nb-db leader.
>   >   > 
>   >   > 2020-08-
> 04T02:31:39.751Z|03138|reconnect|ERR|tcp:10.6.20.81:58620
> 
>   >  : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:42.484Z|03139|reconnect|ERR|tcp:10.6.20.81:58300
> 
>   >  : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:49.858Z|03140|reconnect|ERR|tcp:10.6.20.81:59582
> 
>   >  : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:53.057Z|03141|reconnect|ERR|tcp:10.6.20.83:42626
> 
>   >  : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:53.058Z|03142|reconnect|ERR|tcp:10.6.20.82:45412
> 
>   >  : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:54.067Z|03143|reconnect|ERR|tcp:10.6.20.81:59416
> 
>   >  : no
>   >   > response to inactivity probe after 5 seconds,
> disconnecting
>   >   > 2020-08-
> 04T02:31:54.809Z|03144|reconnect|ERR|tcp:10.6.20.81:60004
> 
>   >  : no
>   >   > response to inactivity probe after 5 seconds,
> 

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-04 Thread Han Zhou
On Tue, Aug 4, 2020 at 2:50 PM Tony Liu  wrote:

> Hi,
>
> Since I have 3 OVN DB nodes, should I add 3 rows in connection table
> for the inactivity_probe? Or put 3 addresses into one row?
>
> "set-connection" set one row only, and there is no "add-connection".
> How should I add 3 rows into the table connection?
>
>
You only need to set one row. Try this command:

ovn-nbctl -- --id=@conn_uuid create Connection target="ptcp\:6641\:0.0.0.0"
inactivity_probe=0 -- set NB_Global . connections=@conn_uuid


> Thanks!
>
> Tony
>
> > -Original Message-
> > From: Numan Siddique 
> > Sent: Tuesday, August 4, 2020 12:36 AM
> > To: Tony Liu 
> > Cc: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> >
> >
> >
> > On Tue, Aug 4, 2020 at 9:12 AM Tony Liu  >  > wrote:
> >
> >
> >   In my deployment, on each Neutron server, there are 13 Neutron
> > server processes.
> >   I see 12 of them (monitor, maintenance, RPC, API) connect to both
> > ovn-nb-db
> >   and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB
> clients.
> >   Is so many clients OK?
> >
> >   Any suggestions how to figure out which side doesn't respond the
> > probe,
> >   if it's bi-directional? I don't see any activities from logging,
> > other than
> >   connect/drop and reconnect...
> >
> >   BTW, please let me know if this is not the right place to discuss
> > Neutron OVN
> >   ML2 driver.
> >
> >
> >   Thanks!
> >
> >   Tony
> >
> >   > -Original Message-
> >   > From: dev mailto:ovs-dev-
> > boun...@openvswitch.org> > On Behalf Of Tony Liu
> >   > Sent: Monday, August 3, 2020 7:45 PM
> >   > To: ovs-discuss mailto:ovs-
> > disc...@openvswitch.org> >; ovs-dev  >   > d...@openvswitch.org  >
> >   > Subject: [ovs-dev] [OVN] no response to inactivity probe
> >   >
> >   > Hi,
> >   >
> >   > Neutron OVN ML2 driver was disconnected by ovn-nb-db. There are
> > many
> >   > error messages from ovn-nb-db leader.
> >   > 
> >   > 2020-08-04T02:31:39.751Z|03138|reconnect|ERR|tcp:
> 10.6.20.81:58620
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:42.484Z|03139|reconnect|ERR|tcp:
> 10.6.20.81:58300
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:49.858Z|03140|reconnect|ERR|tcp:
> 10.6.20.81:59582
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:53.057Z|03141|reconnect|ERR|tcp:
> 10.6.20.83:42626
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:53.058Z|03142|reconnect|ERR|tcp:
> 10.6.20.82:45412
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:54.067Z|03143|reconnect|ERR|tcp:
> 10.6.20.81:59416
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> >   > 2020-08-04T02:31:54.809Z|03144|reconnect|ERR|tcp:
> 10.6.20.81:60004
> >  : no
> >   > response to inactivity probe after 5 seconds, disconnecting
> > 
> >   >
> >   > Could anyone share a bit details how this inactivity probe works?
> >
> >
> >
> > The inactivity probe is sent by both the server and clients
> > independently.
> > Meaning ovsdb-server will send an inactivity probe every 'x' configured
> > seconds to all its connected clients and if it doesn't get a reply from
> > the client within some time, it disconnects the connection.
> >
> > The inactivity probe from the server side can be configured. Run "ovn-
> > nbctl list connection"
> > and you will see inactivity_probe column. You can set this column to
> > desired value like - ovn-nbctl set connection . inactivity_probe=3
> > (for 30 seconds)
> >
> > The same thing for SB ovsdb-server.
> >
> > Similarly each client (ovn-northd, ovn-controller, neutron server) sends
> > inactivity probe every 'y' seconds and if the client doesn't get any
> > reply from ovsdb-server it will disconnect the connection and reconnect
> > again.
> >
> > For ovn-northd you can configured this as - ovn-nbctl set NB_Global .
> > options:northd_probe_interval=3
> >
> > For ovn-controllers - ovs-vsctl set open . external_ids:ovn-remote-
> > probe-interval=3
> >
> > There is also a probe interval for openflow connection from ovn-
> > controller to ovs-vswitchd which you can configure as ovs-vsctl set
> > open . external_ids:ovn-openflow-probe-interval=30 (this is in seconds)
> >
> >
> > Regarding the neutron server I think it is set to 60 seconds. Please see
> > this -
> > 

Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Tony Liu
Hi Han,

Sounds good. I am looking forward to incremental-processing,
and will go from there.

BTW, it would be great if you could let me know how to set probe
interval for 3-node cluster, here or in another thread.


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 4:02 PM
> To: Tony Liu 
> Cc: Han Zhou ; Numan Siddique ; Ben Pfaff
> ; Leonid Ryzhyk ; ovs-dev  d...@openvswitch.org>; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> Hi Tony,
> 
> I am glad it is more clear now. For your concern regarding taking too
> much time for one round of computing, it is valid, but I guess it is not
> directly related to the IDLE probe any more, right?
> The OVSDB IDL in fact already does some of the work of caping and
> buffering like what you proposed. The IDL will read a limited number of
> messages to get processed in each round (and the remaining messages are
> buffered in the stream). However, sometimes a single notification
> message can contain a huge amount of data. It is hard to split the data
> from one single notification, because the data are internally dependent
> on each other.
> 
> Without incremental-processing, the size of the data change doesn't
> matter much because all data is recomputed anyway. I'd suggest to see
> what's the outcome of incremental-processing, and see if any further
> improvement is still needed for handling big transactions.
> 
> In my opinion, the special cases of a big data change triggered by
> scenarios such as data restore can be handled by operational approaches
> instead of implementation. For example, you could adjust the probe
> interval before doing data restore, and change it back afterwards. But
> of course, if there are good ways to implement we should definitely
> consider.
> 
> 
> Thanks,
> Han
> 
> 
> On Tue, Aug 4, 2020 at 2:00 PM Tony Liu   > wrote:
> 
> 
>   Hi Han,
> 
>   Thanks for clarifications! It's crystal clear.
> 
>   My concern, in general, is blocking. For onv-northd, or OVSDB
> client,
>   (I assume all OVSDB clients are using the same library for
> connection,
>   proble, etc.?) when handing current event, it won't be interrupted
> to
>   handle any incoming event, right? How long does it take to handle a
>   computing event for big chunk of data? How much data can be
> buffered
>   to be computed? Is there estimated maximum time for handle so much
> data?
> 
>   In case it takes more than 5s to process an event, then the peer
> will
>   drop the connection because of probe timeout.
> 
>   With incremental-process, if I restore DB, then that still could be
> a
>   huge incremental, unless the incremental size is controlled. That's
>   probably why you recommend to restore to existing cluster, to avoid
>   huge incremental from restoring to a fresh cluster. Am I right?
> 
>   What I used to do is to chop big data into pieces and to be handled
> by
>   multiple event loops. That way, other events will have a chance to
> get
>   processed. So big chunk of data won't cause blocking.
> 
>   Enlarge probe interval will sort of resolve the issue, but it will
> lose
>   the point of probing. Just like that election timer, enlarge the
> timer
>   avoids often failover, but it also increases the failover time when
> real
>   problem happens. And yes, I agree that it's on control plane and
> doesn't
>   break data plane, but just like in networking world, routing
> convergence
>   is very important.
> 
>   I am thinking, in your incremental-processing, if the time for each
> event
>   loop can be capped or controlled, that would be very helpful. The
> side
>   effect of that option is memory consumption. You will need to
> buffer more
>   data. But today, it's lots easier to increase memory to boost
> performance.
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Tuesday, August 4, 2020 12:34 PM
>   > To: Tony Liu   >
>   > Cc: Han Zhou mailto:hz...@ovn.org> >; Numan
> Siddique mailto:num...@ovn.org> >; Ben Pfaff
>   > mailto:b...@ovn.org> >; Leonid Ryzhyk
> mailto:lryz...@vmware.com> >; ovs-dev> d...@openvswitch.org  >; ovs-discuss
> mailto:ovs-discuss@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when
> no
>   > configuration update
>   >
>   >
>   >
>   > On Tue, Aug 4, 2020 at 11:40 AM Tony Liu  
>   >   > > wrote:
>   >
>   >
>   >   Inline...
>   >
>   >   Thanks!
>   >
>   >   Tony
>   >   > -Original Message-
>   >  

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-04 Thread Tony Liu
Hi,

Since I have 3 OVN DB nodes, should I add 3 rows in connection table
for the inactivity_probe? Or put 3 addresses into one row?

"set-connection" set one row only, and there is no "add-connection".
How should I add 3 rows into the table connection?


Thanks!

Tony

> -Original Message-
> From: Numan Siddique 
> Sent: Tuesday, August 4, 2020 12:36 AM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-discuss] [OVN] no response to inactivity probe
> 
> 
> 
> On Tue, Aug 4, 2020 at 9:12 AM Tony Liu   > wrote:
> 
> 
>   In my deployment, on each Neutron server, there are 13 Neutron
> server processes.
>   I see 12 of them (monitor, maintenance, RPC, API) connect to both
> ovn-nb-db
>   and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB clients.
>   Is so many clients OK?
> 
>   Any suggestions how to figure out which side doesn't respond the
> probe,
>   if it's bi-directional? I don't see any activities from logging,
> other than
>   connect/drop and reconnect...
> 
>   BTW, please let me know if this is not the right place to discuss
> Neutron OVN
>   ML2 driver.
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: dev mailto:ovs-dev-
> boun...@openvswitch.org> > On Behalf Of Tony Liu
>   > Sent: Monday, August 3, 2020 7:45 PM
>   > To: ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >; ovs-dev> d...@openvswitch.org  >
>   > Subject: [ovs-dev] [OVN] no response to inactivity probe
>   >
>   > Hi,
>   >
>   > Neutron OVN ML2 driver was disconnected by ovn-nb-db. There are
> many
>   > error messages from ovn-nb-db leader.
>   > 
>   > 2020-08-04T02:31:39.751Z|03138|reconnect|ERR|tcp:10.6.20.81:58620
>  : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:42.484Z|03139|reconnect|ERR|tcp:10.6.20.81:58300
>  : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:49.858Z|03140|reconnect|ERR|tcp:10.6.20.81:59582
>  : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:53.057Z|03141|reconnect|ERR|tcp:10.6.20.83:42626
>  : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:53.058Z|03142|reconnect|ERR|tcp:10.6.20.82:45412
>  : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:54.067Z|03143|reconnect|ERR|tcp:10.6.20.81:59416
>  : no
>   > response to inactivity probe after 5 seconds, disconnecting
>   > 2020-08-04T02:31:54.809Z|03144|reconnect|ERR|tcp:10.6.20.81:60004
>  : no
>   > response to inactivity probe after 5 seconds, disconnecting
> 
>   >
>   > Could anyone share a bit details how this inactivity probe works?
> 
> 
> 
> The inactivity probe is sent by both the server and clients
> independently.
> Meaning ovsdb-server will send an inactivity probe every 'x' configured
> seconds to all its connected clients and if it doesn't get a reply from
> the client within some time, it disconnects the connection.
> 
> The inactivity probe from the server side can be configured. Run "ovn-
> nbctl list connection"
> and you will see inactivity_probe column. You can set this column to
> desired value like - ovn-nbctl set connection . inactivity_probe=3
> (for 30 seconds)
> 
> The same thing for SB ovsdb-server.
> 
> Similarly each client (ovn-northd, ovn-controller, neutron server) sends
> inactivity probe every 'y' seconds and if the client doesn't get any
> reply from ovsdb-server it will disconnect the connection and reconnect
> again.
> 
> For ovn-northd you can configured this as - ovn-nbctl set NB_Global .
> options:northd_probe_interval=3
> 
> For ovn-controllers - ovs-vsctl set open . external_ids:ovn-remote-
> probe-interval=3
> 
> There is also a probe interval for openflow connection from ovn-
> controller to ovs-vswitchd which you can configure as ovs-vsctl set
> open . external_ids:ovn-openflow-probe-interval=30 (this is in seconds)
> 
> 
> Regarding the neutron server I think it is set to 60 seconds. Please see
> this -
> https://github.com/openstack/neutron/blob/master/neutron/conf/plugins/ml
> 2/drivers/ovn/ovn_conf.py#L80
> 
> From the logs you shared, it looks like ovsdb-server is not getting the
> probe reply from neutron server after 5 seconds and hence it is
> disconnecting. Not sure what's happening though.
> 
> You can try increasing the inactivity probe interval on the ovsdb-server
> side with the first command I shared.
> Note: If "ovn-nbctl list connection" returns empty, you need to create a
> 

Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Tony Liu
Hi Han,

Thanks for clarifications! It's crystal clear.

My concern, in general, is blocking. For onv-northd, or OVSDB client,
(I assume all OVSDB clients are using the same library for connection,
proble, etc.?) when handing current event, it won't be interrupted to
handle any incoming event, right? How long does it take to handle a
computing event for big chunk of data? How much data can be buffered
to be computed? Is there estimated maximum time for handle so much data?

In case it takes more than 5s to process an event, then the peer will
drop the connection because of probe timeout.

With incremental-process, if I restore DB, then that still could be a
huge incremental, unless the incremental size is controlled. That's
probably why you recommend to restore to existing cluster, to avoid
huge incremental from restoring to a fresh cluster. Am I right?

What I used to do is to chop big data into pieces and to be handled by
multiple event loops. That way, other events will have a chance to get
processed. So big chunk of data won't cause blocking.

Enlarge probe interval will sort of resolve the issue, but it will lose
the point of probing. Just like that election timer, enlarge the timer
avoids often failover, but it also increases the failover time when real
problem happens. And yes, I agree that it's on control plane and doesn't
break data plane, but just like in networking world, routing convergence
is very important.

I am thinking, in your incremental-processing, if the time for each event
loop can be capped or controlled, that would be very helpful. The side
effect of that option is memory consumption. You will need to buffer more
data. But today, it's lots easier to increase memory to boost performance.


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 12:34 PM
> To: Tony Liu 
> Cc: Han Zhou ; Numan Siddique ; Ben Pfaff
> ; Leonid Ryzhyk ; ovs-dev  d...@openvswitch.org>; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> 
> 
> On Tue, Aug 4, 2020 at 11:40 AM Tony Liu   > wrote:
> 
> 
>   Inline...
> 
>   Thanks!
> 
>   Tony
>   > -Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Tuesday, August 4, 2020 11:01 AM
>   > To: Numan Siddique mailto:num...@ovn.org> >; Ben
> Pfaff mailto:b...@ovn.org> >; Leonid
>   > Ryzhyk mailto:lryz...@vmware.com> >
>   > Cc: Tony Liu   >; Han Zhou   >; ovs-
>   > dev mailto:ovs-...@openvswitch.org> >;
> ovs-discuss mailto:ovs-
> disc...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when
> no
>   > configuration update
>   >
>   >
>   >
>   > On Tue, Aug 4, 2020 at 12:38 AM Numan Siddique  
>   >  > > wrote:
>   >
>   >
>   >
>   >
>   >   On Tue, Aug 4, 2020 at 9:02 AM Tony Liu
> mailto:tonyliu0...@hotmail.com>
>   >   > > wrote:
>   >
>   >
>   >   The probe awakes recomputing?
>   >   There is probe every 5 seconds. Without any
> connection
>   > up/down or failover,
>   >   ovn-northd will recompute everything every 5
> seconds, no
>   > matter what?
>   >   Really?
>   >
>   >   Anyways, I will increase the probe interval for now,
> see if
>   > that helps.
>   >
>   >
>   >
>   >   I think we should optimise this case. I am planning to look
> into
>   > this.
>   >
>   >   Thanks
>   >   Numan
>   >
>   >
>   > Thanks Numan.
>   > I'd like to discuss more on this before we move forward to change
>   > anything.
>   >
>   > 1) Regarding the problem itself, the CPU cost triggered by OVSDB
> IDLE
>   > probe when there is no configuration change to compute, I don't
> think it
>   > matters that much in real production. It simply wastes CPU cycles
> when
>   > there is nothing to do, so what harm would it do here? For ovn-
> northd,
>   > since it is the centralized component, we would always ensure
> there is
>   > enough CPU available for ovn-north when computing is needed, and
> this
>   > reservation will be wasted anyway when there is no change to
> compute. So,
>   > I'd avoid making any change specifically only to address this
> issue. I
>   > could be wrong, though. I'd like to hear what would be the real
> concern
>   > if this is not addressed.
> 
>   Is more vCPUs going to help here? Is ovn-northd multi-thread?
> 
> 
> 
> 
> ovn-northd is single threaded. It can be changed to have a separate
> thread for the probe handling, but I don't see any obvious benefit.
> 

Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Han Zhou
On Tue, Aug 4, 2020 at 11:40 AM Tony Liu  wrote:

> Inline...
>
> Thanks!
>
> Tony
> > -Original Message-
> > From: Han Zhou 
> > Sent: Tuesday, August 4, 2020 11:01 AM
> > To: Numan Siddique ; Ben Pfaff ; Leonid
> > Ryzhyk 
> > Cc: Tony Liu ; Han Zhou ; ovs-
> > dev ; ovs-discuss 
> > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> > configuration update
> >
> >
> >
> > On Tue, Aug 4, 2020 at 12:38 AM Numan Siddique  >  > wrote:
> >
> >
> >
> >
> >   On Tue, Aug 4, 2020 at 9:02 AM Tony Liu  >  > wrote:
> >
> >
> >   The probe awakes recomputing?
> >   There is probe every 5 seconds. Without any connection
> > up/down or failover,
> >   ovn-northd will recompute everything every 5 seconds, no
> > matter what?
> >   Really?
> >
> >   Anyways, I will increase the probe interval for now, see if
> > that helps.
> >
> >
> >
> >   I think we should optimise this case. I am planning to look into
> > this.
> >
> >   Thanks
> >   Numan
> >
> >
> > Thanks Numan.
> > I'd like to discuss more on this before we move forward to change
> > anything.
> >
> > 1) Regarding the problem itself, the CPU cost triggered by OVSDB IDLE
> > probe when there is no configuration change to compute, I don't think it
> > matters that much in real production. It simply wastes CPU cycles when
> > there is nothing to do, so what harm would it do here? For ovn-northd,
> > since it is the centralized component, we would always ensure there is
> > enough CPU available for ovn-north when computing is needed, and this
> > reservation will be wasted anyway when there is no change to compute. So,
> > I'd avoid making any change specifically only to address this issue. I
> > could be wrong, though. I'd like to hear what would be the real concern
> > if this is not addressed.
>
> Is more vCPUs going to help here? Is ovn-northd multi-thread?
>
>
ovn-northd is single threaded. It can be changed to have a separate thread
for the probe handling, but I don't see any obvious benefit.


> I am probably still missing something here. The probe is there all times,
> every 5s.


The probe is sent only if there is no activity on the OVSDB connection
during the interval, that's why it is called "IDLE" probe. If there is
already interaction during the past interval, no probe will be sent.


> If ovn-northd is in the middle of a computing, is a probe going
> to make ovn-northd restart the computing?


No, it won't. Firstly, it is unlikely that a probe is received during
computing, unless the probe interval is set too short. Secondly, even when
it happens, the current computing will complete and all needed changes will
be enforced to SB DB regardless of the probe received during the computing.
The probe will be handled in the next round of the main loop, and it will
trigger another round of computing which is useless but not harmful either.
There is probably one case I can think of that causes a little latency -
when another NB DB change (say, change2) comes during the computing
triggered by the probe, then the handling for the change2 will be delayed a
little until the computing triggered by the probe completes. But the chance
is rather low, especially if the probe interval is enlarged, and in the
unlucky case, the impact is just a little delay in the change handling.


> Or the probe only triggers
> computing when ovn-northd is idle? Even with the latter case, what's the
> intention to trigger computing by probe?
>
>
It is not triggered intentionally for the probe. It is just because
ovn-northd doesn't distinguish if it is woken up by a probe only or if
there are any changes that need to be processed. Many events can wake up
ovn-northd, and once it is wake up it will compute everything. I agree it
can be optimized (we already optimized this for ovn-controller). I am just
wondering if it worth to be optimized specifically. Or we just get it for
free as a byproduct when implementing incremental-processing, which is
already in the road map.

Does this clarify a little?

Thanks,
Han


> >
> > 2) ovn-northd incremental processing would avoid this CPU problem
> > naturally. So let's discuss how to move forward for incremental
> > processing, which is much more important because it also solves the CPU
> > efficiency when handling the changes, and the IDLE probe problem is just
> > a byproduct. I believe the DDlog branch would have solved this problem.
> > However, it seems we are not sure about the current status of DDlog. As
> > you proposed at the last OVN meeting, an alternative is to implement
> > partial incremental-processing using the I-P engine like ovn-controller.
> > While I have no objection to this, we'd better check with Ben and Leonid
> > on the plan to avoid overlapping and waste of work. @Ben @Leonid, would
> > you mind sharing the status here since you were not at the meeting last
> > week?
>
> 

Re: [ovs-discuss] [ovs-dev] [OVN] stale data complained by ovn-controller after db restore

2020-08-04 Thread Han Zhou
On Tue, Aug 4, 2020 at 11:31 AM Tony Liu  wrote:

> Is there any difference to restore DB on existing cluster vs. fresh
> cluster,
> in terms of performance?
>
> If I don't have to restore on fresh cluster, which is recommended?
>
> I would suggest to directly restore on top of existing cluster instead of
creating a fresh cluster.


> For now, since ovn-northd always recomputes the whole DB, I guess not much
> difference?
>
> With incremental-process, would restoring to a fresh cluster be better?
>
> No.


> Is it necessary to stop or restart ovn-northd during DB restore?
>
> No.


>
> Thanks!
>
> Tony
>
> > -Original Message-
> > From: Han Zhou 
> > Sent: Tuesday, August 4, 2020 11:13 AM
> > To: Tony Liu 
> > Cc: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: Re: [ovs-dev] [OVN] stale data complained by ovn-controller
> > after db restore
> >
> >
> >
> > On Tue, Aug 4, 2020 at 10:30 AM Tony Liu  >  > wrote:
> >
> >
> >   Hi,
> >
> >   Here is how I restore OVN DB.
> >   * Stop all ovn-nb-db, ovn-sb-db and ovn-northd services.
> >   * Clean up all DB files.
> >   * Start all DB services. Fresh ovn-nb-db and ovn-sb-db clusters are
> > up and
> > running.
> >   * Set DB election timer to 10s.
> >   * Restore DB to ovn-nb-db by ovsdb-client.
> >   * Start all ovn-northd services.
> >
> >   A few minutes after, ovn-sb-db is fully synced with ovn-nb-db.
> >
> >   Now, the client of ovn-sb-db, ovn-controller and nova-compute
> > complaint about
> >   "stale data". The chassis node is not getting updated.
> >   
> >   2020-08-04 09:07:45.892 26 INFO ovsdbapp.backend.ovs_idl.vlog [-]
> > tcp:10.6.20.84:6642  : connected
> >   2020-08-04 09:07:45.895 26 WARNING ovsdbapp.backend.ovs_idl.vlog
> [-]
> > tcp:10.6.20.84:6642  : clustered database server
> > has stale data; trying another server
> >   
> >
> >   Restarting ovn-controller and nova-compute resolve the issue.
> >
> >   Is this expected? As part of the DB restore process, should I
> > restart
> >   ovn-controller and nova-compute on all chassis node?
> >
> >
> >
> >
> > Yes, this is expected if you freshly start a new cluster. (It wouldn't
> > happen if you simply restore the old data on the existing cluster.
> > However, I understand that the scenario of restoring data on a freshly
> > created cluster is a valid use case).
> > For this case, you could either restart ovn-controller, or trigger a
> > client side raft index reset by:
> > ovn-appctl -t ovn-controller sb-cluster-state-reset
> >
> > Similarly for ovn-northd:
> > ovn-appctl -t ovn-northd nb-cluster-state-reset
> > ovn-appctl -t ovn-northd sb-cluster-state-reset
> >
> > To use this command, you will need at least 20.06 of OVN and OVS master.
> >
> >
> > Thanks,
> > Han
> >
> >
> >
> >
> >   Thanks!
> >
> >   Tony
> >
> >   ___
> >   dev mailing list
> >   d...@openvswitch.org 
> >   https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Tony Liu
Inline...

Thanks!

Tony
> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 11:01 AM
> To: Numan Siddique ; Ben Pfaff ; Leonid
> Ryzhyk 
> Cc: Tony Liu ; Han Zhou ; ovs-
> dev ; ovs-discuss 
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> 
> 
> On Tue, Aug 4, 2020 at 12:38 AM Numan Siddique   > wrote:
> 
> 
> 
> 
>   On Tue, Aug 4, 2020 at 9:02 AM Tony Liu   > wrote:
> 
> 
>   The probe awakes recomputing?
>   There is probe every 5 seconds. Without any connection
> up/down or failover,
>   ovn-northd will recompute everything every 5 seconds, no
> matter what?
>   Really?
> 
>   Anyways, I will increase the probe interval for now, see if
> that helps.
> 
> 
> 
>   I think we should optimise this case. I am planning to look into
> this.
> 
>   Thanks
>   Numan
> 
> 
> Thanks Numan.
> I'd like to discuss more on this before we move forward to change
> anything.
> 
> 1) Regarding the problem itself, the CPU cost triggered by OVSDB IDLE
> probe when there is no configuration change to compute, I don't think it
> matters that much in real production. It simply wastes CPU cycles when
> there is nothing to do, so what harm would it do here? For ovn-northd,
> since it is the centralized component, we would always ensure there is
> enough CPU available for ovn-north when computing is needed, and this
> reservation will be wasted anyway when there is no change to compute. So,
> I'd avoid making any change specifically only to address this issue. I
> could be wrong, though. I'd like to hear what would be the real concern
> if this is not addressed.

Is more vCPUs going to help here? Is ovn-northd multi-thread?

I am probably still missing something here. The probe is there all times,
every 5s. If ovn-northd is in the middle of a computing, is a probe going
to make ovn-northd restart the computing? Or the probe only triggers
computing when ovn-northd is idle? Even with the latter case, what's the
intention to trigger computing by probe?

> 
> 2) ovn-northd incremental processing would avoid this CPU problem
> naturally. So let's discuss how to move forward for incremental
> processing, which is much more important because it also solves the CPU
> efficiency when handling the changes, and the IDLE probe problem is just
> a byproduct. I believe the DDlog branch would have solved this problem.
> However, it seems we are not sure about the current status of DDlog. As
> you proposed at the last OVN meeting, an alternative is to implement
> partial incremental-processing using the I-P engine like ovn-controller.
> While I have no objection to this, we'd better check with Ben and Leonid
> on the plan to avoid overlapping and waste of work. @Ben @Leonid, would
> you mind sharing the status here since you were not at the meeting last
> week?

My point is that, a probe is not supposed to trigger a computing, no matter
it's full or incremental.

> 
> 
> 
> Thanks,
> Han
> 
> 
> 
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Monday, August 3, 2020 8:22 PM
>   > To: Tony Liu   >
>   > Cc: Han Zhou mailto:hz...@ovn.org> >; ovs-
> discuss mailto:ovs-
> disc...@openvswitch.org> >;
>   > ovs-dev mailto:ovs-
> d...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU
> when no
>   > configuration update
>   >
>   > Sorry that I didn't make it clear enough. The OVSDB probe
> itself doesn't
>   > take much CPU, but the probe awakes ovn-northd main loop,
> which
>   > recompute everything, which is why you see CPU spike.
>   > It will be solved by incremental-processing, when only
> delta is
>   > processed, and in case of probe handling, there is no
> change in
>   > configuration, so the delta is zero.
>   > For now, please follow the steps to adjust probe interval,
> if the CPU of
>   > ovn-northd (when there is no configuration change) is a
> concern for you.
>   > But please remember that this has no impact to the real CPU
> usage for
>   > handling configuration changes.
>   >
>   >
>   > Thanks,
>   > Han
>   >
>   >
>   > On Mon, Aug 3, 2020 at 8:11 PM Tony Liu
> mailto:tonyliu0...@hotmail.com>
>   >   > > wrote:
>   >
>   >
>   >   Health check (5 sec internal) taking 30%-100% CPU is
> definitely not
>   > acceptable,
> 

Re: [ovs-discuss] [ovs-dev] [OVN] stale data complained by ovn-controller after db restore

2020-08-04 Thread Tony Liu
Is there any difference to restore DB on existing cluster vs. fresh cluster,
in terms of performance?

If I don't have to restore on fresh cluster, which is recommended?

For now, since ovn-northd always recomputes the whole DB, I guess not much
difference?

With incremental-process, would restoring to a fresh cluster be better?

Is it necessary to stop or restart ovn-northd during DB restore?


Thanks!

Tony

> -Original Message-
> From: Han Zhou 
> Sent: Tuesday, August 4, 2020 11:13 AM
> To: Tony Liu 
> Cc: ovs-discuss ; ovs-dev  d...@openvswitch.org>
> Subject: Re: [ovs-dev] [OVN] stale data complained by ovn-controller
> after db restore
> 
> 
> 
> On Tue, Aug 4, 2020 at 10:30 AM Tony Liu   > wrote:
> 
> 
>   Hi,
> 
>   Here is how I restore OVN DB.
>   * Stop all ovn-nb-db, ovn-sb-db and ovn-northd services.
>   * Clean up all DB files.
>   * Start all DB services. Fresh ovn-nb-db and ovn-sb-db clusters are
> up and
> running.
>   * Set DB election timer to 10s.
>   * Restore DB to ovn-nb-db by ovsdb-client.
>   * Start all ovn-northd services.
> 
>   A few minutes after, ovn-sb-db is fully synced with ovn-nb-db.
> 
>   Now, the client of ovn-sb-db, ovn-controller and nova-compute
> complaint about
>   "stale data". The chassis node is not getting updated.
>   
>   2020-08-04 09:07:45.892 26 INFO ovsdbapp.backend.ovs_idl.vlog [-]
> tcp:10.6.20.84:6642  : connected
>   2020-08-04 09:07:45.895 26 WARNING ovsdbapp.backend.ovs_idl.vlog [-]
> tcp:10.6.20.84:6642  : clustered database server
> has stale data; trying another server
>   
> 
>   Restarting ovn-controller and nova-compute resolve the issue.
> 
>   Is this expected? As part of the DB restore process, should I
> restart
>   ovn-controller and nova-compute on all chassis node?
> 
> 
> 
> 
> Yes, this is expected if you freshly start a new cluster. (It wouldn't
> happen if you simply restore the old data on the existing cluster.
> However, I understand that the scenario of restoring data on a freshly
> created cluster is a valid use case).
> For this case, you could either restart ovn-controller, or trigger a
> client side raft index reset by:
> ovn-appctl -t ovn-controller sb-cluster-state-reset
> 
> Similarly for ovn-northd:
> ovn-appctl -t ovn-northd nb-cluster-state-reset
> ovn-appctl -t ovn-northd sb-cluster-state-reset
> 
> To use this command, you will need at least 20.06 of OVN and OVS master.
> 
> 
> Thanks,
> Han
> 
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   ___
>   dev mailing list
>   d...@openvswitch.org 
>   https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [ovs-dev] [OVN] stale data complained by ovn-controller after db restore

2020-08-04 Thread Han Zhou
On Tue, Aug 4, 2020 at 10:30 AM Tony Liu  wrote:

> Hi,
>
> Here is how I restore OVN DB.
> * Stop all ovn-nb-db, ovn-sb-db and ovn-northd services.
> * Clean up all DB files.
> * Start all DB services. Fresh ovn-nb-db and ovn-sb-db clusters are up and
>   running.
> * Set DB election timer to 10s.
> * Restore DB to ovn-nb-db by ovsdb-client.
> * Start all ovn-northd services.
>
> A few minutes after, ovn-sb-db is fully synced with ovn-nb-db.
>
> Now, the client of ovn-sb-db, ovn-controller and nova-compute complaint
> about
> "stale data". The chassis node is not getting updated.
> 
> 2020-08-04 09:07:45.892 26 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:
> 10.6.20.84:6642: connected
> 2020-08-04 09:07:45.895 26 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:
> 10.6.20.84:6642: clustered database server has stale data; trying another
> server
> 
>
> Restarting ovn-controller and nova-compute resolve the issue.
>
> Is this expected? As part of the DB restore process, should I restart
> ovn-controller and nova-compute on all chassis node?
>
>
Yes, this is expected if you freshly start a new cluster. (It wouldn't
happen if you simply restore the old data on the existing cluster. However,
I understand that the scenario of restoring data on a freshly created
cluster is a valid use case).
For this case, you could either restart ovn-controller, or trigger a client
side raft index reset by:
ovn-appctl -t ovn-controller sb-cluster-state-reset

Similarly for ovn-northd:
ovn-appctl -t ovn-northd nb-cluster-state-reset
ovn-appctl -t ovn-northd sb-cluster-state-reset

To use this command, you will need at least 20.06 of OVN and OVS master.

Thanks,
Han



> Thanks!
>
> Tony
>
> ___
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


[ovs-discuss] [OVN] stale data complained by ovn-controller after db restore

2020-08-04 Thread Tony Liu
Hi,

Here is how I restore OVN DB.
* Stop all ovn-nb-db, ovn-sb-db and ovn-northd services.
* Clean up all DB files.
* Start all DB services. Fresh ovn-nb-db and ovn-sb-db clusters are up and
  running.
* Set DB election timer to 10s.
* Restore DB to ovn-nb-db by ovsdb-client.
* Start all ovn-northd services.

A few minutes after, ovn-sb-db is fully synced with ovn-nb-db.

Now, the client of ovn-sb-db, ovn-controller and nova-compute complaint about
"stale data". The chassis node is not getting updated.

2020-08-04 09:07:45.892 26 INFO ovsdbapp.backend.ovs_idl.vlog [-] 
tcp:10.6.20.84:6642: connected
2020-08-04 09:07:45.895 26 WARNING ovsdbapp.backend.ovs_idl.vlog [-] 
tcp:10.6.20.84:6642: clustered database server has stale data; trying another 
server


Restarting ovn-controller and nova-compute resolve the issue.

Is this expected? As part of the DB restore process, should I restart
ovn-controller and nova-compute on all chassis node?


Thanks!

Tony

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Tony Liu
Thanks Numan for looking into it!
Probe is for health check only, it's not supposed to trigger translation,
even with incremental implementation. Translation should be triggered only
when a ovn-northd becomes active.


Tony

> -Original Message-
> From: Numan Siddique 
> Sent: Tuesday, August 4, 2020 12:38 AM
> To: Tony Liu 
> Cc: Han Zhou ; ovs-dev ; ovs-
> discuss 
> Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> configuration update
> 
> 
> 
> On Tue, Aug 4, 2020 at 9:02 AM Tony Liu   > wrote:
> 
> 
>   The probe awakes recomputing?
>   There is probe every 5 seconds. Without any connection up/down or
> failover,
>   ovn-northd will recompute everything every 5 seconds, no matter
> what?
>   Really?
> 
>   Anyways, I will increase the probe interval for now, see if that
> helps.
> 
> 
> 
> I think we should optimise this case. I am planning to look into this.
> 
> Thanks
> Numan
> 
> 
> 
> 
>   Thanks!
> 
>   Tony
> 
>   > -Original Message-
>   > From: Han Zhou mailto:hz...@ovn.org> >
>   > Sent: Monday, August 3, 2020 8:22 PM
>   > To: Tony Liu   >
>   > Cc: Han Zhou mailto:hz...@ovn.org> >; ovs-discuss
> mailto:ovs-discuss@openvswitch.org> >;
>   > ovs-dev mailto:ovs-
> d...@openvswitch.org> >
>   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when
> no
>   > configuration update
>   >
>   > Sorry that I didn't make it clear enough. The OVSDB probe itself
> doesn't
>   > take much CPU, but the probe awakes ovn-northd main loop, which
>   > recompute everything, which is why you see CPU spike.
>   > It will be solved by incremental-processing, when only delta is
>   > processed, and in case of probe handling, there is no change in
>   > configuration, so the delta is zero.
>   > For now, please follow the steps to adjust probe interval, if the
> CPU of
>   > ovn-northd (when there is no configuration change) is a concern
> for you.
>   > But please remember that this has no impact to the real CPU usage
> for
>   > handling configuration changes.
>   >
>   >
>   > Thanks,
>   > Han
>   >
>   >
>   > On Mon, Aug 3, 2020 at 8:11 PM Tony Liu  
>   >   > > wrote:
>   >
>   >
>   >   Health check (5 sec internal) taking 30%-100% CPU is
> definitely not
>   > acceptable,
>   >   if that's really the case. There must be some blocking (and
> not
>   > yielding CPU)
>   >   in coding, which is not supposed to be there.
>   >
>   >   Could you point me to the coding for such health check?
>   >   Is it single thread? Does it use any event library?
>   >
>   >
>   >   Thanks!
>   >
>   >   Tony
>   >
>   >   > -Original Message-
>   >   > From: Han Zhou mailto:hz...@ovn.org>
>  > >
>   >   > Sent: Saturday, August 1, 2020 9:11 PM
>   >   > To: Tony Liu  
>   >   > >
>   >   > Cc: ovs-discuss mailto:ovs-
> disc...@openvswitch.org>  
>   > disc...@openvswitch.org  > >;
> ovs-dev>   > d...@openvswitch.org 
>  > >
>   >   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much
> CPU when
>   > no
>   >   > configuration update
>   >   >
>   >   >
>   >   >
>   >   > On Fri, Jul 31, 2020 at 4:14 PM Tony Liu
> mailto:tonyliu0...@hotmail.com>
>   >  >
>   >   >  
>   >   > > > wrote:
>   >   >
>   >   >
>   >   >   Hi,
>   >   >
>   >   >   I see the active ovn-northd takes much CPU (30% -
> 100%)
>   > when there
>   >   > is no
>   >   >   configuration from OpenStack, nothing happening on
> all
>   > chassis
>   >   > nodes either.
>   >   >
>   >   >   Is this expected? What is it busy with?
>   >   >
>   >   >
>   >   >
>   >   >
>   >   > Yes, this is expected. It is due to the OVSDB probe
> between ovn-
>   > northd
>   >   > and NB/SB OVSDB servers, which is used to detect the
> OVSDB
>   > connection
>   >   > failure.
>   >   > Usually this is not a concern (unlike the probe with a
> large
>   > 

Re: [ovs-discuss] [ovs-dev] there are many error logs when processing igmp packet

2020-08-04 Thread Ben Pfaff
On Tue, Aug 04, 2020 at 11:55:19AM +, Frank Wang(王培辉) wrote:
> Hello, 
> 
> I found there are many error logs as follows in ovs-vswitchd.log:
> Aug  4 18:48:03 node-170 ovs-vswitchd:
> ovs|00100|odp_util(handler171)|ERR|internal error parsing flow key
> recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),c
> t_zone(0),ct_label(0),eth(src=6c:92:bf:14:e9:f8,
> dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=100.7.40.41,dst=224.0.0.22,
> proto=2,tos=0,ttl=1,frag=no)
> 
>It seems parse_l2_5_onward function return ODP_FIT_TOO_LITTLE when it
> processing igmp packet after digging,

ODP_FIT_TOO_LITTLE is expected for IGMP, because the kernel doesn't
parse the IGMP header, whereas userspace does.

>I'm wondering the reason to do this ,how to avoid this?

The log message shows that there is definitely a bug but it does not
sound like it is in parse_l2_5_onward().
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-04 Thread Gregory Rose



On 8/3/2020 12:01 PM, Johan Knöös via discuss wrote:

Hi Open vSwitch contributors,

We have found openvswitch is causing double-freeing of memory. The
issue was not present in kernel version 5.5.17 but is present in
5.6.14 and newer kernels.

After reverting the RCU commits below for debugging, enabling
slub_debug, lockdep, and KASAN, we see the warnings at the end of this
email in the kernel log (the last one shows the double-free). When I
revert 50b0e61b32ee890a75b4377d5fbe770a86d6a4c1 ("net: openvswitch:
fix possible memleak on destroy flow-table"), the symptoms disappear.
While I have a reliable way to reproduce the issue, I unfortunately
don't yet have a process that's amenable to sharing. Please take a
look.

189a6883dcf7 rcu: Remove kfree_call_rcu_nobatch()
77a40f97030b rcu: Remove kfree_rcu() special casing and lazy-callback handling
e99637becb2e rcu: Add support for debug_objects debugging for kfree_rcu()
0392bebebf26 rcu: Add multiple in-flight batches of kfree_rcu() work
569d767087ef rcu: Make kfree_rcu() use a non-atomic ->monitor_todo
a35d16905efc rcu: Add basic support for kfree_rcu() batching

Thanks,
Johan Knöös


Let's add the author of the patch you reverted and the Linux netdev
mailing list.

- Greg



Traces:

[ cut here ]
WARNING: CPU: 30 PID: 0 at net/openvswitch/flow_table.c:272
table_instance_flow_free+0x2fd/0x340 [openvswitch]
Modules linked in: ...
CPU: 30 PID: 0 Comm: swapper/30 Tainted: GE 5.6.14+ #18
Hardware name: ...
RIP: 0010:table_instance_flow_free+0x2fd/0x340 [openvswitch]
Code: c1 fa 1f 48 c1 e8 20 29 d0 41 39 c7 0f 8f 95 fe ff ff 48 83 c4
10 48 89 ef d1 fe 5b 5d 41 5c 41 5d 41 5e 41 5f e9 33 fb ff ff <0f> 0b
e9 59 fe ff ff 0f 0b e8 65 f1 fe ff 85 c0 0f 85 9b fe ff ff
RSP: 0018:888c3e589da8 EFLAGS: 00010246
RAX:  RBX: 889f954ee580 RCX: dc00
RDX: 0007 RSI: 0003 RDI: 0246
RBP: 888c295150a0 R08: 9297f341 R09: 
R10:  R11:  R12: 889f1ed55000
R13: 888b72efa020 R14: 888c24209480 R15: 888b731bb6f8
FS:  () GS:888c3e58() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0733feb8a700 CR3: 000ba726e004 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:

table_instance_destroy+0xf9/0x1b0 [openvswitch]
? new_vport+0xb0/0xb0 [openvswitch]
destroy_dp_rcu+0x12/0x50 [openvswitch]
rcu_core+0x34d/0x9b0
? rcu_all_qs+0x90/0x90
? rcu_read_lock_sched_held+0xa5/0xc0
? rcu_read_lock_bh_held+0xc0/0xc0
? run_rebalance_domains+0x11d/0x140
__do_softirq+0x128/0x55c
irq_exit+0x101/0x110
smp_apic_timer_interrupt+0xfd/0x2f0
apic_timer_interrupt+0xf/0x20

RIP: 0010:cpuidle_enter_state+0xda/0x5d0
Code: 80 7c 24 10 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 be 04
00 00 31 ff e8 c2 1a 7a ff e8 9d 4d 84 ff fb 66 0f 1f 44 00 00 <45> 85
ed 0f 88 b6 03 00 00 4d 63 f5 4b 8d 04 76 4e 8d 3c f5 00 00
RSP: 0018:888103f07d58 EFLAGS: 0246 ORIG_RAX: ff13
RAX:  RBX: 888c3e5c1800 RCX: dc00
RDX: 0007 RSI: 0006 RDI: 888103ec88d4
RBP: 945a3940 R08: 92982042 R09: 
R10:  R11:  R12: 0002
R13: 0002 R14: 00d0 R15: 945a3a10
? lockdep_hardirqs_on+0x182/0x260
? cpuidle_enter_state+0xd3/0x5d0
cpuidle_enter+0x3c/0x60
do_idle+0x36a/0x450
? arch_cpu_idle_exit+0x40/0x40
cpu_startup_entry+0x19/0x20
start_secondary+0x21f/0x290
? set_cpu_sibling_map+0xcb0/0xcb0
secondary_startup_64+0xa4/0xb0
irq event stamp: 1626911
hardirqs last  enabled at (1626910): [] __call_rcu+0x1b7/0x3b0
hardirqs last disabled at (1626911): []
trace_hardirqs_off_thunk+0x1a/0x1c
softirqs last  enabled at (1626882): [] irq_enter+0x75/0x80
softirqs last disabled at (1626883): [] irq_exit+0x101/0x110
---[ end trace 8dc48dec48bb79c0 ]---


---


=
WARNING: suspicious RCU usage
5.6.14+ #18 Tainted: GW   E
-
net/openvswitch/flow_table.c:239 suspicious rcu_dereference_protected() usage!
\x0aother info that might help us debug this:\x0a
\x0arcu_scheduler_active = 2, debug_locks = 1
1 lock held by swapper/30/0:
#0: 94315e00 (rcu_callback){}, at: rcu_core+0x395/0x9b0
\x0astack backtrace:
CPU: 30 PID: 0 Comm: swapper/30 Tainted: GW   E 5.6.14+ #18
Hardware name: ...
Call Trace:

dump_stack+0xb8/0x110
table_instance_flow_free+0x332/0x340 [openvswitch]
table_instance_destroy+0xf9/0x1b0 [openvswitch]
? new_vport+0xb0/0xb0 [openvswitch]
destroy_dp_rcu+0x12/0x50 [openvswitch]
rcu_core+0x34d/0x9b0
? rcu_all_qs+0x90/0x90
? rcu_read_lock_sched_held+0xa5/0xc0
? 

[ovs-discuss] there are many error logs when processing igmp packet

2020-08-04 Thread 王培辉
Hello, 

I found there are many error logs as follows in ovs-vswitchd.log:
Aug  4 18:48:03 node-170 ovs-vswitchd:
ovs|00100|odp_util(handler171)|ERR|internal error parsing flow key
recirc_id(0),dp_hash(0),skb_priority(0),in_port(5),skb_mark(0),ct_state(0),c
t_zone(0),ct_label(0),eth(src=6c:92:bf:14:e9:f8,
dst=01:00:5e:00:00:16),eth_type(0x0800),ipv4(src=100.7.40.41,dst=224.0.0.22,
proto=2,tos=0,ttl=1,frag=no)

   It seems parse_l2_5_onward function return ODP_FIT_TOO_LITTLE when it
processing igmp packet after digging,
   I'm wondering the reason to do this ,how to avoid this?


Thanks 


smime.p7s
Description: S/MIME cryptographic signature
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-04 Thread Tonghao Zhang
On Tue, Aug 4, 2020 at 3:02 AM Johan Knöös  wrote:
>
> Hi Open vSwitch contributors,
>
> We have found openvswitch is causing double-freeing of memory. The
> issue was not present in kernel version 5.5.17 but is present in
> 5.6.14 and newer kernels.
>
> After reverting the RCU commits below for debugging, enabling
> slub_debug, lockdep, and KASAN, we see the warnings at the end of this
> email in the kernel log (the last one shows the double-free). When I
> revert 50b0e61b32ee890a75b4377d5fbe770a86d6a4c1 ("net: openvswitch:
> fix possible memleak on destroy flow-table"), the symptoms disappear.
> While I have a reliable way to reproduce the issue, I unfortunately
> don't yet have a process that's amenable to sharing. Please take a
> look.
>
> 189a6883dcf7 rcu: Remove kfree_call_rcu_nobatch()
> 77a40f97030b rcu: Remove kfree_rcu() special casing and lazy-callback handling
> e99637becb2e rcu: Add support for debug_objects debugging for kfree_rcu()
> 0392bebebf26 rcu: Add multiple in-flight batches of kfree_rcu() work
> 569d767087ef rcu: Make kfree_rcu() use a non-atomic ->monitor_todo
> a35d16905efc rcu: Add basic support for kfree_rcu() batching
Thanks, I will take a look.
> Thanks,
> Johan Knöös
>
> Traces:
>
> [ cut here ]
> WARNING: CPU: 30 PID: 0 at net/openvswitch/flow_table.c:272
> table_instance_flow_free+0x2fd/0x340 [openvswitch]
> Modules linked in: ...
> CPU: 30 PID: 0 Comm: swapper/30 Tainted: GE 5.6.14+ #18
> Hardware name: ...
> RIP: 0010:table_instance_flow_free+0x2fd/0x340 [openvswitch]
> Code: c1 fa 1f 48 c1 e8 20 29 d0 41 39 c7 0f 8f 95 fe ff ff 48 83 c4
> 10 48 89 ef d1 fe 5b 5d 41 5c 41 5d 41 5e 41 5f e9 33 fb ff ff <0f> 0b
> e9 59 fe ff ff 0f 0b e8 65 f1 fe ff 85 c0 0f 85 9b fe ff ff
> RSP: 0018:888c3e589da8 EFLAGS: 00010246
> RAX:  RBX: 889f954ee580 RCX: dc00
> RDX: 0007 RSI: 0003 RDI: 0246
> RBP: 888c295150a0 R08: 9297f341 R09: 
> R10:  R11:  R12: 889f1ed55000
> R13: 888b72efa020 R14: 888c24209480 R15: 888b731bb6f8
> FS:  () GS:888c3e58() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0733feb8a700 CR3: 000ba726e004 CR4: 003606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
> 
> table_instance_destroy+0xf9/0x1b0 [openvswitch]
> ? new_vport+0xb0/0xb0 [openvswitch]
> destroy_dp_rcu+0x12/0x50 [openvswitch]
> rcu_core+0x34d/0x9b0
> ? rcu_all_qs+0x90/0x90
> ? rcu_read_lock_sched_held+0xa5/0xc0
> ? rcu_read_lock_bh_held+0xc0/0xc0
> ? run_rebalance_domains+0x11d/0x140
> __do_softirq+0x128/0x55c
> irq_exit+0x101/0x110
> smp_apic_timer_interrupt+0xfd/0x2f0
> apic_timer_interrupt+0xf/0x20
> 
> RIP: 0010:cpuidle_enter_state+0xda/0x5d0
> Code: 80 7c 24 10 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 be 04
> 00 00 31 ff e8 c2 1a 7a ff e8 9d 4d 84 ff fb 66 0f 1f 44 00 00 <45> 85
> ed 0f 88 b6 03 00 00 4d 63 f5 4b 8d 04 76 4e 8d 3c f5 00 00
> RSP: 0018:888103f07d58 EFLAGS: 0246 ORIG_RAX: ff13
> RAX:  RBX: 888c3e5c1800 RCX: dc00
> RDX: 0007 RSI: 0006 RDI: 888103ec88d4
> RBP: 945a3940 R08: 92982042 R09: 
> R10:  R11:  R12: 0002
> R13: 0002 R14: 00d0 R15: 945a3a10
> ? lockdep_hardirqs_on+0x182/0x260
> ? cpuidle_enter_state+0xd3/0x5d0
> cpuidle_enter+0x3c/0x60
> do_idle+0x36a/0x450
> ? arch_cpu_idle_exit+0x40/0x40
> cpu_startup_entry+0x19/0x20
> start_secondary+0x21f/0x290
> ? set_cpu_sibling_map+0xcb0/0xcb0
> secondary_startup_64+0xa4/0xb0
> irq event stamp: 1626911
> hardirqs last  enabled at (1626910): [] 
> __call_rcu+0x1b7/0x3b0
> hardirqs last disabled at (1626911): []
> trace_hardirqs_off_thunk+0x1a/0x1c
> softirqs last  enabled at (1626882): [] irq_enter+0x75/0x80
> softirqs last disabled at (1626883): [] irq_exit+0x101/0x110
> ---[ end trace 8dc48dec48bb79c0 ]---
>
>
> ---
>
>
> =
> WARNING: suspicious RCU usage
> 5.6.14+ #18 Tainted: GW   E
> -
> net/openvswitch/flow_table.c:239 suspicious rcu_dereference_protected() usage!
> \x0aother info that might help us debug this:\x0a
> \x0arcu_scheduler_active = 2, debug_locks = 1
> 1 lock held by swapper/30/0:
> #0: 94315e00 (rcu_callback){}, at: rcu_core+0x395/0x9b0
> \x0astack backtrace:
> CPU: 30 PID: 0 Comm: swapper/30 Tainted: GW   E 5.6.14+ #18
> Hardware name: ...
> Call Trace:
> 
> dump_stack+0xb8/0x110
> table_instance_flow_free+0x332/0x340 [openvswitch]
> table_instance_destroy+0xf9/0x1b0 [openvswitch]
> ? new_vport+0xb0/0xb0 

Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no configuration update

2020-08-04 Thread Numan Siddique
On Tue, Aug 4, 2020 at 9:02 AM Tony Liu  wrote:

> The probe awakes recomputing?
> There is probe every 5 seconds. Without any connection up/down or failover,
> ovn-northd will recompute everything every 5 seconds, no matter what?
> Really?
>
> Anyways, I will increase the probe interval for now, see if that helps.
>

I think we should optimise this case. I am planning to look into this.

Thanks
Numan


>
>
> Thanks!
>
> Tony
>
> > -Original Message-
> > From: Han Zhou 
> > Sent: Monday, August 3, 2020 8:22 PM
> > To: Tony Liu 
> > Cc: Han Zhou ; ovs-discuss ;
> > ovs-dev 
> > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when no
> > configuration update
> >
> > Sorry that I didn't make it clear enough. The OVSDB probe itself doesn't
> > take much CPU, but the probe awakes ovn-northd main loop, which
> > recompute everything, which is why you see CPU spike.
> > It will be solved by incremental-processing, when only delta is
> > processed, and in case of probe handling, there is no change in
> > configuration, so the delta is zero.
> > For now, please follow the steps to adjust probe interval, if the CPU of
> > ovn-northd (when there is no configuration change) is a concern for you.
> > But please remember that this has no impact to the real CPU usage for
> > handling configuration changes.
> >
> >
> > Thanks,
> > Han
> >
> >
> > On Mon, Aug 3, 2020 at 8:11 PM Tony Liu  >  > wrote:
> >
> >
> >   Health check (5 sec internal) taking 30%-100% CPU is definitely not
> > acceptable,
> >   if that's really the case. There must be some blocking (and not
> > yielding CPU)
> >   in coding, which is not supposed to be there.
> >
> >   Could you point me to the coding for such health check?
> >   Is it single thread? Does it use any event library?
> >
> >
> >   Thanks!
> >
> >   Tony
> >
> >   > -Original Message-
> >   > From: Han Zhou mailto:hz...@ovn.org> >
> >   > Sent: Saturday, August 1, 2020 9:11 PM
> >   > To: Tony Liu  >  >
> >   > Cc: ovs-discuss mailto:ovs-
> > disc...@openvswitch.org> >; ovs-dev  >   > d...@openvswitch.org  >
> >   > Subject: Re: [ovs-discuss] [OVN] ovn-northd takes much CPU when
> > no
> >   > configuration update
> >   >
> >   >
> >   >
> >   > On Fri, Jul 31, 2020 at 4:14 PM Tony Liu <
> tonyliu0...@hotmail.com
> > 
> >   >  >  > > wrote:
> >   >
> >   >
> >   >   Hi,
> >   >
> >   >   I see the active ovn-northd takes much CPU (30% - 100%)
> > when there
> >   > is no
> >   >   configuration from OpenStack, nothing happening on all
> > chassis
> >   > nodes either.
> >   >
> >   >   Is this expected? What is it busy with?
> >   >
> >   >
> >   >
> >   >
> >   > Yes, this is expected. It is due to the OVSDB probe between ovn-
> > northd
> >   > and NB/SB OVSDB servers, which is used to detect the OVSDB
> > connection
> >   > failure.
> >   > Usually this is not a concern (unlike the probe with a large
> > number of
> >   > ovn-controller clients), because ovn-northd is a centralized
> > component
> >   > and the CPU cost when there is no configuration change doesn't
> > matter
> >   > that much. However, if it is a concern, the probe interval
> > (default 5
> >   > sec) can be changed.
> >   > If you change, remember to change on both server side and client
> > side.
> >   > For client side (ovn-northd), it is configured in the NB DB's
> > NB_Global
> >   > table's options:northd_probe_interval. See man page of ovn-nb(5).
> >   > For server side (NB and SB), it is configured in the NB and SB
> > DB's
> >   > Connection table's inactivity_probe column.
> >   >
> >   > Thanks,
> >   > Han
> >   >
> >   >
> >   >
> >   >   
> >   >   2020-07-31T23:08:09.511Z|04267|poll_loop|DBG|wakeup due to
> > [POLLIN]
> >   > on fd 8 (10.6.20.84:44358 
> >  <->10.6.20.84:6641 
> >   >  ) at lib/stream-fd.c:157 (68% CPU
> usage)
> >   >   2020-07-
> > 31T23:08:09.512Z|04268|jsonrpc|DBG|tcp:10.6.20.84:6641
> > 
> >   >  : received request, method="echo",
> > params=[],
> >   > id="echo"
> >   >   2020-07-
> > 31T23:08:09.512Z|04269|jsonrpc|DBG|tcp:10.6.20.84:6641
> > 
> >   >  : send reply, result=[], id="echo"
> >   >   2020-07-31T23:08:12.777Z|04270|poll_loop|DBG|wakeup due to
> > [POLLIN]
> >   > on fd 9 (10.6.20.84:49158 
> >  <->10.6.20.85:6642 

Re: [ovs-discuss] [OVN] no response to inactivity probe

2020-08-04 Thread Numan Siddique
On Tue, Aug 4, 2020 at 9:12 AM Tony Liu  wrote:

> In my deployment, on each Neutron server, there are 13 Neutron server
> processes.
> I see 12 of them (monitor, maintenance, RPC, API) connect to both ovn-nb-db
> and ovn-sb-db. With 3 Neutron server nodes, that's 36 OVSDB clients.
> Is so many clients OK?
>
> Any suggestions how to figure out which side doesn't respond the probe,
> if it's bi-directional? I don't see any activities from logging, other than
> connect/drop and reconnect...
>
> BTW, please let me know if this is not the right place to discuss Neutron
> OVN
> ML2 driver.
>
>
> Thanks!
>
> Tony
>
> > -Original Message-
> > From: dev  On Behalf Of Tony Liu
> > Sent: Monday, August 3, 2020 7:45 PM
> > To: ovs-discuss ; ovs-dev  > d...@openvswitch.org>
> > Subject: [ovs-dev] [OVN] no response to inactivity probe
> >
> > Hi,
> >
> > Neutron OVN ML2 driver was disconnected by ovn-nb-db. There are many
> > error messages from ovn-nb-db leader.
> > 
> > 2020-08-04T02:31:39.751Z|03138|reconnect|ERR|tcp:10.6.20.81:58620: no
> > response to inactivity probe after 5 seconds, disconnecting
> > 2020-08-04T02:31:42.484Z|03139|reconnect|ERR|tcp:10.6.20.81:58300: no
> > response to inactivity probe after 5 seconds, disconnecting
> > 2020-08-04T02:31:49.858Z|03140|reconnect|ERR|tcp:10.6.20.81:59582: no
> > response to inactivity probe after 5 seconds, disconnecting
> > 2020-08-04T02:31:53.057Z|03141|reconnect|ERR|tcp:10.6.20.83:42626: no
> > response to inactivity probe after 5 seconds, disconnecting
> > 2020-08-04T02:31:53.058Z|03142|reconnect|ERR|tcp:10.6.20.82:45412: no
> > response to inactivity probe after 5 seconds, disconnecting
> > 2020-08-04T02:31:54.067Z|03143|reconnect|ERR|tcp:10.6.20.81:59416: no
> > response to inactivity probe after 5 seconds, disconnecting
> > 2020-08-04T02:31:54.809Z|03144|reconnect|ERR|tcp:10.6.20.81:60004: no
> > response to inactivity probe after 5 seconds, disconnecting 
> >
> > Could anyone share a bit details how this inactivity probe works?
>

The inactivity probe is sent by both the server and clients independently.
Meaning ovsdb-server will send an inactivity probe every 'x' configured
seconds
to all its connected clients and if it doesn't get a reply from the client
within some time, it disconnects
the connection.

The inactivity probe from the server side can be configured. Run "ovn-nbctl
list connection"
and you will see inactivity_probe column. You can set this column to
desired value like -
ovn-nbctl set connection . inactivity_probe=3 (for 30 seconds)

The same thing for SB ovsdb-server.

Similarly each client (ovn-northd, ovn-controller, neutron server) sends
inactivity probe every 'y' seconds
and if the client doesn't get any reply from ovsdb-server it will
disconnect the connection and reconnect again.

For ovn-northd you can configured this as - ovn-nbctl set NB_Global .
options:northd_probe_interval=3

For ovn-controllers - ovs-vsctl set open .
external_ids:ovn-remote-probe-interval=3

There is also a probe interval for openflow connection from ovn-controller
to ovs-vswitchd which you can configure as
ovs-vsctl set open . external_ids:ovn-openflow-probe-interval=30 (this is
in seconds)

Regarding the neutron server I think it is set to 60 seconds. Please see
this -
https://github.com/openstack/neutron/blob/master/neutron/conf/plugins/ml2/drivers/ovn/ovn_conf.py#L80

>From the logs you shared, it looks like ovsdb-server is not getting the
probe reply from neutron server after 5 seconds and hence
it is disconnecting. Not sure what's happening though.

You can try increasing the inactivity probe interval on the ovsdb-server
side with the first command I shared.
Note: If "ovn-nbctl list connection" returns empty, you need to create a
connection row like - ovn-nbctl set-connection ptcp:6641:


Thanks
Numan



> From OVN ML2 driver log, I see it connected to the leader, then the
> > connection was closed by leader after 5 or 6 seconds. Is this probe one-
> > way or two-ways?
> > Both sides are not busy, not taking much CPU cycles. Not sure how this
> > could happen. Any thoughts?
> >
> >
> > Thanks!
> >
> > Tony
> >
> >
> >
> > ___
> > dev mailing list
> > d...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> ___
> discuss mailing list
> disc...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
>
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss