Re: [ovs-discuss] ovn-controller consuming lots of CPU

2017-12-20 Thread Ben Pfaff
On Tue, Dec 12, 2017 at 01:26:33PM -0800, Kevin Lin wrote:
> Hi again,
> 
> We’re trying to scale up our OVN deployment and we’re seeing some worrying 
> log messages. 
> The topology is 32 containers connected to another 32 containers on 10 
> different ports. This is running on 17 machines (one machine runs ovn-northd 
> and ovsdb-server, the other 16 run ovn-controller, ovs-vswitchd, and 
> ovsdb-server). We’re using an address set for the source group, but not the 
> destination group. We’re also creating a different ACL for each port. So the 
> ACLs look like:
> One address set for { container1, container2, … container32 }
> addressSet -> container1 on port 80
> addressSet -> container1 on port 81
> …
> addressSet -> container1 on port 90
> addressSet -> container2 on port 80
> …
> addressSet -> container32 on port 90

Hmm, can you help me understand better what's going on?  I'd like to
help.

It sounds like this strategy for setting up ACL entries will not scale
well.  I guess you have 32*10 of them now, which doesn't bode well as
the number of containers or ports increases.  But that is not the
current problem.

After some reflection, I think that the easiest way to debug this might
be if you are willing to provide a copy of a northbound database that
exhibits the problem (ideally by attaching the DB file rather than some
kind of higher-level dump of it).  Is that OK?  If you do not want to
post it publicly to the mailing list, you can send it to me off-list and
I will keep it confidential.

Thanks,

Ben.
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


Re: [ovs-discuss] ovn-controller consuming lots of CPU

2017-12-20 Thread Russell Bryant
You're right that my patch was a bit of a short term hack.  It
addressed a particular case where we could easily discard flows when
we know they could never match (because of the logical port included
in the match).

Use of conjunctive matches is a better more general solution to the
explosion of flows when using address sets, but that hasn't been
worked on yet.


On Wed, Dec 13, 2017 at 5:00 PM, Kevin Lin  wrote:
> Thanks for the replies!
>
> We’re using v2.8.1.
>
> I don’t completely understand Russell’s patch, but I don’t think our ACLs
> were taking advantage of it. Do the ACLs need to be “tagged” with port
> information in order for it to be useful?
>
> Before, our ACLs were in terms of L3 and above. I brought down the number of
> flows from 252981 to 16224 by modifying the ACLs to include the logical
> port. For example:
>
> from-lport: (inport == “1” || inport == “2” || inport == “3”) && (ip4.dst ==
> $addressSet)
> to-lport: (ip4.src == $addressSet) && (outport == “1” || outport == “2” ||
> outport == “3”)
>
>
> Is that the right way to take advantage of Russell’s patch? This explodes
> the address set on one side of the connection, but does decrease the number
> of flows installed on each vswitchd.
>
> Even with the decrease in flows in vswitchd, I’m still seeing the log
> messages. Does 16224 flows per vswitchd instance, and 559 flows in ovn-sb
> sound reasonable?
>
> Thanks,
> —Kevin
>
> On Dec 13, 2017, at 7:55 AM, Mark Michelson  wrote:
>
> On 12/12/2017 03:26 PM, Kevin Lin wrote:
>
> Hi again,
> We’re trying to scale up our OVN deployment and we’re seeing some worrying
> log messages.
> The topology is 32 containers connected to another 32 containers on 10
> different ports. This is running on 17 machines (one machine runs ovn-northd
> and ovsdb-server, the other 16 run ovn-controller, ovs-vswitchd, and
> ovsdb-server). We’re using an address set for the source group, but not the
> destination group. We’re also creating a different ACL for each port. So the
> ACLs look like:
> One address set for { container1, container2, … container32 }
> addressSet -> container1 on port 80
> addressSet -> container1 on port 81
> …
> addressSet -> container1 on port 90
> addressSet -> container2 on port 80
> …
> addressSet -> container32 on port 90
> The ovn-controller log:
> 2017-12-12T20:14:49Z|11878|timeval|WARN|Unreasonably long 1843ms poll
> interval (1840ms user, 0ms system)
> 2017-12-12T20:14:49Z|11879|timeval|WARN|disk: 0 reads, 16 writes
> 2017-12-12T20:14:49Z|11880|timeval|WARN|context switches: 0 voluntary, 21
> involuntary
> 2017-12-12T20:14:49Z|11881|poll_loop|DBG|wakeup due to [POLLIN] on fd 9
> (172.31.11.193:48460<->172.31.2.181:6640) at lib/stream-fd.c:157 (36% CPU
> usage)
> 2017-12-12T20:14:49Z|11882|poll_loop|DBG|wakeup due to [POLLIN] on fd 12
> (<->/var/run/openvswitch/br-int.mgmt) at lib/stream-fd.c:157 (36% CPU usage)
> 2017-12-12T20:14:49Z|11883|jsonrpc|DBG|tcp:172.31.2.181:6640: received
> reply, result=[], id="echo"
> 2017-12-12T20:14:49Z|11884|netlink_socket|DBG|nl_sock_transact_multiple__
> (Success): nl(len:36, type=38(family-defined), flags=9[REQUEST][ECHO],
> seq=b11, pid=2268452876
> 2017-12-12T20:14:49Z|11885|netlink_socket|DBG|nl_sock_recv__ (Success):
> nl(len:136, type=36(family-defined), flags=0, seq=b11, pid=2268452876
> 2017-12-12T20:14:49Z|11886|vconn|DBG|unix:/var/run/openvswitch/br-int.mgmt:
> received: OFPT_ECHO_REQUEST (OF1.3) (xid=0x0): 0 bytes of payload
> 2017-12-12T20:14:49Z|11887|vconn|DBG|unix:/var/run/openvswitch/br-int.mgmt:
> sent (Success): OFPT_ECHO_REPLY (OF1.3) (xid=0x0): 0 bytes of payload
> 2017-12-12T20:14:51Z|11888|timeval|WARN|Unreasonably long 1851ms poll
> interval (1844ms user, 8ms system)
> 2017-12-12T20:14:51Z|11889|timeval|WARN|context switches: 0 voluntary, 11
> involuntary
> 2017-12-12T20:14:52Z|11890|poll_loop|DBG|wakeup due to [POLLIN] on fd 9
> (172.31.11.193:48460<->172.31.2.181:6640) at lib/stream-fd.c:157 (73% CPU
> usage)
> 2017-12-12T20:14:52Z|11891|jsonrpc|DBG|tcp:172.31.2.181:6640: received
> request, method="echo", params=[], id="echo"
> 2017-12-12T20:14:52Z|11892|jsonrpc|DBG|tcp:172.31.2.181:6640: send reply,
> result=[], id="echo"
> 2017-12-12T20:14:52Z|11893|netlink_socket|DBG|nl_sock_transact_multiple__
> (Success): nl(len:36, type=38(family-defined), flags=9[REQUEST][ECHO],
> seq=b12, pid=2268452876
> 2017-12-12T20:14:52Z|11894|netlink_socket|DBG|nl_sock_recv__ (Success):
> nl(len:136, type=36(family-defined), flags=0, seq=b12, pid=2268452876
> 2017-12-12T20:14:52Z|11895|netdev_linux|DBG|Dropped 18 log messages in last
> 56 seconds (most recently, 3 seconds ago) due to excessive rate
> 2017-12-12T20:14:52Z|11896|netdev_linux|DBG|unknown qdisc "mq"
> 2017-12-12T20:14:54Z|11897|hmap|DBG|Dropped 15511 log messages in last 6
> seconds (most recently, 0 seconds ago) due to excessive rate
> 2017-12-12T20:14:54Z|11898|hmap|DBG|ovn/lib/expr.c:2644: 6 nodes in bucket
> (128 nodes, 64 buckets)
>