Hey Russell,
 
I'm still digging into details on what's happening, but I'll share what I know so far. We've been running various scale tests and they all generally run into this issue at some point. The one I'm using to reliably reproduce this now is a simple test that tries to create 4k routers connected to private networks. This test goes well for a while (last run made it to 2.8k routers/networks) and then some failure seems to happen which causes the neutron-server processes to spin at 100% cpu and all Neutron API calls hang. Once this happens, restarting neutron-server doesn't help, just returns to 100% cpu usage. After restarting neutron-server I do see a traceback where a select call didn't return the expected 3-tuple so unpacking that result fails and later it complains about ports that are in the Neutron DB but not the OVN DB. I've attached a log that shows those errors.
 
Some notes on this setup:
1) This is using stable/liberty Neutron. We have a separate environment on master that we're testing with too, but I don't have results from that right now.
2) There's 10 neutron server nodes each running 2 RPC workers and 4 API workers. Load looks reasonable on them up to the point it fails at then suddenly jumps.
3) After the failure happens, ovsdb-server on the node running ovn-northd is somewhat busy (30%-60% CPU), but more importantly it's transmitting around 7-8 Gbps on a 2x10 Gbps bonded link on port 6640.
4) Here is the test I'm running: https://github.com/mestery/openstack-scripts/blob/master/ovn/create-routers.sh
 
- Matt
 
----- Original message -----
From: Kyle Mestery <mest...@mestery.com>
To: Russell Bryant <russ...@ovn.org>, Matthew Mulsow/Austin/IBM@IBMUS
Cc: Ryan Moats/Omaha/IBM@IBMUS, discuss <discuss@openvswitch.org>
Subject: Re: [ovs-discuss] ovsdb behavior under ovn management plane scaling
Date: Thu, Jan 28, 2016 2:59 PM
 
On Thu, Jan 28, 2016 at 2:51 PM, Russell Bryant <russ...@ovn.org> wrote:
> On 01/28/2016 03:36 PM, Ryan Moats wrote:
>> Yes, that was the first bottleneck we hit and we've taken the work
>> that led to your RFC and gone looking for the next bottleneck, which
>> now appears to be communications between the networking-ovn plugin
>> and ovn-northd.  The first step in that path is from the plugin to
>> ovsdb-server, so I view my initial post as one facet of the
>> problem...
>
> OK, thanks.
>
> What behavior are you seeing between the Neutron plugin and the
> northbound database exactly that makes it the bottleneck?  Is
> neutron-server maxed out trying to keep up with the request load?  or
> ovsdb-server?  or?
>
> Also, are you running multiple Neutron API workers?
>
Copying Matt to get the answers to these questions.

Thanks,
Kyle

> --
> Russell Bryant
> _______________________________________________
> discuss mailing list
> discuss@openvswitch.org
> http://openvswitch.org/mailman/listinfo/discuss

 
 

Attachment: server.log
Description: Binary data

_______________________________________________
discuss mailing list
discuss@openvswitch.org
http://openvswitch.org/mailman/listinfo/discuss

Reply via email to