** Changed in: neutron Status: Fix Committed => Fix Released ** Changed in: neutron Milestone: None => liberty-2
-- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1368281 Title: Scalability issue using neutron-linuxbridge-agent Status in neutron: Fix Released Bug description: Hi all, In a cluster of 20 nodes that is configured to use neutron- linuxbridge-agent and as part of our load test which included running a lot of instances at once (e.g 100, 200 ... 500), we noticed the more you increase the number of instances to launch the more the resulting instances end up in ERROR state, as an example with 500 instance, ~250 instances end up in ERROR state. Analyze: ======= All the instances that end up in ERROR state are all the result of timeout while waiting for neutron to confirm vif is plugged. Now at this point I should note that we already increment all the possible option that we could find in both nova and neutron e.g. rpc_workers, rpc_response_timeout, vif_plugging_timeout, SA pool_size ... , all of this without luck. A further analyze in the neutron code revealed that the bottleneck is the neutron-linuxbridge-agent that doesn't set the port to UP in time. By digging deeper we found that the cause was the storm of RPC calls that start by a port creation, which trigger sending a fanout of the message security_groups_member_updated to all neutron-linuxbridge- agent (in our case it's around 24 agents) asking them to pull the new security group rules for each ports (tap device) they have, which trigger in it turn Iptables change (Did I mentioned that we use iptables security group driver :)), the result is lock contention between all greenlets trying to acquire the iptables driver lock ... . Thankfully we found that this problem is not new to neutron, and as a matter of fact, it was already fixed for ovs-agent at https://github.com/openstack/neutron/commit/3046c4ae22b10f9e4fa83a47bfe089554d4a4681. The same idea was implemented by us (i.e. defer applying iptables changes), this fixed the scalability problem, number wise we was able to lunch 500 instance at once (nova boot ... --min-count 500) all active and reachable in 8 minutes . A patch is coming soon. HTH, To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1368281/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp