[Yahoo-eng-team] [Bug 1838431] [NEW] [scale issue] ovs-agent port processing time increases linearly and eventually timeouts

LIU Yulong Tue, 30 Jul 2019 08:21:25 -0700

Public bug reported:

ENV: stable/queens
But master has basically same code, so the issue may also exist.


Config: L2 ovs-agent with enabled openflow based security group.

Recently I run one extreme test locally, booting 2700 instances for one single 
tenant.
The instance will be booted in 2000 networks. But the entire tenant has only 
one security group with only 5 rules. (This is the key point to the problem.)

The result is totally unacceptable. Almost 2000+ instances failed to
boot (ERROR), and almost every of them meets the "vif-plug-timeout"
exception.


How to reproduce:
1. create 2700 networks one by one "openstack network create"
2. create one IPv4 subnet and one IPv6 subnet for every network
3. create 2700 router (one single tenant can not create HA router more than 
255, because of the VRID range) and connect to these subnets
4.  boot instances
for i in {1..100}
do
    for i in {1..27}
        nova boot --nic net-name="test-network-xxx" ...
    done
    echo "CLI: boot 27 VMs"
    sleep 30s
done


I have some clue of this issue, the linearly processing time increasing is 
something like this:
(1) rpc_loop X
5 port added to the ovs-agent, they are processed and will be add to the 
updated list due to the local notification
(2) rpc_loop X + 1 
another 10 ports are added to the ovs-agent, and 10 updated-port to local 
notification.
This loop the processing time is 5 ports update processing time, and 10 added 
port processing.
(3) rpc_loop X + 2
another 20 are ports added to ovs-agent,
10 updated + 20 added port processing time

And the worse thing is, when the port number is getting larger, every port 
under this one security group will be related. The openflow based security 
group processing time is get longer and longer.
Until some instance ports meet the timeout of vif-plug. And the instance get 
failed to boot.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1838431

Title:
  [scale issue] ovs-agent port processing time increases linearly and
  eventually timeouts

Status in neutron:
  New

Bug description:
  ENV: stable/queens
  But master has basically same code, so the issue may also exist.

  Config: L2 ovs-agent with enabled openflow based security group.

  Recently I run one extreme test locally, booting 2700 instances for one 
single tenant.
  The instance will be booted in 2000 networks. But the entire tenant has only 
one security group with only 5 rules. (This is the key point to the problem.)

  The result is totally unacceptable. Almost 2000+ instances failed to
  boot (ERROR), and almost every of them meets the "vif-plug-timeout"
  exception.

  
  How to reproduce:
  1. create 2700 networks one by one "openstack network create"
  2. create one IPv4 subnet and one IPv6 subnet for every network
  3. create 2700 router (one single tenant can not create HA router more than 
255, because of the VRID range) and connect to these subnets
  4.  boot instances
  for i in {1..100}
  do
      for i in {1..27}
          nova boot --nic net-name="test-network-xxx" ...
      done
      echo "CLI: boot 27 VMs"
      sleep 30s
  done

  
  I have some clue of this issue, the linearly processing time increasing is 
something like this:
  (1) rpc_loop X
  5 port added to the ovs-agent, they are processed and will be add to the 
updated list due to the local notification
  (2) rpc_loop X + 1 
  another 10 ports are added to the ovs-agent, and 10 updated-port to local 
notification.
  This loop the processing time is 5 ports update processing time, and 10 added 
port processing.
  (3) rpc_loop X + 2
  another 20 are ports added to ovs-agent,
  10 updated + 20 added port processing time

  And the worse thing is, when the port number is getting larger, every port 
under this one security group will be related. The openflow based security 
group processing time is get longer and longer.
  Until some instance ports meet the timeout of vif-plug. And the instance get 
failed to boot.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1838431/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1838431] [NEW] [scale issue] ovs-agent port processing time increases linearly and eventually timeouts

Reply via email to