Public bug reported:

When the environment starts (TripleO deployment), we wait until
pacemaker starts everything, but while that's been happening there are
neutron-agent services which have started by systemd but are waiting
with a 10 minute timeout for a RabbitMQ connection. Looking at code [1]
for resilience code for neutron agent - rabbitmq communication, it
doesn't take in account the start up case when connection to rabbit was
never established causing 10m delay. To solve the problem we should
specify the cases for resilience

1. Initial connection establishment. Connection to rabbit was never 
established, agent is trying to establish it (Initial startup of whole 
openstack cluster after power outage or planned reboot or one compute node 
reboot)
2. Connection to rabbit was established but connection was lost. In this case 
[1] does its job perfectly allowing to reduce load on rabbitmq
3. Connection was established but there is no reply from rabbitmq (rabbit is 
overloaded). In this case [1] does its job as well

To resolve case 1 we should introduce variable
is_connection_ever_established. If it's not set we should try to connect
every 20-30 seconds and set is_connection_ever_established==true when
connection established. When is_connection_ever_established==true but no
reply or connection lost we should use [1] algorithm. This change will
increase initial cluster startup or compute node reboot.





[1] 
https://opendev.org/openstack/neutron-lib/src/branch/master/neutron_lib/rpc.py#L159-L180

** Affects: neutron
     Importance: Wishlist
     Assignee: Slawek Kaplonski (slaweq)
         Status: Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1940084

Title:
  neutron-agent causes 10m delay on start-up

Status in neutron:
  Confirmed

Bug description:
  When the environment starts (TripleO deployment), we wait until
  pacemaker starts everything, but while that's been happening there are
  neutron-agent services which have started by systemd but are waiting
  with a 10 minute timeout for a RabbitMQ connection. Looking at code
  [1] for resilience code for neutron agent - rabbitmq communication, it
  doesn't take in account the start up case when connection to rabbit
  was never established causing 10m delay. To solve the problem we
  should specify the cases for resilience

  1. Initial connection establishment. Connection to rabbit was never 
established, agent is trying to establish it (Initial startup of whole 
openstack cluster after power outage or planned reboot or one compute node 
reboot)
  2. Connection to rabbit was established but connection was lost. In this case 
[1] does its job perfectly allowing to reduce load on rabbitmq
  3. Connection was established but there is no reply from rabbitmq (rabbit is 
overloaded). In this case [1] does its job as well

  To resolve case 1 we should introduce variable
  is_connection_ever_established. If it's not set we should try to
  connect every 20-30 seconds and set
  is_connection_ever_established==true when connection established. When
  is_connection_ever_established==true but no reply or connection lost
  we should use [1] algorithm. This change will increase initial cluster
  startup or compute node reboot.







  
  [1] 
https://opendev.org/openstack/neutron-lib/src/branch/master/neutron_lib/rpc.py#L159-L180

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1940084/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to