Public bug reported: When the environment starts (TripleO deployment), we wait until pacemaker starts everything, but while that's been happening there are neutron-agent services which have started by systemd but are waiting with a 10 minute timeout for a RabbitMQ connection. Looking at code [1] for resilience code for neutron agent - rabbitmq communication, it doesn't take in account the start up case when connection to rabbit was never established causing 10m delay. To solve the problem we should specify the cases for resilience
1. Initial connection establishment. Connection to rabbit was never established, agent is trying to establish it (Initial startup of whole openstack cluster after power outage or planned reboot or one compute node reboot) 2. Connection to rabbit was established but connection was lost. In this case [1] does its job perfectly allowing to reduce load on rabbitmq 3. Connection was established but there is no reply from rabbitmq (rabbit is overloaded). In this case [1] does its job as well To resolve case 1 we should introduce variable is_connection_ever_established. If it's not set we should try to connect every 20-30 seconds and set is_connection_ever_established==true when connection established. When is_connection_ever_established==true but no reply or connection lost we should use [1] algorithm. This change will increase initial cluster startup or compute node reboot. [1] https://opendev.org/openstack/neutron-lib/src/branch/master/neutron_lib/rpc.py#L159-L180 ** Affects: neutron Importance: Wishlist Assignee: Slawek Kaplonski (slaweq) Status: Confirmed -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1940084 Title: neutron-agent causes 10m delay on start-up Status in neutron: Confirmed Bug description: When the environment starts (TripleO deployment), we wait until pacemaker starts everything, but while that's been happening there are neutron-agent services which have started by systemd but are waiting with a 10 minute timeout for a RabbitMQ connection. Looking at code [1] for resilience code for neutron agent - rabbitmq communication, it doesn't take in account the start up case when connection to rabbit was never established causing 10m delay. To solve the problem we should specify the cases for resilience 1. Initial connection establishment. Connection to rabbit was never established, agent is trying to establish it (Initial startup of whole openstack cluster after power outage or planned reboot or one compute node reboot) 2. Connection to rabbit was established but connection was lost. In this case [1] does its job perfectly allowing to reduce load on rabbitmq 3. Connection was established but there is no reply from rabbitmq (rabbit is overloaded). In this case [1] does its job as well To resolve case 1 we should introduce variable is_connection_ever_established. If it's not set we should try to connect every 20-30 seconds and set is_connection_ever_established==true when connection established. When is_connection_ever_established==true but no reply or connection lost we should use [1] algorithm. This change will increase initial cluster startup or compute node reboot. [1] https://opendev.org/openstack/neutron-lib/src/branch/master/neutron_lib/rpc.py#L159-L180 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1940084/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp