** Description changed: + [Impact] + This issue appears to be a consequence of https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1420572 where we added a 'wait-for-state running' to the nova-compute upstart so as to ensure that neutron-ovs-cleanup has finished before nova-compute starts. I have started to spot, however, that on some hosts (metal only) there is now a race between the two whereby nova-compute sometimes fails to start on system boot/reboot with the following in /var/log/upstart/nova- compute.log: ... libvirt-bin stop/waiting wait-for-state stop/waiting neutron-ovs-cleanup start/pre-start, process 3084 start: Job failed to start If I manually restart nova-compute all is fine. So this looks like a race between nova-compute's wait-for-state and neutron-ovs-cleanup's pre-start -> start/running. + + The proposed solution here is add some retry logic to nova-compute + upstart job to tolerate neutron-ovs-cleanup not being able to start yet. + We, therefore, allow a certain number of retries, every other with an + incremented delay, before giving up and allowing nova-compute to start + anyway. If ovs-cleanup failed to start after what is a failry liberal + retry period, it is assumed to have failed altogether this making is + safe(ish) to start nova-compute. + + [Test Case] + + In one terminal (as root) do: + service neutron-ovs-cleanup stop; service openvswitch-switch stop; service nova-compute restart + + In another do: + sudo tail -F /var/log/upstart/nova-compute.log + + Observe the retries occurring + + Then do 'sudo service openvswitch-switch start' and observe nova-compute + retry and succeed. + + [Regression Potential] + + * If openvswitch-switch does not start within the max retries and + intervals nova-compute will start anyway and of ovs-cleanup were at some + point to run one would see the behaviour that LP 1420572 was intended to + resolve. It does not seem to make sense to wait indefinitely for ovs- + cleanup to be up and the coded interval is pretty liberal and should be + plenty enough.
** Changed in: nova (Ubuntu Trusty) Status: New => In Progress ** Changed in: nova (Ubuntu Utopic) Status: New => In Progress ** Changed in: nova (Ubuntu Vivid) Status: New => In Progress ** Changed in: nova (Ubuntu Trusty) Assignee: (unassigned) => Edward Hope-Morley (hopem) ** Changed in: nova (Ubuntu Utopic) Assignee: (unassigned) => Edward Hope-Morley (hopem) ** Changed in: nova (Ubuntu Vivid) Assignee: (unassigned) => Edward Hope-Morley (hopem) ** Description changed: [Impact] This issue appears to be a consequence of https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1420572 where we added a 'wait-for-state running' to the nova-compute upstart so as to ensure that neutron-ovs-cleanup has finished before nova-compute starts. I have started to spot, however, that on some hosts (metal only) there is now a race between the two whereby nova-compute sometimes fails to start on system boot/reboot with the following in /var/log/upstart/nova- compute.log: ... libvirt-bin stop/waiting wait-for-state stop/waiting neutron-ovs-cleanup start/pre-start, process 3084 start: Job failed to start If I manually restart nova-compute all is fine. So this looks like a race between nova-compute's wait-for-state and neutron-ovs-cleanup's pre-start -> start/running. The proposed solution here is add some retry logic to nova-compute upstart job to tolerate neutron-ovs-cleanup not being able to start yet. We, therefore, allow a certain number of retries, every other with an incremented delay, before giving up and allowing nova-compute to start anyway. If ovs-cleanup failed to start after what is a failry liberal - retry period, it is assumed to have failed altogether this making is + retry period, it is assumed to have failed altogether thus making is safe(ish) to start nova-compute. [Test Case] In one terminal (as root) do: service neutron-ovs-cleanup stop; service openvswitch-switch stop; service nova-compute restart In another do: sudo tail -F /var/log/upstart/nova-compute.log Observe the retries occurring Then do 'sudo service openvswitch-switch start' and observe nova-compute retry and succeed. [Regression Potential] - * If openvswitch-switch does not start within the max retries and + * If openvswitch-switch does not start within the max retries and intervals nova-compute will start anyway and of ovs-cleanup were at some point to run one would see the behaviour that LP 1420572 was intended to resolve. It does not seem to make sense to wait indefinitely for ovs- cleanup to be up and the coded interval is pretty liberal and should be plenty enough. ** Description changed: [Impact] This issue appears to be a consequence of https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1420572 where we added a 'wait-for-state running' to the nova-compute upstart so as to ensure that neutron-ovs-cleanup has finished before nova-compute starts. I have started to spot, however, that on some hosts (metal only) there is now a race between the two whereby nova-compute sometimes fails to start on system boot/reboot with the following in /var/log/upstart/nova- compute.log: ... libvirt-bin stop/waiting wait-for-state stop/waiting neutron-ovs-cleanup start/pre-start, process 3084 start: Job failed to start If I manually restart nova-compute all is fine. So this looks like a race between nova-compute's wait-for-state and neutron-ovs-cleanup's pre-start -> start/running. The proposed solution here is add some retry logic to nova-compute upstart job to tolerate neutron-ovs-cleanup not being able to start yet. We, therefore, allow a certain number of retries, every other with an incremented delay, before giving up and allowing nova-compute to start anyway. If ovs-cleanup failed to start after what is a failry liberal retry period, it is assumed to have failed altogether thus making is safe(ish) to start nova-compute. [Test Case] In one terminal (as root) do: service neutron-ovs-cleanup stop; service openvswitch-switch stop; service nova-compute restart In another do: sudo tail -F /var/log/upstart/nova-compute.log Observe the retries occurring Then do 'sudo service openvswitch-switch start' and observe nova-compute retry and succeed. [Regression Potential] - * If openvswitch-switch does not start within the max retries and + If openvswitch-switch does not start within the max retries and intervals nova-compute will start anyway and of ovs-cleanup were at some point to run one would see the behaviour that LP 1420572 was intended to resolve. It does not seem to make sense to wait indefinitely for ovs- cleanup to be up and the coded interval is pretty liberal and should be plenty enough. -- You received this bug notification because you are a member of Ubuntu Server Team, which is subscribed to nova in Ubuntu. https://bugs.launchpad.net/bugs/1471022 Title: [SRU] race between nova-compute and neutron-ovs-cleanup To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/nova/+bug/1471022/+subscriptions -- Ubuntu-server-bugs mailing list Ubuntu-server-bugs@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-server-bugs