[Yahoo-eng-team] [Bug 1648242] Re: [SRU] Failure to retry update_ha_routers_states
This bug was fixed in the package neutron - 2:8.4.0-0ubuntu2~cloud0 --- neutron (2:8.4.0-0ubuntu2~cloud0) trusty-mitaka; urgency=medium . * New update for the Ubuntu Cloud Archive. . neutron (2:8.4.0-0ubuntu2) xenial; urgency=medium . [ Edward Hope-Morley ] * Backport fix for Failure to retry update_ha_routers_states (LP: #1648242) - d/p/add-check-for-ha-state.patch . [ Chuck Short ] * d/neutron-common.install, d/neutron-dhcp-agent.install: Remove cron jobs since they will cause a race when using an L3 agent. The L3 agent cleans up after itself now. (LP: #1623664) ** Changed in: cloud-archive/mitaka Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1648242 Title: [SRU] Failure to retry update_ha_routers_states Status in Ubuntu Cloud Archive: Fix Released Status in Ubuntu Cloud Archive mitaka series: Fix Released Status in neutron: Fix Released Status in neutron package in Ubuntu: Fix Released Status in neutron source package in Xenial: Fix Released Bug description: [Impact] Mitigates risk of incorrect ha_state reported by l3-agent for HA routers in case where rmq connection is lost during update window. Fix is already in Ubuntu for O and N but upstream backport just missed the Mitaka PR hence this SRU. [Test Case] * deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3 -agents-per-router set to 3 * configure network, router, boot instance with floating ip and start pinging * check that status is 1 agent showing active and 2 showing standby * trigger some router failovers while rabbit server stopped e.g. - go to l3-agent hosting your router and do: ip netns exec qrouter-${router} ip link set dev down check other units to see if ha iface has been failed over ip netns exec qrouter-${router} ip link set dev up * ensure ping still running * eventually all agents will be xxx/standby * start rabbit server * wait for correct ha_state to be set (takes a few seconds) [Regression Potential] I do not envisage any regression from this patch. One potential side-effect is mildy increased rmq traffic but should be negligible. Version: Mitaka While performing failover testing of L3 HA routers, we've discovered an issue with regards to the failure of an agent to report its state. In this scenario, we have a router (7629f5d7-b205-4af5-8e0e- a3c4d15e7677) scheduled to (3) L3 agents: +--+--++---+--+ | id | host | admin_state_up | alive | ha_state | +--+--++---+--+ | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | :-) | active | | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby | | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby | +--+--++---+--+ The infra03 node was shut down completely and abruptly. The router transitioned to master on infra02 as indicated in these log messages: 2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device qg-d48918fa-eb already exists 2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master 2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464 2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup 2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110 2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master 2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920 2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master 2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848 Traffic to/from VMs through the new master router functioned as expected. However, the ha_state remained 'standby':
[Yahoo-eng-team] [Bug 1648242] Re: [SRU] Failure to retry update_ha_routers_states
This bug was fixed in the package neutron - 2:8.4.0-0ubuntu2 --- neutron (2:8.4.0-0ubuntu2) xenial; urgency=medium [ Edward Hope-Morley ] * Backport fix for Failure to retry update_ha_routers_states (LP: #1648242) - d/p/add-check-for-ha-state.patch [ Chuck Short ] * d/neutron-common.install, d/neutron-dhcp-agent.install: Remove cron jobs since they will cause a race when using an L3 agent. The L3 agent cleans up after itself now. (LP: #1623664) -- Chuck ShortWed, 19 Apr 2017 11:39:09 +0100 ** Changed in: neutron (Ubuntu Xenial) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1648242 Title: [SRU] Failure to retry update_ha_routers_states Status in Ubuntu Cloud Archive: Fix Released Status in Ubuntu Cloud Archive mitaka series: Fix Committed Status in neutron: Fix Released Status in neutron package in Ubuntu: Fix Released Status in neutron source package in Xenial: Fix Released Bug description: [Impact] Mitigates risk of incorrect ha_state reported by l3-agent for HA routers in case where rmq connection is lost during update window. Fix is already in Ubuntu for O and N but upstream backport just missed the Mitaka PR hence this SRU. [Test Case] * deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3 -agents-per-router set to 3 * configure network, router, boot instance with floating ip and start pinging * check that status is 1 agent showing active and 2 showing standby * trigger some router failovers while rabbit server stopped e.g. - go to l3-agent hosting your router and do: ip netns exec qrouter-${router} ip link set dev down check other units to see if ha iface has been failed over ip netns exec qrouter-${router} ip link set dev up * ensure ping still running * eventually all agents will be xxx/standby * start rabbit server * wait for correct ha_state to be set (takes a few seconds) [Regression Potential] I do not envisage any regression from this patch. One potential side-effect is mildy increased rmq traffic but should be negligible. Version: Mitaka While performing failover testing of L3 HA routers, we've discovered an issue with regards to the failure of an agent to report its state. In this scenario, we have a router (7629f5d7-b205-4af5-8e0e- a3c4d15e7677) scheduled to (3) L3 agents: +--+--++---+--+ | id | host | admin_state_up | alive | ha_state | +--+--++---+--+ | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | :-) | active | | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby | | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby | +--+--++---+--+ The infra03 node was shut down completely and abruptly. The router transitioned to master on infra02 as indicated in these log messages: 2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device qg-d48918fa-eb already exists 2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master 2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464 2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup 2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110 2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master 2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920 2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master 2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848 Traffic to/from VMs through the new master router functioned as expected. However, the ha_state remained 'standby':
[Yahoo-eng-team] [Bug 1648242] Re: [SRU] Failure to retry update_ha_routers_states
** Changed in: neutron (Ubuntu Xenial) Status: New => Triaged ** Changed in: neutron (Ubuntu) Status: New => Fix Released ** Also affects: cloud-archive/mitaka Importance: Undecided Status: New ** Changed in: cloud-archive Status: New => Fix Released ** Changed in: cloud-archive/mitaka Importance: Undecided => Low ** Changed in: cloud-archive/mitaka Status: New => Triaged -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1648242 Title: [SRU] Failure to retry update_ha_routers_states Status in Ubuntu Cloud Archive: Fix Released Status in Ubuntu Cloud Archive mitaka series: Triaged Status in neutron: Fix Released Status in neutron package in Ubuntu: Fix Released Status in neutron source package in Xenial: Triaged Bug description: [Impact] Mitigates risk of incorrect ha_state reported by l3-agent for HA routers in case where rmq connection is lost during update window. Fix is already in Ubuntu for O and N but upstream backport just missed the Mitaka PR hence this SRU. [Test Case] * deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3 -agents-per-router set to 3 * configure network, router, boot instance with floating ip and start pinging * check that status is 1 agent showing active and 2 showing standby * trigger some router failovers while rabbit server stopped e.g. - go to l3-agent hosting your router and do: ip netns exec qrouter-${router} ip link set dev down check other units to see if ha iface has been failed over ip netns exec qrouter-${router} ip link set dev up * ensure ping still running * eventually all agents will be xxx/standby * start rabbit server * wait for correct ha_state to be set (takes a few seconds) [Regression Potential] I do not envisage any regression from this patch. One potential side-effect is mildy increased rmq traffic but should be negligible. Version: Mitaka While performing failover testing of L3 HA routers, we've discovered an issue with regards to the failure of an agent to report its state. In this scenario, we have a router (7629f5d7-b205-4af5-8e0e- a3c4d15e7677) scheduled to (3) L3 agents: +--+--++---+--+ | id | host | admin_state_up | alive | ha_state | +--+--++---+--+ | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | :-) | active | | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby | | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby | +--+--++---+--+ The infra03 node was shut down completely and abruptly. The router transitioned to master on infra02 as indicated in these log messages: 2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device qg-d48918fa-eb already exists 2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master 2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464 2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup 2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110 2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master 2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920 2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master 2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-] - - [07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848 Traffic to/from VMs through the new master router functioned as expected. However, the ha_state remained 'standby': +--+--++---+--+ | id | host | admin_state_up | alive | ha_state |