[Yahoo-eng-team] [Bug 1648242] Re: [SRU] Failure to retry update_ha_routers_states

2017-05-17 Thread James Page
This bug was fixed in the package neutron - 2:8.4.0-0ubuntu2~cloud0
---

 neutron (2:8.4.0-0ubuntu2~cloud0) trusty-mitaka; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:8.4.0-0ubuntu2) xenial; urgency=medium
 .
   [ Edward Hope-Morley ]
   * Backport fix for Failure to retry update_ha_routers_states (LP: #1648242)
 - d/p/add-check-for-ha-state.patch
 .
   [ Chuck Short ]
   * d/neutron-common.install, d/neutron-dhcp-agent.install:
 Remove cron jobs since they will cause a race when
 using an L3 agent. The L3 agent cleans up after itself now.
 (LP: #1623664)


** Changed in: cloud-archive/mitaka
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1648242

Title:
  [SRU] Failure to retry update_ha_routers_states

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Xenial:
  Fix Released

Bug description:
  [Impact]

Mitigates risk of incorrect ha_state reported by l3-agent for HA
routers in case where rmq connection is lost during update
window. Fix is already in Ubuntu for O and N but upstream
backport just missed the Mitaka PR hence this SRU.

  [Test Case]

* deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3
  -agents-per-router set to 3

* configure network, router, boot instance with floating ip and
  start pinging

* check that status is 1 agent showing active and 2 showing standby

* trigger some router failovers while rabbit server stopped e.g.

  - go to l3-agent hosting your router and do:

ip netns exec qrouter-${router} ip link set dev  down

check other units to see if ha iface has been failed over

ip netns exec qrouter-${router} ip link set dev  up
 
* ensure ping still running

* eventually all agents will be xxx/standby

* start rabbit server

* wait for correct ha_state to be set (takes a few seconds)

  [Regression Potential]

   I do not envisage any regression from this patch. One potential side-effect 
is
   mildy increased rmq traffic but should be negligible.

  
  

  Version: Mitaka

  While performing failover testing of L3 HA routers, we've discovered
  an issue with regards to the failure of an agent to report its state.

  In this scenario, we have a router (7629f5d7-b205-4af5-8e0e-
  a3c4d15e7677) scheduled to (3) L3 agents:

  
+--+--++---+--+
  | id   | host 
| admin_state_up | alive | ha_state |
  
+--+--++---+--+
  | 4434f999-51d0-4bbb-843c-5430255d5c64 | 
726404-infra03-neutron-agents-container-a8bb0b1f | True   | :-)   | 
active  |
  | 710e7768-df47-4bfe-917f-ca35c138209a | 
726402-infra01-neutron-agents-container-fc937477 | True   | :-)   | 
standby   |
  | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 
726403-infra02-neutron-agents-container-0338af5a | True   | :-)   | 
standby   |
  
+--+--++---+--+

  The infra03 node was shut down completely and abruptly. The router
  transitioned to master on infra02 as indicated in these log messages:

  2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device 
qg-d48918fa-eb already exists
  2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master
  2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464
  2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup
  2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110
  2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 
7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master
  2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920
  2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 
4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master
  2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848

  Traffic to/from VMs through the new master router functioned as
  expected. However, the ha_state remained 'standby':

 

[Yahoo-eng-team] [Bug 1648242] Re: [SRU] Failure to retry update_ha_routers_states

2017-05-15 Thread Launchpad Bug Tracker
This bug was fixed in the package neutron - 2:8.4.0-0ubuntu2

---
neutron (2:8.4.0-0ubuntu2) xenial; urgency=medium

  [ Edward Hope-Morley ]
  * Backport fix for Failure to retry update_ha_routers_states (LP: #1648242)
- d/p/add-check-for-ha-state.patch

  [ Chuck Short ]
  * d/neutron-common.install, d/neutron-dhcp-agent.install:
Remove cron jobs since they will cause a race when
using an L3 agent. The L3 agent cleans up after itself now.
(LP: #1623664)

 -- Chuck Short   Wed, 19 Apr 2017 11:39:09 +0100

** Changed in: neutron (Ubuntu Xenial)
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1648242

Title:
  [SRU] Failure to retry update_ha_routers_states

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Xenial:
  Fix Released

Bug description:
  [Impact]

Mitigates risk of incorrect ha_state reported by l3-agent for HA
routers in case where rmq connection is lost during update
window. Fix is already in Ubuntu for O and N but upstream
backport just missed the Mitaka PR hence this SRU.

  [Test Case]

* deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3
  -agents-per-router set to 3

* configure network, router, boot instance with floating ip and
  start pinging

* check that status is 1 agent showing active and 2 showing standby

* trigger some router failovers while rabbit server stopped e.g.

  - go to l3-agent hosting your router and do:

ip netns exec qrouter-${router} ip link set dev  down

check other units to see if ha iface has been failed over

ip netns exec qrouter-${router} ip link set dev  up
 
* ensure ping still running

* eventually all agents will be xxx/standby

* start rabbit server

* wait for correct ha_state to be set (takes a few seconds)

  [Regression Potential]

   I do not envisage any regression from this patch. One potential side-effect 
is
   mildy increased rmq traffic but should be negligible.

  
  

  Version: Mitaka

  While performing failover testing of L3 HA routers, we've discovered
  an issue with regards to the failure of an agent to report its state.

  In this scenario, we have a router (7629f5d7-b205-4af5-8e0e-
  a3c4d15e7677) scheduled to (3) L3 agents:

  
+--+--++---+--+
  | id   | host 
| admin_state_up | alive | ha_state |
  
+--+--++---+--+
  | 4434f999-51d0-4bbb-843c-5430255d5c64 | 
726404-infra03-neutron-agents-container-a8bb0b1f | True   | :-)   | 
active  |
  | 710e7768-df47-4bfe-917f-ca35c138209a | 
726402-infra01-neutron-agents-container-fc937477 | True   | :-)   | 
standby   |
  | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 
726403-infra02-neutron-agents-container-0338af5a | True   | :-)   | 
standby   |
  
+--+--++---+--+

  The infra03 node was shut down completely and abruptly. The router
  transitioned to master on infra02 as indicated in these log messages:

  2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device 
qg-d48918fa-eb already exists
  2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master
  2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464
  2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup
  2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110
  2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 
7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master
  2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920
  2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 
4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master
  2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848

  Traffic to/from VMs through the new master router functioned as
  expected. However, the ha_state remained 'standby':

  

[Yahoo-eng-team] [Bug 1648242] Re: [SRU] Failure to retry update_ha_routers_states

2017-04-19 Thread James Page
** Changed in: neutron (Ubuntu Xenial)
   Status: New => Triaged

** Changed in: neutron (Ubuntu)
   Status: New => Fix Released

** Also affects: cloud-archive/mitaka
   Importance: Undecided
   Status: New

** Changed in: cloud-archive
   Status: New => Fix Released

** Changed in: cloud-archive/mitaka
   Importance: Undecided => Low

** Changed in: cloud-archive/mitaka
   Status: New => Triaged

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1648242

Title:
  [SRU] Failure to retry update_ha_routers_states

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Xenial:
  Triaged

Bug description:
  [Impact]

Mitigates risk of incorrect ha_state reported by l3-agent for HA
routers in case where rmq connection is lost during update
window. Fix is already in Ubuntu for O and N but upstream
backport just missed the Mitaka PR hence this SRU.

  [Test Case]

* deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3
  -agents-per-router set to 3

* configure network, router, boot instance with floating ip and
  start pinging

* check that status is 1 agent showing active and 2 showing standby

* trigger some router failovers while rabbit server stopped e.g.

  - go to l3-agent hosting your router and do:

ip netns exec qrouter-${router} ip link set dev  down

check other units to see if ha iface has been failed over

ip netns exec qrouter-${router} ip link set dev  up
 
* ensure ping still running

* eventually all agents will be xxx/standby

* start rabbit server

* wait for correct ha_state to be set (takes a few seconds)

  [Regression Potential]

   I do not envisage any regression from this patch. One potential side-effect 
is
   mildy increased rmq traffic but should be negligible.

  
  

  Version: Mitaka

  While performing failover testing of L3 HA routers, we've discovered
  an issue with regards to the failure of an agent to report its state.

  In this scenario, we have a router (7629f5d7-b205-4af5-8e0e-
  a3c4d15e7677) scheduled to (3) L3 agents:

  
+--+--++---+--+
  | id   | host 
| admin_state_up | alive | ha_state |
  
+--+--++---+--+
  | 4434f999-51d0-4bbb-843c-5430255d5c64 | 
726404-infra03-neutron-agents-container-a8bb0b1f | True   | :-)   | 
active  |
  | 710e7768-df47-4bfe-917f-ca35c138209a | 
726402-infra01-neutron-agents-container-fc937477 | True   | :-)   | 
standby   |
  | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 
726403-infra02-neutron-agents-container-0338af5a | True   | :-)   | 
standby   |
  
+--+--++---+--+

  The infra03 node was shut down completely and abruptly. The router
  transitioned to master on infra02 as indicated in these log messages:

  2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device 
qg-d48918fa-eb already exists
  2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master
  2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464
  2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup
  2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110
  2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 
7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master
  2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920
  2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 
4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master
  2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848

  Traffic to/from VMs through the new master router functioned as
  expected. However, the ha_state remained 'standby':

  
+--+--++---+--+
  | id   | host 
| admin_state_up | alive | ha_state |