[Bug 1648242] Re: Failure to retry update_ha_routers_states

Edward Hope-Morley Wed, 05 Apr 2017 13:41:36 -0700

** Description changed:

+ [Impact]
+ 
+   Mitigates risk of incorrect ha_state reported by l3-agent for HA
+   routers in case where rmq connection is lost during update
+   window. Fix is already in Ubuntu for O and N but upstream
+   backport just missed the Mitaka PR hence this SRU.
+ 
+ [Test Case]
+ 
+   * deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3
+ -agents-per-router set to 3
+ 
+   * configure network, router, boot instance with floating ip and start
+ pinging
+ 
+   * check that status is 1 agent showing active and 2 showing standby
+ 
+   * trigger some router failovers while rabbit server stopped e.g.
+ 
+     - go to l3-agent hosting your router and do:
+ 
+       ip netns exec qrouter-${router} ip link set dev <ha-iface> down
+ 
+       check other units to see if ha iface has been failed over
+ 
+       ip netns exec qrouter-${router} ip link set dev <ha-iface> up
+    
+   * ensure ping still running
+ 
+   * eventually all agents will be xxx/standby
+ 
+   * start rabbit server
+ 
+   * wait for correct ha_state to be set (takes a few seconds)
+ 
+ [Regression Potential]
+ 
+  I do not envisage any regression from this patch. One potential side-effect 
is
+  mildy increased rmq traffic but should be negligible.
+ 
+ 
+ --------
+ 
  Version: Mitaka
  
  While performing failover testing of L3 HA routers, we've discovered an
  issue with regards to the failure of an agent to report its state.
  
  In this scenario, we have a router (7629f5d7-b205-4af5-8e0e-
  a3c4d15e7677) scheduled to (3) L3 agents:
  
  
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  | id                                   | host                                 
            | admin_state_up | alive | ha_state |
  
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  | 4434f999-51d0-4bbb-843c-5430255d5c64 | 
726404-infra03-neutron-agents-container-a8bb0b1f | True           | :-)   | 
active  |
  | 710e7768-df47-4bfe-917f-ca35c138209a | 
726402-infra01-neutron-agents-container-fc937477 | True           | :-)   | 
standby   |
  | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 
726403-infra02-neutron-agents-container-0338af5a | True           | :-)   | 
standby   |
  
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  
  The infra03 node was shut down completely and abruptly. The router
  transitioned to master on infra02 as indicated in these log messages:
  
  2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device 
qg-d48918fa-eb already exists
  2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master
  2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-] <local> - - 
[07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464
  2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup
  2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-] <local> - - 
[07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110
  2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 
7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master
  2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-] <local> - - 
[07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920
  2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 
4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master
  2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-] <local> - - 
[07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848
  
  Traffic to/from VMs through the new master router functioned as
  expected. However, the ha_state remained 'standby':
- 
  
  
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  | id                                   | host                                 
            | admin_state_up | alive | ha_state |
  
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  | 4434f999-51d0-4bbb-843c-5430255d5c64 | 
726404-infra03-neutron-agents-container-a8bb0b1f | True           | xxx   | 
standby  |
  | 710e7768-df47-4bfe-917f-ca35c138209a | 
726402-infra01-neutron-agents-container-fc937477 | True           | :-)   | 
standby   |
  | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 
726403-infra02-neutron-agents-container-0338af5a | True           | :-)   | 
standby   |
  
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
  
  A traceback was observed in the logs related to a message timeout,
  probably due to the cut of AMQP on infra03:
  
  2016-12-07 15:22:30.525 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] 
AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 
seconds.
  2016-12-07 15:22:36.537 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] 
AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 
seconds.
  2016-12-07 15:22:37.553 18450 INFO oslo.messaging._drivers.impl_rabbit [-] 
Reconnected to AMQP server on 172.29.238.65:5671 via [amqp] client
  2016-12-07 15:22:51.210 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] 
AMQP server on 172.29.237.246:5671 is unreachable: Basic.cancel: (0) 1. Trying 
again in 1 seconds.
  2016-12-07 15:22:52.262 18450 INFO oslo.messaging._drivers.impl_rabbit [-] 
Reconnected to AMQP server on 172.29.237.246:5671 via [amqp] client
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent [-] Failed 
reporting state!
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent Traceback (most 
recent call last):
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/l3/agent.py",
 line 686, in _report_state
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     True)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/rpc.py",
 line 87, in report_state
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     return 
method(context, 'report_state', **kwargs)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/rpc/client.py",
 line 158, in call
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     
retry=self.retry)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/transport.py",
 line 90, in _send
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     
timeout=timeout, retry=retry)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py",
 line 470, in send
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     retry=retry)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py",
 line 459, in _send
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     result = 
self._waiter.wait(msg_id, timeout)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py",
 line 342, in wait
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     message = 
self.waiters.get(msg_id, timeout=timeout)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py",
 line 244, in get
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent     'to message ID 
%s' % msg_id)
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent MessagingTimeout: 
Timed out waiting for a reply to message ID 03e3909c25d749908652c050eeeaec91
  2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent
  2016-12-07 15:22:55.829 18450 WARNING oslo.service.loopingcall [-] Function 
'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted 
interval by 0.01 sec
  2016-12-07 15:23:01.086 18450 INFO oslo_messaging._drivers.amqpdriver [-] No 
calling threads waiting for msg_id : 03e3909c25d749908652c050eeeaec91
  2016-12-07 15:23:45.180 18450 ERROR neutron.common.rpc [-] Timeout in RPC 
method update_ha_routers_states. Waiting for 4 seconds before next attempt. If 
the server is not down, consider increasing the rpc_response_timeout option as 
Neutron server(s) may be overloaded and unable to respond quickly enough.
  2016-12-07 15:23:45.181 18450 WARNING neutron.common.rpc [-] Increasing 
timeout for update_ha_routers_states calls to 120 seconds. Restart the agent to 
restore it to the default value.
  2016-12-07 15:23:51.941 18450 ERROR neutron.common.rpc [-] Timeout in RPC 
method update_ha_routers_states. Waiting for 35 seconds before next attempt. If 
the server is not down, consider increasing the rpc_response_timeout option as 
Neutron server(s) may be overloaded and unable to respond quickly enough.
  
  While subsequent transitions for other routers acted and updated Neutron
  accordingly, this router's ha_state was never updated after this initial
  failure. We could force the update with a restart of the L3 agent, or by
  causing a transition from infra02 to another node, and back. Neither is
  ideal.
  
  If you need any other information please let me know.
  
  Thanks,
  James


** Summary changed:

- Failure to retry update_ha_routers_states
+ [SRU] Failure to retry update_ha_routers_states

** Tags added: sts sts-sru-needed

** Patch added: "lp1648242-xenial-mitaka.debdiff"
   
https://bugs.launchpad.net/ubuntu/xenial/+source/neutron/+bug/1648242/+attachment/4855750/+files/lp1648242-xenial-mitaka.debdiff

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1648242

Title:
  [SRU] Failure to retry update_ha_routers_states

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1648242/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1648242] Re: Failure to retry update_ha_routers_states

Reply via email to