Public bug reported:

Using ha-routers, we've found that getting network node agents down for
T>agent_down time, and then bringing them up fires a race condition
during ovs-agent and l3-agent boot.

Even if you set a constraint on ovs-agent being up before l3-agent, that
won't work, because still the l3-agent will start and sync routers
before the ovs-agent is able to report himself active back to neutron
server.

A possible solution:

   Using systemd notify from the openvswitch-agent, so it won't notify
system-d about readiness until the first heartbeat has been sent to
neutron-server.

Other solutions could be detecting the unbound state ("Device xxxxx not
defined on plugin") from the ovs-agent and trying to rebind it if the
port is present.


How does it reproduce:

1) neutron-servers and agents are down by being set to standby

some time, T>agent_down_time

2) cluster is took out of standby, neutron-servers and agents are
started again, first neutron-server, then agents

3) neutron servers consider the agents to be down (last heartbeat >
agent_down_time), when checking ports in the database, reset them as
binding failed...

4) agents start, try to fetch info from the ports, as those are in
status binding failed... they get no info from the server (or the
interpretation is wrong "Device xxx not defined on plugin")

5) agents mark ports as dead vlan (4095), as the port "does not exist"
on the server


Analysis over the logs:
  

-- SRV2 tryies to bind port on dead agent (: same happening for SRV1 and
3 that's omitted---

2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG
neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
None] Attempting to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on
host neutron-n-0 for vnic_type normal with profile  bind_port
/usr/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py:574

2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG
neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
4bb6-a652-91359eda496b None] Attempting to bind port 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f-
9b78-185ef15809e3 bind_port /usr/lib/python2.7/site-
packages/neutron/plugins/ml2/drivers/mech_agent.py:57

2015-08-25 09:37:11.689000 [SRV2] 16522 DEBUG
neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
4bb6-a652-91359eda496b None] Checking agent: {'binary': u'neutron-
openvswitch-agent', 'description': None, 'admin_state_up': True,
'heartbeat_timestamp': datetime.datetime(2015, 8, 25, 1, 35, 5),
'alive': False, 'id': u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic':
u'N/A', 'host': u'neutron-n-0', 'agent_type': u'Open vSwitch agent',
'started_at': datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at':
datetime.datetime(2015, 8, 12, 10, 23, 16), 'configurations':
{u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices': 28,
u'l2_population': False, u'tunnel_types': [],
u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet-
external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}} bind_port
/usr/lib/python2.7/site-
packages/neutron/plugins/ml2/drivers/mech_agent.py:65

2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING
neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
4bb6-a652-91359eda496b None] Attempting to bind with dead agent:
{'binary': u'neutron-openvswitch-agent', 'description': None,
'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2015,
8, 25, 1, 35, 5), 'alive': False, 'id':
u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic': u'N/A', 'host':
u'neutron-n-0', 'agent_type': u'Open vSwitch agent', 'started_at':
datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at':
datetime.datetime(2015, 8, 12, 10, 23,16), 'configurations':
{u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices': 28,
u'l2_population': False, u'tunnel_types': [],
u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet-
external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}}

2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING
neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
None] Failed to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on host
neutron-n-0

2015-08-25 09:37:11.719000 [SRV2] 16522 WARNING
neutron.plugins.ml2.plugin [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
None] In _notify_port_updated(), no bound segment for port 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f-
9b78-185ef15809e3


-- OVS agent 1 requests info for ports, including 
6adcffbf-09d5-4a85-9339-9d6beb2bf82c --


2015-08-25 09:37:16.043000 [OVS1] 12613 DEBUG neutron.common.rpc 
[req-588c942a-6526-464f-a447-782a5e2d436a None] 
neutron.plugins.openvswitch.agent.ovs_neutron_agent.OVSPluginApi method call 
called with arguments (<neutron.context.ContextBase object at 0x326d510>, 
{'args': {'host': 'neutron-n-2', 'agent_id': 'ovs-agent-neutron-n-2', 
'devices': set([u'6adcffbf-09d5-4a85-9339-9d6beb2bf82c', 
u'2cdba01d-e48a-46a6-82c0-bbbfd8e2b490', 
u'fa58d644-8162-410b-8b45-747ae8813a61', 
u'5d5ea109-4807-4df3-bef4-b5d89c3ffebc', 
u'743ccaa6-7ed9-4195-aabd-3d55006338e1', 
u'dc662767-61a5-4807-b2ed-a7c76b541fd6', 
u'b5b271a1-d8b7-4b55-86fd-6316523878a5', 
u'b2d04f15-f2e2-429c-b948-00f88e1b9cf6', 
u'34b826df-9787-443c-9bef-084374827a85', 
u'c12e6e06-ff6a-44dc-b75f-78ec55dd3dd3', 
u'79b33879-3232-4b3a-a27c-c0a79da10379', 
u'499d3db7-9737-43a4-bcbf-923f9df5196e', 
u'55a19f5e-efff-4342-a82a-70ddb33fbefb', 
u'44dca3a9-44d4-418f-9d60-99dfe37afdc6'])}, 'namespace': None, 'method': 
'get_devices_details_list'}) {'version': '
 1.3'} wrapper /usr/lib/python2.7/site-packages/neutron/common/log.py:33


---- SRV1 notes the port is in binding_failed state ----

2015-08-25 09:37:16.048000 [SRV1] 15065 DEBUG neutron.plugins.ml2.rpc
[req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c details requested by agent ovs-agent-
neutron-n-2 with host neutron-n-2 get_device_details /usr/lib/python2.7
/site-packages/neutron/plugins/ml2/rpc.py:60

2015-08-25 09:37:16.174000 [SRV1] 15065 WARNING neutron.plugins.ml2.rpc
[req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-
09d5-4a85-9339-9d6beb2bf82c requested by agent ovs-agent-neutron-n-2 on
network 2c52ca73-b11f-426f-9b78-185ef15809e3 not bound, vif_type:
binding_failed


2015-08-25 09:37:16.807000 [OVS1] 12613 WARNING 
neutron.plugins.openvswitch.agent.ovs_neutron_agent 
[req-588c942a-6526-464f-a447-782a5e2d436a None] Device 
6adcffbf-09d5-4a85-9339-9d6beb2bf82c not defined on plugin

--- OVS1 tags it out ---

2015-08-25 09:37:16.900000 [OVS1] 12613 DEBUG neutron.agent.linux.utils
[req-588c942a-6526-464f-a447-782a5e2d436a None] Running command:
['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl',
'--timeout=10', 'set', 'Port', 'qr-6adcffbf-09', 'tag=4095']
create_process /usr/lib/python2.7/site-
packages/neutron/agent/linux/utils.py:48

** Affects: neutron
     Importance: Low
     Assignee: Jakub Libosvar (libosvar)
         Status: New


** Tags: l3-ha ovs

** Changed in: neutron
   Importance: Undecided => Low

** Tags added: ovs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1518995

Title:
  l3-agent router sync unbinds port on down ovs-agent

Status in neutron:
  New

Bug description:
  Using ha-routers, we've found that getting network node agents down
  for T>agent_down time, and then bringing them up fires a race
  condition during ovs-agent and l3-agent boot.

  Even if you set a constraint on ovs-agent being up before l3-agent,
  that won't work, because still the l3-agent will start and sync
  routers before the ovs-agent is able to report himself active back to
  neutron server.

  A possible solution:

     Using systemd notify from the openvswitch-agent, so it won't notify
  system-d about readiness until the first heartbeat has been sent to
  neutron-server.

  Other solutions could be detecting the unbound state ("Device xxxxx
  not defined on plugin") from the ovs-agent and trying to rebind it if
  the port is present.

  
  How does it reproduce:

  1) neutron-servers and agents are down by being set to standby

  some time, T>agent_down_time

  2) cluster is took out of standby, neutron-servers and agents are
  started again, first neutron-server, then agents

  3) neutron servers consider the agents to be down (last heartbeat >
  agent_down_time), when checking ports in the database, reset them as
  binding failed...

  4) agents start, try to fetch info from the ports, as those are in
  status binding failed... they get no info from the server (or the
  interpretation is wrong "Device xxx not defined on plugin")

  5) agents mark ports as dead vlan (4095), as the port "does not exist"
  on the server

  
  Analysis over the logs:
    

  -- SRV2 tryies to bind port on dead agent (: same happening for SRV1
  and 3 that's omitted---

  2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG
  neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
  None] Attempting to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on
  host neutron-n-0 for vnic_type normal with profile  bind_port
  /usr/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py:574

  2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG
  neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
  4bb6-a652-91359eda496b None] Attempting to bind port 6adcffbf-
  09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f-
  9b78-185ef15809e3 bind_port /usr/lib/python2.7/site-
  packages/neutron/plugins/ml2/drivers/mech_agent.py:57

  2015-08-25 09:37:11.689000 [SRV2] 16522 DEBUG
  neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
  4bb6-a652-91359eda496b None] Checking agent: {'binary': u'neutron-
  openvswitch-agent', 'description': None, 'admin_state_up': True,
  'heartbeat_timestamp': datetime.datetime(2015, 8, 25, 1, 35, 5),
  'alive': False, 'id': u'196101f4-9680-43bb-8310-9fca49cd4930',
  'topic': u'N/A', 'host': u'neutron-n-0', 'agent_type': u'Open vSwitch
  agent', 'started_at': datetime.datetime(2015, 8, 22, 6, 55, 59),
  'created_at': datetime.datetime(2015, 8, 12, 10, 23, 16),
  'configurations': {u'arp_responder_enabled': False, u'tunneling_ip':
  u'', u'devices': 28, u'l2_population': False, u'tunnel_types': [],
  u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet-
  external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}} bind_port
  /usr/lib/python2.7/site-
  packages/neutron/plugins/ml2/drivers/mech_agent.py:65

  2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING
  neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc-
  4bb6-a652-91359eda496b None] Attempting to bind with dead agent:
  {'binary': u'neutron-openvswitch-agent', 'description': None,
  'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2015,
  8, 25, 1, 35, 5), 'alive': False, 'id':
  u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic': u'N/A', 'host':
  u'neutron-n-0', 'agent_type': u'Open vSwitch agent', 'started_at':
  datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at':
  datetime.datetime(2015, 8, 12, 10, 23,16), 'configurations':
  {u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices':
  28, u'l2_population': False, u'tunnel_types': [],
  u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet-
  external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}}

  2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING
  neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
  None] Failed to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on host
  neutron-n-0

  2015-08-25 09:37:11.719000 [SRV2] 16522 WARNING
  neutron.plugins.ml2.plugin [req-fd4c17c2-0acc-4bb6-a652-91359eda496b
  None] In _notify_port_updated(), no bound segment for port 6adcffbf-
  09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f-
  9b78-185ef15809e3

  
  -- OVS agent 1 requests info for ports, including 
6adcffbf-09d5-4a85-9339-9d6beb2bf82c --

  
  2015-08-25 09:37:16.043000 [OVS1] 12613 DEBUG neutron.common.rpc 
[req-588c942a-6526-464f-a447-782a5e2d436a None] 
neutron.plugins.openvswitch.agent.ovs_neutron_agent.OVSPluginApi method call 
called with arguments (<neutron.context.ContextBase object at 0x326d510>, 
{'args': {'host': 'neutron-n-2', 'agent_id': 'ovs-agent-neutron-n-2', 
'devices': set([u'6adcffbf-09d5-4a85-9339-9d6beb2bf82c', 
u'2cdba01d-e48a-46a6-82c0-bbbfd8e2b490', 
u'fa58d644-8162-410b-8b45-747ae8813a61', 
u'5d5ea109-4807-4df3-bef4-b5d89c3ffebc', 
u'743ccaa6-7ed9-4195-aabd-3d55006338e1', 
u'dc662767-61a5-4807-b2ed-a7c76b541fd6', 
u'b5b271a1-d8b7-4b55-86fd-6316523878a5', 
u'b2d04f15-f2e2-429c-b948-00f88e1b9cf6', 
u'34b826df-9787-443c-9bef-084374827a85', 
u'c12e6e06-ff6a-44dc-b75f-78ec55dd3dd3', 
u'79b33879-3232-4b3a-a27c-c0a79da10379', 
u'499d3db7-9737-43a4-bcbf-923f9df5196e', 
u'55a19f5e-efff-4342-a82a-70ddb33fbefb', 
u'44dca3a9-44d4-418f-9d60-99dfe37afdc6'])}, 'namespace': None, 'method': 
'get_devices_details_list'}) {'version':
  '1.3'} wrapper /usr/lib/python2.7/site-packages/neutron/common/log.py:33

  
  ---- SRV1 notes the port is in binding_failed state ----

  2015-08-25 09:37:16.048000 [SRV1] 15065 DEBUG neutron.plugins.ml2.rpc
  [req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-
  09d5-4a85-9339-9d6beb2bf82c details requested by agent ovs-agent-
  neutron-n-2 with host neutron-n-2 get_device_details
  /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:60

  2015-08-25 09:37:16.174000 [SRV1] 15065 WARNING
  neutron.plugins.ml2.rpc [req-588c942a-6526-464f-a447-782a5e2d436a
  None] Device 6adcffbf-09d5-4a85-9339-9d6beb2bf82c requested by agent
  ovs-agent-neutron-n-2 on network 2c52ca73-b11f-426f-9b78-185ef15809e3
  not bound, vif_type: binding_failed

  
  2015-08-25 09:37:16.807000 [OVS1] 12613 WARNING 
neutron.plugins.openvswitch.agent.ovs_neutron_agent 
[req-588c942a-6526-464f-a447-782a5e2d436a None] Device 
6adcffbf-09d5-4a85-9339-9d6beb2bf82c not defined on plugin

  --- OVS1 tags it out ---

  2015-08-25 09:37:16.900000 [OVS1] 12613 DEBUG
  neutron.agent.linux.utils [req-588c942a-6526-464f-a447-782a5e2d436a
  None] Running command: ['sudo', 'neutron-rootwrap',
  '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set',
  'Port', 'qr-6adcffbf-09', 'tag=4095'] create_process
  /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1518995/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to