Public bug reported: Using ha-routers, we've found that getting network node agents down for T>agent_down time, and then bringing them up fires a race condition during ovs-agent and l3-agent boot.
Even if you set a constraint on ovs-agent being up before l3-agent, that won't work, because still the l3-agent will start and sync routers before the ovs-agent is able to report himself active back to neutron server. A possible solution: Using systemd notify from the openvswitch-agent, so it won't notify system-d about readiness until the first heartbeat has been sent to neutron-server. Other solutions could be detecting the unbound state ("Device xxxxx not defined on plugin") from the ovs-agent and trying to rebind it if the port is present. How does it reproduce: 1) neutron-servers and agents are down by being set to standby some time, T>agent_down_time 2) cluster is took out of standby, neutron-servers and agents are started again, first neutron-server, then agents 3) neutron servers consider the agents to be down (last heartbeat > agent_down_time), when checking ports in the database, reset them as binding failed... 4) agents start, try to fetch info from the ports, as those are in status binding failed... they get no info from the server (or the interpretation is wrong "Device xxx not defined on plugin") 5) agents mark ports as dead vlan (4095), as the port "does not exist" on the server Analysis over the logs: -- SRV2 tryies to bind port on dead agent (: same happening for SRV1 and 3 that's omitted--- 2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b None] Attempting to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on host neutron-n-0 for vnic_type normal with profile bind_port /usr/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py:574 2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc- 4bb6-a652-91359eda496b None] Attempting to bind port 6adcffbf- 09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f- 9b78-185ef15809e3 bind_port /usr/lib/python2.7/site- packages/neutron/plugins/ml2/drivers/mech_agent.py:57 2015-08-25 09:37:11.689000 [SRV2] 16522 DEBUG neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc- 4bb6-a652-91359eda496b None] Checking agent: {'binary': u'neutron- openvswitch-agent', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2015, 8, 25, 1, 35, 5), 'alive': False, 'id': u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic': u'N/A', 'host': u'neutron-n-0', 'agent_type': u'Open vSwitch agent', 'started_at': datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at': datetime.datetime(2015, 8, 12, 10, 23, 16), 'configurations': {u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices': 28, u'l2_population': False, u'tunnel_types': [], u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet- external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}} bind_port /usr/lib/python2.7/site- packages/neutron/plugins/ml2/drivers/mech_agent.py:65 2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc- 4bb6-a652-91359eda496b None] Attempting to bind with dead agent: {'binary': u'neutron-openvswitch-agent', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2015, 8, 25, 1, 35, 5), 'alive': False, 'id': u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic': u'N/A', 'host': u'neutron-n-0', 'agent_type': u'Open vSwitch agent', 'started_at': datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at': datetime.datetime(2015, 8, 12, 10, 23,16), 'configurations': {u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices': 28, u'l2_population': False, u'tunnel_types': [], u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet- external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}} 2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b None] Failed to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on host neutron-n-0 2015-08-25 09:37:11.719000 [SRV2] 16522 WARNING neutron.plugins.ml2.plugin [req-fd4c17c2-0acc-4bb6-a652-91359eda496b None] In _notify_port_updated(), no bound segment for port 6adcffbf- 09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f- 9b78-185ef15809e3 -- OVS agent 1 requests info for ports, including 6adcffbf-09d5-4a85-9339-9d6beb2bf82c -- 2015-08-25 09:37:16.043000 [OVS1] 12613 DEBUG neutron.common.rpc [req-588c942a-6526-464f-a447-782a5e2d436a None] neutron.plugins.openvswitch.agent.ovs_neutron_agent.OVSPluginApi method call called with arguments (<neutron.context.ContextBase object at 0x326d510>, {'args': {'host': 'neutron-n-2', 'agent_id': 'ovs-agent-neutron-n-2', 'devices': set([u'6adcffbf-09d5-4a85-9339-9d6beb2bf82c', u'2cdba01d-e48a-46a6-82c0-bbbfd8e2b490', u'fa58d644-8162-410b-8b45-747ae8813a61', u'5d5ea109-4807-4df3-bef4-b5d89c3ffebc', u'743ccaa6-7ed9-4195-aabd-3d55006338e1', u'dc662767-61a5-4807-b2ed-a7c76b541fd6', u'b5b271a1-d8b7-4b55-86fd-6316523878a5', u'b2d04f15-f2e2-429c-b948-00f88e1b9cf6', u'34b826df-9787-443c-9bef-084374827a85', u'c12e6e06-ff6a-44dc-b75f-78ec55dd3dd3', u'79b33879-3232-4b3a-a27c-c0a79da10379', u'499d3db7-9737-43a4-bcbf-923f9df5196e', u'55a19f5e-efff-4342-a82a-70ddb33fbefb', u'44dca3a9-44d4-418f-9d60-99dfe37afdc6'])}, 'namespace': None, 'method': 'get_devices_details_list'}) {'version': ' 1.3'} wrapper /usr/lib/python2.7/site-packages/neutron/common/log.py:33 ---- SRV1 notes the port is in binding_failed state ---- 2015-08-25 09:37:16.048000 [SRV1] 15065 DEBUG neutron.plugins.ml2.rpc [req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf- 09d5-4a85-9339-9d6beb2bf82c details requested by agent ovs-agent- neutron-n-2 with host neutron-n-2 get_device_details /usr/lib/python2.7 /site-packages/neutron/plugins/ml2/rpc.py:60 2015-08-25 09:37:16.174000 [SRV1] 15065 WARNING neutron.plugins.ml2.rpc [req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf- 09d5-4a85-9339-9d6beb2bf82c requested by agent ovs-agent-neutron-n-2 on network 2c52ca73-b11f-426f-9b78-185ef15809e3 not bound, vif_type: binding_failed 2015-08-25 09:37:16.807000 [OVS1] 12613 WARNING neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-09d5-4a85-9339-9d6beb2bf82c not defined on plugin --- OVS1 tags it out --- 2015-08-25 09:37:16.900000 [OVS1] 12613 DEBUG neutron.agent.linux.utils [req-588c942a-6526-464f-a447-782a5e2d436a None] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'qr-6adcffbf-09', 'tag=4095'] create_process /usr/lib/python2.7/site- packages/neutron/agent/linux/utils.py:48 ** Affects: neutron Importance: Low Assignee: Jakub Libosvar (libosvar) Status: New ** Tags: l3-ha ovs ** Changed in: neutron Importance: Undecided => Low ** Tags added: ovs -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1518995 Title: l3-agent router sync unbinds port on down ovs-agent Status in neutron: New Bug description: Using ha-routers, we've found that getting network node agents down for T>agent_down time, and then bringing them up fires a race condition during ovs-agent and l3-agent boot. Even if you set a constraint on ovs-agent being up before l3-agent, that won't work, because still the l3-agent will start and sync routers before the ovs-agent is able to report himself active back to neutron server. A possible solution: Using systemd notify from the openvswitch-agent, so it won't notify system-d about readiness until the first heartbeat has been sent to neutron-server. Other solutions could be detecting the unbound state ("Device xxxxx not defined on plugin") from the ovs-agent and trying to rebind it if the port is present. How does it reproduce: 1) neutron-servers and agents are down by being set to standby some time, T>agent_down_time 2) cluster is took out of standby, neutron-servers and agents are started again, first neutron-server, then agents 3) neutron servers consider the agents to be down (last heartbeat > agent_down_time), when checking ports in the database, reset them as binding failed... 4) agents start, try to fetch info from the ports, as those are in status binding failed... they get no info from the server (or the interpretation is wrong "Device xxx not defined on plugin") 5) agents mark ports as dead vlan (4095), as the port "does not exist" on the server Analysis over the logs: -- SRV2 tryies to bind port on dead agent (: same happening for SRV1 and 3 that's omitted--- 2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b None] Attempting to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on host neutron-n-0 for vnic_type normal with profile bind_port /usr/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py:574 2015-08-25 09:37:11.687000 [SRV2] 16522 DEBUG neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc- 4bb6-a652-91359eda496b None] Attempting to bind port 6adcffbf- 09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f- 9b78-185ef15809e3 bind_port /usr/lib/python2.7/site- packages/neutron/plugins/ml2/drivers/mech_agent.py:57 2015-08-25 09:37:11.689000 [SRV2] 16522 DEBUG neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc- 4bb6-a652-91359eda496b None] Checking agent: {'binary': u'neutron- openvswitch-agent', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2015, 8, 25, 1, 35, 5), 'alive': False, 'id': u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic': u'N/A', 'host': u'neutron-n-0', 'agent_type': u'Open vSwitch agent', 'started_at': datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at': datetime.datetime(2015, 8, 12, 10, 23, 16), 'configurations': {u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices': 28, u'l2_population': False, u'tunnel_types': [], u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet- external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}} bind_port /usr/lib/python2.7/site- packages/neutron/plugins/ml2/drivers/mech_agent.py:65 2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING neutron.plugins.ml2.drivers.mech_agent [req-fd4c17c2-0acc- 4bb6-a652-91359eda496b None] Attempting to bind with dead agent: {'binary': u'neutron-openvswitch-agent', 'description': None, 'admin_state_up': True, 'heartbeat_timestamp': datetime.datetime(2015, 8, 25, 1, 35, 5), 'alive': False, 'id': u'196101f4-9680-43bb-8310-9fca49cd4930', 'topic': u'N/A', 'host': u'neutron-n-0', 'agent_type': u'Open vSwitch agent', 'started_at': datetime.datetime(2015, 8, 22, 6, 55, 59), 'created_at': datetime.datetime(2015, 8, 12, 10, 23,16), 'configurations': {u'arp_responder_enabled': False, u'tunneling_ip': u'', u'devices': 28, u'l2_population': False, u'tunnel_types': [], u'enable_distributed_routing': False, u'bridge_mappings': {u'physnet- external': u'br-ex', u'physnet-tenants': u'br-enp7s0'}}} 2015-08-25 09:37:11.690000 [SRV2] 16522 WARNING neutron.plugins.ml2.managers [req-fd4c17c2-0acc-4bb6-a652-91359eda496b None] Failed to bind port 6adcffbf-09d5-4a85-9339-9d6beb2bf82c on host neutron-n-0 2015-08-25 09:37:11.719000 [SRV2] 16522 WARNING neutron.plugins.ml2.plugin [req-fd4c17c2-0acc-4bb6-a652-91359eda496b None] In _notify_port_updated(), no bound segment for port 6adcffbf- 09d5-4a85-9339-9d6beb2bf82c on network 2c52ca73-b11f-426f- 9b78-185ef15809e3 -- OVS agent 1 requests info for ports, including 6adcffbf-09d5-4a85-9339-9d6beb2bf82c -- 2015-08-25 09:37:16.043000 [OVS1] 12613 DEBUG neutron.common.rpc [req-588c942a-6526-464f-a447-782a5e2d436a None] neutron.plugins.openvswitch.agent.ovs_neutron_agent.OVSPluginApi method call called with arguments (<neutron.context.ContextBase object at 0x326d510>, {'args': {'host': 'neutron-n-2', 'agent_id': 'ovs-agent-neutron-n-2', 'devices': set([u'6adcffbf-09d5-4a85-9339-9d6beb2bf82c', u'2cdba01d-e48a-46a6-82c0-bbbfd8e2b490', u'fa58d644-8162-410b-8b45-747ae8813a61', u'5d5ea109-4807-4df3-bef4-b5d89c3ffebc', u'743ccaa6-7ed9-4195-aabd-3d55006338e1', u'dc662767-61a5-4807-b2ed-a7c76b541fd6', u'b5b271a1-d8b7-4b55-86fd-6316523878a5', u'b2d04f15-f2e2-429c-b948-00f88e1b9cf6', u'34b826df-9787-443c-9bef-084374827a85', u'c12e6e06-ff6a-44dc-b75f-78ec55dd3dd3', u'79b33879-3232-4b3a-a27c-c0a79da10379', u'499d3db7-9737-43a4-bcbf-923f9df5196e', u'55a19f5e-efff-4342-a82a-70ddb33fbefb', u'44dca3a9-44d4-418f-9d60-99dfe37afdc6'])}, 'namespace': None, 'method': 'get_devices_details_list'}) {'version': '1.3'} wrapper /usr/lib/python2.7/site-packages/neutron/common/log.py:33 ---- SRV1 notes the port is in binding_failed state ---- 2015-08-25 09:37:16.048000 [SRV1] 15065 DEBUG neutron.plugins.ml2.rpc [req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf- 09d5-4a85-9339-9d6beb2bf82c details requested by agent ovs-agent- neutron-n-2 with host neutron-n-2 get_device_details /usr/lib/python2.7/site-packages/neutron/plugins/ml2/rpc.py:60 2015-08-25 09:37:16.174000 [SRV1] 15065 WARNING neutron.plugins.ml2.rpc [req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-09d5-4a85-9339-9d6beb2bf82c requested by agent ovs-agent-neutron-n-2 on network 2c52ca73-b11f-426f-9b78-185ef15809e3 not bound, vif_type: binding_failed 2015-08-25 09:37:16.807000 [OVS1] 12613 WARNING neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-588c942a-6526-464f-a447-782a5e2d436a None] Device 6adcffbf-09d5-4a85-9339-9d6beb2bf82c not defined on plugin --- OVS1 tags it out --- 2015-08-25 09:37:16.900000 [OVS1] 12613 DEBUG neutron.agent.linux.utils [req-588c942a-6526-464f-a447-782a5e2d436a None] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', 'set', 'Port', 'qr-6adcffbf-09', 'tag=4095'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:48 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1518995/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp