Public bug reported: If an ovs-managed device (device created by add-port followed by set type=internal)'s namespace is being used by some process and then deleted, L3 agent will fail to re-create the device.
Steps to repro: - Stop l3-agent. - Choose a router namespace with at least one ovs-managed device in it. For example, "qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184" has a device "qg-df5a3693-ec" - Ensure the namespace is used by at least one process. For demo purpose, start another shell using "ip netns exec qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184 bash". In reality, ns-metadata-proxy or keepalived may live in the namespace - Delete the namespace by "ip netns del qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184". The command won't fail and the devices in the deleted namespace are still alive, observable by "ip link" in previously opened shell. However, there is no easy method to enter the namespace from outside again. - Start l3 agent. - Verify "qg-df5a3693-ec" cannot be recreated and managed by L3. The backtrace looks like (this is our branch, may differ with upstream): ERROR neutron.agent.l3_agent Failed synchronizing routers TRACE neutron.agent.l3_agent Traceback (most recent call last): TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 1429, in _sync_routers_task TRACE neutron.agent.l3_agent self._process_routers(routers, all_routers=True) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 1354, in _process_routers TRACE neutron.agent.l3_agent self._router_added(r['id'], r) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 672, in _router_added TRACE neutron.agent.l3_agent self.process_ha_router_added(ri) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 923, in process_ha_router_added TRACE neutron.agent.l3_agent vip_cidrs=[gw_ip_cidr]) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 897, in ha_network_added TRACE neutron.agent.l3_agent prefix=HA_DEV_PREFIX) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 194, in plug TRACE neutron.agent.l3_agent ns_dev.link.set_address(mac_address) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 230, in set_address TRACE neutron.agent.l3_agent self._as_root('set', self.name, 'address', mac_address) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 217, in _as_root TRACE neutron.agent.l3_agent kwargs.get('use_root_namespace', False)) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 70, in _as_root TRACE neutron.agent.l3_agent namespace) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 81, in _execute TRACE neutron.agent.l3_agent root_helper=root_helper) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/utils.py", line 90, in execute TRACE neutron.agent.l3_agent raise RuntimeError(m) TRACE neutron.agent.l3_agent RuntimeError: TRACE neutron.agent.l3_agent Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'link', 'set', 'ha-5bd08318-aa', 'address', 'fa:16:3e:f3:2b:6b'] TRACE neutron.agent.l3_agent Exit code: 1 TRACE neutron.agent.l3_agent Stdout: '' TRACE neutron.agent.l3_agent Stderr: 'Cannot find device "ha-5bd08318-aa"\n' TRACE neutron.agent.l3_agent The root cause is that ovs-vsctl "can perform any number of commands in a single run, implemented as a single atomic transaction against the database." and neutron currently use the following to create ovs-managed device: ovs-vsctl -- --if-exists del-port qr-2f4c613d-b7 -- add-port br-int qr-2f4c613d-b7 -- set Interface qr-2f4c613d-b7 type=internal -- set Interface qr-2f4c613d-b7 external-ids:iface-id=2f4c613d- b7f2-4d63-89c8-af2d48948d19 -- set Interface qr-2f4c613d-b7 external-ids :iface-status=active -- set Interface qr-2f4c613d-b7 external-ids :attached-mac=fa:16:3e:3c:4d:18 ovs can delete devices it manages even the device is in a deleted (lost) namespace. But if del-port, add-port and set type=internal are put together in one ovs-vsctl command, ovs will do nothing to the device and the device is left as is. In OVSInterfaceDriver.plug(self, network_id, port_id, device_name, mac_address,bridge=None, namespace=None, prefix=None): self._ovs_add_port(bridge, tap_name, port_id, mac_address, internal=internal) ns_dev.link.set_address(mac_address) if self.conf.network_device_mtu: ns_dev.link.set_mtu(self.conf.network_device_mtu) if self.conf.ovs_use_veth: root_dev.link.set_mtu(self.conf.network_device_mtu) # Add an interface created by ovs to the namespace. if not self.conf.ovs_use_veth and namespace: namespace_obj = ip.ensure_namespace(namespace) namespace_obj.add_device_to_namespace(ns_dev) You can see that set mac address, set mtu, set namespace stuff all uses `ip` command directly, which requires `ip` to have access the the device. The device created or re-created by ovs (in self._ovs_add_port) must not belong to any namespace. This can be guarnteed by splitting the giant ovs-vsctl command above into two parts: ovs-vsctl --if-exists del-port qr-2f4c613d-b7 ovs-vsctl -- add-port br-int qr-2f4c613d-b7 -- set Interface qr-2f4c613d-b7 type=internal -- set Interface qr-2f4c613d-b7 external-ids:iface-id=2f4c613d-b7f2-4d63-89c8-af2d48948d19 -- set Interface qr-2f4c613d-b7 external-ids:iface-status=active -- set Interface qr-2f4c613d-b7 external-ids:attached-mac=fa:16:3e:3c:4d:18 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1346861 Title: l3 cannot re-create device in deleted namespace Status in OpenStack Neutron (virtual network service): New Bug description: If an ovs-managed device (device created by add-port followed by set type=internal)'s namespace is being used by some process and then deleted, L3 agent will fail to re-create the device. Steps to repro: - Stop l3-agent. - Choose a router namespace with at least one ovs-managed device in it. For example, "qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184" has a device "qg-df5a3693-ec" - Ensure the namespace is used by at least one process. For demo purpose, start another shell using "ip netns exec qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184 bash". In reality, ns-metadata-proxy or keepalived may live in the namespace - Delete the namespace by "ip netns del qrouter-df5a3693-ec4d-4023-9e73-8dce9c4ac184". The command won't fail and the devices in the deleted namespace are still alive, observable by "ip link" in previously opened shell. However, there is no easy method to enter the namespace from outside again. - Start l3 agent. - Verify "qg-df5a3693-ec" cannot be recreated and managed by L3. The backtrace looks like (this is our branch, may differ with upstream): ERROR neutron.agent.l3_agent Failed synchronizing routers TRACE neutron.agent.l3_agent Traceback (most recent call last): TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 1429, in _sync_routers_task TRACE neutron.agent.l3_agent self._process_routers(routers, all_routers=True) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 1354, in _process_routers TRACE neutron.agent.l3_agent self._router_added(r['id'], r) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 672, in _router_added TRACE neutron.agent.l3_agent self.process_ha_router_added(ri) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 923, in process_ha_router_added TRACE neutron.agent.l3_agent vip_cidrs=[gw_ip_cidr]) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/l3_agent.py", line 897, in ha_network_added TRACE neutron.agent.l3_agent prefix=HA_DEV_PREFIX) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/interface.py", line 194, in plug TRACE neutron.agent.l3_agent ns_dev.link.set_address(mac_address) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 230, in set_address TRACE neutron.agent.l3_agent self._as_root('set', self.name, 'address', mac_address) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 217, in _as_root TRACE neutron.agent.l3_agent kwargs.get('use_root_namespace', False)) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 70, in _as_root TRACE neutron.agent.l3_agent namespace) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 81, in _execute TRACE neutron.agent.l3_agent root_helper=root_helper) TRACE neutron.agent.l3_agent File "/opt/stack/neutron/neutron/agent/linux/utils.py", line 90, in execute TRACE neutron.agent.l3_agent raise RuntimeError(m) TRACE neutron.agent.l3_agent RuntimeError: TRACE neutron.agent.l3_agent Command: ['sudo', '/usr/local/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'link', 'set', 'ha-5bd08318-aa', 'address', 'fa:16:3e:f3:2b:6b'] TRACE neutron.agent.l3_agent Exit code: 1 TRACE neutron.agent.l3_agent Stdout: '' TRACE neutron.agent.l3_agent Stderr: 'Cannot find device "ha-5bd08318-aa"\n' TRACE neutron.agent.l3_agent The root cause is that ovs-vsctl "can perform any number of commands in a single run, implemented as a single atomic transaction against the database." and neutron currently use the following to create ovs- managed device: ovs-vsctl -- --if-exists del-port qr-2f4c613d-b7 -- add-port br-int qr-2f4c613d-b7 -- set Interface qr-2f4c613d-b7 type=internal -- set Interface qr-2f4c613d-b7 external-ids:iface-id=2f4c613d- b7f2-4d63-89c8-af2d48948d19 -- set Interface qr-2f4c613d-b7 external- ids:iface-status=active -- set Interface qr-2f4c613d-b7 external-ids :attached-mac=fa:16:3e:3c:4d:18 ovs can delete devices it manages even the device is in a deleted (lost) namespace. But if del-port, add-port and set type=internal are put together in one ovs-vsctl command, ovs will do nothing to the device and the device is left as is. In OVSInterfaceDriver.plug(self, network_id, port_id, device_name, mac_address,bridge=None, namespace=None, prefix=None): self._ovs_add_port(bridge, tap_name, port_id, mac_address, internal=internal) ns_dev.link.set_address(mac_address) if self.conf.network_device_mtu: ns_dev.link.set_mtu(self.conf.network_device_mtu) if self.conf.ovs_use_veth: root_dev.link.set_mtu(self.conf.network_device_mtu) # Add an interface created by ovs to the namespace. if not self.conf.ovs_use_veth and namespace: namespace_obj = ip.ensure_namespace(namespace) namespace_obj.add_device_to_namespace(ns_dev) You can see that set mac address, set mtu, set namespace stuff all uses `ip` command directly, which requires `ip` to have access the the device. The device created or re-created by ovs (in self._ovs_add_port) must not belong to any namespace. This can be guarnteed by splitting the giant ovs-vsctl command above into two parts: ovs-vsctl --if-exists del-port qr-2f4c613d-b7 ovs-vsctl -- add-port br-int qr-2f4c613d-b7 -- set Interface qr-2f4c613d-b7 type=internal -- set Interface qr-2f4c613d-b7 external-ids:iface-id=2f4c613d-b7f2-4d63-89c8-af2d48948d19 -- set Interface qr-2f4c613d-b7 external-ids:iface-status=active -- set Interface qr-2f4c613d-b7 external-ids:attached-mac=fa:16:3e:3c:4d:18 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1346861/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp