[Yahoo-eng-team] [Bug 1946318] [NEW] [ovn] Memory consumption grows over time due to MAC_Binding entries in SB database
Public bug reported: MAC_Binding entries are used in OVN as a mechanism to learn MAC addresses on logical ports and avoid sending ARP requests to the network. There is no aging mechanism for these entries [0] and the table can grow indefinitely. In environments with for example large (eg. /16) external networks; OVN may learn a considerable amount of addresses growing the size of the db a lot. Today, Neutron monitors this table to workaround the lack of aging mechanism and remove the MAC_Binding entries associated to Floating IPs and each neutron-server worker will keep an in-memory copy of such table increasing its memory footprint to several Gigabytes, eventually leading to OOM killers. [0] https://mail.openvswitch.org/pipermail/ovs- discuss/2019-June/048936.html ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1946318 Title: [ovn] Memory consumption grows over time due to MAC_Binding entries in SB database Status in neutron: New Bug description: MAC_Binding entries are used in OVN as a mechanism to learn MAC addresses on logical ports and avoid sending ARP requests to the network. There is no aging mechanism for these entries [0] and the table can grow indefinitely. In environments with for example large (eg. /16) external networks; OVN may learn a considerable amount of addresses growing the size of the db a lot. Today, Neutron monitors this table to workaround the lack of aging mechanism and remove the MAC_Binding entries associated to Floating IPs and each neutron-server worker will keep an in-memory copy of such table increasing its memory footprint to several Gigabytes, eventually leading to OOM killers. [0] https://mail.openvswitch.org/pipermail/ovs- discuss/2019-June/048936.html To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1946318/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1945651] [NEW] Updating binding profile through CLI doesn't work
Public bug reported: Updating the binding profile of a port will fail because of invalid type. This suggests a bug in the Neutron code that handles the parameters. $ neutron --debug port-update subportD --binding:profile type=dict parent_name=4af7ef43-597b-4747-b3ac-2b045db17374,tag=999 ... DEBUG: keystoneauth.session RESP: [400] Content-Length: 152 Content-Type: application/json Date: Thu, 30 Sep 2021 13:14:32 GMT X-Openstack-Request-Id: req-5ca76951-518b-4a73-94bd-5c872a462786 DEBUG: keystoneauth.session RESP BODY: {"NeutronError": {"type": "InvalidInput", "message": "Invalid input for operation: Invalid binding:profile. tag 999 value invalid type.", "detail": ""}} DEBUG: keystoneauth.session PUT call to network for https://10.0.0.101:13696/v2.0/ports/fa2ba28e-3dfe-43af-b75a-8c466d23ebcd used request id req-5ca76951-518b-4a73-94bd-5c872a462786 DEBUG: neutronclient.v2_0.client Error message: {"NeutronError": {"type": "InvalidInput", "message": "Invalid input for operation: Invalid binding:profile. tag 999 value invalid type.", "detail": ""}} ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1945651 Title: Updating binding profile through CLI doesn't work Status in neutron: New Bug description: Updating the binding profile of a port will fail because of invalid type. This suggests a bug in the Neutron code that handles the parameters. $ neutron --debug port-update subportD --binding:profile type=dict parent_name=4af7ef43-597b-4747-b3ac-2b045db17374,tag=999 ... DEBUG: keystoneauth.session RESP: [400] Content-Length: 152 Content-Type: application/json Date: Thu, 30 Sep 2021 13:14:32 GMT X-Openstack-Request-Id: req-5ca76951-518b-4a73-94bd-5c872a462786 DEBUG: keystoneauth.session RESP BODY: {"NeutronError": {"type": "InvalidInput", "message": "Invalid input for operation: Invalid binding:profile. tag 999 value invalid type.", "detail": ""}} DEBUG: keystoneauth.session PUT call to network for https://10.0.0.101:13696/v2.0/ports/fa2ba28e-3dfe-43af-b75a-8c466d23ebcd used request id req-5ca76951-518b-4a73-94bd-5c872a462786 DEBUG: neutronclient.v2_0.client Error message: {"NeutronError": {"type": "InvalidInput", "message": "Invalid input for operation: Invalid binding:profile. tag 999 value invalid type.", "detail": ""}} To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1945651/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1904412] [NEW] [ovn] Don't include IP addresses for OVN ports if both port security and DHCP are disabled
Public bug reported: Right now, when port security is disabled the ML2/OVN plugin will set the addresses field to ["unknown", "mac IP1 IP2..."]. Eg.: port 2da76786-51f0-4217-a09b-0c16e6728588 (aka servera-port-2) addresses: ["52:54:00:02:FA:0A 192.168.0.245", "unknown"] There are scenarios (eg. NIC teaming) where the traffic may come from two different ports with the same source MAC address. While this is fine, on the way back, OVN doesn't learn the location of the MAC and it will deliver to the port which has the MAC address defined in the DB. E.g port1 - MAC1 port2 - MAC2 If traffic goes out from port2 with smac=MAC1, then the traffic will be delivered by OVN. However, for incoming traffic getting to br-int with dmac=MAC1, OVN will deliver that to port1 instead of port2 because of the above configuration. If OVN is not configured with any MAC(s) then the traffic will be flooded to all ports which have addresses=["unknown"]. The reason why "MAC IP" is added is merely so that OVN can install the necessary flows to serve DHCP natively. In order to cover these use cases, the ML2/OVN driver could clear up the MAC-IP(s) from the 'addresses' column of those ports that belong to a network with DHCP disabled. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1904412 Title: [ovn] Don't include IP addresses for OVN ports if both port security and DHCP are disabled Status in neutron: New Bug description: Right now, when port security is disabled the ML2/OVN plugin will set the addresses field to ["unknown", "mac IP1 IP2..."]. Eg.: port 2da76786-51f0-4217-a09b-0c16e6728588 (aka servera-port-2) addresses: ["52:54:00:02:FA:0A 192.168.0.245", "unknown"] There are scenarios (eg. NIC teaming) where the traffic may come from two different ports with the same source MAC address. While this is fine, on the way back, OVN doesn't learn the location of the MAC and it will deliver to the port which has the MAC address defined in the DB. E.g port1 - MAC1 port2 - MAC2 If traffic goes out from port2 with smac=MAC1, then the traffic will be delivered by OVN. However, for incoming traffic getting to br-int with dmac=MAC1, OVN will deliver that to port1 instead of port2 because of the above configuration. If OVN is not configured with any MAC(s) then the traffic will be flooded to all ports which have addresses=["unknown"]. The reason why "MAC IP" is added is merely so that OVN can install the necessary flows to serve DHCP natively. In order to cover these use cases, the ML2/OVN driver could clear up the MAC-IP(s) from the 'addresses' column of those ports that belong to a network with DHCP disabled. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1904412/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1893656] [NEW] [ovn] Limit the number of metadata workers
Public bug reported: The OVN Metadata agent reuses the metadata_workers config option from the ML2/OVS Metadata agent. However, it makes sense to split the option as the way both agents work is totally different so it makes sense to have different defaults. In OVN, the Metadata Agent will run in compute nodes while in the ML2/OVS case it usually runs in controllers so the scenario is totally different. We defaulted to 2 in TripleO and this commit includes further details: https://opendev.org/openstack/puppet- neutron/commit/847f434140ee8435ee842801748a0deccdff8155 ** Affects: neutron Importance: Medium Status: Confirmed ** Tags: ovn ** Tags added: ovn -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1893656 Title: [ovn] Limit the number of metadata workers Status in neutron: Confirmed Bug description: The OVN Metadata agent reuses the metadata_workers config option from the ML2/OVS Metadata agent. However, it makes sense to split the option as the way both agents work is totally different so it makes sense to have different defaults. In OVN, the Metadata Agent will run in compute nodes while in the ML2/OVS case it usually runs in controllers so the scenario is totally different. We defaulted to 2 in TripleO and this commit includes further details: https://opendev.org/openstack/puppet- neutron/commit/847f434140ee8435ee842801748a0deccdff8155 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1893656/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1883554] [NEW] [ovn] Agent liveness checks create too many writes into OVN db
Public bug reported: Every time the agent liveness check is triggered (via API or periodically every agent_down_time / 2 seconds), there are a lot of writes into the SB database on the Chassis table. These writes triggers recomputation on ovn-controller running in all nodes having a considerable performance hit, especially under stress. After this commit was merged [0] we avoided bumping nb_cfg too frequently but still we're performing writes into the Chassis table to often, from all the workers. We should use the same logic in [1] to avoid writes that have happened recently. [0] https://opendev.org/openstack/neutron/commit/647b7f63f9dafedfa9fb6e09e3d92d66fb512f0b [1] https://github.com/openstack/neutron/blob/4de18104ae88a835544cefbf30c878aa49efc31f/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L1075 ** Affects: neutron Importance: Undecided Status: New ** Tags: ovn ** Tags added: ovn -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1883554 Title: [ovn] Agent liveness checks create too many writes into OVN db Status in neutron: New Bug description: Every time the agent liveness check is triggered (via API or periodically every agent_down_time / 2 seconds), there are a lot of writes into the SB database on the Chassis table. These writes triggers recomputation on ovn-controller running in all nodes having a considerable performance hit, especially under stress. After this commit was merged [0] we avoided bumping nb_cfg too frequently but still we're performing writes into the Chassis table to often, from all the workers. We should use the same logic in [1] to avoid writes that have happened recently. [0] https://opendev.org/openstack/neutron/commit/647b7f63f9dafedfa9fb6e09e3d92d66fb512f0b [1] https://github.com/openstack/neutron/blob/4de18104ae88a835544cefbf30c878aa49efc31f/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L1075 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1883554/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1874733] [NEW] [OVN] Stale ports can be present in OVN NB leading to metadata errors
Public bug reported: Right now, there's a chance that deleting a port in Neutron with ML2/OVN actually deletes the object from Neutron DB while leaving a stale port in the OVN NB database. This can happen when deleting a port [0] raises a RowNotFound exception. While it may look like it'd mean that the port didn't exist already in OVN NB truth is that the current port_delete function can throw that exception for different reasons (especially against OVN < 2.10 when Address Sets were used instead of Port Groups). Such exception can be observed for example if some ACL or Address Set doesn't exist [1][2] amongst others. In this case, the revision number of the object will be deleted [3] and the port will be stale forever in OVN NB (it'll be skipped by the maintenance task). One of the main impacts of this issue is that the OVN NB database will grow and have stale objects that are undetected (they'll be detected by the neutron-ovn-db-sync-script) but most importantly, that multiple ports in the same OVN Logical Switch may have the same IP addresses and this cause legitimate ports to be left without Metadata. As per metadata agent code here [4] if more than one port in the same network has the same IP address, a 404 will be returned back to the instance upon requesting metadata. The workaround is running the neutron-db-sync script in repair mode to get rid of the stale ports. A proper fix would involve a better granularity of the exceptions that can happen around a port deletion and acting accordingly upon each of them. In the worst case, we won't be deleting the revision number if the port still exists leaving up to the Maintenance task to fix it later on (< 5 minutes). Ideally, we should identify all possible code paths and delete the port from OVN whenever possible even if some other associated operation fails (with proper logging). Also, this scenario seems to be more likely under a high concurrency of API operations (such as heat) and possibly when Port Groups are not supported by the schema (OVN < 2.10). Danie Alvarez [0] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L719 [1] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L680 [2] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L690 [3] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L722 [4] https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/agent/ovn/metadata/server.py#L86 ** Affects: neutron Importance: Undecided Status: New ** Tags: ovn ** Tags added: ovn -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1874733 Title: [OVN] Stale ports can be present in OVN NB leading to metadata errors Status in neutron: New Bug description: Right now, there's a chance that deleting a port in Neutron with ML2/OVN actually deletes the object from Neutron DB while leaving a stale port in the OVN NB database. This can happen when deleting a port [0] raises a RowNotFound exception. While it may look like it'd mean that the port didn't exist already in OVN NB truth is that the current port_delete function can throw that exception for different reasons (especially against OVN < 2.10 when Address Sets were used instead of Port Groups). Such exception can be observed for example if some ACL or Address Set doesn't exist [1][2] amongst others. In this case, the revision number of the object will be deleted [3] and the port will be stale forever in OVN NB (it'll be skipped by the maintenance task). One of the main impacts of this issue is that the OVN NB database will grow and have stale objects that are undetected (they'll be detected by the neutron-ovn-db-sync-script) but most importantly, that multiple ports in the same OVN Logical Switch may have the same IP addresses and this cause legitimate ports to be left without Metadata. As per metadata agent code here [4] if more than one port in the same network has the same IP address, a 404 will be returned back to the instance upon requesting metadata. The workaround is running the neutron-db-sync script in repair mode to get rid of the stale ports. A proper fix would involve a better granularity of the exceptions that can happen around a port deletion and acting accordingly upon each of them. In the worst case, we won't be deleting the revision number if the port still exists leaving up to the Maintenance task to fix it later on (< 5 minutes). Ideally, we should identify all possible code
[Yahoo-eng-team] [Bug 1865889] [NEW] [RFE] Routed provider networks support in OVN
Public bug reported: The routed provider networks feature doesn't work properly with OVN backend. While API doesn't return any errors, all the ports are allocated to the same OVN Logical Switch and besides providing no Layer2 isolation whatsoever, it won't work when multiple segments using different physnets are added to such network. The reason for the latter is that, currently, in core OVN, only one localnet port is supported per Logical Switch so only one physical net can be associated to it. I can think of two different approaches: 1) Change the OVN mech driver to logically separate Neutron segments: a) Create an OVN Logical Switch *per Neutron segment*. This has some challenges from a consistency point of view as right now there's a 1:1 mapping between a Neutron Network and an OVN Logical Switch. Revision numbers, maintenance task, OVN DB Sync script, etcetera. b) Each of those Logical Switches will have a localnet port associated to the physnet of the Neutron segment. c) The port still belongs to the parent network so all the CRUD operations over a port will require to figure out which underlying OVN LS applies (depending on which segment the port lives in). The same goes for other objects (e.g. OVN Load Balancers, gw ports -if attaching a multisegment network to a Neutron router as a gateway is a valid use case at all-). e) Deferred allocation. A port can be created in a multisegment Neutron network but the IP allocation is deferred to the time where a compute node is assigned to an instance. In this case the OVN mech driver might need to move around the Logical Switch Port from the Logical Switch of the parent to that of the segment where it falls (can be prone to race conditions :?). 2) Core OVN changes: The current limitation is that right now only one localnet port is allowed per Logical Switch so we can't map different physnets to it. If we add support for multiple localnet ports in core OVN, we can have all the segments living in the same OVN Logical Switch. My idea here would be: a) Per each Neutron segment, we create a localnet port in the single OVN Logical Switch with its physnet and vlan id (if any). Eg. name: provnet-f7038db6-7376-4b83-b57b-3f456bea2b80 options : {network_name=segment1} parent_name : [] port_security : [] tag : 2016 tag_request : [] type: localnet name: provnet-84487aa7-5ac7-4f07-877e-1840d325e3de options : {network_name=segment2} parent_name : [] port_security : [] tag : 2017 tag_request : [] type: localnet And both ports would belong to the LS corresponding to the multisegment Neutron network. b) In this case, when ovn-controller sees that a port in that network has been bound to it, all it needs to create is the patch port to the provider bridge that the bridge mappings configuration dictates. E.g compute1:bridge-mappings = segment1:br-provider1 compute2:bridge-mappings = segment2:br-provider2 When a port in the multisegment network gets bound to compute1, ovn- controller will create a patch-port between br-int and br-provider1. The restriction here is that on a given hypervisor, only ports belonging to the same segment will be present. ie. we can't mix VMs on different segments on the same hypervisor. c) Minor changes are required in the Neutron side (just creating the localnet port upon segment creation). We need to discuss if the restriction mentioned earlier makes sense. If not, perhaps we need to drop this approach completely or look for core OVN alternatives. I'd lean on approach number 2 as it seems the less invasive in terms of code changes but there's the catch described that may make it a no-go or explore another ways to eliminate that restriction somehow in core OVN. ** Affects: neutron Importance: Undecided Status: New ** Tags: ovn rfe ** Tags added: rfe ** Tags added: ovn -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1865889 Title: [RFE] Routed provider networks support in OVN Status in neutron: New Bug description: The routed provider networks feature doesn't work properly with OVN backend. While API doesn't return any errors, all the ports are allocated to the same OVN Logical Switch and besides providing no Layer2 isolation whatsoever, it won't work when multiple segments using different physnets are added to such network. The reason for the latter is that, currently, in core OVN, only one localnet port is supported per Logical Switch so only one physical net can be associated to it. I can think of two different approaches: 1) Change the OVN mech driver to logically separate Neutron segments: a) Create an OVN Logical Switch *per Neutron segment*. This has some challenges from a consistency
[Yahoo-eng-team] [Bug 1864641] [NEW] [OVN] Run maintenance task whenever the OVN DB schema has been upgraded
Public bug reported: When OVN DBs are upgraded (and restarted), there might be cases whenever we want to accommodate things to a new schema. In this situation we don't want to force a restart of neutron-server (or metadata agent) but instead, detect it and run whatever is needed. This can be achieved by checking the schema version via ovsdbapp [0] and comparing if it's bigger than what we had upon a reconnection to the OVN DBs. [0] https://github.com/openvswitch/ovs/blob/master/python/ovs/db/schema.py#L35 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1864641 Title: [OVN] Run maintenance task whenever the OVN DB schema has been upgraded Status in neutron: New Bug description: When OVN DBs are upgraded (and restarted), there might be cases whenever we want to accommodate things to a new schema. In this situation we don't want to force a restart of neutron-server (or metadata agent) but instead, detect it and run whatever is needed. This can be achieved by checking the schema version via ovsdbapp [0] and comparing if it's bigger than what we had upon a reconnection to the OVN DBs. [0] https://github.com/openvswitch/ovs/blob/master/python/ovs/db/schema.py#L35 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1864641/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1861509] [NEW] [OVN] GW rescheduling logic is broken
Public bug reported: When a Chassis event happens in the SB database, we attempt to reschedule any possible unhosted gateways [0] *always* due to a problem with the existing logic: def get_unhosted_gateways(self, port_physnet_dict, chassis_physnets, gw_chassis): unhosted_gateways = [] for lrp in self._tables['Logical_Router_Port'].rows.values(): if not lrp.name.startswith('lrp-'): continue physnet = port_physnet_dict.get(lrp.name[len('lrp-'):]) chassis_list = self._get_logical_router_port_gateway_chassis(lrp) is_max_gw_reached = len(chassis_list) < ovn_const.MAX_GW_CHASSIS for chassis_name, prio in chassis_list: # TODO(azbiswas): Handle the case when a chassis is no # longer valid. This may involve moving conntrack states, # so it needs to discussed in the OVN community first. if is_max_gw_reached or utils.is_gateway_chassis_invalid( chassis_name, gw_chassis, physnet, chassis_physnets): unhosted_gateways.append(lrp.name) return unhosted_gateways 1) is_max_gw_reached is always going to be True (as normally the possible candidates are less than the maximum) 2) unhosted_gateways.append(lrp.name) is executed inside a loop where lrp doesn't change meaning that if there's 3 candidates in the chassis_list, lrp.name is added 3 times to the list!!! 3) Later on, in the caller, we're iterating over the returned list [1] so as it has all the LRPs N times (N being the names of gw chassis), it will do a lot of extra and unnecessary work. This is almost harmless in the sense that it's not breaking any functionality but it creates unnecessary updates on the logical router port: 2020-01-31 15:54:04.669 37 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): UpdateLRouterPortCommand(name=lrp-93b49ece-2dbc-4fcc-84cb-e7afd482a12e, columns={'gateway_chassis': ['0444b1f1-e9a9-4a73-ba78-997c87e61795', '43d98571-ccd6-48ce-bf4f-08f24aeed522', 'fe8f9887-27ef-4724-8cfc-50ec6e3d4a98']}, if_exists=True) do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84 2020-01-31 15:54:04.670 37 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121 [0] https://github.com/openstack/neutron/blob/4689564fa29915b042547bdeb3dcb44bca54e20c/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/impl_idl_ovn.py#L449 [1] https://github.com/openstack/neutron/blob/858d7f33950a80c73501377a4b2cd36b915d0f40/neutron/services/ovn_l3/plugin.py#L324 ** Affects: neutron Importance: Undecided Assignee: Maciej Jozefczyk (maciej.jozefczyk) Status: Confirmed -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1861509 Title: [OVN] GW rescheduling logic is broken Status in neutron: Confirmed Bug description: When a Chassis event happens in the SB database, we attempt to reschedule any possible unhosted gateways [0] *always* due to a problem with the existing logic: def get_unhosted_gateways(self, port_physnet_dict, chassis_physnets, gw_chassis): unhosted_gateways = [] for lrp in self._tables['Logical_Router_Port'].rows.values(): if not lrp.name.startswith('lrp-'): continue physnet = port_physnet_dict.get(lrp.name[len('lrp-'):]) chassis_list = self._get_logical_router_port_gateway_chassis(lrp) is_max_gw_reached = len(chassis_list) < ovn_const.MAX_GW_CHASSIS for chassis_name, prio in chassis_list: # TODO(azbiswas): Handle the case when a chassis is no # longer valid. This may involve moving conntrack states, # so it needs to discussed in the OVN community first. if is_max_gw_reached or utils.is_gateway_chassis_invalid( chassis_name, gw_chassis, physnet, chassis_physnets): unhosted_gateways.append(lrp.name) return unhosted_gateways 1) is_max_gw_reached is always going to be True (as normally the possible candidates are less than the maximum) 2) unhosted_gateways.append(lrp.name) is executed inside a loop where lrp doesn't change meaning that if there's 3 candidates in the chassis_list, lrp.name is added 3 times to the list!!! 3) Later on, in the caller, we're iterating over the returned list [1] so as it has all the LRPs N times (N being the names of gw chassis), it will do a lot of extra and unnecessary work. This is almost harmless in the sense that it's not breaking any functionality but it creates unnecessary
[Yahoo-eng-team] [Bug 1861510] [NEW] [OVN] GW rescheduling mechanism is triggered on every Chassis updated unnecessarily
Public bug reported: Whenever a chassis is updated for whatever reason, we're triggering the rescheduling mechanism [0]. As the current agent liveness check involves updating the Chassis table quite frequently, we should avoid rescheduling gateways for those checks (ie. when either nb_cfg or external_ids change). [0] https://github.com/openstack/neutron/blob/4689564fa29915b042547bdeb3dcb44bca54e20c/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L87 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1861510 Title: [OVN] GW rescheduling mechanism is triggered on every Chassis updated unnecessarily Status in neutron: New Bug description: Whenever a chassis is updated for whatever reason, we're triggering the rescheduling mechanism [0]. As the current agent liveness check involves updating the Chassis table quite frequently, we should avoid rescheduling gateways for those checks (ie. when either nb_cfg or external_ids change). [0] https://github.com/openstack/neutron/blob/4689564fa29915b042547bdeb3dcb44bca54e20c/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L87 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1861510/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1860436] [NEW] [ovn] Agent liveness checks are flaky and report false positives
Public bug reported: The way that networking-ovn mech driver performs health checks on agents reports false positives due to race conditions: 1) neutron-server increments the nb_cfg in NB_Global table from X to X+1 2) neutron-server almost immediately checks all the Chassis rows to see if they have written (X+1) . [1] 3) neutron-server process the updates from each agent from X to X+1 *Most* of the times, in step number 2, this condition doesn't hold so the timestamp is not updated. The result is that after 60 seconds (agent timeout default value), the agent is shown as dead. Sometimes, 3) happens before 2) so the timestamp gets updated and all is fine but this is not the normal case: 1) Bump of nb_cfg 2020-01-21 11:35:59.534 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36915 2020-01-21 11:35:59.538 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36916 2) Check of each chassis ext_id against our new bumped nb_cfg: 2020-01-21 11:35:59.539 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.540 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.541 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.542 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.543 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.544 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 2020-01-21 11:35:59.546 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915 3) Processing updates [2] in the ChassisEvent (some are even older!) 2020-01-21 11:35:59.546 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36915 2020-01-21 11:35:59.548 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36915 2020-01-21 11:35:59.556 32 INFO networking_ovn.ovsdb.ovsdb_monitor [req-efa34cac-2296-4d30-b153-9630b0309fcd - - - - -] XXX chassis update: 2020-01-21 11:35:59.556 27 INFO networking_ovn.ovsdb.ovsdb_monitor [req-91f7d181-bfa3-4646-9814-bb680d011081 - - - - -] XXX chassis update: 2020-01-21 11:35:59.557 25 INFO networking_ovn.ovsdb.ovsdb_monitor [req-420e5a25-13e4-4da6-8277-8a3a1028c9e9 - - - - -] XXX chassis update: 2020-01-21 11:35:59.756 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36916 2020-01-21 11:35:59.778 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36916 IMO, we need to space the bump of nb_cfg [2] and the check [3] in time as the NB_Global changes needs to be propagated to the SB, processed by all agents and then back to neutron-server which needs to process the JSON stuff and update the internal tables. So even if it's fast, most of the times it is not fast enough. Another solution is to allow a difference of '1' to update timestamps. [0] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1093 [1] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1098 [2] https://github.com/openstack/networking-ovn/blob/bf577e5a999f7db4cb9b790664ad596e1926d9a0/networking_ovn/ml2/mech_driver.py#L988 [3] https://github.com/openstack/networking-ovn/blob/6302298e9c4313f1200c543c89d92629daff9e89/networking_ovn/ovsdb/ovsdb_monitor.py#L74 ** Affects: neutron Importance: Undecided Assignee: Daniel Alvarez (dalvarezs)
[Yahoo-eng-team] [Bug 1804259] [NEW] DB: sorting on elements which are AssociationProxy fails
Public bug reported: If I do a DB query trying to sort by a column which is an AssociationProxy I get the following exception: Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/opt/stack/neutron/neutron/db/db_base_plugin_v2.py", line 1438, in get_ports Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers page_reverse=page_reverse) Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/opt/stack/neutron/neutron/plugins/ml2/plugin.py", line 1935, in _get_ports_qu Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers *args, **kwargs) Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/opt/stack/neutron/neutron/db/db_base_plugin_v2.py", line 1418, in _get_ports_ Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers *args, **kwargs) Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/opt/stack/neutron/neutron/db/_model_query.py", line 159, in get_collection_qu Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers sort_keys = db_utils.get_and_validate_sort_keys(sorts, model) Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/usr/lib/python2.7/site-packages/neutron_lib/db/utils.py", line 45, in get_and Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers if isinstance(sort_key_attr.property, Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers AttributeError: 'AssociationProxy' object has no attribute 'property' Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers This is reproducible for example by querying ports sorted by 'created_at' attribute: ports = self._plugin.get_ports(context, sorts=[('created_at', True)]) Looks like we may need to special case the AssociationProxy columns such as we already do in the filtering code at: https://github.com/openstack/neutron/blob/0bb6136919a31751242d2efbefedbd8922b6bd0a/neutron/db/_model_query.py#L88 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1804259 Title: DB: sorting on elements which are AssociationProxy fails Status in neutron: New Bug description: If I do a DB query trying to sort by a column which is an AssociationProxy I get the following exception: Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/opt/stack/neutron/neutron/db/db_base_plugin_v2.py", line 1438, in get_ports Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers page_reverse=page_reverse) Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/opt/stack/neutron/neutron/plugins/ml2/plugin.py", line 1935, in _get_ports_qu Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers *args, **kwargs) Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/opt/stack/neutron/neutron/db/db_base_plugin_v2.py", line 1418, in _get_ports_ Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers *args, **kwargs) Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/opt/stack/neutron/neutron/db/_model_query.py", line 159, in get_collection_qu Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers sort_keys = db_utils.get_and_validate_sort_keys(sorts, model) Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers File "/usr/lib/python2.7/site-packages/neutron_lib/db/utils.py", line 45, in get_and Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers if isinstance(sort_key_attr.property, Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers AttributeError: 'AssociationProxy' object has no attribute 'property' Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR neutron.plugins.ml2.managers This is reproducible for example by querying ports sorted by 'created_at' attribute: ports = self._plugin.get_ports(context, sorts=[('created_at', True)]) Looks like we may need to special case the AssociationProxy columns such as we already do in the filtering code at: https://github.com/openstack/neutron/blob/0bb6136919a31751242d2efbefedbd8922b6bd0a/neutron/db/_model_query.py#L88 To manage notificati
[Yahoo-eng-team] [Bug 1802369] Re: Unit tests failing due to recent Neutron patch
When importing that module, these event listeners are created: https://github.com/openstack/neutron/blob/master/neutron/db/api.py#L110 and https://github.com/openstack/neutron/blob/master/neutron/db/api.py#L134 Adding them manually fixed the issue. So far the workaround imports the file to get the listeners imported but this is something that Neutron folks need to confirm now and perhaps move them somewhere else to get them imported either way. ** Also affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1802369 Title: Unit tests failing due to recent Neutron patch Status in networking-ovn: New Status in neutron: New Bug description: Since Nov 7th we have unit tests failing. I've doing git bisect on neutron and found that [0] is the culprit. Digging further I checked that it's not actually that MAX_RETRIES changed from 10 (neutron code) to 20 (neutron-lib) but the fact that "from neutron.db import api as db_api" is no longer imported. [0] https://github.com/openstack/neutron/commit/3316b45665a99b0f61e45a8c7facf538618861bf To manage notifications about this bug go to: https://bugs.launchpad.net/networking-ovn/+bug/1802369/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1797084] [NEW] Stale namespaces when fallback tunnels are present
Public bug reported: When a network namespace is created, if the sysctl fb_tunnels_only_for_init_net option is set to 0 (by default), fallback tunnel devices will be automatically created if the initial namespace had those in. This leads to neutron ip_lib detecting namespaces as 'not empty' thus unable to clean them up. We need to add these devices so that they are taken into account when determining if a namespace is empty or not. More info at: https://www.kernel.org/doc/Documentation/sysctl/net.txt ** Affects: networking-ovn Importance: Undecided Status: New ** Affects: neutron Importance: Undecided Assignee: Daniel Alvarez (dalvarezs) Status: In Progress ** Also affects: networking-ovn Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1797084 Title: Stale namespaces when fallback tunnels are present Status in networking-ovn: New Status in neutron: In Progress Bug description: When a network namespace is created, if the sysctl fb_tunnels_only_for_init_net option is set to 0 (by default), fallback tunnel devices will be automatically created if the initial namespace had those in. This leads to neutron ip_lib detecting namespaces as 'not empty' thus unable to clean them up. We need to add these devices so that they are taken into account when determining if a namespace is empty or not. More info at: https://www.kernel.org/doc/Documentation/sysctl/net.txt To manage notifications about this bug go to: https://bugs.launchpad.net/networking-ovn/+bug/1797084/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1785615] [NEW] DNS resolution through eventlet contact nameservers if there's an IPv4 or IPv6 entry present in hosts file
Public bug reported: When trying to resolve a hostname on a node with no nameservers configured and only one entry is present for it in /etc/hosts (IPv4 or IPv6), eventlet will try to fetch the other entry over the network. This changes the behavior from what the original getaddrinfo() implementation does and causes 30 second delays and often timeouts when, for example, metadata agent tries to contact Nova [0]. Here it's a simple reproducer which shows the behavior when we do the monkey patching: import eventlet import socket import time print socket.getaddrinfo('overcloud.internalapi.localdomain', 80, 0, socket.SOCK_STREAM) print time.time() eventlet.monkey_patch() print socket.getaddrinfo('overcloud.internalapi.localdomain', 80, 0, socket.SOCK_STREAM) print time.time() Eventlet issue reported here [1] and fix got merged in master branch. [0] https://github.com/openstack/neutron/blob/13.0.0.0b3/neutron/agent/metadata/agent.py#L189 [1] https://github.com/eventlet/eventlet/issues/511 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1785615 Title: DNS resolution through eventlet contact nameservers if there's an IPv4 or IPv6 entry present in hosts file Status in neutron: New Bug description: When trying to resolve a hostname on a node with no nameservers configured and only one entry is present for it in /etc/hosts (IPv4 or IPv6), eventlet will try to fetch the other entry over the network. This changes the behavior from what the original getaddrinfo() implementation does and causes 30 second delays and often timeouts when, for example, metadata agent tries to contact Nova [0]. Here it's a simple reproducer which shows the behavior when we do the monkey patching: import eventlet import socket import time print socket.getaddrinfo('overcloud.internalapi.localdomain', 80, 0, socket.SOCK_STREAM) print time.time() eventlet.monkey_patch() print socket.getaddrinfo('overcloud.internalapi.localdomain', 80, 0, socket.SOCK_STREAM) print time.time() Eventlet issue reported here [1] and fix got merged in master branch. [0] https://github.com/openstack/neutron/blob/13.0.0.0b3/neutron/agent/metadata/agent.py#L189 [1] https://github.com/eventlet/eventlet/issues/511 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1785615/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1779882] [NEW] Deleting a port on a system with 1K ports takes too long
Public bug reported: When attempting to delete a port on a system with 1K ports, it takes around 35 seconds to complete: $ time openstack port delete port60_2 real0m34.367s user0m3.497s sys 0m0.187s Log is *full* of the following messages when I issue the CLI: neutron-server[324]: DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [None req-a936bb85-d881-441b-aa07-74c4779d1771 demo demo] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] {{(pid=342) _exclude_attributes_by_policy /opt/stack/neutron/neutron/pecan_wsgi/hooks/policy_enforcement.py:256}} To be precise: 896 messages like this ^ $ sudo journalctl -u devstack@q-svc | grep "Attributes excluded by policy engine" | wc -l 33626 $ time openstack port delete port60_2 real0m34.367s user0m3.497s sys 0m0.187s $ sudo journalctl -u devstack@q-svc | grep "Attributes excluded by policy engine" | wc -l 34522 I'm using networking-ovn as mechanism-driver but looks unrelated to the backend :? ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1779882 Title: Deleting a port on a system with 1K ports takes too long Status in neutron: New Bug description: When attempting to delete a port on a system with 1K ports, it takes around 35 seconds to complete: $ time openstack port delete port60_2 real0m34.367s user0m3.497s sys 0m0.187s Log is *full* of the following messages when I issue the CLI: neutron-server[324]: DEBUG neutron.pecan_wsgi.hooks.policy_enforcement [None req-a936bb85-d881-441b-aa07-74c4779d1771 demo demo] Attributes excluded by policy engine: [u'binding:profile', u'binding:vif_details', u'binding:vif_type', u'binding:host_id'] {{(pid=342) _exclude_attributes_by_policy /opt/stack/neutron/neutron/pecan_wsgi/hooks/policy_enforcement.py:256}} To be precise: 896 messages like this ^ $ sudo journalctl -u devstack@q-svc | grep "Attributes excluded by policy engine" | wc -l 33626 $ time openstack port delete port60_2 real0m34.367s user0m3.497s sys 0m0.187s $ sudo journalctl -u devstack@q-svc | grep "Attributes excluded by policy engine" | wc -l 34522 I'm using networking-ovn as mechanism-driver but looks unrelated to the backend :? To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1779882/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1769609] [NEW] neutron-tempest-plugin: there is no way to create a subnet without a gateway and this breaks trunk tests
Public bug reported: This commit [0] fixed an issue with the subnet CIDR generation in tempest tests. With the fix all subnets will get a gateway assigned regardless that it's attached to a router or not so it may happen that the gateway port doesn't exist. Normally, this shouldn't be a big deal but for trunk ports it's currently an issue with test_subport_connectivity [1] where the test boots a VM (advanced image) and then opens an SSH connection to its FIP to configure the interface for the subport and runs dhclient on it. When dhclient runs, a new default gateway route is installed and the connectivity to the FIP is lost thus making the test to fail as it fails to execute/read any further commands: I logged into the VM with virsh and checked the routes: [root@tempest-server-test-378882328 ~]# ip r default via 10.100.0.17 dev eth0.10 default via 10.100.0.1 dev eth0 proto static metric 100 default via 10.100.0.17 dev eth0.10 proto static metric 400 10.100.0.0/28 dev eth0 proto kernel scope link src 10.100.0.5 metric 100 10.100.0.16/28 dev eth0.10 proto kernel scope link src 10.100.0.25 169.254.169.254 via 10.100.0.18 dev eth0.10 proto dhcp 169.254.169.254 via 10.100.0.2 dev eth0 proto dhcp metric 100 This shouldn't happen as the subnet is not even connected to a router and also 10.100.0.17 doesn't even exist in Neutron. Prior to [0] it didn't fail because old code would create the subnet with gateway=None and it was skipped (actually it will only setup a gateway automatically if gateway equals to '' [2] but it was None instead [3]). Let's allow a way to have the ability to configure subnets without a gateway. [0] https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e [1] https://github.com/openstack/neutron-tempest-plugin/blob/02a5e2b07680d8c4dd69d681ae9a01d92b4be0ac/neutron_tempest_plugin/scenario/test_trunk.py#L229 [2] https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e#diff-872f814e35c7437b9f42aef71a991279L295 [3] https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e#diff-2f4232239c10eae0d0688617a3e6f98dL238 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1769609 Title: neutron-tempest-plugin: there is no way to create a subnet without a gateway and this breaks trunk tests Status in neutron: New Bug description: This commit [0] fixed an issue with the subnet CIDR generation in tempest tests. With the fix all subnets will get a gateway assigned regardless that it's attached to a router or not so it may happen that the gateway port doesn't exist. Normally, this shouldn't be a big deal but for trunk ports it's currently an issue with test_subport_connectivity [1] where the test boots a VM (advanced image) and then opens an SSH connection to its FIP to configure the interface for the subport and runs dhclient on it. When dhclient runs, a new default gateway route is installed and the connectivity to the FIP is lost thus making the test to fail as it fails to execute/read any further commands: I logged into the VM with virsh and checked the routes: [root@tempest-server-test-378882328 ~]# ip r default via 10.100.0.17 dev eth0.10 default via 10.100.0.1 dev eth0 proto static metric 100 default via 10.100.0.17 dev eth0.10 proto static metric 400 10.100.0.0/28 dev eth0 proto kernel scope link src 10.100.0.5 metric 100 10.100.0.16/28 dev eth0.10 proto kernel scope link src 10.100.0.25 169.254.169.254 via 10.100.0.18 dev eth0.10 proto dhcp 169.254.169.254 via 10.100.0.2 dev eth0 proto dhcp metric 100 This shouldn't happen as the subnet is not even connected to a router and also 10.100.0.17 doesn't even exist in Neutron. Prior to [0] it didn't fail because old code would create the subnet with gateway=None and it was skipped (actually it will only setup a gateway automatically if gateway equals to '' [2] but it was None instead [3]). Let's allow a way to have the ability to configure subnets without a gateway. [0] https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e [1] https://github.com/openstack/neutron-tempest-plugin/blob/02a5e2b07680d8c4dd69d681ae9a01d92b4be0ac/neutron_tempest_plugin/scenario/test_trunk.py#L229 [2] https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e#diff-872f814e35c7437b9f42aef71a991279L295 [3] https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e#diff-2f4232239c10eae0d0688617a3e6f98dL238 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1769609/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team
[Yahoo-eng-team] [Bug 1765545] [NEW] tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_port_security_disable_security_group fails due to instances failing to retrieve pub
Public bug reported: Running tempest test tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_port_security_disable_security_group fails sometimes when trying to authenticate via public key to the access point instance [0]. After debugging, I managed to connect to the instance via virsh console and check that the instance had not the SSH key installed: / __/ __ / __ \/ __/ / /__ / // __// __// /_/ /\ \ \___//_//_/ /_/ \/___/ http://cirros-cloud.net login as 'cirros' user. default password: 'cubswin:)'. use 'sudo' for root. tempest-server-tempest-testsecuritygroupsbasicops-679466304-acc login: cirros Password: $ cat .ssh/authorized_keys $ Checking up in the console log, I can see the following: cirros-ds 'net' up at 6.37 checking http://169.254.169.254/2009-04-04/instance-id successful after 1/20 tries: up 6.54. iid=i-006e failed to get http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key warning: no ec2 metadata for public-keys failed to get http://169.254.169.254/2009-04-04/user-data warning: no ec2 metadata for user-data found datasource (ec2, net) So it looks like it is able to fetch the instance-id but not getting the public-key. When I try to do it manually, it retrieves it successfully: $ curl 169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key ssh-rsa B3NzaC1yc2EDAQABAAABAQDICvVroPErVzHbx+a1lhI4RU33f0Nb4DT2FiNbKhaI1ZBl4/zRbqFY5a4lMipV810dCzJSViGJVw0VzNgDOf/zCt6Joosem5qC8hKwRgX5tcEXQ0UnVCiXddP1bydbRVt4BofTCTUPb4SZ3Z4zl0+L4WWB1CY58KYl19Lr7H4zqMXPqa6Mw+k1dpo0YBk3ZZR4pIxGtN916w6x6vtSIy2oDg4zaxUuewGaQNp9wENEuP3+TOseTymBxpbdys2RpUKXM2vhWWDDbrzG0+juOFxn111SgFYom05sjONDM310xHX5KBm6QuJO6ObCkSIKre9wvU60i19YW7pxBtyfztIJ Generated-by-Nova Also, running the following command doesn't work: $ sudo cirros-apply net -v $ cat .ssh/authorized_keys $ If, instead I run the following command and reboot, it will get properly installed: $ sudo cirros-per boot cirros-apply-net cirros-apply net && reboot ... $ cat .ssh/authorized_keys ssh-rsa B3NzaC1yc2EDAQABAAABAQDICvVroPErVzHbx+a1lhI4RU33f0Nb4DT2FiNbKhaI1ZBl4/zRbqFY5a4lMipV810dCzJSViGJVw0VzNgDOf/zCt6Joosem5qC8hKwRgX5tcEXQ0UnVCiXddP1bydbRVt4BofTCTUPb4SZ3Z4zl0+L4WWB1CY58KYl19Lr7H4zqMXPqa6Mw+k1dpo0YBk3ZZR4pIxGtN916w6x6vtSIy2oDg4zaxUuewGaQNp9wENEuP3+TOseTymBxpbdys2RpUKXM2vhWWDDbrzG0+juOFxn111SgFYom05sjONDM310xHX5KBm6QuJO6ObCkSIKre9wvU60i19YW7pxBtyfztIJ Generated-by-Nova After checking the ovn metadata proxy log and also nova-metadata-api logs, I can see the requests and the 200 OK responses: 2018-04-19 22:31:37.383 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 4.0820560 2018-04-19 22:31:38.800 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 200 len: 183 time: 1.1210902 2018-04-19 22:31:49.148 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200 len: 146 time: 0.0230849 2018-04-19 22:31:49.387 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/ami-launch-index HTTP/1.1" status: 200 len: 136 time: 0.0262089 2018-04-19 22:31:50.225 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/instance-type HTTP/1.1" status: 200 len: 142 time: 0.7244408 2018-04-19 22:31:50.482 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1" status: 200 len: 146 time: 0.0143349 2018-04-19 22:31:50.612 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/public-ipv4 HTTP/1.1" status: 200 len: 144 time: 0.0130348 2018-04-19 22:31:50.793 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/hostname HTTP/1.1" status: 200 len: 199 time: 0.0100901 2018-04-19 22:31:51.039 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/local-hostname HTTP/1.1" status: 200 len: 199 time: 0.0094490 2018-04-19 22:31:51.197 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/user-data HTTP/1.1" status: 404 len: 297 time: 0.0226381 2018-04-19 22:31:51.475 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/block-device-mapping HTTP/1.1" status: 200 len: 143 time: 0.0118120 2018-04-19 22:31:51.579 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/block-device-mapping/ami HTTP/1.1" status: 200 len: 138 time: 0.0084291 2018-04-19 22:31:51.672 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" status: 200 len: 535 time: 12.7038500 2018-04-19 22:31:51.735 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/block-device-mapping/root HTTP/1.1" status: 200 len: 143 time: 0.0147779 2018-04-19 22:31:51.930 24 INFO eventlet.wsgi.server [-] 10.100.0.7, "GET /2009-04-04/meta-data/public-hostname HTTP/1.1" stat
[Yahoo-eng-team] [Bug 1753540] [NEW] When isolated/force metadata is enabled, metadata proxy doesn't get automatically started/stopped when needed
Public bug reported: When enabled_isolated_metadata option is set to True in DHCP agent configuration, the metadata proxy instances won't get started dynamically when the network gets isolated. Similarly, when a subnet is added to the router, they don't get stopped if they were already running. 100% reproducible: With enable_isolated_metadata=True: 1. Create a network, a subnet and a router. 2. Check that there's a proxy instance running in the DHCP namespace for this network: neutron 89 1 0 17:01 ?00:00:00 haproxy -f /var/lib/neutron/ns-metadata- proxy/9d1c7905-a887-419a-a885-9b07c20c2012.conf 3. Attach the subnet to the router. 4. Verify that the proxy instance is still running. 5. Restart DHCP agent 6. Verify that the proxy instance went away (since the network is not isolated). 7. Remove the subnet from the router. 8. Verify that the proxy instance has not been spawned. At this point, booting any VM on the network will fail since it won't be able to fetch metadata. However, any update on the network/subnet will trigger the agent to refresh the status of the isolated metadata proxy: For example: openstack network set --name foo would trigger that DHCP agent spawns the proxy for that network. ** Affects: neutron Importance: Undecided Assignee: Daniel Alvarez (dalvarezs) Status: In Progress ** Changed in: neutron Assignee: (unassigned) => Daniel Alvarez (dalvarezs) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1753540 Title: When isolated/force metadata is enabled, metadata proxy doesn't get automatically started/stopped when needed Status in neutron: In Progress Bug description: When enabled_isolated_metadata option is set to True in DHCP agent configuration, the metadata proxy instances won't get started dynamically when the network gets isolated. Similarly, when a subnet is added to the router, they don't get stopped if they were already running. 100% reproducible: With enable_isolated_metadata=True: 1. Create a network, a subnet and a router. 2. Check that there's a proxy instance running in the DHCP namespace for this network: neutron 89 1 0 17:01 ?00:00:00 haproxy -f /var/lib/neutron/ns-metadata- proxy/9d1c7905-a887-419a-a885-9b07c20c2012.conf 3. Attach the subnet to the router. 4. Verify that the proxy instance is still running. 5. Restart DHCP agent 6. Verify that the proxy instance went away (since the network is not isolated). 7. Remove the subnet from the router. 8. Verify that the proxy instance has not been spawned. At this point, booting any VM on the network will fail since it won't be able to fetch metadata. However, any update on the network/subnet will trigger the agent to refresh the status of the isolated metadata proxy: For example: openstack network set --name foo would trigger that DHCP agent spawns the proxy for that network. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1753540/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1748658] [NEW] Restarting Neutron containers which make use of network namespaces doesn't work
Public bug reported: When DHCP, L3, Metadata or OVN-Metadata containers are restarted they can't set the previous namespaces: [heat-admin@overcloud-novacompute-0 neutron]$ sudo docker restart 8559f5a7fa45 8559f5a7fa45 [heat-admin@overcloud-novacompute-0 neutron]$ tail -f /var/log/containers/neutron/networking-ovn-metadata-agent.log 2018-02-09 08:34:41.059 5 CRITICAL neutron [-] Unhandled error: ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Invalid argument 2018-02-09 08:34:41.059 5 ERROR neutron Traceback (most recent call last): 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/bin/networking-ovn-metadata-agent", line 10, in 2018-02-09 08:34:41.059 5 ERROR neutron sys.exit(main()) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/cmd/eventlet/agents/metadata.py", line 17, in main 2018-02-09 08:34:41.059 5 ERROR neutron metadata_agent.main() 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata_agent.py", line 38, in main 2018-02-09 08:34:41.059 5 ERROR neutron agt.start() 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 147, in start 2018-02-09 08:34:41.059 5 ERROR neutron self.sync() 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 56, in wrapped 2018-02-09 08:34:41.059 5 ERROR neutron return f(*args, **kwargs) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 169, in sync 2018-02-09 08:34:41.059 5 ERROR neutron metadata_namespaces = self.ensure_all_networks_provisioned() 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 350, in ensure_all_networks_provisioned 2018-02-09 08:34:41.059 5 ERROR neutron netns = self.provision_datapath(datapath) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 294, in provision_datapath 2018-02-09 08:34:41.059 5 ERROR neutron veth_name[0], veth_name[1], namespace) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 182, in add_veth 2018-02-09 08:34:41.059 5 ERROR neutron self._as_root([], 'link', tuple(args)) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 94, in _as_root 2018-02-09 08:34:41.059 5 ERROR neutron namespace=namespace) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 102, in _execute 2018-02-09 08:34:41.059 5 ERROR neutron log_fail_as_error=self.log_fail_as_error) 2018-02-09 08:34:41.059 5 ERROR neutron File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in execute 2018-02-09 08:34:41.059 5 ERROR neutron raise ProcessExecutionError(msg, returncode=returncode) 2018-02-09 08:34:41.059 5 ERROR neutron ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Invalid argument 2018-02-09 08:34:41.059 5 ERROR neutron 2018-02-09 08:34:41.059 5 ERROR neutron 2018-02-09 08:34:41.177 21 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting 2018-02-09 08:34:41.178 21 INFO eventlet.wsgi.server [-] (21) wsgi exited, is_accepting=True An easy way to reproduce the bug: [heat-admin@overcloud-novacompute-0 ~]$ sudo docker exec -u root -it 5c5f254a9321bd74b5911f46acb9513574c2cd9a3c59805a85cffd960bcc864d /bin/bash [root@overcloud-novacompute-0 /]# ip netns a my_netns [root@overcloud-novacompute-0 /]# exit [heat-admin@overcloud-novacompute-0 ~]$ sudo ip netns [heat-admin@overcloud-novacompute-0 ~]$ sudo docker restart 5c5f254a9321bd74b5911f46acb9513574c2cd9a3c59805a85cffd960bcc864d 5c5f254a9321bd74b5911f46acb9513574c2cd9a3c59805a85cffd960bcc864d [heat-admin@overcloud-novacompute-0 ~]$ sudo docker exec -u root -it 5c5f254a9321bd74b5911f46acb9513574c2cd9a3c59805a85cffd960bcc864d /bin/bash [root@overcloud-novacompute-0 /]# ip netns RTNETLINK answers: Invalid argument RTNETLINK answers: Invalid argument my_netns [root@overcloud-novacompute-0 /]# ip netns e my_netns ip a RTNETLINK answers: Invalid argument setting the network namespace "my_netns" failed: Invalid argument Deleting everything under /run/netns/* from kolla_start but this would involve a full sync of the agents which is not desirable: [root@overcloud-novacompute-0 /]# rm /run/netns/my_netns rm: remove regular empty file '/run/netns/my_netns'? y [root@overcloud-novacompute-0 /]# ip netns [root@overcloud-novacompute-0 /]# ip netns a my_netns [root@overcloud-novacompute-0 /]# ** Affects: neutron Importance:
[Yahoo-eng-team] [Bug 1744359] [NEW] Neutron haproxy logs are not being collected
Public bug reported: In Neutron, we use haproxy to proxy metadata requests from instances to Nova Metadata service. By default, haproxy logs to /dev/log but in Ubuntu, those requests get redirected by rsyslog to /var/log/haproxy.log which is not being collected. ubuntu@devstack:~$ cat /etc/rsyslog.d/49-haproxy.conf # Create an additional socket in haproxy's chroot in order to allow logging via # /dev/log to chroot'ed HAProxy processes $AddUnixListenSocket /var/lib/haproxy/dev/log # Send HAProxy messages to a dedicated logfile if $programname startswith 'haproxy' then /var/log/haproxy.log &~ Another possibility would be to change the haproxy.cfg file to include the log-tag option so that haproxy uses a different tag [0] and then it'd be dumped into syslog instead but this would break backwards compatibility. [0] https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#3.1 -log-tag ** Affects: devstack Importance: Undecided Status: New ** Affects: neutron Importance: Undecided Status: New ** Tags: l3-ipam-dhcp ** Also affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1744359 Title: Neutron haproxy logs are not being collected Status in devstack: New Status in neutron: New Bug description: In Neutron, we use haproxy to proxy metadata requests from instances to Nova Metadata service. By default, haproxy logs to /dev/log but in Ubuntu, those requests get redirected by rsyslog to /var/log/haproxy.log which is not being collected. ubuntu@devstack:~$ cat /etc/rsyslog.d/49-haproxy.conf # Create an additional socket in haproxy's chroot in order to allow logging via # /dev/log to chroot'ed HAProxy processes $AddUnixListenSocket /var/lib/haproxy/dev/log # Send HAProxy messages to a dedicated logfile if $programname startswith 'haproxy' then /var/log/haproxy.log &~ Another possibility would be to change the haproxy.cfg file to include the log-tag option so that haproxy uses a different tag [0] and then it'd be dumped into syslog instead but this would break backwards compatibility. [0] https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#3.1 -log-tag To manage notifications about this bug go to: https://bugs.launchpad.net/devstack/+bug/1744359/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1739798] [NEW] update_network_postcommit is being called from delete_network_precommit with an open session
Public bug reported: When a network is deleted, its segments are also deleted [0]. For each segment, it will notify about resources.SEGMENT and events.AFTER_DELETE [1] which will turn out in calling update_network_postcommit [2]. This should be avoided since drivers expect their postcommit methods to be called with no open sessions to the database. There should be separate callbacks for segments so that there's no transactions opened to the database in any of the postcommit calls. We detected this in networking-ovn driver because we're attempting to bump revision numbers in a separate table in Neutron database when a network is updated but we can't commit that change to the database because there's already an open session on a network delete operation. This may be affecting other drivers as well. [0] https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/services/segments/db.py#L315 [1] https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/services/segments/db.py#L178 [2] https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/plugins/ml2/plugin.py#L1917 ** Affects: neutron Importance: Undecided Status: New ** Description changed: - When a network is delete, its segments are also deleted [0]. For each + When a network is deleted, its segments are also deleted [0]. For each segment, it will notify about resources.SEGMENT and events.AFTER_DELETE [1] which will turn out in calling update_network_postcommit [2]. This should be avoided since drivers expect their postcommit methods to be called with no open sessions to the database. There should be separate callbacks for segments so that there's no transactions opened to the database in any of the postcommit calls. We detected this in networking-ovn driver because we're attempting to bump revision numbers in Neutron database when a network is updated but we can't commit that change to the database because there's already an open session. This may be affecting other drivers as well. [0] https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/services/segments/db.py#L315 [1] https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/services/segments/db.py#L178 [2] https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/plugins/ml2/plugin.py#L1917 ** Description changed: When a network is deleted, its segments are also deleted [0]. For each segment, it will notify about resources.SEGMENT and events.AFTER_DELETE [1] which will turn out in calling update_network_postcommit [2]. This should be avoided since drivers expect their postcommit methods to be called with no open sessions to the database. There should be separate callbacks for segments so that there's no transactions opened to the database in any of the postcommit calls. We detected this in networking-ovn driver because we're attempting to - bump revision numbers in Neutron database when a network is updated but - we can't commit that change to the database because there's already an - open session. This may be affecting other drivers as well. + bump revision numbers in a separate table in Neutron database when a + network is updated but we can't commit that change to the database + because there's already an open session on a network delete operation. + This may be affecting other drivers as well. [0] https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/services/segments/db.py#L315 [1] https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/services/segments/db.py#L178 [2] https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/plugins/ml2/plugin.py#L1917 -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1739798 Title: update_network_postcommit is being called from delete_network_precommit with an open session Status in neutron: New Bug description: When a network is deleted, its segments are also deleted [0]. For each segment, it will notify about resources.SEGMENT and events.AFTER_DELETE [1] which will turn out in calling update_network_postcommit [2]. This should be avoided since drivers expect their postcommit methods to be called with no open sessions to the database. There should be separate callbacks for segments so that there's no transactions opened to the database in any of the postcommit calls. We detected this in networking-ovn driver because we're attempting to bump revision numbers in a separate table in Neutron database when a network is updated but we can't commit that change to the database because there's already an open session on a network
[Yahoo-eng-team] [Bug 1738768] [NEW] Dataplane downtime when containers are stopped/restarted
Public bug reported: I have deployed a 3 controllers - 3 computes HA environment with ML2/OVS and observed dataplane downtime when restarting/stopping neutron-l3 container on controllers. This is what I did: 1. Created a network, subnet, router, a VM and attached a FIP to the VM 2. Left a ping running on the undercloud to the FIP 3. Stopped l3 container in controller-0. Result: Observed some packet loss while the router was failed over to controller-1 4. Stopped l3 container in controller-1 Result: Observed some packet loss while the router was failed over to controller-2 5. Stopped l3 container in controller-2 Result: No traffic to/from the FIP at all. (overcloud) [stack@undercloud ~]$ ping 10.0.0.131 PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data. 64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=1.83 ms 64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=1.56 ms < Last l3 container was stopped here (step 5 above)> >From 10.0.0.1 icmp_seq=10 Destination Host Unreachable >From 10.0.0.1 icmp_seq=11 Destination Host Unreachable When containers are stopped, I guess that the qrouter namespace is not accessible by the kernel: [heat-admin@overcloud-controller-2 ~]$ sudo ip netns e qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a RTNETLINK answers: Invalid argument RTNETLINK answers: Invalid argument setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" failed: Invalid argument This means that not only we're getting controlplane downtime but also dataplane which could be seen as a regression when compared to non-containerized environments. The same would happen with DHCP and I expect instances not being able to fetch IP addresses from dnsmasq when dhcp containers are stopped. ** Affects: neutron Importance: Undecided Status: New ** Description changed: I have deployed a 3 controllers - 3 computes HA environment with ML2/OVS and observed dataplane downtime when restarting/stopping neutron-l3 container on controllers. This is what I did: - 1. Created a network, subnet, router, a VM and attached a FIP to the VIM + 1. Created a network, subnet, router, a VM and attached a FIP to the VM 2. Left a ping running on the undercloud to the FIP 3. Stopped l3 container in controller-0. -Result: Observed some packet loss while the router was failed over to controller-1 + Result: Observed some packet loss while the router was failed over to controller-1 4. Stopped l3 container in controller-1 -Result: Observed some packet loss while the router was failed over to controller-2 + Result: Observed some packet loss while the router was failed over to controller-2 5. Stopped l3 container in controller-2 -Result: No traffic to/from the FIP at all. + Result: No traffic to/from the FIP at all. - - (overcloud) [stack@undercloud ~]$ ping 10.0.0.131 + (overcloud) [stack@undercloud ~]$ ping 10.0.0.131 PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data. 64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=1.83 ms 64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=1.56 ms - < Last l3 container was stopped here (step 5) in the above description > - + < Last l3 container was stopped here (step 5 above)> + From 10.0.0.1 icmp_seq=10 Destination Host Unreachable From 10.0.0.1 icmp_seq=11 Destination Host Unreachable - - When containers are stopped, I guess that the qrouter namespace is not accessible by the kernel: + When containers are stopped, I guess that the qrouter namespace is not + accessible by the kernel: [heat-admin@overcloud-controller-2 ~]$ sudo ip netns e qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a RTNETLINK answers: Invalid argument RTNETLINK answers: Invalid argument setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" failed: Invalid argument This means that not only we're getting controlplane downtime but also dataplane which could be seen as a regression when compared to non-containerized environments. The same would happen with DHCP and I expect instances not being able to fetch IP addresses from dnsmasq when dhcp containers are stopped. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1738768 Title: Dataplane downtime when containers are stopped/restarted Status in neutron: New Bug description: I have deployed a 3 controllers - 3 computes HA environment with ML2/OVS and observed dataplane downtime when restarting/stopping neutron-l3 container on controllers. This is what I did: 1. Created a network, subnet, router, a VM and attached a FIP to the VM 2. Left a ping running on the undercloud to the FIP 3. Stopped l3 container in controller-0. Result: Obs
[Yahoo-eng-team] [Bug 1735724] [NEW] Metadata iptables rules never inserted upon exception on router creation
Public bug reported: We've been debugging some issues being seen lately [0] and found out that there's a bug in l3 agent when creating routers (or during initial sync). Jakub Libosvar and I spent some time recreating the issue and this is what we got: Especially since we bumped to ovsdbapp 0.8.0, we've seen some jobs failing due to errors when authenticating using PK to a VM. The TCP connection to the SSH port was successfully established but the authentication failed. After debugging further, we found out that metadata rules in qrouter namespace which redirect traffic to haproxy (which replaced old neutron-ns-metadata-proxy) were missing, so VM's weren't fetching metadata (hence, public key). These rules are installed by metadata driver after a router is created [1] on the AFTER_CREATE notification. Also, they will get created during the initial sync of the l3 agent (since it's still unknown for the agent) [2]. Here, if we don't know the router yet, we'll call _proccess_added_router() and if it's a known router we'll call _process_updated_router(). After our tests, we've seen that iptables rules are never restored if we simulate an Exception inside ri.process() at [3] even though the router is scheduled for resync [4]. The reason why this happens is because we've already added it to our router info [5] so even though ri.process() fails at L481 and it's scheduled for resync, next time _process_updated_router() will get called instead of _process_added_router() thus not pushing the notification into metadata driver to install iptables rules and they never get installed. In conclusion, if an error occurs during _process_added_router() we might end up losing metadata forever until we restart the agent and this call succeeds. Worse, we will be forwarding metadata requests via br-ex which could lead to security issues (ie. we could be injecting wrong metadata from the outside or the metadata server running in the underlying cloud may respond). With ovsdbapp 0.9.0 we're minimizing this because if a port fails to be added to br-int, ovsdbapp will enqueue the transaction instead of throwing an Exception but there could be still some other exceptions I guess that reproduces this scenario outside of ovsdbapp so we need to fix it in Neutron. Thanks Daniel Alvarez --- [0] https://bugs.launchpad.net/tripleo/+bug/1731063 [1] https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/metadata/driver.py#L288 [2] https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L472 [3] https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L481 [4] https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L565 [5] https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L478 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1735724 Title: Metadata iptables rules never inserted upon exception on router creation Status in neutron: New Bug description: We've been debugging some issues being seen lately [0] and found out that there's a bug in l3 agent when creating routers (or during initial sync). Jakub Libosvar and I spent some time recreating the issue and this is what we got: Especially since we bumped to ovsdbapp 0.8.0, we've seen some jobs failing due to errors when authenticating using PK to a VM. The TCP connection to the SSH port was successfully established but the authentication failed. After debugging further, we found out that metadata rules in qrouter namespace which redirect traffic to haproxy (which replaced old neutron-ns-metadata-proxy) were missing, so VM's weren't fetching metadata (hence, public key). These rules are installed by metadata driver after a router is created [1] on the AFTER_CREATE notification. Also, they will get created during the initial sync of the l3 agent (since it's still unknown for the agent) [2]. Here, if we don't know the router yet, we'll call _proccess_added_router() and if it's a known router we'll call _process_updated_router(). After our tests, we've seen that iptables rules are never restored if we simulate an Exception inside ri.process() at [3] even though the router is scheduled for resync [4]. The reason why this happens is because we've already added it to our router info [5] so even though ri.process() fails at L481 and it's scheduled for resync, next time _process_updated_router() will get called instead of _process_added_router() thus not pushing
[Yahoo-eng-team] [Bug 1731494] [NEW] neutron-openvswitch-agent crashes due to TypeError exception in ovs_ryuapp
Public bug reported: At some point during some rally test, we saw this exception in ovs agent logs: 2017-11-07 13:35:51.428 597682 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-62f85bb3-db4c-4485-b35c-b7c1cafb3970 3d527bdd3ede4c6a97f91b701393b8e3 5f753e92a5d740fc97252bd39f868561 - - -] port_delete message processed for port 3e8348d0-40e1-4146-b803-1e6c6eddba53 port_delete /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:430 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp [req-141ecd16-22d7-4b1c-aa91-25d5077414f5 - - - - -] Agent main thread died of an exception: TypeError: int() can't convert non-string with explicit base 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp Traceback (most recent call last): 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_ryuapp.py", line 40, in agent_main_wrapper 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp ovs_agent.main(bridge_classes) 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2205, in main 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp agent.daemon_loop() 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp return f(*args, **kwargs) 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2120, in daemon_loop 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp self.rpc_loop(polling_manager=pm) 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp return f(*args, **kwargs) 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1985, in rpc_loop 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp ovs_status = self.check_ovs_status() 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp return f(*args, **kwargs) 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1787, in check_ovs_status 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp status = self.int_br.check_canary_table() 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py", line 52, in check_canary_table 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp flows = self.dump_flows(constants.CANARY_TABLE) 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 141, in dump_flows 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp (dp, ofp, ofpp) = self._get_dp() 2017-11-07 13:35:51.439 597682 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_brid
[Yahoo-eng-team] [Bug 1723472] Re: [DVR] Lowering the MTU breaks FIP traffic
We have seen that the MAC address of the FIP changes to the qf interface of a different controller. However, the environment was running openstack-neutron-11.0.0-1.el7.noarch. After upgrading to openstack-neutron-11.0.1-1.el7.noarch, this bug no longer occurs. Marking it as invalid. ** Changed in: neutron Status: Confirmed => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1723472 Title: [DVR] Lowering the MTU breaks FIP traffic Status in neutron: Invalid Bug description: In a DVR environment, when lowering the MTU of a network, traffic going to an instance through a floating IP is broken. Description: * Ping traffic to a VM through its FIP works. * Change the MTU of its network through "neutron net-update --mtu 1440". * Ping to the same FIP doesn't work. After a long debugging session with Anil Venkata, we've found that packets reach br-ex and then they hit this OF rule with normal action: cookie=0x1f847e4bf0de0aea, duration=70306.532s, table=3, n_packets=1579251, n_bytes=796614220, idle_age=0, hard_age=65534, priority=1 actions=NORMAL We would expect this rule to switch the packet to br-int so that it can be forwarded to the fip namespace (ie. with dst MAC address set to the floating ip gw (owner=network:floatingip_agent_gateway): $ sudo ovs-vsctl list interface _uuid : 1f2b6e86-d303-42f4-9467-5dab78fc7199 admin_state : down bfd : {} bfd_status : {} cfm_fault : [] cfm_fault_status: [] cfm_flap_count : [] cfm_health : [] cfm_mpid: [] cfm_remote_mpids: [] cfm_remote_opstate : [] duplex : [] error : [] external_ids: {attached-mac="fa:16:3e:9d:0c:4f", iface-id="8ec34826-b1a6-48ce-9c39-2fd3e8167eb4", iface-status=active} name: "fg-8ec34826-b1" [heat-admin@overcloud-novacompute-0 ~]$ sudo ovs-appctl fdb/show br-ex port VLAN MACAge [...] 710 fa:16:3e:9d:0c:4f0 $ sudo ovs-ofctl show br-ex | grep "7(" 7(phy-br-ex): addr:36:63:93:fc:af:e2 And from there, to the fip namespace which would route the packet to the qrouter namespace, etc. However, when we change the MTU through the following command: "neutron net-update --mtu 1440" We see that, after a few seconds, the MAC address of the FIP changes so when traffic arrives br-ex and NORMAL action is performed, it will not be output to br-int through the patch-port but instead, through eth1 and traffic won't work anymore. [heat-admin@overcloud-novacompute-0 ~]$ arp -n | grep ".113" 10.0.0.113 ether fa:16:3e:9d:0c:4f C vlan10 neutron port-set x --mtu 1440 $ arp -n | grep ".113" 10.0.0.113 ether fa:16:3e:20:f9:85 C vlan10 When setting the MAC address manually, ping starts working again: $ arp -s 10.0.0.113 fa:16:3e:9d:0c:4f $ ping 10.0.0.113 PING 10.0.0.113 (10.0.0.113) 56(84) bytes of data. 64 bytes from 10.0.0.113: icmp_seq=1 ttl=62 time=1.17 ms 64 bytes from 10.0.0.113: icmp_seq=2 ttl=62 time=0.561 ms Additional notes: When we set the MAC address manually and traffic gets working back again, lowering the MTU doesn't change the MAC address (we can't see any gARP's coming through). When we delete the ARP entry for the FIP and try to ping the FIP, the wrong MAC address is set. [heat-admin@overcloud-novacompute-0 ~]$ sudo arp -d 10.0.0.113 [heat-admin@overcloud-novacompute-0 ~]$ ping 10.0.0.113 -c 2 PING 10.0.0.113 (10.0.0.113) 56(84) bytes of data. --- 10.0.0.113 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms [heat-admin@overcloud-novacompute-0 ~]$ arp -n | grep ".113" 10.0.0.113 ether fa:16:3e:20:f9:85 C vlan10 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1723472/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1723472] [NEW] [DVR] Lowering the MTU breaks FIP traffic
Public bug reported: In a DVR environment, when lowering the MTU of a network, traffic going to an instance through a floating IP is broken. Description: * Ping traffic to a VM through its FIP works. * Change the MTU of its network through "neutron net-update --mtu 1440". * Ping to the same FIP doesn't work. After a long debugging session with Anil Venkata, we've found that packets reach br-ex and then they hit this OF rule with normal action: cookie=0x1f847e4bf0de0aea, duration=70306.532s, table=3, n_packets=1579251, n_bytes=796614220, idle_age=0, hard_age=65534, priority=1 actions=NORMAL We would expect this rule to switch the packet to br-int so that it can be forwarded to the fip namespace (ie. with dst MAC address set to the floating ip gw (owner=network:floatingip_agent_gateway): $ sudo ovs-vsctl list interface _uuid : 1f2b6e86-d303-42f4-9467-5dab78fc7199 admin_state : down bfd : {} bfd_status : {} cfm_fault : [] cfm_fault_status: [] cfm_flap_count : [] cfm_health : [] cfm_mpid: [] cfm_remote_mpids: [] cfm_remote_opstate : [] duplex : [] error : [] external_ids: {attached-mac="fa:16:3e:9d:0c:4f", iface-id="8ec34826-b1a6-48ce-9c39-2fd3e8167eb4", iface-status=active} name: "fg-8ec34826-b1" [heat-admin@overcloud-novacompute-0 ~]$ sudo ovs-appctl fdb/show br-ex port VLAN MACAge [...] 710 fa:16:3e:9d:0c:4f0 $ sudo ovs-ofctl show br-ex | grep "7(" 7(phy-br-ex): addr:36:63:93:fc:af:e2 And from there, to the fip namespace which would route the packet to the qrouter namespace, etc. However, when we change the MTU through the following command: "neutron net-update --mtu 1440" We see that, after a few seconds, the MAC address of the FIP changes so when traffic arrives br-ex and NORMAL action is performed, it will not be output to br-int through the patch-port but instead, through eth1 and traffic won't work anymore. [heat-admin@overcloud-novacompute-0 ~]$ arp -n | grep ".113" 10.0.0.113 ether fa:16:3e:9d:0c:4f C vlan10 neutron port-set x --mtu 1440 $ arp -n | grep ".113" 10.0.0.113 ether fa:16:3e:20:f9:85 C vlan10 When setting the MAC address manually, ping starts working again: $ arp -s 10.0.0.113 fa:16:3e:9d:0c:4f $ ping 10.0.0.113 PING 10.0.0.113 (10.0.0.113) 56(84) bytes of data. 64 bytes from 10.0.0.113: icmp_seq=1 ttl=62 time=1.17 ms 64 bytes from 10.0.0.113: icmp_seq=2 ttl=62 time=0.561 ms Additional notes: When we set the MAC address manually and traffic gets working back again, lowering the MTU doesn't change the MAC address (we can't see any gARP's coming through). When we delete the ARP entry for the FIP and try to ping the FIP, the wrong MAC address is set. [heat-admin@overcloud-novacompute-0 ~]$ sudo arp -d 10.0.0.113 [heat-admin@overcloud-novacompute-0 ~]$ ping 10.0.0.113 -c 2 PING 10.0.0.113 (10.0.0.113) 56(84) bytes of data. --- 10.0.0.113 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms [heat-admin@overcloud-novacompute-0 ~]$ arp -n | grep ".113" 10.0.0.113 ether fa:16:3e:20:f9:85 C vlan10 ** Affects: neutron Importance: Undecided Assignee: Daniel Alvarez (dalvarezs) Status: New ** Tags: l3-dvr-backlog ** Changed in: neutron Assignee: (unassigned) => Daniel Alvarez (dalvarezs) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1723472 Title: [DVR] Lowering the MTU breaks FIP traffic Status in neutron: New Bug description: In a DVR environment, when lowering the MTU of a network, traffic going to an instance through a floating IP is broken. Description: * Ping traffic to a VM through its FIP works. * Change the MTU of its network through "neutron net-update --mtu 1440". * Ping to the same FIP doesn't work. After a long debugging session with Anil Venkata, we've found that packets reach br-ex and then they hit this OF rule with normal action: cookie=0x1f847e4bf0de0aea, duration=70306.532s, table=3, n_packets=1579251, n_bytes=796614220, idle_age=0, hard_age=65534, priority=1 actions=NORMAL We would expect this rule to switch the packet to br-int so that it can be forwarded to the fip namespace (ie. with dst MAC address set to the floating ip gw (owner=network:floatingip_agent_gateway): $ sudo ovs-vsctl list interface _uuid : 1f2b6e86-d303-42f4-9467-5dab78fc7199 admin_st
[Yahoo-eng-team] [Bug 1695191] [NEW] pyroute2 wrong version constraint
Public bug reported: With this recent change [0] we're now importing asyncio module from pyroute2 and neutron-server fails to start if pyroute<0.4.15: File "/opt/stack/neutron/neutron/common/eventlet_utils.py", line 25, in monkey_patch p_c_e = importutils.import_module('pyroute2.config.asyncio') ImportError: No module named asyncio I'm using pyroute==0.4.13 which is ok according to global- requirements.txt but this won't include the asyncio module. Version 0.4.14 includes it but since it's forbidden right now in our requirements, we should bump it the minimum version to 0.4.15. [0] https://review.openstack.org/#/c/469650/ ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1695191 Title: pyroute2 wrong version constraint Status in neutron: New Bug description: With this recent change [0] we're now importing asyncio module from pyroute2 and neutron-server fails to start if pyroute<0.4.15: File "/opt/stack/neutron/neutron/common/eventlet_utils.py", line 25, in monkey_patch p_c_e = importutils.import_module('pyroute2.config.asyncio') ImportError: No module named asyncio I'm using pyroute==0.4.13 which is ok according to global- requirements.txt but this won't include the asyncio module. Version 0.4.14 includes it but since it's forbidden right now in our requirements, we should bump it the minimum version to 0.4.15. [0] https://review.openstack.org/#/c/469650/ To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1695191/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1691969] [NEW] Functional tests failing due to uid 65534 not present
Public bug reported: We're relying on existing uid 65534 to run functional tests [0] and if it doesn't exist, metadata proxy will fail to spawn [1] and so will the tests. >From what I've seen in centos7, user with uid 65534 exists when deploying devstack because libvirt package is installed and nfs-utils is a dependency. nfs-utils will create nfsnobody user under this uid [2] and the functional tests pass. We shouldn't rely on this uid to be present on the system. I'll try to come up with something to fix the tests but feedback is very welcome :) Daniel [0] https://github.com/openstack/neutron/blob/master/neutron/tests/functional/agent/l3/test_metadata_proxy.py#L188 [1] https://github.com/openstack/neutron/blob/03c5283c69f1f5cba8a9f29e7bd7fd306ee0c123/neutron/agent/metadata/driver.py#L100 [2] http://paste.openstack.org/show/609989/ ** Affects: neutron Importance: Undecided Assignee: Daniel Alvarez (dalvarezs) Status: New ** Tags: functional-tests ** Changed in: neutron Assignee: (unassigned) => Daniel Alvarez (dalvarezs) ** Tags added: functional-tests -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1691969 Title: Functional tests failing due to uid 65534 not present Status in neutron: New Bug description: We're relying on existing uid 65534 to run functional tests [0] and if it doesn't exist, metadata proxy will fail to spawn [1] and so will the tests. From what I've seen in centos7, user with uid 65534 exists when deploying devstack because libvirt package is installed and nfs-utils is a dependency. nfs-utils will create nfsnobody user under this uid [2] and the functional tests pass. We shouldn't rely on this uid to be present on the system. I'll try to come up with something to fix the tests but feedback is very welcome :) Daniel [0] https://github.com/openstack/neutron/blob/master/neutron/tests/functional/agent/l3/test_metadata_proxy.py#L188 [1] https://github.com/openstack/neutron/blob/03c5283c69f1f5cba8a9f29e7bd7fd306ee0c123/neutron/agent/metadata/driver.py#L100 [2] http://paste.openstack.org/show/609989/ To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1691969/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1677279] [NEW] Don't depend on l3-agent running for IPv6 failover
Public bug reported: Right now, we're enabling IPv6 RA on the gateway interface for master instances [0]. This happens only when the l3-agent is running so we depend on it for the correct configuration of the HA routers. If the l3-agent is shut down for maintenance, then we won't get RA enabled on master instance even though keepalived and keepalived-state- change are both running. We should get rid of this dependency by moving this code into keepalived-state-change, which we could assume that will be always running along with keepalived. [0] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L124 ** Affects: neutron Importance: Undecided Assignee: Daniel Alvarez (dalvarezs) Status: New ** Tags: l3-ha ** Tags added: l3-ha ** Changed in: neutron Assignee: (unassigned) => Daniel Alvarez (dalvarezs) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1677279 Title: Don't depend on l3-agent running for IPv6 failover Status in neutron: New Bug description: Right now, we're enabling IPv6 RA on the gateway interface for master instances [0]. This happens only when the l3-agent is running so we depend on it for the correct configuration of the HA routers. If the l3-agent is shut down for maintenance, then we won't get RA enabled on master instance even though keepalived and keepalived- state-change are both running. We should get rid of this dependency by moving this code into keepalived-state-change, which we could assume that will be always running along with keepalived. [0] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L124 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1677279/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1669805] [NEW] rally job failing in gate due to "Quota for tenant X could not be found" Error
Public bug reported: Rally job is failing in the gate due to the following error during cleanup [0]: 2017-03-03 13:14:56.897549 | 2017-03-03 13:14:56.897 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager rutils.retry(resource._max_attempts, resource.delete) 2017-03-03 13:14:56.899015 | 2017-03-03 13:14:56.898 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager File "/opt/stack/new/rally/rally/common/utils.py", line 223, in retry 2017-03-03 13:14:56.900375 | 2017-03-03 13:14:56.900 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager return func(*args, **kwargs) 2017-03-03 13:14:56.901935 | 2017-03-03 13:14:56.901 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager File "/opt/stack/new/rally/rally/plugins/openstack/cleanup/resources.py", line 472, in delete 2017-03-03 13:14:56.903254 | 2017-03-03 13:14:56.902 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager self._manager().delete_quota(self.tenant_uuid) 2017-03-03 13:14:56.904746 | 2017-03-03 13:14:56.904 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager File "/usr/local/lib/python2.7/dist-packages/debtcollector/renames.py", line 43, in decorator 2017-03-03 13:14:56.906156 | 2017-03-03 13:14:56.905 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager return wrapped(*args, **kwargs) 2017-03-03 13:14:56.907779 | 2017-03-03 13:14:56.907 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 742, in delete_quota 2017-03-03 13:14:56.909166 | 2017-03-03 13:14:56.908 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager return self.delete(self.quota_path % (project_id)) 2017-03-03 13:14:56.910455 | 2017-03-03 13:14:56.910 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 357, in delete 2017-03-03 13:14:56.911722 | 2017-03-03 13:14:56.911 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager headers=headers, params=params) 2017-03-03 13:14:56.913068 | 2017-03-03 13:14:56.912 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 338, in retry_request 2017-03-03 13:14:56.914760 | 2017-03-03 13:14:56.914 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager headers=headers, params=params) 2017-03-03 13:14:56.916094 | 2017-03-03 13:14:56.915 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 301, in do_request 2017-03-03 13:14:56.917732 | 2017-03-03 13:14:56.917 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager self._handle_fault_response(status_code, replybody, resp) 2017-03-03 13:14:56.919266 | 2017-03-03 13:14:56.918 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 276, in _handle_fault_response 2017-03-03 13:14:56.920699 | 2017-03-03 13:14:56.920 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager exception_handler_v20(status_code, error_body) 2017-03-03 13:14:56.922342 | 2017-03-03 13:14:56.922 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 92, in exception_handler_v20 2017-03-03 13:14:56.923773 | 2017-03-03 13:14:56.923 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager request_ids=request_ids) 2017-03-03 13:14:56.925720 | 2017-03-03 13:14:56.925 | 2017-03-03 13:14:56.886 6099 ERROR rally.plugins.openstack.cleanup.manager NotFound: Quota for tenant 82ee0ba1b6534f958d1acd2f717b5c3d could not be found. Seems like we're hitting this errors from 1AM today (03/03/2017) [1] [0] http://logs.openstack.org/91/431691/30/check/gate-rally-dsvm-neutron-neutron-ubuntu-xenial/ab3471c/console.html#_2017-03-03_13_14_56_925720 [1] http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%20%5C%22gate-rally-dsvm-neutron-neutron-ubuntu-xenial%5C%22%20AND%20message%3A%20%5C%22Quota%20for%20tenant%5C%22 ** Affects: neutron Importance: Undecided Status: New ** Tags: gate-failure -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1669805 Title: rally job failing in gate due to "Quota for tenant X could not be found" Error Status in neutron: New Bug description: Rally job is
[Yahoo-eng-team] [Bug 1669765] [NEW] RA is not disabled on backup HA routers
Public bug reported: When an HA router is created, RA is enabled on the gateway interface for the 'master' router [0]. However, it is not disabled in the 'else' clause and therefore: 1. If the router was set to 'master' before, it will still have RA enabled on its gateway interface 2. If default value for accept_ra in '/proc/sys/net/ipv6/conf/default/accept_ra' is > 0, then it will still have RA enabled on its gateway interface. Having RA enabled on a backup router leads to the following unwanted situation: - It may respond to RA packets coming from an external switch and, because it has the same MAC address as the master instance, the switch will learn its MAC address and may send the traffic to it until the master sends some packets. Therefore, any existing connections will be interrupted. The fix would consist in disabling RA on the gateway interface if conditions are not met to enable it. [0] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L136 ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1669765 Title: RA is not disabled on backup HA routers Status in neutron: New Bug description: When an HA router is created, RA is enabled on the gateway interface for the 'master' router [0]. However, it is not disabled in the 'else' clause and therefore: 1. If the router was set to 'master' before, it will still have RA enabled on its gateway interface 2. If default value for accept_ra in '/proc/sys/net/ipv6/conf/default/accept_ra' is > 0, then it will still have RA enabled on its gateway interface. Having RA enabled on a backup router leads to the following unwanted situation: - It may respond to RA packets coming from an external switch and, because it has the same MAC address as the master instance, the switch will learn its MAC address and may send the traffic to it until the master sends some packets. Therefore, any existing connections will be interrupted. The fix would consist in disabling RA on the gateway interface if conditions are not met to enable it. [0] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L136 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1669765/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1654287] Re: functional test netns_cleanup failing in gate
** Also affects: oslo.rootwrap Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1654287 Title: functional test netns_cleanup failing in gate Status in neutron: In Progress Status in oslo.rootwrap: New Bug description: The functional test for netns_cleanup has failed in the gate today [0]. Apparently, when trying to get the list of devices (ip_lib.get_devices() 'find /sys/class/net -maxdepth 1 -type 1 -printf %f') through rootwrap_daemon, it's getting the output of the previous command instead ('netstat -nlp'). This causes that the netns_cleanup module tries to unplug random devices which correspond to the actual output of the 'netstat' command. This bug doesn't look related to the test itself but to rootwrap_daemon? Maybe due to long output to the netstat command? Relevant part of the log 2017-01-05 12:17:04.609 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'netstat', '-nlp'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:04.613 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149 2017-01-05 12:17:04.614 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'find', '/sys/class/net', '-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:04.645 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 14 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202 2017-01-05 12:17:04.686 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149 2017-01-05 12:17:04.688 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 'Active'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:04.746 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149 2017-01-05 12:17:04.747 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 'Internet'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:04.758 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 14 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202 2017-01-05 12:17:04.815 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 14 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202 2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 7 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202 2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): InterfaceToBridgeCommand(name=Internet) do_commit neutron/agent/ovsdb/impl_idl.py:100 2017-01-05 12:17:04.823 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction aborted do_commit neutron/agent/ovsdb/impl_idl.py:124 2017-01-05 12:17:04.824 27615 DEBUG neutron.cmd.netns_cleanup [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for device: Internet unplug_device neutron/cmd/netns_cleanup.py:138 2017-01-05 12:17:04.824 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 'connections'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:06.388 27615 DEBUG neutron.cmd.netns_cleanup [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for device: Path unplug_device neutron/cmd/netns_cleanup.py:138 2017-01-05 12:17:06.389 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', '-o', 'netns', 'list'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:06.454 27615 ERROR neutron.agent.linux.ut
[Yahoo-eng-team] [Bug 1654287] [NEW] functional test netns_cleanup failing in gate
Public bug reported: The functional test for netns_cleanup has failed in the gate today [0]. Apparently, when trying to get the list of devices (ip_lib.get_devices() 'find /sys/class/net -maxdepth 1 -type 1 -printf %f') through rootwrap_daemon, it's getting the output of the previous command instead ('netstat -nlp'). This causes that the netns_cleanup module tries to unplug random devices which correspond to the actual output of the 'netstat' command. This bug doesn't look related to the test itself but to rootwrap_daemon? Maybe due to long output to the netstat command? Relevant part of the log 2017-01-05 12:17:04.609 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'netstat', '-nlp'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:04.613 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149 2017-01-05 12:17:04.614 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'find', '/sys/class/net', '-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:04.645 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 14 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202 2017-01-05 12:17:04.686 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149 2017-01-05 12:17:04.688 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 'Active'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:04.746 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute neutron/agent/linux/utils.py:149 2017-01-05 12:17:04.747 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 'Internet'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:04.758 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 14 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202 2017-01-05 12:17:04.815 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 14 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202 2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] [POLLIN] on fd 7 __log_wakeup /opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202 2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running txn command(idx=0): InterfaceToBridgeCommand(name=Internet) do_commit neutron/agent/ovsdb/impl_idl.py:100 2017-01-05 12:17:04.823 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] Transaction aborted do_commit neutron/agent/ovsdb/impl_idl.py:124 2017-01-05 12:17:04.824 27615 DEBUG neutron.cmd.netns_cleanup [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for device: Internet unplug_device neutron/cmd/netns_cleanup.py:138 2017-01-05 12:17:04.824 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 'connections'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:06.388 27615 DEBUG neutron.cmd.netns_cleanup [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for device: Path unplug_device neutron/cmd/netns_cleanup.py:138 2017-01-05 12:17:06.389 27615 DEBUG neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap daemon): ['ip', '-o', 'netns', 'list'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108 2017-01-05 12:17:06.454 27615 ERROR neutron.agent.linux.utils [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 1; Stdin: ; Stdout: ; Stderr: Cannot find device "Path" 2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup [req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Error unable to destroy namespace: qrouter-cf2030c6-c924-45bb-b13b-6774d275b394 2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup Traceback (most recent call last
[Yahoo-eng-team] [Bug 1652124] [NEW] netns-cleanup functional test fails on some conditions
Public bug reported: We've seen this functional test failing in the gate [0] and it's due to a bug in the helper module that was written for the functional test. [1] The problem shows up when process_spawn is not able to find a port to listen on and the process stays running anyways. That means that netns- cleanup won't clean it up and this condition [2] doesn't hold (1!=0). As per logs in the gate I can tell that it's only a bug in the functional test but not in the module itself. I myself will submit a patch to it right now. [0] http://logs.openstack.org/45/358845/24/check/gate-neutron-dsvm-functional-ubuntu-xenial/b018ed7/testr_results.html.gz [1] https://github.com/openstack/neutron/blob/master/neutron/tests/functional/cmd/process_spawn.py#L107 [2] https://github.com/openstack/neutron/blob/master/neutron/tests/functional/cmd/test_netns_cleanup.py#L84 ** Affects: neutron Importance: Undecided Status: New ** Tags: functional-tests gate-failure -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1652124 Title: netns-cleanup functional test fails on some conditions Status in neutron: New Bug description: We've seen this functional test failing in the gate [0] and it's due to a bug in the helper module that was written for the functional test. [1] The problem shows up when process_spawn is not able to find a port to listen on and the process stays running anyways. That means that netns-cleanup won't clean it up and this condition [2] doesn't hold (1!=0). As per logs in the gate I can tell that it's only a bug in the functional test but not in the module itself. I myself will submit a patch to it right now. [0] http://logs.openstack.org/45/358845/24/check/gate-neutron-dsvm-functional-ubuntu-xenial/b018ed7/testr_results.html.gz [1] https://github.com/openstack/neutron/blob/master/neutron/tests/functional/cmd/process_spawn.py#L107 [2] https://github.com/openstack/neutron/blob/master/neutron/tests/functional/cmd/test_netns_cleanup.py#L84 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1652124/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1650611] [NEW] dhcp agent reporting state as down during the initial sync
Public bug reported: When dhcp agent is started, neutron agent-list reports its state as dead until the initial sync is complete. This can lead to unwanted alarms in monitoring systems, especially in large environments where the initial sync may take hours. During this time, systemctl shows that the agent is actually alive while neutron agent-list reports it as down. Technical details: If I'm right, this line [0] is the exact point where the initial sync takes place right after the first state report (with start_flag=True) is sent to the server. As it's being done in the same thread, it won't send a second state report until it's done with the sync. Doing it in a separate thread would let the heartbeat task to continue sending state reports to the server but I don't know whether this have any unwanted side effects. [0] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L751 ** Affects: neutron Importance: Undecided Status: New ** Tags: l3-bgp -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1650611 Title: dhcp agent reporting state as down during the initial sync Status in neutron: New Bug description: When dhcp agent is started, neutron agent-list reports its state as dead until the initial sync is complete. This can lead to unwanted alarms in monitoring systems, especially in large environments where the initial sync may take hours. During this time, systemctl shows that the agent is actually alive while neutron agent-list reports it as down. Technical details: If I'm right, this line [0] is the exact point where the initial sync takes place right after the first state report (with start_flag=True) is sent to the server. As it's being done in the same thread, it won't send a second state report until it's done with the sync. Doing it in a separate thread would let the heartbeat task to continue sending state reports to the server but I don't know whether this have any unwanted side effects. [0] https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L751 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1650611/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1647431] [NEW] grenade job times out on Xenial
Public bug reported: gate-grenade-dsvm-neutron-multinode-ubuntu-xenial job is failing on neutron gate I have checked some other patches and looks like the job doesn't fail on them so apparently it's not deterministic. >From the logs: [1] 2016-12-05 09:07:46.832799 | ERROR: the main setup script run by this job failed - exit code: 124 [2] 2016-12-05 09:07:10.778 | + /opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:207 : timeout 30 sh -c 'while openstack server show cinder_server1 >/dev/null; do sleep 1; done' 2016-12-05 09:07:40.781 | + /opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:1 : exit_trap 2016-12-05 09:07:40.782 | + /opt/stack/new/grenade/functions:exit_trap:103 : local r=124 [1] http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/console.html [2] http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/logs/grenade.sh.txt.gz ** Affects: neutron Importance: Critical Status: Confirmed ** Tags: gate-failure -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1647431 Title: grenade job times out on Xenial Status in neutron: Confirmed Bug description: gate-grenade-dsvm-neutron-multinode-ubuntu-xenial job is failing on neutron gate I have checked some other patches and looks like the job doesn't fail on them so apparently it's not deterministic. From the logs: [1] 2016-12-05 09:07:46.832799 | ERROR: the main setup script run by this job failed - exit code: 124 [2] 2016-12-05 09:07:10.778 | + /opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:207 : timeout 30 sh -c 'while openstack server show cinder_server1 >/dev/null; do sleep 1; done' 2016-12-05 09:07:40.781 | + /opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:1 : exit_trap 2016-12-05 09:07:40.782 | + /opt/stack/new/grenade/functions:exit_trap:103 : local r=124 [1] http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/console.html [2] http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/logs/grenade.sh.txt.gz To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1647431/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp