from:"Daniel Alvarez"

[Yahoo-eng-team] [Bug 1946318] [NEW] [ovn] Memory consumption grows over time due to MAC_Binding entries in SB database

2021-10-07 Thread Daniel Alvarez

Public bug reported:

MAC_Binding entries are used in OVN as a mechanism to learn MAC
addresses on logical ports and avoid sending ARP requests to the
network.

There is no aging mechanism for these entries [0] and the table can grow
indefinitely. In environments with for example large (eg. /16) external
networks; OVN may learn a considerable amount of addresses growing the
size of the db a lot.

Today, Neutron monitors this table to workaround the lack of aging mechanism 
and remove the MAC_Binding entries associated to Floating IPs and each 
neutron-server worker will keep an in-memory copy of such table increasing its 
memory footprint to several Gigabytes, eventually leading to OOM killers.
 

[0] https://mail.openvswitch.org/pipermail/ovs-
discuss/2019-June/048936.html

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1946318

Title:
  [ovn] Memory consumption grows over time due to MAC_Binding entries in
  SB database

Status in neutron:
  New

Bug description:
  MAC_Binding entries are used in OVN as a mechanism to learn MAC
  addresses on logical ports and avoid sending ARP requests to the
  network.

  There is no aging mechanism for these entries [0] and the table can
  grow indefinitely. In environments with for example large (eg. /16)
  external networks; OVN may learn a considerable amount of addresses
  growing the size of the db a lot.

  Today, Neutron monitors this table to workaround the lack of aging mechanism 
and remove the MAC_Binding entries associated to Floating IPs and each 
neutron-server worker will keep an in-memory copy of such table increasing its 
memory footprint to several Gigabytes, eventually leading to OOM killers.
   

  [0] https://mail.openvswitch.org/pipermail/ovs-
  discuss/2019-June/048936.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1946318/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1945651] [NEW] Updating binding profile through CLI doesn't work

2021-09-30 Thread Daniel Alvarez

Public bug reported:

Updating the binding profile of a port will fail because of invalid
type. This suggests a bug in the Neutron code that handles the
parameters.


$ neutron --debug port-update subportD --binding:profile type=dict 
parent_name=4af7ef43-597b-4747-b3ac-2b045db17374,tag=999

...

DEBUG: keystoneauth.session RESP: [400] Content-Length: 152 Content-Type: 
application/json Date: Thu, 30 Sep 2021 13:14:32 GMT X-Openstack-Request-Id: 
req-5ca76951-518b-4a73-94bd-5c872a462786
DEBUG: keystoneauth.session RESP BODY: {"NeutronError": {"type": 
"InvalidInput", "message": "Invalid input for operation: Invalid 
binding:profile. tag 999 value invalid type.", "detail": ""}}
DEBUG: keystoneauth.session PUT call to network for 
https://10.0.0.101:13696/v2.0/ports/fa2ba28e-3dfe-43af-b75a-8c466d23ebcd used 
request id req-5ca76951-518b-4a73-94bd-5c872a462786
DEBUG: neutronclient.v2_0.client Error message: {"NeutronError": {"type": 
"InvalidInput", "message": "Invalid input for operation: Invalid 
binding:profile. tag 999 value invalid type.", "detail": ""}}

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1945651

Title:
  Updating binding profile through CLI doesn't work

Status in neutron:
  New

Bug description:
  Updating the binding profile of a port will fail because of invalid
  type. This suggests a bug in the Neutron code that handles the
  parameters.

  
  $ neutron --debug port-update subportD --binding:profile type=dict 
parent_name=4af7ef43-597b-4747-b3ac-2b045db17374,tag=999

  ...

  DEBUG: keystoneauth.session RESP: [400] Content-Length: 152 Content-Type: 
application/json Date: Thu, 30 Sep 2021 13:14:32 GMT X-Openstack-Request-Id: 
req-5ca76951-518b-4a73-94bd-5c872a462786
  DEBUG: keystoneauth.session RESP BODY: {"NeutronError": {"type": 
"InvalidInput", "message": "Invalid input for operation: Invalid 
binding:profile. tag 999 value invalid type.", "detail": ""}}
  DEBUG: keystoneauth.session PUT call to network for 
https://10.0.0.101:13696/v2.0/ports/fa2ba28e-3dfe-43af-b75a-8c466d23ebcd used 
request id req-5ca76951-518b-4a73-94bd-5c872a462786
  DEBUG: neutronclient.v2_0.client Error message: {"NeutronError": {"type": 
"InvalidInput", "message": "Invalid input for operation: Invalid 
binding:profile. tag 999 value invalid type.", "detail": ""}}

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1945651/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1904412] [NEW] [ovn] Don't include IP addresses for OVN ports if both port security and DHCP are disabled

2020-11-16 Thread Daniel Alvarez

Public bug reported:

Right now, when port security is disabled the ML2/OVN plugin will set
the addresses field to ["unknown", "mac IP1 IP2..."]. Eg.:

port 2da76786-51f0-4217-a09b-0c16e6728588 (aka servera-port-2)
addresses: ["52:54:00:02:FA:0A 192.168.0.245", "unknown"]

There are scenarios (eg. NIC teaming) where the traffic may come from
two different ports with the same source MAC address. While this is
fine, on the way back, OVN doesn't learn the location of the MAC and it
will deliver to the port which has the MAC address defined in the DB.

E.g

port1 - MAC1
port2 - MAC2

If traffic goes out from port2 with smac=MAC1, then the traffic will be 
delivered by OVN.
However, for incoming traffic getting to br-int with dmac=MAC1, OVN will 
deliver that to port1 instead of port2 because of the above configuration.

If OVN is not configured with any MAC(s) then the traffic will be
flooded to all ports which have addresses=["unknown"].

The reason why "MAC IP" is added is merely so that OVN can install the
necessary flows to serve DHCP natively.

In order to cover these use cases, the ML2/OVN driver could clear up the
MAC-IP(s) from the 'addresses' column of those ports that belong to a
network with DHCP disabled.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1904412

Title:
  [ovn] Don't include IP addresses for OVN ports if both port security
  and DHCP are disabled

Status in neutron:
  New

Bug description:
  Right now, when port security is disabled the ML2/OVN plugin will set
  the addresses field to ["unknown", "mac IP1 IP2..."]. Eg.:

  port 2da76786-51f0-4217-a09b-0c16e6728588 (aka servera-port-2)
  addresses: ["52:54:00:02:FA:0A 192.168.0.245", "unknown"]

  There are scenarios (eg. NIC teaming) where the traffic may come from
  two different ports with the same source MAC address. While this is
  fine, on the way back, OVN doesn't learn the location of the MAC and
  it will deliver to the port which has the MAC address defined in the
  DB.

  E.g

  port1 - MAC1
  port2 - MAC2

  If traffic goes out from port2 with smac=MAC1, then the traffic will be 
delivered by OVN.
  However, for incoming traffic getting to br-int with dmac=MAC1, OVN will 
deliver that to port1 instead of port2 because of the above configuration.

  If OVN is not configured with any MAC(s) then the traffic will be
  flooded to all ports which have addresses=["unknown"].

  The reason why "MAC IP" is added is merely so that OVN can install the
  necessary flows to serve DHCP natively.

  In order to cover these use cases, the ML2/OVN driver could clear up
  the MAC-IP(s) from the 'addresses' column of those ports that belong
  to a network with DHCP disabled.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1904412/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1893656] [NEW] [ovn] Limit the number of metadata workers

2020-08-31 Thread Daniel Alvarez

Public bug reported:

The OVN Metadata agent reuses the metadata_workers config option from
the ML2/OVS Metadata agent.

However, it makes sense to split the option as the way both agents work
is totally different so it makes sense to have different defaults.

In OVN, the Metadata Agent will run in compute nodes while in the
ML2/OVS case it usually runs in controllers so the scenario is totally
different.

We defaulted to 2 in TripleO and this commit includes further details:

https://opendev.org/openstack/puppet-
neutron/commit/847f434140ee8435ee842801748a0deccdff8155

** Affects: neutron
 Importance: Medium
 Status: Confirmed


** Tags: ovn

** Tags added: ovn

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1893656

Title:
  [ovn] Limit the number of metadata workers

Status in neutron:
  Confirmed

Bug description:
  The OVN Metadata agent reuses the metadata_workers config option from
  the ML2/OVS Metadata agent.

  However, it makes sense to split the option as the way both agents
  work is totally different so it makes sense to have different
  defaults.

  In OVN, the Metadata Agent will run in compute nodes while in the
  ML2/OVS case it usually runs in controllers so the scenario is totally
  different.

  We defaulted to 2 in TripleO and this commit includes further details:

  https://opendev.org/openstack/puppet-
  neutron/commit/847f434140ee8435ee842801748a0deccdff8155

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1893656/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1883554] [NEW] [ovn] Agent liveness checks create too many writes into OVN db

2020-06-15 Thread Daniel Alvarez

Public bug reported:

Every time the agent liveness check is triggered (via API or periodically every 
agent_down_time / 2 seconds), there are a lot of writes into the SB database on 
the Chassis table.
These writes triggers recomputation on ovn-controller running in all nodes 
having a considerable performance hit, especially under stress.


After this commit was merged [0] we avoided bumping nb_cfg too frequently but 
still we're performing writes into the Chassis table to often, from all the 
workers.

We should use the same logic in [1] to avoid writes that have happened
recently.


[0] 
https://opendev.org/openstack/neutron/commit/647b7f63f9dafedfa9fb6e09e3d92d66fb512f0b
[1] 
https://github.com/openstack/neutron/blob/4de18104ae88a835544cefbf30c878aa49efc31f/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L1075

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: ovn

** Tags added: ovn

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1883554

Title:
  [ovn] Agent liveness checks create too many writes into OVN db

Status in neutron:
  New

Bug description:
  Every time the agent liveness check is triggered (via API or periodically 
every agent_down_time / 2 seconds), there are a lot of writes into the SB 
database on the Chassis table.
  These writes triggers recomputation on ovn-controller running in all nodes 
having a considerable performance hit, especially under stress.

  
  After this commit was merged [0] we avoided bumping nb_cfg too frequently but 
still we're performing writes into the Chassis table to often, from all the 
workers.

  We should use the same logic in [1] to avoid writes that have happened
  recently.

  
  [0] 
https://opendev.org/openstack/neutron/commit/647b7f63f9dafedfa9fb6e09e3d92d66fb512f0b
  [1] 
https://github.com/openstack/neutron/blob/4de18104ae88a835544cefbf30c878aa49efc31f/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L1075

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1883554/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1874733] [NEW] [OVN] Stale ports can be present in OVN NB leading to metadata errors

2020-04-24 Thread Daniel Alvarez

Public bug reported:

Right now, there's a chance that deleting a port in Neutron with ML2/OVN
actually deletes the object from Neutron DB while leaving a stale port
in the OVN NB database.

This can happen when deleting a port [0] raises a RowNotFound exception.
While it may look like it'd mean that the port didn't exist already in
OVN NB truth is that the current port_delete function can throw that
exception for different reasons (especially against OVN < 2.10 when
Address Sets were used instead of Port Groups).

Such exception can be observed for example if some ACL or Address Set
doesn't exist [1][2] amongst others. In this case, the revision number
of the object will be deleted [3] and the port will be stale forever in
OVN NB (it'll be skipped by the maintenance task).

One of the main impacts of this issue is that the OVN NB database will
grow and have stale objects that are undetected (they'll be detected by
the neutron-ovn-db-sync-script) but most importantly, that multiple
ports in the same OVN Logical Switch may have the same IP addresses and
this cause legitimate ports to be left without Metadata.

As per metadata agent code here [4] if more than one port in the same
network has the same IP address, a 404 will be returned back to the
instance upon requesting metadata.

The workaround is running the neutron-db-sync script in repair mode to
get rid of the stale ports.

A proper fix would involve a better granularity of the exceptions that
can happen around a port deletion and acting accordingly upon each of
them. In the worst case, we won't be deleting the revision number if the
port still exists leaving up to the Maintenance task to fix it later on
(< 5 minutes). Ideally, we should identify all possible code paths and
delete the port from OVN whenever possible even if some other associated
operation fails (with proper logging).


Also, this scenario seems to be more likely under a high concurrency of API 
operations (such as heat) and possibly when Port Groups are not supported by 
the schema (OVN < 2.10).

Danie Alvarez


[0] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L719
[1] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L680
[2] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L690
[3] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py#L722
[4] 
https://github.com/openstack/neutron/blob/99774a0465bce893e0b7178fe83fe1985432c704/neutron/agent/ovn/metadata/server.py#L86

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: ovn

** Tags added: ovn

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1874733

Title:
  [OVN] Stale ports can be present in OVN NB leading to metadata errors

Status in neutron:
  New

Bug description:
  Right now, there's a chance that deleting a port in Neutron with
  ML2/OVN actually deletes the object from Neutron DB while leaving a
  stale port in the OVN NB database.

  This can happen when deleting a port [0] raises a RowNotFound
  exception. While it may look like it'd mean that the port didn't exist
  already in OVN NB truth is that the current port_delete function can
  throw that exception for different reasons (especially against OVN <
  2.10 when Address Sets were used instead of Port Groups).

  Such exception can be observed for example if some ACL or Address Set
  doesn't exist [1][2] amongst others. In this case, the revision number
  of the object will be deleted [3] and the port will be stale forever
  in OVN NB (it'll be skipped by the maintenance task).

  One of the main impacts of this issue is that the OVN NB database will
  grow and have stale objects that are undetected (they'll be detected
  by the neutron-ovn-db-sync-script) but most importantly, that multiple
  ports in the same OVN Logical Switch may have the same IP addresses
  and this cause legitimate ports to be left without Metadata.

  As per metadata agent code here [4] if more than one port in the same
  network has the same IP address, a 404 will be returned back to the
  instance upon requesting metadata.

  The workaround is running the neutron-db-sync script in repair mode to
  get rid of the stale ports.

  A proper fix would involve a better granularity of the exceptions that
  can happen around a port deletion and acting accordingly upon each of
  them. In the worst case, we won't be deleting the revision number if
  the port still exists leaving up to the Maintenance task to fix it
  later on (< 5 minutes). Ideally, we should identify all possible code

[Yahoo-eng-team] [Bug 1865889] [NEW] [RFE] Routed provider networks support in OVN

2020-03-03 Thread Daniel Alvarez

Public bug reported:

The routed provider networks feature doesn't work properly with OVN
backend. While API doesn't return any errors, all the ports are
allocated to the same OVN Logical Switch and besides providing no Layer2
isolation whatsoever, it won't work when multiple segments using
different physnets are added to such network.

The reason for the latter is that, currently, in core OVN, only one
localnet port is supported per Logical Switch so only one physical net
can be associated to it. I can think of two different approaches:

1) Change the OVN mech driver to logically separate Neutron segments:

a) Create an OVN Logical Switch *per Neutron segment*. This has some
challenges from a consistency point of view as right now there's a 1:1
mapping between a Neutron Network and an OVN Logical Switch. Revision
numbers, maintenance task, OVN DB Sync script, etcetera.

b) Each of those Logical Switches will have a localnet port associated
to the physnet of the Neutron segment.

c) The port still belongs to the parent network so all the CRUD operations over 
a port will require to figure out which underlying OVN LS applies (depending on 
which segment the port lives in).
The same goes for other objects (e.g. OVN Load Balancers, gw ports -if 
attaching a multisegment network to a Neutron router as a gateway is a valid 
use case at all-).

e) Deferred allocation. A port can be created in a multisegment Neutron
network but the IP allocation is deferred to the time where a compute
node is assigned to an instance. In this case the OVN mech driver might
need to move around the Logical Switch Port from the Logical Switch of
the parent to that of the segment where it falls (can be prone to race
conditions :?).


2) Core OVN changes:

The current limitation is that right now only one localnet port is
allowed per Logical Switch so we can't map different physnets to it. If
we add support for multiple localnet ports in core OVN, we can have all
the segments living in the same OVN Logical Switch.

My idea here would be:

a) Per each Neutron segment, we create a localnet port in the single OVN
Logical Switch with its physnet and vlan id (if any). Eg.

name: provnet-f7038db6-7376-4b83-b57b-3f456bea2b80
options : {network_name=segment1}
parent_name : []
port_security   : []
tag : 2016
tag_request : []
type: localnet


name: provnet-84487aa7-5ac7-4f07-877e-1840d325e3de
options : {network_name=segment2}
parent_name : []
port_security   : []
tag : 2017
tag_request : []
type: localnet

And both ports would belong to the LS corresponding to the multisegment
Neutron network.

b) In this case, when ovn-controller sees that a port in that network
has been bound to it, all it needs to create is the patch port to the
provider bridge that the bridge mappings configuration dictates.

E.g

compute1:bridge-mappings = segment1:br-provider1
compute2:bridge-mappings = segment2:br-provider2

When a port in the multisegment network gets bound to compute1, ovn-
controller will create a patch-port between br-int and br-provider1. The
restriction here is that on a given hypervisor, only ports belonging to
the same segment will be present. ie. we can't mix VMs on different
segments on the same hypervisor.


c) Minor changes are required in the Neutron side (just creating the localnet 
port upon segment creation).


We need to discuss if the restriction mentioned earlier makes sense. If not, 
perhaps we need to drop this approach completely or look for core OVN 
alternatives.


I'd lean on approach number 2 as it seems the less invasive in terms of code 
changes but there's the catch described that may make it a no-go or explore 
another ways to eliminate that restriction somehow in core OVN.

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: ovn rfe

** Tags added: rfe

** Tags added: ovn

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1865889

Title:
  [RFE] Routed provider networks support in OVN

Status in neutron:
  New

Bug description:
  The routed provider networks feature doesn't work properly with OVN
  backend. While API doesn't return any errors, all the ports are
  allocated to the same OVN Logical Switch and besides providing no
  Layer2 isolation whatsoever, it won't work when multiple segments
  using different physnets are added to such network.

  The reason for the latter is that, currently, in core OVN, only one
  localnet port is supported per Logical Switch so only one physical net
  can be associated to it. I can think of two different approaches:

  1) Change the OVN mech driver to logically separate Neutron segments:

  a) Create an OVN Logical Switch *per Neutron segment*. This has some
  challenges from a consistency

[Yahoo-eng-team] [Bug 1864641] [NEW] [OVN] Run maintenance task whenever the OVN DB schema has been upgraded

2020-02-25 Thread Daniel Alvarez

Public bug reported:

When OVN DBs are upgraded (and restarted), there might be cases whenever
we want to accommodate things to a new schema. In this situation we
don't want to force a restart of neutron-server (or metadata agent) but
instead, detect it and run whatever is needed.

This can be achieved by checking the schema version via ovsdbapp [0] and
comparing if it's bigger than what we had upon a reconnection to the OVN
DBs.

[0]
https://github.com/openvswitch/ovs/blob/master/python/ovs/db/schema.py#L35

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1864641

Title:
  [OVN] Run maintenance task whenever the OVN DB schema has been
  upgraded

Status in neutron:
  New

Bug description:
  When OVN DBs are upgraded (and restarted), there might be cases
  whenever we want to accommodate things to a new schema. In this
  situation we don't want to force a restart of neutron-server (or
  metadata agent) but instead, detect it and run whatever is needed.

  This can be achieved by checking the schema version via ovsdbapp [0]
  and comparing if it's bigger than what we had upon a reconnection to
  the OVN DBs.

  [0]
  https://github.com/openvswitch/ovs/blob/master/python/ovs/db/schema.py#L35

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1864641/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1861509] [NEW] [OVN] GW rescheduling logic is broken

2020-01-31 Thread Daniel Alvarez

Public bug reported:

When a Chassis event happens in the SB database, we attempt to
reschedule any possible unhosted gateways [0] *always* due to a problem
with the existing logic:


def get_unhosted_gateways(self, port_physnet_dict, chassis_physnets,
  gw_chassis):
unhosted_gateways = []
for lrp in self._tables['Logical_Router_Port'].rows.values():
if not lrp.name.startswith('lrp-'):
continue
physnet = port_physnet_dict.get(lrp.name[len('lrp-'):])
chassis_list = self._get_logical_router_port_gateway_chassis(lrp)
is_max_gw_reached = len(chassis_list) < ovn_const.MAX_GW_CHASSIS
for chassis_name, prio in chassis_list:
# TODO(azbiswas): Handle the case when a chassis is no
# longer valid. This may involve moving conntrack states,
# so it needs to discussed in the OVN community first.
if is_max_gw_reached or utils.is_gateway_chassis_invalid(
chassis_name, gw_chassis, physnet, chassis_physnets):
unhosted_gateways.append(lrp.name)
return unhosted_gateways


1) is_max_gw_reached is always going to be True (as normally the possible 
candidates are less than the maximum) 

2) unhosted_gateways.append(lrp.name) is executed inside a loop where
lrp doesn't change meaning that if there's 3 candidates in the
chassis_list,  lrp.name is added 3 times to the list!!!

3) Later on, in the caller, we're iterating over the returned list [1]
so as it has all the LRPs N times (N being the names of gw chassis), it
will do a lot of extra and unnecessary work.


This is almost harmless in the sense that it's not breaking any functionality 
but it creates unnecessary updates on the logical router port:


2020-01-31 15:54:04.669 37 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] 
Running txn n=1 command(idx=0): 
UpdateLRouterPortCommand(name=lrp-93b49ece-2dbc-4fcc-84cb-e7afd482a12e, 
columns={'gateway_chassis': ['0444b1f1-e9a9-4a73-ba78-997c87e61795', 
'43d98571-ccd6-48ce-bf4f-08f24aeed522', 
'fe8f9887-27ef-4724-8cfc-50ec6e3d4a98']}, if_exists=True) do_commit 
/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:84
2020-01-31 15:54:04.670 37 DEBUG ovsdbapp.backend.ovs_idl.transaction [-] 
Transaction caused no change do_commit 
/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121


[0] 
https://github.com/openstack/neutron/blob/4689564fa29915b042547bdeb3dcb44bca54e20c/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/impl_idl_ovn.py#L449
[1] 
https://github.com/openstack/neutron/blob/858d7f33950a80c73501377a4b2cd36b915d0f40/neutron/services/ovn_l3/plugin.py#L324

** Affects: neutron
 Importance: Undecided
 Assignee: Maciej Jozefczyk (maciej.jozefczyk)
 Status: Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1861509

Title:
  [OVN] GW rescheduling logic is broken

Status in neutron:
  Confirmed

Bug description:
  When a Chassis event happens in the SB database, we attempt to
  reschedule any possible unhosted gateways [0] *always* due to a
  problem with the existing logic:

  
  def get_unhosted_gateways(self, port_physnet_dict, chassis_physnets,
gw_chassis):
  unhosted_gateways = []
  for lrp in self._tables['Logical_Router_Port'].rows.values():
  if not lrp.name.startswith('lrp-'):
  continue
  physnet = port_physnet_dict.get(lrp.name[len('lrp-'):])
  chassis_list = self._get_logical_router_port_gateway_chassis(lrp)
  is_max_gw_reached = len(chassis_list) < ovn_const.MAX_GW_CHASSIS
  for chassis_name, prio in chassis_list:
  # TODO(azbiswas): Handle the case when a chassis is no
  # longer valid. This may involve moving conntrack states,
  # so it needs to discussed in the OVN community first.
  if is_max_gw_reached or utils.is_gateway_chassis_invalid(
  chassis_name, gw_chassis, physnet, chassis_physnets):
  unhosted_gateways.append(lrp.name)
  return unhosted_gateways

  
  1) is_max_gw_reached is always going to be True (as normally the possible 
candidates are less than the maximum) 

  2) unhosted_gateways.append(lrp.name) is executed inside a loop where
  lrp doesn't change meaning that if there's 3 candidates in the
  chassis_list,  lrp.name is added 3 times to the list!!!

  3) Later on, in the caller, we're iterating over the returned list [1]
  so as it has all the LRPs N times (N being the names of gw chassis),
  it will do a lot of extra and unnecessary work.

  
  This is almost harmless in the sense that it's not breaking any functionality 
but it creates unnecessary

[Yahoo-eng-team] [Bug 1861510] [NEW] [OVN] GW rescheduling mechanism is triggered on every Chassis updated unnecessarily

2020-01-31 Thread Daniel Alvarez

Public bug reported:

Whenever a chassis is updated for whatever reason, we're triggering the
rescheduling mechanism [0]. As the current agent liveness check involves
updating the Chassis table quite frequently, we should avoid
rescheduling gateways for those checks (ie. when either nb_cfg or
external_ids change).


[0] 
https://github.com/openstack/neutron/blob/4689564fa29915b042547bdeb3dcb44bca54e20c/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L87

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1861510

Title:
  [OVN] GW rescheduling mechanism is triggered on every Chassis updated
  unnecessarily

Status in neutron:
  New

Bug description:
  Whenever a chassis is updated for whatever reason, we're triggering
  the rescheduling mechanism [0]. As the current agent liveness check
  involves updating the Chassis table quite frequently, we should avoid
  rescheduling gateways for those checks (ie. when either nb_cfg or
  external_ids change).

  
  [0] 
https://github.com/openstack/neutron/blob/4689564fa29915b042547bdeb3dcb44bca54e20c/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L87

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1861510/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1860436] [NEW] [ovn] Agent liveness checks are flaky and report false positives

2020-01-21 Thread Daniel Alvarez

Public bug reported:

The way that networking-ovn mech driver performs health checks on agents
reports false positives due to race conditions:

1) neutron-server increments the nb_cfg in NB_Global table from X to X+1
2) neutron-server almost immediately checks all the Chassis rows to see if they 
have written (X+1) . [1]
3) neutron-server process the updates from each agent from X to X+1

*Most* of the times, in step number 2, this condition doesn't hold so
the timestamp is not updated. The result is that after 60 seconds (agent
timeout default value), the agent is shown as dead. Sometimes, 3)
happens before 2) so the timestamp gets updated and all is fine but this
is not the normal case:


1) Bump of nb_cfg
2020-01-21 11:35:59.534 28 INFO networking_ovn.ml2.mech_driver 
[req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 
d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36915
2020-01-21 11:35:59.538 28 INFO networking_ovn.ml2.mech_driver 
[req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 
d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36916


2) Check of each chassis ext_id against our new bumped nb_cfg: 
2020-01-21 11:35:59.539 28 INFO networking_ovn.ml2.mech_driver 
[req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 
d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   
chassis nb_cfg = 36915
2020-01-21 11:35:59.540 28 INFO networking_ovn.ml2.mech_driver 
[req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 
d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   
chassis nb_cfg = 36915
2020-01-21 11:35:59.541 28 INFO networking_ovn.ml2.mech_driver 
[req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 
d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   
chassis nb_cfg = 36915
2020-01-21 11:35:59.542 28 INFO networking_ovn.ml2.mech_driver 
[req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 
d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   
chassis nb_cfg = 36915
2020-01-21 11:35:59.543 28 INFO networking_ovn.ml2.mech_driver 
[req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 
d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   
chassis nb_cfg = 36915
2020-01-21 11:35:59.544 28 INFO networking_ovn.ml2.mech_driver 
[req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 
d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   
chassis nb_cfg = 36915
2020-01-21 11:35:59.546 28 INFO networking_ovn.ml2.mech_driver 
[req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 
d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   
chassis nb_cfg = 36915


3) Processing updates [2] in the ChassisEvent (some are even older!)
2020-01-21 11:35:59.546 30 INFO networking_ovn.ovsdb.ovsdb_monitor 
[req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36915
2020-01-21 11:35:59.548 29 INFO networking_ovn.ovsdb.ovsdb_monitor 
[req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36915
2020-01-21 11:35:59.556 32 INFO networking_ovn.ovsdb.ovsdb_monitor 
[req-efa34cac-2296-4d30-b153-9630b0309fcd - - - - -] XXX chassis update:
2020-01-21 11:35:59.556 27 INFO networking_ovn.ovsdb.ovsdb_monitor 
[req-91f7d181-bfa3-4646-9814-bb680d011081 - - - - -] XXX chassis update:
2020-01-21 11:35:59.557 25 INFO networking_ovn.ovsdb.ovsdb_monitor 
[req-420e5a25-13e4-4da6-8277-8a3a1028c9e9 - - - - -] XXX chassis update:
2020-01-21 11:35:59.756 30 INFO networking_ovn.ovsdb.ovsdb_monitor 
[req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36916
2020-01-21 11:35:59.778 29 INFO networking_ovn.ovsdb.ovsdb_monitor 
[req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36916

IMO, we need to space the bump of nb_cfg [2] and the check [3] in time
as the NB_Global changes needs to be propagated to the SB, processed by
all agents and then back to neutron-server which needs to process the
JSON stuff and update the internal tables. So even if it's fast, most of
the times it is not fast enough.

Another solution is to allow a difference of '1' to update timestamps.
 

[0] 
https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1093
[1] 
https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1098
[2] 
https://github.com/openstack/networking-ovn/blob/bf577e5a999f7db4cb9b790664ad596e1926d9a0/networking_ovn/ml2/mech_driver.py#L988
[3] 
https://github.com/openstack/networking-ovn/blob/6302298e9c4313f1200c543c89d92629daff9e89/networking_ovn/ovsdb/ovsdb_monitor.py#L74

** Affects: neutron
 Importance: Undecided
     Assignee: Daniel Alvarez (dalvarezs)

[Yahoo-eng-team] [Bug 1804259] [NEW] DB: sorting on elements which are AssociationProxy fails

2018-11-20 Thread Daniel Alvarez

Public bug reported:

If I do a DB query trying to sort by a column which is an
AssociationProxy I get the following exception:


Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/opt/stack/neutron/neutron/db/db_base_plugin_v2.py", line 1438, in get_ports
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers page_reverse=page_reverse)
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/opt/stack/neutron/neutron/plugins/ml2/plugin.py", line 1935, in _get_ports_qu
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers *args, **kwargs)
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/opt/stack/neutron/neutron/db/db_base_plugin_v2.py", line 1418, in _get_ports_
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers *args, **kwargs)
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/opt/stack/neutron/neutron/db/_model_query.py", line 159, in get_collection_qu
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers sort_keys = 
db_utils.get_and_validate_sort_keys(sorts, model)
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/usr/lib/python2.7/site-packages/neutron_lib/db/utils.py", line 45, in get_and
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers if isinstance(sort_key_attr.property,
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers AttributeError: 'AssociationProxy' object has no 
attribute 'property'
Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers


This is reproducible for example by querying ports sorted by 'created_at' 
attribute:

ports = self._plugin.get_ports(context, sorts=[('created_at', True)])


Looks like we may need to special case the AssociationProxy columns such as we 
already do in the filtering code at:

https://github.com/openstack/neutron/blob/0bb6136919a31751242d2efbefedbd8922b6bd0a/neutron/db/_model_query.py#L88

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1804259

Title:
  DB: sorting on elements which are AssociationProxy fails

Status in neutron:
  New

Bug description:
  If I do a DB query trying to sort by a column which is an
  AssociationProxy I get the following exception:

  
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/opt/stack/neutron/neutron/db/db_base_plugin_v2.py", line 1438, in get_ports
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers page_reverse=page_reverse)
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/opt/stack/neutron/neutron/plugins/ml2/plugin.py", line 1935, in _get_ports_qu
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers *args, **kwargs)
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/opt/stack/neutron/neutron/db/db_base_plugin_v2.py", line 1418, in _get_ports_
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers *args, **kwargs)
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/opt/stack/neutron/neutron/db/_model_query.py", line 159, in get_collection_qu
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers sort_keys = 
db_utils.get_and_validate_sort_keys(sorts, model)
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers   File 
"/usr/lib/python2.7/site-packages/neutron_lib/db/utils.py", line 45, in get_and
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers if isinstance(sort_key_attr.property,
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers AttributeError: 'AssociationProxy' object has no 
attribute 'property'
  Nov 20 14:41:20 centos.rdocloud neutron-server[11934]: ERROR 
neutron.plugins.ml2.managers

  
  This is reproducible for example by querying ports sorted by 'created_at' 
attribute:

  ports = self._plugin.get_ports(context, sorts=[('created_at', True)])

  
  Looks like we may need to special case the AssociationProxy columns such as 
we already do in the filtering code at:

  
https://github.com/openstack/neutron/blob/0bb6136919a31751242d2efbefedbd8922b6bd0a/neutron/db/_model_query.py#L88

To manage notificati

[Yahoo-eng-team] [Bug 1802369] Re: Unit tests failing due to recent Neutron patch

2018-11-08 Thread Daniel Alvarez

When importing that module, these event listeners are created:

https://github.com/openstack/neutron/blob/master/neutron/db/api.py#L110
and
https://github.com/openstack/neutron/blob/master/neutron/db/api.py#L134


Adding them manually fixed the issue. So far the workaround imports the file to 
get the listeners imported but this is something that Neutron folks need to 
confirm now and perhaps move them somewhere else to get them imported either 
way.

** Also affects: neutron
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1802369

Title:
  Unit tests failing due to recent Neutron patch

Status in networking-ovn:
  New
Status in neutron:
  New

Bug description:
  Since Nov 7th we have unit tests failing. I've doing git bisect on
  neutron and found that [0] is the culprit. Digging further I checked
  that it's not actually that MAX_RETRIES changed from 10 (neutron code)
  to 20 (neutron-lib) but the fact that "from neutron.db import api as
  db_api" is no longer imported.

  
  [0] 
https://github.com/openstack/neutron/commit/3316b45665a99b0f61e45a8c7facf538618861bf

To manage notifications about this bug go to:
https://bugs.launchpad.net/networking-ovn/+bug/1802369/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1797084] [NEW] Stale namespaces when fallback tunnels are present

2018-10-10 Thread Daniel Alvarez

Public bug reported:

When a network namespace is created, if the sysctl
fb_tunnels_only_for_init_net option is set to 0 (by default), fallback
tunnel devices will be automatically created if the initial namespace
had those in.

This leads to neutron ip_lib detecting namespaces as 'not empty' thus
unable to clean them up.

We need to add these devices so that they are taken into account when
determining if a namespace is empty or not.

More info at: https://www.kernel.org/doc/Documentation/sysctl/net.txt

** Affects: networking-ovn
 Importance: Undecided
 Status: New

** Affects: neutron
 Importance: Undecided
 Assignee: Daniel Alvarez (dalvarezs)
 Status: In Progress

** Also affects: networking-ovn
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1797084

Title:
  Stale namespaces when fallback tunnels are present

Status in networking-ovn:
  New
Status in neutron:
  In Progress

Bug description:
  When a network namespace is created, if the sysctl
  fb_tunnels_only_for_init_net option is set to 0 (by default), fallback
  tunnel devices will be automatically created if the initial namespace
  had those in.

  This leads to neutron ip_lib detecting namespaces as 'not empty' thus
  unable to clean them up.

  We need to add these devices so that they are taken into account when
  determining if a namespace is empty or not.

  More info at: https://www.kernel.org/doc/Documentation/sysctl/net.txt

To manage notifications about this bug go to:
https://bugs.launchpad.net/networking-ovn/+bug/1797084/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1785615] [NEW] DNS resolution through eventlet contact nameservers if there's an IPv4 or IPv6 entry present in hosts file

2018-08-06 Thread Daniel Alvarez

Public bug reported:

When trying to resolve a hostname on a node with no nameservers
configured and only one entry is present for it in /etc/hosts (IPv4 or
IPv6), eventlet will try to fetch the other entry over the network.

This changes the behavior from what the original getaddrinfo()
implementation does and causes 30 second delays and often timeouts when,
for example, metadata agent tries to contact Nova [0].

Here it's a simple reproducer which shows the behavior when we do the
monkey patching:

import eventlet
import socket
import time

print socket.getaddrinfo('overcloud.internalapi.localdomain', 80, 0, 
socket.SOCK_STREAM)
print time.time()
eventlet.monkey_patch()
print socket.getaddrinfo('overcloud.internalapi.localdomain', 80, 0, 
socket.SOCK_STREAM)
print time.time()


Eventlet issue reported here [1] and fix got merged in master branch.

[0] 
https://github.com/openstack/neutron/blob/13.0.0.0b3/neutron/agent/metadata/agent.py#L189
[1] https://github.com/eventlet/eventlet/issues/511

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1785615

Title:
  DNS resolution through eventlet contact nameservers if there's an IPv4
  or IPv6 entry present in hosts file

Status in neutron:
  New

Bug description:
  When trying to resolve a hostname on a node with no nameservers
  configured and only one entry is present for it in /etc/hosts (IPv4 or
  IPv6), eventlet will try to fetch the other entry over the network.

  This changes the behavior from what the original getaddrinfo()
  implementation does and causes 30 second delays and often timeouts
  when, for example, metadata agent tries to contact Nova [0].

  Here it's a simple reproducer which shows the behavior when we do the
  monkey patching:

  import eventlet
  import socket
  import time

  print socket.getaddrinfo('overcloud.internalapi.localdomain', 80, 0, 
socket.SOCK_STREAM)
  print time.time()
  eventlet.monkey_patch()
  print socket.getaddrinfo('overcloud.internalapi.localdomain', 80, 0, 
socket.SOCK_STREAM)
  print time.time()

  
  Eventlet issue reported here [1] and fix got merged in master branch.

  [0] 
https://github.com/openstack/neutron/blob/13.0.0.0b3/neutron/agent/metadata/agent.py#L189
  [1] https://github.com/eventlet/eventlet/issues/511

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1785615/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1779882] [NEW] Deleting a port on a system with 1K ports takes too long

2018-07-03 Thread Daniel Alvarez

Public bug reported:

When attempting to delete a port on a system with 1K ports, it takes
around 35 seconds to complete:

$ time openstack port delete port60_2

real0m34.367s
user0m3.497s
sys 0m0.187s


Log is *full* of the following messages when I issue the CLI:

neutron-server[324]: DEBUG neutron.pecan_wsgi.hooks.policy_enforcement
[None req-a936bb85-d881-441b-aa07-74c4779d1771 demo demo] Attributes
excluded by policy engine: [u'binding:profile', u'binding:vif_details',
u'binding:vif_type', u'binding:host_id'] {{(pid=342)
_exclude_attributes_by_policy
/opt/stack/neutron/neutron/pecan_wsgi/hooks/policy_enforcement.py:256}}

To be precise: 896 messages like this ^

$ sudo journalctl  -u devstack@q-svc | grep "Attributes excluded by policy 
engine" | wc -l
33626

$ time openstack port delete port60_2

real0m34.367s
user0m3.497s
sys 0m0.187s

$ sudo journalctl  -u devstack@q-svc | grep "Attributes excluded by policy 
engine" | wc -l
34522

I'm using networking-ovn as mechanism-driver but looks unrelated to the
backend :?

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1779882

Title:
  Deleting a port on a system with 1K ports takes too long

Status in neutron:
  New

Bug description:
  When attempting to delete a port on a system with 1K ports, it takes
  around 35 seconds to complete:

  $ time openstack port delete port60_2

  real0m34.367s
  user0m3.497s
  sys 0m0.187s

  
  Log is *full* of the following messages when I issue the CLI:

  neutron-server[324]: DEBUG neutron.pecan_wsgi.hooks.policy_enforcement
  [None req-a936bb85-d881-441b-aa07-74c4779d1771 demo demo] Attributes
  excluded by policy engine: [u'binding:profile',
  u'binding:vif_details', u'binding:vif_type', u'binding:host_id']
  {{(pid=342) _exclude_attributes_by_policy
  /opt/stack/neutron/neutron/pecan_wsgi/hooks/policy_enforcement.py:256}}

  To be precise: 896 messages like this ^

  $ sudo journalctl  -u devstack@q-svc | grep "Attributes excluded by policy 
engine" | wc -l
  33626

  $ time openstack port delete port60_2

  real0m34.367s
  user0m3.497s
  sys 0m0.187s

  $ sudo journalctl  -u devstack@q-svc | grep "Attributes excluded by policy 
engine" | wc -l
  34522

  I'm using networking-ovn as mechanism-driver but looks unrelated to
  the backend :?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1779882/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1769609] [NEW] neutron-tempest-plugin: there is no way to create a subnet without a gateway and this breaks trunk tests

2018-05-07 Thread Daniel Alvarez

Public bug reported:

This commit [0] fixed an issue with the subnet CIDR generation in tempest tests.
With the fix all subnets will get a gateway assigned regardless that it's 
attached to a router or not so it may happen that the gateway port doesn't 
exist. Normally, this shouldn't be a big deal but for trunk ports it's 
currently an issue with test_subport_connectivity [1] where the test boots a VM 
(advanced image) and then opens an SSH connection to its FIP to configure the 
interface for the subport and runs dhclient on it.

When dhclient runs, a new default gateway route is installed and the
connectivity to the FIP is lost thus making the test to fail as it fails
to execute/read any further commands:

I logged into the VM with virsh and checked the routes:

[root@tempest-server-test-378882328 ~]# ip r
default via 10.100.0.17 dev eth0.10
default via 10.100.0.1 dev eth0 proto static metric 100
default via 10.100.0.17 dev eth0.10 proto static metric 400
10.100.0.0/28 dev eth0 proto kernel scope link src 10.100.0.5 metric 100
10.100.0.16/28 dev eth0.10 proto kernel scope link src 10.100.0.25
169.254.169.254 via 10.100.0.18 dev eth0.10 proto dhcp
169.254.169.254 via 10.100.0.2 dev eth0 proto dhcp metric 100

This shouldn't happen as the subnet is not even connected to a router
and also 10.100.0.17 doesn't even exist in Neutron. Prior to [0] it
didn't fail because old code would create the subnet with gateway=None
and it was skipped (actually it will only setup a gateway automatically
if gateway equals to '' [2] but it was None instead [3]).

Let's allow a way to have the ability to configure subnets without a
gateway.

[0] 
https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e
[1] 
https://github.com/openstack/neutron-tempest-plugin/blob/02a5e2b07680d8c4dd69d681ae9a01d92b4be0ac/neutron_tempest_plugin/scenario/test_trunk.py#L229
[2] 
https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e#diff-872f814e35c7437b9f42aef71a991279L295
[3] 
https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e#diff-2f4232239c10eae0d0688617a3e6f98dL238

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1769609

Title:
  neutron-tempest-plugin: there is no way to create a subnet without a
  gateway and this breaks trunk tests

Status in neutron:
  New

Bug description:
  This commit [0] fixed an issue with the subnet CIDR generation in tempest 
tests.
  With the fix all subnets will get a gateway assigned regardless that it's 
attached to a router or not so it may happen that the gateway port doesn't 
exist. Normally, this shouldn't be a big deal but for trunk ports it's 
currently an issue with test_subport_connectivity [1] where the test boots a VM 
(advanced image) and then opens an SSH connection to its FIP to configure the 
interface for the subport and runs dhclient on it.

  When dhclient runs, a new default gateway route is installed and the
  connectivity to the FIP is lost thus making the test to fail as it
  fails to execute/read any further commands:

  I logged into the VM with virsh and checked the routes:

  [root@tempest-server-test-378882328 ~]# ip r
  default via 10.100.0.17 dev eth0.10
  default via 10.100.0.1 dev eth0 proto static metric 100
  default via 10.100.0.17 dev eth0.10 proto static metric 400
  10.100.0.0/28 dev eth0 proto kernel scope link src 10.100.0.5 metric 100
  10.100.0.16/28 dev eth0.10 proto kernel scope link src 10.100.0.25
  169.254.169.254 via 10.100.0.18 dev eth0.10 proto dhcp
  169.254.169.254 via 10.100.0.2 dev eth0 proto dhcp metric 100

  This shouldn't happen as the subnet is not even connected to a router
  and also 10.100.0.17 doesn't even exist in Neutron. Prior to [0] it
  didn't fail because old code would create the subnet with gateway=None
  and it was skipped (actually it will only setup a gateway
  automatically if gateway equals to '' [2] but it was None instead
  [3]).

  Let's allow a way to have the ability to configure subnets without a
  gateway.

  [0] 
https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e
  [1] 
https://github.com/openstack/neutron-tempest-plugin/blob/02a5e2b07680d8c4dd69d681ae9a01d92b4be0ac/neutron_tempest_plugin/scenario/test_trunk.py#L229
  [2] 
https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e#diff-872f814e35c7437b9f42aef71a991279L295
  [3] 
https://github.com/openstack/neutron-tempest-plugin/commit/0ddc93b1b19922d08bedf331b57c363535bb357e#diff-2f4232239c10eae0d0688617a3e6f98dL238

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1769609/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team

[Yahoo-eng-team] [Bug 1765545] [NEW] tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_port_security_disable_security_group fails due to instances failing to retrieve pub

2018-04-19 Thread Daniel Alvarez

Public bug reported:

Running tempest test
tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_port_security_disable_security_group
fails sometimes when trying to authenticate via public key to the access
point instance [0].

After debugging, I managed to connect to the instance via virsh console
and check that the instance had not the SSH key installed:

       
 / __/ __   / __ \/ __/
/ /__ / // __// __// /_/ /\ \
\___//_//_/  /_/   \/___/
   http://cirros-cloud.net
login as 'cirros' user. default password: 'cubswin:)'. use 'sudo' for root.
tempest-server-tempest-testsecuritygroupsbasicops-679466304-acc login: cirros
Password:
$ cat .ssh/authorized_keys
$


Checking up in the console log, I can see the following:
cirros-ds 'net' up at 6.37
checking http://169.254.169.254/2009-04-04/instance-id
successful after 1/20 tries: up 6.54. iid=i-006e
failed to get 
http://169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key
warning: no ec2 metadata for public-keys
failed to get http://169.254.169.254/2009-04-04/user-data
warning: no ec2 metadata for user-data
found datasource (ec2, net)


So it looks like it is able to fetch the instance-id but not getting the 
public-key.
When I try to do it manually, it retrieves it successfully:

$ curl 169.254.169.254/2009-04-04/meta-data/public-keys/0/openssh-key
ssh-rsa 
B3NzaC1yc2EDAQABAAABAQDICvVroPErVzHbx+a1lhI4RU33f0Nb4DT2FiNbKhaI1ZBl4/zRbqFY5a4lMipV810dCzJSViGJVw0VzNgDOf/zCt6Joosem5qC8hKwRgX5tcEXQ0UnVCiXddP1bydbRVt4BofTCTUPb4SZ3Z4zl0+L4WWB1CY58KYl19Lr7H4zqMXPqa6Mw+k1dpo0YBk3ZZR4pIxGtN916w6x6vtSIy2oDg4zaxUuewGaQNp9wENEuP3+TOseTymBxpbdys2RpUKXM2vhWWDDbrzG0+juOFxn111SgFYom05sjONDM310xHX5KBm6QuJO6ObCkSIKre9wvU60i19YW7pxBtyfztIJ
 Generated-by-Nova

Also, running the following command doesn't work:

$ sudo cirros-apply net -v
$ cat .ssh/authorized_keys
$

If, instead I run the following command and reboot, it will get properly 
installed:
$ sudo cirros-per boot cirros-apply-net cirros-apply net && reboot
...
$ cat .ssh/authorized_keys
ssh-rsa 
B3NzaC1yc2EDAQABAAABAQDICvVroPErVzHbx+a1lhI4RU33f0Nb4DT2FiNbKhaI1ZBl4/zRbqFY5a4lMipV810dCzJSViGJVw0VzNgDOf/zCt6Joosem5qC8hKwRgX5tcEXQ0UnVCiXddP1bydbRVt4BofTCTUPb4SZ3Z4zl0+L4WWB1CY58KYl19Lr7H4zqMXPqa6Mw+k1dpo0YBk3ZZR4pIxGtN916w6x6vtSIy2oDg4zaxUuewGaQNp9wENEuP3+TOseTymBxpbdys2RpUKXM2vhWWDDbrzG0+juOFxn111SgFYom05sjONDM310xHX5KBm6QuJO6ObCkSIKre9wvU60i19YW7pxBtyfztIJ
 Generated-by-Nova


After checking the ovn metadata proxy log and also nova-metadata-api logs, I 
can see the requests and the 200 OK responses:

2018-04-19 22:31:37.383 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200  len: 146 time: 
4.0820560
2018-04-19 22:31:38.800 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 200  len: 183 time: 
1.1210902
2018-04-19 22:31:49.148 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 200  len: 146 time: 
0.0230849
2018-04-19 22:31:49.387 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/ami-launch-index HTTP/1.1" status: 200  len: 136 
time: 0.0262089
2018-04-19 22:31:50.225 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/instance-type HTTP/1.1" status: 200  len: 142 time: 
0.7244408
2018-04-19 22:31:50.482 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/local-ipv4 HTTP/1.1" status: 200  len: 146 time: 
0.0143349
2018-04-19 22:31:50.612 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/public-ipv4 HTTP/1.1" status: 200  len: 144 time: 
0.0130348
2018-04-19 22:31:50.793 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/hostname HTTP/1.1" status: 200  len: 199 time: 
0.0100901
2018-04-19 22:31:51.039 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/local-hostname HTTP/1.1" status: 200  len: 199 time: 
0.0094490
2018-04-19 22:31:51.197 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/user-data HTTP/1.1" status: 404  len: 297 time: 0.0226381
2018-04-19 22:31:51.475 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/block-device-mapping HTTP/1.1" status: 200  len: 143 
time: 0.0118120
2018-04-19 22:31:51.579 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/block-device-mapping/ami HTTP/1.1" status: 200  len: 
138 time: 0.0084291
2018-04-19 22:31:51.672 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" status: 200  
len: 535 time: 12.7038500
2018-04-19 22:31:51.735 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/block-device-mapping/root HTTP/1.1" status: 200  
len: 143 time: 0.0147779
2018-04-19 22:31:51.930 24 INFO eventlet.wsgi.server [-] 10.100.0.7, 
"GET /2009-04-04/meta-data/public-hostname HTTP/1.1" stat

[Yahoo-eng-team] [Bug 1753540] [NEW] When isolated/force metadata is enabled, metadata proxy doesn't get automatically started/stopped when needed

2018-03-05 Thread Daniel Alvarez

Public bug reported:

When enabled_isolated_metadata option is set to True in DHCP agent
configuration, the metadata proxy instances won't get started
dynamically when the network gets isolated. Similarly, when a subnet is
added to the router, they don't get stopped if they were already
running.

100% reproducible:

With enable_isolated_metadata=True:

1. Create a network, a subnet and a router.
2. Check that there's a proxy instance running in the DHCP namespace for this 
network:

neutron   89   1  0 17:01 ?00:00:00 haproxy -f
/var/lib/neutron/ns-metadata-
proxy/9d1c7905-a887-419a-a885-9b07c20c2012.conf

3. Attach the subnet to the router.
4. Verify that the proxy instance is still running.
5. Restart DHCP agent
6. Verify that the proxy instance went away (since the network is not isolated).
7. Remove the subnet from the router.
8. Verify that the proxy instance has not been spawned.

At this point, booting any VM on the network will fail since it won't be able 
to fetch metadata.
However, any update on the network/subnet will trigger the agent to refresh the 
status of the isolated metadata proxy:

For example: openstack network set  --name foo 
would trigger that DHCP agent spawns the proxy for that network.

** Affects: neutron
 Importance: Undecided
     Assignee: Daniel Alvarez (dalvarezs)
 Status: In Progress

** Changed in: neutron
 Assignee: (unassigned) => Daniel Alvarez (dalvarezs)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1753540

Title:
  When isolated/force metadata is enabled, metadata proxy doesn't get
  automatically started/stopped when needed

Status in neutron:
  In Progress

Bug description:
  When enabled_isolated_metadata option is set to True in DHCP agent
  configuration, the metadata proxy instances won't get started
  dynamically when the network gets isolated. Similarly, when a subnet
  is added to the router, they don't get stopped if they were already
  running.

  100% reproducible:

  With enable_isolated_metadata=True:

  1. Create a network, a subnet and a router.
  2. Check that there's a proxy instance running in the DHCP namespace for this 
network:

  neutron   89   1  0 17:01 ?00:00:00 haproxy -f
  /var/lib/neutron/ns-metadata-
  proxy/9d1c7905-a887-419a-a885-9b07c20c2012.conf

  3. Attach the subnet to the router.
  4. Verify that the proxy instance is still running.
  5. Restart DHCP agent
  6. Verify that the proxy instance went away (since the network is not 
isolated).
  7. Remove the subnet from the router.
  8. Verify that the proxy instance has not been spawned.

  At this point, booting any VM on the network will fail since it won't be able 
to fetch metadata.
  However, any update on the network/subnet will trigger the agent to refresh 
the status of the isolated metadata proxy:

  For example: openstack network set  --name foo 
  would trigger that DHCP agent spawns the proxy for that network.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1753540/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1748658] [NEW] Restarting Neutron containers which make use of network namespaces doesn't work

2018-02-10 Thread Daniel Alvarez

Public bug reported:

When DHCP, L3, Metadata or OVN-Metadata containers are restarted they can't
set the previous namespaces:


[heat-admin@overcloud-novacompute-0 neutron]$ sudo docker restart 8559f5a7fa45
8559f5a7fa45


[heat-admin@overcloud-novacompute-0 neutron]$ tail -f 
/var/log/containers/neutron/networking-ovn-metadata-agent.log 
2018-02-09 08:34:41.059 5 CRITICAL neutron [-] Unhandled error: 
ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: RTNETLINK 
answers: Invalid argument
2018-02-09 08:34:41.059 5 ERROR neutron Traceback (most recent call last):
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/bin/networking-ovn-metadata-agent", line 10, in 
2018-02-09 08:34:41.059 5 ERROR neutron sys.exit(main())
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/networking_ovn/cmd/eventlet/agents/metadata.py",
 line 17, in main
2018-02-09 08:34:41.059 5 ERROR neutron metadata_agent.main()
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata_agent.py", line 
38, in main
2018-02-09 08:34:41.059 5 ERROR neutron agt.start()
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 
147, in start
2018-02-09 08:34:41.059 5 ERROR neutron self.sync()
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 
56, in wrapped
2018-02-09 08:34:41.059 5 ERROR neutron return f(*args, **kwargs)
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 
169, in sync
2018-02-09 08:34:41.059 5 ERROR neutron metadata_namespaces = 
self.ensure_all_networks_provisioned()
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 
350, in ensure_all_networks_provisioned
2018-02-09 08:34:41.059 5 ERROR neutron netns = 
self.provision_datapath(datapath)
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/networking_ovn/agent/metadata/agent.py", line 
294, in provision_datapath
2018-02-09 08:34:41.059 5 ERROR neutron veth_name[0], veth_name[1], 
namespace)
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 182, in 
add_veth
2018-02-09 08:34:41.059 5 ERROR neutron self._as_root([], 'link', 
tuple(args))
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 94, in 
_as_root
2018-02-09 08:34:41.059 5 ERROR neutron namespace=namespace)
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/neutron/agent/linux/ip_lib.py", line 102, in 
_execute
2018-02-09 08:34:41.059 5 ERROR neutron 
log_fail_as_error=self.log_fail_as_error)
2018-02-09 08:34:41.059 5 ERROR neutron   File 
"/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 151, in 
execute
2018-02-09 08:34:41.059 5 ERROR neutron raise ProcessExecutionError(msg, 
returncode=returncode)
2018-02-09 08:34:41.059 5 ERROR neutron ProcessExecutionError: Exit code: 2; 
Stdin: ; Stdout: ; Stderr: RTNETLINK answers: Invalid argument
2018-02-09 08:34:41.059 5 ERROR neutron 
2018-02-09 08:34:41.059 5 ERROR neutron 
2018-02-09 08:34:41.177 21 INFO oslo_service.service [-] Parent process has 
died unexpectedly, exiting
2018-02-09 08:34:41.178 21 INFO eventlet.wsgi.server [-] (21) wsgi exited, 
is_accepting=True


An easy way to reproduce the bug:

[heat-admin@overcloud-novacompute-0 ~]$ sudo docker exec -u root -it
5c5f254a9321bd74b5911f46acb9513574c2cd9a3c59805a85cffd960bcc864d
/bin/bash

[root@overcloud-novacompute-0 /]# ip netns a my_netns
[root@overcloud-novacompute-0 /]# exit

[heat-admin@overcloud-novacompute-0 ~]$ sudo ip netns
[heat-admin@overcloud-novacompute-0 ~]$ sudo docker restart 
5c5f254a9321bd74b5911f46acb9513574c2cd9a3c59805a85cffd960bcc864d
5c5f254a9321bd74b5911f46acb9513574c2cd9a3c59805a85cffd960bcc864d

[heat-admin@overcloud-novacompute-0 ~]$ sudo docker exec -u root -it 
5c5f254a9321bd74b5911f46acb9513574c2cd9a3c59805a85cffd960bcc864d /bin/bash
[root@overcloud-novacompute-0 /]# ip netns
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
my_netns

[root@overcloud-novacompute-0 /]# ip netns e my_netns ip a
RTNETLINK answers: Invalid argument
setting the network namespace "my_netns" failed: Invalid argument

Deleting everything under /run/netns/* from kolla_start but this would involve
a full sync of the agents which is not desirable:

[root@overcloud-novacompute-0 /]# rm /run/netns/my_netns 
rm: remove regular empty file '/run/netns/my_netns'? y
[root@overcloud-novacompute-0 /]# ip netns
[root@overcloud-novacompute-0 /]# ip netns a my_netns
[root@overcloud-novacompute-0 /]#

** Affects: neutron
 Importance:

[Yahoo-eng-team] [Bug 1744359] [NEW] Neutron haproxy logs are not being collected

2018-01-19 Thread Daniel Alvarez

Public bug reported:

In Neutron, we use haproxy to proxy metadata requests from instances to Nova 
Metadata service.
By default, haproxy logs to /dev/log but in Ubuntu, those requests get 
redirected by rsyslog to 
/var/log/haproxy.log which is not being collected.

ubuntu@devstack:~$ cat /etc/rsyslog.d/49-haproxy.conf 
# Create an additional socket in haproxy's chroot in order to allow logging via
# /dev/log to chroot'ed HAProxy processes
$AddUnixListenSocket /var/lib/haproxy/dev/log

# Send HAProxy messages to a dedicated logfile
if $programname startswith 'haproxy' then /var/log/haproxy.log
&~


Another possibility would be to change the haproxy.cfg file to include the 
log-tag option so that haproxy uses a different tag [0] and then it'd be dumped 
into syslog instead but this would break backwards compatibility.

[0] https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#3.1
-log-tag

** Affects: devstack
 Importance: Undecided
 Status: New

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: l3-ipam-dhcp

** Also affects: neutron
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1744359

Title:
  Neutron haproxy logs are not being collected

Status in devstack:
  New
Status in neutron:
  New

Bug description:
  In Neutron, we use haproxy to proxy metadata requests from instances to Nova 
Metadata service.
  By default, haproxy logs to /dev/log but in Ubuntu, those requests get 
redirected by rsyslog to 
  /var/log/haproxy.log which is not being collected.

  ubuntu@devstack:~$ cat /etc/rsyslog.d/49-haproxy.conf 
  # Create an additional socket in haproxy's chroot in order to allow logging 
via
  # /dev/log to chroot'ed HAProxy processes
  $AddUnixListenSocket /var/lib/haproxy/dev/log

  # Send HAProxy messages to a dedicated logfile
  if $programname startswith 'haproxy' then /var/log/haproxy.log
  &~

  
  Another possibility would be to change the haproxy.cfg file to include the 
log-tag option so that haproxy uses a different tag [0] and then it'd be dumped 
into syslog instead but this would break backwards compatibility.

  [0] https://cbonte.github.io/haproxy-dconv/configuration-1.5.html#3.1
  -log-tag

To manage notifications about this bug go to:
https://bugs.launchpad.net/devstack/+bug/1744359/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1739798] [NEW] update_network_postcommit is being called from delete_network_precommit with an open session

2017-12-22 Thread Daniel Alvarez

Public bug reported:

When a network is deleted, its segments are also deleted [0]. For each
segment, it will notify about resources.SEGMENT and events.AFTER_DELETE
[1] which will turn out in calling update_network_postcommit [2].

This should be avoided since drivers expect their postcommit methods to
be called with no open sessions to the database. There should be
separate callbacks for segments so that there's no transactions opened
to the database in any of the postcommit calls.

We detected this in networking-ovn driver because we're attempting to
bump revision numbers in a separate table in Neutron database when a
network is updated but we can't commit that change to the database
because there's already an open session on a network delete operation.
This may be affecting other drivers as well.

[0]
https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/services/segments/db.py#L315
[1]
https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/services/segments/db.py#L178
[2]
https://github.com/openstack/neutron/blob/6cdd079f8f3e6994734fa806b3c819cecb5f521a/neutron/plugins/ml2/plugin.py#L1917

** Affects: neutron
Importance: Undecided
Status: New

** Description changed:

- When a network is delete, its segments are also deleted [0]. For each
+ When a network is deleted, its segments are also deleted [0]. For each
segment, it will notify about resources.SEGMENT and events.AFTER_DELETE
[1] which will turn out in calling update_network_postcommit [2].

We detected this in networking-ovn driver because we're attempting to
bump revision numbers in Neutron database when a network is updated but
we can't commit that change to the database because there's already an
open session. This may be affecting other drivers as well.

** Description changed:

We detected this in networking-ovn driver because we're attempting to
- bump revision numbers in Neutron database when a network is updated but
- we can't commit that change to the database because there's already an
- open session. This may be affecting other drivers as well.
+ bump revision numbers in a separate table in Neutron database when a
+ network is updated but we can't commit that change to the database
+ because there's already an open session on a network delete operation.
+ This may be affecting other drivers as well.

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1739798

Title:
update_network_postcommit is being called from
delete_network_precommit with an open session

Status in neutron:
New

Bug description:
When a network is deleted, its segments are also deleted [0]. For each
segment, it will notify about resources.SEGMENT and
events.AFTER_DELETE [1] which will turn out in calling
update_network_postcommit [2].

This should be avoided since drivers expect their postcommit methods
to be called with no open sessions to the database. There should be
separate callbacks for segments so that there's no transactions opened
to the database in any of the postcommit calls.

[Yahoo-eng-team] [Bug 1738768] [NEW] Dataplane downtime when containers are stopped/restarted

2017-12-18 Thread Daniel Alvarez

Public bug reported:

I have deployed a 3 controllers - 3 computes HA environment with ML2/OVS
and observed dataplane downtime when restarting/stopping neutron-l3
container on controllers. This is what I did:

1. Created a network, subnet, router, a VM and attached a FIP to the VM
2. Left a ping running on the undercloud to the FIP
3. Stopped l3 container in controller-0.
   Result: Observed some packet loss while the router was failed over to 
controller-1
4. Stopped l3 container in controller-1
   Result: Observed some packet loss while the router was failed over to 
controller-2
5. Stopped l3 container in controller-2
   Result: No traffic to/from the FIP at all.

(overcloud) [stack@undercloud ~]$ ping 10.0.0.131
PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=1.83 ms
64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=1.56 ms

< Last l3 container was stopped here (step 5 above)>

>From 10.0.0.1 icmp_seq=10 Destination Host Unreachable
>From 10.0.0.1 icmp_seq=11 Destination Host Unreachable

When containers are stopped, I guess that the qrouter namespace is not
accessible by the kernel:

[heat-admin@overcloud-controller-2 ~]$ sudo ip netns e 
qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a
RTNETLINK answers: Invalid argument
RTNETLINK answers: Invalid argument
setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" 
failed: Invalid argument

This means that not only we're getting controlplane downtime but also dataplane 
which could be seen as a regression when compared to non-containerized 
environments.
The same would happen with DHCP and I expect instances not being able to fetch 
IP addresses from dnsmasq when dhcp containers are stopped.

** Affects: neutron
 Importance: Undecided
 Status: New

** Description changed:

  I have deployed a 3 controllers - 3 computes HA environment with ML2/OVS
  and observed dataplane downtime when restarting/stopping neutron-l3
  container on controllers. This is what I did:
  
- 1. Created a network, subnet, router, a VM and attached a FIP to the VIM
+ 1. Created a network, subnet, router, a VM and attached a FIP to the VM
  2. Left a ping running on the undercloud to the FIP
  3. Stopped l3 container in controller-0.
-Result: Observed some packet loss while the router was failed over to 
controller-1
+    Result: Observed some packet loss while the router was failed over to 
controller-1
  4. Stopped l3 container in controller-1
-Result: Observed some packet loss while the router was failed over to 
controller-2
+    Result: Observed some packet loss while the router was failed over to 
controller-2
  5. Stopped l3 container in controller-2
-Result: No traffic to/from the FIP at all.
+    Result: No traffic to/from the FIP at all.
  
- 
- (overcloud) [stack@undercloud ~]$ ping 10.0.0.131 
 
+ (overcloud) [stack@undercloud ~]$ ping 10.0.0.131
  PING 10.0.0.131 (10.0.0.131) 56(84) bytes of data.
  64 bytes from 10.0.0.131: icmp_seq=1 ttl=63 time=1.83 ms
  64 bytes from 10.0.0.131: icmp_seq=2 ttl=63 time=1.56 ms
  
- < Last l3 container was stopped here (step 5) in the above description 
>
- 
+ < Last l3 container was stopped here (step 5 above)>
+ 
  From 10.0.0.1 icmp_seq=10 Destination Host Unreachable
  From 10.0.0.1 icmp_seq=11 Destination Host Unreachable
  
- 
- When containers are stopped, I guess that the qrouter namespace is not 
accessible by the kernel:
+ When containers are stopped, I guess that the qrouter namespace is not
+ accessible by the kernel:
  
  [heat-admin@overcloud-controller-2 ~]$ sudo ip netns e 
qrouter-5244e91c-f533-4128-9289-f37c9656792c ip a
  RTNETLINK answers: Invalid argument
  RTNETLINK answers: Invalid argument
  setting the network namespace "qrouter-5244e91c-f533-4128-9289-f37c9656792c" 
failed: Invalid argument
  
  This means that not only we're getting controlplane downtime but also 
dataplane which could be seen as a regression when compared to 
non-containerized environments.
  The same would happen with DHCP and I expect instances not being able to 
fetch IP addresses from dnsmasq when dhcp containers are stopped.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1738768

Title:
  Dataplane downtime when containers are stopped/restarted

Status in neutron:
  New

Bug description:
  I have deployed a 3 controllers - 3 computes HA environment with
  ML2/OVS and observed dataplane downtime when restarting/stopping
  neutron-l3 container on controllers. This is what I did:

  1. Created a network, subnet, router, a VM and attached a FIP to the VM
  2. Left a ping running on the undercloud to the FIP
  3. Stopped l3 container in controller-0.
     Result: Obs

[Yahoo-eng-team] [Bug 1735724] [NEW] Metadata iptables rules never inserted upon exception on router creation

2017-12-01 Thread Daniel Alvarez

Public bug reported:

We've been debugging some issues being seen lately [0] and found out
that there's a bug in l3 agent when creating routers (or during initial
sync). Jakub Libosvar and I spent some time recreating the issue and
this is what we got:

Especially since we bumped to ovsdbapp 0.8.0, we've seen some jobs
failing due to errors when authenticating using PK to a VM. The TCP
connection to the SSH port was successfully established but the
authentication failed. After debugging further, we found out that
metadata rules in qrouter namespace which redirect traffic to haproxy
(which replaced old neutron-ns-metadata-proxy) were missing, so VM's
weren't fetching metadata (hence, public key).

These rules are installed by metadata driver after a router is created [1] on 
the AFTER_CREATE notification. Also, they will get created during the initial 
sync of the l3 agent (since it's still unknown for the agent) [2]. Here, if we 
don't know the router yet, we'll call _proccess_added_router() and if it's a 
known router we'll call _process_updated_router().
After our tests, we've seen that iptables rules are never restored if we 
simulate an
Exception inside ri.process() at [3] even though the router is scheduled for 
resync [4]. The reason why this happens is because we've already added it to 
our router info [5] so even though
ri.process() fails at L481 and it's scheduled for resync, next time 
_process_updated_router()
will get called instead of _process_added_router() thus not pushing the 
notification into
metadata driver to install iptables rules and they never get installed. 

In conclusion, if an error occurs during _process_added_router() we might end 
up losing
metadata forever until we restart the agent and this call succeeds. Worse, we 
will be
forwarding metadata requests via br-ex which could lead to security issues (ie. 
we could be injecting wrong metadata from the outside or the metadata server 
running in the underlying cloud may respond).

With ovsdbapp 0.9.0 we're minimizing this because if a port fails to be added 
to br-int, ovsdbapp will enqueue the transaction instead of throwing an 
Exception but there could be still some other exceptions I guess that 
reproduces this scenario outside of ovsdbapp so we need to fix it
in Neutron.

Thanks
Daniel Alvarez

---

[0] https://bugs.launchpad.net/tripleo/+bug/1731063
[1] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/metadata/driver.py#L288
[2] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L472
[3] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L481
[4] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L565
[5] 
https://github.com/openstack/neutron/blob/02fa049c5f5a38a276bec6e55c68ac19cd08c59f/neutron/agent/l3/agent.py#L478

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1735724

Title:
  Metadata iptables rules never inserted upon exception on router
  creation

Status in neutron:
  New

Bug description:
  We've been debugging some issues being seen lately [0] and found out
  that there's a bug in l3 agent when creating routers (or during
  initial sync). Jakub Libosvar and I spent some time recreating the
  issue and this is what we got:

  Especially since we bumped to ovsdbapp 0.8.0, we've seen some jobs
  failing due to errors when authenticating using PK to a VM. The TCP
  connection to the SSH port was successfully established but the
  authentication failed. After debugging further, we found out that
  metadata rules in qrouter namespace which redirect traffic to haproxy
  (which replaced old neutron-ns-metadata-proxy) were missing, so VM's
  weren't fetching metadata (hence, public key).

  These rules are installed by metadata driver after a router is created [1] on 
the AFTER_CREATE notification. Also, they will get created during the initial 
sync of the l3 agent (since it's still unknown for the agent) [2]. Here, if we 
don't know the router yet, we'll call _proccess_added_router() and if it's a 
known router we'll call _process_updated_router().
  After our tests, we've seen that iptables rules are never restored if we 
simulate an
  Exception inside ri.process() at [3] even though the router is scheduled for 
resync [4]. The reason why this happens is because we've already added it to 
our router info [5] so even though
  ri.process() fails at L481 and it's scheduled for resync, next time 
_process_updated_router()
  will get called instead of _process_added_router() thus not pushing

[Yahoo-eng-team] [Bug 1731494] [NEW] neutron-openvswitch-agent crashes due to TypeError exception in ovs_ryuapp

2017-11-10 Thread Daniel Alvarez

Public bug reported:

At some point during some rally test, we saw this exception in ovs agent
logs:

2017-11-07 13:35:51.428 597682 DEBUG 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-62f85bb3-db4c-4485-b35c-b7c1cafb3970 3d527bdd3ede4c6a97f91b701393b8e3 
5f753e92a5d740fc97252bd39f868561 - - -] port_delete message processed for port 
3e8348d0-40e1-4146-b803-1e6c6eddba53 port_delete 
/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:430
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
[req-141ecd16-22d7-4b1c-aa91-25d5077414f5 - - - - -] Agent main thread died of 
an exception: TypeError: int() can't convert non-string with explicit base
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
Traceback (most recent call last):
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_ryuapp.py",
 line 40, in agent_main_wrapper
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
ovs_agent.main(bridge_classes)
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
 line 2205, in main
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
agent.daemon_loop()
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
return f(*args, **kwargs)
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
 line 2120, in daemon_loop
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
self.rpc_loop(polling_manager=pm)
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
return f(*args, **kwargs)
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
 line 1985, in rpc_loop
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
ovs_status = self.check_ovs_status()
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 153, in wrapper
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
return f(*args, **kwargs)
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py",
 line 1787, in check_ovs_status
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
status = self.int_br.check_canary_table()
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py",
 line 52, in check_canary_table
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
flows = self.dump_flows(constants.CANARY_TABLE)
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py",
 line 141, in dump_flows
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp 
(dp, ofp, ofpp) = self._get_dp()
2017-11-07 13:35:51.439 597682 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ovs_ryuapp   File 
"/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_brid

[Yahoo-eng-team] [Bug 1723472] Re: [DVR] Lowering the MTU breaks FIP traffic

2017-10-18 Thread Daniel Alvarez

We have seen that the MAC address of the FIP changes to the qf interface of a 
different controller.
However, the environment was running openstack-neutron-11.0.0-1.el7.noarch.

After upgrading to openstack-neutron-11.0.1-1.el7.noarch, this bug no longer 
occurs.
Marking it as invalid.

** Changed in: neutron
   Status: Confirmed => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1723472

Title:
  [DVR] Lowering the MTU breaks FIP traffic

Status in neutron:
  Invalid

Bug description:
  In a DVR environment, when lowering the MTU of a network, traffic
  going to an instance through a floating IP is broken.

  Description:

  * Ping traffic to a VM through its FIP works.
  * Change the MTU of its network through "neutron net-update  --mtu 
1440".
  * Ping to the same FIP doesn't work.

  After a long debugging session with Anil Venkata, we've found that
  packets reach br-ex and then they hit this OF rule with normal action:

   cookie=0x1f847e4bf0de0aea, duration=70306.532s, table=3,
  n_packets=1579251, n_bytes=796614220, idle_age=0, hard_age=65534,
  priority=1 actions=NORMAL

  
  We would expect this rule to switch the packet to br-int so that it can be 
forwarded to the fip namespace (ie. with dst MAC address set to the floating ip 
gw (owner=network:floatingip_agent_gateway):

  $ sudo ovs-vsctl list interface

  _uuid   : 1f2b6e86-d303-42f4-9467-5dab78fc7199
  admin_state : down
  bfd : {}
  bfd_status  : {}
  cfm_fault   : []
  cfm_fault_status: []
  cfm_flap_count  : []
  cfm_health  : []
  cfm_mpid: []
  cfm_remote_mpids: []
  cfm_remote_opstate  : []
  duplex  : []
  error   : []
  external_ids: {attached-mac="fa:16:3e:9d:0c:4f", 
iface-id="8ec34826-b1a6-48ce-9c39-2fd3e8167eb4", iface-status=active}
  name: "fg-8ec34826-b1"


  [heat-admin@overcloud-novacompute-0 ~]$ sudo ovs-appctl fdb/show br-ex


   port  VLAN  MACAge
   [...]
  710  fa:16:3e:9d:0c:4f0

  
  $ sudo ovs-ofctl show br-ex | grep "7("
   7(phy-br-ex): addr:36:63:93:fc:af:e2

  
  And from there, to the fip namespace which would route the packet to the 
qrouter namespace, etc.

  However, when we change the MTU through the following command:

  "neutron net-update  --mtu 1440"

  We see that, after a few seconds, the MAC address of the FIP changes
  so when traffic arrives br-ex and NORMAL action is performed, it will
  not be output to br-int through the patch-port but instead, through
  eth1 and traffic won't work anymore.

  [heat-admin@overcloud-novacompute-0 ~]$ arp -n | grep ".113"
  10.0.0.113   ether   fa:16:3e:9d:0c:4f   C 
vlan10

  neutron port-set x --mtu 1440

  $ arp -n | grep ".113"
  10.0.0.113   ether   fa:16:3e:20:f9:85   C 
vlan10

  
  When setting the MAC address manually, ping starts working again:

  $ arp -s 10.0.0.113 fa:16:3e:9d:0c:4f
  $ ping 10.0.0.113
  PING 10.0.0.113 (10.0.0.113) 56(84) bytes of data.
  64 bytes from 10.0.0.113: icmp_seq=1 ttl=62 time=1.17 ms
  64 bytes from 10.0.0.113: icmp_seq=2 ttl=62 time=0.561 ms

  
  Additional notes:

  When we set the MAC address manually and traffic gets working back
  again, lowering the MTU doesn't change the MAC address (we can't see
  any gARP's coming through).

  When we delete the ARP entry for the FIP and try to ping the FIP, the
  wrong MAC address is set.

  [heat-admin@overcloud-novacompute-0 ~]$ sudo arp -d 10.0.0.113

  [heat-admin@overcloud-novacompute-0 ~]$ ping 10.0.0.113 -c 2
  PING 10.0.0.113 (10.0.0.113) 56(84) bytes of data.

  --- 10.0.0.113 ping statistics ---
  2 packets transmitted, 0 received, 100% packet loss, time 999ms

  [heat-admin@overcloud-novacompute-0 ~]$ arp -n | grep ".113"
  10.0.0.113   ether   fa:16:3e:20:f9:85   C 
vlan10

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1723472/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1723472] [NEW] [DVR] Lowering the MTU breaks FIP traffic

2017-10-13 Thread Daniel Alvarez

Public bug reported:

In a DVR environment, when lowering the MTU of a network, traffic going
to an instance through a floating IP is broken.

Description:

* Ping traffic to a VM through its FIP works.
* Change the MTU of its network through "neutron net-update  --mtu 
1440".
* Ping to the same FIP doesn't work.

After a long debugging session with Anil Venkata, we've found that
packets reach br-ex and then they hit this OF rule with normal action:

 cookie=0x1f847e4bf0de0aea, duration=70306.532s, table=3,
n_packets=1579251, n_bytes=796614220, idle_age=0, hard_age=65534,
priority=1 actions=NORMAL


We would expect this rule to switch the packet to br-int so that it can be 
forwarded to the fip namespace (ie. with dst MAC address set to the floating ip 
gw (owner=network:floatingip_agent_gateway):

$ sudo ovs-vsctl list interface

_uuid   : 1f2b6e86-d303-42f4-9467-5dab78fc7199
admin_state : down
bfd : {}
bfd_status  : {}
cfm_fault   : []
cfm_fault_status: []
cfm_flap_count  : []
cfm_health  : []
cfm_mpid: []
cfm_remote_mpids: []
cfm_remote_opstate  : []
duplex  : []
error   : []
external_ids: {attached-mac="fa:16:3e:9d:0c:4f", 
iface-id="8ec34826-b1a6-48ce-9c39-2fd3e8167eb4", iface-status=active}
name: "fg-8ec34826-b1"


[heat-admin@overcloud-novacompute-0 ~]$ sudo ovs-appctl fdb/show br-ex  

  
 port  VLAN  MACAge
 [...]
710  fa:16:3e:9d:0c:4f0


$ sudo ovs-ofctl show br-ex | grep "7("
 7(phy-br-ex): addr:36:63:93:fc:af:e2


And from there, to the fip namespace which would route the packet to the 
qrouter namespace, etc.

However, when we change the MTU through the following command:

"neutron net-update  --mtu 1440"

We see that, after a few seconds, the MAC address of the FIP changes so
when traffic arrives br-ex and NORMAL action is performed, it will not
be output to br-int through the patch-port but instead, through eth1 and
traffic won't work anymore.

[heat-admin@overcloud-novacompute-0 ~]$ arp -n | grep ".113"
10.0.0.113   ether   fa:16:3e:9d:0c:4f   C 
vlan10

neutron port-set x --mtu 1440

$ arp -n | grep ".113"
10.0.0.113   ether   fa:16:3e:20:f9:85   C 
vlan10


When setting the MAC address manually, ping starts working again:

$ arp -s 10.0.0.113 fa:16:3e:9d:0c:4f
$ ping 10.0.0.113
PING 10.0.0.113 (10.0.0.113) 56(84) bytes of data.
64 bytes from 10.0.0.113: icmp_seq=1 ttl=62 time=1.17 ms
64 bytes from 10.0.0.113: icmp_seq=2 ttl=62 time=0.561 ms


Additional notes:

When we set the MAC address manually and traffic gets working back
again, lowering the MTU doesn't change the MAC address (we can't see any
gARP's coming through).

When we delete the ARP entry for the FIP and try to ping the FIP, the
wrong MAC address is set.

[heat-admin@overcloud-novacompute-0 ~]$ sudo arp -d 10.0.0.113

[heat-admin@overcloud-novacompute-0 ~]$ ping 10.0.0.113 -c 2
PING 10.0.0.113 (10.0.0.113) 56(84) bytes of data.

--- 10.0.0.113 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms

[heat-admin@overcloud-novacompute-0 ~]$ arp -n | grep ".113"
10.0.0.113   ether   fa:16:3e:20:f9:85   C 
vlan10

** Affects: neutron
 Importance: Undecided
 Assignee: Daniel Alvarez (dalvarezs)
 Status: New


** Tags: l3-dvr-backlog

** Changed in: neutron
 Assignee: (unassigned) => Daniel Alvarez (dalvarezs)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1723472

Title:
  [DVR] Lowering the MTU breaks FIP traffic

Status in neutron:
  New

Bug description:
  In a DVR environment, when lowering the MTU of a network, traffic
  going to an instance through a floating IP is broken.

  Description:

  * Ping traffic to a VM through its FIP works.
  * Change the MTU of its network through "neutron net-update  --mtu 
1440".
  * Ping to the same FIP doesn't work.

  After a long debugging session with Anil Venkata, we've found that
  packets reach br-ex and then they hit this OF rule with normal action:

   cookie=0x1f847e4bf0de0aea, duration=70306.532s, table=3,
  n_packets=1579251, n_bytes=796614220, idle_age=0, hard_age=65534,
  priority=1 actions=NORMAL

  
  We would expect this rule to switch the packet to br-int so that it can be 
forwarded to the fip namespace (ie. with dst MAC address set to the floating ip 
gw (owner=network:floatingip_agent_gateway):

  $ sudo ovs-vsctl list interface

  _uuid   : 1f2b6e86-d303-42f4-9467-5dab78fc7199
  admin_st

[Yahoo-eng-team] [Bug 1695191] [NEW] pyroute2 wrong version constraint

2017-06-02 Thread Daniel Alvarez

Public bug reported:

With this recent change [0] we're now importing asyncio module from
pyroute2 and neutron-server fails to start if pyroute<0.4.15:

File "/opt/stack/neutron/neutron/common/eventlet_utils.py", line 25, in 
monkey_patch
p_c_e = importutils.import_module('pyroute2.config.asyncio')
ImportError: No module named asyncio

I'm using pyroute==0.4.13 which is ok according to global-
requirements.txt but this won't include the asyncio module. Version
0.4.14 includes it but since it's forbidden right now in our
requirements, we should bump it the minimum version to 0.4.15.

[0] https://review.openstack.org/#/c/469650/

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1695191

Title:
  pyroute2 wrong version constraint

Status in neutron:
  New

Bug description:
  With this recent change [0] we're now importing asyncio module from
  pyroute2 and neutron-server fails to start if pyroute<0.4.15:

  File "/opt/stack/neutron/neutron/common/eventlet_utils.py", line 25, in 
monkey_patch
  p_c_e = importutils.import_module('pyroute2.config.asyncio')
  ImportError: No module named asyncio

  I'm using pyroute==0.4.13 which is ok according to global-
  requirements.txt but this won't include the asyncio module. Version
  0.4.14 includes it but since it's forbidden right now in our
  requirements, we should bump it the minimum version to 0.4.15.

  [0] https://review.openstack.org/#/c/469650/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1695191/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1691969] [NEW] Functional tests failing due to uid 65534 not present

2017-05-19 Thread Daniel Alvarez

Public bug reported:

We're relying on existing uid 65534 to run functional tests [0] and if it 
doesn't exist,
metadata proxy will fail to spawn [1] and so will the tests.

>From what I've seen in centos7, user with uid 65534 exists when
deploying devstack because libvirt package is installed and nfs-utils is
a dependency. nfs-utils will create nfsnobody user under this uid [2]
and the functional tests pass.

We shouldn't rely on this uid to be present on the system. I'll try to
come up with something to fix the tests but feedback is very welcome :)

Daniel

[0] 
https://github.com/openstack/neutron/blob/master/neutron/tests/functional/agent/l3/test_metadata_proxy.py#L188
[1] 
https://github.com/openstack/neutron/blob/03c5283c69f1f5cba8a9f29e7bd7fd306ee0c123/neutron/agent/metadata/driver.py#L100
[2] http://paste.openstack.org/show/609989/

** Affects: neutron
 Importance: Undecided
 Assignee: Daniel Alvarez (dalvarezs)
 Status: New


** Tags: functional-tests

** Changed in: neutron
     Assignee: (unassigned) => Daniel Alvarez (dalvarezs)

** Tags added: functional-tests

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1691969

Title:
  Functional tests failing due to uid 65534 not present

Status in neutron:
  New

Bug description:
  We're relying on existing uid 65534 to run functional tests [0] and if it 
doesn't exist,
  metadata proxy will fail to spawn [1] and so will the tests.

  From what I've seen in centos7, user with uid 65534 exists when
  deploying devstack because libvirt package is installed and nfs-utils
  is a dependency. nfs-utils will create nfsnobody user under this uid
  [2] and the functional tests pass.

  We shouldn't rely on this uid to be present on the system. I'll try to
  come up with something to fix the tests but feedback is very welcome
  :)

  Daniel

  [0] 
https://github.com/openstack/neutron/blob/master/neutron/tests/functional/agent/l3/test_metadata_proxy.py#L188
  [1] 
https://github.com/openstack/neutron/blob/03c5283c69f1f5cba8a9f29e7bd7fd306ee0c123/neutron/agent/metadata/driver.py#L100
  [2] http://paste.openstack.org/show/609989/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1691969/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1677279] [NEW] Don't depend on l3-agent running for IPv6 failover

2017-03-29 Thread Daniel Alvarez

Public bug reported:

Right now, we're enabling IPv6 RA on the gateway interface for master
instances [0].

This happens only when the l3-agent is running so we depend on it for
the correct configuration of the HA routers.

If the l3-agent is shut down for maintenance, then we won't get RA
enabled on master instance even though keepalived and keepalived-state-
change are both running.

We should get rid of this dependency by moving this code into
keepalived-state-change, which we could assume that will be always
running along with keepalived.

[0]
https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L124

** Affects: neutron
 Importance: Undecided
 Assignee: Daniel Alvarez (dalvarezs)
 Status: New


** Tags: l3-ha

** Tags added: l3-ha

** Changed in: neutron
 Assignee: (unassigned) => Daniel Alvarez (dalvarezs)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1677279

Title:
  Don't depend on l3-agent running for IPv6 failover

Status in neutron:
  New

Bug description:
  Right now, we're enabling IPv6 RA on the gateway interface for master
  instances [0].

  This happens only when the l3-agent is running so we depend on it for
  the correct configuration of the HA routers.

  If the l3-agent is shut down for maintenance, then we won't get RA
  enabled on master instance even though keepalived and keepalived-
  state-change are both running.

  We should get rid of this dependency by moving this code into
  keepalived-state-change, which we could assume that will be always
  running along with keepalived.

  [0]
  https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L124

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1677279/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1669805] [NEW] rally job failing in gate due to "Quota for tenant X could not be found" Error

2017-03-03 Thread Daniel Alvarez

Public bug reported:

Rally job is failing in the gate due to the following error during
cleanup [0]:

2017-03-03 13:14:56.897549 | 2017-03-03 13:14:56.897 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager 
rutils.retry(resource._max_attempts, resource.delete)
2017-03-03 13:14:56.899015 | 2017-03-03 13:14:56.898 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager   File 
"/opt/stack/new/rally/rally/common/utils.py", line 223, in retry
2017-03-03 13:14:56.900375 | 2017-03-03 13:14:56.900 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager return func(*args, 
**kwargs)
2017-03-03 13:14:56.901935 | 2017-03-03 13:14:56.901 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager   File 
"/opt/stack/new/rally/rally/plugins/openstack/cleanup/resources.py", line 472, 
in delete
2017-03-03 13:14:56.903254 | 2017-03-03 13:14:56.902 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager 
self._manager().delete_quota(self.tenant_uuid)
2017-03-03 13:14:56.904746 | 2017-03-03 13:14:56.904 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager   File 
"/usr/local/lib/python2.7/dist-packages/debtcollector/renames.py", line 43, in 
decorator
2017-03-03 13:14:56.906156 | 2017-03-03 13:14:56.905 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager return wrapped(*args, 
**kwargs)
2017-03-03 13:14:56.907779 | 2017-03-03 13:14:56.907 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager   File 
"/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 
742, in delete_quota
2017-03-03 13:14:56.909166 | 2017-03-03 13:14:56.908 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager return 
self.delete(self.quota_path % (project_id))
2017-03-03 13:14:56.910455 | 2017-03-03 13:14:56.910 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager   File 
"/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 
357, in delete
2017-03-03 13:14:56.911722 | 2017-03-03 13:14:56.911 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager headers=headers, 
params=params)
2017-03-03 13:14:56.913068 | 2017-03-03 13:14:56.912 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager   File 
"/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 
338, in retry_request
2017-03-03 13:14:56.914760 | 2017-03-03 13:14:56.914 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager headers=headers, 
params=params)
2017-03-03 13:14:56.916094 | 2017-03-03 13:14:56.915 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager   File 
"/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 
301, in do_request
2017-03-03 13:14:56.917732 | 2017-03-03 13:14:56.917 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager 
self._handle_fault_response(status_code, replybody, resp)
2017-03-03 13:14:56.919266 | 2017-03-03 13:14:56.918 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager   File 
"/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 
276, in _handle_fault_response
2017-03-03 13:14:56.920699 | 2017-03-03 13:14:56.920 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager 
exception_handler_v20(status_code, error_body)
2017-03-03 13:14:56.922342 | 2017-03-03 13:14:56.922 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager   File 
"/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", line 92, 
in exception_handler_v20
2017-03-03 13:14:56.923773 | 2017-03-03 13:14:56.923 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager request_ids=request_ids)
2017-03-03 13:14:56.925720 | 2017-03-03 13:14:56.925 | 2017-03-03 13:14:56.886 
6099 ERROR rally.plugins.openstack.cleanup.manager NotFound: Quota for tenant 
82ee0ba1b6534f958d1acd2f717b5c3d could not be found.

Seems like we're hitting this errors from 1AM today (03/03/2017) [1]

[0] 
http://logs.openstack.org/91/431691/30/check/gate-rally-dsvm-neutron-neutron-ubuntu-xenial/ab3471c/console.html#_2017-03-03_13_14_56_925720
[1] 
http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%20%5C%22gate-rally-dsvm-neutron-neutron-ubuntu-xenial%5C%22%20AND%20message%3A%20%5C%22Quota%20for%20tenant%5C%22

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1669805

Title:
  rally job failing in gate due to "Quota for tenant X could not be
  found" Error

Status in neutron:
  New

Bug description:
  Rally job is

[Yahoo-eng-team] [Bug 1669765] [NEW] RA is not disabled on backup HA routers

2017-03-03 Thread Daniel Alvarez

Public bug reported:

When an HA router is created, RA is enabled on the gateway interface for the 
'master' router [0].
However, it is not disabled in the 'else' clause and therefore:

1. If the router was set to 'master' before, it will still have RA enabled on 
its gateway interface
2. If default value for accept_ra in 
'/proc/sys/net/ipv6/conf/default/accept_ra' is > 0, then it will still have RA 
enabled on its gateway interface.

Having RA enabled on a backup router leads to the following unwanted
situation:

- It may respond to RA packets coming from an external switch and,
because it has the same MAC address as the master instance, the switch
will learn its MAC address and may send the traffic to it until the
master sends some packets. Therefore, any existing connections will be
interrupted.

The fix would consist in disabling RA on the gateway interface if
conditions are not met to enable it.

[0]
https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L136

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1669765

Title:
  RA is not disabled on backup HA routers

Status in neutron:
  New

Bug description:
  When an HA router is created, RA is enabled on the gateway interface for the 
'master' router [0].
  However, it is not disabled in the 'else' clause and therefore:

  1. If the router was set to 'master' before, it will still have RA enabled on 
its gateway interface
  2. If default value for accept_ra in 
'/proc/sys/net/ipv6/conf/default/accept_ra' is > 0, then it will still have RA 
enabled on its gateway interface.

  Having RA enabled on a backup router leads to the following unwanted
  situation:

  - It may respond to RA packets coming from an external switch and,
  because it has the same MAC address as the master instance, the switch
  will learn its MAC address and may send the traffic to it until the
  master sends some packets. Therefore, any existing connections will be
  interrupted.

  The fix would consist in disabling RA on the gateway interface if
  conditions are not met to enable it.

  [0]
  https://github.com/openstack/neutron/blob/master/neutron/agent/l3/ha.py#L136

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1669765/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1654287] Re: functional test netns_cleanup failing in gate

2017-01-18 Thread Daniel Alvarez

** Also affects: oslo.rootwrap
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1654287

Title:
  functional test netns_cleanup failing in gate

Status in neutron:
  In Progress
Status in oslo.rootwrap:
  New

Bug description:
  
  The functional test for netns_cleanup has failed in the gate today [0].

  Apparently, when trying to get the list of devices
  (ip_lib.get_devices() 'find /sys/class/net -maxdepth 1 -type 1 -printf
  %f') through rootwrap_daemon, it's getting the output of the previous
  command instead ('netstat -nlp'). This causes that the netns_cleanup
  module tries to unplug random devices which correspond to the actual
  output of the 'netstat' command.

  This bug doesn't look related to the test itself but to
  rootwrap_daemon? Maybe due to long output to the netstat command?

  
  Relevant part of the log

  2017-01-05 12:17:04.609 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'netstat', '-nlp'] 
execute_rootwrap_daemon neutron/agent/linux/utils.py:108
  2017-01-05 12:17:04.613 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute 
neutron/agent/linux/utils.py:149
  2017-01-05 12:17:04.614 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'find', '/sys/class/net', 
'-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_rootwrap_daemon 
neutron/agent/linux/utils.py:108
  2017-01-05 12:17:04.645 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] 
[POLLIN] on fd 14 __log_wakeup 
/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
  2017-01-05 12:17:04.686 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute 
neutron/agent/linux/utils.py:149
  2017-01-05 12:17:04.688 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 
'Active'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
  2017-01-05 12:17:04.746 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute 
neutron/agent/linux/utils.py:149
  2017-01-05 12:17:04.747 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 
'Internet'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
  2017-01-05 12:17:04.758 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] 
[POLLIN] on fd 14 __log_wakeup 
/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
  2017-01-05 12:17:04.815 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] 
[POLLIN] on fd 14 __log_wakeup 
/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
  2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] 
[POLLIN] on fd 7 __log_wakeup 
/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
  2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running 
txn command(idx=0): InterfaceToBridgeCommand(name=Internet) do_commit 
neutron/agent/ovsdb/impl_idl.py:100
  2017-01-05 12:17:04.823 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] 
Transaction aborted do_commit neutron/agent/ovsdb/impl_idl.py:124
  2017-01-05 12:17:04.824 27615 DEBUG neutron.cmd.netns_cleanup 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for 
device: Internet unplug_device neutron/cmd/netns_cleanup.py:138
  2017-01-05 12:17:04.824 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 
'connections'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
  
  2017-01-05 12:17:06.388 27615 DEBUG neutron.cmd.netns_cleanup 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for 
device: Path unplug_device neutron/cmd/netns_cleanup.py:138
  2017-01-05 12:17:06.389 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', '-o', 'netns', 'list'] execute_rootwrap_daemon 
neutron/agent/linux/utils.py:108
  2017-01-05 12:17:06.454 27615 ERROR neutron.agent.linux.ut

[Yahoo-eng-team] [Bug 1654287] [NEW] functional test netns_cleanup failing in gate

2017-01-05 Thread Daniel Alvarez

Public bug reported:


The functional test for netns_cleanup has failed in the gate today [0].

Apparently, when trying to get the list of devices (ip_lib.get_devices()
'find /sys/class/net -maxdepth 1 -type 1 -printf %f') through
rootwrap_daemon, it's getting the output of the previous command instead
('netstat -nlp'). This causes that the netns_cleanup module tries to
unplug random devices which correspond to the actual output of the
'netstat' command.

This bug doesn't look related to the test itself but to rootwrap_daemon?
Maybe due to long output to the netstat command?


Relevant part of the log

2017-01-05 12:17:04.609 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'netstat', '-nlp'] 
execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-01-05 12:17:04.613 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute 
neutron/agent/linux/utils.py:149
2017-01-05 12:17:04.614 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'find', '/sys/class/net', 
'-maxdepth', '1', '-type', 'l', '-printf', '%f '] execute_rootwrap_daemon 
neutron/agent/linux/utils.py:108
2017-01-05 12:17:04.645 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] 
[POLLIN] on fd 14 __log_wakeup 
/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
2017-01-05 12:17:04.686 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute 
neutron/agent/linux/utils.py:149
2017-01-05 12:17:04.688 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 
'Active'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-01-05 12:17:04.746 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 0 execute 
neutron/agent/linux/utils.py:149
2017-01-05 12:17:04.747 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 
'Internet'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108
2017-01-05 12:17:04.758 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] 
[POLLIN] on fd 14 __log_wakeup 
/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
2017-01-05 12:17:04.815 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] 
[POLLIN] on fd 14 __log_wakeup 
/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.native.vlog [-] 
[POLLIN] on fd 7 __log_wakeup 
/opt/stack/new/neutron/.tox/dsvm-functional/local/lib/python2.7/site-packages/ovs/poller.py:202
2017-01-05 12:17:04.822 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] Running 
txn command(idx=0): InterfaceToBridgeCommand(name=Internet) do_commit 
neutron/agent/ovsdb/impl_idl.py:100
2017-01-05 12:17:04.823 27615 DEBUG neutron.agent.ovsdb.impl_idl [-] 
Transaction aborted do_commit neutron/agent/ovsdb/impl_idl.py:124
2017-01-05 12:17:04.824 27615 DEBUG neutron.cmd.netns_cleanup 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for 
device: Internet unplug_device neutron/cmd/netns_cleanup.py:138
2017-01-05 12:17:04.824 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', 'netns', 'exec', 
'qrouter-cf2030c6-c924-45bb-b13b-6774d275b394', 'ip', 'link', 'delete', 
'connections'] execute_rootwrap_daemon neutron/agent/linux/utils.py:108

2017-01-05 12:17:06.388 27615 DEBUG neutron.cmd.netns_cleanup 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Unable to find bridge for 
device: Path unplug_device neutron/cmd/netns_cleanup.py:138
2017-01-05 12:17:06.389 27615 DEBUG neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Running command (rootwrap 
daemon): ['ip', '-o', 'netns', 'list'] execute_rootwrap_daemon 
neutron/agent/linux/utils.py:108
2017-01-05 12:17:06.454 27615 ERROR neutron.agent.linux.utils 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Exit code: 1; Stdin: ; 
Stdout: ; Stderr: Cannot find device "Path"

2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup 
[req-68eceb29-052a-4c8c-8152-38bbe636cba5 - - - - -] Error unable to destroy 
namespace: qrouter-cf2030c6-c924-45bb-b13b-6774d275b394
2017-01-05 12:17:06.454 27615 ERROR neutron.cmd.netns_cleanup Traceback (most 
recent call last

[Yahoo-eng-team] [Bug 1652124] [NEW] netns-cleanup functional test fails on some conditions

2016-12-22 Thread Daniel Alvarez

Public bug reported:

We've seen this functional test failing in the gate [0] and it's due to
a bug in the helper module that was written for the functional test. [1]

The problem shows up when process_spawn is not able to find a port to
listen on and the process stays running anyways. That means that netns-
cleanup won't clean it up and this condition [2] doesn't hold (1!=0).

As per logs in the gate I can tell that it's only a bug in the
functional test but not in the module itself. I myself will submit a
patch to it right now.


[0]  
http://logs.openstack.org/45/358845/24/check/gate-neutron-dsvm-functional-ubuntu-xenial/b018ed7/testr_results.html.gz

[1]
https://github.com/openstack/neutron/blob/master/neutron/tests/functional/cmd/process_spawn.py#L107

[2]
https://github.com/openstack/neutron/blob/master/neutron/tests/functional/cmd/test_netns_cleanup.py#L84

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: functional-tests gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1652124

Title:
  netns-cleanup functional test fails on some conditions

Status in neutron:
  New

Bug description:
  We've seen this functional test failing in the gate [0] and it's due
  to a bug in the helper module that was written for the functional
  test. [1]

  The problem shows up when process_spawn is not able to find a port to
  listen on and the process stays running anyways. That means that
  netns-cleanup won't clean it up and this condition [2] doesn't hold
  (1!=0).

  As per logs in the gate I can tell that it's only a bug in the
  functional test but not in the module itself. I myself will submit a
  patch to it right now.

  
  [0]  
http://logs.openstack.org/45/358845/24/check/gate-neutron-dsvm-functional-ubuntu-xenial/b018ed7/testr_results.html.gz

  [1]
  
https://github.com/openstack/neutron/blob/master/neutron/tests/functional/cmd/process_spawn.py#L107

  [2]
  
https://github.com/openstack/neutron/blob/master/neutron/tests/functional/cmd/test_netns_cleanup.py#L84

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1652124/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1650611] [NEW] dhcp agent reporting state as down during the initial sync

2016-12-16 Thread Daniel Alvarez

Public bug reported:

When dhcp agent is started, neutron agent-list reports its state as dead
until the initial sync is complete.

This can lead to unwanted alarms in monitoring systems, especially in
large environments where the initial sync may take hours. During this
time, systemctl shows that the agent is actually alive while neutron
agent-list reports it as down.

Technical details:

If I'm right, this line [0] is the exact point where the initial sync
takes place right after the first state report (with start_flag=True) is
sent to the server. As it's being done in the same thread, it won't send
a second state report until it's done with the sync.

Doing it in a separate thread would let the heartbeat task to continue
sending state reports to the server but I don't know whether this have
any unwanted side effects.


[0] 
https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L751

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: l3-bgp

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1650611

Title:
  dhcp agent reporting state as down during the initial sync

Status in neutron:
  New

Bug description:
  When dhcp agent is started, neutron agent-list reports its state as
  dead until the initial sync is complete.

  This can lead to unwanted alarms in monitoring systems, especially in
  large environments where the initial sync may take hours. During this
  time, systemctl shows that the agent is actually alive while neutron
  agent-list reports it as down.

  Technical details:

  If I'm right, this line [0] is the exact point where the initial sync
  takes place right after the first state report (with start_flag=True)
  is sent to the server. As it's being done in the same thread, it won't
  send a second state report until it's done with the sync.

  Doing it in a separate thread would let the heartbeat task to continue
  sending state reports to the server but I don't know whether this have
  any unwanted side effects.

  
  [0] 
https://github.com/openstack/neutron/blob/master/neutron/agent/dhcp/agent.py#L751

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1650611/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1647431] [NEW] grenade job times out on Xenial

2016-12-05 Thread Daniel Alvarez

Public bug reported:

gate-grenade-dsvm-neutron-multinode-ubuntu-xenial job is failing on
neutron gate

I have checked some other patches and looks like the job doesn't fail on
them so apparently it's not deterministic.


>From the logs: 

[1]
2016-12-05 09:07:46.832799 | ERROR: the main setup script run by this job 
failed - exit code: 124

[2]
2016-12-05 09:07:10.778 | + 
/opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:207 :   timeout 
30 sh -c 'while openstack server show cinder_server1 >/dev/null; do sleep 1; 
done'
2016-12-05 09:07:40.781 | + 
/opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:1 :   exit_trap
2016-12-05 09:07:40.782 | + /opt/stack/new/grenade/functions:exit_trap:103 :   
local r=124


[1] 
http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/console.html
[2] 
http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/logs/grenade.sh.txt.gz

** Affects: neutron
 Importance: Critical
 Status: Confirmed


** Tags: gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1647431

Title:
  grenade job times out on Xenial

Status in neutron:
  Confirmed

Bug description:
  gate-grenade-dsvm-neutron-multinode-ubuntu-xenial job is failing on
  neutron gate

  I have checked some other patches and looks like the job doesn't fail
  on them so apparently it's not deterministic.

  
  From the logs: 

  [1]
  2016-12-05 09:07:46.832799 | ERROR: the main setup script run by this job 
failed - exit code: 124

  [2]
  2016-12-05 09:07:10.778 | + 
/opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:207 :   timeout 
30 sh -c 'while openstack server show cinder_server1 >/dev/null; do sleep 1; 
done'
  2016-12-05 09:07:40.781 | + 
/opt/stack/new/grenade/projects/70_cinder/resources.sh:destroy:1 :   exit_trap
  2016-12-05 09:07:40.782 | + /opt/stack/new/grenade/functions:exit_trap:103 :  
 local r=124

  
  [1] 
http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/console.html
  [2] 
http://logs.openstack.org/40/402140/7/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/ad0cf41/logs/grenade.sh.txt.gz

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1647431/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

37 matches

Mail list logo