[Yahoo-eng-team] [Bug 1821284] [NEW] "group type spec" and "group spec" are used inconsistently

2019-03-22 Thread Akihiro Motoki
Public bug reported:

In the group type spec page, "group type spec" and "group spec" are used 
inconsistently.
It is better to use either consistently ("group spec"?).

This is targeted to Train. (Stein is now in the hard string freeze.)

** Affects: horizon
 Importance: Low
 Status: Triaged


** Tags: cinder

** Changed in: horizon
   Importance: Undecided => Low

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1821284

Title:
  "group type spec" and "group spec" are used inconsistently

Status in OpenStack Dashboard (Horizon):
  Triaged

Bug description:
  In the group type spec page, "group type spec" and "group spec" are used 
inconsistently.
  It is better to use either consistently ("group spec"?).

  This is targeted to Train. (Stein is now in the hard string freeze.)

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1821284/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821289] [NEW] stein: Sync policy.json files with service projects

2019-03-22 Thread Akihiro Motoki
Public bug reported:

We need to sync policy.json files with service projects before the Stein
release.

** Affects: horizon
 Importance: High
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1821289

Title:
  stein: Sync policy.json files with service projects

Status in OpenStack Dashboard (Horizon):
  New

Bug description:
  We need to sync policy.json files with service projects before the
  Stein release.

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1821289/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821288] [NEW] volume_groups view refers to consistency group index page incorrectly

2019-03-22 Thread Akihiro Motoki
Public bug reported:

The volume_groups view refers to consistency group index page
incorrectly.

https://github.com/openstack/horizon/blob/1d2145b888af836f4aa69d0ed53c27d4864188de/openstack_dashboard/dashboards/project/volume_groups/views.py#L40

** Affects: horizon
 Importance: Medium
 Assignee: Akihiro Motoki (amotoki)
 Status: New


** Tags: stein-rc-potential

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1821288

Title:
  volume_groups view refers to consistency group index page incorrectly

Status in OpenStack Dashboard (Horizon):
  New

Bug description:
  The volume_groups view refers to consistency group index page
  incorrectly.

  
https://github.com/openstack/horizon/blob/1d2145b888af836f4aa69d0ed53c27d4864188de/openstack_dashboard/dashboards/project/volume_groups/views.py#L40

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1821288/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821299] [NEW] Associating Floating IP with a port can result in duplicate Floating IPs, due to the original FIP not being removed from the SNAT namespace.

2019-03-22 Thread piotrrr
Public bug reported:

Associating Floating IP with a port can result in duplicate Floating
IPs, due to the original FIP not being removed from the SNAT namespace.
This is likely specific to using DVR.

We're creating a Heat stack containing, among other things, a Floating
IP and a Port.

  head_a_floating_ip:
properties:
  floating_network_id: cf0a6df9-b533-457b-8cd7-0336f0649213
  port_id:
get_resource: head_a_external_port
type: OS::Neutron::FloatingIP
  head_a_external_port:
properties:
  network_id:
get_resource: external_net
  port_security_enabled: true
  replacement_policy: AUTO
  security_groups:
  - get_resource: head_sec_group
type: OS::Neutron::Port

During this initial stack creation, we are not creating any VMs. So, the
port is not attached to any device.

It looks like because of those two lines in floating-ip definition
  port_id:
get_resource: head_a_external_port
after the initial stack creation the Floating IP gets allocated in a SNAT 
namespace of one of the hypervisors, and starts to respond to ARP requests.

However, as soon as we update this stack, adding a VM, and making the
above mentioned port part of a VM, something weird happens. As expected,
Neutron then allocates that FIP on the hypervisor hosting the VM (as
expected, we're running DVR), however Neutron fails to remove the FIP it
had created initially in the SNAT namespace, after the initial stack
creation.

This results in FIP being present on two different hypervisors, causing
duplicate ARP replies (one MAC being in the SNAT namespace, the other in
the floating ip namespace), and obvious connectivity issues.

Note that the issues does not appear if the initial FIP happens to land
in the SNAT namespace of the same hypervisors which will later (after
stack update) also host the VM.

Simple, confirmed, workaround is to NOT include those two lines during the 
initial heat stack creation, and only include them in the stack update during 
which we add the VM.
  port_id:
get_resource: head_a_external_port
Not including those lines initially in the stack results in Neutron not 
allocating the FIP anywhere.

Environment: Neutron Pike (11.0.5), with DVR, OVS, VLAN-based isolation.

** Affects: neutron
 Importance: Undecided
 Status: New

** Description changed:

  Associating Floating IP with a port can result in duplicate Floating
  IPs, due to the original FIP not being removed from the SNAT namespace.
  This is likely specific to using DVR.
  
+ We're creating a Heat stack containing, among other things, a Floating
+ IP and a Port.
  
- We're creating a Heat stack containing, among other things, a Floating IP and 
a Port.
- 
-   head_a_floating_ip:
- properties:
-   floating_network_id: cf0a6df9-b533-457b-8cd7-0336f0649213
-   port_id:
- get_resource: head_a_external_port
- type: OS::Neutron::FloatingIP
-   head_a_external_port:
- properties:
-   network_id:
- get_resource: external_net
-   port_security_enabled: true
-   replacement_policy: AUTO
-   security_groups:
-   - get_resource: head_sec_group
- type: OS::Neutron::Port
+   head_a_floating_ip:
+ properties:
+   floating_network_id: cf0a6df9-b533-457b-8cd7-0336f0649213
+   port_id:
+ get_resource: head_a_external_port
+ type: OS::Neutron::FloatingIP
+   head_a_external_port:
+ properties:
+   network_id:
+ get_resource: external_net
+   port_security_enabled: true
+   replacement_policy: AUTO
+   security_groups:
+   - get_resource: head_sec_group
+ type: OS::Neutron::Port
  
  During this initial stack creation, we are not creating any VMs. So, the
  port is not attached to any device.
  
  It looks like because of those two lines in floating-ip definition
-   port_id:
- get_resource: head_a_external_port
+   port_id:
+ get_resource: head_a_external_port
  after the initial stack creation the Floating IP gets allocated in a SNAT 
namespace of one of the hypervisors, and starts to respond to ARP requests.
  
  However, as soon as we update this stack, adding a VM, and making the
  above mentioned port part of a VM, something weird happens. As expected,
  Neutron then allocates that FIP on the hypervisor hosting the VM (as
  expected, we're running DVR), however Neutron fails to remove the FIP it
  had created initially in the SNAT namespace, after the initial stack
  creation.
  
  This results in FIP being present on two different hypervisors, causing
  duplicate ARP replies (one MAC being in the SNAT namespace, the other in
  the floating ip namespace), and obvious connectivity issues.
  
  Note that the issues does not appear if the initial FIP happens to land
  in the SNAT namespace of the same hypervisors which will later (after
  stack update) also host the VM.
  
  Simple, confirmed, workaround is to NOT include those two lines during the 
initial he

[Yahoo-eng-team] [Bug 1821303] [NEW] Online data migration bases on hit count rather than total count

2019-03-22 Thread Maciej Jozefczyk
Public bug reported:

Imagine online data migration script reported 50 matched rows, but no
executed migrations, like:

Running batches of 50 until complete
50 rows matched query fake_migration, 50 migrated
50 rows matched query fake_migration, 40 migrated
50 rows matched query fake_migration, 0 migrated
++--+---+
|   Migration| Total Needed | Completed |
++--+---+
| fake_migration | 150  |90|
++--+---+
"""

After last run online data migration will not step to next batch, even
there are still rows considered to be checked/migrated.

It is because the condition if migration has been done looks for 'completed' 
counter instead of 'total needed' counter.
https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L733
https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L744

For some of online data migration scripts, like:
https://github.com/openstack/nova/blob/master/nova/objects/virtual_interface.py#L154

operator could be mislead, because migration ends but in fact there are
still rows that needs to be checked.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821303

Title:
  Online data migration bases on hit count rather than total count

Status in OpenStack Compute (nova):
  New

Bug description:
  Imagine online data migration script reported 50 matched rows, but no
  executed migrations, like:

  Running batches of 50 until complete
  50 rows matched query fake_migration, 50 migrated
  50 rows matched query fake_migration, 40 migrated
  50 rows matched query fake_migration, 0 migrated
  ++--+---+
  |   Migration| Total Needed | Completed |
  ++--+---+
  | fake_migration | 150  |90|
  ++--+---+
  """

  After last run online data migration will not step to next batch, even
  there are still rows considered to be checked/migrated.

  It is because the condition if migration has been done looks for 'completed' 
counter instead of 'total needed' counter.
  https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L733
  https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L744

  For some of online data migration scripts, like:
  
https://github.com/openstack/nova/blob/master/nova/objects/virtual_interface.py#L154

  operator could be mislead, because migration ends but in fact there
  are still rows that needs to be checked.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821303/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821306] [NEW] Using or importing the ABCs from 'collections' is deprecated

2019-03-22 Thread Attila Fazekas
Public bug reported:

Mar 22 09:09:30 controller-02 glance-api[23536]: 
/opt/stack/glance/glance/location.py:189: DeprecationWarning: Using or 
importing the ABCs from 'collections' instead of from 'collections.abc' is 
deprecated, and>
Mar 22 09:09:30 controller-02 glance-api[23536]:   class 
StoreLocations(collections.MutableSequence):
Mar 22 09:09:30 controller-02 glance-api[23536]: 
/opt/stack/glance/glance/api/common.py:115: DeprecationWarning: invalid escape 
sequence \d
Mar 22 09:09:30 controller-02 glance-api[23536]:   pattern = 
re.compile('^(\d+)((K|M|G|T)?B)?$')

Mar 22 09:09:30 controller-02 glance-api[23536]:
/usr/local/lib/python3.7/site-
packages/os_brick/initiator/linuxrbd.py:24: DeprecationWarning: Using or
importing the ABCs from 'collections' instead of from 'col>

(Today version)
py: Python 3.7.2

** Affects: glance
 Importance: Undecided
 Status: New

** Affects: os-brick
 Importance: Undecided
 Status: New

** Also affects: os-brick
   Importance: Undecided
   Status: New

** Description changed:

  Mar 22 09:09:30 controller-02 glance-api[23536]: 
/opt/stack/glance/glance/location.py:189: DeprecationWarning: Using or 
importing the ABCs from 'collections' instead of from 'collections.abc' is 
deprecated, and>
  Mar 22 09:09:30 controller-02 glance-api[23536]:   class 
StoreLocations(collections.MutableSequence):
  Mar 22 09:09:30 controller-02 glance-api[23536]: 
/opt/stack/glance/glance/api/common.py:115: DeprecationWarning: invalid escape 
sequence \d
  Mar 22 09:09:30 controller-02 glance-api[23536]:   pattern = 
re.compile('^(\d+)((K|M|G|T)?B)?$')
  
  Mar 22 09:09:30 controller-02 glance-api[23536]:
  /usr/local/lib/python3.7/site-
  packages/os_brick/initiator/linuxrbd.py:24: DeprecationWarning: Using or
  importing the ABCs from 'collections' instead of from 'col>
  
  (Today version)
- py: Python 2.7.15
+ py: Python 3.7.2

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/1821306

Title:
  Using or importing the ABCs from 'collections'  is deprecated

Status in Glance:
  New
Status in os-brick:
  New

Bug description:
  Mar 22 09:09:30 controller-02 glance-api[23536]: 
/opt/stack/glance/glance/location.py:189: DeprecationWarning: Using or 
importing the ABCs from 'collections' instead of from 'collections.abc' is 
deprecated, and>
  Mar 22 09:09:30 controller-02 glance-api[23536]:   class 
StoreLocations(collections.MutableSequence):
  Mar 22 09:09:30 controller-02 glance-api[23536]: 
/opt/stack/glance/glance/api/common.py:115: DeprecationWarning: invalid escape 
sequence \d
  Mar 22 09:09:30 controller-02 glance-api[23536]:   pattern = 
re.compile('^(\d+)((K|M|G|T)?B)?$')

  Mar 22 09:09:30 controller-02 glance-api[23536]:
  /usr/local/lib/python3.7/site-
  packages/os_brick/initiator/linuxrbd.py:24: DeprecationWarning: Using
  or importing the ABCs from 'collections' instead of from 'col>

  (Today version)
  py: Python 3.7.2

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1821306/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821311] [NEW] openstack router remove/add command out without error, when it fails

2019-03-22 Thread Candido Campos Rivas
Public bug reported:

the command fails but the failure is not shown if you don't use --debug
option:

(overcloud) [stack@undercloud-0 ~]$ openstack router remove subnet router 
selfservice ; echo $?
0
(overcloud) [stack@undercloud-0 ~]$ 
(overcloud) [stack@undercloud-0 ~]$ 
(overcloud) [stack@undercloud-0 ~]$ openstack router add subnet router 
selfservice ; echo $?
0


(overcloud) [stack@undercloud-0 ~]$ openstack router remove subnet router 
selfservice --debug
START with options: [u'router', u'remove', u'subnet', u'router', 
u'selfservice', u'--debug']
...

RESP: [409] Content-Length: 268 Content-Type: application/json Date: Fri, 22 
Mar 2019 09:29:10 GMT X-Openstack-Request-Id: 
req-e46e3e8b-76d3-4535-8cad-e14eed2c9190
RESP BODY: {"NeutronError": {"message": "Router interface for subnet 
ca7de33b-98c7-4ff4-9fae-cc2fcb7c41cc on router 
daa62d34-037d-4188-a37c-ab5d058d5489 cannot be deleted, as it is required by 
one or more floating IPs.", "type": "RouterInterfaceInUseByFloatingIP", 
"detail": ""}}
PUT call to network for 
http://10.0.0.107:9696/v2.0/routers/daa62d34-037d-4188-a37c-ab5d058d5489/remove_router_interface
 used request id req-e46e3e8b-76d3-4535-8cad-e14eed2c9190
Manager unknown ran task network.PUT.routers.remove_router_interface in 
1.10061788559s
clean_up RemoveSubnetFromRouter: 
END return value: 0
(overcloud) [stack@undercloud-0 ~]$ 


Reproduction example:

 -create a router:

(overcloud) [stack@undercloud-0 ~]$ history | grep router
   18  openstack router create router
   19  openstack router add subnet router selfservice
   20  openstack router set router --external-gateway public

 -associate a floating ip to a vm:
   56  openstack server add floating ip provider-instance 10.0.0.216

 -try to add/remove the subnet

   61  openstack router remove subnet router selfservice --debug
   62  openstack router add subnet router selfservice --debug


Logs and versio:


(overcloud) [stack@undercloud-0 ~]$ yum info openstack-neutron 
Loaded plugins: search-disabled-repos
Available Packages
Name: openstack-neutron
Arch: noarch
Epoch   : 1
Version : 13.0.3
Release : 0.20190119134915.886782c.el7ost
Size: 28 k
Repo: rhelosp-14.0-puddle/x86_64
Summary : OpenStack Networking Service
URL : http://launchpad.net/neutron/
License : ASL 2.0
Description : 
: Neutron is a virtual network service for Openstack. Just like
: OpenStack Nova provides an API to dynamically request and 
configure
: virtual servers, Neutron provides an API to dynamically request 
and
: configure virtual networks. These networks connect "interfaces" 
from
: other OpenStack services (e.g., virtual NICs from Nova VMs). The
: Neutron API supports extensions to provide advanced network
: capabilities (e.g., QoS, ACLs, network monitoring, etc.)

(overcloud) [stack@undercloud-0 ~]$ cat /etc/rhosp-release 
Red Hat OpenStack Platform release 14.0.1 RC (Rocky)

(overcloud) [stack@undercloud-0 ~]$ openstack server add floating ip 
provider-instance 10.0.0.216
(overcloud) [stack@undercloud-0 ~]$ openstack router remove subnet router 
selfservice
(overcloud) [stack@undercloud-0 ~]$ 
(overcloud) [stack@undercloud-0 ~]$ 
(overcloud) [stack@undercloud-0 ~]$ 
(overcloud) [stack@undercloud-0 ~]$ openstack router show router
+-+---+
| Field   | Value   


  |
+-+---+
| admin_state_up  | UP  


  |
| availability_zone_hints | 


  |
| availability_zones  | nova
  

[Yahoo-eng-team] [Bug 1821357] [NEW] VRRP vip on VM not reachable from other network on DVR setup

2019-03-22 Thread David Rabel
Public bug reported:

Hi.

We are using OpenStack Queens with DVR and have the following problem:

We have a VRRP setup (OpenSense firewalls) on VMs. The vip is reachable
from alle other VMs in the same network, but not from VMs in different
networks. Both OpenSense VMs are reachable from the other network.


So, routing in general between the two networks works fine, but we cannot reach 
the vip from the other network.

Port Security is deactivated.

It does work if the VRRP master VM is on the same compute node as the
test VM trying to reach it.

Further investigation shows that when trying to ping the vip, the ICMP
message reaches the router interface on the compute node where the VM
sending it is located. But a ovs-tcpdump on patch-int port shows that
there is no traffic tunneled between the hosts.

So, if the VRRP master with the vip is on the same node as the VM trying
to reach it, it receives the ping and answers. If it is on a different
node, we can observe an arp request from the router interface only on
the node where the VM sending the ping is located. This arp request is
unanswered.


It seems to us that this is a bug in Neutron.

Yours
  David

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1821357

Title:
  VRRP vip on VM not reachable from other network on DVR setup

Status in neutron:
  New

Bug description:
  Hi.

  We are using OpenStack Queens with DVR and have the following problem:

  We have a VRRP setup (OpenSense firewalls) on VMs. The vip is
  reachable from alle other VMs in the same network, but not from VMs in
  different networks. Both OpenSense VMs are reachable from the other
  network.

  
  So, routing in general between the two networks works fine, but we cannot 
reach the vip from the other network.

  Port Security is deactivated.

  It does work if the VRRP master VM is on the same compute node as the
  test VM trying to reach it.

  Further investigation shows that when trying to ping the vip, the ICMP
  message reaches the router interface on the compute node where the VM
  sending it is located. But a ovs-tcpdump on patch-int port shows that
  there is no traffic tunneled between the hosts.

  So, if the VRRP master with the vip is on the same node as the VM
  trying to reach it, it receives the ping and answers. If it is on a
  different node, we can observe an arp request from the router
  interface only on the node where the VM sending the ping is located.
  This arp request is unanswered.

  
  It seems to us that this is a bug in Neutron.

  Yours
David

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1821357/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1791075] Re: update_available_resource periodic does not take into account all evacuation states

2019-03-22 Thread Matt Riedemann
** Also affects: nova/queens
   Importance: Undecided
   Status: New

** Changed in: nova/queens
   Status: New => In Progress

** Changed in: nova/queens
   Importance: Undecided => Medium

** Changed in: nova/queens
 Assignee: (unassigned) => huanhongda (hongda)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1791075

Title:
  update_available_resource periodic does not take into account all
  evacuation states

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  In Progress
Status in OpenStack Compute (nova) rocky series:
  Fix Committed

Bug description:
  Current _update_usage_from_migrations code takes into account only
  REBUILDING task state, while not handling properly rebuilding spawn
  and rebuilding volume attachments. This can cause issues with numa
  topologies or pci devices if several instances are being evacuated and
  some of them begin evacuation prior to update_available_resource
  periodic pass and others immediately after, causing latter ones to
  claim e.g. already pinned cpus.

  Here is an example traceback that appears in nova-compute log after
  the instance was evacuated:

  2018-06-27T16:16:59.181573+02:00 compute-0-8.domain.tld nova-compute[19571]: 
2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
[req-79bc5f9f-9d5e-4f55-ad56-8351930afcb3 - - - - -] Error updating resources 
for node compute-0-8.domain.tld.
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager Traceback (most 
recent call last):
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6533, in 
update_available_resource_for_node
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
rt.update_available_resource(context, periodic=True)
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 594, 
in update_available_resource
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
self._update_available_resource(context, resources, periodic=periodic)
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in 
inner
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager return f(*args, 
**kwargs)
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 661, 
in _update_available_resource
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
self._update_usage_from_instances(context, instances)
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1035, 
in _update_usage_from_instances
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
self._update_usage_from_instance(context, instance)
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1001, 
in _update_usage_from_instance
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
self._update_usage(instance, sign=sign)
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 834, 
in _update_usage
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
self.compute_node, usage, free)
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1491, in 
get_host_numa_usage_from_instance
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
host_numa_topology, instance_numa_topology, free=free))
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1356, in 
numa_usage_from_instances
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
newcell.pin_cpus(pinned_cpus)
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager   File 
"/usr/lib/python2.7/dist-packages/nova/objects/numa.py", line 85, in pin_cpus
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager 
pinned=list(self.pinned_cpus))
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager CPUPinningInvalid: 
Cannot pin/unpin cpus [10, 34] from the following pinned set [9, 10, 34, 33]
  2018-06-27 16:16:59.163 19571 ERROR nova.compute.manager

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1791075/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821373] [NEW] Most instance actions can be called concurrently

2019-03-22 Thread Matthew Booth
Public bug reported:

A customer reported that they were getting DB corruption if they called
shelve twice in quick succession on the same instance. This should be
prevented by the guard in nova.API.shelve, which does:

  instance.task_state = task_states.SHELVING
  instance.save(expected_task_state=[None])

This is intended to act as a robust gate against 2 instance actions
happening concurrently. The first will set the task state to SHELVING,
the second will fail because the task state is not SHELVING. The
comparison is done atomically in db.instance_update_and_get_original(),
and should be race free.

However, instance.save() shortcuts if there is no update and does not
call db.instance_update_and_get_original(). Therefore this guard fails
if we call the same operation twice:

  instance = get_instance()
=> Returned instance.task_state is None
  instance.task_state = task_states.SHELVING
  instance.save(expected_task_state=[None])
=> task_state was None, now SHELVING, updates = {'task_state': SHELVING}
=> db.instance_update_and_get_original() executes and succeeds

  instance = get_instance()
=> Returned instance.task_state is SHELVING
  instance.task_state = task_states.SHELVING
  instance.save(expected_task_state=[None])
=> task_state was SHELVING, still SHELVING, updates = {}
=> db.instance_update_and_get_original() does not execute, therefore 
doesn't raise the expected exception

This pattern is common to almost all instance actions in nova api. A
quick scan suggests that all of the following actions are affected by
this bug, and can therefore all potentially be executed multiple times
concurrently for the same instance:

restore
force_stop
start
backup
snapshot
soft reboot
hard reboot
rebuild
revert_resize
resize
shelve
shelve_offload
unshelve
pause
unpause
suspend
resume
rescue
unrescue
set_admin_password
live_migrate
evacuate

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821373

Title:
  Most instance actions can be called concurrently

Status in OpenStack Compute (nova):
  New

Bug description:
  A customer reported that they were getting DB corruption if they
  called shelve twice in quick succession on the same instance. This
  should be prevented by the guard in nova.API.shelve, which does:

instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])

  This is intended to act as a robust gate against 2 instance actions
  happening concurrently. The first will set the task state to SHELVING,
  the second will fail because the task state is not SHELVING. The
  comparison is done atomically in
  db.instance_update_and_get_original(), and should be race free.

  However, instance.save() shortcuts if there is no update and does not
  call db.instance_update_and_get_original(). Therefore this guard fails
  if we call the same operation twice:

instance = get_instance()
  => Returned instance.task_state is None
instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])
  => task_state was None, now SHELVING, updates = {'task_state': SHELVING}
  => db.instance_update_and_get_original() executes and succeeds

instance = get_instance()
  => Returned instance.task_state is SHELVING
instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])
  => task_state was SHELVING, still SHELVING, updates = {}
  => db.instance_update_and_get_original() does not execute, therefore 
doesn't raise the expected exception

  This pattern is common to almost all instance actions in nova api. A
  quick scan suggests that all of the following actions are affected by
  this bug, and can therefore all potentially be executed multiple times
  concurrently for the same instance:

  restore
  force_stop
  start
  backup
  snapshot
  soft reboot
  hard reboot
  rebuild
  revert_resize
  resize
  shelve
  shelve_offload
  unshelve
  pause
  unpause
  suspend
  resume
  rescue
  unrescue
  set_admin_password
  live_migrate
  evacuate

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821373/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821384] [NEW] Customization of ubuntu image fails before upload to devstack

2019-03-22 Thread Slawek Kaplonski
Public bug reported:

In tempest scenario jobs we have customization script which customize
image before upload it to Glance in devstack. Sometimes it fails when
there is package and index mismatch, like e.g. in
http://logs.openstack.org/86/643486/12/check/neutron-tempest-plugin-
scenario-
linuxbridge/48055f7/controller/logs/devstacklog.txt.gz#_2019-03-22_17_00_19_424

Logstash query:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%20%5C%22E%3A%20Some%20index%20files%20failed%20to%20download.%20They%20have%20been%20ignored%2C%20or%20old%20ones%20used%20instead.%5C%22

** Affects: neutron
 Importance: Critical
 Status: Confirmed


** Tags: gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1821384

Title:
  Customization of ubuntu image fails before upload to devstack

Status in neutron:
  Confirmed

Bug description:
  In tempest scenario jobs we have customization script which customize
  image before upload it to Glance in devstack. Sometimes it fails when
  there is package and index mismatch, like e.g. in
  http://logs.openstack.org/86/643486/12/check/neutron-tempest-plugin-
  scenario-
  
linuxbridge/48055f7/controller/logs/devstacklog.txt.gz#_2019-03-22_17_00_19_424

  Logstash query:
  
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%20%5C%22E%3A%20Some%20index%20files%20failed%20to%20download.%20They%20have%20been%20ignored%2C%20or%20old%20ones%20used%20instead.%5C%22

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1821384/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821411] [NEW] Configuring nova.conf - [keystone_auth] mistake

2019-03-22 Thread Luis Bustos
Public bug reported:

In the controller node configuration of the Compute service, when
editing nova.conf, the guide instructs to define project_domain_name and
user_domain_name as 'default'. They should be capitalized as 'Default'
to be consistent with the rest of the guide.

URL referred: https://docs.openstack.org/nova/rocky/install/controller-
install-rdo.html#install-and-configure-components

I am a new user to Linux but I believe most parameters in Linux are
case-sensitive, this is why I consider relevant to report this minimal
issue.


This bug tracker is for errors with the documentation, use the following as a 
template and remove or add fields as you see fit. Convert [ ] into [x] to check 
boxes:

- [X] This doc is inaccurate in this way: project_domain_name and
user_domain_name should be capitalized as 'Default' instead of default
to be consistent with the other parts of the installation guide.
Currently both their values are 'default'.

- [ ] This is a doc addition request.
- [ ] I have a fix to the document that I can paste below including example: 
input and output. 

If you have a troubleshooting or support issue, use the following
resources:

 - Ask OpenStack: http://ask.openstack.org
 - The mailing list: http://lists.openstack.org
 - IRC: 'openstack' channel on Freenode

---
Release: 18.1.1.dev95 on 2019-03-21 16:37
SHA: 9bb78d5765dab01e38327f57312583c189a352d5
Source: 
https://git.openstack.org/cgit/openstack/nova/tree/doc/source/install/controller-install-rdo.rst
URL: https://docs.openstack.org/nova/rocky/install/controller-install-rdo.html

** Affects: nova
 Importance: Undecided
 Status: New


** Tags: doc

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821411

Title:
  Configuring nova.conf - [keystone_auth] mistake

Status in OpenStack Compute (nova):
  New

Bug description:
  In the controller node configuration of the Compute service, when
  editing nova.conf, the guide instructs to define project_domain_name
  and user_domain_name as 'default'. They should be capitalized as
  'Default' to be consistent with the rest of the guide.

  URL referred: https://docs.openstack.org/nova/rocky/install
  /controller-install-rdo.html#install-and-configure-components

  I am a new user to Linux but I believe most parameters in Linux are
  case-sensitive, this is why I consider relevant to report this minimal
  issue.

  
  This bug tracker is for errors with the documentation, use the following as a 
template and remove or add fields as you see fit. Convert [ ] into [x] to check 
boxes:

  - [X] This doc is inaccurate in this way: project_domain_name and
  user_domain_name should be capitalized as 'Default' instead of default
  to be consistent with the other parts of the installation guide.
  Currently both their values are 'default'.

  - [ ] This is a doc addition request.
  - [ ] I have a fix to the document that I can paste below including example: 
input and output. 

  If you have a troubleshooting or support issue, use the following
  resources:

   - Ask OpenStack: http://ask.openstack.org
   - The mailing list: http://lists.openstack.org
   - IRC: 'openstack' channel on Freenode

  ---
  Release: 18.1.1.dev95 on 2019-03-21 16:37
  SHA: 9bb78d5765dab01e38327f57312583c189a352d5
  Source: 
https://git.openstack.org/cgit/openstack/nova/tree/doc/source/install/controller-install-rdo.rst
  URL: https://docs.openstack.org/nova/rocky/install/controller-install-rdo.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821411/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1808975] Re: python3 + Fedora + SSL + nova compute RecursionError: maximum recursion depth exceeded while calling a Python object

2019-03-22 Thread OpenStack Infra
Reviewed:  https://review.openstack.org/626952
Committed: 
https://git.openstack.org/cgit/openstack/nova/commit/?id=3c5e2b0e9fac985294a949852bb8c83d4ed77e04
Submitter: Zuul
Branch:master

commit 3c5e2b0e9fac985294a949852bb8c83d4ed77e04
Author: Matthew Booth 
Date:   Wed Jan 30 15:10:25 2019 +

Eventlet monkey patching should be as early as possible

We were seeing infinite recursion opening an ssl socket when running
various combinations of python3, eventlet, and urllib3. It is not
clear exactly what combination of versions are affected, but for
background there is an example of this issue documented here:

https://github.com/eventlet/eventlet/issues/371

The immediate cause in nova's case was that we were calling
eventlet.monkey_patch() after importing urllib3. Specifically, change
Ie7bf5d012e2ccbcd63c262ddaf739782afcdaf56 introduced the
nova.utils.monkey_patch() method to make monkey patching common
between WSGI and non-WSGI services. Unfortunately, before executing
this method you must first import nova.utils, which imports a large
number of modules itself. Anything imported (transitively) by
nova.utils would therefore be imported before monkey patching, which
included urllib3. This triggers the infinite recursion problem
described above if you have an affected combination of library
versions.

While this specific issue may eventually be worked around or fixed in
eventlet or urllib3, it remains true that eventlet best practises are
to monkey patch as early as possible, which we were not doing. To
avoid this and hopefully future similar issues, this change ensures
that monkey patching happens as early as possible, and only a minimum
number of modules are imported first.

This change fixes monkey patching for both non-wsgi and wsgi callers:

* Non-WSGI services (nova/cmd)

  This is fixed by using the new monkey_patch module, which has minimal
  dependencies.

* WSGI services (nova/api/openstack)

  This is fixed both by using the new monkey_patch module, and by moving
  the patching point up one level so that it is done before importing
  anything in nova/api/openstack/__init__.py.

  This move causes issues for some external tools which load this path
  from nova and now monkey patch where they previously did not. However,
  it is unfortunately unavoidable to enable monkey patching for the wsgi
  entry point without major restructuring. This change includes a
  workaround for sphinx to avoid this issue.

This change has been through several iterations. I started with what
seemed like the simplest and most obvious change, and moved on as I
discovered more interactions which broke. It is clear that eventlet
monkey patching is extremely fragile, especially when done implicitly at
module load time as we do. I would advocate a code restructure to
improve this situation, but I think the time would be better spent
removing the eventlet dependency entirely.

Co-authored-by: Lee Yarwood 

Closes-Bug: #1808975
Closes-Bug: #1808951
Change-Id: Id46e7b553a10ec4654d4418a9884975b5b95


** Changed in: nova
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1808975

Title:
  python3 + Fedora + SSL + nova compute RecursionError: maximum
  recursion depth exceeded while calling a Python object

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description:- While Testing python3 Fedora deployment for nova in [1]
  got below Recursion Error in nova-compute:-

  2018-12-18 08:00:05.266 2428 ERROR nova.compute.manager 
[req-f908a9e0-e77a-4d35-9266-fc5e8d79dfde - - - - -] Error updating resources 
for node rdo-fedora-stable-rdo-cloud-358855.: RecursionError: maximum 
recursion depth exceeded while calling a Python object
  2018-12-18 08:00:05.266 2428 ERROR nova.compute.manager Traceback (most 
recent call last):
  2018-12-18 08:00:05.266 2428 ERROR nova.compute.manager   File 
"/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 7690, in 
_update_available_resource_for_node
  2018-12-18 08:00:05.266 2428 ERROR nova.compute.manager 
rt.update_available_resource(context, nodename, startup=startup)
  2018-12-18 08:00:05.266 2428 ERROR nova.compute.manager   File 
"/usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 738, 
in update_available_resource
  2018-12-18 08:00:05.266 2428 ERROR nova.compute.manager 
self._update_available_resource(context, resources, startup=startup)
  2018-12-18 08:00:05.266 2428 ERROR nova.compute.manager   File 
"/usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 328, in 
inner
  2018-12-18 08:00:05.266 242

[Yahoo-eng-team] [Bug 1808951] Re: python3 + Fedora + SSL + wsgi nova deployment, nova api returns RecursionError: maximum recursion depth exceeded while calling a Python object

2019-03-22 Thread OpenStack Infra
Reviewed:  https://review.openstack.org/626952
Committed: 
https://git.openstack.org/cgit/openstack/nova/commit/?id=3c5e2b0e9fac985294a949852bb8c83d4ed77e04
Submitter: Zuul
Branch:master

commit 3c5e2b0e9fac985294a949852bb8c83d4ed77e04
Author: Matthew Booth 
Date:   Wed Jan 30 15:10:25 2019 +

Eventlet monkey patching should be as early as possible

We were seeing infinite recursion opening an ssl socket when running
various combinations of python3, eventlet, and urllib3. It is not
clear exactly what combination of versions are affected, but for
background there is an example of this issue documented here:

https://github.com/eventlet/eventlet/issues/371

The immediate cause in nova's case was that we were calling
eventlet.monkey_patch() after importing urllib3. Specifically, change
Ie7bf5d012e2ccbcd63c262ddaf739782afcdaf56 introduced the
nova.utils.monkey_patch() method to make monkey patching common
between WSGI and non-WSGI services. Unfortunately, before executing
this method you must first import nova.utils, which imports a large
number of modules itself. Anything imported (transitively) by
nova.utils would therefore be imported before monkey patching, which
included urllib3. This triggers the infinite recursion problem
described above if you have an affected combination of library
versions.

While this specific issue may eventually be worked around or fixed in
eventlet or urllib3, it remains true that eventlet best practises are
to monkey patch as early as possible, which we were not doing. To
avoid this and hopefully future similar issues, this change ensures
that monkey patching happens as early as possible, and only a minimum
number of modules are imported first.

This change fixes monkey patching for both non-wsgi and wsgi callers:

* Non-WSGI services (nova/cmd)

  This is fixed by using the new monkey_patch module, which has minimal
  dependencies.

* WSGI services (nova/api/openstack)

  This is fixed both by using the new monkey_patch module, and by moving
  the patching point up one level so that it is done before importing
  anything in nova/api/openstack/__init__.py.

  This move causes issues for some external tools which load this path
  from nova and now monkey patch where they previously did not. However,
  it is unfortunately unavoidable to enable monkey patching for the wsgi
  entry point without major restructuring. This change includes a
  workaround for sphinx to avoid this issue.

This change has been through several iterations. I started with what
seemed like the simplest and most obvious change, and moved on as I
discovered more interactions which broke. It is clear that eventlet
monkey patching is extremely fragile, especially when done implicitly at
module load time as we do. I would advocate a code restructure to
improve this situation, but I think the time would be better spent
removing the eventlet dependency entirely.

Co-authored-by: Lee Yarwood 

Closes-Bug: #1808975
Closes-Bug: #1808951
Change-Id: Id46e7b553a10ec4654d4418a9884975b5b95


** Changed in: nova
   Status: In Progress => Fix Released

** Bug watch added: github.com/eventlet/eventlet/issues #371
   https://github.com/eventlet/eventlet/issues/371

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1808951

Title:
  python3 + Fedora + SSL + wsgi nova deployment, nova api returns
  RecursionError: maximum recursion depth exceeded while calling a
  Python object

Status in OpenStack Compute (nova):
  Fix Released
Status in tripleo:
  Triaged

Bug description:
  Description:-

  So while testing python3 with Fedora in [1], Found an issue while
  running nova-api behind wsgi. It fails with below Traceback:-

  2018-12-18 07:41:55.364 26870 INFO nova.api.openstack.requestlog 
[req-e1af4808-ecd8-47c7-9568-a5dd9691c2c9 - - - - -] 127.0.0.1 "GET 
/v2.1/servers/detail?all_tenants=True&deleted=True" status: 500 len: 0 
microversion: - time: 0.007297
  2018-12-18 07:41:55.364 26870 ERROR nova.api.openstack 
[req-e1af4808-ecd8-47c7-9568-a5dd9691c2c9 - - - - -] Caught error: maximum 
recursion depth exceeded while calling a Python object: RecursionError: maximum 
recursion depth exceeded while calling a Python object
  2018-12-18 07:41:55.364 26870 ERROR nova.api.openstack Traceback (most recent 
call last):
  2018-12-18 07:41:55.364 26870 ERROR nova.api.openstack   File 
"/usr/lib/python3.6/site-packages/nova/api/openstack/__init__.py", line 94, in 
__call__
  2018-12-18 07:41:55.364 26870 ERROR nova.api.openstack return 
req.get_response(self.application)
  2018-12-18 07:41:55.364 26870 ERROR nova.api.openstack   File 
"/usr/lib/python3.6/site-pac

[Yahoo-eng-team] [Bug 1821311] Re: openstack router remove/add command out without error, when it fails

2019-03-22 Thread Miguel Lavalle
This bug should be filed against python-openstackclient

** Project changed: neutron => python-openstackclient

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1821311

Title:
  openstack router remove/add command out without error, when it fails

Status in python-openstackclient:
  New

Bug description:
  the command fails but the failure is not shown if you don't use
  --debug option:

  (overcloud) [stack@undercloud-0 ~]$ openstack router remove subnet router 
selfservice ; echo $?
  0
  (overcloud) [stack@undercloud-0 ~]$ 
  (overcloud) [stack@undercloud-0 ~]$ 
  (overcloud) [stack@undercloud-0 ~]$ openstack router add subnet router 
selfservice ; echo $?
  0

  
  (overcloud) [stack@undercloud-0 ~]$ openstack router remove subnet router 
selfservice --debug
  START with options: [u'router', u'remove', u'subnet', u'router', 
u'selfservice', u'--debug']
  ...

  RESP: [409] Content-Length: 268 Content-Type: application/json Date: Fri, 22 
Mar 2019 09:29:10 GMT X-Openstack-Request-Id: 
req-e46e3e8b-76d3-4535-8cad-e14eed2c9190
  RESP BODY: {"NeutronError": {"message": "Router interface for subnet 
ca7de33b-98c7-4ff4-9fae-cc2fcb7c41cc on router 
daa62d34-037d-4188-a37c-ab5d058d5489 cannot be deleted, as it is required by 
one or more floating IPs.", "type": "RouterInterfaceInUseByFloatingIP", 
"detail": ""}}
  PUT call to network for 
http://10.0.0.107:9696/v2.0/routers/daa62d34-037d-4188-a37c-ab5d058d5489/remove_router_interface
 used request id req-e46e3e8b-76d3-4535-8cad-e14eed2c9190
  Manager unknown ran task network.PUT.routers.remove_router_interface in 
1.10061788559s
  clean_up RemoveSubnetFromRouter: 
  END return value: 0
  (overcloud) [stack@undercloud-0 ~]$ 

  
  Reproduction example:

   -create a router:

  (overcloud) [stack@undercloud-0 ~]$ history | grep router
 18  openstack router create router
 19  openstack router add subnet router selfservice
 20  openstack router set router --external-gateway public

   -associate a floating ip to a vm:
 56  openstack server add floating ip provider-instance 10.0.0.216

   -try to add/remove the subnet

 61  openstack router remove subnet router selfservice --debug
 62  openstack router add subnet router selfservice --debug

  
  Logs and versio:

  
  (overcloud) [stack@undercloud-0 ~]$ yum info openstack-neutron 
  Loaded plugins: search-disabled-repos
  Available Packages
  Name: openstack-neutron
  Arch: noarch
  Epoch   : 1
  Version : 13.0.3
  Release : 0.20190119134915.886782c.el7ost
  Size: 28 k
  Repo: rhelosp-14.0-puddle/x86_64
  Summary : OpenStack Networking Service
  URL : http://launchpad.net/neutron/
  License : ASL 2.0
  Description : 
  : Neutron is a virtual network service for Openstack. Just like
  : OpenStack Nova provides an API to dynamically request and 
configure
  : virtual servers, Neutron provides an API to dynamically request 
and
  : configure virtual networks. These networks connect "interfaces" 
from
  : other OpenStack services (e.g., virtual NICs from Nova VMs). The
  : Neutron API supports extensions to provide advanced network
  : capabilities (e.g., QoS, ACLs, network monitoring, etc.)

  (overcloud) [stack@undercloud-0 ~]$ cat /etc/rhosp-release 
  Red Hat OpenStack Platform release 14.0.1 RC (Rocky)

  (overcloud) [stack@undercloud-0 ~]$ openstack server add floating ip 
provider-instance 10.0.0.216
  (overcloud) [stack@undercloud-0 ~]$ openstack router remove subnet router 
selfservice
  (overcloud) [stack@undercloud-0 ~]$ 
  (overcloud) [stack@undercloud-0 ~]$ 
  (overcloud) [stack@undercloud-0 ~]$ 
  (overcloud) [stack@undercloud-0 ~]$ openstack router show router
  
+-+---+
  | Field   | Value 


|
  
+-+---+
  | admin_state_up  | UP

 

[Yahoo-eng-team] [Bug 1813715] Re: [L2][scale issue] ovs-agent meets unexpected tunnel lost

2019-03-22 Thread OpenStack Infra
Reviewed:  https://review.openstack.org/640797
Committed: 
https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch:master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong 
Date:   Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

Ovs-agent can be very time-consuming in handling a large number
of ports. At this point, the ovs-agent status report may have
exceeded the set timeout value. Some flows updating operations
will not be triggerred. This results in flows loss during agent
restart, especially for hosts to hosts of vxlan tunnel flow.

This fix will let the ovs-agent explicitly, in the first rpc loop,
indicate that the status is restarted. Then l2pop will be required
to update fdb entries.

Closes-Bug: #1813703
Closes-Bug: #1813714
Closes-Bug: #1813715
Closes-Bug: #1794991
Closes-Bug: #1799178

Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813715

Title:
  [L2][scale issue] ovs-agent meets unexpected tunnel lost

Status in neutron:
  Fix Released

Bug description:
  The ovs-agent will lost some tunnels to other nodes, for instance to DHCP 
node or L3 node, these lost tunnels can sometimes cause VM failed to boot or 
dataplane down.
  When subnets or security group ports quantity reaches 2000+, this issue can 
be seen in high probability.

  This is a subproblem of bug #1813703, for more information, please see the 
summary:
  https://bugs.launchpad.net/neutron/+bug/1813703

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813715/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1799178] Re: l2 pop doesn't always provide the whole list of fdb entries on agent restart

2019-03-22 Thread OpenStack Infra
Reviewed:  https://review.openstack.org/640797
Committed: 
https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch:master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong 
Date:   Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

Ovs-agent can be very time-consuming in handling a large number
of ports. At this point, the ovs-agent status report may have
exceeded the set timeout value. Some flows updating operations
will not be triggerred. This results in flows loss during agent
restart, especially for hosts to hosts of vxlan tunnel flow.

This fix will let the ovs-agent explicitly, in the first rpc loop,
indicate that the status is restarted. Then l2pop will be required
to update fdb entries.

Closes-Bug: #1813703
Closes-Bug: #1813714
Closes-Bug: #1813715
Closes-Bug: #1794991
Closes-Bug: #1799178

Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1799178

Title:
  l2 pop doesn't always provide the whole list of fdb entries on agent
  restart

Status in neutron:
  Fix Released

Bug description:
  The whole list of fdb entries is provided to the agent in case a port form 
new network appears, or when agent is restarted.
  Currently agent restart is detected by agent_boot_time option, 180 sec by 
default. 
  In fact boot time differs depending on port count and on some loaded clusters 
may exceed 180 secs on gateway nodes easily. Changing boot time in config 
works, but honestly this is not an ideal solution. 
  There should be a smarter way for agent restart detection (like agent itself 
sending flag in state report).

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1799178/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813714] Re: [L2][scale issue] ovs-agent meets unexpected flow lost

2019-03-22 Thread OpenStack Infra
Reviewed:  https://review.openstack.org/640797
Committed: 
https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch:master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong 
Date:   Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

Ovs-agent can be very time-consuming in handling a large number
of ports. At this point, the ovs-agent status report may have
exceeded the set timeout value. Some flows updating operations
will not be triggerred. This results in flows loss during agent
restart, especially for hosts to hosts of vxlan tunnel flow.

This fix will let the ovs-agent explicitly, in the first rpc loop,
indicate that the status is restarted. Then l2pop will be required
to update fdb entries.

Closes-Bug: #1813703
Closes-Bug: #1813714
Closes-Bug: #1813715
Closes-Bug: #1794991
Closes-Bug: #1799178

Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813714

Title:
  [L2][scale issue] ovs-agent meets unexpected flow lost

Status in neutron:
  Fix Released

Bug description:
  Ovs-agent will lost some flows during restart, for instance, flows to DHCP or 
L3, tunnel flows. These lost flows can sometimes cause VM failed to boot or 
dataplane down.
  When subnets or security group ports quantity reaches 2000+, this issue can 
be seen in high probability.

  This is a subproblem of bug #1813703, for more information, please see the 
summary:
  https://bugs.launchpad.net/neutron/+bug/1813703

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813714/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1794991] Re: Inconsistent flows with DVR l2pop VxLAN on br-tun

2019-03-22 Thread OpenStack Infra
Reviewed:  https://review.openstack.org/640797
Committed: 
https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch:master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong 
Date:   Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

Ovs-agent can be very time-consuming in handling a large number
of ports. At this point, the ovs-agent status report may have
exceeded the set timeout value. Some flows updating operations
will not be triggerred. This results in flows loss during agent
restart, especially for hosts to hosts of vxlan tunnel flow.

This fix will let the ovs-agent explicitly, in the first rpc loop,
indicate that the status is restarted. Then l2pop will be required
to update fdb entries.

Closes-Bug: #1813703
Closes-Bug: #1813714
Closes-Bug: #1813715
Closes-Bug: #1794991
Closes-Bug: #1799178

Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1794991

Title:
  Inconsistent flows with DVR l2pop VxLAN on br-tun

Status in neutron:
  Fix Released

Bug description:
  We are using Neutron (Pike) configured as DVR with l2pop and ARP
  responder and VxLAN. Since few weeks we are experiencing unexpected
  behaviors:

  - [1] Some instances are not able to get DHCP address
  - [2] Instances are not able to ping other instances on different compute

  This is totally random, sometime it will work as expected and sometime
  we will have the behaviors describe above.

  After checking the flows between network and compute nodes we have
  been able to discover that for behavior [1] it is due to missing flows
  on the compute nodes pointing to the DHCP agent on the network one.

  About behavior [2] it is related to missing flows too, some compute
  nodes have missing output to other compute nodes (vxlan-xx) which
  prevent an instance on compute 1 to communicate to an instance on
  compute 2.

  When we add the missing flows for [1] and [2] we are able to fix the
  issues but if we restart neutron-openvswitch-agent the flows are
  missing again.

  For [1] sometime just disable/enable the port on the network nodes
  related to each DHCP solve the problem and sometime not.

  For [2] the only way we found to fix the flows without adding them
  manually is to remove all instances of a network from the compute and
  create a new instance from this network which will sends a
  notification message to all computing and network nodes but again when
  neutron-openvswitch-agent restart the flows vanish again.

  We cherry-picked these commits but nothing changed:
    - https://review.openstack.org/#/c/600151/
    - https://review.openstack.org/#/c/573785/

  Information about our deployment:
    - OS: Ubuntu 16.04.5
    - Deployer: Kolla
    - Docker: 18.06
    - OpenStack: Pike/Rocky

  Any ideas ?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1794991/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1813703] Re: [L2] [summary] ovs-agent issues at large scale

2019-03-22 Thread OpenStack Infra
Reviewed:  https://review.openstack.org/640797
Committed: 
https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch:master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong 
Date:   Mon Mar 4 21:17:20 2019 +0800

More accurate agent restart state transfer

Ovs-agent can be very time-consuming in handling a large number
of ports. At this point, the ovs-agent status report may have
exceeded the set timeout value. Some flows updating operations
will not be triggerred. This results in flows loss during agent
restart, especially for hosts to hosts of vxlan tunnel flow.

This fix will let the ovs-agent explicitly, in the first rpc loop,
indicate that the status is restarted. Then l2pop will be required
to update fdb entries.

Closes-Bug: #1813703
Closes-Bug: #1813714
Closes-Bug: #1813715
Closes-Bug: #1794991
Closes-Bug: #1799178

Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961


** Changed in: neutron
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1813703

Title:
  [L2] [summary] ovs-agent issues at large scale

Status in neutron:
  Fix Released

Bug description:
  [L2] [summary] ovs-agent issues at large scale

  Recently we have tested the ovs-agent with the openvswitch flow based
  security group, and we met some issues at large scale. This bug will
  give us a centralized location to track the following problems.

  Problems:
  (1) RPC timeout during ovs-agent restart
  https://bugs.launchpad.net/neutron/+bug/1813704
  (2) local connection to ovs-vswitchd was drop or timeout
  https://bugs.launchpad.net/neutron/+bug/1813705
  (3) ovs-agent failed to restart
  https://bugs.launchpad.net/neutron/+bug/1813706
  (4) ovs-agent restart costs too long time  (15-40mins+)
  https://bugs.launchpad.net/neutron/+bug/1813707
  (5) unexpected flow lost
  https://bugs.launchpad.net/neutron/+bug/1813714
  (6) unexpected tunnel lost
  https://bugs.launchpad.net/neutron/+bug/1813715
  (7) multipe cookies flows (stale flows)
  https://bugs.launchpad.net/neutron/+bug/1813712
  (8) dump-flows takes a lots of time
  https://bugs.launchpad.net/neutron/+bug/1813709
  (9) really hard to do trouble shooting if one VM lost the connection, flow 
tables are almost unreadable (reach 30k+ flows).
  https://bugs.launchpad.net/neutron/+bug/1813708

  Problem can be seen in the following scenarios:
  (1) 2000-3000 ports related to one single security group (or one remote 
security group)
  (2) create 2000-3000 VMs in one single subnet (network)
  (3) create 2000-3000 VMs under one single security group

  Yes, the scale is the main problem, when one host's VM count is
  closing to 150-200 (at the same time the ports number in one subnet or
  security group is closing 2000), the ovs-agent restart will get worse.

  Test ENV:
  stable/queens

  Deployment topology:
  neutron-server, database, message queue all have its own dedicated physical 
hosts, 3 nodes for each service at least.

  Configurations:
  ovs-agent was setup with l2pop, security group based on ovs flow, and the 
config was basiclly like the following:
  [agent]
  enable_distributed_routing = True
  l2_population = True
  tunnel_types = vxlan
  arp_responder = True
  prevent_arp_spoofing = True
  extensions = qos
  report_interval = 60

  [ovs]
  bridge_mappings = tenant:br-vlan,external:br-ex
  local_ip = 10.114.4.48

  [securitygroup]
  firewall_driver = openvswitch
  enable_security_group = True

  Some issue tracking:
  (1) mostly because the great number of ports related to one security grop or 
in one network
  (2) uncessary RPC call during ovs-agent restart
  (3) inefficient database query conditions
  (4) full sync will redo again and again if any exception was raised in 
rpc_loop
  (5) clean stale flows will dump all flows first (not once, multipe times), 
this is really time-consuming

  So this is a summay bug for the entire scale issues we have met.

  Some potential solutions:
  Increase some config like rpc_response_timeout, of_connect_timeout, 
of_request_timeout, ovsdb_timeout etc,
  does not help too much, and these changes can cause the restart cost time 
much more. And those issues can still be seen.

  One workaround is to disable the openvswitch flow based security
  group, the ovs-agent can restart in less than 10 mins.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1813703/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp