[Yahoo-eng-team] [Bug 2028009] [NEW] Modifying IPv6 subnet with ipv6-address-mode set results in error

2023-07-17 Thread James Denton
Public bug reported:

Attempting to update an IPv6 subnet within Horizon that was created via
CLI, and whose ipv6-address-mode is set (dhcp-stateful vs None), results
in an error:

list index out of range

Creating the subnet via Horizon with the exact same parameters behaves
properly.

--

Looking at the DB, I compared the CLI-created subnet to the Horizon-
created subnet and found that Horizon sets the ipv6_ra_mode, too, which
I suspect is the reason for the error:

+--+--++--++-+---+-+-+---+---+--+++
| project_id   | id   | 
name   | network_id   | ip_version | cidr   
 | gateway_ip| enable_dhcp | ipv6_ra_mode| 
ipv6_address_mode | subnetpool_id | standard_attr_id | segment_id | in_use |
+--+--++--++-+---+-+-+---+---+--+++
| 78c44dffa72a44c58327088142091d12 | 828b3e14-307f-4d69-bd90-3f23e39b3c10 | 
IPV6-stateful  | 57b47430-106d-489c-8219-388790d35db4 |  6 | 
2001:4801:12a1:408::/64 | 2001:4801:12a1:408::1 |   1 | NULL
| dhcpv6-stateful   | NULL  |  392 | NULL   |  1 |
| 78c44dffa72a44c58327088142091d12 | a0aa908a-0c7f-4b7c-b53a-c21f997b5bbe | 
IPv6-horizon   | 57b47430-106d-489c-8219-388790d35db4 |  6 | 
6001:4801:12a1:408::/64 | 6001:4801:12a1:408::1 |   1 | dhcpv6-stateful 
| dhcpv6-stateful   | NULL  |  402 | NULL   |  1 |
+--+--++--++-+---+-+-+---+---+--+++

Looking at https://docs.openstack.org/neutron/latest/admin/config-
ipv6.html, it makes sense why this would be the case. The CLI is a
little too lax in that regard, since it appears to allow an invalid or
unsupported configuration. Not sure if the error could be clearer.
Thanks!

** Affects: horizon
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/2028009

Title:
  Modifying IPv6 subnet with ipv6-address-mode set results in error

Status in OpenStack Dashboard (Horizon):
  New

Bug description:
  Attempting to update an IPv6 subnet within Horizon that was created
  via CLI, and whose ipv6-address-mode is set (dhcp-stateful vs None),
  results in an error:

  list index out of range

  Creating the subnet via Horizon with the exact same parameters behaves
  properly.

  --

  Looking at the DB, I compared the CLI-created subnet to the Horizon-
  created subnet and found that Horizon sets the ipv6_ra_mode, too,
  which I suspect is the reason for the error:

  
+--+--++--++-+---+-+-+---+---+--+++
  | project_id   | id   | 
name   | network_id   | ip_version | cidr   
 | gateway_ip| enable_dhcp | ipv6_ra_mode| 
ipv6_address_mode | subnetpool_id | standard_attr_id | segment_id | in_use |
  
+--+--++--++-+---+-+-+---+---+--+++
  | 78c44dffa72a44c58327088142091d12 | 828b3e14-307f-4d69-bd90-3f23e39b3c10 | 
IPV6-stateful  | 57b47430-106d-489c-8219-388790d35db4 |  6 | 
2001:4801:12a1:408::/64 | 2001:4801:12a1:408::1 |   1 | NULL
| dhcpv6-stateful   | NULL  |  392 | NULL   |  1 |
  | 78c44dffa72a44c58327088142091d12 | a0aa908a-0c7f-4b7c-b53a-c21f997b5bbe | 
IPv6-horizon   | 57b47430-106d-489c-8219-388790d35db4 |  6 | 
6001:4801:12a1:408::/64 | 6001:4801:12a1:408::1 |   1 | dhcpv6-stateful 
| dhcpv6-stateful   | NULL  |  402 | NULL   |  1 |
  

[Yahoo-eng-team] [Bug 2007167] [NEW] OVN DHCP replies source from subnet gateway IP

2023-02-13 Thread James Denton
Public bug reported:

Recently switched from using DHCP Agent to built-in OVN DHCP for
baremetal deployments.

Version: Zed
OS: 22.04 LTS
OVS: 3.0.1
OVN: 22.09

When a baremetal node is provisioned, during PXE I am getting a lease
from an OVN controller but nothing further (ie. no TFTP). Here is the
DHCP request and reply:

root@lab-infra02:~# tcpdump -i ens192 -ne port 67 or port 68
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens192, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:16:23.767513 14:02:ec:32:3e:0c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 405: vlan 6, p 0, ethertype IPv4 (0x0800), 0.0.0.0.68 > 
255.255.255.255.67: BOOTP/DHCP, Request from 14:02:ec:32:3e:0c, length 359
16:16:23.768943 fa:16:3e:1f:ab:d3 > 14:02:ec:32:3e:0c, ethertype 802.1Q 
(0x8100), length 398: vlan 6, p 0, ethertype IPv4 (0x0800), 192.168.208.1.67 > 
255.255.255.255.68: BOOTP/DHCP, Reply, length 352

I've noticed two things:

1. The MAC fa:16:3e:1f:ab:d3 is not documented in Neutron's port list (and not 
sure if it should be) but appears to be owned by OVN in some way
2. The source IP 192.168.208.1 on the reply is the *gateway* IP for the 
provisioning subnet, which is a VLAN with a real external gateway *also* 
configured with 192.168.208.1.

Best I can tell, OVN is sending the DHCP reply as 192.168.208.1, which
is actually not in allocation pool as it's configured as the subnet
gateway and not use by Neutron at all. The subnet is not attached to a
Neutron router, so not sure why it would be claimed. There ARE Neutron
ports of owner network:dhcp, and one of these is allocation to lab-
infra02 and listed in the logical port list in NB DB.

Here is more detail on the DHCP request/reply. Notice server-id is
192.168.208.1 where it ought to be 192.168.208.202:

17:19:04.278903 14:02:ec:32:3e:0c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 393: vlan 6, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, 
id 4507, offset 0, flags [none], proto UDP (17), length 375)
0.0.0.0.68 > 255.255.255.255.67: [udp sum ok] BOOTP/DHCP, Request from 
14:02:ec:32:3e:0c, length 347, xid 0x56cdc32e, Flags [Broadcast] (0x8000)
  Client-Ethernet-Address 14:02:ec:32:3e:0c
  Vendor-rfc1048 Extensions
Magic Cookie 0x63825363
DHCP-Message (53), length 1: Discover
MSZ (57), length 2: 1464
Parameter-Request (55), length 35:
  Subnet-Mask (1), Time-Zone (2), Default-Gateway (3), Time-Server 
(4)
  IEN-Name-Server (5), Domain-Name-Server (6), Hostname (12), BS 
(13)
  Domain-Name (15), RP (17), EP (18), RSZ (22)
  TTL (23), BR (28), YD (40), YS (41)
  NTP (42), Vendor-Option (43), Requested-IP (50), Lease-Time (51)
  Server-ID (54), RN (58), RB (59), Vendor-Class (60)
  TFTP (66), BF (67), GUID (97), Unknown (128)
  Unknown (129), Unknown (130), Unknown (131), Unknown (132)
  Unknown (133), Unknown (134), Unknown (135)
GUID (97), length 17: 
0.55.53.53.50.53.56.54.67.85.54.48.49.89.82.78.48
NDI (94), length 3: 1.3.16
ARCH (93), length 2: 7
Vendor-Class (60), length 32: "PXEClient:Arch:7:UNDI:003016"
END (255), length 0
0x:  4500 0177 119b  4011 67dc    E..w@.g.
0x0010:    0044 0043 0163 58d3 0101 0600  .D.C.cX.
0x0020:  56cd c32e  8000      V...
0x0030:      1402 ec32 3e0c   ...2>...
0x0040:           
0x0050:           
0x0060:           
0x0070:           
0x0080:           
0x0090:           
0x00a0:           
0x00b0:           
0x00c0:           
0x00d0:           
0x00e0:           
0x00f0:           
0x0100:      6382 5363 3501 0139  c.Sc5..9
0x0110:  0205 b837 2301 0203 0405 060c 0d0f 1112  ...7#...
0x0120:  1617 1c28 292a 2b32 3336 3a3b 3c42 4361  ...()*+236:; 14:02:ec:32:3e:0c, ethertype 802.1Q 
(0x8100), length 398: vlan 6, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, 
id 4507, offset 0, flags [none], proto UDP (17), length 380)
192.168.208.1.67 > 255.255.255.255.68: [no cksum] 

[Yahoo-eng-team] [Bug 1906406] [NEW] [segments] dnsmasq can't delete lease for instance due to mismatch between client ip and local addr

2020-12-01 Thread James Denton
Public bug reported:

Issue:

The Neutron DHCP agent bootstraps the DHCP leases file for a network
using all associated subnets[1]. In a multi-segment environment,
however, a DHCP agent can only service a single segment/subnet of a
given network.

The DHCP namespace, then, is configured with an interface containing a
single IP address for the respective segment/subnet it's servicing. When
a VM from the same network but different segment/subnet is deleted, the
DHCP release packet that would be issued by dhcp_release isn't sent due
to a mismatch between client IP and local addr.

Brian Haley patched dhcp_release.c recently to fix a similar issue here:

http://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commitdiff;h=d9f882bea2806799bf3d1f73937f5e72d0bfc650;hp=fef2f1c75eba56b7355cbe729e4362474d558aa4;ds=sidebyside

We can probably update dnsmasq-utils in the short term, but maybe making
the DHCP agent segment aware is a better long-term solution?

Here are the steps to reproduce:

-=-=-=-=-

Network: rpn_multisegment

Segment 1:
VLAN 106 10.106.0.0/24
Provider Mapping: physnet1:bond1

Segment 2:
VLAN 206 10.206.0.0/24
Provider Mapping: physnet2:bond1

Two VMs:

OpenStack Lab % openstack server list
+--+-+-+---+--++
| ID   | Name| Status  | 
Networks  | Image| 
Flavor |
+--+-+-+---+--++
| 40f94b68-7e38-45b6-855d-792399c2a9ff | vm-seg2 | ACTIVE  | 
rpn_multisegment=10.206.0.53  | bionic-osa-master| 
osa-dev-8-8-60 |
| 34f8ff53-e505-4267-a13a-b881dfcec240 | vm-seg1 | ACTIVE  | 
rpn_multisegment=10.106.0.98  | bionic-osa-master| 
osa-dev-8-8-60 |
+--+-+-+---+--++

On compute01, we can see host file populated with entries for each
subnet associated with the network:

root@lab-compute01:~# cat 
/var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/host
fa:16:3e:07:f7:af,host-10-206-0-2.openstacklocal,10.206.0.2
fa:16:3e:2c:da:6d,host-10-106-0-2.openstacklocal,10.106.0.2
fa:16:3e:46:7b:d1,host-10-106-0-98.openstacklocal,10.106.0.98
fa:16:3e:ce:b1:b5,host-10-206-0-53.openstacklocal,10.206.0.53

Same on compute02:


root@lab-compute02:~# cat 
/var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/host
fa:16:3e:07:f7:af,host-10-206-0-2.openstacklocal,10.206.0.2
fa:16:3e:2c:da:6d,host-10-106-0-2.openstacklocal,10.106.0.2
fa:16:3e:46:7b:d1,host-10-106-0-98.openstacklocal,10.106.0.98
fa:16:3e:ce:b1:b5,host-10-206-0-53.openstacklocal,10.206.0.53

The leases file, however, contains only those hosts that have obtained
leases (expected):

root@lab-compute01:~# cat 
/var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606916842 fa:16:3e:46:7b:d1 10.106.0.98 host-10-106-0-98 
ff:b5:5e:67:ff:00:02:00:00:ab:11:9e:a5:86:fd:ae:2f:49:ad
1606916738 fa:16:3e:2c:da:6d 10.106.0.2 host-10-106-0-2 *
1606916738 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

root@lab-compute02:~# cat 
/var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606916917 fa:16:3e:ce:b1:b5 10.206.0.53 host-10-206-0-53 
ff:b5:5e:67:ff:00:02:00:00:ab:11:9e:a5:86:fd:ae:2f:49:ad
1606916626 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

Everything looks OK so far.

When restarting the neutron-dhcp-agent, however, the leases file is
bootstrapped and contains entries for all subnets associated with the
network:

root@lab-compute01:~# cat 
/var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606917246 fa:16:3e:46:7b:d1 10.106.0.98 host-10-106-0-98 *
1606917246 fa:16:3e:2c:da:6d 10.106.0.2 host-10-106-0-2 *
1606917246 fa:16:3e:ce:b1:b5 10.206.0.53 host-10-206-0-53 *
1606917246 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

root@lab-compute02:~# cat 
/var/lib/neutron/dhcp/0e4fa560-1483-4ac5-be44-0542503f1e5a/leases
1606917254 fa:16:3e:46:7b:d1 10.106.0.98 host-10-106-0-98 *
1606917254 fa:16:3e:2c:da:6d 10.106.0.2 host-10-106-0-2 *
1606917254 fa:16:3e:ce:b1:b5 10.206.0.53 host-10-206-0-53 *
1606917254 fa:16:3e:07:f7:af 10.206.0.2 host-10-206-0-2 *

This configuration becomes a problem when a VM is deleted and
dhcp_release is executed, as the the namespaces on each host only have
an IP from their respective segment and will not be able to delete a
lease for what essentially is a non-connected subnet:

root@lab-compute01:~# ip netns exec qdhcp-0e4fa560-1483-4ac5-be44-0542503f1e5a 
ip addr
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
default qlen 1000

[Yahoo-eng-team] [Bug 1879747] [NEW] Manual install & Configuration in neutron

2020-05-20 Thread James Denton
Public bug reported:


This bug tracker is for errors with the documentation, use the following
as a template and remove or add fields as you see fit. Convert [ ] into
[x] to check boxes:

- [x] This doc is inaccurate in this way: service_plugin path is incorrect 
since rolling OVN into Neutron
- [ ] This is a doc addition request.
- [x] I have a fix to the document that I can paste below including example: 
input and output. 

If you have a troubleshooting or support issue, use the following
resources:

 - Ask OpenStack: https://ask.openstack.org
 - The mailing list: https://lists.openstack.org
 - IRC: 'openstack' channel on Freenode

---
Release: 16.1.0.dev94 on 2020-01-08 17:10:46
SHA: 182f345018d9c544464f344741e1a6885376cd0f
Source: 
https://opendev.org/openstack/neutron/src/doc/source/install/ovn/manual_install.rst
URL: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html

-=-=-=-=-=-

The docs mention that OVN is rolled into Neutron in recent releases,
which means that the networking_ovn path is no longer valid when
defining the service plugin.

Incorrect: service_plugins = networking_ovn.l3.l3_ovn.OVNL3RouterPlugin
Correct: service_plugins = neutron.services.ovn_l3.plugin.OVNL3RouterPlugin

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1879747

Title:
  Manual install & Configuration in neutron

Status in neutron:
  New

Bug description:

  This bug tracker is for errors with the documentation, use the
  following as a template and remove or add fields as you see fit.
  Convert [ ] into [x] to check boxes:

  - [x] This doc is inaccurate in this way: service_plugin path is incorrect 
since rolling OVN into Neutron
  - [ ] This is a doc addition request.
  - [x] I have a fix to the document that I can paste below including example: 
input and output. 

  If you have a troubleshooting or support issue, use the following
  resources:

   - Ask OpenStack: https://ask.openstack.org
   - The mailing list: https://lists.openstack.org
   - IRC: 'openstack' channel on Freenode

  ---
  Release: 16.1.0.dev94 on 2020-01-08 17:10:46
  SHA: 182f345018d9c544464f344741e1a6885376cd0f
  Source: 
https://opendev.org/openstack/neutron/src/doc/source/install/ovn/manual_install.rst
  URL: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html

  -=-=-=-=-=-

  The docs mention that OVN is rolled into Neutron in recent releases,
  which means that the networking_ovn path is no longer valid when
  defining the service plugin.

  Incorrect: service_plugins = networking_ovn.l3.l3_ovn.OVNL3RouterPlugin
  Correct: service_plugins = neutron.services.ovn_l3.plugin.OVNL3RouterPlugin

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1879747/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1870228] [NEW] cloud-init metadata fallback broken

2020-04-01 Thread James Denton
Public bug reported:

I came across an issue today for a user that was experiencing issues
connecting to metadata at 169.254.169.254. For a long time, cloud-init
has had a fallback mechanism to that allowed it to contact the metadata
service at http:///latest/meta-data if
http://169.254.169.254/latest/meta-data were unavailable, like so:

[  157.574921] cloud-init[1313]: 2020-03-31 09:53:24,158 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [50/120s]: 
request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries 
exceeded with url: /2009-04-04/meta-data/instance-id (Caused by 
ConnectTimeoutError(, 'Connection to 169.254.169.254 timed out. (connect 
timeout=50.0)'))]
[  208.629083] cloud-init[1313]: 2020-03-31 09:54:15,214 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [101/120s]: 
request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries 
exceeded with url: /2009-04-04/meta-data/instance-id (Caused by 
ConnectTimeoutError(, 'Connection to 169.254.169.254 timed out. (connect 
timeout=50.0)'))]
[  226.639267] cloud-init[1313]: 2020-03-31 09:54:33,224 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: 
request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries 
exceeded with url: /2009-04-04/meta-data/instance-id (Caused by 
ConnectTimeoutError(, 'Connection to 169.254.169.254 timed out. (connect 
timeout=17.0)'))]
[  227.640812] cloud-init[1313]: 2020-03-31 09:54:34,225 - 
DataSourceEc2.py[CRITICAL]: Giving up on md from 
['http://169.254.169.254/2009-04-04/meta-data/instance-id'] after 120 seconds
[  227.651134] cloud-init[1313]: 2020-03-31 09:54:34,236 - 
url_helper.py[WARNING]: Calling 
'http://10.19.48.2/latest/meta-data/instance-id' failed [0/120s]: request error 
[('Connection aborted.', error(111, 'Connection refused'))]
[  228.655226] cloud-init[1313]: 2020-03-31 09:54:35,240 - 
url_helper.py[WARNING]: Calling 
'http://10.19.48.2/latest/meta-data/instance-id' failed [1/120s]: request error 
[('Connection aborted.', error(111, 'Connection refused'))]

In this Stein environment, isolated metadata is enabled, and the qdhcp
namespace has a listener at 169.254.169.254:80. Previous versions of
Neutron had the listener on 0.0.0.0:80, which helped facilitate the
fallback mechanism described above. The bug/patch where this was changed
is here:

[1] https://bugs.launchpad.net/neutron/+bug/1745618

Having this functionality back would be nice. Thoughts?

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1870228

Title:
  cloud-init metadata fallback broken

Status in neutron:
  New

Bug description:
  I came across an issue today for a user that was experiencing issues
  connecting to metadata at 169.254.169.254. For a long time, cloud-init
  has had a fallback mechanism to that allowed it to contact the
  metadata service at http:///latest/meta-data if
  http://169.254.169.254/latest/meta-data were unavailable, like so:

  [  157.574921] cloud-init[1313]: 2020-03-31 09:53:24,158 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [50/120s]: 
request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries 
exceeded with url: /2009-04-04/meta-data/instance-id (Caused by 
ConnectTimeoutError(, 'Connection to 169.254.169.254 timed out. (connect 
timeout=50.0)'))]
  [  208.629083] cloud-init[1313]: 2020-03-31 09:54:15,214 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [101/120s]: 
request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries 
exceeded with url: /2009-04-04/meta-data/instance-id (Caused by 
ConnectTimeoutError(, 'Connection to 169.254.169.254 timed out. (connect 
timeout=50.0)'))]
  [  226.639267] cloud-init[1313]: 2020-03-31 09:54:33,224 - 
url_helper.py[WARNING]: Calling 
'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: 
request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries 
exceeded with url: /2009-04-04/meta-data/instance-id (Caused by 
ConnectTimeoutError(, 'Connection to 169.254.169.254 timed out. (connect 
timeout=17.0)'))]
  [  227.640812] cloud-init[1313]: 2020-03-31 09:54:34,225 - 
DataSourceEc2.py[CRITICAL]: Giving up on md from 
['http://169.254.169.254/2009-04-04/meta-data/instance-id'] after 120 seconds
  [  227.651134] cloud-init[1313]: 2020-03-31 09:54:34,236 - 
url_helper.py[WARNING]: Calling 
'http://10.19.48.2/latest/meta-data/instance-id' failed [0/120s]: request error 
[('Connection aborted.', error(111, 'Connection refused'))]
  [  228.655226] cloud-init[1313]: 2020-03-31 09:54:35,240 - 

[Yahoo-eng-team] [Bug 1865223] [NEW] [scale issue] regression for security group list between Newton and Rocky+

2020-02-28 Thread James Denton
Public bug reported:

We recently upgraded an environment from Newton -> Rocky, and
experienced a dramatic increase in the amount of time it takes to return
a full security group list. For ~8,000 security groups, it takes nearly
75 seconds. This was not observed in Newton.

I was able to replicate this in the following 4 environments:

Newton (virtual machine)
Rocky (baremetal)
Stein (virtual machine)
Train (baremetal)

Command: openstack security group list

> Sec Grps vs. Seconds

QtyNewton VM  Rocky BM  Stein VM  Train BM
200 4.1 3.7  5.4   5.2  
500 5.3 7119.4  
10007.2 12.4 19.2  16   
20009.2 24.2 35.3  30.7 
300012.136.5 5244   
400016.147.2 7358.9 

At this time, we do not know if this increase in time extends to other
'list' commands at scale. The 'show' commands appear to be fairly
performant. This increase in time does have a negative impact on user
perception, scripts, other dependent resources, etc. The Stein VM is
slower than Train, but could be due to VM vs BM. The Newton environment
is virtual, too, so I would expect even better performance on bare
metal.

Any assistance or insight into what might have changed between releases
to cause this would be helpful.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1865223

Title:
  [scale issue] regression for security group list between Newton and
  Rocky+

Status in neutron:
  New

Bug description:
  We recently upgraded an environment from Newton -> Rocky, and
  experienced a dramatic increase in the amount of time it takes to
  return a full security group list. For ~8,000 security groups, it
  takes nearly 75 seconds. This was not observed in Newton.

  I was able to replicate this in the following 4 environments:

  Newton (virtual machine)
  Rocky (baremetal)
  Stein (virtual machine)
  Train (baremetal)

  Command: openstack security group list

  > Sec Grps vs. Seconds

  QtyNewton VM  Rocky BM  Stein VM  Train BM
  200 4.1 3.7  5.4   5.2  
  500 5.3 7119.4  
  10007.2 12.4 19.2  16   
  20009.2 24.2 35.3  30.7 
  300012.136.5 5244   
  400016.147.2 7358.9 

  At this time, we do not know if this increase in time extends to other
  'list' commands at scale. The 'show' commands appear to be fairly
  performant. This increase in time does have a negative impact on user
  perception, scripts, other dependent resources, etc. The Stein VM is
  slower than Train, but could be due to VM vs BM. The Newton
  environment is virtual, too, so I would expect even better performance
  on bare metal.

  Any assistance or insight into what might have changed between
  releases to cause this would be helpful.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1865223/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1837252] [NEW] IFLA_BR_AGEING_TIME of 0 causes flooding across bridges

2019-07-19 Thread James Denton
Public bug reported:

Release: OpenStack Stein
Driver: LinuxBridge

Using Stein w/ the LinuxBridge mech driver/agent, we have found that
traffic is being flooded across bridges. Using tcpdump inside an
instance, you can see unicast traffic for other instances.

We have confirmed the macs table shows the aging timer set to 0 for
permanent entries, and the bridge is NOT learning new MACs:

root@lab-compute01:~# brctl showmacs brqd0084ac0-f7
port no mac addris local?   ageing timer
  5 24:be:05:a3:1f:e1   yes0.00
  5 24:be:05:a3:1f:e1   yes0.00
  1 fe:16:3e:02:62:18   yes0.00
  1 fe:16:3e:02:62:18   yes0.00
  7 fe:16:3e:07:65:47   yes0.00
  7 fe:16:3e:07:65:47   yes0.00
  4 fe:16:3e:1d:d6:33   yes0.00
  4 fe:16:3e:1d:d6:33   yes0.00
  9 fe:16:3e:2b:2f:f0   yes0.00
  9 fe:16:3e:2b:2f:f0   yes0.00
  8 fe:16:3e:3c:42:64   yes0.00
  8 fe:16:3e:3c:42:64   yes0.00
 10 fe:16:3e:5c:a6:6c   yes0.00
 10 fe:16:3e:5c:a6:6c   yes0.00
  2 fe:16:3e:86:9c:dd   yes0.00
  2 fe:16:3e:86:9c:dd   yes0.00
  6 fe:16:3e:91:9b:45   yes0.00
  6 fe:16:3e:91:9b:45   yes0.00
 11 fe:16:3e:b3:30:00   yes0.00
 11 fe:16:3e:b3:30:00   yes0.00
  3 fe:16:3e:dc:c3:3e   yes0.00
  3 fe:16:3e:dc:c3:3e   yes0.00

root@lab-compute01:~# bridge fdb show | grep brqd0084ac0-f7
01:00:5e:00:00:01 dev brqd0084ac0-f7 self permanent
fe:16:3e:02:62:18 dev tap74af38f9-2e master brqd0084ac0-f7 permanent
fe:16:3e:02:62:18 dev tap74af38f9-2e vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:86:9c:dd dev tapb00b3c18-b3 master brqd0084ac0-f7 permanent
fe:16:3e:86:9c:dd dev tapb00b3c18-b3 vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:dc:c3:3e dev tap7284d235-2b master brqd0084ac0-f7 permanent
fe:16:3e:dc:c3:3e dev tap7284d235-2b vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:1d:d6:33 dev tapbeb9441a-99 vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:1d:d6:33 dev tapbeb9441a-99 master brqd0084ac0-f7 permanent
24:be:05:a3:1f:e1 dev eno1.102 vlan 1 master brqd0084ac0-f7 permanent
24:be:05:a3:1f:e1 dev eno1.102 master brqd0084ac0-f7 permanent
fe:16:3e:91:9b:45 dev tapc8ad2cec-90 master brqd0084ac0-f7 permanent
fe:16:3e:91:9b:45 dev tapc8ad2cec-90 vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:07:65:47 dev tap86e2c412-24 master brqd0084ac0-f7 permanent
fe:16:3e:07:65:47 dev tap86e2c412-24 vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:3c:42:64 dev tap37bcb70e-9e master brqd0084ac0-f7 permanent
fe:16:3e:3c:42:64 dev tap37bcb70e-9e vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:2b:2f:f0 dev tap40f6be7c-2d master brqd0084ac0-f7 permanent
fe:16:3e:b3:30:00 dev tap6548bacb-c0 vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:b3:30:00 dev tap6548bacb-c0 master brqd0084ac0-f7 permanent
fe:16:3e:5c:a6:6c dev tap61107236-1e vlan 1 master brqd0084ac0-f7 permanent
fe:16:3e:5c:a6:6c dev tap61107236-1e master brqd0084ac0-f7 permanent

The ageing time for the bridge is set to 0:

root@lab-compute01:~# brctl showstp brqd0084ac0-f7
brqd0084ac0-f7
 bridge id  8000.24be05a31fe1
 designated root8000.24be05a31fe1
 root port 0path cost  0
 max age  20.00 bridge max age20.00
 hello time2.00 bridge hello time  2.00
 forward delay 0.00 bridge forward delay   0.00
 ageing time   0.00
 hello timer   0.00 tcn timer  0.00
 topology change timer 0.00 gc timer   0.00
 flags

The default ageing time of 300 is being overridden by the value set
here:

Stein: https://github.com/openstack/os-
vif/blob/stable/stein/os_vif/internal/command/ip/linux/impl_pyroute2.py#L89

Master: https://github.com/openstack/os-
vif/blob/master/os_vif/internal/ip/linux/impl_pyroute2.py#L89

I am not sure of the behavior in OVS environments using the iptables
firewall, but I have confirmed the 'qbr' bridges also have a ageing time
of 0 (formerly 300).

Please let me know if you have any questions.

** Affects: neutron
 Importance: Undecided
 Status: New

** Affects: nova
 Importance: Undecided
 Status: New

** Affects: os-vif
 Importance: Undecided
 Status: New

** Also affects: neutron
   Importance: Undecided
   Status: New

** Also affects: nova
   Importance: Undecided
   Status: New

-- 

[Yahoo-eng-team] [Bug 1643991] Re: 504 Gateway Timeout when creating a port

2019-02-13 Thread James Denton
** Changed in: openstack-ansible
   Status: Confirmed => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1643991

Title:
  504 Gateway Timeout when creating a port

Status in neutron:
  Invalid
Status in openstack-ansible:
  Invalid

Bug description:
  We are using Openstack installed on Containers and trying to create ports or 
networks using Neutron CLI. But we received this error message on CLI: "504 
Gateway Time-out The server didn't respond in time." and there is no error in 
the Neutron Logs. Sometimes networks or ports are created and sometimes not.
  In another test, we are trying to perform a Deploy and neutron creates and 
then delete the network or port, and the same error message appears in the 
Ironic log.

  The message error is like this:
  https://bugs.launchpad.net/fuel/+bug/1540346

  
  Error log from CLI:

  DEBUG: neutronclient.v2_0.client Error message: 
  504 Gateway Time-outThe server didn't respond in 
time.

  DEBUG: neutronclient.v2_0.client POST call to neutron for 
http://XXX.XX.XXX.XXX:9696/v2.0/ports.json used request id None
  ERROR: neutronclient.shell 504 Gateway Time-out
  The server didn't respond in time.
  
  Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/neutronclient/shell.py", line 
877, in run_subcommand
  return run_command(cmd, cmd_parser, sub_argv)
File "/usr/local/lib/python2.7/dist-packages/neutronclient/shell.py", line 
114, in run_command
  return cmd.run(known_args)
File 
"/usr/local/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",
 line 324, in run
  return super(NeutronCommand, self).run(parsed_args)
File "/usr/local/lib/python2.7/dist-packages/cliff/display.py", line 100, 
in run
  column_names, data = self.take_action(parsed_args)
File 
"/usr/local/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",
 line 407, in take_action
  data = obj_creator(body)
File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", 
line 750, in create_port
  return self.post(self.ports_path, body=body)
File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", 
line 365, in post
  headers=headers, params=params)
File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", 
line 300, in do_request
  self._handle_fault_response(status_code, replybody, resp)
File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", 
line 275, in _handle_fault_response
  exception_handler_v20(status_code, error_body)
File "/usr/local/lib/python2.7/dist-packages/neutronclient/v2_0/client.py", 
line 91, in exception_handler_v20
  request_ids=request_ids)
  NeutronClientException: 504 Gateway Time-out
  The server didn't respond in time.
  

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1643991/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1807400] Re: networksegments table in neutron can not be cleared automatically

2019-02-13 Thread James Denton
Marking invalid for OSA. If this is still an issue, please submit for
Neutron project.

** Changed in: openstack-ansible
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1807400

Title:
  networksegments table in neutron can not be cleared automatically

Status in neutron:
  Invalid
Status in openstack-ansible:
  Invalid

Bug description:
  _process_port_binding function in neutron/plugins/ml2/plugin.py used
  clear_binding_levels to clear ml2_port_binding_levels table, but it
  will not do anything to networksegments under hierarchical port
  bonding condition

  @db_api.context_manager.writer
  def clear_binding_levels(context, port_id, host):
  if host:
  for l in (context.session.query(models.PortBindingLevel).
filter_by(port_id=port_id, host=host)):
  context.session.delete(l)
  LOG.debug("For port %(port_id)s, host %(host)s, "
"cleared binding levels",
{'port_id': port_id,
 'host': host})

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1807400/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1754062] [NEW] openstack client does not pass prefixlen when creating subnet

2018-03-07 Thread James Denton
Public bug reported:

Version: Pike
OpenStack Client: 3.12.0

When testing Subnet Pool functionality, I found that the behavior
between the openstack and neutron clients is different.

Subnet pool:

root@controller01:~# openstack subnet pool show MySubnetPool
+---+--+
| Field | Value|
+---+--+
| address_scope_id  | None |
| created_at| 2018-03-07T13:18:22Z |
| default_prefixlen | 8|
| default_quota | None |
| description   |  |
| id| e49703d8-27f4-4a16-9bf4-91a6cf00fff3 |
| ip_version| 4|
| is_default| False|
| max_prefixlen | 32   |
| min_prefixlen | 8|
| name  | MySubnetPool |
| prefixes  | 172.31.0.0/16|
| project_id| 9233b6b4f6a54386af63c0a7b8f043c2 |
| revision_number   | 0|
| shared| False|
| tags  |  |
| updated_at| 2018-03-07T13:18:22Z |
+---+--+

When attempting to create a /28 subnet from that pool with the openstack
client, the following error is observed:

root@controller01:~# openstack subnet create \
> --subnet-pool MySubnetPool \
> --prefix-length 28 \
> --network MyVLANNetwork2 \
> MyFlatSubnetFromPool
HttpException: Internal Server Error (HTTP 500) (Request-ID: 
req-61b3f00a-9764-4bcb-899d-e85d66f54e5a), Failed to allocate subnet: 
Insufficient prefix space to allocate subnet size /8.

However, the same request is successful with the neutron client:

root@controller01:~# neutron subnet-create --subnetpool MySubnetPool 
--prefixlen 28 --name MySubnetFromPool MyVLANNetwork2
neutron CLI is deprecated and will be removed in the future. Use openstack CLI 
instead.
Created a new subnet:
+---+---+
| Field | Value |
+---+---+
| allocation_pools  | {"start": "172.31.0.2", "end": "172.31.0.14"} |
| cidr  | 172.31.0.0/28 |
| created_at| 2018-03-07T13:35:35Z  |
| description   |   |
| dns_nameservers   |   |
| enable_dhcp   | True  |
| gateway_ip| 172.31.0.1|
| host_routes   |   |
| id| 43cb9dda-1b7e-436d-9dc1-5312866a1b63  |
| ip_version| 4 |
| ipv6_address_mode |   |
| ipv6_ra_mode  |   |
| name  | MySubnetFromPool  |
| network_id| e01ca743-607c-4a94-9176-b572a46fba84  |
| project_id| 9233b6b4f6a54386af63c0a7b8f043c2  |
| revision_number   | 0 |
| service_types |   |
| subnetpool_id | e49703d8-27f4-4a16-9bf4-91a6cf00fff3  |
| tags  |   |
| tenant_id | 9233b6b4f6a54386af63c0a7b8f043c2  |
| updated_at| 2018-03-07T13:35:35Z  |
+---+---+

The payload is different between these clients - the openstack client
fails to send the prefixlen key.

openstack client:

REQ: curl -g -i -X POST http://controller01:9696/v2.0/subnets -H "User-Agent: 
openstacksdk/0.9.17 keystoneauth1/3.1.0 python-requests/2.18.1 CPython/2.7.12" 
-H "Content-Type: application/json" -H "X-Auth-Token: 
{SHA1}ec04a71699eee2c70dc4abb35037de272523fef0" -d '{"subnet": {"network_id": 
"e01ca743-607c-4a94-9176-b572a46fba84", "ip_version": 4, "name": 
"MyFlatSubnetFromPool", "subnetpool_id": 
"e49703d8-27f4-4a16-9bf4-91a6cf00fff3"}}'
http://controller01:9696 "POST /v2.0/subnets HTTP/1.1" 500 160

neutron client:

REQ: curl -g -i -X POST http://controller01:9696/v2.0/subnets -H "User-
Agent: python-neutronclient" -H "Content-Type: application/json" -H
"Accept: application/json" -H "X-Auth-Token:
{SHA1}b3b6f0fa14c2b28c5c9784f857ee753455c1d375" -d '{"subnet":
{"network_id": 

[Yahoo-eng-team] [Bug 1734445] [NEW] Customize and configure the Dashboard in horizon

2017-11-25 Thread James Denton
Public bug reported:


- [x] This doc is inaccurate in this way: The documentation states that the 
'openstack-dashboard-ubuntu-theme' package can be removed to revert to the 
default Horizon theme. However, that package did not appear to be installed on 
my system via the 'openstack-dashboard' package. 

In local_settings.py, modifying 'DEFAULT_THEME' from 'ubuntu' to
'default' and restarting Apache seems to work just as well, and may be
more reliable approach.

DEFAULT_THEME = 'default'

---
Release: 12.0.2.dev4 on 2017-11-22 01:48
SHA: 12cfe72f193be6de6511257bd2b901b47d1450d3
Source: 
https://git.openstack.org/cgit/openstack/horizon/tree/doc/source/admin/customize-configure.rst
URL: https://docs.openstack.org/horizon/pike/admin/customize-configure.html

** Affects: horizon
 Importance: Undecided
 Status: New


** Tags: documentation

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Dashboard (Horizon).
https://bugs.launchpad.net/bugs/1734445

Title:
  Customize and configure the Dashboard in horizon

Status in OpenStack Dashboard (Horizon):
  New

Bug description:
  
  - [x] This doc is inaccurate in this way: The documentation states that the 
'openstack-dashboard-ubuntu-theme' package can be removed to revert to the 
default Horizon theme. However, that package did not appear to be installed on 
my system via the 'openstack-dashboard' package. 

  In local_settings.py, modifying 'DEFAULT_THEME' from 'ubuntu' to
  'default' and restarting Apache seems to work just as well, and may be
  more reliable approach.

  DEFAULT_THEME = 'default'

  ---
  Release: 12.0.2.dev4 on 2017-11-22 01:48
  SHA: 12cfe72f193be6de6511257bd2b901b47d1450d3
  Source: 
https://git.openstack.org/cgit/openstack/horizon/tree/doc/source/admin/customize-configure.rst
  URL: https://docs.openstack.org/horizon/pike/admin/customize-configure.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/horizon/+bug/1734445/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1734341] [NEW] Install and configure (Ubuntu) in glance

2017-11-24 Thread James Denton
Public bug reported:


- [x] This doc is inaccurate in this way: The guide is missing the steps
needed to create the Glance database. It mentions using the mysql
client, but does not include the commands to create glance DB and user.

---
Release: 15.0.1.dev1 on 'Mon Aug 7 01:28:54 2017, commit 9091d26'
SHA: 9091d262afb120fd077bae003d52463f833a4fde
Source: 
https://git.openstack.org/cgit/openstack/glance/tree/doc/source/install/install-ubuntu.rst
URL: https://docs.openstack.org/glance/pike/install/install-ubuntu.html

** Affects: glance
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/1734341

Title:
  Install and configure (Ubuntu) in glance

Status in Glance:
  New

Bug description:

  - [x] This doc is inaccurate in this way: The guide is missing the
  steps needed to create the Glance database. It mentions using the
  mysql client, but does not include the commands to create glance DB
  and user.

  ---
  Release: 15.0.1.dev1 on 'Mon Aug 7 01:28:54 2017, commit 9091d26'
  SHA: 9091d262afb120fd077bae003d52463f833a4fde
  Source: 
https://git.openstack.org/cgit/openstack/glance/tree/doc/source/install/install-ubuntu.rst
  URL: https://docs.openstack.org/glance/pike/install/install-ubuntu.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1734341/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1732067] [NEW] openvswitch firewall flows cause flooding on integration bridge

2017-11-13 Thread James Denton
Public bug reported:

Environment: OpenStack Newton
Driver: ML2 w/ OVS
Firewall: openvswitch

In this environment, we have observed OVS flooding network traffic
across all ports in a given VLAN on the integration bridge due to the
lack of a FDB entry for the destination MAC address. Across the large
fleet of 240+ nodes, this is causing a considerable amount of noise on
any given node.

In this test, we have 3 machines:

Client: fa:16:3e:e8:59:00 (10.10.60.2)
Server: fa:16:3e:80:cb:0a (10.10.60.9)
Bystander: fa:16:3e:a0:ee:02 (10.10.60.10)

The server is running a web server using netcat:

while true ; do sudo nc -l -p 80 < index.html ; done

Client requests page using curl:

ip netns exec qdhcp-b07e6cb3-0943-45a2-b5ff-efb7e99e4d3d curl
http://10.10.60.9/

We should expect to see the communication limited to the client and
server. However, the captures below reflect the server->client responses
being broadcast out all tap interfaces connected to br-int in the same
local vlan:

root@osa-newton-ovs-compute01:~# tcpdump -i tap5f03424d-1c -ne port 80
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tap5f03424d-1c, link-type EN10MB (Ethernet), capture size 262144 
bytes
02:20:30.190675 fa:16:3e:e8:59:00 > fa:16:3e:80:cb:0a, ethertype IPv4 (0x0800), 
length 74: 10.10.60.2.54796 > 10.10.60.9.80: Flags [S], seq 213484442, win 
29200, options [mss 1460,sackOK,TS val 140883559 ecr 0,nop,wscale 7], length 0
02:20:30.191926 fa:16:3e:80:cb:0a > fa:16:3e:e8:59:00, ethertype IPv4 (0x0800), 
length 74: 10.10.60.9.80 > 10.10.60.2.54796: Flags [S.], seq 90006557, ack 
213484443, win 14480, options [mss 1460,sackOK,TS val 95716 ecr 
140883559,nop,wscale 4], length 0
02:20:30.192837 fa:16:3e:e8:59:00 > fa:16:3e:80:cb:0a, ethertype IPv4 (0x0800), 
length 66: 10.10.60.2.54796 > 10.10.60.9.80: Flags [.], ack 1, win 229, options 
[nop,nop,TS val 140883560 ecr 95716], length 0
02:20:30.192986 fa:16:3e:e8:59:00 > fa:16:3e:80:cb:0a, ethertype IPv4 (0x0800), 
length 140: 10.10.60.2.54796 > 10.10.60.9.80: Flags [P.], seq 1:75, ack 1, win 
229, options [nop,nop,TS val 140883560 ecr 95716], length 74: HTTP: GET / 
HTTP/1.1
02:20:30.195806 fa:16:3e:80:cb:0a > fa:16:3e:e8:59:00, ethertype IPv4 (0x0800), 
length 79: 10.10.60.9.80 > 10.10.60.2.54796: Flags [P.], seq 1:14, ack 1, win 
905, options [nop,nop,TS val 95717 ecr 140883560], length 13: HTTP
02:20:30.196207 fa:16:3e:e8:59:00 > fa:16:3e:80:cb:0a, ethertype IPv4 (0x0800), 
length 66: 10.10.60.2.54796 > 10.10.60.9.80: Flags [.], ack 14, win 229, 
options [nop,nop,TS val 140883561 ecr 95717], length 0
02:20:30.197481 fa:16:3e:80:cb:0a > fa:16:3e:e8:59:00, ethertype IPv4 (0x0800), 
length 66: 10.10.60.9.80 > 10.10.60.2.54796: Flags [.], ack 75, win 905, 
options [nop,nop,TS val 95717 ecr 140883560], length 0

^^^ On the server tap we see the bi-directional traffic

root@osa-newton-ovs-compute01:/home/ubuntu# tcpdump -i tapb8051da9-60 -ne port 
80
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on tapb8051da9-60, link-type EN10MB (Ethernet), capture size 262144 
bytes
02:20:30.192165 fa:16:3e:80:cb:0a > fa:16:3e:e8:59:00, ethertype IPv4 (0x0800), 
length 74: 10.10.60.9.80 > 10.10.60.2.54796: Flags [S.], seq 90006557, ack 
213484443, win 14480, options [mss 1460,sackOK,TS val 95716 ecr 
140883559,nop,wscale 4], length 0
02:20:30.195827 fa:16:3e:80:cb:0a > fa:16:3e:e8:59:00, ethertype IPv4 (0x0800), 
length 79: 10.10.60.9.80 > 10.10.60.2.54796: Flags [P.], seq 1:14, ack 1, win 
905, options [nop,nop,TS val 95717 ecr 140883560], length 13: HTTP
02:20:30.197500 fa:16:3e:80:cb:0a > fa:16:3e:e8:59:00, ethertype IPv4 (0x0800), 
length 66: 10.10.60.9.80 > 10.10.60.2.54796: Flags [.], ack 75, win 905, 
options [nop,nop,TS val 95717 ecr 140883560], length 0

^^^ On the bystander tap we see the flooded traffic

The FDB tables reflect the lack of CAM entry for the client on br-int
bridge. I would expect to see the MAC address on the patch uplink:

root@osa-newton-ovs-compute01:/home/ubuntu# ovs-appctl fdb/show br-int | grep 
'fa:16:3e:e8:59:00'
root@osa-newton-ovs-compute01:/home/ubuntu# ovs-appctl fdb/show br-provider | 
grep 'fa:16:3e:e8:59:00'
2   850  fa:16:3e:e8:59:003

Sources[1] point to the fact that an 'output' action negates the MAC learning 
mechanism in OVS. Related Table 82 entries are below, and code is here[2]:

cookie=0x94ebb7913c37a0ec, duration=415.490s, table=82, n_packets=5, 
n_bytes=424, idle_age=31, 
priority=70,ct_state=+est-rel-rpl,tcp,reg5=0xd,dl_dst=fa:16:3e:80:cb:0a,tp_dst=80
 actions=strip_vlan,output:13
cookie=0x94ebb7913c37a0ec, duration=415.489s, table=82, n_packets=354, 
n_bytes=35229, idle_age=154, 
priority=70,ct_state=+est-rel-rpl,tcp,reg5=0xd,dl_dst=fa:16:3e:80:cb:0a,tp_dst=22
 actions=strip_vlan,output:13
cookie=0x94ebb7913c37a0ec, duration=415.489s, table=82, n_packets=1, 
n_bytes=78, idle_age=154, 
priority=70,ct_state=+new-est,tcp,reg5=0xd,dl_dst=fa:16:3e:80:cb:0a,tp_dst=80 

[Yahoo-eng-team] [Bug 1731953] [NEW] Modifying security groups when using openvswitch firewall causes existing connections to drop

2017-11-13 Thread James Denton
Public bug reported:

Environment: OpenStack Newton
Driver: ML2 w/ OVS
Firewall: openvswitch

Clients using an OpenStack cloud based on the Newton release are facing
network issues when updating security groups/rules. We are able to
replicate the issue by modifying security group rules in an existing
security group applied to a port.

Test scenario:
--
1. Built a test instance. Example:

root@osctrl-utility-container-8ad9622f:~# openstack server show 
rackspace-jamesdenton-01
WARNING: openstackclient.common.utils is deprecated and will be removed after 
Jun 2017. Please use osc_lib.utils
+--++
| Field| Value  
|
+--++
| OS-DCF:diskConfig| MANUAL 
|
| OS-EXT-AZ:availability_zone  | nova   
|
| OS-EXT-SRV-ATTR:host | oscomp-h126
|
| OS-EXT-SRV-ATTR:hypervisor_hostname  | oscomp-h126
|
| OS-EXT-SRV-ATTR:instance_name| instance-00014fed  
|
| OS-EXT-STS:power_state   | Running
|
| OS-EXT-STS:task_state| None   
|
| OS-EXT-STS:vm_state  | active 
|
| OS-SRV-USG:launched_at   | 2017-11-13T14:57:09.00 
|
| OS-SRV-USG:terminated_at | None   
|
| accessIPv4   |
|
| accessIPv6   |
|
| addresses| 
Public=2001::::f816:3eff:fef2:457a, 192.168.2.200  |
| config_drive |
|
| created  | 2017-11-13T14:56:54Z   
|
| flavor   | m1.medium (103)
|
| hostId   | 
1599f0caa6bb0775a5b8b2b4ee76a23a9135e9d84e7844c53543541f   |
| id   | 5d5afb5b-778c-46fc-8dbb-31c62a4e45d5   
|
| image| Ubuntu-Trusty-20170310 
(80267974-d0fc-4016-9338-3a057671782a)  |
| key_name | rpc_support
|
| name | rackspace-jamesdenton-01   
|
| os-extended-volumes:volumes_attached | [] 
|
| progress | 0  
|
| project_id   | 723cdf11c4dd41ca9eeb47cb0576eb71   
|
| properties   |
|
| security_groups  | [{u'name': u'rpc-support'}]
|
| status   | ACTIVE 
|
| updated  | 2017-11-13T14:57:10Z   
|
| user_id  | 74cebd9525a843fcb374af1ea3a91fea   
|
+--++

2. Initiate a 4G image download from the VM

# wget -4 -O /dev/null
http://centos.mirror.constant.com/7.4.1708/isos/x86_64/CentOS-7-x86_64-DVD-1708.iso

--2017-11-13 15:00:59--  
http://centos.mirror.constant.com/7.4.1708/isos/x86_64/CentOS-7-x86_64-DVD-1708.iso
Resolving centos.mirror.constant.com (centos.mirror.constant.com)... 108.61.5.83
Connecting to centos.mirror.constant.com 
(centos.mirror.constant.com)|108.61.5.83|:80... connected.
HTTP request sent, 

[Yahoo-eng-team] [Bug 1728665] [NEW] Removing gateway ip for tenant network (DVR) causes traceback in neutron-openvswitch-agent

2017-10-30 Thread James Denton
Public bug reported:

Version: OpenStack Newton (OSA v14.2.11)
neutron-openvswitch-agent version 9.4.2.dev21

Issue:

Users complained that instances were unable to procure their IP via
DHCP. On the controllers, numerous ports were found in BUILD state.
Tracebacks similar to the following could be observed in the neutron-
openvswitch-agent logs across the (3) controllers.

2017-10-26 16:24:28.458 4403 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] Port 
e9c11103-9d10-4b27-b739-e428773d8fac updated. Details: {u'profile': {}, 
u'network_qos_policy_id': None, u'qos_policy_id': None, 
u'allowed_address_pairs': [], u'admin_state_up': True, u'network_id': 
u'e57257d9-f915-4c60-ac30-76b0e2d36378', u'segmentation_id': 2123, 
u'device_owner': u'network:dhcp', u'physical_network': u'physnet1', 
u'mac_address': u'fa:16:3e:af:aa:f5', u'device': 
u'e9c11103-9d10-4b27-b739-e428773d8fac', u'port_security_enabled': False, 
u'port_id': u'e9c11103-9d10-4b27-b739-e428773d8fac', u'fixed_ips': 
[{u'subnet_id': u'b7196c99-0df6-4b0e-bbfa-e62da96dac86', u'ip_address': 
u'10.1.1.32'}], u'network_type': u'vlan'}
2017-10-26 16:24:28.458 4403 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] Assigning 48 as local vlan 
for net-id=e57257d9-f915-4c60-ac30-76b0e2d36378
2017-10-26 16:24:28.462 4403 INFO neutron.agent.l2.extensions.qos 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] QoS extension did have no 
information about the port e9c11103-9d10-4b27-b739-e428773d8fac that we were 
trying to reset
2017-10-26 16:24:28.462 4403 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] Port 
610c3924-5e94-4f95-b19b-75e43c5729ff updated. Details: {u'profile': {}, 
u'network_qos_policy_id': None, u'qos_policy_id': None, 
u'allowed_address_pairs': [], u'admin_state_up': True, u'network_id': 
u'f09a8be9-a7c7-4f90-8cb3-d08b61095c25', u'segmentation_id': 5, 
u'device_owner': u'network:router_gateway', u'physical_network': u'physnet1', 
u'mac_address': u'fa:16:3e:bf:39:43', u'device': 
u'610c3924-5e94-4f95-b19b-75e43c5729ff', u'port_security_enabled': False, 
u'port_id': u'610c3924-5e94-4f95-b19b-75e43c5729ff', u'fixed_ips': 
[{u'subnet_id': u'3ce21ed4-bb6a-4e67-b222-a055df40af08', u'ip_address': 
u'96.116.48.132'}], u'network_type': u'vlan'}
2017-10-26 16:24:28.463 4403 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] Assigning 43 as local vlan 
for net-id=f09a8be9-a7c7-4f90-8cb3-d08b61095c25
2017-10-26 16:24:28.466 4403 INFO neutron.agent.l2.extensions.qos 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] QoS extension did have no 
information about the port 610c3924-5e94-4f95-b19b-75e43c5729ff that we were 
trying to reset
2017-10-26 16:24:28.467 4403 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] Port 
66db7e2d-bd92-48ea-85fa-5e20dfc5311c updated. Details: {u'profile': {}, 
u'network_qos_policy_id': None, u'qos_policy_id': None, 
u'allowed_address_pairs': [], u'admin_state_up': True, u'network_id': 
u'fd67eae2-9db7-4f7c-a622-39be67090cb4', u'segmentation_id': 2170, 
u'device_owner': u'network:dhcp', u'physical_network': u'physnet1', 
u'mac_address': u'fa:16:3e:c9:24:8a', u'device': 
u'66db7e2d-bd92-48ea-85fa-5e20dfc5311c', u'port_security_enabled': False, 
u'port_id': u'66db7e2d-bd92-48ea-85fa-5e20dfc5311c', u'fixed_ips': 
[{u'subnet_id': u'47366a54-22ca-47a2-b7a0-987257fa83ea', u'ip_address': 
u'192.168.189.3'}], u'network_type': u'vlan'}
2017-10-26 16:24:28.467 4403 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] Assigning 54 as local vlan 
for net-id=fd67eae2-9db7-4f7c-a622-39be67090cb4
2017-10-26 16:24:28.470 4403 INFO neutron.agent.l2.extensions.qos 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] QoS extension did have no 
information about the port 66db7e2d-bd92-48ea-85fa-5e20dfc5311c that we were 
trying to reset
{...snip...}
2017-10-26 16:24:28.501 4403 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-00e34b5f-346a-4c33-a71b-822fde6e6f46 - - - - -] Port 
c53c48d4-77a8-4185-bc87-ff999bdfd4a1 updated. Details: {u'profile': {}, 
u'network_qos_policy_id': None, u'qos_policy_id': None, 
u'allowed_address_pairs': [], u'admin_state_up': True, u'network_id': 
u'06390e9c-6aa4-427a-91dc-5cf2c62be143', u'segmentation_id': 2003, 
u'device_owner': u'network:router_interface_distributed', u'physical_network': 
u'physnet1', u'mac_address': u'fa:16:3e:38:8b:f0', u'device': 
u'c53c48d4-77a8-4185-bc87-ff999bdfd4a1', u'port_security_enabled': False, 
u'port_id': u'c53c48d4-77a8-4185-bc87-ff999bdfd4a1', u'fixed_ips': 
[{u'subnet_id': 

[Yahoo-eng-team] [Bug 1715734] Re: Gratuitous ARP for floating IPs not so gratuitous

2017-09-11 Thread James Denton
Thanks, Brian. I confirmed that the other 'arping' package was being
installed over iputils-arping post-deploy by another set of playbooks.
The difference in behavior between the two packages is subtle and not
enough to cause any outright errors, but will affect users in a negative
way as described in the initial report.

Feel free to close this bug, and thanks again for your help.

** No longer affects: openstack-ansible

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1715734

Title:
  Gratuitous ARP for floating IPs not so gratuitous

Status in neutron:
  In Progress

Bug description:
  OpenStack Release: Newton
  OS: Ubuntu 16.04 LTS

  When working in an environment with multiple application deployments
  that build up/tear down routers and floating ips, it has been observed
  that connectivity to new instances using recycled floating IPs may be
  impacted.

  In this environment, the external provider network is connected to a
  Cisco Nexus 7010 with a default arp cache timeout of 1500 seconds. We
  have observed that the L3 agent is sending out the following arpings
  when floating IPs are assigned:

  2017-09-07 16:57:17.396 13048 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 
'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-A', '-I', 
'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.36'] create_process 
/openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
  2017-09-07 16:57:19.644 13048 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 
'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 
'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.29'] create_process 
/openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
  2017-09-07 16:57:19.913 13048 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 
'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 
'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.44'] create_process 
/openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89

  Here's the respective packet capture:

  18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 
tell 172.29.77.39, length 28
  18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 
tell 172.29.77.39, length 28
  18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 
tell 172.29.77.39, length 28

  The source address in all of those ARP requests is 172.29.77.39 - the
  IP primary address on the qg interface. The ARP entry for the recycled
  floating IPs on the Nexus is not being refreshed and remains stale.
  For the gratuitous ARP to be successful, the source IP needs to be
  changed to the respective floating IP, so that both the source and
  destination IPs are the same. The following code change was made in
  ip_lib.py:

  FROM:
  arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
# Pass -w to set timeout to ensure exit if interface
# removed while running
'-w', 1.5, address]

  TO:
  arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
# Pass -w to set timeout to ensure exit if interface
# removed while running
'-w', 1.5, '-S', address, address]

  With that change in place, the following packet captures reflects the
  new behavior:

  18:10:30.389966 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 
tell 172.29.77.36, length 28
  18:10:30.390068 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 
tell 172.29.77.29, length 28
  18:10:30.390143 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 
tell 172.29.77.44, length 28

  Since making the change, we have not had a failed deployment and all
  recycled floating IPs appear to be reachable immediately.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1715734/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : 

[Yahoo-eng-team] [Bug 1715734] Re: Gratuitous ARP for floating IPs not so gratuitous

2017-09-08 Thread James Denton
** Also affects: openstack-ansible
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1715734

Title:
  Gratuitous ARP for floating IPs not so gratuitous

Status in neutron:
  In Progress
Status in openstack-ansible:
  New

Bug description:
  OpenStack Release: Newton
  OS: Ubuntu 16.04 LTS

  When working in an environment with multiple application deployments
  that build up/tear down routers and floating ips, it has been observed
  that connectivity to new instances using recycled floating IPs may be
  impacted.

  In this environment, the external provider network is connected to a
  Cisco Nexus 7010 with a default arp cache timeout of 1500 seconds. We
  have observed that the L3 agent is sending out the following arpings
  when floating IPs are assigned:

  2017-09-07 16:57:17.396 13048 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 
'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-A', '-I', 
'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.36'] create_process 
/openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
  2017-09-07 16:57:19.644 13048 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 
'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 
'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.29'] create_process 
/openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
  2017-09-07 16:57:19.913 13048 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 
'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 
'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.44'] create_process 
/openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89

  Here's the respective packet capture:

  18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 
tell 172.29.77.39, length 28
  18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 
tell 172.29.77.39, length 28
  18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 
tell 172.29.77.39, length 28

  The source address in all of those ARP requests is 172.29.77.39 - the
  IP primary address on the qg interface. The ARP entry for the recycled
  floating IPs on the Nexus is not being refreshed and remains stale.
  For the gratuitous ARP to be successful, the source IP needs to be
  changed to the respective floating IP, so that both the source and
  destination IPs are the same. The following code change was made in
  ip_lib.py:

  FROM:
  arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
# Pass -w to set timeout to ensure exit if interface
# removed while running
'-w', 1.5, address]

  TO:
  arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
# Pass -w to set timeout to ensure exit if interface
# removed while running
'-w', 1.5, '-S', address, address]

  With that change in place, the following packet captures reflects the
  new behavior:

  18:10:30.389966 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 
tell 172.29.77.36, length 28
  18:10:30.390068 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 
tell 172.29.77.29, length 28
  18:10:30.390143 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 
tell 172.29.77.44, length 28

  Since making the change, we have not had a failed deployment and all
  recycled floating IPs appear to be reachable immediately.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1715734/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1715734] [NEW] Gratuitous ARP for floating IPs not so gratuitous

2017-09-07 Thread James Denton
Public bug reported:

OpenStack Release: Newton
OS: Ubuntu 16.04 LTS

When working in an environment with multiple application deployments
that build up/tear down routers and floating ips, it has been observed
that connectivity to new instances using recycled floating IPs may be
impacted.

In this environment, the external provider network is connected to a
Cisco Nexus 7010 with a default arp cache timeout of 1500 seconds. We
have observed that the L3 agent is sending out the following arpings
when floating IPs are assigned:

2017-09-07 16:57:17.396 13048 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 
'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-A', '-I', 
'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.36'] create_process 
/openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
2017-09-07 16:57:19.644 13048 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 
'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 
'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.29'] create_process 
/openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
2017-09-07 16:57:19.913 13048 DEBUG neutron.agent.linux.utils [-] Running 
command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', 
'/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 
'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 
'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.44'] create_process 
/openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89

Here's the respective packet capture:

18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 
tell 172.29.77.39, length 28
18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 
tell 172.29.77.39, length 28
18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 
tell 172.29.77.39, length 28

The source address in all of those ARP requests is 172.29.77.39 - the IP
primary address on the qg interface. The ARP entry for the recycled
floating IPs on the Nexus is not being refreshed and remains stale. For
the gratuitous ARP to be successful, the source IP needs to be changed
to the respective floating IP, so that both the source and destination
IPs are the same. The following code change was made in ip_lib.py:

FROM:
arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
  # Pass -w to set timeout to ensure exit if interface
  # removed while running
  '-w', 1.5, address]

TO:
arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
  # Pass -w to set timeout to ensure exit if interface
  # removed while running
  '-w', 1.5, '-S', address, address]

With that change in place, the following packet captures reflects the
new behavior:

18:10:30.389966 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 
tell 172.29.77.36, length 28
18:10:30.390068 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 
tell 172.29.77.29, length 28
18:10:30.390143 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q 
(0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 
tell 172.29.77.44, length 28

Since making the change, we have not had a failed deployment and all
recycled floating IPs appear to be reachable immediately.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1715734

Title:
  Gratuitous ARP for floating IPs not so gratuitous

Status in neutron:
  New

Bug description:
  OpenStack Release: Newton
  OS: Ubuntu 16.04 LTS

  When working in an environment with multiple application deployments
  that build up/tear down routers and floating ips, it has been observed
  that connectivity to new instances using recycled floating IPs may be
  impacted.

  In this environment, the external provider network is connected to a
  Cisco Nexus 7010 with a default arp cache timeout of 1500 seconds. We
  have observed that the L3 agent is sending out the following arpings
  when floating IPs are assigned:

  2017-09-07 16:57:17.396 13048 DEBUG neutron.agent.linux.utils [-] Running 

[Yahoo-eng-team] [Bug 1667755] Re: Default scope rules added to router may drop traffic unexpectedly

2017-03-01 Thread James Denton
It has been determined that the networks attached to the router were
associated with different scopes. Additional testing has found the
proper rules are being added. marking as invalid.

** Changed in: neutron
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1667755

Title:
  Default scope rules added to router may drop traffic unexpectedly

Status in neutron:
  Invalid

Bug description:
  Release: OpenStack-Ansible 13.3.4 (Mitaka)

  Scenario:

  Neutron routers are connected to single provider network and single
  tenant network. Floating IPs are *not* used, and SNAT is disabled on
  the router:

  
+-++
  | Field   | Value 

 |
  
+-++
  | admin_state_up  | True  

 |
  | availability_zone_hints |   

 |
  | availability_zones  | nova  

 |
  | description |   

 |
  | distributed | False 

 |
  | external_gateway_info   | {"network_id": 
"ce830329-4133-41fe-868f-698cc761e247", "enable_snat": false, 
"external_fixed_ips": [{"subnet_id": "cf34a5c3-5d26   |
  | | -449f-b22e-2e3fdd69f262", "ip_address": 
"10.152.114.39"}]}  
   |
  | ha  | False 

 |
  | id  | c965e7a1-98c0-4d5e-8dcb-cfafc2667ee1  

 |
  | name| RTR   
 |
  | routes  |   

 |
  | status  | ACTIVE

 |
  | tenant_id   | 2ed1712187674c64acae83948e5b1928  

 |
  
+-++
 

  Upstream routes exist that route tenant network traffic to the qg
  interface of the routes (static, not BGP - yet).

  In some cases, we have found that inbound/outbound traffic is getting
  dropped within the Neutron qrouter namespace. Comparing to a working
  router, we have found some differences in iptables:

  Working router:

  *mangle
  -A neutron-l3-agent-scope -i qr-3dd65e85-f2 -j MARK --set-xmark 
0x401/0x
  -A neutron-l3-agent-scope -i qg-2f55db22-5b -j MARK --set-xmark 
0x401/0x

  *filter
  -A neutron-l3-agent-scope -o qr-3dd65e85-f2 -m mark ! --mark 
0x401/0x -j DROP
  -A neutron-l3-agent-scope -o qg-2f55db22-5b -m mark ! --mark 
0x401/0x -j DROP

  Non-working router:

  *mangle
  -A neutron-l3-agent-scope -i qg-e3f65cf1-29 -j MARK --set-xmark 
0x401/0x
  -A neutron-l3-agent-scope -i qr-125a3dc5-e3 -j MARK --set-xmark 
0x400/0x

  *filter
  -A neutron-l3-agent-scope -o qg-e3f65cf1-29 -m mark ! --mark 
0x401/0x -j DROP
  -A neutron-l3-agent-scope -o qr-125a3dc5-e3 -m mark ! --mark 
0x400/0x -j DROP

  Our working theory is that the marks in filter rules on the non-
  working router are incorrectly set - traffic ingress to the qg
  interface is being marked as x401, and the egress filter on the qr
  interface is checking for x400. We were able to test this theory by
  swapping the marks on those two filter rules and observed that
  inbound/outbound 

[Yahoo-eng-team] [Bug 1667755] [NEW] Default scope rules added to router may drop traffic unexpectedly

2017-02-24 Thread James Denton
Public bug reported:

Release: OpenStack-Ansible 13.3.4 (Mitaka)

Scenario:

Neutron routers are connected to single provider network and single
tenant network. Floating IPs are *not* used, and SNAT is disabled on the
router:

+-++
| Field   | Value   
   |
+-++
| admin_state_up  | True
   |
| availability_zone_hints | 
   |
| availability_zones  | nova
   |
| description | 
   |
| distributed | False   
   |
| external_gateway_info   | {"network_id": 
"ce830329-4133-41fe-868f-698cc761e247", "enable_snat": false, 
"external_fixed_ips": [{"subnet_id": "cf34a5c3-5d26   |
| | -449f-b22e-2e3fdd69f262", "ip_address": 
"10.152.114.39"}]}  
   |
| ha  | False   
   |
| id  | c965e7a1-98c0-4d5e-8dcb-cfafc2667ee1
   |
| name| RTR 
   |
| routes  | 
   |
| status  | ACTIVE  
   |
| tenant_id   | 2ed1712187674c64acae83948e5b1928
   |
+-++
 

Upstream routes exist that route tenant network traffic to the qg
interface of the routes (static, not BGP - yet).

In some cases, we have found that inbound/outbound traffic is getting
dropped within the Neutron qrouter namespace. Comparing to a working
router, we have found some differences in iptables:

Working router:

*mangle
-A neutron-l3-agent-scope -i qr-3dd65e85-f2 -j MARK --set-xmark 
0x401/0x
-A neutron-l3-agent-scope -i qg-2f55db22-5b -j MARK --set-xmark 
0x401/0x

*filter
-A neutron-l3-agent-scope -o qr-3dd65e85-f2 -m mark ! --mark 
0x401/0x -j DROP
-A neutron-l3-agent-scope -o qg-2f55db22-5b -m mark ! --mark 
0x401/0x -j DROP

Non-working router:

*mangle
-A neutron-l3-agent-scope -i qg-e3f65cf1-29 -j MARK --set-xmark 
0x401/0x
-A neutron-l3-agent-scope -i qr-125a3dc5-e3 -j MARK --set-xmark 
0x400/0x

*filter
-A neutron-l3-agent-scope -o qg-e3f65cf1-29 -m mark ! --mark 
0x401/0x -j DROP
-A neutron-l3-agent-scope -o qr-125a3dc5-e3 -m mark ! --mark 
0x400/0x -j DROP

Our working theory is that the marks in filter rules on the non-working
router are incorrectly set - traffic ingress to the qg interface is
being marked as x401, and the egress filter on the qr interface is
checking for x400. We were able to test this theory by swapping the
marks on those two filter rules and observed that inbound/outbound
traffic was working properly.

In the case of the working router, the mark set in the mangle rules is
the same (x401 for both), so the filter rules work fine.

We are not sure at this time how the mark is determined, and while we
can replicate the issue on new routers in the environment, we are unable
to replicate this behavior in other environments at this time.

Please let us know if you need any additional info.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.

[Yahoo-eng-team] [Bug 1658802] [NEW] Issue booting instance with normal port and macvtap agent

2017-01-23 Thread James Denton
Public bug reported:

OpenStack Release: Newton
Operating System: Ubuntu 16.04 LTS 4.4.0-45-generic
OpenStack Distro: OpenStack-Ansible 14.0.2

While working to test/implement macvtap functionality, I found it was
not possible to boot an instance when using the macvtap mech driver and
macvtap agent. My goal is/was to boot an instance with a 'normal' (non-
PCI) port, and have a macvtap interface created on the compute node.
What I found was that that booting the instance resulted in an ERROR
state, with errors reported in neutron-server.log[1] and nova-
compute.log[2]:

[1] Neutron Server: http://paste.openstack.org/show/596128/
[2] Nova Compute: http://paste.openstack.org/show/596129/

The Nova boot syntax can be seen here:

nova boot --image 'Cirros Test Image' --flavor 'm1.test' --nic port-
id=5f08dcec-6689-4c1c-9f95-ffb69548c606 --availability-zone=ZONE-B
TEST-1

An availability zone consisting of a single node was specified.

The interesting thing was observed (multiple times) in neutron-
server.log during the boot process:

2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers 
[req-f531783a-8f29-423e-9764-026f4cd4693f 3e202337454b4813b0ce6ecff74d43d6 
b009912cbdcd45a0b4ef34fb3d22e7e1 - - -] Mechanism driver macvtap failed in 
bind_port
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers Traceback 
(most recent call last):
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers   File 
"/openstack/venvs/neutron-14.0.2/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py",
 line 787, in _bind_port_level
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers 
driver.obj.bind_port(context)
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers   File 
"/openstack/venvs/neutron-14.0.2/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_agent.py",
 line 109, in bind_port
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers agent):
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers   File 
"/openstack/venvs/neutron-14.0.2/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/macvtap/mech_driver/mech_macvtap.py",
 line 106, in try_to_bind_segment_for_agent
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers if 
self._is_live_migration(context):
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers   File 
"/openstack/venvs/neutron-14.0.2/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/macvtap/mech_driver/mech_macvtap.py",
 line 68, in _is_live_migration
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers 
port_profile = context.original.get(portbindings.PROFILE)
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers 
AttributeError: 'NoneType' object has no attribute 'get'
2017-01-23 15:25:57.464 17265 ERROR neutron.plugins.ml2.managers

The _is_live_migration function was implemented in [3]:

[3]
https://review.openstack.org/#/c/361301/5/neutron/plugins/ml2/drivers/macvtap/mech_driver/mech_macvtap.py

context.original is set to None, which is resulting in the Traceback
observed here. I have logged the object data at the time it hits the
function in [4]:

[4] Object: http://paste.openstack.org/show/596130/

To work around this issue for the sake of booting the instance, I set
'port_profile' to None in mech_macvtap.py and restarted Neutron server.
I was then able to boot the instance, and the agent created the VLAN and
macvtap interfaces accordingly:

+--++++-+--+
| ID   | Name   | 
Status | Task State | Power State | Networks
 |
+--++++-+--+
| 0022df55-6500-44c0-8ff5-4433c37157c1 | TEST-1 | 
ERROR  | -  | NOSTATE | 
 |
| 69eda7de-00ad-4927-955f-9e5e1d9d6909 | TEST-1 | 
ACTIVE | -  | Running | TEST=192.168.7.8
 |

113: em50.62@em50:  mtu 1500 qdisc noqueue 
state UP group default qlen 1000
link/ether 5c:b9:01:88:fd:a5 brd ff:ff:ff:ff:ff:ff
inet6 fe80::5eb9:1ff:fe88:fda5/64 scope link
   valid_lft forever preferred_lft forever
115: macvtap0@em50.62:  mtu 1500 
qdisc pfifo_fast state UNKNOWN group default qlen 500
link/ether fa:16:3e:67:da:b8 brd ff:ff:ff:ff:ff:ff
inet6 fe80::f816:3eff:fe67:dab8/64 scope link
   valid_lft forever preferred_lft forever
   

[Yahoo-eng-team] [Bug 1653810] [NEW] [sriov] Modifying or removing pci_passthrough_whitelist may result in inconsistent VF availability

2017-01-03 Thread James Denton
Public bug reported:

OpenStack Version: v14 (Newton)
NIC: Mellanox ConnectX-3 Pro

While testing an SR-IOV implementation, we found that
pci_passthrough_whitelist in nova.conf is involved in the population of
the pci_devices table in the Nova DB. Making changes to the
device/interface in the whitelist or commenting out the line altogether,
and restarting nova-compute, can result in the entries being marked as
'deleted' in the database. Reconfiguring the pci_passthrough_whitelist
option with the same device/interface will result in new entries being
created and marked as 'available'. This can cause PCI device claim
issues if an existing instance is still running and using a VF and
another instance is booted using a 'direct' port.

In the following table, you can see the original implementation that
includes an allocated VF. During testing, we commented out the
pci_passthrough_whitelist line in nova.conf, and restarted nova-compute.
The entries were marked as 'deleted', though the running instance was
not deleted and continued to function.  The pci_passthrough_whitelist
config was then returned and nova-compute restarted. New entries were
created and marked as 'available':

MariaDB [nova]> select * from pci_devices;
+-+-+-+-+-+-+--++---+--+--+-+-++--+--+---+--+
| created_at  | updated_at  | deleted_at  | deleted | 
id  | compute_node_id | address  | product_id | vendor_id | dev_type | 
dev_id   | label   | status  | extra_info | instance_uuid   
 | request_id   | numa_node | 
parent_addr  |
+-+-+-+-+-+-+--++---+--+--+-+-++--+--+---+--+
| 2016-12-29 15:23:36 | 2016-12-29 20:40:34 | 2016-12-29 20:42:26 |  72 |  
72 |   6 | :07:00.0 | 1007   | 15b3  | type-PF  | 
pci__07_00_0 | label_15b3_1007 | unavailable | {} | NULL
 | NULL | 0 | NULL  
   |
| 2016-12-29 15:23:36 | 2016-12-29 20:40:34 | 2016-12-29 20:43:23 |  75 |  
75 |   6 | :07:00.1 | 1004   | 15b3  | type-VF  | 
pci__07_00_1 | label_15b3_1004 | available   | {} | NULL
 | NULL | 0 | 
:07:00.0 |
| 2016-12-29 15:23:36 | 2016-12-29 20:40:34 | 2016-12-29 20:42:26 |  78 |  
78 |   6 | :07:00.2 | 1004   | 15b3  | type-VF  | 
pci__07_00_2 | label_15b3_1004 | available   | {} | NULL
 | NULL | 0 | 
:07:00.0 |
| 2016-12-29 15:23:36 | 2016-12-29 20:40:34 | 2016-12-29 20:44:25 |  81 |  
81 |   6 | :07:00.3 | 1004   | 15b3  | type-VF  | 
pci__07_00_3 | label_15b3_1004 | available   | {} | NULL
 | NULL | 0 | 
:07:00.0 |
| 2016-12-29 15:23:36 | 2016-12-29 20:40:34 | 2016-12-29 20:42:26 |  84 |  
84 |   6 | :07:00.4 | 1004   | 15b3  | type-VF  | 
pci__07_00_4 | label_15b3_1004 | available   | {} | NULL
 | NULL | 0 | 
:07:00.0 |
| 2016-12-29 15:23:36 | 2016-12-29 20:40:34 | 2016-12-29 20:43:23 |  87 |  
87 |   6 | :07:00.5 | 1004   | 15b3  | type-VF  | 
pci__07_00_5 | label_15b3_1004 | available   | {} | NULL
 | NULL | 0 | 
:07:00.0 |
| 2016-12-29 15:23:36 | 2016-12-29 20:40:34 | 2016-12-29 20:42:26 |  90 |  
90 |   6 | :07:00.6 | 1004   | 15b3  | type-VF  | 
pci__07_00_6 | label_15b3_1004 | available   | {} | NULL
 | NULL | 0 | 
:07:00.0 |
| 2016-12-29 15:23:36 | 2016-12-29 20:40:34 | 2016-12-29 20:44:51 |  93 |  
93 |   6 | :07:00.7 | 1004   | 15b3  | type-VF  | 
pci__07_00_7 | label_15b3_1004 | available   | {} | NULL
 | NULL | 0 | 
:07:00.0 |
| 2016-12-29 15:23:36 | 2016-12-29 17:40:25 | 2016-12-29 20:42:26 |  96 |  
96 |   6 | :07:01.0 | 1004   | 15b3  | type-VF  | 
pci__07_01_0 | 

[Yahoo-eng-team] [Bug 1648242] [NEW] Failure to retry update_ha_routers_states

2016-12-07 Thread James Denton
Public bug reported:

Version: Mitaka

While performing failover testing of L3 HA routers, we've discovered an
issue with regards to the failure of an agent to report its state.

In this scenario, we have a router (7629f5d7-b205-4af5-8e0e-
a3c4d15e7677) scheduled to (3) L3 agents:

+--+--++---+--+
| id   | host   
  | admin_state_up | alive | ha_state |
+--+--++---+--+
| 4434f999-51d0-4bbb-843c-5430255d5c64 | 
726404-infra03-neutron-agents-container-a8bb0b1f | True   | :-)   | 
active  |
| 710e7768-df47-4bfe-917f-ca35c138209a | 
726402-infra01-neutron-agents-container-fc937477 | True   | :-)   | 
standby   |
| 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 
726403-infra02-neutron-agents-container-0338af5a | True   | :-)   | 
standby   |
+--+--++---+--+

The infra03 node was shut down completely and abruptly. The router
transitioned to master on infra02 as indicated in these log messages:

2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device 
qg-d48918fa-eb already exists
2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master
2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464
2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router 
c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup
2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110
2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 
7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master
2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920
2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 
4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master
2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-]  - - 
[07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848

Traffic to/from VMs through the new master router functioned as
expected. However, the ha_state remained 'standby':


+--+--++---+--+
| id   | host   
  | admin_state_up | alive | ha_state |
+--+--++---+--+
| 4434f999-51d0-4bbb-843c-5430255d5c64 | 
726404-infra03-neutron-agents-container-a8bb0b1f | True   | xxx   | 
standby  |
| 710e7768-df47-4bfe-917f-ca35c138209a | 
726402-infra01-neutron-agents-container-fc937477 | True   | :-)   | 
standby   |
| 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 
726403-infra02-neutron-agents-container-0338af5a | True   | :-)   | 
standby   |
+--+--++---+--+

A traceback was observed in the logs related to a message timeout,
probably due to the cut of AMQP on infra03:

2016-12-07 15:22:30.525 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] 
AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 
seconds.
2016-12-07 15:22:36.537 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] 
AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 
seconds.
2016-12-07 15:22:37.553 18450 INFO oslo.messaging._drivers.impl_rabbit [-] 
Reconnected to AMQP server on 172.29.238.65:5671 via [amqp] client
2016-12-07 15:22:51.210 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] 
AMQP server on 172.29.237.246:5671 is unreachable: Basic.cancel: (0) 1. Trying 
again in 1 seconds.
2016-12-07 15:22:52.262 18450 INFO oslo.messaging._drivers.impl_rabbit [-] 
Reconnected to AMQP server on 172.29.237.246:5671 via [amqp] client
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent [-] Failed reporting 
state!
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent Traceback (most 
recent call last):
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/l3/agent.py",
 line 686, in _report_state
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent True)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent   File 
"/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/rpc.py",
 line 87, in report_state
2016-12-07 

[Yahoo-eng-team] [Bug 1616208] [NEW] [RFE] Support creating a subnet without an allocation pool

2016-08-23 Thread James Denton
Public bug reported:

Problem Description
===

Currently, subnets are created with an allocation pool(s) that is either
a) user-defined or b) automatically generated based on CIDR. This RFE
asks that the community support the creation of subnets without an
allocation pool.

Neutron allows users to create ports using fixed IP addresses that fall
outside of the subnet allocation pool(s) but within the range defined by
the CIDR. Neutron keeps track of assigned addresses and does not allow
for overlap within the same subnet.

Use cases:
* An external IPAM service is utilized that is not integrated with 
OpenStack/Neutron. The user wants to create a port with a specific IP address 
using the --fixed-ip flag, and does not want Neutron automatically consuming 
addresses from the pool if an address is not manually allocated via Neutron or 
Nova. 


Proposed Change
===


Allow 'None', or similar value, as a valid start/end value. The result would be 
that Neutron would not create an allocation pool for the subnet. The Neutron 
client would have a new flag, such as --no-allocation-pool, or something 
similar.

As I see it, not creating an allocation pool for a subnet would mean
that when a port is created without an IP specified, Neutron would
return the 'no more addresses available for allocation' error.
Otherwise, the current behavior of allowing the user to specify a
particular fixed IP address remains the same.

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: rfe

** Tags added: rfe

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1616208

Title:
  [RFE] Support creating a subnet without an allocation pool

Status in neutron:
  New

Bug description:
  Problem Description
  ===

  Currently, subnets are created with an allocation pool(s) that is
  either a) user-defined or b) automatically generated based on CIDR.
  This RFE asks that the community support the creation of subnets
  without an allocation pool.

  Neutron allows users to create ports using fixed IP addresses that
  fall outside of the subnet allocation pool(s) but within the range
  defined by the CIDR. Neutron keeps track of assigned addresses and
  does not allow for overlap within the same subnet.

  Use cases:
  * An external IPAM service is utilized that is not integrated with 
OpenStack/Neutron. The user wants to create a port with a specific IP address 
using the --fixed-ip flag, and does not want Neutron automatically consuming 
addresses from the pool if an address is not manually allocated via Neutron or 
Nova. 

  
  Proposed Change
  ===

  
  Allow 'None', or similar value, as a valid start/end value. The result would 
be that Neutron would not create an allocation pool for the subnet. The Neutron 
client would have a new flag, such as --no-allocation-pool, or something 
similar.

  As I see it, not creating an allocation pool for a subnet would mean
  that when a port is created without an IP specified, Neutron would
  return the 'no more addresses available for allocation' error.
  Otherwise, the current behavior of allowing the user to specify a
  particular fixed IP address remains the same.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1616208/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1572390] [NEW] Scheduling router without external network results in unclear error

2016-04-19 Thread James Denton
Public bug reported:

When attempting to schedule a router to an L3 agent without an external
network set, the following error is observed:

root@infra01_neutron_server_container-96ae0d98:~# neutron l3-agent-router-add 
7ec8336e-3d82-46f5-8e15-2f2477090021 TestRouter
Agent 7ec8336e-3d82-46f5-8e15-2f2477090021 is not a L3 Agent or has been 
disabled

The error could lead users to believe the L3 agent(s) are not alive when
the agent-list clearly shows otherwise:

root@infra01_neutron_server_container-96ae0d98:~# neutron agent-list | grep l3
| 7ec8336e-3d82-46f5-8e15-2f2477090021 | L3 agent   | 
infra01_neutron_agents_container-4bbf1e68 | :-)   | True   | 
neutron-l3-agent  |
| b5d7be19-3143-431e-ac8b-1c8237acaace | L3 agent   | 
infra02_neutron_agents_container-fb552892 | :-)   | True   | 
neutron-l3-agent  |
| e8bed137-6ec3-41d1-a524-3e5e3a884fcd | L3 agent   | 
infra03_neutron_agents_container-ad2973dd | :-)   | True   | 
neutron-l3-agent  |

The test case in get_l3_agent_candidates function in
l3_agentschedulers_db.py tests to see if external_gateway_info is set
and when it isn't, returns zero candidates. As a result, the presented
agent is considered invalid.

The proper workflow appears to be that a gateway must be set before a
router can be scheduled. However, already scheduled routers can have
their gateway cleared with no observed issues and remain scheduled.

It may help to create a new error to represent this failed test case
rather than lumping tests together, or provide additional pointers in
the error message to reflect possible issues other than a disabled or
non-L3 agent.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1572390

Title:
  Scheduling router without external network results in unclear error

Status in neutron:
  New

Bug description:
  When attempting to schedule a router to an L3 agent without an
  external network set, the following error is observed:

  root@infra01_neutron_server_container-96ae0d98:~# neutron l3-agent-router-add 
7ec8336e-3d82-46f5-8e15-2f2477090021 TestRouter
  Agent 7ec8336e-3d82-46f5-8e15-2f2477090021 is not a L3 Agent or has been 
disabled

  The error could lead users to believe the L3 agent(s) are not alive
  when the agent-list clearly shows otherwise:

  root@infra01_neutron_server_container-96ae0d98:~# neutron agent-list | grep l3
  | 7ec8336e-3d82-46f5-8e15-2f2477090021 | L3 agent   | 
infra01_neutron_agents_container-4bbf1e68 | :-)   | True   | 
neutron-l3-agent  |
  | b5d7be19-3143-431e-ac8b-1c8237acaace | L3 agent   | 
infra02_neutron_agents_container-fb552892 | :-)   | True   | 
neutron-l3-agent  |
  | e8bed137-6ec3-41d1-a524-3e5e3a884fcd | L3 agent   | 
infra03_neutron_agents_container-ad2973dd | :-)   | True   | 
neutron-l3-agent  |

  The test case in get_l3_agent_candidates function in
  l3_agentschedulers_db.py tests to see if external_gateway_info is set
  and when it isn't, returns zero candidates. As a result, the presented
  agent is considered invalid.

  The proper workflow appears to be that a gateway must be set before a
  router can be scheduled. However, already scheduled routers can have
  their gateway cleared with no observed issues and remain scheduled.

  It may help to create a new error to represent this failed test case
  rather than lumping tests together, or provide additional pointers in
  the error message to reflect possible issues other than a disabled or
  non-L3 agent.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1572390/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1531013] [NEW] Duplicate entries in FDB table

2016-01-04 Thread James Denton
Public bug reported:

Posting here, because I'm not sure of a better place at the moment.

Environment: Juno
OS: Ubuntu 14.04 LTS
Plugin: ML2/LinuxBridge

root@infra01_neutron_agents_container-4c850328:~# bridge -V
bridge utility, 0.0
root@infra01_neutron_agents_container-4c850328:~# ip -V
ip utility, iproute2-ss131122
root@infra01_neutron_agents_container-4c850328:~# uname -a
Linux infra01_neutron_agents_container-4c850328 3.13.0-46-generic #79-Ubuntu 
SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

We recently discovered that across the environment (5 controller, 50+
compute) there are (tens of) thousands of duplicate entries in the FDB
table, but only for the 00:00:00:00:00:00 broadcast entries. This is in
an environment of ~1600 instances, ~4,100 ports, and 80 networks.

In this example, the number of duplicate FDB entries for this particular
VTEP jumps wildly:

root@infra01_neutron_agents_container-4c850328:~# bridge fdb show | grep 
"00:00:00:00:00:00 dev vxlan-10 dst 172.29.243.157" | wc -l
1429
root@infra01_neutron_agents_container-4c850328:~# bridge fdb show | grep 
"00:00:00:00:00:00 dev vxlan-10 dst 172.29.243.157" | wc -l
81057
root@infra01_neutron_agents_container-4c850328:~# bridge fdb show | grep 
"00:00:00:00:00:00 dev vxlan-10 dst 172.29.243.157" | wc -l
25806
root@infra01_neutron_agents_container-4c850328:~# bridge fdb show | grep 
"00:00:00:00:00:00 dev vxlan-10 dst 172.29.243.157" | wc -l
473141
root@infra01_neutron_agents_container-4c850328:~# bridge fdb show | grep 
"00:00:00:00:00:00 dev vxlan-10 dst 172.29.243.157" | wc -l
225472

That behavior can be observed for all other VTEPs. We're seeing over 13
million total FDB entries on this node:

root@infra01_neutron_agents_container-4c850328:~# bridge fdb show >> 
james_fdb2.txt
root@infra01_neutron_agents_container-4c850328:~# cat james_fdb2.txt | wc -l
13554258

We're also seeing the wild counts on compute nodes. These were run
within 1 second of the previous completion:

root@compute032:~# bridge fdb show | wc -l
898981
root@compute032:~# bridge fdb show | wc -l
734916
root@compute032:~# bridge fdb show | wc -l
1483081
root@compute032:~# bridge fdb show | wc -l
508811
root@compute032:~# bridge fdb show | wc -l
2349221

On this node, you can see over 28,000 duplicates for each of the
entries:

root@compute032:~# bridge fdb show | sort | uniq -c | sort -nr
  28871 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.39 self permanent
  28871 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.38 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.243.252 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.243.157 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.243.133 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.242.66 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.242.193 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.60 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.59 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.58 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.57 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.55 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.54 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.53 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.51 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.50 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.49 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.48 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.47 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.46 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.45 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.44 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.43 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.42 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.40 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.37 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.36 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.35 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.34 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.33 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.32 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.31 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.30 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.29 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.28 self permanent
  28870 00:00:00:00:00:00 dev vxlan-15 dst 172.29.240.27 self permanent
  28870 

[Yahoo-eng-team] [Bug 1521314] [NEW] Changing physical interface mapping may result in multiple physical interfaces in bridge

2015-11-30 Thread James Denton
Public bug reported:

Version: 2015.2 (Liberty)
Plugin: ML2 w/ LinuxBridge

While testing various NICs, I found that changing the physical interface
mapping in the ML2 configuration file and restarting the agent resulted
in the old physical interface remaining in the bridge. This can be
observed with the following steps:

Original configuration:

[linux_bridge]
physical_interface_mappings = physnet1:eth2

racker@compute01:~$ brctl show
bridge name   bridge id  STP enabled   interfaces
brqad516357-478000.e41d2d5b6213  noeth2
   tap72e7d2be-24

Modify the bridge mapping:

[linux_bridge]
#physical_interface_mappings = physnet1:eth2
physical_interface_mappings = physnet1:eth1

Restart the agent:

racker@compute01:~$ sudo service neutron-plugin-linuxbridge-agent restart
neutron-plugin-linuxbridge-agent stop/waiting
neutron-plugin-linuxbridge-agent start/running, process 12803

Check the bridge:

racker@compute01:~$ brctl show
bridge name   bridge id  STP enabled   interfaces
brqad516357-478000.6805ca37dc39  noeth1
   eth2
   tap72e7d2be-24

This behavior was observed with flat or vlan networks, and can result in
some wonky behavior. Removing the original interface from the bridge(s)
by hand or restarting the node is a workaround, but I suspect
LinuxBridge users aren't used to modifying the bridges manually as the
agent usually handles that.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1521314

Title:
  Changing physical interface mapping may result in multiple physical
  interfaces in bridge

Status in neutron:
  New

Bug description:
  Version: 2015.2 (Liberty)
  Plugin: ML2 w/ LinuxBridge

  While testing various NICs, I found that changing the physical
  interface mapping in the ML2 configuration file and restarting the
  agent resulted in the old physical interface remaining in the bridge.
  This can be observed with the following steps:

  Original configuration:

  [linux_bridge]
  physical_interface_mappings = physnet1:eth2

  racker@compute01:~$ brctl show
  bridge name   bridge id  STP enabled   interfaces
  brqad516357-478000.e41d2d5b6213  noeth2
 tap72e7d2be-24

  Modify the bridge mapping:

  [linux_bridge]
  #physical_interface_mappings = physnet1:eth2
  physical_interface_mappings = physnet1:eth1

  Restart the agent:

  racker@compute01:~$ sudo service neutron-plugin-linuxbridge-agent restart
  neutron-plugin-linuxbridge-agent stop/waiting
  neutron-plugin-linuxbridge-agent start/running, process 12803

  Check the bridge:

  racker@compute01:~$ brctl show
  bridge name   bridge id  STP enabled   interfaces
  brqad516357-478000.6805ca37dc39  noeth1
 eth2
 tap72e7d2be-24

  This behavior was observed with flat or vlan networks, and can result
  in some wonky behavior. Removing the original interface from the
  bridge(s) by hand or restarting the node is a workaround, but I
  suspect LinuxBridge users aren't used to modifying the bridges
  manually as the agent usually handles that.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1521314/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1505781] [NEW] Unexpected SNAT behavior between instances when SNAT disabled on router

2015-10-13 Thread James Denton
Public bug reported:

= Scenario =

• Kilo/Juno
• Single Neutron router with enable_snat=false
• two instances in two tenant networks attached to router
• each instance has a floating IP

INSTANCE A: TestNet1=192.167.7.3, 10.1.1.7
INSTANCE B: TestNet2=10.0.8.3, 10.1.1.6

When instances communicate out (ie. to the Internet), they are properly
SNAT'd using their respective floating IP. If an instance does not have
a floating IP, the traffic is routed out without SNAT.

When instances in tenant networks behind the same router communicate via
their fixed IPs, the source address is SNAT'd as the respective floating
IP while the destination is unmodified:

Pinging from INSTANCE A to INSTANCE B:

$ ping 10.0.8.3 -c1
PING 10.0.8.3 (10.0.8.3): 56 data bytes
64 bytes from 10.0.8.3: seq=0 ttl=63 time=7.483 ms

>From the Neutron router:

root@controller01:~# ip netns exec qrouter-dd15e8f3-8612-4925-81d4-88fcad49807f 
tcpdump -i any -ne icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
10:37:48.840404: 192.167.7.3 > 10.0.8.3: ICMP echo request, id 37121, seq 12, 
length 64
10:37:48.840467: 10.1.1.7 > 10.0.8.3: ICMP echo request, id 37121, seq 12, 
length 64 <-- SNAT as FLOAT
10:37:48.842506: 10.0.8.3 > 10.1.1.7: ICMP echo reply, id 37121, seq 12, length 
64
10:37:48.842565: 10.0.8.3 > 192.167.7.3: ICMP echo reply, id 37121, seq 12, 
length 64

This behavior has a negative effect for a couple of reasons:

1. The expectation is that traffic between the two instances behind the same 
router using fixed IPs would not be source NAT'd
2. Security group rules that use 'Remote Security Group' rather than 'Remote IP 
Prefix' fail to work since the source address is modified

When SNAT is enabled on the router, traffic between the instances via
their fixed IP works as expected:

>From INSTANCE A to B:

$ ping 10.0.8.3 -c 1
PING 10.0.8.3 (10.0.8.3): 56 data bytes
64 bytes from 10.0.8.3: seq=0 ttl=63 time=8.024 ms

>From the Neutron router:

root@controller01:~# ip netns exec qrouter-dd15e8f3-8612-4925-81d4-88fcad49807f 
tcpdump -i any -ne icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
10:52:19.945863: 192.167.7.3 > 10.0.8.3: ICMP echo request, id 39425, seq 0, 
length 64
10:52:19.945953: 192.167.7.3 > 10.0.8.3: ICMP echo request, id 39425, seq 0, 
length 64
10:52:19.951498: 10.0.8.3 > 192.167.7.3: ICMP echo reply, id 39425, seq 0, 
length 64
10:52:19.951554: 10.0.8.3 > 192.167.7.3: ICMP echo reply, id 39425, seq 0, 
length 64

We believe the existence of the following iptables nat rule causes the
desired behavior, in that traffic not traversing the qg interface is not
NAT'd:

-A neutron-l3-agent-POSTROUTING ! -i qg-80aa20be-9b ! -o qg-80aa20be-9b
-m conntrack ! --ctstate DNAT -j ACCEPT

That rule only exists when SNAT is *enabled* on the router, and not when
it is disabled, as shown below:

SNAT enabled:

-A PREROUTING -j neutron-l3-agent-PREROUTING
-A OUTPUT -j neutron-l3-agent-OUTPUT
-A POSTROUTING -j neutron-l3-agent-POSTROUTING
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3-agent-OUTPUT -d 10.1.1.6/32 -j DNAT --to-destination 10.0.8.3
-A neutron-l3-agent-OUTPUT -d 10.1.1.7/32 -j DNAT --to-destination 192.167.7.3
-A neutron-l3-agent-POSTROUTING ! -i qg-80aa20be-9b ! -o qg-80aa20be-9b -m 
conntrack ! --ctstate DNAT -j ACCEPT
-A neutron-l3-agent-PREROUTING -d 10.1.1.6/32 -j DNAT --to-destination 10.0.8.3
-A neutron-l3-agent-PREROUTING -d 10.1.1.7/32 -j DNAT --to-destination 
192.167.7.3
-A neutron-l3-agent-float-snat -s 10.0.8.3/32 -j SNAT --to-source 10.1.1.6
-A neutron-l3-agent-float-snat -s 192.167.7.3/32 -j SNAT --to-source 10.1.1.7
-A neutron-l3-agent-snat -j neutron-l3-agent-float-snat
-A neutron-l3-agent-snat -o qg-80aa20be-9b -j SNAT --to-source 10.1.1.5
-A neutron-l3-agent-snat -m mark ! --mark 0x2 -m conntrack --ctstate DNAT -j 
SNAT --to-source 10.1.1.5
-A neutron-postrouting-bottom -m comment --comment "Perform source NAT on 
outgoing traffic." -j neutron-l3-agent-snat

SNAT disabled:

-A PREROUTING -j neutron-l3-agent-PREROUTING
-A OUTPUT -j neutron-l3-agent-OUTPUT
-A POSTROUTING -j neutron-l3-agent-POSTROUTING
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3-agent-OUTPUT -d 10.1.1.6/32 -j DNAT --to-destination 10.0.8.3
-A neutron-l3-agent-OUTPUT -d 10.1.1.7/32 -j DNAT --to-destination 192.167.7.3
-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp 
--dport 80 -j REDIRECT --to-ports 9697
-A neutron-l3-agent-PREROUTING -d 10.1.1.6/32 -j DNAT --to-destination 10.0.8.3
-A neutron-l3-agent-PREROUTING -d 10.1.1.7/32 -j DNAT --to-destination 
192.167.7.3
-A neutron-l3-agent-float-snat -s 10.0.8.3/32 -j SNAT --to-source 10.1.1.6
-A neutron-l3-agent-float-snat -s 192.167.7.3/32 -j SNAT --to-source 10.1.1.7
-A neutron-l3-agent-snat -j neutron-l3-agent-float-snat

[Yahoo-eng-team] [Bug 1452886] [NEW] Port stuck in BUILD state results in limited instance connectivity

2015-05-07 Thread James Denton
Public bug reported:

I am currently experiencing (random) cases of instances that are spun up
having limited connectivity. There are about 650 instances in the
environment and 45 networks.

Network Info:
- ML2/LinuxBridge/l2pop
- VXLAN networks

Symptoms:
- On the local compute node, the instance tap is in the bridge. Everything 
looks good.
- Instance is reachable from some, but not all, instances/devices in the same 
subnet across all compute and network nodes
- On some compute nodes and network nodes, the ARP and FDB entries for the 
instance do not exist. Instances/devices on these nodes cannot communicate with 
the new instance.
- No errors are logged

Here are some observations for the non-working instances:
- The corresponding Neutron port is stuck in a BUILD state
- The binding:host_id value of the port (ie. compute-xxx) does not match the 
OS-EXT-SRV-ATTR:host value of the instance (ie. compute-zzz). For working 
instances, these values match.

I am unable to replicate this consistently at this time, nor am I sure
where to begin pinpointing the issue. Any help is appreciated.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1452886

Title:
  Port stuck in BUILD state results in limited instance connectivity

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  I am currently experiencing (random) cases of instances that are spun
  up having limited connectivity. There are about 650 instances in the
  environment and 45 networks.

  Network Info:
  - ML2/LinuxBridge/l2pop
  - VXLAN networks

  Symptoms:
  - On the local compute node, the instance tap is in the bridge. Everything 
looks good.
  - Instance is reachable from some, but not all, instances/devices in the same 
subnet across all compute and network nodes
  - On some compute nodes and network nodes, the ARP and FDB entries for the 
instance do not exist. Instances/devices on these nodes cannot communicate with 
the new instance.
  - No errors are logged

  Here are some observations for the non-working instances:
  - The corresponding Neutron port is stuck in a BUILD state
  - The binding:host_id value of the port (ie. compute-xxx) does not match the 
OS-EXT-SRV-ATTR:host value of the instance (ie. compute-zzz). For working 
instances, these values match.

  I am unable to replicate this consistently at this time, nor am I sure
  where to begin pinpointing the issue. Any help is appreciated.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1452886/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1447242] [NEW] Use of allowed-address-pairs can allow tenant to cause denial of service in shared networks

2015-04-22 Thread James Denton
Public bug reported:

By assigning the subnet gateway address to a port as an allowed address,
a user can cause ARP conflicts and deny service to other users in the
network. This can be exacerbated by the use of arping to send gratuitous
ARPs and poison the arp cache of instances in the same network.

Steps to reproduce:

1. Build a VM. In this case, the network was a VLAN type with external=false 
and shared=true. 
2. Assign the subnet gateway address as a secondary address in the VM
3. Use the 'port-update' command to add the gateway address as an allowed 
address on the VM port
4. Use 'arping' from iputils-arping to send gratuitous ARPs as the gateway IP 
from the instance
5. Watch as the ARP cache is updated on other instances in the network, 
effectively taking them offline.

This was tested with LinuxBridge/VLAN as a non-admin user, but may
affect other combinations.

Possible remedies may include removing the ability to use allowed-
address-pairs as a non-admin user, or ensuring that the user cannot add
the gateway_ip of the subnet associated with the port as an allowed
address. Either of those two remedies may negatively impact certain use
cases, so at a minimum it may be a good idea to document this somewhere.

If you need more information please reach out to me.

** Affects: neutron
 Importance: Undecided
 Status: New


** Tags: allowed-address-pairs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1447242

Title:
  Use of allowed-address-pairs can allow tenant to cause denial of
  service in shared networks

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  By assigning the subnet gateway address to a port as an allowed
  address, a user can cause ARP conflicts and deny service to other
  users in the network. This can be exacerbated by the use of arping to
  send gratuitous ARPs and poison the arp cache of instances in the same
  network.

  Steps to reproduce:

  1. Build a VM. In this case, the network was a VLAN type with external=false 
and shared=true. 
  2. Assign the subnet gateway address as a secondary address in the VM
  3. Use the 'port-update' command to add the gateway address as an allowed 
address on the VM port
  4. Use 'arping' from iputils-arping to send gratuitous ARPs as the gateway IP 
from the instance
  5. Watch as the ARP cache is updated on other instances in the network, 
effectively taking them offline.

  This was tested with LinuxBridge/VLAN as a non-admin user, but may
  affect other combinations.

  Possible remedies may include removing the ability to use allowed-
  address-pairs as a non-admin user, or ensuring that the user cannot
  add the gateway_ip of the subnet associated with the port as an
  allowed address. Either of those two remedies may negatively impact
  certain use cases, so at a minimum it may be a good idea to document
  this somewhere.

  If you need more information please reach out to me.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1447242/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1445089] [NEW] allowed-address-pairs broken with l2pop/arp responder and LinuxBridge/VXLAN

2015-04-16 Thread James Denton
Public bug reported:

Problem:

In Icehouse/Juno, when using ML2/LinuxBridge and VXLAN networks,
allowed-address-pairs functionality is broken. It appears to be a case
where the node drops broadcast traffic (ff:ff:ff:ff:ff:ff), specifically
ARP requests, from an instance.

Steps to reproduce:

1. Create two instances in the same VXLAN network on two different hosts
2. Add a secondary IP address to instance #1, and add it to the port using 
--allowed-address-pairs
3. Ping from instance #1 to instance #2 using the secondary IP address
4. On the compute node hosting instance #2, observe that the ARP request can be 
seen on the vxlan interface, but not the parent interface

Steps to resolve:

1. Add static ARP entry to instance #2 
2. -OR- Add static ARP entry/neighbor entry to compute node hosting instance #2

The resolutions above become problematic when the allowed addresses are
networks rather than single IPs, as in the cases where instances are
acting as routers or NFV devices of some kind.

---

Example:

Create network:
neutron net-create testnet
neutron subnet-create testnet 192.168.100.0/24

Create ports, one for each instance:
neutron port-create 56c413ca-6ef1-45c8-a3e5-6241ad24bb23
neutron port-create 56c413ca-6ef1-45c8-a3e5-6241ad24bb23

Add security group and allowed-address-pairs to each port (IP to be shared)
neutron port-update 6d6796cd-455f-4b48-9e1a-8316bd336aa4 --security-group 
378e3851-ae7f-40b3-94e3-c05cad5cb56b --allowed-address-pairs type=dict 
list=true ip_address=192.168.100.254
neutron port-update 0715121b-4cc8-4437-8840-aa74be619c2e --security-group 
378e3851-ae7f-40b3-94e3-c05cad5cb56b --allowed-address-pairs type=dict 
list=true ip_address=192.168.100.254

Boot instances:
nova boot --flavor 2 --image 0af87835-f50f-4461-abaa-b6f088c64744 --nic 
port-id=6d6796cd-455f-4b48-9e1a-8316bd336aa4 --key_name rpc_support 
--availability-zone nova:626976-Compute001 20150331-COMP1-TEST
nova boot --flavor 2 --image 0af87835-f50f-4461-abaa-b6f088c64744 --nic 
port-id=0715121b-4cc8-4437-8840-aa74be619c2e --key_name rpc_support 
--availability-zone nova:626977-Compute002 20150331-COMP2-TEST

Observe that the proper iptables rules are in place on the compute
nodes:

root@Compute001:~# iptables-save | grep 6d6796cd
-A neutron-linuxbri-s6d6796cd-4 -s 192.168.100.254/32 -m mac --mac-source 
FA:16:3E:BF:B0:A1 -j RETURN
-A neutron-linuxbri-s6d6796cd-4 -s 192.168.100.5/32 -m mac --mac-source 
FA:16:3E:BF:B0:A1 -j RETURN
-A neutron-linuxbri-s6d6796cd-4 -j DROP

root@Compute002:~# iptables-save | grep 0715121b
-A neutron-linuxbri-s0715121b-4 -s 192.168.100.254/32 -m mac --mac-source 
FA:16:3E:1C:9D:55 -j RETURN
-A neutron-linuxbri-s0715121b-4 -s 192.168.100.6/32 -m mac --mac-source 
FA:16:3E:1C:9D:55 -j RETURN
-A neutron-linuxbri-s0715121b-4 -j DROP

Verify that ARP entries exist on the compute nodes (instances can ping
each other at fixed IP as expected):

root@Compute001:~# arp -an | grep 192.168.100
? (192.168.100.4) at fa:16:3e:4d:73:7b [ether] PERM on vxlan-2
? (192.168.100.6) at fa:16:3e:1c:9d:55 [ether] PERM on vxlan-2
? (192.168.100.2) at fa:16:3e:d4:53:75 [ether] PERM on vxlan-2
? (192.168.100.3) at fa:16:3e:a6:a4:03 [ether] PERM on vxlan-2

root@Compute002:~# arp -an | grep 192.168.100
? (192.168.100.3) at fa:16:3e:a6:a4:03 [ether] PERM on vxlan-2
? (192.168.100.4) at fa:16:3e:4d:73:7b [ether] PERM on vxlan-2
? (192.168.100.2) at fa:16:3e:d4:53:75 [ether] PERM on vxlan-2
? (192.168.100.5) at fa:16:3e:bf:b0:a1 [ether] PERM on vxlan-2

! TEST !

Test: Configure 192.168.100.254 as a secondary address on INSTANCE#1 and
ping out to INSTANCE#2

root@20150331-comp1-test:~# ip a a 192.168.100.254/32 dev eth0

root@20150331-comp1-test:~# ping -I 192.168.100.254 192.168.100.6
PING 192.168.100.6 (192.168.100.6) from 192.168.100.254 : 56(84) bytes of data.
^C
--- 192.168.100.6 ping statistics ---
26 packets transmitted, 0 received, 100% packet loss, time 25200ms

Result: Failure to reach destination

! TROUBLESHOOT !

Process: 
1. Start ping:

root@20150331-comp1-test:~# ping -I 192.168.100.254 192.168.100.6
PING 192.168.100.6 (192.168.100.6) from 192.168.100.254 : 56(84) bytes of data.

2. Dump on vxlan interface on local compute node:

root@Compute001:~# tcpdump -i vxlan-2 -ne
tcpdump: WARNING: vxlan-2: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vxlan-2, link-type EN10MB (Ethernet), capture size 65535 bytes
14:22:06.595700 fa:16:3e:bf:b0:a1  fa:16:3e:1c:9d:55, ethertype IPv4 (0x0800), 
length 98: 192.168.100.254  192.168.100.6: ICMP echo request, id 1521, seq 28, 
length 64
14:22:07.603721 fa:16:3e:bf:b0:a1  fa:16:3e:1c:9d:55, ethertype IPv4 (0x0800), 
length 98: 192.168.100.254  192.168.100.6: ICMP echo request, id 1521, seq 29, 
length 64
14:22:08.611701 fa:16:3e:bf:b0:a1  fa:16:3e:1c:9d:55, ethertype IPv4 (0x0800), 
length 98: 192.168.100.254  192.168.100.6: ICMP echo request, id 1521, seq 30, 
length