[Yahoo-eng-team] [Bug 2059128] [NEW] Internal Server Error when attempring to use an incorrect URL within the metadata API

2024-03-26 Thread Anton Kurbatov
Public bug reported:

When trying to GET a non-existent metadata key within the VM, like
'/latest/meta-data/hostname/abc', the Nova metadata service responses
with a 500 HTTP status code:

Inside a VM:

$ curl http://169.254.169.254/latest/meta-data/hostname/abc

 
  500 Internal Server Error
 
 
  500 Internal Server Error
  An unknown error has occurred. Please try your request again.

 
$


The nova metadata service logs:

CRITICAL nova [None req-3286f047-98c4-41c8-a11b-02a140fd2e4d None None] 
Unhandled error: TypeError: string indices must be integers
ERROR nova Traceback (most recent call last):
ERROR nova   File "/usr/local/lib/python3.9/site-packages/paste/urlmap.py", 
line 216, in __call__
ERROR nova return app(environ, start_response)
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 
129, in __call__
ERROR nova resp = self.call_func(req, *args, **kw)
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 
193, in call_func
ERROR nova return self.func(req, *args, **kwargs)
ERROR nova   File 
"/usr/local/lib/python3.9/site-packages/oslo_middleware/base.py", line 124, in 
__call__
ERROR nova response = req.get_response(self.application)
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/request.py", 
line 1313, in send
ERROR nova status, headers, app_iter = self.call_application(
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/request.py", 
line 1278, in call_application
ERROR nova app_iter = application(self.environ, start_response)
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 
129, in __call__
ERROR nova resp = self.call_func(req, *args, **kw)
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 
193, in call_func
ERROR nova return self.func(req, *args, **kwargs)
ERROR nova   File 
"/usr/local/lib/python3.9/site-packages/oslo_middleware/base.py", line 124, in 
__call__
ERROR nova response = req.get_response(self.application)
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/request.py", 
line 1313, in send
ERROR nova status, headers, app_iter = self.call_application(
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/request.py", 
line 1278, in call_application
ERROR nova app_iter = application(self.environ, start_response)
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 
129, in __call__
ERROR nova resp = self.call_func(req, *args, **kw)
ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 
193, in call_func
ERROR nova return self.func(req, *args, **kwargs)
ERROR nova   File "/opt/stack/nova/nova/api/metadata/handler.py", line 129, in 
__call__
ERROR nova data = meta_data.lookup(req.path_info)
ERROR nova   File "/opt/stack/nova/nova/api/metadata/base.py", line 576, in 
lookup
ERROR nova data = self.get_ec2_item(path_tokens[1:])
ERROR nova   File "/opt/stack/nova/nova/api/metadata/base.py", line 308, in 
get_ec2_item
ERROR nova return find_path_in_tree(data, path_tokens[1:])
ERROR nova   File "/opt/stack/nova/nova/api/metadata/base.py", line 737, in 
find_path_in_tree
ERROR nova data = data[path_tokens[i]]
ERROR nova TypeError: string indices must be integers
ERROR nova
[pid: 156048|app: 0|req: 5/9] 10.136.16.184 () {40 vars in 687 bytes} [Tue Mar 
26 04:37:44 2024] GET /latest/meta-data/hostname/abc => generated 0 bytes in 82 
msecs (HTTP/1.1 500) 0 headers in 0 bytes (0 switches on core 0)

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2059128

Title:
  Internal Server Error when attempring to use an incorrect URL within
  the metadata API

Status in OpenStack Compute (nova):
  New

Bug description:
  When trying to GET a non-existent metadata key within the VM, like
  '/latest/meta-data/hostname/abc', the Nova metadata service responses
  with a 500 HTTP status code:

  Inside a VM:

  $ curl http://169.254.169.254/latest/meta-data/hostname/abc
  
   
500 Internal Server Error
   
   
500 Internal Server Error
An unknown error has occurred. Please try your request again.

   
  $

  
  The nova metadata service logs:

  CRITICAL nova [None req-3286f047-98c4-41c8-a11b-02a140fd2e4d None None] 
Unhandled error: TypeError: string indices must be integers
  ERROR nova Traceback (most recent call last):
  ERROR nova   File "/usr/local/lib/python3.9/site-packages/paste/urlmap.py", 
line 216, in __call__
  ERROR nova return app(environ, start_response)
  ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 
129, in __call__
  ERROR nova resp = self.call_func(req, *args, **kw)
  ERROR nova   File "/usr/local/lib/python3.9/site-packages/webob/dec.py", line 
193, in call_func
  ERROR 

[Yahoo-eng-team] [Bug 2059032] [NEW] Neutron metadata service returns http code 500 if nova metadata service is down

2024-03-25 Thread Anton Kurbatov
Public bug reported:

We discovered that if the nova metadata service is down, then the
neutron metadata service starts printing stack traces with a 500 HTTP
code to the user.

Demo on a newly installed devstack

$ systemctl stop devstack@n-api-meta.service

Then inside a VM:

$ curl http://169.254.169.254/latest/meta-data/hostname

 
  500 Internal Server Error
 
 
  500 Internal Server Error
  An unknown error has occurred. Please try your request again.
 
$

Stack trace:

ERROR neutron.agent.metadata.agent Traceback (most recent call last):
ERROR neutron.agent.metadata.agent   File 
"/opt/stack/neutron/neutron/agent/metadata/agent.py", line 85, in __call__
ERROR neutron.agent.metadata.agent res = self._proxy_request(instance_id, 
tenant_id, req)
ERROR neutron.agent.metadata.agent   File 
"/opt/stack/neutron/neutron/agent/metadata/agent.py", line 249, in 
_proxy_request
ERROR neutron.agent.metadata.agent resp = 
requests.request(method=req.method, url=url,
ERROR neutron.agent.metadata.agent   File 
"/usr/local/lib/python3.9/site-packages/requests/api.py", line 59, in request
ERROR neutron.agent.metadata.agent return session.request(method=method, 
url=url, **kwargs)
ERROR neutron.agent.metadata.agent   File 
"/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 589, in 
request
ERROR neutron.agent.metadata.agent resp = self.send(prep, **send_kwargs)
ERROR neutron.agent.metadata.agent   File 
"/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
ERROR neutron.agent.metadata.agent r = adapter.send(request, **kwargs)
ERROR neutron.agent.metadata.agent   File 
"/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 519, in send
ERROR neutron.agent.metadata.agent raise ConnectionError(e, request=request)
ERROR neutron.agent.metadata.agent requests.exceptions.ConnectionError: 
HTTPConnectionPool(host='10.136.16.184', port=8775): Max retries exceeded with 
url: /latest/meta-data/hostname (Caused by 
NewConnectionError(': Failed to establish a new connection: [Errno 111] 
ECONNREFUSED'))
ERROR neutron.agent.metadata.agent
INFO eventlet.wsgi.server [-] :::192.168.100.14, "GET 
/latest/meta-data/hostname HTTP/1.1" status: 500  len: 362 time: 0.1392403


Also, in our installation the nova service is behind nginx. And if we stop nova 
metadata service we also get 500 http code but with another traceback:

2024-03-25 20:27:01.985 24 ERROR neutron.agent.metadata.agent [-] Unexpected 
error.: Exception: Unexpected response code: 502
2024-03-25 20:27:01.985 24 ERROR neutron.agent.metadata.agent Traceback (most 
recent call last):
2024-03-25 20:27:01.985 24 ERROR neutron.agent.metadata.agent   File 
"/usr/lib/python3.6/site-packages/neutron/agent/metadata/agent.py", line 93, in 
__call__
2024-03-25 20:27:01.985 24 ERROR neutron.agent.metadata.agent res = 
self._proxy_request(instance_id, tenant_id, req)
2024-03-25 20:27:01.985 24 ERROR neutron.agent.metadata.agent   File 
"/usr/lib/python3.6/site-packages/neutron/agent/metadata/agent.py", line 288, 
in _proxy_request
2024-03-25 20:27:01.985 24 ERROR neutron.agent.metadata.agent 
resp.status_code)
2024-03-25 20:27:01.985 24 ERROR neutron.agent.metadata.agent Exception: 
Unexpected response code: 502
2024-03-25 20:27:01.985 24 ERROR neutron.agent.metadata.agent
2024-03-25 20:27:01.988 24 INFO eventlet.wsgi.server [-] 10.197.115.207, 
"GET /latest/meta-data/hostname HTTP/1.1" status: 500  len: 362 time: 0.1369441

It seems to me that it is also better to handle nginx-like gateway
errors a bit more correctly.

These 500 HTTP codes worries us because we are trying to create an alert
system one of the criteria for which is 500 codes.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2059032

Title:
  Neutron metadata service returns http code 500 if nova metadata
  service is down

Status in neutron:
  New

Bug description:
  We discovered that if the nova metadata service is down, then the
  neutron metadata service starts printing stack traces with a 500 HTTP
  code to the user.

  Demo on a newly installed devstack

  $ systemctl stop devstack@n-api-meta.service

  Then inside a VM:

  $ curl http://169.254.169.254/latest/meta-data/hostname
  
   
500 Internal Server Error
   
   
500 Internal Server Error
An unknown error has occurred. Please try your request again.
   
  $

  Stack trace:

  ERROR neutron.agent.metadata.agent Traceback (most recent call last):
  ERROR neutron.agent.metadata.agent   File 
"/opt/stack/neutron/neutron/agent/metadata/agent.py", line 85, in __call__
  ERROR neutron.agent.metadata.agent res = self._proxy_request(instance_id, 
tenant_id, req)
  ERROR neutron.agent.metadata.agent   File 
"/opt/stack/neutron/neutron/agent/metadata/agent.py", line 249, in 
_proxy_request
 

[Yahoo-eng-team] [Bug 2038931] [NEW] ovsfw: OVS br-int rule disappears from the table=60 after stop/start VM

2023-10-10 Thread Anton Kurbatov
Public bug reported:

I found out that after VM creation and after VM stop/start the set of
OVS rules is different in br-int table=60 (TRANSIENT_TABLE)

I have a flat network, in this network I create a VM. After the VM
stop/start the set of rules in table 60 for this VM is different from
the one that was after VM creation.

Here is a demo:

[root@devstack0 ~]# openstack server create test-vm --image 
cirros-0.6.2-x86_64-disk --network public --flavor m1.tiny -c id
+---+--+
| Field | Value|
+---+--+
| id| 84c7ed9c-c78e-4d15-8a09-6eb18b0f872a |
+---+--+
[root@devstack0 ~]# openstack port list --device-id 
84c7ed9c-c78e-4d15-8a09-6eb18b0f872a -c ID -c mac_address
+--+---+
| ID   | MAC Address   |
+--+---+
| 4fd0022b-223d-43ac-9134-1623b38ee2a6 | fa:16:3e:4b:db:3e |
+--+---+
[root@devstack0 ~]#


Table 60: two rules with dl_dst=fa:16:3e:4b:db:3e after VM is created:

[root@devstack0 neutron]# ovs-ofctl dump-flows br-int table=60 | grep 
fa:16:3e:4b:db:3e
 cookie=0x1a51dc2aa3392248, duration=23.420s, table=60, n_packets=0, n_bytes=0, 
idle_age=1961, priority=90,vlan_tci=0x/0x1fff,dl_dst=fa:16:3e:4b:db:3e 
actions=load:0x1c->NXM_NX_REG5[],load:0x2->NXM_NX_REG6[],resubmit(,81)
 cookie=0x1a51dc2aa3392248, duration=23.420s, table=60, n_packets=25, 
n_bytes=2450, idle_age=678, priority=90,dl_vlan=2,dl_dst=fa:16:3e:4b:db:3e 
actions=load:0x1c->NXM_NX_REG5[],load:0x2->NXM_NX_REG6[],strip_vlan,resubmit(,81)
[root@devstack0 neutron]#


Stop/start the VM and check it again:

[root@devstack0 ~]# openstack server stop test-vm
[root@devstack0 ~]# openstack server start test-vm
[root@devstack0 ~]#
[root@devstack0 neutron]# ovs-ofctl dump-flows br-int table=60 | grep 
fa:16:3e:4b:db:3e
 cookie=0x1a51dc2aa3392248, duration=14.201s, table=60, n_packets=25, 
n_bytes=2450, idle_age=697, priority=90,dl_vlan=2,dl_dst=fa:16:3e:4b:db:3e 
actions=load:0x1d->NXM_NX_REG5[],load:0x2->NXM_NX_REG6[],strip_vlan,resubmit(,81)
[root@devstack0 neutron]#

You can see that the rule [1] has disappeared.

And there is a neutron-openvsiwth-agent message 'Initializing port
 that was already initialized' while VM starting:

Oct 10 08:50:05 devstack0 neutron-openvswitch-agent[232791]: INFO 
neutron.agent.securitygroups_rpc [None req-df876af2-5007-42ae-ae4e-8c968f59fb5c 
None None] Preparing filters for devices 
{'4fd0022b-223d-43ac-9134-1623b38ee2a6'}
Oct 10 08:50:05 devstack0 neutron-openvswitch-agent[232791]: INFO 
neutron.agent.linux.openvswitch_firewall.firewall [None 
req-df876af2-5007-42ae-ae4e-8c968f59fb5c None None] Initializing port 
4fd0022b-223d-43ac-9134-1623b38ee2a6 that was already initialized.

I get this behavior on devstack with neutron from master branch.

It looks like this rule is disappeared because OVS interface under OVS
port is recreated after VM stop/start and new OFPort object is creating
with network_type=None (as well with physical_network=None). Compare to
a few lines above where the OFPort object is created with
network_type/physical_network [2]


I actually discovered this behavior while testing my neutron port-check plugin 
[3]

[root@devstack0 ~]# openstack port check 4fd0022b-223d-43ac-9134-1623b38ee2a6 
-c firewall
+--+--+
| Field| Value  
  |
+--+--+
| firewall | - No flow: table=60, priority=90,vlan_tci=(0, 
8191),eth_dst=fa:16:3e:4b:db:3e 
actions=set_field:29->reg5,set_field:2->reg6,resubmit(,81) |
+--+--+
[root@devstack0 ~]#

[1] 
https://opendev.org/openstack/neutron/src/commit/78027da56ccb25d19ac2c3bc1c174acb2150e6a5/neutron/agent/linux/openvswitch_firewall/firewall.py#L915
[2] 
https://opendev.org/openstack/neutron/src/commit/78027da56ccb25d19ac2c3bc1c174acb2150e6a5/neutron/agent/linux/openvswitch_firewall/firewall.py#L724
[3] https://github.com/antonkurbatov/neutron-portcheck

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2038931

Title:
  ovsfw: OVS br-int rule disappears from the table=60 after stop/start
  VM

Status in neutron:
 

[Yahoo-eng-team] [Bug 2024381] [NEW] keepalived fails to start after updating DVR-HA internal network MTU

2023-06-19 Thread Anton Kurbatov
Public bug reported:

We got an issue when keepalived stops to be running after update MTU on the 
internal network of the DVR-HA router.
It turned out that the keepalived config has an interface from qrouter-ns 
although the keepalived process itself is running in snat-ns.

Here is a simple demo on the latest master branch:
$ openstack network create net1
$ openstack subnet create sub1 --network net1 --subnet-range 192.168.100.0/24
$ openstack router create r1 --distributed --ha
$ openstack router add subnet r1 sub1

Keepalived process is running and the config looks like:

$ ps axf | grep -w pid.keepalived
...
 130250 ?S  0:00  \_ keepalived -P -f 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd/keepalived.conf
 -p 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd.pid.keepalived
 -r 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd.pid.keepalived-vrrp
 -D
$ cat 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd/keepalived.conf
global_defs {
notification_email_from neutron@openstack.local
router_id neutron
}
vrrp_instance VR_60 {
state BACKUP
interface ha-77ee55dc-5c
virtual_router_id 60
priority 50
garp_master_delay 60
nopreempt
advert_int 2
track_interface {
ha-77ee55dc-5c
}
virtual_ipaddress {
169.254.0.60/24 dev ha-77ee55dc-5c
}
$


Now update MTU of the internal network:

$ openstack network set net1 --mtu 1400
$ ps axf | grep -w pid.keepalived
 131097 pts/0S+ 0:00  |   \_ grep --color=auto -w pid.keepalived
$ 

$ ip netns exec snat-f7df848f-f168-4305-8ba2-a31902bdbbfd keepalived -t -f 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd/keepalived.conf
(/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd/keepalived.conf:
 Line 20) WARNING - interface qr-035f8095-76 for ip address 192.168.100.1/24 
doesn't exist
(/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd/keepalived.conf:
 Line 21) WARNING - interface qr-035f8095-76 for ip address 
fe80::f816:3eff:fe88:e922/64 doesn't exist
Non-existent interface specified in configuration
$
$ cat 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd/keepalived.conf
global_defs {
notification_email_from neutron@openstack.local
router_id neutron
}
vrrp_instance VR_60 {
state BACKUP
interface ha-77ee55dc-5c
virtual_router_id 60
priority 50
garp_master_delay 60
nopreempt
advert_int 2
track_interface {
ha-77ee55dc-5c
}
virtual_ipaddress {
169.254.0.60/24 dev ha-77ee55dc-5c
}
virtual_ipaddress_excluded {
192.168.100.1/24 dev qr-035f8095-76
fe80::f816:3eff:fe88:e922/64 dev qr-035f8095-76 scope link
}
}$

$ ip netns exec snat-f7df848f-f168-4305-8ba2-a31902bdbbfd ip link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode 
DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
10: ha-77ee55dc-5c:  mtu 1450 qdisc noqueue 
state UNKNOWN mode DEFAULT group default qlen 1000
link/ether fa:16:3e:46:30:c4 brd ff:ff:ff:ff:ff:ff
$

** Affects: neutron
 Importance: Undecided
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2024381

Title:
  keepalived fails to start after updating DVR-HA internal network MTU

Status in neutron:
  In Progress

Bug description:
  We got an issue when keepalived stops to be running after update MTU on the 
internal network of the DVR-HA router.
  It turned out that the keepalived config has an interface from qrouter-ns 
although the keepalived process itself is running in snat-ns.

  Here is a simple demo on the latest master branch:
  $ openstack network create net1
  $ openstack subnet create sub1 --network net1 --subnet-range 192.168.100.0/24
  $ openstack router create r1 --distributed --ha
  $ openstack router add subnet r1 sub1

  Keepalived process is running and the config looks like:

  $ ps axf | grep -w pid.keepalived
  ...
   130250 ?S  0:00  \_ keepalived -P -f 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd/keepalived.conf
 -p 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd.pid.keepalived
 -r 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd.pid.keepalived-vrrp
 -D
  $ cat 
/opt/stack/data/neutron/ha_confs/f7df848f-f168-4305-8ba2-a31902bdbbfd/keepalived.conf
  global_defs {
  notification_email_from neutron@openstack.local
  router_id neutron
  }
  vrrp_instance VR_60 {
  state BACKUP
  interface ha-77ee55dc-5c
  virtual_router_id 60
  priority 50
  garp_master_delay 60
  nopreempt
  advert_int 2
  track_interface {
  ha-77ee55dc-5c
  }
  virtual_ipaddress {
  169.254.0.60/24 

[Yahoo-eng-team] [Bug 2008270] [NEW] Neutron allows you to delete router_ha_interface ports, which can lead to issues

2023-02-23 Thread Anton Kurbatov
Public bug reported:

We ran into a problem with a customer when some external integration
tries to remove all ports using the neutron API, including router prots.

It seems only the router ports with the router_ha_interface  device
owner are allowed to delete, all other router ports cannot be deleted
directly through the API.

Here is a simple example that demonstrates the doubling of ARP responses
if such a port is deleted:

[root@dev0 ~]# openstack router create r1 --ha --external-gateway public -c id
+---+--+
| Field | Value|
+---+--+
| id| 5d9d6fee-6652-4843-9f7c-54c11899d721 |
+---+--+
[root@dev0 ~]# neutron l3-agent-list-hosting-router r1
neutron CLI is deprecated and will be removed in the Z cycle. Use openstack CLI 
instead.
+--+--++---+--+
| id   | host | admin_state_up | alive | 
ha_state |
+--+--++---+--+
| 9dd0920a-cb0c-47f1-a976-3e208e3e2e6c | dev0 | True   | :-)   | active 
  |
| 6fa92056-ca25-42e0-aee4-c4e744008239 | dev2 | True   | :-)   | 
standby  |
| 8fbda128-dc9c-4b3b-be1b-bb3f11ad1447 | dev1 | True   | :-)   | 
standby  |
+--+--++---+--+
[root@dev0 ~]# openstack port list --device-id 
5d9d6fee-6652-4843-9f7c-54c11899d721 -c id -c device_owner -c fixed_ips --long
+--+-++
| ID   | Device Owner| Fixed IP 
Addresses |
+--+-++
| 555a9272-c9df-4a05-9f08-752c91c5a4c9 | network:router_ha_interface | 
ip_address='169.254.192.147', subnet_id='20c159f7-13f8-4093-9a4a-8380bdcfea60' |
| 6a196ff7-f3d4-4bee-aed0-b5d7ba727741 | network:router_ha_interface | 
ip_address='169.254.193.243', subnet_id='20c159f7-13f8-4093-9a4a-8380bdcfea60' |
| 7a849dcc-eac4-4d5b-a547-7ce3986ffb95 | network:router_ha_interface | 
ip_address='169.254.192.155', subnet_id='20c159f7-13f8-4093-9a4a-8380bdcfea60' |
| d77e624d-87a2-4135-9118-3d8e78539cee | network:router_gateway  | 
ip_address='10.136.17.172', subnet_id='ee15c548-e497-449e-b46d-50e9ccc0f70c'   |
+--+-++
[root@dev0 ~]#

[root@dev0 ~]# ip netns exec snat-5d9d6fee-6652-4843-9f7c-54c11899d721 ip a
...
25: ha-555a9272-c9:  mtu 1450 qdisc noqueue 
state UNKNOWN group default qlen 1000
link/ether fa:16:3e:7d:cf:a0 brd ff:ff:ff:ff:ff:ff
inet 169.254.192.147/18 brd 169.254.255.255 scope global ha-555a9272-c9
   valid_lft forever preferred_lft forever
inet 169.254.0.189/24 scope global ha-555a9272-c9
   valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe7d:cfa0/64 scope link
   valid_lft forever preferred_lft forever
28: qg-d77e624d-87:  mtu 1500 qdisc noqueue 
state UNKNOWN group default qlen 1000
link/ether fa:16:3e:a8:54:29 brd ff:ff:ff:ff:ff:ff
inet 10.136.17.172/20 scope global qg-d77e624d-87
   valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fea8:5429/64 scope link nodad
   valid_lft forever preferred_lft forever
[root@dev0 ~]#

[root@dev0 ~]# openstack port delete 555a9272-c9df-4a05-9f08-752c91c5a4c9
[root@dev0 ~]# neutron l3-agent-list-hosting-router r1
neutron CLI is deprecated and will be removed in the Z cycle. Use openstack CLI 
instead.
+--+--++---+--+
| id   | host | admin_state_up | alive | 
ha_state |
+--+--++---+--+
| 6fa92056-ca25-42e0-aee4-c4e744008239 | dev2 | True   | :-)   | active 
  |
| 8fbda128-dc9c-4b3b-be1b-bb3f11ad1447 | dev1 | True   | :-)   | 
standby  |
+--+--++---+--+
[root@dev0 ~]#

[root@dev0 ~]# ip netns exec snat-5d9d6fee-6652-4843-9f7c-54c11899d721 ip a s 
qg-d77e624d-87
28: qg-d77e624d-87:  mtu 1500 qdisc noqueue 
state UNKNOWN group default qlen 1000
link/ether fa:16:3e:a8:54:29 brd ff:ff:ff:ff:ff:ff
inet 10.136.17.172/20 scope global qg-d77e624d-87
   valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fea8:5429/64 scope link nodad
   valid_lft forever preferred_lft forever
[root@dev0 ~]# ssh dev2 ip netns exec 

[Yahoo-eng-team] [Bug 2003532] [NEW] Floating IP stuck in snat-ns after binding host to associated fixed IP

2023-01-20 Thread Anton Kurbatov
Public bug reported:

We encountered a problem when the floating IP is not removed from the snat-ns 
when FIP is moving from the centralized to the distributed state (i.e. when the 
host is binding to the associated fixed IP address).
This happens when the the fixed IP was originally created with a non-empty 
device_owner field.

Steps to reproduce.
Create a router, a port on a private network, and a FIP with this port as a 
fixed IP port:

[root@devstack0 ~]# openstack router create --distributed r1 --external-gateway 
public
[root@devstack0 ~]# openstack router add subnet r1 private
[root@devstack0 ~]# openstack port create my-port --network private 
--device-owner compute:nova
+--+---+
| Field| Value  
   |
+--+---+
| device_owner | compute:nova   
   |
| fixed_ips| ip_address='192.168.10.133', 
subnet_id='8ec1cd23-363a-474c-a53f-bab4692c312f' |
+--+---+
[root@devstack0 ~]# openstack floating ip create public --port my-port -c 
floating_ip_address
+-+---+
| Field   | Value |
+-+---+
| floating_ip_address | 10.136.17.171 |
+-+---+
[root@devstack0 ~]#

The FIP is added to the snat-ns:

[root@devstack0 ~]# ip netns exec snat-b961c902-8cd9-4c5c-a03c-6595368a2314 ip a
...
38: qg-6a663b96-e1:  mtu 1500 qdisc noqueue 
state UNKNOWN group default qlen 1000
link/ether fa:16:3e:bf:85:ab brd ff:ff:ff:ff:ff:ff
inet 10.136.17.175/20 brd 10.136.31.255 scope global qg-6a663b96-e1
   valid_lft forever preferred_lft forever
inet 10.136.17.171/32 brd 10.136.17.171 scope global qg-6a663b96-e1
   valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:febf:85ab/64 scope link
   valid_lft forever preferred_lft forever
...
[root@devstack0 ~]#

Create a VM with `my-port` and boot it on an another node:

[root@devstack0 ~]# openstack server create vm --port my-port --image
cirros-0.5.2-x86_64-disk --flavor 1 --host devstack2


Check FIP state on the node with VM (OK):

[root@devstack2 ~]# ip netns exec qrouter-b961c902-8cd9-4c5c-a03c-6595368a2314 
ip rule
...
65426:  from 192.168.10.133 lookup 16
3232238081: from 192.168.10.1/24 lookup 3232238081
[root@devstack2 ~]#

Check the FIP on the node with the snat-ns (not OK, it's still here):

[root@devstack0 ~]# ip netns exec snat-b961c902-8cd9-4c5c-a03c-6595368a2314 ip a
...
38: qg-6a663b96-e1:  mtu 1500 qdisc noqueue 
state UNKNOWN group default qlen 1000
link/ether fa:16:3e:bf:85:ab brd ff:ff:ff:ff:ff:ff
inet 10.136.17.175/20 brd 10.136.31.255 scope global qg-6a663b96-e1
   valid_lft forever preferred_lft forever
inet 10.136.17.171/32 brd 10.136.17.171 scope global qg-6a663b96-e1
   valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:febf:85ab/64 scope link
   valid_lft forever preferred_lft forever
...
[root@devstack0 ~]#


We found that FIP status "moving" notification is not sent to snat nodes in 
this scenario, see [1].
There was also some small discussion about why the notification should be sent 
only when changing from empty to a non-empty device_owner [2].
It looks like such behavior can be considered as a bug.


[1] 
https://opendev.org/openstack/neutron/src/commit/c1eff1dd440b2243a4a31cf3c3af06a01e899f1d/neutron/db/l3_dvrscheduler_db.py#L647
[2] 
https://review.opendev.org/c/openstack/neutron/+/609924/10/neutron/db/l3_dvrscheduler_db.py#503

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2003532

Title:
  Floating IP stuck in snat-ns after binding host to associated fixed IP

Status in neutron:
  New

Bug description:
  We encountered a problem when the floating IP is not removed from the snat-ns 
when FIP is moving from the centralized to the distributed state (i.e. when the 
host is binding to the associated fixed IP address).
  This happens when the the fixed IP was originally created with a non-empty 
device_owner field.

  Steps to reproduce.
  Create a router, a port on a private network, and a FIP with this port as a 
fixed IP port:

  [root@devstack0 ~]# openstack router create --distributed r1 
--external-gateway public
  [root@devstack0 ~]# openstack router add subnet r1 private
  [root@devstack0 ~]# openstack port create my-port --network private 
--device-owner compute:nova
  
+--+---+
  | Field| Value   

[Yahoo-eng-team] [Bug 2003359] [NEW] DVR HA router gets stuck in backup state

2023-01-19 Thread Anton Kurbatov
Public bug reported:

We found the issue when a created HA DVR router gets stuck in the backup state 
and does not go into primary state.
Preconditions:
1) there is no router with a specific external network yet
2) the router needs to go through a quick creation->deletion, and then the next 
creation of the router can get stuck in the backup state

The reason for such behavior is not removed fip-ns on the agent while the 
floatingip_agent_gateway port was removed.
Further is a demo with the help of which I managed to reproduce this behavior 
on a single node devstack setup with.

Сreate a router and quickly delete it while the l3 agent processes the
external GW adding:

[root@devstack ~]# r_id=$(openstack router create r1 --distributed --ha -c id 
-f value); sleep 30 # give time to process
[root@devstack ~]# count_fip_requests() { journalctl -u devstack@q-l3.service | 
grep 'FloatingIP agent gateway port received' | wc -l; }
[root@devstack ~]# # add an external gateway and then delete the router while 
the agent processes gw
[root@devstack ~]# fip_requests=$(count_fip_requests); openstack router set 
$r_id --external-gateway public; while :; do [[ $fip_requests == 
$(count_fip_requests) ]] && { echo "waiting before deletion..."; sleep 1; } || 
break; done; openstack router delete $r_id
waiting before deletion...
waiting before deletion...
[root@devstack ~]#

As a result fip-ns is not deleted even though the
floatingip_agent_gateway port was removed:

[root@devstack ~]# ip netns
fip-8d4bc2d5-c6e7-44d0-99f7-1333bafa991f (id: 1)
[root@devstack ~]# openstack port list --network public -c ID -c device_owner 
-c status --long

[root@devstack ~]#

Re-create the router together with external gw now:

[root@devstack ~]# openstack router create r1 --ha --distributed
--external-gateway public

In the logs, one can see a traceback that the creation of this router
failed initially, followed by a successful creation:

ERROR neutron.agent.l3.dvr_fip_ns Traceback (most recent call last):
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 152, in 
create_or_update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns self._update_gateway_port(
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/l3/dvr_fip_ns.py", line 323, in 
_update_gateway_port
ERROR neutron.agent.l3.dvr_fip_ns self.driver.set_onlink_routes(
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/interface.py", line 193, in 
set_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns onlink = 
device.route.list_onlink_routes(constants.IP_VERSION_4)
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 633, in 
list_onlink_routes
ERROR neutron.agent.l3.dvr_fip_ns routes = self.list_routes(ip_version, 
scope='link')
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 629, in list_routes
ERROR neutron.agent.l3.dvr_fip_ns return 
list_ip_routes(self._parent.namespace, ip_version, scope=scope,
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/opt/stack/neutron/neutron/agent/linux/ip_lib.py", line 1585, in list_ip_routes
ERROR neutron.agent.l3.dvr_fip_ns routes = 
privileged.list_ip_routes(namespace, ip_version, device=device,
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 333, in 
wrapped_f
ERROR neutron.agent.l3.dvr_fip_ns return self(f, *args, **kw)
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 423, in 
__call__
ERROR neutron.agent.l3.dvr_fip_ns do = self.iter(retry_state=retry_state)
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 360, in iter
ERROR neutron.agent.l3.dvr_fip_ns return fut.result()
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/lib64/python3.9/concurrent/futures/_base.py", line 439, in result
ERROR neutron.agent.l3.dvr_fip_ns return self.__get_result()
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
ERROR neutron.agent.l3.dvr_fip_ns raise self._exception
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/tenacity/__init__.py", line 426, in 
__call__
ERROR neutron.agent.l3.dvr_fip_ns result = fn(*args, **kwargs)
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/oslo_privsep/priv_context.py", line 
271, in _wrap
ERROR neutron.agent.l3.dvr_fip_ns return self.channel.remote_call(name, 
args, kwargs,
ERROR neutron.agent.l3.dvr_fip_ns   File 
"/usr/local/lib/python3.9/site-packages/oslo_privsep/daemon.py", line 215, in 
remote_call
ERROR neutron.agent.l3.dvr_fip_ns raise exc_type(*result[2])
ERROR neutron.agent.l3.dvr_fip_ns 
neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network 

[Yahoo-eng-team] [Bug 2000078] [NEW] neutron-remove-duplicated-port-bindings doesn't remove binding_levels

2022-12-19 Thread Anton Kurbatov
Public bug reported:

I'm trying to do an INACTIVE port binding cleanup using 
neutron-remove-duplicated-port-bindings tool from #1979072
But I found an issue with this help tool: it doens't remove entries from the 
ml2_port_binding_levels table that still blocks new port binding to the host.

Demo:
1) 
create VM and bind a port to another host:
$ openstack port create my-port --network private  --device-owner compute:test
-> get port port ID -> 075c4058-2933-4f6f-90a9-f754e81cef52
$  curl -k -H "x-auth-token: $t" -H "Content-Type: application/json" -X POST 
http://10.136.16.186:9696/networking/v2.0/ports/075c4058-2933-4f6f-90a9-f754e81cef52/bindings
 -d '{"binding": {"host": "ak-dev2"}}'

MariaDB [neutron]> select port_id,host,vif_type,status from ml2_port_bindings 
where port_id='075c4058-2933-4f6f-90a9-f754e81cef52';
+--+-+--+--+
| port_id  | host| vif_type | status   |
+--+-+--+--+
| 075c4058-2933-4f6f-90a9-f754e81cef52 | ak-dev1 | ovs  | ACTIVE   |
| 075c4058-2933-4f6f-90a9-f754e81cef52 | ak-dev2 | ovs  | INACTIVE |
+--+-+--+--+
2 rows in set (0.000 sec)

MariaDB [neutron]> select * from ml2_port_binding_levels where 
port_id='075c4058-2933-4f6f-90a9-f754e81cef52';
+--+-+---+-+--+
| port_id  | host| level | driver  | 
segment_id   |
+--+-+---+-+--+
| 075c4058-2933-4f6f-90a9-f754e81cef52 | ak-dev1 | 0 | openvswitch | 
2250e731-0046-46ae-8cf0-8da7fd3aad98 |
| 075c4058-2933-4f6f-90a9-f754e81cef52 | ak-dev2 | 0 | openvswitch | 
2250e731-0046-46ae-8cf0-8da7fd3aad98 |
+--+-+---+-+--+
2 rows in set (0.000 sec)

MariaDB [neutron]>

2) remove INACTIVE port bindings via neutron-remove-duplicated-port-bindings:
$ neutron-remove-duplicated-port-bindings --config-file 
/etc/neutron/neutron.conf

MariaDB [neutron]> select port_id,host,vif_type,status from ml2_port_bindings 
where port_id='075c4058-2933-4f6f-90a9-f754e81cef52';
+--+-+--++
| port_id  | host| vif_type | status |
+--+-+--++
| 075c4058-2933-4f6f-90a9-f754e81cef52 | ak-dev1 | ovs  | ACTIVE |
+--+-+--++
1 row in set (0.000 sec)

MariaDB [neutron]> select * from ml2_port_binding_levels where 
port_id='075c4058-2933-4f6f-90a9-f754e81cef52';
+--+-+---+-+--+
| port_id  | host| level | driver  | 
segment_id   |
+--+-+---+-+--+
| 075c4058-2933-4f6f-90a9-f754e81cef52 | ak-dev1 | 0 | openvswitch | 
2250e731-0046-46ae-8cf0-8da7fd3aad98 |
| 075c4058-2933-4f6f-90a9-f754e81cef52 | ak-dev2 | 0 | openvswitch | 
2250e731-0046-46ae-8cf0-8da7fd3aad98 |
+--+-+---+-+--+
2 rows in set (0.000 sec)

MariaDB [neutron]>

3) Create the port binding again. It fails:

$ # curl -k -H "x-auth-token: $t" -H "Content-Type: application/json" -X POST 
http://10.136.16.186:9696/networking/v2.0/ports/075c4058-2933-4f6f-90a9-f754e81cef52/bindings
 -d '{"binding": {"host": "ak-dev2"}}'
{"NeutronError": {"type": "NeutronDbObjectDuplicateEntry", "message": "Failed 
to create a duplicate PortBindingLevel: for attribute(s) ['PRIMARY'] with 
value(s) 075c4058-2933-4f6f-90a9-f754e81cef52-ak-dev2-0", "detail": ""}}

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/278

Title:
  neutron-remove-duplicated-port-bindings doesn't remove binding_levels

Status in neutron:
  New

Bug description:
  I'm trying to do an INACTIVE port binding cleanup using 
neutron-remove-duplicated-port-bindings tool from #1979072
  But I found an issue with this help tool: it doens't remove entries from the 
ml2_port_binding_levels table that still blocks new port binding to the host.

  Demo:
  1) 
  create VM and bind a port to another host:
  $ openstack port create my-port --network private  --device-owner compute:test
  -> get port port ID -> 075c4058-2933-4f6f-90a9-f754e81cef52
  $  curl -k -H "x-auth-token: $t" 

[Yahoo-eng-team] [Bug 1999678] [NEW] Static route can get stuck in the router snat namespace

2022-12-14 Thread Anton Kurbatov
Public bug reported:

I ran into a problem where a static route just gets stuck in the snat 
namepsace, even when removing all static routes from a distributed router with 
ha enabled.
Here is a simple demo from my devstack setup:

[root@node0 ~]# openstack network create private
[root@node0 ~]# openstack subnet create private --network private 
--subnet-range 192.168.10.0/24 --dhcp --gateway 192.168.10.1
[root@node0 ~]# openstack router create r1 --external-gateway public 
--distributed --ha
[root@node0 ~]# openstack router add subnet r1 private
[root@node0 ~]# openstack router set r1 --route 
destination=8.8.8.0/24,gateway=192.168.10.100 --route 
destination=8.8.8.0/24,gateway=192.168.10.200

After multipath route was added, snat-ns routes look like this:

[root@node0 ~]# ip netns exec snat-dcbec74b-2003-4447-8854-524d918260ac ip r
default via 10.136.16.1 dev qg-94c43336-56 proto keepalived
8.8.8.0/24 via 192.168.10.200 dev sg-dcf4a20b-8a proto keepalived
8.8.8.0/24 via 192.168.10.100 dev sg-dcf4a20b-8a proto keepalived
8.8.8.0/24 via 192.168.10.100 dev sg-dcf4a20b-8a proto static
10.136.16.0/20 dev qg-94c43336-56 proto kernel scope link src 10.136.17.171
169.254.0.0/24 dev ha-11b5b7d3-4e proto kernel scope link src 169.254.0.21
169.254.192.0/18 dev ha-11b5b7d3-4e proto kernel scope link src 169.254.195.228
192.168.10.0/24 dev sg-dcf4a20b-8a proto kernel scope link src 192.168.10.228
[root@node0 ~]#

Note that there is only one 'static' route added by neutron and no multipath 
route.
And two routes with 'proto keepalived' that have been added by keepalived 
process.
Now delete all routes and check the routes inside snat-ns, the route is still 
there:

[root@node0 ~]# openstack router set r1 --no-route
[root@node0 ~]# ip netns exec snat-dcbec74b-2003-4447-8854-524d918260ac ip r
default via 10.136.16.1 dev qg-94c43336-56 proto keepalived
8.8.8.0/24 via 192.168.10.100 dev sg-dcf4a20b-8a proto static
10.136.16.0/20 dev qg-94c43336-56 proto kernel scope link src 10.136.17.171
169.254.0.0/24 dev ha-11b5b7d3-4e proto kernel scope link src 169.254.0.21
169.254.192.0/18 dev ha-11b5b7d3-4e proto kernel scope link src 169.254.195.228
192.168.10.0/24 dev sg-dcf4a20b-8a proto kernel scope link src 192.168.10.228
[root@node0 ~]#

** Affects: neutron
 Importance: Undecided
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1999678

Title:
  Static route can get stuck in the router snat namespace

Status in neutron:
  In Progress

Bug description:
  I ran into a problem where a static route just gets stuck in the snat 
namepsace, even when removing all static routes from a distributed router with 
ha enabled.
  Here is a simple demo from my devstack setup:

  [root@node0 ~]# openstack network create private
  [root@node0 ~]# openstack subnet create private --network private 
--subnet-range 192.168.10.0/24 --dhcp --gateway 192.168.10.1
  [root@node0 ~]# openstack router create r1 --external-gateway public 
--distributed --ha
  [root@node0 ~]# openstack router add subnet r1 private
  [root@node0 ~]# openstack router set r1 --route 
destination=8.8.8.0/24,gateway=192.168.10.100 --route 
destination=8.8.8.0/24,gateway=192.168.10.200

  After multipath route was added, snat-ns routes look like this:

  [root@node0 ~]# ip netns exec snat-dcbec74b-2003-4447-8854-524d918260ac ip r
  default via 10.136.16.1 dev qg-94c43336-56 proto keepalived
  8.8.8.0/24 via 192.168.10.200 dev sg-dcf4a20b-8a proto keepalived
  8.8.8.0/24 via 192.168.10.100 dev sg-dcf4a20b-8a proto keepalived
  8.8.8.0/24 via 192.168.10.100 dev sg-dcf4a20b-8a proto static
  10.136.16.0/20 dev qg-94c43336-56 proto kernel scope link src 10.136.17.171
  169.254.0.0/24 dev ha-11b5b7d3-4e proto kernel scope link src 169.254.0.21
  169.254.192.0/18 dev ha-11b5b7d3-4e proto kernel scope link src 
169.254.195.228
  192.168.10.0/24 dev sg-dcf4a20b-8a proto kernel scope link src 192.168.10.228
  [root@node0 ~]#

  Note that there is only one 'static' route added by neutron and no multipath 
route.
  And two routes with 'proto keepalived' that have been added by keepalived 
process.
  Now delete all routes and check the routes inside snat-ns, the route is still 
there:

  [root@node0 ~]# openstack router set r1 --no-route
  [root@node0 ~]# ip netns exec snat-dcbec74b-2003-4447-8854-524d918260ac ip r
  default via 10.136.16.1 dev qg-94c43336-56 proto keepalived
  8.8.8.0/24 via 192.168.10.100 dev sg-dcf4a20b-8a proto static
  10.136.16.0/20 dev qg-94c43336-56 proto kernel scope link src 10.136.17.171
  169.254.0.0/24 dev ha-11b5b7d3-4e proto kernel scope link src 169.254.0.21
  169.254.192.0/18 dev ha-11b5b7d3-4e proto kernel scope link src 
169.254.195.228
  192.168.10.0/24 dev sg-dcf4a20b-8a proto kernel scope link src 192.168.10.228
  [root@node0 ~]#

To manage notifications about this bug go to:

[Yahoo-eng-team] [Bug 1998343] [NEW] Unittest test_distributed_port_binding_deleted_by_port_deletion fails: DeprecationWarning('ssl.PROTOCOL_TLS is deprecated')

2022-11-30 Thread Anton Kurbatov
Public bug reported:

I got an error in the test_distributed_port_binding_deleted_by_port_deletion 
test on my CI run [1].
Also I found the same failure in another CI run [2]

FAIL: 
neutron.tests.unit.plugins.ml2.test_db.Ml2DvrDBTestCase.test_distributed_port_binding_deleted_by_port_deletion
tags: worker-0
--
stderr: {{{
/home/zuul/src/opendev.org/openstack/neutron/.tox/shared/lib/python3.10/site-packages/ovs/stream.py:794:
 DeprecationWarning: ssl.PROTOCOL_TLS is deprecated
  ctx = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
/home/zuul/src/opendev.org/openstack/neutron/.tox/shared/lib/python3.10/site-packages/ovs/stream.py:794:
 DeprecationWarning: ssl.PROTOCOL_TLS is deprecated
  ctx = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
}}}

Traceback (most recent call last):
  File "/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/base.py", 
line 182, in func
return f(self, *args, **kwargs)
  File 
"/home/zuul/src/opendev.org/openstack/neutron/neutron/tests/unit/plugins/ml2/test_db.py",
 line 535, in test_distributed_port_binding_deleted_by_port_deletion
self.assertEqual(
  File 
"/home/zuul/src/opendev.org/openstack/neutron/.tox/shared/lib/python3.10/site-packages/testtools/testcase.py",
 line 393, in assertEqual
self.assertThat(observed, matcher, message)
  File 
"/home/zuul/src/opendev.org/openstack/neutron/.tox/shared/lib/python3.10/site-packages/testtools/testcase.py",
 line 480, in assertThat
raise mismatch_error
testtools.matchers._impl.MismatchError: [] != []: Warnings: {message : DeprecationWarning('ssl.PROTOCOL_TLS 
is deprecated'), category : 'DeprecationWarning', filename : 
'/home/zuul/src/opendev.org/openstack/neutron/.tox/shared/lib/python3.10/site-packages/ovs/stream.py',
 lineno : 794, line : None}

I have spent some time and seem to have found the reason for this behavior on 
python 3.10.
First of all, since python3.10 we get a warning when using ssl.PROTOCOL_TLS [3]:

[root@node0 neutron]# python
Python 3.10.8+ (heads/3.10-dirty:ca3c480, Nov 30 2022, 12:16:40) [GCC 4.8.5 
20150623 (Red Hat 4.8.5-44)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ssl
>>> ssl.SSLContext(ssl.PROTOCOL_SSLv23)
:1: DeprecationWarning: ssl.PROTOCOL_TLS is deprecated

>>>

I also found that the `test_ssl_connection` test case affects catching warnings 
in the test_distributed_port_binding_deleted_by_port_deletion test case.
I was then able to reproduce the issue like this:

[root@node0 neutron]# cat run_list.txt
neutron.tests.unit.agent.ovsdb.native.test_connection.ConfigureSslConnTestCase.test_ssl_connection
neutron.tests.unit.plugins.ml2.test_db.Ml2DvrDBTestCase.test_distributed_port_binding_deleted_by_port_deletion
[root@node0 neutron]# git diff
diff --git a/neutron/tests/unit/plugins/ml2/test_db.py 
b/neutron/tests/unit/plugins/ml2/test_db.py
index 578a01a..d837871 100644
--- a/neutron/tests/unit/plugins/ml2/test_db.py
+++ b/neutron/tests/unit/plugins/ml2/test_db.py
@@ -531,6 +531,8 @@ class Ml2DvrDBTestCase(testlib_api.SqlTestCase):
 router_id='router_id',
 status=constants.PORT_STATUS_DOWN).create()
 with warnings.catch_warnings(record=True) as warning_list:
+import time
+time.sleep(0.1)
 port.delete()
 self.assertEqual(
 [], warning_list,
[root@node0 neutron]# source .tox/shared/bin/activate
(shared) [root@node0 neutron]# stestr run --concurrency=1 --load-list 
./run_list.txt
...
neutron.tests.unit.plugins.ml2.test_db.Ml2DvrDBTestCase.test_distributed_port_binding_deleted_by_port_deletion
--
Captured traceback:
~~~
Traceback (most recent call last):
  File "/root/github/neutron/neutron/tests/base.py", line 182, in func
return f(self, *args, **kwargs)
  File "/root/github/neutron/neutron/tests/unit/plugins/ml2/test_db.py", 
line 537, in test_distributed_port_binding_deleted_by_port_deletion
self.assertEqual(
  File 
"/root/github/neutron/.tox/shared/lib/python3.10/site-packages/testtools/testcase.py",
 line 393, in assertEqual
self.assertThat(observed, matcher, message)
  File 
"/root/github/neutron/.tox/shared/lib/python3.10/site-packages/testtools/testcase.py",
 line 480, in assertThat
raise mismatch_error
testtools.matchers._impl.MismatchError: [] != []: Warnings: {message : 
DeprecationWarning('ssl.PROTOCOL_TLS is deprecated'), category : 
'DeprecationWarning', filename : 
'/root/github/neutron/.tox/shared/lib/python3.10/site-packages/ovs/stream.py', 
lineno : 794, line : None}

==
Totals
==
Ran: 2 tests in 1.3571 sec.
 - Passed: 1
 - Skipped: 0
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 1
Sum of execute time for each test: 1.3053 sec.


[1] 

[Yahoo-eng-team] [Bug 1998110] [NEW] Tempest test test_resize_server_revert: failed to build and is in ERROR status: Virtual Interface creation failed

2022-11-28 Thread Anton Kurbatov
Public bug reported:

In my CI run I got an error in test_resize_server_revert test case [1]

{3}
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
[401.454625s] ... FAILED

Captured traceback:
~~~
Traceback (most recent call last):

  File 
"/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 
430, in test_resize_server_revert
waiters.wait_for_server_status(self.client, self.server_id, 'ACTIVE')

  File "/opt/stack/tempest/tempest/common/waiters.py", line 101, in 
wait_for_server_status
raise lib_exc.TimeoutException(message)

tempest.lib.exceptions.TimeoutException: Request timed out
Details: (ServerActionsTestJSON:test_resize_server_revert) Server 
e69e6d33-c494-415a-9cb8-b597af2ea052 failed to reach ACTIVE status and task 
state "None" within the required time (196 s). Current status: REVERT_RESIZE. 
Current task state: resize_reverting.


Captured traceback-1:
~
Traceback (most recent call last):

  File "/opt/stack/tempest/tempest/api/compute/base.py", line 228, in 
server_check_teardown
waiters.wait_for_server_status(cls.servers_client,

  File "/opt/stack/tempest/tempest/common/waiters.py", line 81, in 
wait_for_server_status
raise exceptions.BuildErrorException(details, server_id=server_id)

tempest.exceptions.BuildErrorException: Server 
e69e6d33-c494-415a-9cb8-b597af2ea052 failed to build and is in ERROR status
Details: Fault: {'code': 500, 'created': '2022-11-23T21:46:15Z', 'message': 
'Virtual Interface creation failed'}.

The test checks the following:
1) resize to new flavor;
2) wait for a VM VERIFY_RESIZE status;
3) revert a resizing;
4) wait for a VM ACTIVE status < I got fail here.

The test did a resize with a change of node:
VM on the node 0032209120 -> resize VM, new VM node is 0032209122 -> revert 
resize

The `resize revert` (p3) started here:
Nov 23 21:41:05.514686 ubuntu-jammy-rax-dfw-0032209120 
devstack@n-api.service[54681]: DEBUG nova.api.openstack.wsgi [None 
req-83266751-d6d9-4a35-89fc-b4c97c1b481d 
tempest-ServerActionsTestJSON-1939410532 
tempest-ServerActionsTestJSON-1939410532-project] Action: 'action', calling 
method: >, body: {"revertResize": {}} {{(pid=54681) _process_stack 
/opt/stack/nova/nova/api/openstack/wsgi.py:511}}

The Nova got unexpected event network-vif-plugged:
Nov 23 21:41:12.404453 ubuntu-jammy-rax-dfw-0032209122 nova-compute[31414]: 
WARNING nova.compute.manager [req-b389f403-c195-4fa0-b578-7b687f85b79d 
req-c9eab04d-708d-4666-b4ce-f7bb760c7aa6 service nova] [instance: 
e69e6d33-c494-415a-9cb8-b597af2ea052] Received unexpected event 
network-vif-plugged-775d8945-1367-4e08-8306-9c683e1891cf for instance with 
vm_state resized and task_state resize_reverting.

The Nova is preparing to receive network-vif-plugged notification:
Nov 23 21:41:13.497369 ubuntu-jammy-rax-dfw-0032209122 nova-compute[31414]: 
DEBUG nova.compute.manager [None req-83266751-d6d9-4a35-89fc-b4c97c1b481d 
tempest-ServerActionsTestJSON-1939410532 
tempest-ServerActionsTestJSON-1939410532-project] [instance: 
e69e6d33-c494-415a-9cb8-b597af2ea052] Preparing to wait for external event 
network-vif-plugged-775d8945-1367-4e08-8306-9c683e1891cf {{(pid=31414) 
prepare_for_instance_event /opt/stack/nova/nova/compute/manager.py:281}}

So, there is an unexpected network-vif-plugged event.
I believe that the trigger of this event is the `resize` operation from p1: The 
Nova does not wait for network interfaces to be plugged when resizing a VM 
(vifs_already_plugged=True)  and a VM can switch to the VERIFY_RESIZE status 
without waiting for the port processing by Neutron [2]


At the same time on the Newtron server side:

Binding the port to the node 0032209120 in `resize` operation (p1):
Nov 23 21:40:57.981780 ubuntu-jammy-rax-dfw-0032209120 neutron-server[55724]: 
DEBUG neutron.api.v2.base [req-b1a98064-7e8e-4ad3-84cf-09e3bf12727e 
req-036d315c-21e6-47d5-be1d-44a4efc8a3e9 service neutron] Request body: 
{'port': {'binding:host_id': 'ubuntu-jammy-rax-dfw-0032209120', 'device_owner': 
'compute:nova'}} {{(pid=55724) prepare_request_body 
/opt/stack/neutron/neutron/api/v2/base.py:731}}

Binding the port to the node 0032209122 in `resize revert` operation (p3):
Nov 23 21:41:10.832391 ubuntu-jammy-rax-dfw-0032209120 neutron-server[55723]: 
DEBUG neutron.api.v2.base [req-83266751-d6d9-4a35-89fc-b4c97c1b481d 
req-268a8b14-b6b9-438d-bc3f-446f5eaad88d service neutron] Request body: 
{'port': {'binding:host_id': 'ubuntu-jammy-rax-dfw-0032209122', 'device_owner': 
'compute:nova'}} {{(pid=55723) prepare_request_body 
/opt/stack/neutron/neutron/api/v2/base.py:731}}

Provisioning completed by L2 from `resize` operation (p1):
Nov 23 21:41:10.950190 ubuntu-jammy-rax-dfw-0032209120 neutron-server[55725]: 
DEBUG neutron.db.provisioning_blocks [None 
req-793235de-b92d-459f-b016-a3d9ba1a1ddd None None] Provisioning complete for 
port 

[Yahoo-eng-team] [Bug 1997492] [NEW] Neutron server doesn't wait for port DHCP provisioning while VM creation

2022-11-22 Thread Anton Kurbatov
Public bug reported:

I found that neutron-server does not wait for successful port provisioning from 
the dhcp agent in the case of VM creation. DHCP entity is not added into 
provisioning_block by neutron-server for such port.
As a result, nova receives a notification that the port is plugged, while the 
DHCP agent is still processing the port or even getting an error during 
processing.

Steps to reproduce on devstack from master:

- make port_create_end method fail in DHCP agent side [1]
- create a VM with network with DHCP enabled

VM is successfully created, port is active, while the DHCP entry for
this port is not configured.

[root@node0 neutron]# git diff
diff --git a/neutron/agent/dhcp/agent.py b/neutron/agent/dhcp/agent.py
index 7349d7e297..553ba81fdc 100644
--- a/neutron/agent/dhcp/agent.py
+++ b/neutron/agent/dhcp/agent.py
@@ -676,6 +676,7 @@ class DhcpAgent(manager.Manager):
 payload.get('priority', DEFAULT_PRIORITY),
 action='_port_create',
 resource=created_port, obj_type='port')
+raise Exception('fail for testing purposes')
 self._queue.add(update)

 @_wait_if_syncing
 
 
[root@node0 neutron]# openstack server create test-vm --network net1 --flavor 
m1.tiny --image cirros-0.5.2-x86_64-disk
[root@node0 ~]# openstack server list
+--+-+++--+-+
| ID   | Name| Status | Networks   
| Image| Flavor  |
+--+-+++--+-+
| cce75084-b1e0-4407-a0d6-0074ed05abad | test-vm | ACTIVE | net1=192.168.1.111 
| cirros-0.5.2-x86_64-disk | m1.tiny |
+--+-+++--+-+
[root@node0 ~]# openstack port list --device-id 
cce75084-b1e0-4407-a0d6-0074ed05abad
+--+--+---+--++
| ID   | Name | MAC Address   | Fixed IP 
Addresses   | Status |
+--+--+---+--++
| d7e55e08-05ae-4ac4-8cd0-4f88b93c5872 |  | fa:16:3e:9e:30:b3 | 
ip_address='192.168.1.111', subnet_id='281f70f3-8996-436b-ab90-bff1f9dbf5f8' | 
ACTIVE |
+--+--+---+--++
[root@node0 ~]#
[root@node0 ~]# cat 
/opt/stack/data/neutron/dhcp/710bcfcd-44d9-445d-a895-8ec522f64016/addn_hosts
[root@node0 ~]#


While VM creation there two API calls from the nova:
1) Port 'create' API call:
Nov 22 16:19:40 node0 neutron-server[953593]: DEBUG neutron.api.v2.base 
[req-5cbe6387-fe21-4509-81f6-cfcfe268252f 
req-0b7496ea-3697-4bc8-abb4-95d8f23d3497 demo admin] Request body: {'port': 
{'device_id': 'cce75084-b1e0-4407-a0d6-0074ed05abad', 'network_id': 
'710bcfcd-44d9-445d-a895-8ec522f64016', 'admin_state_up': True, 'tenant_id': 
'a022c969871149e9b19ec31c896a0701'}} {{(pid=953593) prepare_request_body 
/opt/stack/neutron/neutron/api/v2/base.py:730}}

2) Port 'update' API call:
Nov 22 16:16:11 node0 neutron-server[953593]: DEBUG neutron.api.v2.base 
[req-145264e0-96a0-450b-9ad5-a5181c2497b1 
req-9015e2c3-7dbb-430f-9cba-c7d6972f5134 service neutron] Request body: 
{'port': {'device_id': '4a4f87c0-a357-49eb-8639-58b499b8ae1f', 'device_owner': 
'compute:nova', 'binding:host_id': 'node1'}} {{(pid=953593) 
prepare_request_body /opt/stack/neutron/neutron/api/v2/base.py:730}}

For the port creation API call a DHCP provisioning is not setup because 
device_owner is absent [2]
For the port 'update' API call a DCHP provisioning is not setup because none of 
the fixed_ips/mac_address is updated [3]

[1] 
https://opendev.org/openstack/neutron/src/commit/51827d8e78db4926f3aa347c4b2237a7b210f861/neutron/agent/dhcp/agent.py#L670
[2] 
https://opendev.org/openstack/neutron/src/commit/51827d8e78db4926f3aa347c4b2237a7b210f861/neutron/plugins/ml2/plugin.py#L1501
[3] 
https://opendev.org/openstack/neutron/src/commit/51827d8e78db4926f3aa347c4b2237a7b210f861/neutron/plugins/ml2/plugin.py#L1925

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1997492

Title:
  Neutron server doesn't wait for port DHCP provisioning while VM
  creation

Status in neutron:
  New

Bug description:
  I found that neutron-server does not wait for successful 

[Yahoo-eng-team] [Bug 1997090] [NEW] VMs listing with sort keys throws exception when trying to compare None values

2022-11-18 Thread Anton Kurbatov
Public bug reported:

The nova-api raises exception on attempt to get VMs sorted by i.e.
task_state key.

Here are steps-to-reproduce:

- create two VMs: vm1 in ACTIVE state (cell1) and vm2 in ERROR state (cell0)
- try to list servers sorted by sort_key=task_state

[root@node0 ~]# openstack server create vm1 --network net1 --flavor m1.tiny 
--image cirros-0.5.2-x86_64-disk
[root@node0 ~]# openstack server create vm2 --network net1 --flavor m1.xlarge 
--image cirros-0.5.2-x86_64-disk
[root@node0 ~]# openstack server list -f json --long -c ID -c 'Task State' -c 
'Status'
[
  {
"ID": "3a3927c4-9f67-4356-8a3e-a3e58cf0744e",
"Status": "ERROR",
"Task State": null
  },
  {
"ID": "9af631ec-3e59-45da-bafa-85141e3707da",
"Status": "ACTIVE",
"Task State": null
  }
]
[root@node0 ~]#
[root@node0 ~]# curl -k -H "x-auth-token: $s" 
'http://10.136.16.186/compute/v2.1/servers/detail?sort_key=task_state'
{"computeFault": {"code": 500, "message": "Unexpected API Error. Please report 
this at http://bugs.launchpad.net/nova/ and attach the Nova API log if 
possible.\n"}}[root@node0 ~]#

Traceback:

Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi [None req-59ce5d12-1c84-4c45-8b10-da863b721d6f demo 
admin] Unexpected exception in API method: TypeError: '<' not supported between 
instances of 'NoneType' and 'NoneType'
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi Traceback (most recent call last):
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/api/openstack/wsgi.py", 
line 664, in wrapped
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi return f(*args, **kwargs)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/validation/__init__.py", line 192, in wrapper
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi return func(*args, **kwargs)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/validation/__init__.py", line 192, in wrapper
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi return func(*args, **kwargs)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/validation/__init__.py", line 192, in wrapper
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi return func(*args, **kwargs)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   [Previous line repeated 2 more times]
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/openstack/compute/servers.py", line 143, in detail
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi servers = self._get_servers(req, is_detail=True)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/openstack/compute/servers.py", line 327, in 
_get_servers
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi instance_list = self.compute_api.get_all(elevated 
or context,
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/compute/api.py", line 
3140, in get_all
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi insts, down_cell_uuids = 
instance_list.get_instance_objects_sorted(
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/compute/instance_list.py", 
line 176, in get_instance_objects_sorted
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi instance_list = 
instance_obj._make_instance_list(ctx,
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/objects/instance.py", line 
1287, in _make_instance_list
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi for db_inst in db_inst_list:
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/compute/multi_cell_list.py", line 411, in 
get_records_sorted
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi item = next(feeder)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File "/usr/lib64/python3.9/heapq.py", line 353, in 
merge
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi _heapify(h)
Nov 18 09:59:09 node0 devstack@n-api.service[1156072]: ERROR 
nova.api.openstack.wsgi   File 

[Yahoo-eng-team] [Bug 1996788] [NEW] The virtual network is broken on the node after neutron-openvswitch-agent is restarted if RPC requests return an error for a while.

2022-11-16 Thread Anton Kurbatov
Public bug reported:

We ran into a problem in our openstack cluster, when traffic does not go 
through the virtual network on the node on which the neutron-openvswitch-agent 
was restarted.
We had an update from one version of the Openstack to another and by chance we 
had a inconsistency of the DB and neutron-server: any port select from the DB 
returned an error.
For a while neutron-openvswitch-agent (just after restart) couldn't get any 
information via RCP in its rpc_loop iterations due to DB/neutron-server 
inconsistency.
But after updating the database, we got a broken virtual network on the node 
where the neutron-openvswitch-agent was restarted.

It seems to me that I have found a problem place in the logic of 
neutron-ovs-agent.
To demonstrate, better to emulate the RPC request fail from neutron-ovs-agent 
to neutron-server.

Here are the steps to reproduce on devstack setup from the master branch.
Two nodes: node0 is controller, node1 is compute.

0) Prepare a vxlan based network and a VM.
[root@node0 ~]# openstack network create net1
[root@node0 ~]# openstack subnet create sub1 --network net1 --subnet-range 
192.168.1.0/24
[root@node0 ~]# openstack server create vm1 --network net1 --flavor m1.tiny 
--image cirros-0.5.2-x86_64-disk --host node1

Just after creating the VM, there is a message in the devstack@q-agt
logs:

Nov 16 09:53:35 node1 neutron-openvswitch-agent[374810]: INFO
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None
req-77753b72-cb23-4dae-b68a-7048b63faf8b None None] Assigning 1 as local
vlan for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466

So, local vlan which is used on node1 for the network is `1`
A ping from the node0 to the VM on node1 success works:

[root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping 
192.168.1.211
PING 192.168.1.211 (192.168.1.211) 56(84) bytes of data.
64 bytes from 192.168.1.211: icmp_seq=1 ttl=64 time=1.86 ms
64 bytes from 192.168.1.211: icmp_seq=2 ttl=64 time=0.891 ms

1) Now, please don't misunderstand me, I don't want to be read that I'm 
patching the code and then clearly something won't work,
I just want to emulate a problem that's hard enough to reproduce in a normal 
way but it can.
So, emulate a problem that method get_resource_by_id returns an error just 
after neutron-ovs-agent restart (RPC based method actually):

[root@node1 neutron]# git diff
diff --git a/neutron/agent/rpc.py b/neutron/agent/rpc.py
index 9a133afb07..299eb25981 100644
--- a/neutron/agent/rpc.py
+++ b/neutron/agent/rpc.py
@@ -327,6 +327,11 @@ class CacheBackedPluginApi(PluginApi):

 def get_device_details(self, context, device, agent_id, host=None,
agent_restarted=False):
+import time
+if not hasattr(self, '_stime'):
+self._stime = time.time()
+if self._stime + 5 > time.time():
+raise Exception('Emulate RPC error in get_resource_by_id call')
 port_obj = self.remote_resource_cache.get_resource_by_id(
 resources.PORT, device, agent_restarted)
 if not port_obj:


Restart neutron-openvswitch-agent agent and try to ping after 1-2 mins:

[root@node1 ~]# systemctl restart devstack@q-agt

[root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping 
-c 2 192.168.1.234
PING 192.168.1.234 (192.168.1.234) 56(84) bytes of data.

--- 192.168.1.234 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1058ms

[root@node0 ~]#

Ping doesn't work.
Just after the neutron-ovs-agent restart and when the RPC starts working 
correctly, there are logs:

Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None 
req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Assigning 2 as local vlan 
for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466
Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO 
neutron.agent.securitygroups_rpc [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 
None None] Preparing filters for devices 
{'40d82f69-274f-4de5-84d9-6290159f288b'}
Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO 
neutron.agent.linux.openvswitch_firewall.firewall [None 
req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Initializing port 
40d82f69-274f-4de5-84d9-6290159f288b that was already initialized.

So, `Assigning 2 as local vlan` followed by `Initializing port ... that
was already initialized.`

2) Using a pyrasite the eventlet backdoor was setup and I see that in
the internal structure inside the OVSFirewallDriver a `vlan_tag` of the
port is still `1` instead of `2`:

>>> import gc
>>> from neutron.agent.linux.openvswitch_firewall.firewall import 
>>> OVSFirewallDriver
>>> for ob in gc.get_objects():
... if isinstance(ob, OVSFirewallDriver):
... break
...
>>> ob.sg_port_map.ports['40d82f69-274f-4de5-84d9-6290159f288b'].vlan_tag
1
>>>

So, the OVSFirewallDriver still thinks that 

[Yahoo-eng-team] [Bug 1995872] [NEW] A stuck INACTIVE port binding causes wrong l2pop fdb entries to be sent

2022-11-07 Thread Anton Kurbatov
Public bug reported:

We are testing the network availability of VMs in case of HA events.
And we run into a problem where aborting live migration of a VM can break 
communication with that VM in the future at the OVS rules level.
The fault of the wrong OVS rules is the stuck INACTIVE port binding in the 
neutron `ml2_port_bindings` table.

Steps to reproduce:
Install cluster via devstack with target branch `master` with 3 nodes.
mechanism driver is `openvswitch` with `l2population` enabled:

[root@node0 ~]# grep -r l2population /etc/neutron/*
/etc/neutron/plugins/ml2/ml2_conf.ini:mechanism_drivers = 
openvswitch,l2population
[root@node0 ~]#

0) preparation:
- create a vxlan based internal network,
- start 3 VMs per each node: vm0 -> node0, vm1 -> node1, vm2 -> node2

[root@node0 ~]# for i in {0..2}; do openstack server create vm$i --network 
vxlan-net --flavor m1.tiny --image cirros-0.5.2-x86_64-disk; done
[root@node0 ~]# for i in {0..2}; do openstack server migrate vm$i --host node$i 
--live-migration; done

1) abort the `vm1` live migration from node1 -> node0
[root@node0 ~]# openstack server migrate vm1 --host node0 --live-migration; 
sleep 1; ssh root@node1 systemctl stop devstack@n-cpu.service
[root@node0 ~]# openstack server list
+--+--+---+-+--+-+
| ID   | Name | Status| Networks
| Image| Flavor  |
+--+--+---+-+--+-+
| 56ec7007-5470-42df-863e-8ae7d6a0110f | vm1  | MIGRATING | 
vxlan-net=192.168.0.169 | cirros-0.5.2-x86_64-disk | m1.tiny |
| 5bc93710-8da8-4b12-b1f0-767cf1768d27 | vm2  | ACTIVE| 
vxlan-net=192.168.0.82  | cirros-0.5.2-x86_64-disk | m1.tiny |
| 6f93f40f-0065-413c-81e6-724a21b3756b | vm0  | ACTIVE| 
vxlan-net=192.168.0.135 | cirros-0.5.2-x86_64-disk | m1.tiny |
+--+--+---+-+--+-+
[root@node0 ~]# ssh root@node1 systemctl start devstack@n-cpu.service
[root@node0 ~]# openstack server list
+--+--++-+--+-+
| ID   | Name | Status | Networks   
 | Image| Flavor  |
+--+--++-+--+-+
| 56ec7007-5470-42df-863e-8ae7d6a0110f | vm1  | ACTIVE | 
vxlan-net=192.168.0.169 | cirros-0.5.2-x86_64-disk | m1.tiny |
| 5bc93710-8da8-4b12-b1f0-767cf1768d27 | vm2  | ACTIVE | vxlan-net=192.168.0.82 
 | cirros-0.5.2-x86_64-disk | m1.tiny |
| 6f93f40f-0065-413c-81e6-724a21b3756b | vm0  | ACTIVE | 
vxlan-net=192.168.0.135 | cirros-0.5.2-x86_64-disk | m1.tiny |
+--+--++-+--+-+
[root@node0 ~]#

VM failed to migrate and still on the node1:
[root@node0 ~]# openstack server show vm1 -c OS-EXT-SRV-ATTR:host
+--+-+
| Field| Value   |
+--+-+
| OS-EXT-SRV-ATTR:host | node1   |
+--+-+
[root@node0 ~]# ssh node1 virsh list
 Id   NameState
---
 3instance-0009   running

[root@node0 ~]#

Now I get two port bindings ACTIVE and INACTIVE for the `vm1` port:

MariaDB [neutron]> select port_id,host,vif_type,profile from ml2_port_bindings 
where port_id='3be55a45-83c6-42b7-82fc-fb6c4855f255';
+--+-+--+-+
| port_id  | host| vif_type | profile   
  |
+--+-+--+-+
| 3be55a45-83c6-42b7-82fc-fb6c4855f255 | node0   | ovs  | 
{"os_vif_delegation": true} |
| 3be55a45-83c6-42b7-82fc-fb6c4855f255 | node1   | ovs  | {"migrating_to": 
"node0"}   |
+--+-+--+-+

2) restart on the node2 the neutron-openvswitch-agent, that forces
neutron-server to repopulate neighbors fdb entries:

[root@node0 ~]# ssh node2 systemctl restart devstack@q-agt.service

Now a ping from the vm2 to the vm1 doesn't work:

[root@node0 ~]# ip netns exec qdhcp-f0f8f0b6-3cd3-4ae5-b5cf-25f2834bcdb2 ssh 
cirros@192.168.0.82
sign_and_send_pubkey: no mutual signature supported
cirros@192.168.0.82's password:
$ ping 192.168.0.169
PING 192.168.0.169 (192.168.0.169): 56 data bytes
^C
--- 192.168.0.169 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss
$

This is because the br-tun rules on node2 send traffic for `vm1` to
node0 and not to node1 where VM is 

[Yahoo-eng-team] [Bug 1990561] [NEW] Network filtering by provider attributes has a race condition with network removal

2022-09-22 Thread Anton Kurbatov
Public bug reported:

I ran into a problem when the list of networks filtered by segment ID does not 
match the expected list.
An important condition is the parallel removal of another network.

Here is a demo:

Console 1:
$ while :; do openstack network create test-net --provider-segment 200 
--provider-network-type vxlan >/dev/null; openstack network delete test-net; 
done

Console 2:
$ for i in {0..1000}; do net=$(openstack network list --provider-segment 100); 
[[ -n "${net}" ]] && echo "${net}" && echo "Iter=$i" && break; done
+--+--+-+
| ID   | Name | Subnets |
+--+--+-+
| 64ccd339-c669-4b8b-9d11-758e98295955 | test-net | |
+--+--+-+
Iter=81
$


A log file has a message:

2022-09-22 20:13:15.706 25 DEBUG neutron.plugins.ml2.managers [None
req-4c379e00-4794-4625-afe7-64643aa801cf
4f5e975fb1044192a4930fd01ca7d9d7 1958e62e718f468299ae302a12364c08 -
default default] Network 64ccd339-c669-4b8b-9d11-758e98295955 has no
segments extend_network_with_provider_segments
/usr/lib64/python3.6/site-packages/neutron/plugins/ml2/managers.py:169


So, it looks like there is a race condition.
OS version: Xena

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1990561

Title:
  Network filtering by provider attributes has a race condition with
  network removal

Status in neutron:
  New

Bug description:
  I ran into a problem when the list of networks filtered by segment ID does 
not match the expected list.
  An important condition is the parallel removal of another network.

  Here is a demo:

  Console 1:
  $ while :; do openstack network create test-net --provider-segment 200 
--provider-network-type vxlan >/dev/null; openstack network delete test-net; 
done

  Console 2:
  $ for i in {0..1000}; do net=$(openstack network list --provider-segment 
100); [[ -n "${net}" ]] && echo "${net}" && echo "Iter=$i" && break; done
  +--+--+-+
  | ID   | Name | Subnets |
  +--+--+-+
  | 64ccd339-c669-4b8b-9d11-758e98295955 | test-net | |
  +--+--+-+
  Iter=81
  $

  
  A log file has a message:

  2022-09-22 20:13:15.706 25 DEBUG neutron.plugins.ml2.managers [None
  req-4c379e00-4794-4625-afe7-64643aa801cf
  4f5e975fb1044192a4930fd01ca7d9d7 1958e62e718f468299ae302a12364c08 -
  default default] Network 64ccd339-c669-4b8b-9d11-758e98295955 has no
  segments extend_network_with_provider_segments
  /usr/lib64/python3.6/site-packages/neutron/plugins/ml2/managers.py:169

  
  So, it looks like there is a race condition.
  OS version: Xena

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1990561/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1967142] [NEW] No way to set quotas for neutron-vpnaas resources using openstack CLI tool

2022-03-30 Thread Anton Kurbatov
Public bug reported:

I can't find a way to set up VPN quotas using the CLI tools: neither the
openstack CLI nor deprecated neutron CLI has this feature.

I can only update VPN quotas using a direct API request (e.g. via curl).
And can only list VPN quotas using neutron CLI tool.

[root@node4578 ~]# curl -ks -H "x-auth-token: $token" -X PUT 
https://192.168.1.10:9696/v2.0/quotas/e28d46f9ce084b21a163f72ce1a49adf -d 
'{"quota": {"ipsec_site_connection": 5}}'
{"quota": {"subnet": -1, "ikepolicy": -1, "subnetpool": -1, "network": -1, 
"ipsec_site_connection": 5, "endpoint_group": -1, "ipsecpolicy": -1, 
"security_group_device": -1, "security_group_rule": -1, "vpnservice": -1, 
"floatingip": -1, "security_group": -1, "router": -1, "rbac_policy": -1, 
"port": -1}}
[root@node4578 ~]#
[root@node4578 ~]# neutron quota-show e28d46f9ce084b21a163f72ce1a49adf
neutron CLI is deprecated and will be removed in the future. Use openstack CLI 
instead.
+---+---+
| Field | Value |
+---+---+
| endpoint_group| -1|
| floatingip| -1|
| ikepolicy | -1|
| ipsec_site_connection | 5 |
| ipsecpolicy   | -1|
| network   | -1|
| port  | -1|
| rbac_policy   | -1|
| router| -1|
| security_group| -1|
| security_group_device | -1|
| security_group_rule   | -1|
| subnet| -1|
| subnetpool| -1|
| vpnservice| -1|
+---+---+
[root@node4578 ~]# openstack quota list --network --detail --project 
e28d46f9ce084b21a163f72ce1a49adf
+--++--+---+
| Resource | In Use | Reserved | Limit |
+--++--+---+
| subnets  |  0 |0 |-1 |
| routers  |  0 |0 |-1 |
| security_group_rules |  0 |0 |-1 |
| subnet_pools |  0 |0 |-1 |
| security_groups  |  0 |0 |-1 |
| rbac_policies|  0 |0 |-1 |
| floating_ips |  0 |0 |-1 |
| networks |  0 |0 |-1 |
| ports|  0 |0 |-1 |
+--++--+---+
[root@node4578 ~]#

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1967142

Title:
  No way to set quotas for neutron-vpnaas resources using openstack CLI
  tool

Status in neutron:
  New

Bug description:
  I can't find a way to set up VPN quotas using the CLI tools: neither
  the openstack CLI nor deprecated neutron CLI has this feature.

  I can only update VPN quotas using a direct API request (e.g. via curl).
  And can only list VPN quotas using neutron CLI tool.

  [root@node4578 ~]# curl -ks -H "x-auth-token: $token" -X PUT 
https://192.168.1.10:9696/v2.0/quotas/e28d46f9ce084b21a163f72ce1a49adf -d 
'{"quota": {"ipsec_site_connection": 5}}'
  {"quota": {"subnet": -1, "ikepolicy": -1, "subnetpool": -1, "network": -1, 
"ipsec_site_connection": 5, "endpoint_group": -1, "ipsecpolicy": -1, 
"security_group_device": -1, "security_group_rule": -1, "vpnservice": -1, 
"floatingip": -1, "security_group": -1, "router": -1, "rbac_policy": -1, 
"port": -1}}
  [root@node4578 ~]#
  [root@node4578 ~]# neutron quota-show e28d46f9ce084b21a163f72ce1a49adf
  neutron CLI is deprecated and will be removed in the future. Use openstack 
CLI instead.
  +---+---+
  | Field | Value |
  +---+---+
  | endpoint_group| -1|
  | floatingip| -1|
  | ikepolicy | -1|
  | ipsec_site_connection | 5 |
  | ipsecpolicy   | -1|
  | network   | -1|
  | port  | -1|
  | rbac_policy   | -1|
  | router| -1|
  | security_group| -1|
  | security_group_device | -1|
  | security_group_rule   | -1|
  | subnet| -1|
  | subnetpool| -1|
  | vpnservice| -1|
  +---+---+
  [root@node4578 ~]# openstack quota list --network --detail --project 
e28d46f9ce084b21a163f72ce1a49adf
  +--++--+---+
  | Resource | In Use | Reserved | Limit |
  +--++--+---+
  | subnets  |  0 |0 |-1 |
  | routers  |  0 |0 |-1 |
  | security_group_rules |  0 |0 |-1 |
  | subnet_pools |  0 |0 |-1 |
  | security_groups  |  0 |0 |-1 |
  | rbac_policies|  0 |0 |-1 |
 

[Yahoo-eng-team] [Bug 1959697] [NEW] VM gets wrong ipv6 address from dhcp-agent after ipv6 address on port was changed

2022-02-01 Thread Anton Kurbatov
Public bug reported:

I run into a problem when neutron dhcp-agent is still replying to the old 
address confirmation.
Simple steps to reproduce:
- create a port with IPv6 address in dhcpv6-stateful subnet
- create a VM with cloud-init inside
- change the IPv6 port address
- reboot the VM

Here are my commands:

$ openstack subnet create --subnet-range 2001:db8:123::/64 --ip-version 6 
--ipv6-address-mode dhcpv6-stateful --network public subv6
$ openstack subnet list --network public
+--+---+--+---+
| ID   | Name  | Network
  | Subnet|
+--+---+--+---+
| 6d9a7fb5-5c1b-4759-b32b-5720b5cedbf4 | subv4 | 
f1f3d967-26db-41b3-b6f6-1d5356e33a84 | 10.136.16.0/22|
| 76db898c-6a7a-4301-9253-23241cafaa83 | subv6 | 
f1f3d967-26db-41b3-b6f6-1d5356e33a84 | 2001:db8:123::/64 |
+--+---+--+---+
$

$ openstack port create my-port  --network public --fixed-ip 
ip-address=10.136.17.163 --fixed-ip ip-address=2001:db8:123::111
$ openstack server create test --flavor m1.small --port my-port --image 
CentOS-7-x86_64-GenericCloud-2009.qcow2 --key-name key --use-config-drive

Check IPv6 address inside VM (it's correct):

[centos@test ~]$ ip a s eth0
2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether fa:16:3e:2e:66:ac brd ff:ff:ff:ff:ff:ff
inet 10.136.17.163/22 brd 10.136.19.255 scope global dynamic eth0
   valid_lft 86371sec preferred_lft 86371sec
inet6 2001:db8:123::111/128 scope global dynamic
   valid_lft 7473sec preferred_lft 7173sec
inet6 fe80::f816:3eff:fe2e:66ac/64 scope link
   valid_lft forever preferred_lft forever
[centos@test ~]$

Change IPv6 address and reboot the VM:
$ openstack port set my-port --no-fixed-ip --fixed-ip ip-address=10.136.17.163 
--fixed-ip ip-address=2001:db8:123::222
$ openstack server reboot test

[centos@test ~]$ ip a s eth0
2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether fa:16:3e:2e:66:ac brd ff:ff:ff:ff:ff:ff
inet 10.136.17.163/22 brd 10.136.19.255 scope global dynamic eth0
   valid_lft 86382sec preferred_lft 86382sec
inet6 2001:db8:123::111/128 scope global dynamic
   valid_lft 7482sec preferred_lft 7182sec
inet6 fe80::f816:3eff:fe2e:66ac/64 scope link
   valid_lft forever preferred_lft forever
[centos@test ~]$

^^ you can see the VM got the old IPv6 address and actually all traffic
is blocked by port-security feature. If I remove a lease file and re-
spawn a dhclient, all is fine:

[centos@test ~]$ ps axf | grep dhcl
  780 ?Ss 0:00 /sbin/dhclient -1 -q -lf 
/var/lib/dhclient/dhclient--eth0.lease -pf /var/run/dhclient-eth0.pid -H test 
eth0
  868 ?Ss 0:00 /sbin/dhclient -6 -1 -lf 
/var/lib/dhclient/dhclient6--eth0.lease -pf /var/run/dhclient6-eth0.pid eth0 -H 
test
 1371 pts/0S+ 0:00  \_ grep --color=auto dhcl
[centos@test ~]$ sudo kill -9 868
[centos@test ~]$ sudo ip addr del 2001:db8:123::111/128 dev eth0
[centos@test ~]$ sudo rm -rf /var/lib/dhclient/dhclient6--eth0.lease
[centos@test ~]$ sudo /sbin/dhclient -6 -1 -lf 
/var/lib/dhclient/dhclient6--eth0.lease -pf /var/run/dhclient6-eth0.pid eth0 -H 
test
[centos@test ~]$ ip a s eth0
2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
group default qlen 1000
link/ether fa:16:3e:2e:66:ac brd ff:ff:ff:ff:ff:ff
inet 10.136.17.163/22 brd 10.136.19.255 scope global dynamic eth0
   valid_lft 86319sec preferred_lft 86319sec
inet6 2001:db8:123::222/128 scope global dynamic
   valid_lft 7481sec preferred_lft 7181sec
inet6 fe80::f816:3eff:fe2e:66ac/64 scope link
   valid_lft forever preferred_lft forever
[centos@test ~]$

I found some logic with dhcpv6 leases removing here:
https://opendev.org/openstack/neutron/src/commit/e7b70521d0e230143a80974e7e4795a2acafcc9b/neutron/agent/linux/dhcp.py#L600
but it looks like it doesn't help in case of DHCPCONFIRM client request:
In the dnsmasq logs I see the following DHCPCONFIRM->DHCPREPLY messages 
exchange after the VM came back after the reboot (see also 
https://datatracker.ietf.org/doc/html/rfc3315#page-50):

Feb  1 16:49:12 dnsmasq-dhcp[1360521]: DHCPREQUEST(tapc233cb5c-8f) 
10.136.17.163 fa:16:3e:2e:66:ac
Feb  1 16:49:12 dnsmasq-dhcp[1360521]: DHCPACK(tapc233cb5c-8f) 10.136.17.163 
fa:16:3e:2e:66:ac host-10-136-17-163
Feb  1 16:49:15 dnsmasq-dhcp[1360521]: DHCPCONFIRM(tapc233cb5c-8f) 
00:01:00:01:29:8c:20:5e:fa:16:3e:2e:66:ac
Feb  1 16:49:15 dnsmasq-dhcp[1360521]: DHCPREPLY(tapc233cb5c-8f) 
2001:db8:123::111 00:01:00:01:29:8c:20:5e:fa:16:3e:2e:66:ac 
host-2001-db8-123--222

** Affects: neutron
 Importance: Undecided
 Status: New

** Description changed:

  I run into a 

[Yahoo-eng-team] [Bug 1958643] [NEW] Unicast RA messages for a VM are filtered out by ovs rules

2022-01-21 Thread Anton Kurbatov
Public bug reported:

I run into a problem when unicast RA messages are not accepted by openflow 
rules.
In my configuration I'm using radvd daemon to send RA messages in my IPv6 
network.
Here is a config of radvd with `clients` dirrective to turn off multicast 
messages:

[root@radvd ~]# cat /etc/radvd.conf
interface br-eth0
{
AdvSendAdvert on;
MinRtrAdvInterval 3;
MaxRtrAdvInterval 5;
prefix 2001:db8:123::/64
{
AdvOnLink on;
AdvAutonomous on;
AdvRouterAddr off;
};
clients
{
fe80::f816:3eff:fed7:358a;
};
};
[root@radvd ~]#

I use devstack installation with Neutron from the master branch.
I've create a virtual flat network with dual stack: IPv4 and IPv6 subnets.
IPv6 subnet has a SLAAC address mode.
And created a VM to test IPv6 address assignment inside VM.
But RA message doesn't reach the VM.

VM/port/security group rules:

[root@devstack ~]# openstack server list
+--+--++--+-+--+
| ID   | Name | Status | Networks   
  | Image   | 
Flavor   |
+--+--++--+-+--+
| 332942be-0869-403f-9aba-386f88b9bc9d | test | ACTIVE | public=10.136.17.163, 
2001:db8:123:0:f816:3eff:fed7:358a | CentOS-7-x86_64-GenericCloud-2009.qcow2 | 
m1.small |
+--+--++--+-+--+
[root@devstack ~]#
[root@devstack ~]# openstack port show 664489d1-f15f-4990-99eb-b53ad21f673a
+-++
| Field   | Value   

   |
+-++
| admin_state_up  | UP  

   |
| allowed_address_pairs   | 

   |
| binding_host_id | devstack

   |
| binding_profile | 

   |
| binding_vif_details | bridge_name='br-int', connectivity='l2', 
datapath_type='system', ovs_hybrid_plug='False', port_filter='True' 
  |
| binding_vif_type| ovs 

   |
| binding_vnic_type   | normal  

   |
| created_at  | 2022-01-21T11:32:19Z

   |
| data_plane_status   | None

   |
| description | 

   |
| device_id   | 332942be-0869-403f-9aba-386f88b9bc9d

   |
| device_owner| compute:nova 

[Yahoo-eng-team] [Bug 1938191] [NEW] L3 agent fails to process a DVR router external network change

2021-07-27 Thread Anton Kurbatov
Public bug reported:

I ran into a problem when L3 agent fails to process the external network change 
on the router and was hitting the retry limit.
I'm using a devstack deployment over the master branch.

* Pre-conditions:
L3 agent in DVR mode
mechanism driver is openvswitch

* Step-by-step reproduction steps:
- create two external networks and three internal
- create three routers and add the corresponding internal networks
- connect external networks to the routers (according to the scheme: net1->r1, 
net2->r2, net1->r3)
- switch the external network of the r3 router from net1 to net2

Here are the CLI commands:

openstack network create phys-net1 --external
openstack network create phys-net2 --external
openstack network create priv-net1
openstack network create priv-net2
openstack network create priv-net3

openstack subnet create --network phys-net1 --subnet-range 192.168.1.0/24 
phys-sub1
openstack subnet create --network phys-net2 --subnet-range 192.168.2.0/24 
phys-sub2
openstack subnet create --network priv-net1 --subnet-range 192.168.10.0/24 
priv-sub1
openstack subnet create --network priv-net2 --subnet-range 192.168.20.0/24 
priv-sub2
openstack subnet create --network priv-net3 --subnet-range 192.168.30.0/24 
priv-sub3

openstack router create r1
openstack router create r2
openstack router create r3

openstack router add subnet r1 priv-sub1
openstack router add subnet r2 priv-sub2
openstack router add subnet r3 priv-sub3

openstack router set r1 --external-gateway phys-net1
openstack router set r2 --external-gateway phys-net2
openstack router set r3 --external-gateway phys-net1

# Switch r3 external network from phys-net1 to phys-net2:
openstack router set r3 --external-gateway phys-net2

After switching in the l3 agent logs one can observe unsuccessful attempts to 
process the changes and the message
(see the logs of router processing below):
'Hit retry limit with router update for , action 3'

The state of resources and net devices:

[root@devstack ~]# openstack router list
+--+--++---+--+-+---+
| ID   | Name | Status | State | Project
  | Distributed | HA|
+--+--++---+--+-+---+
| 6cb4a81f-9b5a-4f98-9ef2-705b369d4240 | r2   | ACTIVE | UP| 
f3f8c288836f47ca930e13620f27a8c8 | True| False |
| 9e15faf3-8478-4b2a-83f1-ad2cc8cd9de4 | r3   | ACTIVE | UP| 
f3f8c288836f47ca930e13620f27a8c8 | True| False |
| c37e75aa-4bc1-4d56-95a1-3045d8817c26 | r1   | ACTIVE | UP| 
f3f8c288836f47ca930e13620f27a8c8 | True| False |
+--+--++---+--+-+---+
[root@devstack ~]# openstack network list
+--+---+--+
| ID   | Name  | Subnets
  |
+--+---+--+
| 34cf22a5-8368-4935-a5a6-47bf2763d6a1 | priv-net2 | 
2f067140-d6a8-4341-ac53-aef48be15877 |
| 86f5bceb-a945-48c0-ad50-ae3e395fd21f | phys-net1 | 
d03016ee-5724-47ea-891c-018cdd8338f1 |
| 8bbaff79-4e40-4341-b48d-76b8a62f80cd | priv-net1 | 
ef7dca63-29f8-4483-af7f-8ab9661232f2 |
| a3704615-3e3e-4a03-a425-5851a381e702 | phys-net2 | 
647ed571-c6ee-4f7f-8ecf-8a78b5f0b534 |
| f142ca45-9cce-4619-9964-ad68b64aa0a2 | priv-net3 | 
e386cfdd-d52c-4830-a90b-bdc5cb656ad7 |
+--+---+--+
[root@devstack ~]# openstack router show r3 -c external_gateway_info
+---+--+
| Field | Value 

   |
+---+--+
| external_gateway_info | {"network_id": 
"a3704615-3e3e-4a03-a425-5851a381e702", "external_fixed_ips": [{"subnet_id": 
"647ed571-c6ee-4f7f-8ecf-8a78b5f0b534", "ip_address": "192.168.2.42"}], 
"enable_snat": true} |
+---+--+
[root@devstack ~]#

[root@devstack ~]# ip netns
snat-9e15faf3-8478-4b2a-83f1-ad2cc8cd9de4 (id: 12)

[Yahoo-eng-team] [Bug 1929438] [NEW] Cannot provision flat network after reconfiguring physical bridges

2021-05-24 Thread Anton Kurbatov
Public bug reported:

I ran into a problem when the network inside the newly created VM is not
working.


* Pre-conditions:
- the neutron ovs agent has not yet seen any ports from the VM network;
- any other bridge (except for the network in which the VM is created) is 
recreated on the node.


* Step-by-step reproduction steps:

The bridge mapping from ml2_conf.ini looks like:
[ovs]
bridge_mappings = Public:br-eth0,Test:br-test


The 'Test:br-test' mapping is a test bridge to demonstrate the problem.
I've created it using ovs-vsctl tool like: ovs-vsctl add-br br-test.

1) Recreate this test bridge that triggers
_reconfigure_physical_bridges:

[root@sqvm2-2009 ~]# ovs-vsctl del-br br-test; ovs-vsctl add-br br-test
[root@sqvm2-2009 ~]#


2) Create the first VM from the 'public' network that is mapped to the 'Public' 
bridge and try to ping it:
---
[root@sqvm2-2009 ~]# openstack server create test-vm --image cirros  --flavor 
100 --network public --boot-from-volume 1
[root@sqvm2-2009 ~]# openstack server list
+--+-++-+--++
| ID   | Name| Status | Networks
| Image| Flavor |
+--+-++-+--++
| 68c32b4d-8f90-4ced-8ca4-67a9e4ff255b | test-vm | ACTIVE | public=10.34.111.12 
| N/A (booted from volume) | tiny   |
+--+-++-+--++
[root@sqvm2-2009 ~]# virsh console 68c32b4d-8f90-4ced-8ca4-67a9e4ff255b
Connected to domain instance-0005
Escape character is ^]

login as 'cirros' user. default password: 'gocubsgo'. use 'sudo' for root.
cirros login: cirros
Password:
$ sudo ip addr add 10.34.111.12/18 dev eth0
$ ip a s eth0
2: eth0:  mtu 1500 qdisc pfifo_fast qlen 1000
link/ether fa:16:3e:67:f8:4e brd ff:ff:ff:ff:ff:ff
inet 10.34.111.12/18 scope global eth0
   valid_lft forever preferred_lft forever
inet6 fe80::f816:3eff:fe67:f84e/64 scope link
   valid_lft forever preferred_lft forever
$
[root@sqvm2-2009 ~]# ping 10.34.111.12
PING 10.34.111.12 (10.34.111.12) 56(84) bytes of data.
>From 10.34.66.138 icmp_seq=1 Destination Host Unreachable
>From 10.34.66.138 icmp_seq=2 Destination Host Unreachable
>From 10.34.66.138 icmp_seq=3 Destination Host Unreachable
>From 10.34.66.138 icmp_seq=4 Destination Host Unreachable
^C
--- 10.34.111.12 ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4000ms
pipe 4
[root@sqvm2-2009 ~]#
---


* Actual result:
The VM is not pingable, but should.
During port processing in neutron-openvswitch-agent rpc_loop, one can see the 
logs:

2021-05-24 17:28:29.776 13744 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-7f3667e0-56a5-4830-8376-10577a2ee167 - - - - -] Port 
c010955a-4782-418d-a612-0bfbd66b3c09 updated. Details: {'device': 
'c010955a-4782-418d-a612-0bfbd66b3c09', 'device_id': 
'68c32b4d-8f90-4ced-8ca4-67a9e4ff255b', 'network_id': 
'568fb8ce-8f1b-456e-8a31-330ef19f2f5c', 'port_id': 
'c010955a-4782-418d-a612-0bfbd66b3c09', 'mac_address': 'fa:16:3e:67:f8:4e', 
'admin_state_up': True, 'network_type': 'flat', 'segmentation_id': None, 
'physical_network': 'Public', 'fixed_ips': [{'subnet_id': 
'b6f963e3-ad77-4bde-8431-049f87871422', 'ip_address': '10.34.111.12'}], 
'device_owner': 'compute:nova', 'allowed_address_pairs': [], 
'port_security_enabled': True, 'qos_policy_id': None, 'network_qos_policy_id': 
None, 'profile': {}, 'vif_type': 'ovs', 'vnic_type': 'normal', 
'security_groups': ['09948793-2e11-4d89-ad1f-0c0d0eef80f0'], 'migrating_to': 
None}
2021-05-24 17:28:29.776 13744 INFO 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-7f3667e0-56a5-4830-8376-10577a2ee167 - - - - -] Assigning 2 as local vlan 
for net-id=568fb8ce-8f1b-456e-8a31-330ef19f2f5c
2021-05-24 17:28:29.777 13744 ERROR 
neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 
[req-7f3667e0-56a5-4830-8376-10577a2ee167 - - - - -] Cannot provision flat 
network for net-id=568fb8ce-8f1b-456e-8a31-330ef19f2f5c - no bridge for 
physical_network Public


* Version:
Stein release.
The issue is also reproducible on the master branch.


* Attachments:
Full neutron-openvswitch-agent service logs attached

** Affects: neutron
 Importance: Undecided
 Status: New

** Attachment added: "neutron-openvswitch-agent.log"
   
https://bugs.launchpad.net/bugs/1929438/+attachment/5499913/+files/neutron-openvswitch-agent.log

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1929438

Title:
  Cannot provision flat network after reconfiguring physical bridges

Status in neutron:
  New

Bug description:
  I ran into a problem when the network inside the 

[Yahoo-eng-team] [Bug 1808541] [NEW] Openflow entries are not totally removed for stopped VM

2018-12-14 Thread Anton Kurbatov
Public bug reported:

I am using Queens release and VM's tap interfaces are plugged into ovs br-int.
I'm watching a case when openflow entries are not totally removed when I stop 
my VM (name='my-vm').
It is only reproducable when there is some another activity on a node for 
different VMs: in my case I attach new network to another VM (name='vm-other')

ovs-agent logs in attach.

I managed to simulate the issue using next steps:

1) grep by mac current openflow entries for my-vm:

# ovs-ofctl dump-flows br-int | grep fa:16:3e:ec:d3:45
 cookie=0xf4d7d970f5382f3d, duration=93.162s, table=60, n_packets=146, 
n_bytes=21001, idle_age=4, priority=90,dl_vlan=9,dl_dst=fa:16:3e:ec:d3:45 
actions=load:0xa3->NXM_NX_REG5[],load:0x9->NXM_NX_REG6[],strip_vlan,resubmit(,81)
 cookie=0xf4d7d970f5382f3d, duration=93.162s, table=71, n_packets=2, 
n_bytes=84, idle_age=4, 
priority=95,arp,reg5=0xa3,in_port=163,dl_src=fa:16:3e:ec:d3:45,arp_spa=10.94.152.212
 actions=NORMAL
 cookie=0xf4d7d970f5382f3d, duration=93.162s, table=71, n_packets=28, 
n_bytes=2448, idle_age=9, 
priority=65,ip,reg5=0xa3,in_port=163,dl_src=fa:16:3e:ec:d3:45,nw_src=10.94.152.212
 actions=ct(table=72,zone=NXM_NX_REG6[0..15])
 cookie=0xf4d7d970f5382f3d, duration=93.162s, table=71, n_packets=0, n_bytes=0, 
idle_age=93, 
priority=65,ipv6,reg5=0xa3,in_port=163,dl_src=fa:16:3e:ec:d3:45,ipv6_src=fe80::f816:3eff:feec:d345
 actions=ct(table=72,zone=NXM_NX_REG6[0..15])
 cookie=0xf4d7d970f5382f3d, duration=93.162s, table=73, n_packets=0, n_bytes=0, 
idle_age=3401, priority=100,reg6=0x9,dl_dst=fa:16:3e:ec:d3:45 
actions=load:0xa3->NXM_NX_REG5[],resubmit(,81)
#

2) # ps ax | grep libvirt
 4887 pts/6S+ 0:00 grep --color=auto libvirt
 3934 ?Sl 0:18 /usr/libexec/qemu-kvm -name 
guest=instance-0012,debug-threads=on -S -object 
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-131-instance-0012/master-key.aes
 -machine pc-i440fx-vz7.8.0,accel=kvm,usb=off,dump-guest-core=off -cpu 
Westmere-IBRS,vme=on,ss=on,pcid=on,x2apic=on,tsc-deadline=on,hypervisor=on,arat=on,tsc_adjust=on,ssbd=on,stibp=on,pdpe1gb=on,rdtscp=on,aes=off,+kvmclock
 -m 512 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -numa 
node,nodeid=0,cpus=0,mem=512 -uuid 89fccc31-96a6-47ce-abd1-e40fba7274e6 -smbios 
type=1,manufacturer=Virtuozzo Infrastructure Platform,product=OpenStack 
Compute,version=17.0.6-1.vl7,serial=71f55add-ef93-4ec2-a4dd-ab8098b6312d,uuid=89fccc31-96a6-47ce-abd1-e40fba7274e6,family=Virtual
 Machine -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-131-instance-0012/monitor.sock,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -global 
kvm-pit.lost_tick_policy=discard -no-shutdown -boot strict=on -device 
nec-usb-x,id=usb,bus=pci.0,addr=0x4 -device 
virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x5 -device 
virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive 
file=/mnt/vstorage/vols/datastores/cinder/volume-e30d1874-d68e-4578-bf89-aa599e8383c7/volume-e30d1874-d68e-4578-bf89-aa599e8383c7,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none,l2-cache-size=128M,discard=unmap,aio=native
 -device 
scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1,logical_block_size=512,physical_block_size=4096,serial=e30d1874-d68e-4578-bf89-aa599e8383c7
 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=30 -device 
virtio-net-pci,host_mtu=1500,netdev=hostnet0,id=net0,mac=fa:16:3e:ec:d3:45,bus=pci.0,addr=0x3
 -chardev 
pty,id=charserial0,logfile=/mnt/vstorage/vols/datastores/nova/instances/89fccc31-96a6-47ce-abd1-e40fba7274e6/console.log,logappend=on
 -device isa-serial,chardev=charserial0,id=serial0 -chardev 
socket,id=charchannel0,path=/var/lib/libvirt/qemu/org.qemu.guest_agent.0.instance-0012.sock,server,nowait
 -device 
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
 -chardev 
socket,id=charchannel1,path=/var/lib/libvirt/qemu/org.qemu.guest_agent.1.instance-0012.sock,server,nowait
 -device 
virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.1
 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 10.10.1.237:0 -device 
VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -device vmcoreinfo -device 
pvpanic,ioport=1285 -msg timestamp=on
24553 ?Ssl   11:44 /usr/sbin/libvirtd --listen

Note: "-netdev tap,fd=28", so, net device is passed to qemu as handle
and AFAIU tap interface is auto removed (by kernel) when qemu process
exits.

3) SIGSTOP libvirtd to emulate port deletion delay that is performed by 
libvirtd when guest is stopped
(I believe libvirtd removes port when guest is stopped in the way like 
'ovs-vsctl --timeout=5 -- --if-exists del-port taped0487c9-23')

# kill -SIGSTOP 24553
#

4) Kill the guest:
# kill -9 3934
#

Ovs agent logs right after killing:

2018-12-14