Public bug reported:
## General description:
After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
Antelope to Caracal(2024.1), we found ourselves with a few resources
stuck with somehow bad states in neutron. More specifically, these
resources are in an invalid state, and can no longer be fixed through
the usual APIs.
From what I found out, it seems that there are some discrepancies
between Neutron and OVN databases, which the Mech Driver is unable to
fix automatically, and in turn, prevent the resources from being updated
by the neutron-server service.
## Context on the deployment & bug
Openstack version: 2024.1
Control-Plane Deployment: over Kubernetes, using community helm charts
Compute deployment: initramfs-booted (do not keep state between two boots,
except the compute_id which is laid down by custom services upon initializing)
It may be important to note that due to our deployment method for the
compute nodes, our upgrade procedure for these is a simple reboot, with
required steps to disable the node in Nova. We currently do not handle
neutron's agent, which might mean that the operation is brutal from
neutron/ovn's point of view, akin to an unexpected server crash.
## Traces and useful outputs:
Initially, a user reported the inability to update a router's routes,
due to the service answering with an HTTP 500. Looking at neutron-server
logs, nothing useful was shown about the error, apart from the access
log line.
While looking at the logs from all neutron services, I ended up finding
a bunch of logs similar to the following (always the same KeyError, on
various router_ports) in the neutron-server logs:
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None
> req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task:
> Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type:
> router_ports): KeyError: 'neutron:provnet-network-type'
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most
> recent call last):
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
> "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
> line 377, in check_for_inconsistencies
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
> self._fix_create_update(admin_context, row)
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
> "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
> line 286, in _fix_create_update
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
> res_map['ovn_update'](context, n_obj)
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
> "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
> line 1878, in update_router_port
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
> self._update_lrouter_port(context, port, if_exists=if_exists,
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
> "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
> line 1863, in _update_lrouter_port
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
> options=self._gen_router_port_options(port),
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
> "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
> line 1701, in _gen_router_port_options
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
> network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
> 2025-05-22 11:47:06.959 16 ERROR
> neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError:
> 'neutron:provnet-network-type'
I ended up finding that one of these errors was related to the resource update
issue I had been called for, by checking out the northbound OVS database:
> switch fbb1b26a-b915-454e-81e8-6600e1a70811
> (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
> ...
> port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> type: router
> router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> ...
> router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20
> (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
> port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
> mac: "REDACTED"
> networks: ["REDACTED"]
> port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> mac: "REDACTED"
> networks: ["REDACTED"]
> gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
> nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
> external ip: "REDACTED"
> logical ip: "REDACTED"
> type: "snat"
Here, one of the 5 computes from the router is one that has been unavailable
since the upgrade (COMPUTE5).
I wondered if that could be a factor, but then, I checked out the
southbound OVN Db for the port, to find out that it wasn't the active
chassis for the port:
> Chassis COMPUTE3
> hostname: COMPUTE3
> Encap geneve
> ip: "REDACTED"
> options: {csum="true"}
> Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
All I could tell so far is that the neutron router that we're unable to update
is somehow related to an OVH router_port that the neutron-server cannot
automatically fix.
The
## Reproduction steps:
Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
what could have gone wrong. All I can say is that we noticed these
errors after upgrading our openstack deployment. It is possible that our
reboots might be too brutal for neutron/ovn as is, and that we need to
refine our upgrade procedures.
## Perceived Severity
A few existing resources are now unusable for our users, but new resources are
unaffected.
Given that our users expect to rely on existing resources, I'd put this issue
as a blocker, but to the community it might sound like a High priority instead.
## Expectations
I hope to learn what we may have done wrong if our procedure is the
cause, and learn how I can fix the current situation manually, to
unblock our users
I'll stay available on this BugReport to provide any additional insight
I can. Hopefully this initial version provides enough information.
** Affects: neutron
Importance: Undecided
Status: New
** Tags: ovn
** Description changed:
## General description:
After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
Antelope to Caracal(2024.1), we found ourselves with a few resources
stuck with somehow bad states in neutron. More specifically, these
resources are in an invalid state, and can no longer be fixed through
the usual APIs.
From what I found out, it seems that there are some discrepancies
between Neutron and OVN databases, which the Mech Driver is unable to
fix automatically, and in turn, prevent the resources from being updated
by the neutron-server service.
-
## Context on the deployment & bug
Openstack version: 2024.1
Control-Plane Deployment: over Kubernetes, using community helm charts
Compute deployment: initramfs-booted (do not keep state between two boots,
except the compute_id which is laid down by custom services upon initializing)
It may be important to note that due to our deployment method for the
compute nodes, our upgrade procedure for these is a simple reboot, with
required steps to disable the node in Nova. We currently do not handle
neutron's agent, which might mean that the operation is brutal from
neutron/ovn's point of view, akin to an unexpected server crash.
-
## Traces and useful outputs:
Initially, a user reported the inability to update a router's routes,
due to the service answering with an HTTP 500. Looking at neutron-server
logs, nothing useful was shown about the error, apart from the access
log line.
While looking at the logs from all neutron services, I ended up finding
a bunch of logs similar to the following (always the same KeyError, on
various router_ports) in the neutron-server logs:
```
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None
req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed
to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports):
KeyError: 'neutron:provnet-network-type'
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most
recent call last):
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
line 377, in check_for_inconsistencies
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
self._fix_create_update(admin_context, row)
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
line 286, in _fix_create_update
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
res_map['ovn_update'](context, n_obj)
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1878, in update_router_port
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
self._update_lrouter_port(context, port, if_exists=if_exists,
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1863, in _update_lrouter_port
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
options=self._gen_router_port_options(port),
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1701, in _gen_router_port_options
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance network_type
= ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError:
'neutron:provnet-network-type'
```
I ended up finding that one of these errors was related to the resource
update issue I had been called for, by checking out the northbound OVS
database:
```
...
switch fbb1b26a-b915-454e-81e8-6600e1a70811
(neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
...
- port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- type: router
- router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ type: router
+ router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
...
router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20
(neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
- port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
- mac: "REDACTED"
- networks: ["REDACTED"]
- port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- mac: "REDACTED"
- networks: ["REDACTED"]
- gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
- nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
- external ip: "REDACTED"
- logical ip: "REDACTED"
- type: "snat"
+ port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
+ mac: "REDACTED"
+ networks: ["REDACTED"]
+ port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ mac: "REDACTED"
+ networks: ["REDACTED"]
+ gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
+ nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
+ external ip: "REDACTED"
+ logical ip: "REDACTED"
+ type: "snat"
```
Here, one of the 5 computes from the router is one that has been
unavailable since the upgrade (COMPUTE5).
I wondered if that could be a factor, but then, I checked out the
southbound OVN Db for the port, to find out that it wasn't the active
chassis for the port:
```
...
Chassis COMPUTE3
- hostname: COMPUTE3
- Encap geneve
- ip: "REDACTED"
- options: {csum="true"}
- Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ hostname: COMPUTE3
+ Encap geneve
+ ip: "REDACTED"
+ options: {csum="true"}
+ Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
...
```
All I could tell so far is that the neutron router that we're unable to
update is somehow related to an OVH router_port that the neutron-server cannot
automatically fix.
- The
-
+ The
## Reproduction steps:
Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
what could have gone wrong. All I can say is that we noticed these
errors after upgrading our openstack deployment. It is possible that our
reboots might be too brutal for neutron/ovn as is, and that we need to
refine our upgrade procedures.
-
## Perceived Severity
A few existing resources are now unusable for our users, but new resources
are unaffected.
Given that our users expect to rely on existing resources, I'd put this issue
as a blocker, but to the community it might sound like a High priority instead.
-
## Expectations
I hope to learn what we may have done wrong if our procedure is the
cause, and learn how I can fix the current situation manually, to
unblock our users
-
I'll stay available on this BugReport to provide any additional insight
I can. Hopefully this initial version provides enough information.
** Description changed:
## General description:
After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
Antelope to Caracal(2024.1), we found ourselves with a few resources
stuck with somehow bad states in neutron. More specifically, these
resources are in an invalid state, and can no longer be fixed through
the usual APIs.
From what I found out, it seems that there are some discrepancies
between Neutron and OVN databases, which the Mech Driver is unable to
fix automatically, and in turn, prevent the resources from being updated
by the neutron-server service.
## Context on the deployment & bug
Openstack version: 2024.1
Control-Plane Deployment: over Kubernetes, using community helm charts
Compute deployment: initramfs-booted (do not keep state between two boots,
except the compute_id which is laid down by custom services upon initializing)
It may be important to note that due to our deployment method for the
compute nodes, our upgrade procedure for these is a simple reboot, with
required steps to disable the node in Nova. We currently do not handle
neutron's agent, which might mean that the operation is brutal from
neutron/ovn's point of view, akin to an unexpected server crash.
## Traces and useful outputs:
Initially, a user reported the inability to update a router's routes,
due to the service answering with an HTTP 500. Looking at neutron-server
logs, nothing useful was shown about the error, apart from the access
log line.
While looking at the logs from all neutron services, I ended up finding
a bunch of logs similar to the following (always the same KeyError, on
various router_ports) in the neutron-server logs:
- ```
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None
req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed
to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports):
KeyError: 'neutron:provnet-network-type'
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most
recent call last):
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
line 377, in check_for_inconsistencies
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
self._fix_create_update(admin_context, row)
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
line 286, in _fix_create_update
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
res_map['ovn_update'](context, n_obj)
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1878, in update_router_port
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
self._update_lrouter_port(context, port, if_exists=if_exists,
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1863, in _update_lrouter_port
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
options=self._gen_router_port_options(port),
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1701, in _gen_router_port_options
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance network_type
= ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
- 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError:
'neutron:provnet-network-type'
- ```
- I ended up finding that one of these errors was related to the resource
- update issue I had been called for, by checking out the northbound OVS
- database:
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None
req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed
to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports):
KeyError: 'neutron:provnet-network-type'
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most
recent call last):
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
line 377, in check_for_inconsistencies
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
self._fix_create_update(admin_context, row)
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
line 286, in _fix_create_update
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
res_map['ovn_update'](context, n_obj)
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1878, in update_router_port
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
self._update_lrouter_port(context, port, if_exists=if_exists,
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1863, in _update_lrouter_port
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
options=self._gen_router_port_options(port),
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1701, in _gen_router_port_options
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance network_type
= ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
+ > 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError:
'neutron:provnet-network-type'
- ```
- ...
- switch fbb1b26a-b915-454e-81e8-6600e1a70811
(neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
- ...
- port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- type: router
- router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- ...
- router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20
(neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
- port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
- mac: "REDACTED"
- networks: ["REDACTED"]
- port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- mac: "REDACTED"
- networks: ["REDACTED"]
- gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
- nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
- external ip: "REDACTED"
- logical ip: "REDACTED"
- type: "snat"
- ```
- Here, one of the 5 computes from the router is one that has been
- unavailable since the upgrade (COMPUTE5).
+ I ended up finding that one of these errors was related to the resource
update issue I had been called for, by checking out the northbound OVS database:
+
+
+ > switch fbb1b26a-b915-454e-81e8-6600e1a70811
(neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
+ > ...
+ > port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ > type: router
+ > router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ > ...
+ > router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20
(neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
+ > port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
+ > mac: "REDACTED"
+ > networks: ["REDACTED"]
+ > port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ > mac: "REDACTED"
+ > networks: ["REDACTED"]
+ > gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
+ > nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
+ > external ip: "REDACTED"
+ > logical ip: "REDACTED"
+ > type: "snat"
+
+
+ Here, one of the 5 computes from the router is one that has been unavailable
since the upgrade (COMPUTE5).
I wondered if that could be a factor, but then, I checked out the
southbound OVN Db for the port, to find out that it wasn't the active
chassis for the port:
- ```
- ...
- Chassis COMPUTE3
- hostname: COMPUTE3
- Encap geneve
- ip: "REDACTED"
- options: {csum="true"}
- Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- ...
- ```
+
+ > Chassis COMPUTE3
+ > hostname: COMPUTE3
+ > Encap geneve
+ > ip: "REDACTED"
+ > options: {csum="true"}
+ > Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+
All I could tell so far is that the neutron router that we're unable to
update is somehow related to an OVH router_port that the neutron-server cannot
automatically fix.
The
## Reproduction steps:
Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
what could have gone wrong. All I can say is that we noticed these
errors after upgrading our openstack deployment. It is possible that our
reboots might be too brutal for neutron/ovn as is, and that we need to
refine our upgrade procedures.
## Perceived Severity
A few existing resources are now unusable for our users, but new resources
are unaffected.
Given that our users expect to rely on existing resources, I'd put this issue
as a blocker, but to the community it might sound like a High priority instead.
## Expectations
I hope to learn what we may have done wrong if our procedure is the
cause, and learn how I can fix the current situation manually, to
unblock our users
I'll stay available on this BugReport to provide any additional insight
I can. Hopefully this initial version provides enough information.
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2111498
Title:
[OVN] Mechanism driver fails to fix router_ports: KeyError
'neutron:provnet-network-type'
Status in neutron:
New
Bug description:
## General description:
After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
Antelope to Caracal(2024.1), we found ourselves with a few resources
stuck with somehow bad states in neutron. More specifically, these
resources are in an invalid state, and can no longer be fixed through
the usual APIs.
From what I found out, it seems that there are some discrepancies
between Neutron and OVN databases, which the Mech Driver is unable to
fix automatically, and in turn, prevent the resources from being
updated by the neutron-server service.
## Context on the deployment & bug
Openstack version: 2024.1
Control-Plane Deployment: over Kubernetes, using community helm charts
Compute deployment: initramfs-booted (do not keep state between two boots,
except the compute_id which is laid down by custom services upon initializing)
It may be important to note that due to our deployment method for the
compute nodes, our upgrade procedure for these is a simple reboot,
with required steps to disable the node in Nova. We currently do not
handle neutron's agent, which might mean that the operation is brutal
from neutron/ovn's point of view, akin to an unexpected server crash.
## Traces and useful outputs:
Initially, a user reported the inability to update a router's routes,
due to the service answering with an HTTP 500. Looking at neutron-
server logs, nothing useful was shown about the error, apart from the
access log line.
While looking at the logs from all neutron services, I ended up
finding a bunch of logs similar to the following (always the same
KeyError, on various router_ports) in the neutron-server logs:
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None
req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed
to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports):
KeyError: 'neutron:provnet-network-type'
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most
recent call last):
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
line 377, in check_for_inconsistencies
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
self._fix_create_update(admin_context, row)
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py",
line 286, in _fix_create_update
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
res_map['ovn_update'](context, n_obj)
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1878, in update_router_port
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
self._update_lrouter_port(context, port, if_exists=if_exists,
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1863, in _update_lrouter_port
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance
options=self._gen_router_port_options(port),
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File
"/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py",
line 1701, in _gen_router_port_options
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance network_type
= ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
> 2025-05-22 11:47:06.959 16 ERROR
neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError:
'neutron:provnet-network-type'
I ended up finding that one of these errors was related to the resource
update issue I had been called for, by checking out the northbound OVS database:
> switch fbb1b26a-b915-454e-81e8-6600e1a70811
(neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
> ...
> port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> type: router
> router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> ...
> router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20
(neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
> port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
> mac: "REDACTED"
> networks: ["REDACTED"]
> port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> mac: "REDACTED"
> networks: ["REDACTED"]
> gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
> nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
> external ip: "REDACTED"
> logical ip: "REDACTED"
> type: "snat"
Here, one of the 5 computes from the router is one that has been unavailable
since the upgrade (COMPUTE5).
I wondered if that could be a factor, but then, I checked out the
southbound OVN Db for the port, to find out that it wasn't the active
chassis for the port:
> Chassis COMPUTE3
> hostname: COMPUTE3
> Encap geneve
> ip: "REDACTED"
> options: {csum="true"}
> Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
All I could tell so far is that the neutron router that we're unable to
update is somehow related to an OVH router_port that the neutron-server cannot
automatically fix.
The
## Reproduction steps:
Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
what could have gone wrong. All I can say is that we noticed these
errors after upgrading our openstack deployment. It is possible that
our reboots might be too brutal for neutron/ovn as is, and that we
need to refine our upgrade procedures.
## Perceived Severity
A few existing resources are now unusable for our users, but new resources
are unaffected.
Given that our users expect to rely on existing resources, I'd put this issue
as a blocker, but to the community it might sound like a High priority instead.
## Expectations
I hope to learn what we may have done wrong if our procedure is the
cause, and learn how I can fix the current situation manually, to
unblock our users
I'll stay available on this BugReport to provide any additional
insight I can. Hopefully this initial version provides enough
information.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2111498/+subscriptions
--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp