** Also affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2103500
Title:
Race condition - Error creating instance if network is deleted by
different process
Status in neutron:
New
Status in OpenStack Compute (nova):
New
Bug description:
Note: I'm creating this bug report both at nova and in neutron because
I'm not sure where the bug is present.
Description
===========
We run automated Python scripts on our monitoring system that create, test,
and delete OpenStack resources. Each of our three Availability Zones (AZs) has
two of these processes. These monitoring cycles repeat every 5 to 10 minutes.
Approximately once every 1 to 2 days, one of these monitoring checks fails
with the following error:
https://paste.openstack.org/show/bt3BWnyrBbQspiFx8KPZ/
"""
2025-03-12 18:42:17,764 ERROR: Unexpected API Error. Please report
this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'neutronclient.common.exceptions.InternalServerError'> (HTTP
500) (Request-ID: req-3b755d12-27b1-4531-b948-b613520c05ae)
"""
Upon investigating this Request-ID in the nova-api logs, we found that nova
reports an InternalServerError coming from neutron, along with a neutron
Request-ID:
https://paste.openstack.org/show/b6wGVVZNwaAuLmFgHvaq/
"""
2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi
neutronclient.common.exceptions.InternalServerError: The server has either
erred or is incapable of performing the requested operation.<br /><br />
"""
Further investigation of that Request-ID in neutron logs revealed this
specific error:
https://paste.openstack.org/show/bJL9b5aY99XpZlLZdzgy/
"""
2025-03-12 18:42:17.754 24 ERROR oslo_middleware.catch_errors
oslo_db.exception.RetryRequest: Network 6163d404-aa33-4d83-a8a4-ead7dc0d9e14
could not be found.
"""
Additionally, looking at the HTTP DELETE request, we confirmed that the
network was indeed deleted just milliseconds (0.005s) before the nova error
occurred:
"""
log_level:INFO,
logger:openstack.neutron,
programname:neutron-server,
python_module:neutron.wsgi,
timestamp:2025-03-12 18:42:17.749
payload:
XX.XXX.XX.XXX,10.170.65.35 \"DELETE
/v2.0/networks/6163d404-aa33-4d83-a8a4-ead7dc0d9e14 HTTP/1.0\" status: 204
len: 168 time: 0.6581874
"""
This specific example was pretty close in timing — the network deletion
happened just 0.005 seconds before the nova error.
In other occurrences, the gap between the DELETE and the nova error was
slightly larger (typically 0.5 to 1.2 seconds), but the same errors where
triggered.
What we suspect
===============
What we suspect is happening is that at the creation part of a instance, nova
requests all ports from neutron using the neutronclient.
"""
2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi File
"/var/lib/kolla/venv/lib/python3.10/site-packages/nova/network/neutron.py",
line 2664, in validate_networks
2025-03-12 18:42:17.759 19 ERROR nova.api.openstack.wsgi ports =
neutron.list_ports(**params)['ports']
"""
We believe that either Neutron or neutronclient tries to fetch the network
details for each port it has already gathered in a list. Since it has collected
all the ports in the project, it then attempts to fetch all the information of
each ports network in the project. The issue happens if another process (in our
case, one of the other five Python scripts) deletes a network right after
Neutron has gathered the ports but just before it tries to fetch the network
details. When that happens, Neutron tries to look up the now-deleted network,
triggering the error because the network was removed a split second earlier.
Temporary workaround
====================
We'd figured out a workaround for this problem:
/nova/nova/network/neutron.py:2715-2717
"""
if quotas.get('port', -1) == -1:
# Unlimited Port Quota
return num_instances
"""
Setting the port quota of the project in question to unlimited (-1). Makes
sure that validate_networks() just returns num_instances and skips the part
where neutron.list_ports() is called (line 2721). This has for now fixed the
problem for us.
Steps to reproduce
==================
So far, all occurrences have happened randomly during automated processes.
The steps below are based on my understanding and speculation of what’s
happening:
1. Create two private networks in the same project — Network A and Network B.
2. Ensure that both networks contain one or more ports.
3. Request to create an instance attached to Network A using openstacksdk.
- During this process, nova-api calls neutron.list_ports().
- Neutron gathers all ports in the project.
4. Delete Network B — make sure this happens right AFTER Neutron has finished
gathering all ports, but right BEFORE Neutron starts fetching network details
for the ports it collected.
- Neutron attempts to fetch the network information for all gathered
ports, which includes Network B (now deleted).
- Since Network B no longer exists, Neutron fails to find it, causing
an error.
X. As a result, the nova-api call and the openstacksdk operation fail.
Expected result
===============
The expected result is that either nova, neutron or neutronclient recognizes
that the network is deleted. And either recollects all ports again or ignores
the ports from the deleted network. So that nova can continue to run the
instance creation.
Actual result
=============
Already described in description.
Environment
===========
OpenStack 2023.1 deployed w/ ansible-kolla
Ubuntu 24.04
Python scripts: python3.8.10 w/ openstacksdk 4.0.1
Neutron networking
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2103500/+subscriptions
--
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help : https://help.launchpad.net/ListHelp