[Yahoo-eng-team] [Bug 1841481] [NEW] Race during ironic re-balance corrupts local RT ProviderTree and compute_nodes cache

Matt Riedemann Mon, 26 Aug 2019 14:51:09 -0700

Public bug reported:

Seen with an ironic re-balance in this job:


https://d01b2e57f0a56cb7edf0-b6bc206936c08bb07a5f77cfa916a2d4.ssl.cf5.rackcdn.com/678298/4/check
/ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/92c65ac/

On the subnode we see the RT detect that the node is moving hosts:

Aug 26 18:41:38.818412 ubuntu-bionic-rax-ord-0010443319 nova-
compute[747]: INFO nova.compute.resource_tracker [None req-a894abee-
a2f1-4423-8ede-2a1b9eef28a4 None None] ComputeNode 61dbc9c7-828b-4c42
-b19c-a3716037965f moving from ubuntu-bionic-rax-ord-0010443317 to
ubuntu-bionic-rax-ord-0010443319

On that new host, the ProviderTree cache is getting updated with
refreshed associations for inventory:

Aug 26 18:41:38.881026 ubuntu-bionic-rax-ord-0010443319 nova-
compute[747]: DEBUG nova.scheduler.client.report [None req-a894abee-
a2f1-4423-8ede-2a1b9eef28a4 None None] Refreshing inventories for
resource provider 61dbc9c7-828b-4c42-b19c-a3716037965f {{(pid=747)
_refresh_associations
/opt/stack/nova/nova/scheduler/client/report.py:761}}

aggregates:

Aug 26 18:41:38.953685 ubuntu-bionic-rax-ord-0010443319 nova-
compute[747]: DEBUG nova.scheduler.client.report [None req-a894abee-
a2f1-4423-8ede-2a1b9eef28a4 None None] Refreshing aggregate associations
for resource provider 61dbc9c7-828b-4c42-b19c-a3716037965f, aggregates:
None {{(pid=747) _refresh_associations
/opt/stack/nova/nova/scheduler/client/report.py:770}}

and traits - but when we get traits the provider is gone:

Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager [None req-a894abee-a2f1-4423-8ede-2a1b9eef28a4 None 
None] Error updating resources for node 61dbc9c7-828b-4c42-b19c-a3716037965f.: 
ResourceProviderTraitRetrievalFailed: Failed to get traits for resource 
provider with UUID 61dbc9c7-828b-4c42-b19c-a3716037965f
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager Traceback (most recent call last):
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File "/opt/stack/nova/nova/compute/manager.py", 
line 8250, in _update_available_resource_for_node
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     startup=startup)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 715, in 
update_available_resource
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     self._update_available_resource(context, 
resources, startup=startup)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 
328, in inner
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     return f(*args, **kwargs)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 738, in 
_update_available_resource
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     is_new_compute_node = 
self._init_compute_node(context, resources)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 561, in 
_init_compute_node
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     if self._check_for_nodes_rebalance(context, 
resources, nodename):
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 516, in 
_check_for_nodes_rebalance
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     self._update(context, cn)
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 1054, in _update
Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     self._update_to_placement(context, compute_node, 
startup)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/retrying.py", line 49, in wrapped_f
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     return Retrying(*dargs, **dkw).call(f, *args, 
**kw)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/retrying.py", line 206, in call
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     return attempt.get(self._wrap_exception)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/retrying.py", line 247, in get
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     six.reraise(self.value[0], self.value[1], 
self.value[2])
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     attempt = Attempt(fn(*args, **kwargs), 
attempt_number, False)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 970, in 
_update_to_placement
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     context, compute_node.uuid, 
name=compute_node.hypervisor_hostname)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/scheduler/client/report.py", line 858, in 
get_provider_tree_and_ensure_root
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     parent_provider_uuid=parent_provider_uuid)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/scheduler/client/report.py", line 666, in 
_ensure_resource_provider
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     self._refresh_associations(context, 
uuid_to_refresh, force=True)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/scheduler/client/report.py", line 778, in 
_refresh_associations
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     trait_info = self.get_provider_traits(context, 
rp_uuid)
Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/scheduler/client/report.py", line 381, in 
get_provider_traits
Aug 26 18:41:38.998320 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     raise 
exception.ResourceProviderTraitRetrievalFailed(uuid=rp_uuid)
Aug 26 18:41:38.998320 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager ResourceProviderTraitRetrievalFailed: Failed to get 
traits for resource provider with UUID 61dbc9c7-828b-4c42-b19c-a3716037965f
Aug 26 18:41:38.998320 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager 

That's because back on the original host, it deleted the no-longer-
reported-here node:

Aug 26 18:41:38.832749 ubuntu-bionic-rax-ord-0010443317 nova-compute[19290]: 
INFO nova.compute.manager [None req-d5a9c4b6-f197-4f6c-8b12-8f736bbdb11c None 
None] Deleting orphan compute node 6 hypervisor host is 
61dbc9c7-828b-4c42-b19c-a3716037965f, nodes are 
set([u'1d23263a-31d4-49d9-ad68-be19219c3bae', 
u'be80f41d-73ed-46ad-b8e4-cefb0193de36', 
u'f3c6add0-3eda-47d9-9624-c1f73d488066', 
u'2c909342-b5dc-4203-b9cb-05a8f29c6c35', 
u'4921f5d8-8b39-4d03-8423-8e8404128ece'])
Aug 26 18:41:38.962237 ubuntu-bionic-rax-ord-0010443317 nova-compute[19290]: 
INFO nova.scheduler.client.report [None 
req-d5a9c4b6-f197-4f6c-8b12-8f736bbdb11c None None] Deleted resource provider 
61dbc9c7-828b-4c42-b19c-a3716037965f

Every 60 seconds or so after that, the update_available_resource
periodic task on the new host should correct this, but there are a
couple of problems - we continue to see that the resource provider is
not re-created:

Aug 26 18:42:37.122768 ubuntu-bionic-rax-ord-0010443319 nova-
compute[747]: ERROR nova.compute.resource_tracker [None req-ab8d1a0e-
385f-4333-bbb5-7b82250968fb None None] Skipping removal of allocations
for deleted instances: Failed to retrieve allocations for resource
provider 61dbc9c7-828b-4c42-b19c-a3716037965f: {"errors": [{"status":
404, "request_id": "req-46f04ab5-2bd7-4a13-add9-4b9073587138", "detail":
"The resource could not be found.\n\n Resource provider '61dbc9c7-828b-
4c42-b19c-a3716037965f' not found: No resource provider with uuid
61dbc9c7-828b-4c42-b19c-a3716037965f found  ", "title": "Not Found"}]}:
ResourceProviderAllocationRetrievalFailed: Failed to retrieve
allocations for resource provider 61dbc9c7-828b-4c42-b19c-a3716037965f:
{"errors": [{"status": 404, "request_id": "req-
46f04ab5-2bd7-4a13-add9-4b9073587138", "detail": "The resource could not
be found.\n\n Resource provider '61dbc9c7-828b-4c42-b19c-a3716037965f'
not found: No resource provider with uuid 61dbc9c7-828b-4c42-b19c-
a3716037965f found  ", "title": "Not Found"}]}

First, we don't go through the RT._update flow again because when the
new host detects the moved node, it adds it to the compute_nodes dict:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L513

And then fails the _update call to get the traits because the old host
deleted the provider concurrently:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L516

After that, update_available_resource will see the node in
RT.compute_nodes already and not call _update:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L546

Another issue is that because there was a race between the time the new
host was refreshing and adding the resource provider to its local
ProviderTree cache:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L640

and when the old host deleted the provider, the new host ProviderTree
cache now has the provider locally but it's actually gone from placement
- remember that this is where we failed to get the traits:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L778

So there seems to be two things to cleanup:

1. If we fail here:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L516

We should remove the node from the RT.compute_nodes dict - similar to
this fix https://review.opendev.org/#/c/675704/. That will mean we go
through RT._update on the next update_available_resource periodic task
run.

2. If we fail here:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L666

We should remove the provider from the local ProviderTree cache so that
the next run will find that the provider does not exist and re-create it
here:

https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L642

Now maybe the "remove from ProviderTree cache on failure" logic there
needs to be conditional on whether or not created_rp is None, I'm not
sure. It might be best to just always remove the entry from the cache if
we got an error trying to refresh associations for it so we're clean
next time - that's likely a better question for Eric Fried.

** Affects: nova
     Importance: Medium
         Status: Triaged


** Tags: ironic placement resource-tracker

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1841481

Title:
  Race during ironic re-balance corrupts local RT ProviderTree and
  compute_nodes cache

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Seen with an ironic re-balance in this job:

  
https://d01b2e57f0a56cb7edf0-b6bc206936c08bb07a5f77cfa916a2d4.ssl.cf5.rackcdn.com/678298/4/check
  /ironic-tempest-ipa-wholedisk-direct-tinyipa-multinode/92c65ac/

  On the subnode we see the RT detect that the node is moving hosts:

  Aug 26 18:41:38.818412 ubuntu-bionic-rax-ord-0010443319 nova-
  compute[747]: INFO nova.compute.resource_tracker [None req-a894abee-
  a2f1-4423-8ede-2a1b9eef28a4 None None] ComputeNode 61dbc9c7-828b-4c42
  -b19c-a3716037965f moving from ubuntu-bionic-rax-ord-0010443317 to
  ubuntu-bionic-rax-ord-0010443319

  On that new host, the ProviderTree cache is getting updated with
  refreshed associations for inventory:

  Aug 26 18:41:38.881026 ubuntu-bionic-rax-ord-0010443319 nova-
  compute[747]: DEBUG nova.scheduler.client.report [None req-a894abee-
  a2f1-4423-8ede-2a1b9eef28a4 None None] Refreshing inventories for
  resource provider 61dbc9c7-828b-4c42-b19c-a3716037965f {{(pid=747)
  _refresh_associations
  /opt/stack/nova/nova/scheduler/client/report.py:761}}

  aggregates:

  Aug 26 18:41:38.953685 ubuntu-bionic-rax-ord-0010443319 nova-
  compute[747]: DEBUG nova.scheduler.client.report [None req-a894abee-
  a2f1-4423-8ede-2a1b9eef28a4 None None] Refreshing aggregate
  associations for resource provider 61dbc9c7-828b-4c42-b19c-
  a3716037965f, aggregates: None {{(pid=747) _refresh_associations
  /opt/stack/nova/nova/scheduler/client/report.py:770}}

  and traits - but when we get traits the provider is gone:

  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager [None req-a894abee-a2f1-4423-8ede-2a1b9eef28a4 None 
None] Error updating resources for node 61dbc9c7-828b-4c42-b19c-a3716037965f.: 
ResourceProviderTraitRetrievalFailed: Failed to get traits for resource 
provider with UUID 61dbc9c7-828b-4c42-b19c-a3716037965f
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager Traceback (most recent call last):
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File "/opt/stack/nova/nova/compute/manager.py", 
line 8250, in _update_available_resource_for_node
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     startup=startup)
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 715, in 
update_available_resource
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     self._update_available_resource(context, 
resources, startup=startup)
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 
328, in inner
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     return f(*args, **kwargs)
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 738, in 
_update_available_resource
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     is_new_compute_node = 
self._init_compute_node(context, resources)
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 561, in 
_init_compute_node
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     if self._check_for_nodes_rebalance(context, 
resources, nodename):
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 516, in 
_check_for_nodes_rebalance
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     self._update(context, cn)
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 1054, in _update
  Aug 26 18:41:38.995595 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     self._update_to_placement(context, compute_node, 
startup)
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/retrying.py", line 49, in wrapped_f
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     return Retrying(*dargs, **dkw).call(f, *args, 
**kw)
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/retrying.py", line 206, in call
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     return attempt.get(self._wrap_exception)
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/retrying.py", line 247, in get
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     six.reraise(self.value[0], self.value[1], 
self.value[2])
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/usr/local/lib/python2.7/dist-packages/retrying.py", line 200, in call
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     attempt = Attempt(fn(*args, **kwargs), 
attempt_number, False)
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/compute/resource_tracker.py", line 970, in 
_update_to_placement
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     context, compute_node.uuid, 
name=compute_node.hypervisor_hostname)
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/scheduler/client/report.py", line 858, in 
get_provider_tree_and_ensure_root
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     parent_provider_uuid=parent_provider_uuid)
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/scheduler/client/report.py", line 666, in 
_ensure_resource_provider
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     self._refresh_associations(context, 
uuid_to_refresh, force=True)
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/scheduler/client/report.py", line 778, in 
_refresh_associations
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     trait_info = self.get_provider_traits(context, 
rp_uuid)
  Aug 26 18:41:38.996935 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager   File 
"/opt/stack/nova/nova/scheduler/client/report.py", line 381, in 
get_provider_traits
  Aug 26 18:41:38.998320 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager     raise 
exception.ResourceProviderTraitRetrievalFailed(uuid=rp_uuid)
  Aug 26 18:41:38.998320 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager ResourceProviderTraitRetrievalFailed: Failed to get 
traits for resource provider with UUID 61dbc9c7-828b-4c42-b19c-a3716037965f
  Aug 26 18:41:38.998320 ubuntu-bionic-rax-ord-0010443319 nova-compute[747]: 
ERROR nova.compute.manager 

  That's because back on the original host, it deleted the no-longer-
  reported-here node:

  Aug 26 18:41:38.832749 ubuntu-bionic-rax-ord-0010443317 nova-compute[19290]: 
INFO nova.compute.manager [None req-d5a9c4b6-f197-4f6c-8b12-8f736bbdb11c None 
None] Deleting orphan compute node 6 hypervisor host is 
61dbc9c7-828b-4c42-b19c-a3716037965f, nodes are 
set([u'1d23263a-31d4-49d9-ad68-be19219c3bae', 
u'be80f41d-73ed-46ad-b8e4-cefb0193de36', 
u'f3c6add0-3eda-47d9-9624-c1f73d488066', 
u'2c909342-b5dc-4203-b9cb-05a8f29c6c35', 
u'4921f5d8-8b39-4d03-8423-8e8404128ece'])
  Aug 26 18:41:38.962237 ubuntu-bionic-rax-ord-0010443317 nova-compute[19290]: 
INFO nova.scheduler.client.report [None 
req-d5a9c4b6-f197-4f6c-8b12-8f736bbdb11c None None] Deleted resource provider 
61dbc9c7-828b-4c42-b19c-a3716037965f

  Every 60 seconds or so after that, the update_available_resource
  periodic task on the new host should correct this, but there are a
  couple of problems - we continue to see that the resource provider is
  not re-created:

  Aug 26 18:42:37.122768 ubuntu-bionic-rax-ord-0010443319 nova-
  compute[747]: ERROR nova.compute.resource_tracker [None req-ab8d1a0e-
  385f-4333-bbb5-7b82250968fb None None] Skipping removal of allocations
  for deleted instances: Failed to retrieve allocations for resource
  provider 61dbc9c7-828b-4c42-b19c-a3716037965f: {"errors": [{"status":
  404, "request_id": "req-46f04ab5-2bd7-4a13-add9-4b9073587138",
  "detail": "The resource could not be found.\n\n Resource provider
  '61dbc9c7-828b-4c42-b19c-a3716037965f' not found: No resource provider
  with uuid 61dbc9c7-828b-4c42-b19c-a3716037965f found  ", "title": "Not
  Found"}]}: ResourceProviderAllocationRetrievalFailed: Failed to
  retrieve allocations for resource provider 61dbc9c7-828b-4c42-b19c-
  a3716037965f: {"errors": [{"status": 404, "request_id": "req-
  46f04ab5-2bd7-4a13-add9-4b9073587138", "detail": "The resource could
  not be found.\n\n Resource provider '61dbc9c7-828b-4c42-b19c-
  a3716037965f' not found: No resource provider with uuid 61dbc9c7-828b-
  4c42-b19c-a3716037965f found  ", "title": "Not Found"}]}

  First, we don't go through the RT._update flow again because when the
  new host detects the moved node, it adds it to the compute_nodes dict:

  
https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L513

  And then fails the _update call to get the traits because the old host
  deleted the provider concurrently:

  
https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L516

  After that, update_available_resource will see the node in
  RT.compute_nodes already and not call _update:

  
https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L546

  Another issue is that because there was a race between the time the
  new host was refreshing and adding the resource provider to its local
  ProviderTree cache:

  
https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L640

  and when the old host deleted the provider, the new host ProviderTree
  cache now has the provider locally but it's actually gone from
  placement - remember that this is where we failed to get the traits:

  
https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L778

  So there seems to be two things to cleanup:

  1. If we fail here:

  
https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/compute/resource_tracker.py#L516

  We should remove the node from the RT.compute_nodes dict - similar to
  this fix https://review.opendev.org/#/c/675704/. That will mean we go
  through RT._update on the next update_available_resource periodic task
  run.

  2. If we fail here:

  
https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L666

  We should remove the provider from the local ProviderTree cache so
  that the next run will find that the provider does not exist and re-
  create it here:

  
https://github.com/openstack/nova/blob/71478c3eedd95e2eeb219f47460603221ee249b9/nova/scheduler/client/report.py#L642

  Now maybe the "remove from ProviderTree cache on failure" logic there
  needs to be conditional on whether or not created_rp is None, I'm not
  sure. It might be best to just always remove the entry from the cache
  if we got an error trying to refresh associations for it so we're
  clean next time - that's likely a better question for Eric Fried.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1841481/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1841481] [NEW] Race during ironic re-balance corrupts local RT ProviderTree and compute_nodes cache

Reply via email to