Reviewed: https://review.openstack.org/556669 Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3518ccb665b9b6374c476b4c2e63fa966aee1f3a Submitter: Zuul Branch: master
commit 3518ccb665b9b6374c476b4c2e63fa966aee1f3a Author: Eric Fried <efr...@us.ibm.com> Date: Tue Jul 3 14:34:00 2018 -0500 Check provider generation and retry on conflict Update aggregate-related scheduler report client methods to use placement microversion 1.19, which returns provider generation in GET /rps/{u}/aggregates and handles generation conflicts in PUT /rps/{u}/aggregates. Helper methods previously returning aggregates and traits now also return the generation, which is fed through appropriately to subsequent calls. As a result, the generation kwarg is no longer needed in _refresh_associations, so it is removed. Doing this exposes the race described in the cited bug, so we add a retry decorator to the resource tracker's _update and the report client's aggregate_{add|remove}_host methods. Related to blueprint placement-aggregate-generation Closes-Bug: #1779931 Change-Id: I3c5fbb18297db71e682fcddb5bf4536595d92383 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1779931 Title: Provider update race between host aggregate sync and resource tracker Status in OpenStack Compute (nova): Fix Released Bug description: The resource tracker (in n-cpu) used to be the only place we were pushing changes to placement, all funneled through a single mutex (COMPUTE_RESOURCE_SEMAPHORE) to prevent conflicts. When we started mirroring host aggregates as placement aggregates [1], which happens in the n-api process, we introduced races with the resource tracker e.g. as follows: n-api: aggregate_add_host => _get_provider_by_name [2] n-cpu: get_provider_tree_and_ensure_root [3] n-api: set_aggregates_for_provider [4] n-cpu: update_from_provider_tree [5] => set_aggregates_for_provider [6] (similar for aggregate_remove_host) Whoever gets to set_aggregates_for_provider first will push their view of the aggregates to placement. Until we start checking for generation conflicts in set_aggregates_for_provider, whoever gets there second will simply blow away the first one. Therefore it won't cause failures and we wouldn't notice. Once we do start checking for generation conflicts in set_aggregates_for_provider [7], we start seeing actual failures, like: tempest.api.compute.admin.test_aggregates.AggregatesAdminTestJSON.test_aggregate_add_host_get_details[id-eeef473c-7c52-494d-9f09-2ed7fc8fc036] ---------------------------------------------------------------------------------------------------------------------------------------------- Captured traceback-1: ~~~~~~~~~~~~~~~~~~~~~ Traceback (most recent call last): File "tempest/lib/common/utils/test_utils.py", line 84, in call_and_ignore_notfound_exc return func(*args, **kwargs) File "tempest/lib/services/compute/aggregates_client.py", line 70, in delete_aggregate resp, body = self.delete("os-aggregates/%s" % aggregate_id) File "tempest/lib/common/rest_client.py", line 310, in delete return self.request('DELETE', url, extra_headers, headers, body) File "tempest/lib/services/compute/base_compute_client.py", line 48, in request method, url, extra_headers, headers, body, chunked) File "tempest/lib/common/rest_client.py", line 668, in request self._error_checker(resp, resp_body) File "tempest/lib/common/rest_client.py", line 779, in _error_checker raise exceptions.BadRequest(resp_body, resp=resp) tempest.lib.exceptions.BadRequest: Bad request Details: {u'code': 400, u'message': u'Cannot remove host from aggregate 2. Reason: Host aggregate is not empty.'} ... Captured traceback: ~~~~~~~~~~~~~~~~~~~ Traceback (most recent call last): File "tempest/api/compute/admin/test_aggregates.py", line 193, in test_aggregate_add_host_get_details self.client.add_host(aggregate['id'], host=self.host) File "tempest/lib/services/compute/aggregates_client.py", line 95, in add_host post_body) File "tempest/lib/common/rest_client.py", line 279, in post return self.request('POST', url, extra_headers, headers, body, chunked) File "tempest/lib/services/compute/base_compute_client.py", line 48, in request method, url, extra_headers, headers, body, chunked) File "tempest/lib/common/rest_client.py", line 668, in request self._error_checker(resp, resp_body) File "tempest/lib/common/rest_client.py", line 845, in _error_checker message=message) tempest.lib.exceptions.ServerFault: Got server fault Details: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'nova.exception.ResourceProviderUpdateConflict'> [1] https://review.openstack.org/#/c/553597/ [2] https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1935 [3] https://github.com/openstack/nova/blob/ee7c39e4416e215d5bf5fbf07c0a8a4301828248/nova/compute/resource_tracker.py#L883 [4] https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1956 [5] https://github.com/openstack/nova/blob/ee7c39e4416e215d5bf5fbf07c0a8a4301828248/nova/compute/resource_tracker.py#L897 [6] https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1454 [7] https://review.openstack.org/#/c/556669/ To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1779931/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp