[Yahoo-eng-team] [Bug 1779931] Re: Provider update race between host aggregate sync and resource tracker

OpenStack Infra Sat, 21 Jul 2018 12:31:15 -0700

Reviewed:  https://review.openstack.org/556669
Committed: 
https://git.openstack.org/cgit/openstack/nova/commit/?id=3518ccb665b9b6374c476b4c2e63fa966aee1f3a
Submitter: Zuul
Branch:    master


commit 3518ccb665b9b6374c476b4c2e63fa966aee1f3a
Author: Eric Fried <efr...@us.ibm.com>
Date:   Tue Jul 3 14:34:00 2018 -0500

    Check provider generation and retry on conflict
    
    Update aggregate-related scheduler report client methods to use
    placement microversion 1.19, which returns provider generation in GET
    /rps/{u}/aggregates and handles generation conflicts in PUT
    /rps/{u}/aggregates. Helper methods previously returning aggregates and
    traits now also return the generation, which is fed through
    appropriately to subsequent calls. As a result, the generation kwarg is
    no longer needed in _refresh_associations, so it is removed.
    
    Doing this exposes the race described in the cited bug, so we add a
    retry decorator to the resource tracker's _update and the report
    client's aggregate_{add|remove}_host methods.
    
    Related to blueprint placement-aggregate-generation
    Closes-Bug: #1779931
    
    Change-Id: I3c5fbb18297db71e682fcddb5bf4536595d92383


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1779931

Title:
  Provider update race between host aggregate sync and resource tracker

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  The resource tracker (in n-cpu) used to be the only place we were
  pushing changes to placement, all funneled through a single mutex
  (COMPUTE_RESOURCE_SEMAPHORE) to prevent conflicts.

  When we started mirroring host aggregates as placement aggregates [1],
  which happens in the n-api process, we introduced races with the
  resource tracker e.g. as follows:

  n-api: aggregate_add_host => _get_provider_by_name [2]
  n-cpu: get_provider_tree_and_ensure_root [3]
  n-api: set_aggregates_for_provider [4]
  n-cpu: update_from_provider_tree [5] => set_aggregates_for_provider [6]

  (similar for aggregate_remove_host)

  Whoever gets to set_aggregates_for_provider first will push their view
  of the aggregates to placement.  Until we start checking for
  generation conflicts in set_aggregates_for_provider, whoever gets
  there second will simply blow away the first one.  Therefore it won't
  cause failures and we wouldn't notice.

  Once we do start checking for generation conflicts in
  set_aggregates_for_provider [7], we start seeing actual failures,
  like:

   
tempest.api.compute.admin.test_aggregates.AggregatesAdminTestJSON.test_aggregate_add_host_get_details[id-eeef473c-7c52-494d-9f09-2ed7fc8fc036]
   
----------------------------------------------------------------------------------------------------------------------------------------------

   Captured traceback-1:
   ~~~~~~~~~~~~~~~~~~~~~
       Traceback (most recent call last):
         File "tempest/lib/common/utils/test_utils.py", line 84, in 
call_and_ignore_notfound_exc
           return func(*args, **kwargs)
         File "tempest/lib/services/compute/aggregates_client.py", line 70, in 
delete_aggregate
           resp, body = self.delete("os-aggregates/%s" % aggregate_id)
         File "tempest/lib/common/rest_client.py", line 310, in delete
           return self.request('DELETE', url, extra_headers, headers, body)
         File "tempest/lib/services/compute/base_compute_client.py", line 48, 
in request
           method, url, extra_headers, headers, body, chunked)
         File "tempest/lib/common/rest_client.py", line 668, in request
           self._error_checker(resp, resp_body)
         File "tempest/lib/common/rest_client.py", line 779, in _error_checker
           raise exceptions.BadRequest(resp_body, resp=resp)
       tempest.lib.exceptions.BadRequest: Bad request
       Details: {u'code': 400, u'message': u'Cannot remove host from aggregate 
2. Reason: Host aggregate is not empty.'}

  ...

   Captured traceback:
   ~~~~~~~~~~~~~~~~~~~
       Traceback (most recent call last):
         File "tempest/api/compute/admin/test_aggregates.py", line 193, in 
test_aggregate_add_host_get_details
           self.client.add_host(aggregate['id'], host=self.host)
         File "tempest/lib/services/compute/aggregates_client.py", line 95, in 
add_host
           post_body)
         File "tempest/lib/common/rest_client.py", line 279, in post
           return self.request('POST', url, extra_headers, headers, body, 
chunked)
         File "tempest/lib/services/compute/base_compute_client.py", line 48, 
in request
           method, url, extra_headers, headers, body, chunked)
         File "tempest/lib/common/rest_client.py", line 668, in request
           self._error_checker(resp, resp_body)
         File "tempest/lib/common/rest_client.py", line 845, in _error_checker
           message=message)
       tempest.lib.exceptions.ServerFault: Got server fault
       Details: Unexpected API Error. Please report this at 
http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
       <class 'nova.exception.ResourceProviderUpdateConflict'>

  [1] https://review.openstack.org/#/c/553597/
  [2] 
https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1935
  [3] 
https://github.com/openstack/nova/blob/ee7c39e4416e215d5bf5fbf07c0a8a4301828248/nova/compute/resource_tracker.py#L883
  [4] 
https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1956
  [5] 
https://github.com/openstack/nova/blob/ee7c39e4416e215d5bf5fbf07c0a8a4301828248/nova/compute/resource_tracker.py#L897
  [6] 
https://github.com/openstack/nova/blob/df5c253b58f82dcca7f59ac34fc8b8b51e824ca4/nova/scheduler/client/report.py#L1454
  [7] https://review.openstack.org/#/c/556669/

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1779931/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1779931] Re: Provider update race between host aggregate sync and resource tracker

Reply via email to