Public bug reported:

I recently found a bug in Mitaka, and it appears to be still present in
master.

I was testing a separate patch by doing resizes, and bugs in my code had
resulted in a number of incomplete resizes involving compute-1.  I then
did a resize from compute-0 to compute-0, and saw compute-1's resource
usage go up when it ran the resource audit.

This got me curious, so I went digging and discovered a gap in the current 
resource audit logic.  The problem arises if:
    
    1) You have one or more stale migrations which didn't complete
    properly that involve the current compute node.
    
    2) The instance from the uncompleted migration is currently doing a
    resize/migration that does not involve the current compute node.
    
When this happens, _update_usage_from_migrations() will be passed in the stale 
migration, and since the instance is in fact in a resize state, the current 
compute node will erroneously account for the instance.  (Even though the 
instance isn't doing anything involving the current compute node.)
    
The fix is to check that the instance migration ID matches the ID of the 
migration being analyzed.  This will work because in the case of the stale 
migration we will have hit the error case in _pair_instances_to_migrations(), 
and so the instance will be lazy-loaded from the DB, ensuring that its 
migration ID is up-to-date.

** Affects: nova
     Importance: Undecided
     Assignee: Chris Friesen (cbf123)
         Status: In Progress


** Tags: compute

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1600304

Title:
  _update_usage_from_migrations() can end up processing stale migrations

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  I recently found a bug in Mitaka, and it appears to be still present
  in master.

  I was testing a separate patch by doing resizes, and bugs in my code
  had resulted in a number of incomplete resizes involving compute-1.  I
  then did a resize from compute-0 to compute-0, and saw compute-1's
  resource usage go up when it ran the resource audit.

  This got me curious, so I went digging and discovered a gap in the current 
resource audit logic.  The problem arises if:
      
      1) You have one or more stale migrations which didn't complete
      properly that involve the current compute node.
      
      2) The instance from the uncompleted migration is currently doing a
      resize/migration that does not involve the current compute node.
      
  When this happens, _update_usage_from_migrations() will be passed in the 
stale migration, and since the instance is in fact in a resize state, the 
current compute node will erroneously account for the instance.  (Even though 
the instance isn't doing anything involving the current compute node.)
      
  The fix is to check that the instance migration ID matches the ID of the 
migration being analyzed.  This will work because in the case of the stale 
migration we will have hit the error case in _pair_instances_to_migrations(), 
and so the instance will be lazy-loaded from the DB, ensuring that its 
migration ID is up-to-date.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1600304/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to