[Yahoo-eng-team] [Bug 1821373] Re: Most instance actions can be called concurrently

2019-07-01 Thread Matt Riedemann
** Also affects: nova/rocky
   Importance: Undecided
   Status: New

** Also affects: nova/queens
   Importance: Undecided
   Status: New

** Also affects: nova/stein
   Importance: Undecided
   Status: New

** Changed in: nova/queens
   Status: New => In Progress

** Changed in: nova/stein
   Status: New => Fix Released

** Changed in: nova/queens
   Importance: Undecided => Low

** Changed in: nova/rocky
   Importance: Undecided => Low

** Changed in: nova/queens
 Assignee: (unassigned) => Matthew Booth (mbooth-9)

** Changed in: nova/stein
   Importance: Undecided => Low

** Changed in: nova/stein
 Assignee: (unassigned) => Matthew Booth (mbooth-9)

** Changed in: nova/rocky
 Assignee: (unassigned) => Matthew Booth (mbooth-9)

** Changed in: nova/rocky
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821373

Title:
  Most instance actions can be called concurrently

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  In Progress
Status in OpenStack Compute (nova) rocky series:
  Fix Released
Status in OpenStack Compute (nova) stein series:
  Fix Released

Bug description:
  A customer reported that they were getting DB corruption if they
  called shelve twice in quick succession on the same instance. This
  should be prevented by the guard in nova.API.shelve, which does:

instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])

  This is intended to act as a robust gate against 2 instance actions
  happening concurrently. The first will set the task state to SHELVING,
  the second will fail because the task state is not SHELVING. The
  comparison is done atomically in
  db.instance_update_and_get_original(), and should be race free.

  However, instance.save() shortcuts if there is no update and does not
  call db.instance_update_and_get_original(). Therefore this guard fails
  if we call the same operation twice:

instance = get_instance()
  => Returned instance.task_state is None
instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])
  => task_state was None, now SHELVING, updates = {'task_state': SHELVING}
  => db.instance_update_and_get_original() executes and succeeds

instance = get_instance()
  => Returned instance.task_state is SHELVING
instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])
  => task_state was SHELVING, still SHELVING, updates = {}
  => db.instance_update_and_get_original() does not execute, therefore 
doesn't raise the expected exception

  This pattern is common to almost all instance actions in nova api. A
  quick scan suggests that all of the following actions are affected by
  this bug, and can therefore all potentially be executed multiple times
  concurrently for the same instance:

  restore
  force_stop
  start
  backup
  snapshot
  soft reboot
  hard reboot
  rebuild
  revert_resize
  resize
  shelve
  shelve_offload
  unshelve
  pause
  unpause
  suspend
  resume
  rescue
  unrescue
  set_admin_password
  live_migrate
  evacuate

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821373/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1821373] Re: Most instance actions can be called concurrently

2019-05-15 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/658845
Committed: 
https://git.openstack.org/cgit/openstack/nova/commit/?id=aae5c7aa3819ad9161fd2effed3872d540099230
Submitter: Zuul
Branch:master

commit aae5c7aa3819ad9161fd2effed3872d540099230
Author: Matthew Booth 
Date:   Mon May 13 16:04:39 2019 +0100

Fix retry of instance_update_and_get_original

_instance_update modifies its 'values' argument. Consequently if it is
retried due to an update conflict, the second invocation has the wrong
arguments.

A specific issue this causes is that if we called it with
expected_task_state a concurrent modification to task_state will cause
us to fail and retry. However, expected_task_state will have been popped
from values on the first invocation and will not be present for the
second. Consequently the second invocation will fail to perform the
task_state check and therefore succeed, resulting in a race.

We rewrite the old race unit test which wasn't testing the correct
thing for 2 reasons:

1. Due to the bug fixed in this patch, although we were calling
   update_on_match() twice, the second call didn't check the task state.
2. side_effect=iterable returns function items without executing them,
   but we weren't hitting this due to the bug fixed in this patch.

Closes-Bug: #1821373
Change-Id: I01c63e685113bf30e687ccb14a4d18e344b306f6


** Changed in: nova
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821373

Title:
  Most instance actions can be called concurrently

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  A customer reported that they were getting DB corruption if they
  called shelve twice in quick succession on the same instance. This
  should be prevented by the guard in nova.API.shelve, which does:

instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])

  This is intended to act as a robust gate against 2 instance actions
  happening concurrently. The first will set the task state to SHELVING,
  the second will fail because the task state is not SHELVING. The
  comparison is done atomically in
  db.instance_update_and_get_original(), and should be race free.

  However, instance.save() shortcuts if there is no update and does not
  call db.instance_update_and_get_original(). Therefore this guard fails
  if we call the same operation twice:

instance = get_instance()
  => Returned instance.task_state is None
instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])
  => task_state was None, now SHELVING, updates = {'task_state': SHELVING}
  => db.instance_update_and_get_original() executes and succeeds

instance = get_instance()
  => Returned instance.task_state is SHELVING
instance.task_state = task_states.SHELVING
instance.save(expected_task_state=[None])
  => task_state was SHELVING, still SHELVING, updates = {}
  => db.instance_update_and_get_original() does not execute, therefore 
doesn't raise the expected exception

  This pattern is common to almost all instance actions in nova api. A
  quick scan suggests that all of the following actions are affected by
  this bug, and can therefore all potentially be executed multiple times
  concurrently for the same instance:

  restore
  force_stop
  start
  backup
  snapshot
  soft reboot
  hard reboot
  rebuild
  revert_resize
  resize
  shelve
  shelve_offload
  unshelve
  pause
  unpause
  suspend
  resume
  rescue
  unrescue
  set_admin_password
  live_migrate
  evacuate

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821373/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp