[Yahoo-eng-team] [Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit

2017-09-22 Thread Alex Schultz
This is a really old bug and I don't think it applies to tripleo anymore
(if ever). Setting to invalid for tripleo.

** Changed in: tripleo
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1341420

Title:
  gap between scheduler selection and claim causes spurious failures
  when the instance is the last one to fit

Status in OpenStack Compute (nova):
  Invalid
Status in tripleo:
  Invalid

Bug description:
  There is a race between the scheduler in select_destinations, which
  selects a set of hosts, and the nova compute manager, which claims
  resources on those hosts when building the instance. The race is
  particularly noticable with Ironic, where every request will consume a
  full host, but can turn up on libvirt etc too. Multiple schedulers
  will likely exacerbate this too unless they are in a version of python
  with randomised dictionary ordering, in which case they will make it
  better :).

  I've put https://review.openstack.org/106677 up to remove a comment
  which comes from before we introduced this race.

  One mitigating aspect to the race in the filter scheduler _schedule
  method attempts to randomly select hosts to avoid returning the same
  host in repeated requests, but the default minimum set it selects from
  is size 1 - so when heat requests a single instance, the same
  candidate is chosen every time. Setting that number higher can avoid
  all concurrent requests hitting the same host, but it will still be a
  race, and still likely to fail fairly hard at near-capacity situations
  (e.g. deploying all machines in a cluster with Ironic and Heat).

  Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or
  10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity
  on each hypervisor. Then deploy a bunch of VMs one at a time but very
  close together - e.g. use the python API to get cached keystone
  credentials, and boot 5 in a loop.

  If using Ironic you will want https://review.openstack.org/106676 to
  let you see which host is being returned from the selection.

  Possible fixes:
   - have the scheduler be a bit smarter about returning hosts - e.g. track 
destination selection counts since the last refresh and weight hosts by that 
count as well
   - reinstate actioning claims into the scheduler, allowing the audit to 
correct any claimed-but-not-started resource counts asynchronously
   - special case the retry behaviour if there are lots of resources available 
elsewhere in the cluster.

  Stats wise, I just testing a 29 instance deployment with ironic and a
  heat stack, with 45 machines to deploy onto (so 45 hosts in the
  scheduler set) and 4 failed with this race - which means they
  recheduled and failed 3 times each - or 12 cases of scheduler racing
  *at minimum*.

  background chat

  15:43 < lifeless> mikal: around? I need to sanity check something
  15:44 < lifeless> ulp, nope, am sure of it. filing a bug.
  15:45 < mikal> lifeless: ok
  15:46 < lifeless> mikal: oh, you're here, I will run it past you :)
  15:46 < lifeless> mikal: if you have ~5m
  15:46 < mikal> Sure
  15:46 < lifeless> so, symptoms
  15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly 
reliably. Some minor timeout related things to fix but nothing dramatic.
  15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in 
it -> about 50% of instances fail to come up
  15:47 < lifeless> this is with Ironic
  15:47 < mikal> Sure
  15:47 < lifeless> the failure on all the instances is the retry-three-times 
failure-of-death
  15:47 < lifeless> what I believe is happening is this
  15:48 < lifeless> the scheduler is allocating the same weighed list of hosts 
for requests that happen close enough together
  15:49 < lifeless> and I believe its able to do that because the target hosts 
(from select_destinations) need to actually hit the compute node manager and 
have
  15:49 < lifeless> with rt.instance_claim(context, instance, 
limits):
  15:49 < lifeless> happen in _build_and_run_instance
  15:49 < lifeless> before the resource usage is assigned
  15:49 < mikal> Is heat making 45 separate requests to the nova API?
  15:49 < lifeless> eys
  15:49 < lifeless> yes
  15:49 < lifeless> thats the key difference
  15:50 < lifeless> same flavour, same image
  15:50 < openstackgerrit> Sam Morrison proposed a change to openstack/nova: 
Remove cell api overrides for lock and unlock  
https://review.openstack.org/89487
  15:50 < mikal> And you have enough quota for these instances, right?
  15:50 < lifeless> yes
  15:51 < mikal> I'd have to dig deeper to have an answer, but it sure does 
seem worth filing a bug for
  15:51 < lifeless> my theory is that there is enough time between 
select_destinations in the conductor, and _build_and_run_instance in compute 
for another request to come in 

[Yahoo-eng-team] [Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit

2017-02-03 Thread Dan Smith
What you describe is fundamental to how nova works right now. We
speculate in the scheduler, and if we race between two, we handle it
with a reschedule. Nova specifically states that scheduling every last
resource is out of scope. When trying to do that (which is often the use
case for ironic) you're likely to hit this race as you run out of
capacity:

https://github.com/openstack/nova/blob/master/doc/source/project_scope.rst
#iaas-not-batch-processing

In the next few cycles we plan to move the claim process to the
placement engine, which will eliminate most of these race-to-claim type
issues, and in that situation things will be better for this kind of
arrangement.

Until that point, this is not a bug though, because it's specifically
how nova is designed to work.

** Changed in: nova
   Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1341420

Title:
  gap between scheduler selection and claim causes spurious failures
  when the instance is the last one to fit

Status in OpenStack Compute (nova):
  Invalid
Status in tripleo:
  New

Bug description:
  There is a race between the scheduler in select_destinations, which
  selects a set of hosts, and the nova compute manager, which claims
  resources on those hosts when building the instance. The race is
  particularly noticable with Ironic, where every request will consume a
  full host, but can turn up on libvirt etc too. Multiple schedulers
  will likely exacerbate this too unless they are in a version of python
  with randomised dictionary ordering, in which case they will make it
  better :).

  I've put https://review.openstack.org/106677 up to remove a comment
  which comes from before we introduced this race.

  One mitigating aspect to the race in the filter scheduler _schedule
  method attempts to randomly select hosts to avoid returning the same
  host in repeated requests, but the default minimum set it selects from
  is size 1 - so when heat requests a single instance, the same
  candidate is chosen every time. Setting that number higher can avoid
  all concurrent requests hitting the same host, but it will still be a
  race, and still likely to fail fairly hard at near-capacity situations
  (e.g. deploying all machines in a cluster with Ironic and Heat).

  Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or
  10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity
  on each hypervisor. Then deploy a bunch of VMs one at a time but very
  close together - e.g. use the python API to get cached keystone
  credentials, and boot 5 in a loop.

  If using Ironic you will want https://review.openstack.org/106676 to
  let you see which host is being returned from the selection.

  Possible fixes:
   - have the scheduler be a bit smarter about returning hosts - e.g. track 
destination selection counts since the last refresh and weight hosts by that 
count as well
   - reinstate actioning claims into the scheduler, allowing the audit to 
correct any claimed-but-not-started resource counts asynchronously
   - special case the retry behaviour if there are lots of resources available 
elsewhere in the cluster.

  Stats wise, I just testing a 29 instance deployment with ironic and a
  heat stack, with 45 machines to deploy onto (so 45 hosts in the
  scheduler set) and 4 failed with this race - which means they
  recheduled and failed 3 times each - or 12 cases of scheduler racing
  *at minimum*.

  background chat

  15:43 < lifeless> mikal: around? I need to sanity check something
  15:44 < lifeless> ulp, nope, am sure of it. filing a bug.
  15:45 < mikal> lifeless: ok
  15:46 < lifeless> mikal: oh, you're here, I will run it past you :)
  15:46 < lifeless> mikal: if you have ~5m
  15:46 < mikal> Sure
  15:46 < lifeless> so, symptoms
  15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly 
reliably. Some minor timeout related things to fix but nothing dramatic.
  15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in 
it -> about 50% of instances fail to come up
  15:47 < lifeless> this is with Ironic
  15:47 < mikal> Sure
  15:47 < lifeless> the failure on all the instances is the retry-three-times 
failure-of-death
  15:47 < lifeless> what I believe is happening is this
  15:48 < lifeless> the scheduler is allocating the same weighed list of hosts 
for requests that happen close enough together
  15:49 < lifeless> and I believe its able to do that because the target hosts 
(from select_destinations) need to actually hit the compute node manager and 
have
  15:49 < lifeless> with rt.instance_claim(context, instance, 
limits):
  15:49 < lifeless> happen in _build_and_run_instance
  15:49 < lifeless> before the resource usage is assigned
  15:49 < mikal> Is heat making 45 separate requests to the nova API?
  15:49 < lifeless> eys
 

[Yahoo-eng-team] [Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit

2017-02-03 Thread Vasyl Saienko
The issues was reproduced at ironic-multinode job
http://logs.openstack.org/75/427675/11/check/gate-tempest-dsvm-ironic-
ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-
nv/a66f08f/logs

several instances were scheduled to the same ironic node.


http://logs.openstack.org/75/427675/11/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/a66f08f/logs/screen-n-sch.txt.gz#_2017-02-03_18_46_35_112

2017-02-03 18:46:35.112 21443 DEBUG nova.scheduler.filter_scheduler
[req-f8358730-fc5a-4f56-959c-af8742c8c3c2 tempest-
ServersTestJSON-264305795 tempest-ServersTestJSON-264305795] Selected
host: WeighedHost [host: (ubuntu-xenial-2-node-osic-
cloud1-s3500-7106433, e85b2653-5563-4742-b23f-7ceb15b2032f) ram: 384MB
disk: 10240MB io_ops: 0 instances: 0, weight: 2.0] _schedule
/opt/stack/new/nova/nova/scheduler/filter_scheduler.py:127


http://logs.openstack.org/75/427675/11/check/gate-tempest-dsvm-ironic-ipa-wholedisk-agent_ipmitool-tinyipa-multinode-ubuntu-xenial-nv/a66f08f/logs/screen-n-sch.txt.gz#_2017-02-03_18_46_35_204

2017-02-03 18:46:35.204 21443 DEBUG nova.scheduler.filter_scheduler 
[req-3ff28238-f02b-4914-9ee1-dc07a15f6c9d 
tempest-ServerActionsTestJSON-1934209030 
tempest-ServerActionsTestJSON-1934209030] Selected host: WeighedHost [host: 
(ubuntu-xenial-2-node-osic-cloud1-s3500-7106433, 
e85b2653-5563-4742-b23f-7ceb15b2032f) ram: 384MB disk: 10240MB io_ops: 0 
instances: 0, weight: 2.0] _schedule 
/opt/stack/new/nova/nova/scheduler/filter_scheduler.py:127
2017-02-03 18:46:35.204 21443 DEBUG oslo_concurrency.lockutils 
[req-3ff28238-f02b-4914-9ee1-dc07a15f6c9d 
tempest-ServerActionsTestJSON-1934209030 
tempest-ServerActionsTestJSON-1934209030] Lock 
"(u'ubuntu-xenial-2-node-osic-cloud1-s3500-7106433', 
u'e85b2653-5563-4742-b23f-7ceb15b2032f')" acquired by 
"nova.scheduler.host_manager._locked" :: waited 0.000s inner 
/usr/local/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:270


** Changed in: nova
   Status: Invalid => New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1341420

Title:
  gap between scheduler selection and claim causes spurious failures
  when the instance is the last one to fit

Status in OpenStack Compute (nova):
  New
Status in tripleo:
  New

Bug description:
  There is a race between the scheduler in select_destinations, which
  selects a set of hosts, and the nova compute manager, which claims
  resources on those hosts when building the instance. The race is
  particularly noticable with Ironic, where every request will consume a
  full host, but can turn up on libvirt etc too. Multiple schedulers
  will likely exacerbate this too unless they are in a version of python
  with randomised dictionary ordering, in which case they will make it
  better :).

  I've put https://review.openstack.org/106677 up to remove a comment
  which comes from before we introduced this race.

  One mitigating aspect to the race in the filter scheduler _schedule
  method attempts to randomly select hosts to avoid returning the same
  host in repeated requests, but the default minimum set it selects from
  is size 1 - so when heat requests a single instance, the same
  candidate is chosen every time. Setting that number higher can avoid
  all concurrent requests hitting the same host, but it will still be a
  race, and still likely to fail fairly hard at near-capacity situations
  (e.g. deploying all machines in a cluster with Ironic and Heat).

  Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or
  10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity
  on each hypervisor. Then deploy a bunch of VMs one at a time but very
  close together - e.g. use the python API to get cached keystone
  credentials, and boot 5 in a loop.

  If using Ironic you will want https://review.openstack.org/106676 to
  let you see which host is being returned from the selection.

  Possible fixes:
   - have the scheduler be a bit smarter about returning hosts - e.g. track 
destination selection counts since the last refresh and weight hosts by that 
count as well
   - reinstate actioning claims into the scheduler, allowing the audit to 
correct any claimed-but-not-started resource counts asynchronously
   - special case the retry behaviour if there are lots of resources available 
elsewhere in the cluster.

  Stats wise, I just testing a 29 instance deployment with ironic and a
  heat stack, with 45 machines to deploy onto (so 45 hosts in the
  scheduler set) and 4 failed with this race - which means they
  recheduled and failed 3 times each - or 12 cases of scheduler racing
  *at minimum*.

  background chat

  15:43 < lifeless> mikal: around? I need to sanity check something
  15:44 < lifeless> ulp, nope, am sure of it. filing a bug.
  15:45 < mikal> lifeless: ok
  15:46 < lifeless> mikal: oh, you're 

[Yahoo-eng-team] [Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit

2016-04-18 Thread Sylvain Bauza
That bug report is still accurate, but given the fact that we're working
towards having new resource providers for the scheduler and discussing
actively on the implementation details, I'd by far prefer to close that
bug and leave a spec better describing the problem.


** Changed in: nova
   Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1341420

Title:
  gap between scheduler selection and claim causes spurious failures
  when the instance is the last one to fit

Status in OpenStack Compute (nova):
  Invalid
Status in tripleo:
  New

Bug description:
  There is a race between the scheduler in select_destinations, which
  selects a set of hosts, and the nova compute manager, which claims
  resources on those hosts when building the instance. The race is
  particularly noticable with Ironic, where every request will consume a
  full host, but can turn up on libvirt etc too. Multiple schedulers
  will likely exacerbate this too unless they are in a version of python
  with randomised dictionary ordering, in which case they will make it
  better :).

  I've put https://review.openstack.org/106677 up to remove a comment
  which comes from before we introduced this race.

  One mitigating aspect to the race in the filter scheduler _schedule
  method attempts to randomly select hosts to avoid returning the same
  host in repeated requests, but the default minimum set it selects from
  is size 1 - so when heat requests a single instance, the same
  candidate is chosen every time. Setting that number higher can avoid
  all concurrent requests hitting the same host, but it will still be a
  race, and still likely to fail fairly hard at near-capacity situations
  (e.g. deploying all machines in a cluster with Ironic and Heat).

  Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or
  10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity
  on each hypervisor. Then deploy a bunch of VMs one at a time but very
  close together - e.g. use the python API to get cached keystone
  credentials, and boot 5 in a loop.

  If using Ironic you will want https://review.openstack.org/106676 to
  let you see which host is being returned from the selection.

  Possible fixes:
   - have the scheduler be a bit smarter about returning hosts - e.g. track 
destination selection counts since the last refresh and weight hosts by that 
count as well
   - reinstate actioning claims into the scheduler, allowing the audit to 
correct any claimed-but-not-started resource counts asynchronously
   - special case the retry behaviour if there are lots of resources available 
elsewhere in the cluster.

  Stats wise, I just testing a 29 instance deployment with ironic and a
  heat stack, with 45 machines to deploy onto (so 45 hosts in the
  scheduler set) and 4 failed with this race - which means they
  recheduled and failed 3 times each - or 12 cases of scheduler racing
  *at minimum*.

  background chat

  15:43 < lifeless> mikal: around? I need to sanity check something
  15:44 < lifeless> ulp, nope, am sure of it. filing a bug.
  15:45 < mikal> lifeless: ok
  15:46 < lifeless> mikal: oh, you're here, I will run it past you :)
  15:46 < lifeless> mikal: if you have ~5m
  15:46 < mikal> Sure
  15:46 < lifeless> so, symptoms
  15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly 
reliably. Some minor timeout related things to fix but nothing dramatic.
  15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in 
it -> about 50% of instances fail to come up
  15:47 < lifeless> this is with Ironic
  15:47 < mikal> Sure
  15:47 < lifeless> the failure on all the instances is the retry-three-times 
failure-of-death
  15:47 < lifeless> what I believe is happening is this
  15:48 < lifeless> the scheduler is allocating the same weighed list of hosts 
for requests that happen close enough together
  15:49 < lifeless> and I believe its able to do that because the target hosts 
(from select_destinations) need to actually hit the compute node manager and 
have
  15:49 < lifeless> with rt.instance_claim(context, instance, 
limits):
  15:49 < lifeless> happen in _build_and_run_instance
  15:49 < lifeless> before the resource usage is assigned
  15:49 < mikal> Is heat making 45 separate requests to the nova API?
  15:49 < lifeless> eys
  15:49 < lifeless> yes
  15:49 < lifeless> thats the key difference
  15:50 < lifeless> same flavour, same image
  15:50 < openstackgerrit> Sam Morrison proposed a change to openstack/nova: 
Remove cell api overrides for lock and unlock  
https://review.openstack.org/89487
  15:50 < mikal> And you have enough quota for these instances, right?
  15:50 < lifeless> yes
  15:51 < mikal> I'd have to dig deeper to have an answer, but it sure does 
seem worth filing a bug for
  15:51 < lifeless> my 

[Yahoo-eng-team] [Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit

2016-03-08 Thread James Slagle
** Also affects: tripleo
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1341420

Title:
  gap between scheduler selection and claim causes spurious failures
  when the instance is the last one to fit

Status in OpenStack Compute (nova):
  In Progress
Status in tripleo:
  New

Bug description:
  There is a race between the scheduler in select_destinations, which
  selects a set of hosts, and the nova compute manager, which claims
  resources on those hosts when building the instance. The race is
  particularly noticable with Ironic, where every request will consume a
  full host, but can turn up on libvirt etc too. Multiple schedulers
  will likely exacerbate this too unless they are in a version of python
  with randomised dictionary ordering, in which case they will make it
  better :).

  I've put https://review.openstack.org/106677 up to remove a comment
  which comes from before we introduced this race.

  One mitigating aspect to the race in the filter scheduler _schedule
  method attempts to randomly select hosts to avoid returning the same
  host in repeated requests, but the default minimum set it selects from
  is size 1 - so when heat requests a single instance, the same
  candidate is chosen every time. Setting that number higher can avoid
  all concurrent requests hitting the same host, but it will still be a
  race, and still likely to fail fairly hard at near-capacity situations
  (e.g. deploying all machines in a cluster with Ironic and Heat).

  Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or
  10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity
  on each hypervisor. Then deploy a bunch of VMs one at a time but very
  close together - e.g. use the python API to get cached keystone
  credentials, and boot 5 in a loop.

  If using Ironic you will want https://review.openstack.org/106676 to
  let you see which host is being returned from the selection.

  Possible fixes:
   - have the scheduler be a bit smarter about returning hosts - e.g. track 
destination selection counts since the last refresh and weight hosts by that 
count as well
   - reinstate actioning claims into the scheduler, allowing the audit to 
correct any claimed-but-not-started resource counts asynchronously
   - special case the retry behaviour if there are lots of resources available 
elsewhere in the cluster.

  Stats wise, I just testing a 29 instance deployment with ironic and a
  heat stack, with 45 machines to deploy onto (so 45 hosts in the
  scheduler set) and 4 failed with this race - which means they
  recheduled and failed 3 times each - or 12 cases of scheduler racing
  *at minimum*.

  background chat

  15:43 < lifeless> mikal: around? I need to sanity check something
  15:44 < lifeless> ulp, nope, am sure of it. filing a bug.
  15:45 < mikal> lifeless: ok
  15:46 < lifeless> mikal: oh, you're here, I will run it past you :)
  15:46 < lifeless> mikal: if you have ~5m
  15:46 < mikal> Sure
  15:46 < lifeless> so, symptoms
  15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly 
reliably. Some minor timeout related things to fix but nothing dramatic.
  15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in 
it -> about 50% of instances fail to come up
  15:47 < lifeless> this is with Ironic
  15:47 < mikal> Sure
  15:47 < lifeless> the failure on all the instances is the retry-three-times 
failure-of-death
  15:47 < lifeless> what I believe is happening is this
  15:48 < lifeless> the scheduler is allocating the same weighed list of hosts 
for requests that happen close enough together
  15:49 < lifeless> and I believe its able to do that because the target hosts 
(from select_destinations) need to actually hit the compute node manager and 
have
  15:49 < lifeless> with rt.instance_claim(context, instance, 
limits):
  15:49 < lifeless> happen in _build_and_run_instance
  15:49 < lifeless> before the resource usage is assigned
  15:49 < mikal> Is heat making 45 separate requests to the nova API?
  15:49 < lifeless> eys
  15:49 < lifeless> yes
  15:49 < lifeless> thats the key difference
  15:50 < lifeless> same flavour, same image
  15:50 < openstackgerrit> Sam Morrison proposed a change to openstack/nova: 
Remove cell api overrides for lock and unlock  
https://review.openstack.org/89487
  15:50 < mikal> And you have enough quota for these instances, right?
  15:50 < lifeless> yes
  15:51 < mikal> I'd have to dig deeper to have an answer, but it sure does 
seem worth filing a bug for
  15:51 < lifeless> my theory is that there is enough time between 
select_destinations in the conductor, and _build_and_run_instance in compute 
for another request to come in the front door and be scheduled to the same host
  15:51 < mikal> That seems possible to me
  15:52 < 

[Yahoo-eng-team] [Bug 1341420] Re: gap between scheduler selection and claim causes spurious failures when the instance is the last one to fit

2015-06-02 Thread Michael Davies
This is not an Ironic bug.  It's just triggered by using Ironic.  This
is a a Nova scheduling bug (and a PITA to fix :(

** No longer affects: ironic

** Tags removed: ironic

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1341420

Title:
  gap between scheduler selection and claim causes spurious failures
  when the instance is the last one to fit

Status in OpenStack Compute (Nova):
  Confirmed

Bug description:
  There is a race between the scheduler in select_destinations, which
  selects a set of hosts, and the nova compute manager, which claims
  resources on those hosts when building the instance. The race is
  particularly noticable with Ironic, where every request will consume a
  full host, but can turn up on libvirt etc too. Multiple schedulers
  will likely exacerbate this too unless they are in a version of python
  with randomised dictionary ordering, in which case they will make it
  better :).

  I've put https://review.openstack.org/106677 up to remove a comment
  which comes from before we introduced this race.

  One mitigating aspect to the race in the filter scheduler _schedule
  method attempts to randomly select hosts to avoid returning the same
  host in repeated requests, but the default minimum set it selects from
  is size 1 - so when heat requests a single instance, the same
  candidate is chosen every time. Setting that number higher can avoid
  all concurrent requests hitting the same host, but it will still be a
  race, and still likely to fail fairly hard at near-capacity situations
  (e.g. deploying all machines in a cluster with Ironic and Heat).

  Folk wanting to reproduce this: take a decent size cloud - e.g. 5 or
  10 hypervisor hosts (KVM is fine). Deploy up to 1 VM left of capacity
  on each hypervisor. Then deploy a bunch of VMs one at a time but very
  close together - e.g. use the python API to get cached keystone
  credentials, and boot 5 in a loop.

  If using Ironic you will want https://review.openstack.org/106676 to
  let you see which host is being returned from the selection.

  Possible fixes:
   - have the scheduler be a bit smarter about returning hosts - e.g. track 
destination selection counts since the last refresh and weight hosts by that 
count as well
   - reinstate actioning claims into the scheduler, allowing the audit to 
correct any claimed-but-not-started resource counts asynchronously
   - special case the retry behaviour if there are lots of resources available 
elsewhere in the cluster.

  Stats wise, I just testing a 29 instance deployment with ironic and a
  heat stack, with 45 machines to deploy onto (so 45 hosts in the
  scheduler set) and 4 failed with this race - which means they
  recheduled and failed 3 times each - or 12 cases of scheduler racing
  *at minimum*.

  background chat

  15:43  lifeless mikal: around? I need to sanity check something
  15:44  lifeless ulp, nope, am sure of it. filing a bug.
  15:45  mikal lifeless: ok
  15:46  lifeless mikal: oh, you're here, I will run it past you :)
  15:46  lifeless mikal: if you have ~5m
  15:46  mikal Sure
  15:46  lifeless so, symptoms
  15:46  lifeless nova boot ... --num-instances 45 - works fairly 
reliably. Some minor timeout related things to fix but nothing dramatic.
  15:47  lifeless heat create-stack ... with a stack with 45 instances in 
it - about 50% of instances fail to come up
  15:47  lifeless this is with Ironic
  15:47  mikal Sure
  15:47  lifeless the failure on all the instances is the retry-three-times 
failure-of-death
  15:47  lifeless what I believe is happening is this
  15:48  lifeless the scheduler is allocating the same weighed list of hosts 
for requests that happen close enough together
  15:49  lifeless and I believe its able to do that because the target hosts 
(from select_destinations) need to actually hit the compute node manager and 
have
  15:49  lifeless with rt.instance_claim(context, instance, 
limits):
  15:49  lifeless happen in _build_and_run_instance
  15:49  lifeless before the resource usage is assigned
  15:49  mikal Is heat making 45 separate requests to the nova API?
  15:49  lifeless eys
  15:49  lifeless yes
  15:49  lifeless thats the key difference
  15:50  lifeless same flavour, same image
  15:50  openstackgerrit Sam Morrison proposed a change to openstack/nova: 
Remove cell api overrides for lock and unlock  
https://review.openstack.org/89487
  15:50  mikal And you have enough quota for these instances, right?
  15:50  lifeless yes
  15:51  mikal I'd have to dig deeper to have an answer, but it sure does 
seem worth filing a bug for
  15:51  lifeless my theory is that there is enough time between 
select_destinations in the conductor, and _build_and_run_instance in compute 
for another request to come in the front door and be scheduled to the same host
  15:51  mikal That seems possible to