[Yahoo-eng-team] [Bug 1721843] [NEW] Unversioned notifications not being sent.

2017-10-06 Thread Charles Volzka
Public bug reported:

Description
===
After a vm moves from state 'building' to 'error' an unversioned notification 
is no longer sent if CONF.notifications.notification_format is set to 
'unversioned'.

Steps to reproduce
==
In nova.conf set
[notifications]
notification_format = unversioned

Setup environment so VM deploy fails.
To reproduce easily in my environment I raised a generic Exception just after 
the call to spawn in orchestrator's start_deploy_simple()
Attempt to deploy VM.
Wait for deploy to fail.

Expected result
===
When the vm_state changes to 'error' an unversioned notification should be sent.

Actual result
=
The unversioned notification is not sent.

Environment
===
(pike)nova-compute/now 10:16.0.0-201710030907


Additional Info:

Problem seems to stem from this change: 
https://github.com/openstack/nova/commit/29cb8f1c459e6d23dd9303fb570cee773d9c4d02
 at:
if (NOTIFIER.is_enabled() and
CONF.notifications.notification_format in ('both',
   'versioned')):
Because 'unversioned' is not in the list, the @rpc.if_notifications_enabled 
decorator causes send_instance_update_notification() as well as 
_send_versioned_instance_update() to effectively be skipped. The name of the 
decorator and the comment describing it's functionality make it hard to 
determine is precise intended purpose. The decorator name implies it's checking 
if notifications are enabled at all. The comment in the decorator states it's 
specificly checking if versioned notifications are enabled and is in fact what 
it appears to be doing. Since the decorator was applied to 
send_instance_update_notification it's effectively blocking unversioned 
notifications if versioned notifications are not enabled.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1721843

Title:
  Unversioned notifications not being sent.

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  After a vm moves from state 'building' to 'error' an unversioned notification 
is no longer sent if CONF.notifications.notification_format is set to 
'unversioned'.

  Steps to reproduce
  ==
  In nova.conf set
  [notifications]
  notification_format = unversioned

  Setup environment so VM deploy fails.
  To reproduce easily in my environment I raised a generic Exception just after 
the call to spawn in orchestrator's start_deploy_simple()
  Attempt to deploy VM.
  Wait for deploy to fail.

  Expected result
  ===
  When the vm_state changes to 'error' an unversioned notification should be 
sent.

  Actual result
  =
  The unversioned notification is not sent.

  Environment
  ===
  (pike)nova-compute/now 10:16.0.0-201710030907

  
  Additional Info:
  
  Problem seems to stem from this change: 
https://github.com/openstack/nova/commit/29cb8f1c459e6d23dd9303fb570cee773d9c4d02
 at:
  if (NOTIFIER.is_enabled() and
  CONF.notifications.notification_format in ('both',
 'versioned')):
  Because 'unversioned' is not in the list, the @rpc.if_notifications_enabled 
decorator causes send_instance_update_notification() as well as 
_send_versioned_instance_update() to effectively be skipped. The name of the 
decorator and the comment describing it's functionality make it hard to 
determine is precise intended purpose. The decorator name implies it's checking 
if notifications are enabled at all. The comment in the decorator states it's 
specificly checking if versioned notifications are enabled and is in fact what 
it appears to be doing. Since the decorator was applied to 
send_instance_update_notification it's effectively blocking unversioned 
notifications if versioned notifications are not enabled.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1721843/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1721652] [NEW] Evacuate cleanup fails at _delete_allocation_for_moved_instance

2017-10-05 Thread Charles Volzka
Public bug reported:

Description
===
After an evacuation, when nova-compute is restarted on the source host, the 
clean up of the old instance on the source host fails. The traceback in 
nova-compute.log ends with:
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File 
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 679, in 
_destroy_evacuated_instances
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, 
migration.source_node)
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File 
"/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1216, 
in delete_allocation_for_evacuated_instance
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, node, 
'evacuated', node_type)
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File 
"/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1227, 
in _delete_allocation_for_moved_instance
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service cn_uuid = 
self.compute_nodes[node].uuid
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service KeyError: 
u''
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service


Steps to reproduce
==
Deploy instance on Host A.
Shut down Host A.
Evacuate instance to Host B.
Turn back on Host A.
Wait for cleanup of old instance allocation to occur

Expected result
===
Clean up of old instance from Host A is successful

Actual result
=
Old instance clean up appears to work but there's a traceback in the log and 
allocation is not cleaned up.

Environment
===
(pike)nova-compute/now 10:16.0.0-201710030907


Additional Info:

Problem seems to come from this change: 
https://github.com/openstack/nova/commit/0de806684f5d670dd5f961f7adf212961da3ed87
 at:
rt = self._get_resource_tracker()
rt.delete_allocation_for_evacuated_instance
That is called very early in init_host flow to clean up the allocations. The 
problem is that at this point in the startup the resource tracker's 
self.compute_node is still None. That makes 
delete_allocation_for_evacuated_instance blow up with a key error at:
cn_uuid = self.compute_nodes[node].uuid
The resource tracker's self.compute_node is actually initialized later on in 
the startup process via the update_available_resources() -> 
_update_available_resources() -> _init_compute_node(). It isn't initialized 
when the tracker is first created which appears to be the assumption made by 
the referenced commit.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1721652

Title:
  Evacuate cleanup fails at _delete_allocation_for_moved_instance

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===
  After an evacuation, when nova-compute is restarted on the source host, the 
clean up of the old instance on the source host fails. The traceback in 
nova-compute.log ends with:
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File 
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 679, in 
_destroy_evacuated_instances
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, 
migration.source_node)
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File 
"/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1216, 
in delete_allocation_for_evacuated_instance
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, node, 
'evacuated', node_type)
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service   File 
"/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1227, 
in _delete_allocation_for_moved_instance
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service cn_uuid = 
self.compute_nodes[node].uuid
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service KeyError: 
u''
  2017-10-04 05:32:18.725 5575 ERROR oslo_service.service


  Steps to reproduce
  ==
  Deploy instance on Host A.
  Shut down Host A.
  Evacuate instance to Host B.
  Turn back on Host A.
  Wait for cleanup of old instance allocation to occur

  Expected result
  ===
  Clean up of old instance from Host A is successful

  Actual result
  =
  Old instance clean up appears to work but there's a traceback in the log and 
allocation is not cleaned up.

  Environment
  ===
  (pike)nova-compute/now 10:16.0.0-201710030907

  
  Additional Info:
  
  Problem seems to come from this change: 
https://github.com/openstack/nova/commit/0de806684f5d670dd5f961f7adf212961da3ed87
 at:
  rt = self._get_resource_tracker()
  rt.delete_allocation_for_evacuated_instance
  That is called very early in init_host flow to clean up the allocations. The 
problem is that at this point in the startup the 

[Yahoo-eng-team] [Bug 1682621] [NEW] http 404 instead of 403 for role with read but not write access

2017-04-13 Thread Charles Volzka
Public bug reported:

While attempting DELETE on /networks and /routers I found I was getting
a HTTPNotFound error instead of PolicyNotAuthorized. The role I'm
testing has read access. Calls to GET for the network or router in
question are successful. Since the user has GET ability, I'd expect a
more accurate error when attempting a DELETE.

Steps to reproduce
1. Create a neutron network and/or router
2. Set a user to have a role whose policy allows get_network and get_router but 
not delete_network or delete_router ability
3. Confirm GET calls from the user for /network and /router are successful
4. Attempt DELETE call on the network created in step 1.

Expected Results:
1. DELETE call is unsuccessful and returns http 403 and PolicyNotAuthorized (or 
equivalent)

Actual Resutls:
1. DELETE call is unsuccessful and returns http 404 HTTPNotFound

Issue discovered using an early April ocata build, but likely has existed for a 
while.
Note: There may be other endpoints where this occurs. So far I've just noticed 
it in the two mentioned but I have not searched extensively. I see the 
try/catch with except oslo_policy.PolicyNotAuthorized in several places.

** Affects: neutron
 Importance: Undecided
 Assignee: Matthew Edmonds (edmondsw)
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1682621

Title:
  http 404 instead of 403 for role with read but not write access

Status in neutron:
  New

Bug description:
  While attempting DELETE on /networks and /routers I found I was
  getting a HTTPNotFound error instead of PolicyNotAuthorized. The role
  I'm testing has read access. Calls to GET for the network or router in
  question are successful. Since the user has GET ability, I'd expect a
  more accurate error when attempting a DELETE.

  Steps to reproduce
  1. Create a neutron network and/or router
  2. Set a user to have a role whose policy allows get_network and get_router 
but not delete_network or delete_router ability
  3. Confirm GET calls from the user for /network and /router are successful
  4. Attempt DELETE call on the network created in step 1.

  Expected Results:
  1. DELETE call is unsuccessful and returns http 403 and PolicyNotAuthorized 
(or equivalent)

  Actual Resutls:
  1. DELETE call is unsuccessful and returns http 404 HTTPNotFound

  Issue discovered using an early April ocata build, but likely has existed for 
a while.
  Note: There may be other endpoints where this occurs. So far I've just 
noticed it in the two mentioned but I have not searched extensively. I see the 
try/catch with except oslo_policy.PolicyNotAuthorized in several places.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1682621/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1674003] [NEW] Race condition cause instance system_updates to be lost

2017-03-18 Thread Charles Volzka
Public bug reported:

We ran into an issue using a March ocata build. We have some
system_metadata that we need to save very early in a VM's life.
Previously we did this during scheduling. After the switch to cells v2,
we now listen for the compute.instance.create.start and add the key to
the instance's system_metadata then. The problem is that because of how
nova.objects.Instance.save() works when saving metadata there is a race
condition that causes some of the system_metadata to be lost.


Basic setup of the instance.save() problem:
test_uuid = 
inst_ref_1 = nova.objects.Instance.get_by_uuid(context, test_uuid)
inst_ref_2 = nova.objects.Instance.get_by_uuid(context, test_uuid)

inst_ref_1.system_metadata.update({'key1': 'val1'})
inst_ref_2.system_metadata.update({'key2': 'val2'})
(Note: You need to read or update inst_ref_2.system_metadata at least once 
before calling inst_ref_1.save() the first time otherwise the lazy load on 
inst_ref_2.system_metadata will pick up inst_ref_1's change and hide the issue.)
inst_ref_1.save()
(Note: can check db before the next save to confirm the first save worked)
inst_ref_2.save()

Afterward, nova.objects.Instance.get_by_uuid(context, 
test_uuid).system_metadata returns {'key2': 'val2'} instead of the desired 
{'key1': 'val1', 'key2': 'val2'}
Watching the db also reflects that the key1 was present after inst_ref_1.save() 
but was then removed and replaced with key2 after inst_ref_2.save().

The issue is the flow of Instance.save(). It eventually calls 
nova.db.sqlalchemy.api._instance_metadata_update_in_place(). That method 
assumes if a key is found in the db but is not in the passed metadata dict, 
that it should delete the key from the db.
So in the example above, because the inst_ref_2.system_metadata dictionary does 
not have the key added by inst_ref_1.save(), the inst_ref2.save() is deleting 
the entry added by inst_ref_1.save().


Issue this creates:
nova.compute.manager._build_and_run_instance() starts by sending the 
compute.instance.create.start notification. Immediately after that a recent 
unrelated change 
(https://github.com/openstack/nova/commit/6d8b58dc6f1cbda8d664b3487674f87049491c74)
 calls instance.system_metadata.update({'boot_roles': 
','.join(context.roles)}). The first instance.save() in 
_build_and_run_instance() is called as a side effect of 'with 
rt.instance_claim(context, instance, node, limits)'. (FWIW it's also called 
again very shortly after that in _build_and_run_instance() itself when vm_state 
and task_state are set).

This creates the race condition mentioned at the top. Our listener gets
the compute.instance.create.start notification is also attempting to
update the instance's system_metadata. The listener has to create it's
own reference to the same instance so depending on which instance
reference's save() is called first (the one in our listener or the one
from _build_and_run_instance()) one of the updates to system_metadata
gets lost.

Expected result:
Independent Instance.save() calls containing don't wipe out non-conflicting key 
changes.

Actual result:
They do.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1674003

Title:
  Race condition cause instance system_updates to be lost

Status in OpenStack Compute (nova):
  New

Bug description:
  We ran into an issue using a March ocata build. We have some
  system_metadata that we need to save very early in a VM's life.
  Previously we did this during scheduling. After the switch to cells
  v2, we now listen for the compute.instance.create.start and add the
  key to the instance's system_metadata then. The problem is that
  because of how nova.objects.Instance.save() works when saving metadata
  there is a race condition that causes some of the system_metadata to
  be lost.

  
  Basic setup of the instance.save() problem:
  test_uuid = 
  inst_ref_1 = nova.objects.Instance.get_by_uuid(context, test_uuid)
  inst_ref_2 = nova.objects.Instance.get_by_uuid(context, test_uuid)

  inst_ref_1.system_metadata.update({'key1': 'val1'})
  inst_ref_2.system_metadata.update({'key2': 'val2'})
  (Note: You need to read or update inst_ref_2.system_metadata at least once 
before calling inst_ref_1.save() the first time otherwise the lazy load on 
inst_ref_2.system_metadata will pick up inst_ref_1's change and hide the issue.)
  inst_ref_1.save()
  (Note: can check db before the next save to confirm the first save worked)
  inst_ref_2.save()

  Afterward, nova.objects.Instance.get_by_uuid(context, 
test_uuid).system_metadata returns {'key2': 'val2'} instead of the desired 
{'key1': 'val1', 'key2': 'val2'}
  Watching the db also reflects that the key1 was present after 
inst_ref_1.save() but was then removed and replaced with key2 after 
inst_ref_2.save().

  The issue is the flow of Instance.save(). It 

[Yahoo-eng-team] [Bug 1671648] [NEW] Instances are not rescheduled after deploy fails

2017-03-09 Thread Charles Volzka
Public bug reported:

Steps to reproduce:
Pre-step. Need to force the deploy to fail in such a way that it can be 
rescheduled. For testing I just forced it to fail by adding raise 
nova.exception.ComputeResourcesUnavailable('forced failure') during the 
instance spawn on the host.
1. Make sure environment is set to retry failed deploys.
2. Attempt to deploy VM and wait for it to fail.

Expected result:
Failed instance is rescheduled and attempted on another host.

Actual result:
Deploy fails but is not rescheduled.


I am just beginning to experiment with ocata build from early March. I
found that when an instance fails to deploy and throws a
RescheduledException, it is not getting rescheduled as expected. The
problem appears to be that the filter_properties['retry'] is not getting
set during initial deploy.

On initial deploy nova.conductor.manager.schedule_and_build_instances()
schedules the build_request and creates the instance object. That method
also creates the filter properties (filter_props) that is passed on to
compute_rpcapi.build_and_run_instance(). The problem is that
scheduler_utils.populate_retry() is not called before the filter_props
is passed on to the build call. When the deploy later fails on the host
nova.compute.manager._do_build_and_run_instance() catches the
RescheduledException but does not try and reschedule it because
filter_properties.get('retry') returns None.

In the past it looks like populate_retry() was called in by
nova.conductor.manager.build_instances() during the initial deploy. I'm
not seeing build_instances() get called during initial deploy after
switching to ocata. As an experiment I added
scheduler_utils.populate_retry(filter_props,
build_request.instance_uuid) immediately after filter_props is set in
schedule_and_build_instances(). Afterward I do see the instances get
rescheduled. I also noticed nova.conductor.manager.build_instances()
gets called for each attempt after the first.

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1671648

Title:
  Instances are not rescheduled after deploy fails

Status in OpenStack Compute (nova):
  New

Bug description:
  Steps to reproduce:
  Pre-step. Need to force the deploy to fail in such a way that it can be 
rescheduled. For testing I just forced it to fail by adding raise 
nova.exception.ComputeResourcesUnavailable('forced failure') during the 
instance spawn on the host.
  1. Make sure environment is set to retry failed deploys.
  2. Attempt to deploy VM and wait for it to fail.

  Expected result:
  Failed instance is rescheduled and attempted on another host.

  Actual result:
  Deploy fails but is not rescheduled.


  I am just beginning to experiment with ocata build from early March. I
  found that when an instance fails to deploy and throws a
  RescheduledException, it is not getting rescheduled as expected. The
  problem appears to be that the filter_properties['retry'] is not
  getting set during initial deploy.

  On initial deploy
  nova.conductor.manager.schedule_and_build_instances() schedules the
  build_request and creates the instance object. That method also
  creates the filter properties (filter_props) that is passed on to
  compute_rpcapi.build_and_run_instance(). The problem is that
  scheduler_utils.populate_retry() is not called before the filter_props
  is passed on to the build call. When the deploy later fails on the
  host nova.compute.manager._do_build_and_run_instance() catches the
  RescheduledException but does not try and reschedule it because
  filter_properties.get('retry') returns None.

  In the past it looks like populate_retry() was called in by
  nova.conductor.manager.build_instances() during the initial deploy.
  I'm not seeing build_instances() get called during initial deploy
  after switching to ocata. As an experiment I added
  scheduler_utils.populate_retry(filter_props,
  build_request.instance_uuid) immediately after filter_props is set in
  schedule_and_build_instances(). Afterward I do see the instances get
  rescheduled. I also noticed nova.conductor.manager.build_instances()
  gets called for each attempt after the first.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1671648/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1606269] [NEW] Incorrect flavor in request_spec on resize

2016-07-25 Thread Charles Volzka
Public bug reported:

On resizes, the RequestSpec object sent to the scheduler contains the
instance's original flavor.  This is causing an issue in our scheduler
because it is not seeing a new flavor for the resize tasking.

The issue appears to be that at
https://github.com/openstack/nova/blob/76dfb4ba9fa0fed1350021591956c4e8143b1ce9/nova/conductor/tasks/migrate.py#L52
the RequestSpec is hydrated with self.instance.flavor than the new
flavor which is  self.flavor.

Issue discovered in newtwon nova and appeared after
https://github.com/openstack/nova/commit/76dfb4ba9fa0fed1350021591956c4e8143b1ce9?diff=split
#diff-b839034e35c154b8c3a1c65bf7791eefL42

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1606269

Title:
  Incorrect flavor in request_spec on resize

Status in OpenStack Compute (nova):
  New

Bug description:
  On resizes, the RequestSpec object sent to the scheduler contains the
  instance's original flavor.  This is causing an issue in our scheduler
  because it is not seeing a new flavor for the resize tasking.

  The issue appears to be that at
  
https://github.com/openstack/nova/blob/76dfb4ba9fa0fed1350021591956c4e8143b1ce9/nova/conductor/tasks/migrate.py#L52
  the RequestSpec is hydrated with self.instance.flavor than the new
  flavor which is  self.flavor.

  Issue discovered in newtwon nova and appeared after
  
https://github.com/openstack/nova/commit/76dfb4ba9fa0fed1350021591956c4e8143b1ce9?diff=split
  #diff-b839034e35c154b8c3a1c65bf7791eefL42

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1606269/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp