[Yahoo-eng-team] [Bug 1721843] [NEW] Unversioned notifications not being sent.
Public bug reported: Description === After a vm moves from state 'building' to 'error' an unversioned notification is no longer sent if CONF.notifications.notification_format is set to 'unversioned'. Steps to reproduce == In nova.conf set [notifications] notification_format = unversioned Setup environment so VM deploy fails. To reproduce easily in my environment I raised a generic Exception just after the call to spawn in orchestrator's start_deploy_simple() Attempt to deploy VM. Wait for deploy to fail. Expected result === When the vm_state changes to 'error' an unversioned notification should be sent. Actual result = The unversioned notification is not sent. Environment === (pike)nova-compute/now 10:16.0.0-201710030907 Additional Info: Problem seems to stem from this change: https://github.com/openstack/nova/commit/29cb8f1c459e6d23dd9303fb570cee773d9c4d02 at: if (NOTIFIER.is_enabled() and CONF.notifications.notification_format in ('both', 'versioned')): Because 'unversioned' is not in the list, the @rpc.if_notifications_enabled decorator causes send_instance_update_notification() as well as _send_versioned_instance_update() to effectively be skipped. The name of the decorator and the comment describing it's functionality make it hard to determine is precise intended purpose. The decorator name implies it's checking if notifications are enabled at all. The comment in the decorator states it's specificly checking if versioned notifications are enabled and is in fact what it appears to be doing. Since the decorator was applied to send_instance_update_notification it's effectively blocking unversioned notifications if versioned notifications are not enabled. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1721843 Title: Unversioned notifications not being sent. Status in OpenStack Compute (nova): New Bug description: Description === After a vm moves from state 'building' to 'error' an unversioned notification is no longer sent if CONF.notifications.notification_format is set to 'unversioned'. Steps to reproduce == In nova.conf set [notifications] notification_format = unversioned Setup environment so VM deploy fails. To reproduce easily in my environment I raised a generic Exception just after the call to spawn in orchestrator's start_deploy_simple() Attempt to deploy VM. Wait for deploy to fail. Expected result === When the vm_state changes to 'error' an unversioned notification should be sent. Actual result = The unversioned notification is not sent. Environment === (pike)nova-compute/now 10:16.0.0-201710030907 Additional Info: Problem seems to stem from this change: https://github.com/openstack/nova/commit/29cb8f1c459e6d23dd9303fb570cee773d9c4d02 at: if (NOTIFIER.is_enabled() and CONF.notifications.notification_format in ('both', 'versioned')): Because 'unversioned' is not in the list, the @rpc.if_notifications_enabled decorator causes send_instance_update_notification() as well as _send_versioned_instance_update() to effectively be skipped. The name of the decorator and the comment describing it's functionality make it hard to determine is precise intended purpose. The decorator name implies it's checking if notifications are enabled at all. The comment in the decorator states it's specificly checking if versioned notifications are enabled and is in fact what it appears to be doing. Since the decorator was applied to send_instance_update_notification it's effectively blocking unversioned notifications if versioned notifications are not enabled. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1721843/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1721652] [NEW] Evacuate cleanup fails at _delete_allocation_for_moved_instance
Public bug reported: Description === After an evacuation, when nova-compute is restarted on the source host, the clean up of the old instance on the source host fails. The traceback in nova-compute.log ends with: 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 679, in _destroy_evacuated_instances 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, migration.source_node) 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1216, in delete_allocation_for_evacuated_instance 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, node, 'evacuated', node_type) 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1227, in _delete_allocation_for_moved_instance 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service cn_uuid = self.compute_nodes[node].uuid 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service KeyError: u'' 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service Steps to reproduce == Deploy instance on Host A. Shut down Host A. Evacuate instance to Host B. Turn back on Host A. Wait for cleanup of old instance allocation to occur Expected result === Clean up of old instance from Host A is successful Actual result = Old instance clean up appears to work but there's a traceback in the log and allocation is not cleaned up. Environment === (pike)nova-compute/now 10:16.0.0-201710030907 Additional Info: Problem seems to come from this change: https://github.com/openstack/nova/commit/0de806684f5d670dd5f961f7adf212961da3ed87 at: rt = self._get_resource_tracker() rt.delete_allocation_for_evacuated_instance That is called very early in init_host flow to clean up the allocations. The problem is that at this point in the startup the resource tracker's self.compute_node is still None. That makes delete_allocation_for_evacuated_instance blow up with a key error at: cn_uuid = self.compute_nodes[node].uuid The resource tracker's self.compute_node is actually initialized later on in the startup process via the update_available_resources() -> _update_available_resources() -> _init_compute_node(). It isn't initialized when the tracker is first created which appears to be the assumption made by the referenced commit. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1721652 Title: Evacuate cleanup fails at _delete_allocation_for_moved_instance Status in OpenStack Compute (nova): New Bug description: Description === After an evacuation, when nova-compute is restarted on the source host, the clean up of the old instance on the source host fails. The traceback in nova-compute.log ends with: 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 679, in _destroy_evacuated_instances 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, migration.source_node) 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1216, in delete_allocation_for_evacuated_instance 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, node, 'evacuated', node_type) 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1227, in _delete_allocation_for_moved_instance 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service cn_uuid = self.compute_nodes[node].uuid 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service KeyError: u'' 2017-10-04 05:32:18.725 5575 ERROR oslo_service.service Steps to reproduce == Deploy instance on Host A. Shut down Host A. Evacuate instance to Host B. Turn back on Host A. Wait for cleanup of old instance allocation to occur Expected result === Clean up of old instance from Host A is successful Actual result = Old instance clean up appears to work but there's a traceback in the log and allocation is not cleaned up. Environment === (pike)nova-compute/now 10:16.0.0-201710030907 Additional Info: Problem seems to come from this change: https://github.com/openstack/nova/commit/0de806684f5d670dd5f961f7adf212961da3ed87 at: rt = self._get_resource_tracker() rt.delete_allocation_for_evacuated_instance That is called very early in init_host flow to clean up the allocations. The problem is that at this point in the startup the
[Yahoo-eng-team] [Bug 1682621] [NEW] http 404 instead of 403 for role with read but not write access
Public bug reported: While attempting DELETE on /networks and /routers I found I was getting a HTTPNotFound error instead of PolicyNotAuthorized. The role I'm testing has read access. Calls to GET for the network or router in question are successful. Since the user has GET ability, I'd expect a more accurate error when attempting a DELETE. Steps to reproduce 1. Create a neutron network and/or router 2. Set a user to have a role whose policy allows get_network and get_router but not delete_network or delete_router ability 3. Confirm GET calls from the user for /network and /router are successful 4. Attempt DELETE call on the network created in step 1. Expected Results: 1. DELETE call is unsuccessful and returns http 403 and PolicyNotAuthorized (or equivalent) Actual Resutls: 1. DELETE call is unsuccessful and returns http 404 HTTPNotFound Issue discovered using an early April ocata build, but likely has existed for a while. Note: There may be other endpoints where this occurs. So far I've just noticed it in the two mentioned but I have not searched extensively. I see the try/catch with except oslo_policy.PolicyNotAuthorized in several places. ** Affects: neutron Importance: Undecided Assignee: Matthew Edmonds (edmondsw) Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1682621 Title: http 404 instead of 403 for role with read but not write access Status in neutron: New Bug description: While attempting DELETE on /networks and /routers I found I was getting a HTTPNotFound error instead of PolicyNotAuthorized. The role I'm testing has read access. Calls to GET for the network or router in question are successful. Since the user has GET ability, I'd expect a more accurate error when attempting a DELETE. Steps to reproduce 1. Create a neutron network and/or router 2. Set a user to have a role whose policy allows get_network and get_router but not delete_network or delete_router ability 3. Confirm GET calls from the user for /network and /router are successful 4. Attempt DELETE call on the network created in step 1. Expected Results: 1. DELETE call is unsuccessful and returns http 403 and PolicyNotAuthorized (or equivalent) Actual Resutls: 1. DELETE call is unsuccessful and returns http 404 HTTPNotFound Issue discovered using an early April ocata build, but likely has existed for a while. Note: There may be other endpoints where this occurs. So far I've just noticed it in the two mentioned but I have not searched extensively. I see the try/catch with except oslo_policy.PolicyNotAuthorized in several places. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1682621/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1674003] [NEW] Race condition cause instance system_updates to be lost
Public bug reported: We ran into an issue using a March ocata build. We have some system_metadata that we need to save very early in a VM's life. Previously we did this during scheduling. After the switch to cells v2, we now listen for the compute.instance.create.start and add the key to the instance's system_metadata then. The problem is that because of how nova.objects.Instance.save() works when saving metadata there is a race condition that causes some of the system_metadata to be lost. Basic setup of the instance.save() problem: test_uuid = inst_ref_1 = nova.objects.Instance.get_by_uuid(context, test_uuid) inst_ref_2 = nova.objects.Instance.get_by_uuid(context, test_uuid) inst_ref_1.system_metadata.update({'key1': 'val1'}) inst_ref_2.system_metadata.update({'key2': 'val2'}) (Note: You need to read or update inst_ref_2.system_metadata at least once before calling inst_ref_1.save() the first time otherwise the lazy load on inst_ref_2.system_metadata will pick up inst_ref_1's change and hide the issue.) inst_ref_1.save() (Note: can check db before the next save to confirm the first save worked) inst_ref_2.save() Afterward, nova.objects.Instance.get_by_uuid(context, test_uuid).system_metadata returns {'key2': 'val2'} instead of the desired {'key1': 'val1', 'key2': 'val2'} Watching the db also reflects that the key1 was present after inst_ref_1.save() but was then removed and replaced with key2 after inst_ref_2.save(). The issue is the flow of Instance.save(). It eventually calls nova.db.sqlalchemy.api._instance_metadata_update_in_place(). That method assumes if a key is found in the db but is not in the passed metadata dict, that it should delete the key from the db. So in the example above, because the inst_ref_2.system_metadata dictionary does not have the key added by inst_ref_1.save(), the inst_ref2.save() is deleting the entry added by inst_ref_1.save(). Issue this creates: nova.compute.manager._build_and_run_instance() starts by sending the compute.instance.create.start notification. Immediately after that a recent unrelated change (https://github.com/openstack/nova/commit/6d8b58dc6f1cbda8d664b3487674f87049491c74) calls instance.system_metadata.update({'boot_roles': ','.join(context.roles)}). The first instance.save() in _build_and_run_instance() is called as a side effect of 'with rt.instance_claim(context, instance, node, limits)'. (FWIW it's also called again very shortly after that in _build_and_run_instance() itself when vm_state and task_state are set). This creates the race condition mentioned at the top. Our listener gets the compute.instance.create.start notification is also attempting to update the instance's system_metadata. The listener has to create it's own reference to the same instance so depending on which instance reference's save() is called first (the one in our listener or the one from _build_and_run_instance()) one of the updates to system_metadata gets lost. Expected result: Independent Instance.save() calls containing don't wipe out non-conflicting key changes. Actual result: They do. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1674003 Title: Race condition cause instance system_updates to be lost Status in OpenStack Compute (nova): New Bug description: We ran into an issue using a March ocata build. We have some system_metadata that we need to save very early in a VM's life. Previously we did this during scheduling. After the switch to cells v2, we now listen for the compute.instance.create.start and add the key to the instance's system_metadata then. The problem is that because of how nova.objects.Instance.save() works when saving metadata there is a race condition that causes some of the system_metadata to be lost. Basic setup of the instance.save() problem: test_uuid = inst_ref_1 = nova.objects.Instance.get_by_uuid(context, test_uuid) inst_ref_2 = nova.objects.Instance.get_by_uuid(context, test_uuid) inst_ref_1.system_metadata.update({'key1': 'val1'}) inst_ref_2.system_metadata.update({'key2': 'val2'}) (Note: You need to read or update inst_ref_2.system_metadata at least once before calling inst_ref_1.save() the first time otherwise the lazy load on inst_ref_2.system_metadata will pick up inst_ref_1's change and hide the issue.) inst_ref_1.save() (Note: can check db before the next save to confirm the first save worked) inst_ref_2.save() Afterward, nova.objects.Instance.get_by_uuid(context, test_uuid).system_metadata returns {'key2': 'val2'} instead of the desired {'key1': 'val1', 'key2': 'val2'} Watching the db also reflects that the key1 was present after inst_ref_1.save() but was then removed and replaced with key2 after inst_ref_2.save(). The issue is the flow of Instance.save(). It
[Yahoo-eng-team] [Bug 1671648] [NEW] Instances are not rescheduled after deploy fails
Public bug reported: Steps to reproduce: Pre-step. Need to force the deploy to fail in such a way that it can be rescheduled. For testing I just forced it to fail by adding raise nova.exception.ComputeResourcesUnavailable('forced failure') during the instance spawn on the host. 1. Make sure environment is set to retry failed deploys. 2. Attempt to deploy VM and wait for it to fail. Expected result: Failed instance is rescheduled and attempted on another host. Actual result: Deploy fails but is not rescheduled. I am just beginning to experiment with ocata build from early March. I found that when an instance fails to deploy and throws a RescheduledException, it is not getting rescheduled as expected. The problem appears to be that the filter_properties['retry'] is not getting set during initial deploy. On initial deploy nova.conductor.manager.schedule_and_build_instances() schedules the build_request and creates the instance object. That method also creates the filter properties (filter_props) that is passed on to compute_rpcapi.build_and_run_instance(). The problem is that scheduler_utils.populate_retry() is not called before the filter_props is passed on to the build call. When the deploy later fails on the host nova.compute.manager._do_build_and_run_instance() catches the RescheduledException but does not try and reschedule it because filter_properties.get('retry') returns None. In the past it looks like populate_retry() was called in by nova.conductor.manager.build_instances() during the initial deploy. I'm not seeing build_instances() get called during initial deploy after switching to ocata. As an experiment I added scheduler_utils.populate_retry(filter_props, build_request.instance_uuid) immediately after filter_props is set in schedule_and_build_instances(). Afterward I do see the instances get rescheduled. I also noticed nova.conductor.manager.build_instances() gets called for each attempt after the first. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1671648 Title: Instances are not rescheduled after deploy fails Status in OpenStack Compute (nova): New Bug description: Steps to reproduce: Pre-step. Need to force the deploy to fail in such a way that it can be rescheduled. For testing I just forced it to fail by adding raise nova.exception.ComputeResourcesUnavailable('forced failure') during the instance spawn on the host. 1. Make sure environment is set to retry failed deploys. 2. Attempt to deploy VM and wait for it to fail. Expected result: Failed instance is rescheduled and attempted on another host. Actual result: Deploy fails but is not rescheduled. I am just beginning to experiment with ocata build from early March. I found that when an instance fails to deploy and throws a RescheduledException, it is not getting rescheduled as expected. The problem appears to be that the filter_properties['retry'] is not getting set during initial deploy. On initial deploy nova.conductor.manager.schedule_and_build_instances() schedules the build_request and creates the instance object. That method also creates the filter properties (filter_props) that is passed on to compute_rpcapi.build_and_run_instance(). The problem is that scheduler_utils.populate_retry() is not called before the filter_props is passed on to the build call. When the deploy later fails on the host nova.compute.manager._do_build_and_run_instance() catches the RescheduledException but does not try and reschedule it because filter_properties.get('retry') returns None. In the past it looks like populate_retry() was called in by nova.conductor.manager.build_instances() during the initial deploy. I'm not seeing build_instances() get called during initial deploy after switching to ocata. As an experiment I added scheduler_utils.populate_retry(filter_props, build_request.instance_uuid) immediately after filter_props is set in schedule_and_build_instances(). Afterward I do see the instances get rescheduled. I also noticed nova.conductor.manager.build_instances() gets called for each attempt after the first. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1671648/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1606269] [NEW] Incorrect flavor in request_spec on resize
Public bug reported: On resizes, the RequestSpec object sent to the scheduler contains the instance's original flavor. This is causing an issue in our scheduler because it is not seeing a new flavor for the resize tasking. The issue appears to be that at https://github.com/openstack/nova/blob/76dfb4ba9fa0fed1350021591956c4e8143b1ce9/nova/conductor/tasks/migrate.py#L52 the RequestSpec is hydrated with self.instance.flavor than the new flavor which is self.flavor. Issue discovered in newtwon nova and appeared after https://github.com/openstack/nova/commit/76dfb4ba9fa0fed1350021591956c4e8143b1ce9?diff=split #diff-b839034e35c154b8c3a1c65bf7791eefL42 ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1606269 Title: Incorrect flavor in request_spec on resize Status in OpenStack Compute (nova): New Bug description: On resizes, the RequestSpec object sent to the scheduler contains the instance's original flavor. This is causing an issue in our scheduler because it is not seeing a new flavor for the resize tasking. The issue appears to be that at https://github.com/openstack/nova/blob/76dfb4ba9fa0fed1350021591956c4e8143b1ce9/nova/conductor/tasks/migrate.py#L52 the RequestSpec is hydrated with self.instance.flavor than the new flavor which is self.flavor. Issue discovered in newtwon nova and appeared after https://github.com/openstack/nova/commit/76dfb4ba9fa0fed1350021591956c4e8143b1ce9?diff=split #diff-b839034e35c154b8c3a1c65bf7791eefL42 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1606269/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp