Yes, this is a race. However, it's my understanding that this is 'ok'. The resource tracker doesn't claim to be 100% accurate at all times, right? Otherwise why would it update itself in a period task in the first place. It's my understanding that the resource tracker is basically a best effort cache, and that scheduling decisions can still fail at the host. The resource tracker will fix itself next time it runs via its periodic task.
Matt (not a scheduler person) On Thu, Jun 9, 2016 at 10:41 PM, Chris Friesen <[email protected]> wrote: > Hi, > > I'm wondering if we might have a race between live migration and the > resource audit. I've included a few people on the receiver list that have > worked directly with this code in the past. > > In _update_available_resource() we have code that looks like this: > > instances = objects.InstanceList.get_by_host_and_node() > self._update_usage_from_instances() > migrations = objects.MigrationList.get_in_progress_by_host_and_node() > self._update_usage_from_migrations() > > > In post_live_migration_at_destination() we do this (updating the host and > node as well as the task state): > instance.host = self.host > instance.task_state = None > instance.node = node_name > instance.save(expected_task_state=task_states.MIGRATING) > > > And in _post_live_migration() we update the migration status to > "completed": > if migrate_data and migrate_data.get('migration'): > migrate_data['migration'].status = 'completed' > migrate_data['migration'].save() > > > Both of the latter routines are not serialized by the > COMPUTE_RESOURCE_SEMAPHORE, so they can race relative to the code in > _update_available_resource(). > > > I'm wondering if we can have a situation like this: > > 1) migration in progress > 2) We start running _update_available_resource() on destination, and we > call instances = objects.InstanceList.get_by_host_and_node(). This will > not return the migration, because it is not yet on the destination host. > 3) The migration completes and we call > post_live_migration_at_destination(), which sets the host/node/task_state > on the instance. > 4) In _update_available_resource() on destination, we call migrations = > objects.MigrationList.get_in_progress_by_host_and_node(). This will return > the migration for the instance in question, but when we run > self._update_usage_from_migrations() the uuid will not be in "instances" > and so we will use the instance from the newly-queried migration. We will > then ignore the instance because it is not in a "migrating" state. > > Am I imagining things, or is there a race here? If so, the negative > effects would be that the resources of the migrating instance would be > "lost", allowing a newly-scheduled instance to claim the same resources > (PCI devices, pinned CPUs, etc.) > > Chris > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK)
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
