Hi,

I'm wondering if we might have a race between live migration and the resource audit. I've included a few people on the receiver list that have worked directly with this code in the past.

In _update_available_resource() we have code that looks like this:

instances = objects.InstanceList.get_by_host_and_node()
self._update_usage_from_instances()
migrations = objects.MigrationList.get_in_progress_by_host_and_node()
self._update_usage_from_migrations()


In post_live_migration_at_destination() we do this (updating the host and node as well as the task state):
            instance.host = self.host
            instance.task_state = None
            instance.node = node_name
            instance.save(expected_task_state=task_states.MIGRATING)


And in _post_live_migration() we update the migration status to "completed":
        if migrate_data and migrate_data.get('migration'):
            migrate_data['migration'].status = 'completed'
            migrate_data['migration'].save()


Both of the latter routines are not serialized by the COMPUTE_RESOURCE_SEMAPHORE, so they can race relative to the code in _update_available_resource().


I'm wondering if we can have a situation like this:

1) migration in progress
2) We start running _update_available_resource() on destination, and we call instances = objects.InstanceList.get_by_host_and_node(). This will not return the migration, because it is not yet on the destination host. 3) The migration completes and we call post_live_migration_at_destination(), which sets the host/node/task_state on the instance. 4) In _update_available_resource() on destination, we call migrations = objects.MigrationList.get_in_progress_by_host_and_node(). This will return the migration for the instance in question, but when we run self._update_usage_from_migrations() the uuid will not be in "instances" and so we will use the instance from the newly-queried migration. We will then ignore the instance because it is not in a "migrating" state.

Am I imagining things, or is there a race here? If so, the negative effects would be that the resources of the migrating instance would be "lost", allowing a newly-scheduled instance to claim the same resources (PCI devices, pinned CPUs, etc.)

Chris

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to