On 09/25/2015 04:03 PM, Denis V. Lunev wrote: > On 09/25/2015 04:21 AM, Wen Congyang wrote: >> On 09/24/2015 08:53 PM, Denis V. Lunev wrote: >>> From: Igor Redko <red...@virtuozzo.com> >>> >>> Release qemu global mutex before call synchronize_rcu(). >>> synchronize_rcu() waiting for all readers to finish their critical >>> sections. There is at least one critical section in which we try >>> to get QGM (critical section is in address_space_rw() and >>> prepare_mmio_access() is trying to aquire QGM). >>> >>> Both functions (migration_end() and migration_bitmap_extend()) >>> are called from main thread which is holding QGM. >>> >>> Thus there is a race condition that ends up with deadlock: >>> main thread working thread >>> Lock QGA | >>> | Call KVM_EXIT_IO handler >>> | | >>> | Open rcu reader's critical section >>> Migration cleanup bh | >>> | | >>> synchronize_rcu() is | >>> waiting for readers | >>> | prepare_mmio_access() is waiting for QGM >>> \ / >>> deadlock >>> >>> The patch just releases QGM before calling synchronize_rcu(). >>> >>> Signed-off-by: Igor Redko <red...@virtuozzo.com> >>> Reviewed-by: Anna Melekhova <an...@virtuozzo.com> >>> Signed-off-by: Denis V. Lunev <d...@openvz.org> >>> CC: Juan Quintela <quint...@redhat.com> >>> CC: Amit Shah <amit.s...@redhat.com> >>> --- >>> migration/ram.c | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> diff --git a/migration/ram.c b/migration/ram.c >>> index 7f007e6..d01febc 100644 >>> --- a/migration/ram.c >>> +++ b/migration/ram.c >>> @@ -1028,12 +1028,16 @@ static void migration_end(void) >>> { >>> /* caller have hold iothread lock or is in a bh, so there is >>> * no writing race against this migration_bitmap >>> + * but rcu used not only for migration_bitmap, so we should >>> + * release QGM or we get in deadlock. >>> */ >>> unsigned long *bitmap = migration_bitmap; >>> atomic_rcu_set(&migration_bitmap, NULL); >>> if (bitmap) { >>> memory_global_dirty_log_stop(); >>> + qemu_mutex_unlock_iothread(); >>> synchronize_rcu(); >>> + qemu_mutex_lock_iothread(); >> migration_end() can called in two cases: >> 1. migration_completed >> 2. migration is cancelled >> >> In case 1, you should not unlock iothread, otherwise, the vm's state may be >> changed >> unexpectedly. > > sorry, but there is now very good choice here. We should either > unlock or not call synchronize_rcu which is also an option. > > In the other case the rework should be much more sufficient.
I don't reproduce this bug. But according to your description, the bug only exists in case 2. Is it right? > > Den > >>> g_free(bitmap); >>> } >>> @@ -1085,7 +1089,9 @@ void migration_bitmap_extend(ram_addr_t old, >>> ram_addr_t new) >>> atomic_rcu_set(&migration_bitmap, bitmap); >>> qemu_mutex_unlock(&migration_bitmap_mutex); >>> migration_dirty_pages += new - old; >>> + qemu_mutex_unlock_iothread(); >>> synchronize_rcu(); >>> + qemu_mutex_lock_iothread(); >> Hmm, I think it is OK to unlock iothread here >> >>> g_free(old_bitmap); >>> } >>> } >>> > > . >