On Wed, 18 May 2016 13:42:08 +0200 Kevin Wolf <kw...@redhat.com> wrote:
> Am 18.05.2016 um 13:13 hat Dr. David Alan Gilbert geschrieben: > > * Greg Kurz (gk...@linux.vnet.ibm.com) wrote: > > > We currently have an error path during migration that can cause > > > the source QEMU to abort: > > > > Hmm, wasn't there something similar recently, sorry I can't remember the > > details, but > > cc'ing some block people who might remember. > > Not sure, but in any case calling bdrv_invalidate_cache_all() in the > error case sounds right to me. > > I think I would have tried to write the patch a bit differently so that > it uses the normal error handling patterns, but that doesn't make it > wrong. Specifically, checking the error condition twice in two different > places seems strange; normally you check it once and then goto fail, > where the cleanup is done (which could include a call to > bdrv_invalidate_cache_all()). > I have coded it this way because it is obvious we're undoing the call to bdrv_inactivate_all() a few lines above. And also, in the postcopy case, migration_thread() doesn't call vm_start() when migration fails to complete: bdrv_invalidate_cache_all() will be called if the guest gets restarted via QMP. This being said, I agree that the resulting code looks weird. I guess it is possible to add a fail_invalidate: label to do the cleanup in the non postcopy case. I'll send a v2. > Kevin > > > Dave > > > > > > migration_thread() > > > migration_completion() > > > runstate_is_running() ----------------> true if guest is running > > > bdrv_inactivate_all() ----------------> inactivate images > > > qemu_savevm_state_complete_precopy() > > > ... qemu_fflush() > > > socket_writev_buffer() --------> error because destination > > > fails > > > qemu_fflush() -------------------> set error on migration stream > > > migration_completion() -----------------> set migrate state to FAILED > > > migration_thread() -----------------------> break migration loop > > > vm_start() -----------------------------> restart guest with inactive > > > images > > > > > > and you get: > > > > > > qemu-system-ppc64: socket_writev_buffer: Got err=104 for > > > (32768/18446744073709551615) > > > qemu-system-ppc64: > > > /home/greg/Work/qemu/qemu-master/block/io.c:1342:bdrv_co_do_pwritev: > > > Assertion `!(bs->open_flags & 0x0800)' failed. > > > Aborted (core dumped) > > > > > > If we try postcopy with a similar scenario, we also get the writev error > > > message but QEMU leaves the guest paused because entered_postcopy is true. > > > > > > We could possibly do the same with precopy and leave the guest paused. > > > But since the historical default for migration errors is to restart the > > > source, this patch adds a call to bdrv_invalidate_cache_all() instead. > > > > > > Signed-off-by: Greg Kurz <gk...@linux.vnet.ibm.com> > > > --- > > > migration/migration.c | 9 +++++++++ > > > 1 file changed, 9 insertions(+) > > > > > > diff --git a/migration/migration.c b/migration/migration.c > > > index 991313a8629a..5726959ddfd9 100644 > > > --- a/migration/migration.c > > > +++ b/migration/migration.c > > > @@ -1568,8 +1568,17 @@ static void migration_completion(MigrationState > > > *s, int current_active_state, > > > ret = bdrv_inactivate_all(); > > > } > > > if (ret >= 0) { > > > + Error *local_err = NULL; > > > + > > > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > > > qemu_savevm_state_complete_precopy(s->to_dst_file, > > > false); > > > + > > > + if (qemu_file_get_error(s->to_dst_file)) { > > > + bdrv_invalidate_cache_all(&local_err); > > > + if (local_err) { > > > + error_report_err(local_err); > > > + } > > > + } > > > } > > > } > > > qemu_mutex_unlock_iothread(); > > > > > -- > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >