Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues

2018-10-29 Thread Fei Li




On 10/26/2018 11:24 PM, Dr. David Alan Gilbert wrote:

* Peter Xu (pet...@redhat.com) wrote:

On Fri, Oct 26, 2018 at 09:10:19PM +0800, Fei Li wrote:


On 10/25/2018 08:58 PM, Peter Xu wrote:

On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:

[...]


@@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
   /* Return true if multifd is ready for the migration, otherwise false */
   bool multifd_recv_new_channel(QIOChannel *ioc)
   {
+    MigrationIncomingState *mis = migration_incoming_get_current();
   MultiFDRecvParams *p;
   Error *local_err = NULL;
   int id;

   id = multifd_recv_initial_packet(ioc, _err);
   if (id < 0) {
-    multifd_recv_terminate_threads(local_err);
-    return false;
+    error_reportf_err(local_err,
+  "failed to receive packet via multifd channel %x:
",
+  multifd_recv_state->count);
+    goto fail;
   }

   p = _recv_state->params[id];
   if (p->c != NULL) {
   error_setg(_err, "multifd: received id '%d' already setup'",
      id);
-    multifd_recv_terminate_threads(local_err);
-    return false;
+    goto fail;
   }
   p->c = ioc;
   object_ref(OBJECT(ioc));
@@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
      QEMU_THREAD_JOINABLE);
   atomic_inc(_recv_state->count);
   return multifd_recv_state->count == migrate_multifd_channels();
+fail:
+    multifd_recv_terminate_threads(local_err);
+    qemu_fclose(mis->from_src_file);
+    mis->from_src_file = NULL;
+    exit(EXIT_FAILURE);
   }

Yeah I think it makes sense to at least report some details when error
happens, but I'm not sure whether it's good to explicitly exit() here.
IMHO you can add an Error** in multifd_recv_new_channel() parameter
list to do that, and even through migration_ioc_process_incoming().
What do you think?

Regards,


You mean exit() in migration_ioc_process_incoming(), or further
caller migration_channel_process_incoming()? Actually either is
ok for me. :) But today I find if using postcopy and multifd together
to do live migration, it seems the hang still occurs even with the
above codes, so sad about that. I will keep debugging and see
how to fix this.

Maybe you can move the error_report_err() in
migration_channel_process_incoming() out of the TLS path so we can
report the error if either TLS or non-TLS case got something wrong.

Thanks for the advice. I will do the update in the next version. :)


And I don't even know whether multifd could work with postcopy...

Nope, it's not expected to work yet.

Dave

Thanks for the helpful information. :)

BTW, in the next version, I'd like to merge these three migration 
patches into
the "[PATCH RFC v6 ] qemu_thread_create: propagate the error to callers 
to handle",

and cc you inside the patches. Please help to review.

Have a nice day, thanks again
Fei



Regards,

--
Peter Xu

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK







Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues

2018-10-26 Thread Dr. David Alan Gilbert
* Peter Xu (pet...@redhat.com) wrote:
> On Fri, Oct 26, 2018 at 09:10:19PM +0800, Fei Li wrote:
> > 
> > 
> > On 10/25/2018 08:58 PM, Peter Xu wrote:
> > > On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
> > > 
> > > [...]
> > > 
> > > > @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
> > > >   /* Return true if multifd is ready for the migration, otherwise false 
> > > > */
> > > >   bool multifd_recv_new_channel(QIOChannel *ioc)
> > > >   {
> > > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > >   MultiFDRecvParams *p;
> > > >   Error *local_err = NULL;
> > > >   int id;
> > > > 
> > > >   id = multifd_recv_initial_packet(ioc, _err);
> > > >   if (id < 0) {
> > > > -    multifd_recv_terminate_threads(local_err);
> > > > -    return false;
> > > > +    error_reportf_err(local_err,
> > > > +  "failed to receive packet via multifd 
> > > > channel %x:
> > > > ",
> > > > +  multifd_recv_state->count);
> > > > +    goto fail;
> > > >   }
> > > > 
> > > >   p = _recv_state->params[id];
> > > >   if (p->c != NULL) {
> > > >   error_setg(_err, "multifd: received id '%d' already 
> > > > setup'",
> > > >      id);
> > > > -    multifd_recv_terminate_threads(local_err);
> > > > -    return false;
> > > > +    goto fail;
> > > >   }
> > > >   p->c = ioc;
> > > >   object_ref(OBJECT(ioc));
> > > > @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
> > > >      QEMU_THREAD_JOINABLE);
> > > >   atomic_inc(_recv_state->count);
> > > >   return multifd_recv_state->count == migrate_multifd_channels();
> > > > +fail:
> > > > +    multifd_recv_terminate_threads(local_err);
> > > > +    qemu_fclose(mis->from_src_file);
> > > > +    mis->from_src_file = NULL;
> > > > +    exit(EXIT_FAILURE);
> > > >   }
> > > Yeah I think it makes sense to at least report some details when error
> > > happens, but I'm not sure whether it's good to explicitly exit() here.
> > > IMHO you can add an Error** in multifd_recv_new_channel() parameter
> > > list to do that, and even through migration_ioc_process_incoming().
> > > What do you think?
> > > 
> > > Regards,
> > > 
> > You mean exit() in migration_ioc_process_incoming(), or further
> > caller migration_channel_process_incoming()? Actually either is
> > ok for me. :) But today I find if using postcopy and multifd together
> > to do live migration, it seems the hang still occurs even with the
> > above codes, so sad about that. I will keep debugging and see
> > how to fix this.
> 
> Maybe you can move the error_report_err() in
> migration_channel_process_incoming() out of the TLS path so we can
> report the error if either TLS or non-TLS case got something wrong.
> 
> And I don't even know whether multifd could work with postcopy...

Nope, it's not expected to work yet.

Dave

> Regards,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues

2018-10-26 Thread Peter Xu
On Fri, Oct 26, 2018 at 09:10:19PM +0800, Fei Li wrote:
> 
> 
> On 10/25/2018 08:58 PM, Peter Xu wrote:
> > On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:
> > 
> > [...]
> > 
> > > @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
> > >   /* Return true if multifd is ready for the migration, otherwise false */
> > >   bool multifd_recv_new_channel(QIOChannel *ioc)
> > >   {
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > >   MultiFDRecvParams *p;
> > >   Error *local_err = NULL;
> > >   int id;
> > > 
> > >   id = multifd_recv_initial_packet(ioc, _err);
> > >   if (id < 0) {
> > > -    multifd_recv_terminate_threads(local_err);
> > > -    return false;
> > > +    error_reportf_err(local_err,
> > > +  "failed to receive packet via multifd channel 
> > > %x:
> > > ",
> > > +  multifd_recv_state->count);
> > > +    goto fail;
> > >   }
> > > 
> > >   p = _recv_state->params[id];
> > >   if (p->c != NULL) {
> > >   error_setg(_err, "multifd: received id '%d' already 
> > > setup'",
> > >      id);
> > > -    multifd_recv_terminate_threads(local_err);
> > > -    return false;
> > > +    goto fail;
> > >   }
> > >   p->c = ioc;
> > >   object_ref(OBJECT(ioc));
> > > @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
> > >      QEMU_THREAD_JOINABLE);
> > >   atomic_inc(_recv_state->count);
> > >   return multifd_recv_state->count == migrate_multifd_channels();
> > > +fail:
> > > +    multifd_recv_terminate_threads(local_err);
> > > +    qemu_fclose(mis->from_src_file);
> > > +    mis->from_src_file = NULL;
> > > +    exit(EXIT_FAILURE);
> > >   }
> > Yeah I think it makes sense to at least report some details when error
> > happens, but I'm not sure whether it's good to explicitly exit() here.
> > IMHO you can add an Error** in multifd_recv_new_channel() parameter
> > list to do that, and even through migration_ioc_process_incoming().
> > What do you think?
> > 
> > Regards,
> > 
> You mean exit() in migration_ioc_process_incoming(), or further
> caller migration_channel_process_incoming()? Actually either is
> ok for me. :) But today I find if using postcopy and multifd together
> to do live migration, it seems the hang still occurs even with the
> above codes, so sad about that. I will keep debugging and see
> how to fix this.

Maybe you can move the error_report_err() in
migration_channel_process_incoming() out of the TLS path so we can
report the error if either TLS or non-TLS case got something wrong.

And I don't even know whether multifd could work with postcopy...

Regards,

-- 
Peter Xu



Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues

2018-10-26 Thread Fei Li




On 10/25/2018 08:58 PM, Peter Xu wrote:

On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:

[...]


@@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
  /* Return true if multifd is ready for the migration, otherwise false */
  bool multifd_recv_new_channel(QIOChannel *ioc)
  {
+    MigrationIncomingState *mis = migration_incoming_get_current();
  MultiFDRecvParams *p;
  Error *local_err = NULL;
  int id;

  id = multifd_recv_initial_packet(ioc, _err);
  if (id < 0) {
-    multifd_recv_terminate_threads(local_err);
-    return false;
+    error_reportf_err(local_err,
+  "failed to receive packet via multifd channel %x:
",
+  multifd_recv_state->count);
+    goto fail;
  }

  p = _recv_state->params[id];
  if (p->c != NULL) {
  error_setg(_err, "multifd: received id '%d' already setup'",
     id);
-    multifd_recv_terminate_threads(local_err);
-    return false;
+    goto fail;
  }
  p->c = ioc;
  object_ref(OBJECT(ioc));
@@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
     QEMU_THREAD_JOINABLE);
  atomic_inc(_recv_state->count);
  return multifd_recv_state->count == migrate_multifd_channels();
+fail:
+    multifd_recv_terminate_threads(local_err);
+    qemu_fclose(mis->from_src_file);
+    mis->from_src_file = NULL;
+    exit(EXIT_FAILURE);
  }

Yeah I think it makes sense to at least report some details when error
happens, but I'm not sure whether it's good to explicitly exit() here.
IMHO you can add an Error** in multifd_recv_new_channel() parameter
list to do that, and even through migration_ioc_process_incoming().
What do you think?

Regards,


You mean exit() in migration_ioc_process_incoming(), or further
caller migration_channel_process_incoming()? Actually either is
ok for me. :) But today I find if using postcopy and multifd together
to do live migration, it seems the hang still occurs even with the
above codes, so sad about that. I will keep debugging and see
how to fix this.

Have a nice day, thanks
Fei



Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues

2018-10-26 Thread Fei Li




On 10/25/2018 08:55 PM, Dr. David Alan Gilbert wrote:

* Fei Li (f...@suse.com) wrote:

Hi,
these two patches are to fix live migration issues. The first is
about multifd, and the second is to fix some error handling.

But I have a question about using multifd migration.
In our current code, when multifd is used during migration, if there
is an error before the destination receives all new channels (I mean
multifd_recv_new_channel(ioc)), the destination does not exit but
keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
the source exits.

My question is about the state of the destination host if fails during
this period. I did a test, after applying [1/2] patch, if
multifd_new_send_channel_async() fails, the destination host hangs for
a while then later pops up a window saying
 "'QEMU (...) [stopped]' is not responding.
 You may choose to wait a short while for it to continue or force
 the application to quit entirely."
But after closing the window by clicking, the qemu on the dest still
hangs there until I exclusively kill the qemu on the source.

That sounds like the main thread is blocked for some reason?

Yes, the main thread on  the dst is keeps looping.

But I don't
normally use the window setup;  if you try with -nographic and can see
the HMP (or a QMP) monitor, can you see if the monitor still responds?


Thanks for the `-nographic` reminder, I harvested an interesting 
phenonmenon:

If I do the `migrate -d tcp:ip_addr:port` before the guest's graphic appears
(it's dark now), there is no hang and the guest starts up properly later.
But if I do the live migration after the guest fully starts up, I mean when
I can operate something using my mouse inside the guest, the hang
situation is there.
This is true for using `-nographic` for both src and dst,
and using `-nographic` for only src or dst.


The hang phenonmenon is that the dst seems never responds (I
waited three minutes), and the cursor just keeps flashing. After I
exclusively kill the src, then the dst quit. Just as follows:
(Same result if gdb is not used in src)
src:
(qemu) ...
(qemu) q
(gdb) q
dst:
(qemu) Up to now, dst has received the 0 channel
Up to now, dst has received the 1 channel

(qemu)
(qemu)

To check the migtation state in the src:
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off 
zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off 
release-ram: off block: off return-path: off pause-before-switchover: 
off x-multifd: on dirty-bitmaps: off postcopy-blocktime: off 
late-block-activate: off
Migration status: setup /* I added some codes to set the status to 
"failed", but still not working, details see below */

total time: 0 milliseconds

I guess maybe the source should to proactive to tell the dst and
disconnects from the source side, so I tried to set the above
"Migration status" to be "failed", and use qemu_fclose(s->to_dst_file)
when multifd_new_send_channel_async() fails.
(BTW: I even tried:
 if (s->vm_was_running) {   vm_start();   }   )
But the hang situation is still there.

If it doesn't then try and get a backtrace.

The monitor really shouldn't block, so it would be interesting to see.

Dave
I set two breakpoints and get the following backtrace, hope they can 
help. :)


Thread 1 "qemu-system-x86" hit Breakpoint 1, multifd_recv_new_channel (
    ioc=0x57995af0) at /build/gitcode/qemu-build/migration/ram.c:1368
1368    {
(gdb) c
Continuing.

Thread 1 "qemu-system-x86" hit Breakpoint 2, qio_channel_socket_readv (
    ioc=0x57995af0, iov=0x568777d0, niov=1, fds=0x0, nfds=0x0,
    errp=0x7fffdb38) at io/channel-socket.c:463
463    {
(gdb) n
464        QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
(gdb)
..
483     retry:
(gdb)
484        ret = recvmsg(sioc->fd, , sflags);
(gdb) bt
#0  qio_channel_socket_readv (ioc=0x57995af0, iov=0x568777d0, 
niov=1,

    fds=0x0, nfds=0x0, errp=0x7fffdb38) at io/channel-socket.c:484
#1  0x55d156c5 in qio_channel_readv_full (ioc=0x57995af0,
    iov=0x568777d0, niov=1, fds=0x0, nfds=0x0, errp=0x7fffdb38)
    at io/channel.c:65
#2  0x55d15b26 in qio_channel_readv (ioc=0x57995af0,
    iov=0x568777d0, niov=1, errp=0x7fffdb38) at io/channel.c:197
#3  0x55d15853 in qio_channel_readv_all_eof (ioc=0x57995af0,
    iov=0x7fffda70, niov=1, errp=0x7fffdb38) at io/channel.c:106
#4  0x55d1595c in qio_channel_readv_all (ioc=0x57995af0,
    iov=0x7fffda70, niov=1, errp=0x7fffdb38) at io/channel.c:142
#5  0x55d15d0c in qio_channel_read_all (ioc=0x57995af0,
    buf=0x7fffdad0 "\340\"zVUU", buflen=25, errp=0x7fffdb38)
    at io/channel.c:246
#6  0x5587695c in multifd_recv_initial_packet (c=0x57995af0,
    errp=0x7fffdb38) at 

Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues

2018-10-25 Thread Peter Xu
On Thu, Oct 25, 2018 at 05:04:00PM +0800, Fei Li wrote:

[...]

> @@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
>  /* Return true if multifd is ready for the migration, otherwise false */
>  bool multifd_recv_new_channel(QIOChannel *ioc)
>  {
> +    MigrationIncomingState *mis = migration_incoming_get_current();
>  MultiFDRecvParams *p;
>  Error *local_err = NULL;
>  int id;
> 
>  id = multifd_recv_initial_packet(ioc, _err);
>  if (id < 0) {
> -    multifd_recv_terminate_threads(local_err);
> -    return false;
> +    error_reportf_err(local_err,
> +  "failed to receive packet via multifd channel %x:
> ",
> +  multifd_recv_state->count);
> +    goto fail;
>  }
> 
>  p = _recv_state->params[id];
>  if (p->c != NULL) {
>  error_setg(_err, "multifd: received id '%d' already setup'",
>     id);
> -    multifd_recv_terminate_threads(local_err);
> -    return false;
> +    goto fail;
>  }
>  p->c = ioc;
>  object_ref(OBJECT(ioc));
> @@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
>     QEMU_THREAD_JOINABLE);
>  atomic_inc(_recv_state->count);
>  return multifd_recv_state->count == migrate_multifd_channels();
> +fail:
> +    multifd_recv_terminate_threads(local_err);
> +    qemu_fclose(mis->from_src_file);
> +    mis->from_src_file = NULL;
> +    exit(EXIT_FAILURE);
>  }

Yeah I think it makes sense to at least report some details when error
happens, but I'm not sure whether it's good to explicitly exit() here.
IMHO you can add an Error** in multifd_recv_new_channel() parameter
list to do that, and even through migration_ioc_process_incoming().
What do you think?

Regards,

-- 
Peter Xu



Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues

2018-10-25 Thread Dr. David Alan Gilbert
* Fei Li (f...@suse.com) wrote:
> Hi,
> these two patches are to fix live migration issues. The first is
> about multifd, and the second is to fix some error handling.
> 
> But I have a question about using multifd migration.
> In our current code, when multifd is used during migration, if there
> is an error before the destination receives all new channels (I mean
> multifd_recv_new_channel(ioc)), the destination does not exit but
> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
> the source exits.
> 
> My question is about the state of the destination host if fails during
> this period. I did a test, after applying [1/2] patch, if
> multifd_new_send_channel_async() fails, the destination host hangs for
> a while then later pops up a window saying
> "'QEMU (...) [stopped]' is not responding.
> You may choose to wait a short while for it to continue or force
> the application to quit entirely."
> But after closing the window by clicking, the qemu on the dest still
> hangs there until I exclusively kill the qemu on the source.

That sounds like the main thread is blocked for some reason? But I don't
normally use the window setup;  if you try with -nographic and can see
the HMP (or a QMP) monitor, can you see if the monitor still responds?
If it doesn't then try and get a backtrace.

The monitor really shouldn't block, so it would be interesting to see.

Dave

> The source host keeps running as expected, but I guess the hang
> phenonmenon in the dest is not right.
> Would someone kindly give some suggestions on this? Thanks a lot.
> 
> 
> Fei Li (2):
>   migration: fix the multifd code
>   migration: fix some error handling
> 
>  migration/migration.c|  5 +
>  migration/postcopy-ram.c |  3 +++
>  migration/ram.c  | 33 +++--
>  migration/ram.h  |  2 +-
>  4 files changed, 28 insertions(+), 15 deletions(-)
> 
> -- 
> 2.13.7
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues

2018-10-25 Thread Fei Li




On 10/25/2018 05:27 AM, Peter Xu wrote:

On Mon, Oct 22, 2018 at 07:08:52PM +0800, Fei Li wrote:

Hi,
these two patches are to fix live migration issues. The first is
about multifd, and the second is to fix some error handling.

But I have a question about using multifd migration.
In our current code, when multifd is used during migration, if there
is an error before the destination receives all new channels (I mean
multifd_recv_new_channel(ioc)), the destination does not exit but
keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
the source exits.

My question is about the state of the destination host if fails during
this period. I did a test, after applying [1/2] patch, if
multifd_new_send_channel_async() fails, the destination host hangs for
a while then later pops up a window saying
 "'QEMU (...) [stopped]' is not responding.
 You may choose to wait a short while for it to continue or force
 the application to quit entirely."
But after closing the window by clicking, the qemu on the dest still
hangs there until I exclusively kill the qemu on the source.

The source host keeps running as expected, but I guess the hang
phenonmenon in the dest is not right.
Would someone kindly give some suggestions on this? Thanks a lot.

Note that it's during KVM forum so the response from anyone might be
slow (it ends this week).

Thanks for the kindly reminder. :)

I think the thing you described seems normal since we can't guarantee
the network is always stable, normally I'll expect that the migration
will fail but it won't matter much since after all it's a precopy so
we lose nothing.  So I'm curious about when the error you mentioned
happens (e.g., total channel number is N, you only got M channels
connected, with M < N) could you just simply kill the destination?
Then AFAIU the source can just continue to run, right?

Yes, for the M < N situation, IMO the destination can be simply killed by
adding exit(EXIT_FAILURE) when it failed to receive packet via some
channel. The code is as below which has been tested, and result is the
source continues to run and the destination exits.
I'd like to write a separate patch if the below code/idea is acceptable
to fix the hang issue.

@@ -1325,22 +1325,24 @@ bool multifd_recv_all_channels_created(void)
 /* Return true if multifd is ready for the migration, otherwise false */
 bool multifd_recv_new_channel(QIOChannel *ioc)
 {
+    MigrationIncomingState *mis = migration_incoming_get_current();
 MultiFDRecvParams *p;
 Error *local_err = NULL;
 int id;

 id = multifd_recv_initial_packet(ioc, _err);
 if (id < 0) {
-    multifd_recv_terminate_threads(local_err);
-    return false;
+    error_reportf_err(local_err,
+  "failed to receive packet via multifd channel 
%x: ",

+  multifd_recv_state->count);
+    goto fail;
 }

 p = _recv_state->params[id];
 if (p->c != NULL) {
 error_setg(_err, "multifd: received id '%d' already setup'",
    id);
-    multifd_recv_terminate_threads(local_err);
-    return false;
+    goto fail;
 }
 p->c = ioc;
 object_ref(OBJECT(ioc));
@@ -1352,6 +1354,11 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
    QEMU_THREAD_JOINABLE);
 atomic_inc(_recv_state->count);
 return multifd_recv_state->count == migrate_multifd_channels();
+fail:
+    multifd_recv_terminate_threads(local_err);
+    qemu_fclose(mis->from_src_file);
+    mis->from_src_file = NULL;
+    exit(EXIT_FAILURE);
 }

Have a nice day, thanks a lot
Fei


Fei Li (2):
   migration: fix the multifd code
   migration: fix some error handling

  migration/migration.c|  5 +
  migration/postcopy-ram.c |  3 +++
  migration/ram.c  | 33 +++--
  migration/ram.h  |  2 +-
  4 files changed, 28 insertions(+), 15 deletions(-)

--
2.13.7


Regards,






Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues

2018-10-24 Thread Peter Xu
On Mon, Oct 22, 2018 at 07:08:52PM +0800, Fei Li wrote:
> Hi,
> these two patches are to fix live migration issues. The first is
> about multifd, and the second is to fix some error handling.
> 
> But I have a question about using multifd migration.
> In our current code, when multifd is used during migration, if there
> is an error before the destination receives all new channels (I mean
> multifd_recv_new_channel(ioc)), the destination does not exit but
> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
> the source exits.
> 
> My question is about the state of the destination host if fails during
> this period. I did a test, after applying [1/2] patch, if
> multifd_new_send_channel_async() fails, the destination host hangs for
> a while then later pops up a window saying
> "'QEMU (...) [stopped]' is not responding.
> You may choose to wait a short while for it to continue or force
> the application to quit entirely."
> But after closing the window by clicking, the qemu on the dest still
> hangs there until I exclusively kill the qemu on the source.
> 
> The source host keeps running as expected, but I guess the hang
> phenonmenon in the dest is not right.
> Would someone kindly give some suggestions on this? Thanks a lot.

Note that it's during KVM forum so the response from anyone might be
slow (it ends this week).

I think the thing you described seems normal since we can't guarantee
the network is always stable, normally I'll expect that the migration
will fail but it won't matter much since after all it's a precopy so
we lose nothing.  So I'm curious about when the error you mentioned
happens (e.g., total channel number is N, you only got M channels
connected, with M < N) could you just simply kill the destination?
Then AFAIU the source can just continue to run, right?

> 
> 
> Fei Li (2):
>   migration: fix the multifd code
>   migration: fix some error handling
> 
>  migration/migration.c|  5 +
>  migration/postcopy-ram.c |  3 +++
>  migration/ram.c  | 33 +++--
>  migration/ram.h  |  2 +-
>  4 files changed, 28 insertions(+), 15 deletions(-)
> 
> -- 
> 2.13.7
> 

Regards,

-- 
Peter Xu