[RFC PATCH 11/17] COLO ctl: implement colo checkpoint protocol

2014-07-23 Thread Yang Hongyang
implement colo checkpoint protocol.

Checkpoint synchronzing points.

  Primary Secondary
  NEW @
  Suspend
  SUSPENDED   @
  Suspend&Save state
  SEND@
  Send state  Receive state
  RECEIVED@
  Flush network   Load state
  LOADED  @
  Resume  Resume

  Start Comparing
NOTE:
 1) '@' who sends the message
 2) Every sync-point is synchronized by two sides with only
one handshake(single direction) for low-latency.
If more strict synchronization is required, a opposite direction
sync-point should be added.
 3) Since sync-points are single direction, the remote side may
go forward a lot when this side just receives the sync-point.

Signed-off-by: Yang Hongyang 
---
 migration-colo.c | 268 +--
 1 file changed, 262 insertions(+), 6 deletions(-)

diff --git a/migration-colo.c b/migration-colo.c
index 2699e77..a708872 100644
--- a/migration-colo.c
+++ b/migration-colo.c
@@ -24,6 +24,41 @@
  */
 #define CHKPOINT_TIMER 1
 
+enum {
+COLO_READY = 0x46,
+
+/*
+ * Checkpoint synchronzing points.
+ *
+ *  Primary Secondary
+ *  NEW @
+ *  Suspend
+ *  SUSPENDED   @
+ *  Suspend&Save state
+ *  SEND@
+ *  Send state  Receive state
+ *  RECEIVED@
+ *  Flush network   Load state
+ *  LOADED  @
+ *  Resume  Resume
+ *
+ *  Start Comparing
+ * NOTE:
+ * 1) '@' who sends the message
+ * 2) Every sync-point is synchronized by two sides with only
+ *one handshake(single direction) for low-latency.
+ *If more strict synchronization is required, a opposite direction
+ *sync-point should be added.
+ * 3) Since sync-points are single direction, the remote side may
+ *go forward a lot when this side just receives the sync-point.
+ */
+COLO_CHECKPOINT_NEW,
+COLO_CHECKPOINT_SUSPENDED,
+COLO_CHECKPOINT_SEND,
+COLO_CHECKPOINT_RECEIVED,
+COLO_CHECKPOINT_LOADED,
+};
+
 static QEMUBH *colo_bh;
 
 bool colo_supported(void)
@@ -185,30 +220,161 @@ static const QEMUFileOps colo_read_ops = {
 .close = colo_close,
 };
 
+/* colo checkpoint control helper */
+static bool is_master(void);
+static bool is_slave(void);
+
+static void ctl_error_handler(void *opaque, int err)
+{
+if (is_slave()) {
+/* TODO: determine whether we need to failover */
+/* FIXME: we will not failover currently, just kill slave */
+error_report("error: colo transmission failed!\n");
+exit(1);
+} else if (is_master()) {
+/* Master still alive, do not failover */
+error_report("error: colo transmission failed!\n");
+return;
+} else {
+error_report("COLO: Unexpected error happend!\n");
+exit(EXIT_FAILURE);
+}
+}
+
+static int colo_ctl_put(QEMUFile *f, uint64_t request)
+{
+int ret = 0;
+
+qemu_put_be64(f, request);
+qemu_fflush(f);
+
+ret = qemu_file_get_error(f);
+if (ret < 0) {
+ctl_error_handler(f, ret);
+return 1;
+}
+
+return ret;
+}
+
+static int colo_ctl_get_value(QEMUFile *f, uint64_t *value)
+{
+int ret = 0;
+uint64_t temp;
+
+temp = qemu_get_be64(f);
+
+ret = qemu_file_get_error(f);
+if (ret < 0) {
+ctl_error_handler(f, ret);
+return 1;
+}
+
+*value = temp;
+return 0;
+}
+
+static int colo_ctl_get(QEMUFile *f, uint64_t require)
+{
+int ret;
+uint64_t value;
+
+ret = colo_ctl_get_value(f, &value);
+if (ret) {
+return ret;
+}
+
+if (value != require) {
+error_report("unexpected state received!\n");
+exit(1);
+}
+
+return ret;
+}
+
 /* save */
 
-static __attribute__((unused)) bool is_master(void)
+static bool is_master(void)
 {
 MigrationState *s = migrate_get_current();
 return (s->state == MIG_STATE_COLO);
 }
 
+static int do_colo_transaction(MigrationState *s, QEMUFile *control,
+   QEMUFile *trans)
+{
+int ret;
+
+ret = colo_ctl_put(s->file, COLO_CHECKPOINT_NEW);
+if (ret) {
+goto out;
+}
+
+ret = colo_ctl_get(control, COLO_CHECKPOINT_SUSPENDED);
+if (ret) {
+goto out;
+}
+
+/* TODO: suspend and save vm state to colo buffer */
+
+ret = colo_ctl_put(s->file, COLO_CHECKPOINT_SEND);
+if (ret) {
+goto out;
+}
+
+/* TODO: send

Re: [RFC PATCH 11/17] COLO ctl: implement colo checkpoint protocol

2014-08-01 Thread Dr. David Alan Gilbert
* Yang Hongyang (yan...@cn.fujitsu.com) wrote:
> implement colo checkpoint protocol.
> 
> Checkpoint synchronzing points.
> 
>   Primary Secondary
>   NEW @
>   Suspend
>   SUSPENDED   @
>   Suspend&Save state
>   SEND@
>   Send state  Receive state
>   RECEIVED@
>   Flush network   Load state
>   LOADED  @
>   Resume  Resume
> 
>   Start Comparing
> NOTE:
>  1) '@' who sends the message
>  2) Every sync-point is synchronized by two sides with only
> one handshake(single direction) for low-latency.
> If more strict synchronization is required, a opposite direction
> sync-point should be added.
>  3) Since sync-points are single direction, the remote side may
> go forward a lot when this side just receives the sync-point.
> 
> Signed-off-by: Yang Hongyang 
> ---
>  migration-colo.c | 268 
> +--
>  1 file changed, 262 insertions(+), 6 deletions(-)
> 
> diff --git a/migration-colo.c b/migration-colo.c
> index 2699e77..a708872 100644
> --- a/migration-colo.c
> +++ b/migration-colo.c
> @@ -24,6 +24,41 @@
>   */
>  #define CHKPOINT_TIMER 1
>  
> +enum {
> +COLO_READY = 0x46,
> +
> +/*
> + * Checkpoint synchronzing points.
> + *
> + *  Primary Secondary
> + *  NEW @
> + *  Suspend
> + *  SUSPENDED   @
> + *  Suspend&Save state
> + *  SEND@
> + *  Send state  Receive state
> + *  RECEIVED@
> + *  Flush network   Load state
> + *  LOADED  @
> + *  Resume  Resume
> + *
> + *  Start Comparing
> + * NOTE:
> + * 1) '@' who sends the message
> + * 2) Every sync-point is synchronized by two sides with only
> + *one handshake(single direction) for low-latency.
> + *If more strict synchronization is required, a opposite direction
> + *sync-point should be added.
> + * 3) Since sync-points are single direction, the remote side may
> + *go forward a lot when this side just receives the sync-point.
> + */
> +COLO_CHECKPOINT_NEW,
> +COLO_CHECKPOINT_SUSPENDED,
> +COLO_CHECKPOINT_SEND,
> +COLO_CHECKPOINT_RECEIVED,
> +COLO_CHECKPOINT_LOADED,
> +};
> +
>  static QEMUBH *colo_bh;
>  
>  bool colo_supported(void)
> @@ -185,30 +220,161 @@ static const QEMUFileOps colo_read_ops = {
>  .close = colo_close,
>  };
>  
> +/* colo checkpoint control helper */
> +static bool is_master(void);
> +static bool is_slave(void);
> +
> +static void ctl_error_handler(void *opaque, int err)
> +{
> +if (is_slave()) {
> +/* TODO: determine whether we need to failover */
> +/* FIXME: we will not failover currently, just kill slave */
> +error_report("error: colo transmission failed!\n");
> +exit(1);
> +} else if (is_master()) {
> +/* Master still alive, do not failover */
> +error_report("error: colo transmission failed!\n");
> +return;
> +} else {
> +error_report("COLO: Unexpected error happend!\n");
> +exit(EXIT_FAILURE);
> +}
> +}
> +
> +static int colo_ctl_put(QEMUFile *f, uint64_t request)
> +{
> +int ret = 0;
> +
> +qemu_put_be64(f, request);
> +qemu_fflush(f);
> +
> +ret = qemu_file_get_error(f);
> +if (ret < 0) {
> +ctl_error_handler(f, ret);
> +return 1;
> +}
> +
> +return ret;
> +}
> +
> +static int colo_ctl_get_value(QEMUFile *f, uint64_t *value)
> +{
> +int ret = 0;
> +uint64_t temp;
> +
> +temp = qemu_get_be64(f);
> +
> +ret = qemu_file_get_error(f);
> +if (ret < 0) {
> +ctl_error_handler(f, ret);
> +return 1;
> +}
> +
> +*value = temp;
> +return 0;
> +}
> +
> +static int colo_ctl_get(QEMUFile *f, uint64_t require)
> +{
> +int ret;
> +uint64_t value;
> +
> +ret = colo_ctl_get_value(f, &value);
> +if (ret) {
> +return ret;
> +}
> +
> +if (value != require) {
> +error_report("unexpected state received!\n");

I find it useful to print the expected/received state to
be able to figure out what went wrong.

> +exit(1);
> +}
> +
> +return ret;
> +}
> +
>  /* save */
>  
> -static __attribute__((unused)) bool is_master(void)
> +static bool is_master(void)
>  {
>  MigrationState *s = migrate_get_current();
>  return (s->state == MIG_STATE_COLO);
>  }
>  
> +static int do_colo_transaction(MigrationState

Re: [RFC PATCH 11/17] COLO ctl: implement colo checkpoint protocol

2014-09-11 Thread Hongyang Yang



在 08/01/2014 11:03 PM, Dr. David Alan Gilbert 写道:

* Yang Hongyang (yan...@cn.fujitsu.com) wrote:

implement colo checkpoint protocol.

Checkpoint synchronzing points.

   Primary Secondary
   NEW @
   Suspend
   SUSPENDED   @
   Suspend&Save state
   SEND@
   Send state  Receive state
   RECEIVED@
   Flush network   Load state
   LOADED  @
   Resume  Resume

   Start Comparing
NOTE:
  1) '@' who sends the message
  2) Every sync-point is synchronized by two sides with only
 one handshake(single direction) for low-latency.
 If more strict synchronization is required, a opposite direction
 sync-point should be added.
  3) Since sync-points are single direction, the remote side may
 go forward a lot when this side just receives the sync-point.

Signed-off-by: Yang Hongyang 
---
  migration-colo.c | 268 +--
  1 file changed, 262 insertions(+), 6 deletions(-)

diff --git a/migration-colo.c b/migration-colo.c
index 2699e77..a708872 100644
--- a/migration-colo.c
+++ b/migration-colo.c
@@ -24,6 +24,41 @@
   */
  #define CHKPOINT_TIMER 1

+enum {
+COLO_READY = 0x46,
+
+/*
+ * Checkpoint synchronzing points.
+ *
+ *  Primary Secondary
+ *  NEW @
+ *  Suspend
+ *  SUSPENDED   @
+ *  Suspend&Save state
+ *  SEND@
+ *  Send state  Receive state
+ *  RECEIVED@
+ *  Flush network   Load state
+ *  LOADED  @
+ *  Resume  Resume
+ *
+ *  Start Comparing
+ * NOTE:
+ * 1) '@' who sends the message
+ * 2) Every sync-point is synchronized by two sides with only
+ *one handshake(single direction) for low-latency.
+ *If more strict synchronization is required, a opposite direction
+ *sync-point should be added.
+ * 3) Since sync-points are single direction, the remote side may
+ *go forward a lot when this side just receives the sync-point.
+ */
+COLO_CHECKPOINT_NEW,
+COLO_CHECKPOINT_SUSPENDED,
+COLO_CHECKPOINT_SEND,
+COLO_CHECKPOINT_RECEIVED,
+COLO_CHECKPOINT_LOADED,
+};
+
  static QEMUBH *colo_bh;

  bool colo_supported(void)
@@ -185,30 +220,161 @@ static const QEMUFileOps colo_read_ops = {
  .close = colo_close,
  };

+/* colo checkpoint control helper */
+static bool is_master(void);
+static bool is_slave(void);
+
+static void ctl_error_handler(void *opaque, int err)
+{
+if (is_slave()) {
+/* TODO: determine whether we need to failover */
+/* FIXME: we will not failover currently, just kill slave */
+error_report("error: colo transmission failed!\n");
+exit(1);
+} else if (is_master()) {
+/* Master still alive, do not failover */
+error_report("error: colo transmission failed!\n");
+return;
+} else {
+error_report("COLO: Unexpected error happend!\n");
+exit(EXIT_FAILURE);
+}
+}
+
+static int colo_ctl_put(QEMUFile *f, uint64_t request)
+{
+int ret = 0;
+
+qemu_put_be64(f, request);
+qemu_fflush(f);
+
+ret = qemu_file_get_error(f);
+if (ret < 0) {
+ctl_error_handler(f, ret);
+return 1;
+}
+
+return ret;
+}
+
+static int colo_ctl_get_value(QEMUFile *f, uint64_t *value)
+{
+int ret = 0;
+uint64_t temp;
+
+temp = qemu_get_be64(f);
+
+ret = qemu_file_get_error(f);
+if (ret < 0) {
+ctl_error_handler(f, ret);
+return 1;
+}
+
+*value = temp;
+return 0;
+}
+
+static int colo_ctl_get(QEMUFile *f, uint64_t require)
+{
+int ret;
+uint64_t value;
+
+ret = colo_ctl_get_value(f, &value);
+if (ret) {
+return ret;
+}
+
+if (value != require) {
+error_report("unexpected state received!\n");


I find it useful to print the expected/received state to
be able to figure out what went wrong.


Good idea!




+exit(1);
+}
+
+return ret;
+}
+
  /* save */

-static __attribute__((unused)) bool is_master(void)
+static bool is_master(void)
  {
  MigrationState *s = migrate_get_current();
  return (s->state == MIG_STATE_COLO);
  }

+static int do_colo_transaction(MigrationState *s, QEMUFile *control,
+   QEMUFile *trans)
+{
+int ret;
+
+ret = colo_ctl_put(s->file, COLO_CHECKPOINT_NEW);
+if (ret) {
+goto out;
+}
+
+ret = colo_ctl_get(control, COLO_C

Re: [RFC PATCH 11/17] COLO ctl: implement colo checkpoint protocol

2014-09-12 Thread Dr. David Alan Gilbert
* Hongyang Yang (yan...@cn.fujitsu.com) wrote:
> 
> 
> ??? 08/01/2014 11:03 PM, Dr. David Alan Gilbert ??:
> >* Yang Hongyang (yan...@cn.fujitsu.com) wrote:



> >>+static int do_colo_transaction(MigrationState *s, QEMUFile *control,
> >>+   QEMUFile *trans)
> >>+{
> >>+int ret;
> >>+
> >>+ret = colo_ctl_put(s->file, COLO_CHECKPOINT_NEW);
> >>+if (ret) {
> >>+goto out;
> >>+}
> >>+
> >>+ret = colo_ctl_get(control, COLO_CHECKPOINT_SUSPENDED);
> >
> >What happens at this point if the slave just doesn't respond?
> >(i.e. the socket doesn't drop - you just don't get the byte).
> 
> If the socket return bytes that were not expected, exit. If
> socket return error, do some cleanup and quit COLO process.
> refer to: colo_ctl_get() and colo_ctl_get_value()

But what happens if the slave just doesn't respond at all; e.g.
if the slave host loses power, it'll take a while (many seconds)
before the socket will timeout.

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 11/17] COLO ctl: implement colo checkpoint protocol

2014-09-12 Thread Hongyang Yang



在 09/12/2014 07:17 PM, Dr. David Alan Gilbert 写道:

* Hongyang Yang (yan...@cn.fujitsu.com) wrote:



??? 08/01/2014 11:03 PM, Dr. David Alan Gilbert ??:

* Yang Hongyang (yan...@cn.fujitsu.com) wrote:





+static int do_colo_transaction(MigrationState *s, QEMUFile *control,
+   QEMUFile *trans)
+{
+int ret;
+
+ret = colo_ctl_put(s->file, COLO_CHECKPOINT_NEW);
+if (ret) {
+goto out;
+}
+
+ret = colo_ctl_get(control, COLO_CHECKPOINT_SUSPENDED);


What happens at this point if the slave just doesn't respond?
(i.e. the socket doesn't drop - you just don't get the byte).


If the socket return bytes that were not expected, exit. If
socket return error, do some cleanup and quit COLO process.
refer to: colo_ctl_get() and colo_ctl_get_value()


But what happens if the slave just doesn't respond at all; e.g.
if the slave host loses power, it'll take a while (many seconds)
before the socket will timeout.


It will wait until the call returns timeout error, and then do some
cleanup and quit COLO process. There may be better way to handle
this?



Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
.



--
Thanks,
Yang.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 11/17] COLO ctl: implement colo checkpoint protocol

2014-09-12 Thread Dr. David Alan Gilbert
* Hongyang Yang (yan...@cn.fujitsu.com) wrote:
> 
> 
> ??? 09/12/2014 07:17 PM, Dr. David Alan Gilbert ??:
> >* Hongyang Yang (yan...@cn.fujitsu.com) wrote:
> >>
> >>
> >>??? 08/01/2014 11:03 PM, Dr. David Alan Gilbert ??:
> >>>* Yang Hongyang (yan...@cn.fujitsu.com) wrote:
> >
> >
> >
> +static int do_colo_transaction(MigrationState *s, QEMUFile *control,
> +   QEMUFile *trans)
> +{
> +int ret;
> +
> +ret = colo_ctl_put(s->file, COLO_CHECKPOINT_NEW);
> +if (ret) {
> +goto out;
> +}
> +
> +ret = colo_ctl_get(control, COLO_CHECKPOINT_SUSPENDED);
> >>>
> >>>What happens at this point if the slave just doesn't respond?
> >>>(i.e. the socket doesn't drop - you just don't get the byte).
> >>
> >>If the socket return bytes that were not expected, exit. If
> >>socket return error, do some cleanup and quit COLO process.
> >>refer to: colo_ctl_get() and colo_ctl_get_value()
> >
> >But what happens if the slave just doesn't respond at all; e.g.
> >if the slave host loses power, it'll take a while (many seconds)
> >before the socket will timeout.
> 
> It will wait until the call returns timeout error, and then do some
> cleanup and quit COLO process.

If it was to wait here for ~30seconds for the timeout what would happen
to the primary? Would it be stopped from sending any network traffic
for those 30 seconds - I think that's too long to fail over.

> There may be better way to handle this?

In postcopy I always take reads coming back from the destination
in a separate thread, because that thread can't block the main thread
going out (I originally did that using async reads but the thread
is nicer).  You could also use something like a poll() with a shorter
timeout to however long you are happy for COLO to go before it fails.

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html