On Fri, Jun 21, 2024 at 07:32:21AM -0700, Elena Ufimtseva wrote:
> During live migration, receive current downtime from source
> and start a downtime timer. When the destination dowtime
> and added source downtime exceeds downtime limit for more
> than switchover limit, abort live migration on destination.
> 
> Signed-off-by: Elena Ufimtseva <elena.ufimts...@oracle.com>
> ---
>  migration/migration.c  |  2 ++
>  migration/migration.h  | 15 ++++++++++
>  migration/savevm.c     | 68 ++++++++++++++++++++++++++++++++++++++++++
>  migration/savevm.h     |  2 ++
>  migration/trace-events |  3 ++
>  5 files changed, 90 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 5cc304d2db..64d7290997 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -240,6 +240,8 @@ void migration_object_init(void)
>      current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
>  
>      current_incoming->exit_on_error = INMIGRATE_DEFAULT_EXIT_ON_ERROR;
> +    /* Downtime will start when source sends its current downtime. */
> +    current_incoming->downtime_start = 0;
>  
>      migration_object_check(current_migration, &error_fatal);
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index aa56b70795..06f4ebe214 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -230,6 +230,21 @@ struct MigrationIncomingState {
>  
>      /* Do exit on incoming migration failure */
>      bool exit_on_error;
> +
> +    /* Initial downtime on destination set by MIG_CMD_SEND_SRC_DOWNTIME */
> +    uint64_t downtime_start;
> +    /*
> +     * Current donwtime on destination that initially set equal to source by
> +     * MIG_CMD_SEND_SRC_DOWNTIME, then updated by destination itself.
> +     */
> +    uint64_t downtime_now;

Is this needed?

> +    /*
> +     * Abort live migration on destination when current destination downtime
> +     * exceeds the abort_limit. abort_limit is being set by
> +     * MIG_CMD_SEND_SRC_DOWNTIME sent from source.
> +     */
> +    uint64_t abort_limit;

If we assume both QEMUs will be applied with the same cap/params, we may
not need this and we can directly read the parameter on dest.

> +    uint64_t src_downtime;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 031ab03915..f3b5ea98bf 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -90,6 +90,7 @@ enum qemu_vm_cmd {
>      MIG_CMD_ENABLE_COLO,       /* Enable COLO */
>      MIG_CMD_POSTCOPY_RESUME,   /* resume postcopy on dest */
>      MIG_CMD_RECV_BITMAP,       /* Request for recved bitmap on dst */
> +    MIG_CMD_SEND_SRC_DOWNTIME,    /* Send current downtime to dst */
>      MIG_CMD_MAX
>  };
>  
> @@ -109,6 +110,7 @@ static struct mig_cmd_args {
>      [MIG_CMD_POSTCOPY_RESUME]  = { .len =  0, .name = "POSTCOPY_RESUME" },
>      [MIG_CMD_PACKAGED]         = { .len =  4, .name = "PACKAGED" },
>      [MIG_CMD_RECV_BITMAP]      = { .len = -1, .name = "RECV_BITMAP" },
> +    [MIG_CMD_SEND_SRC_DOWNTIME] = { .len = -1, .name = "SEND_SRC_DOWNTIME" },
>      [MIG_CMD_MAX]              = { .len = -1, .name = "MAX" },
>  };
>  
> @@ -1218,6 +1220,18 @@ void qemu_savevm_send_recv_bitmap(QEMUFile *f, char 
> *block_name)
>      qemu_savevm_command_send(f, MIG_CMD_RECV_BITMAP, len + 1, (uint8_t 
> *)buf);
>  }
>  
> +void qemu_savevm_send_downtime(QEMUFile *f, int64_t abort_limit_ms,
> +                               int64_t source_downtime)
> +{
> +    uint64_t tmp[2];
> +    tmp[0] = cpu_to_be64(abort_limit_ms);
> +    tmp[1] = cpu_to_be64(source_downtime);
> +
> +    trace_qemu_savevm_send_downtime(abort_limit_ms, source_downtime);
> +    qemu_savevm_command_send(f, MIG_CMD_SEND_SRC_DOWNTIME,
> +                             16, (uint8_t *)tmp);
> +}
> +
>  bool qemu_savevm_state_blocked(Error **errp)
>  {
>      SaveStateEntry *se;
> @@ -1635,6 +1649,14 @@ int qemu_savevm_state_complete_precopy(QEMUFile *f, 
> bool iterable_only,
>          }
>      }
>  
> +    if (migrate_switchover_abort()) {
> +        MigrationState *s = migrate_get_current();
> +        uint64_t abort_limit_ms =
> +            s->parameters.downtime_limit + s->parameters.switchover_limit;
> +        qemu_savevm_send_downtime(f, abort_limit_ms,
> +                                  migration_get_current_downtime(s));
> +    }
> +
>      if (iterable_only) {
>          goto flush;
>      }
> @@ -1919,6 +1941,20 @@ static int 
> loadvm_postcopy_handle_advise(MigrationIncomingState *mis,
>      return 0;
>  }
>  
> +static int loadvm_handle_src_downtime(MigrationIncomingState *mis,
> +                                      uint16_t len)
> +{
> +    uint64_t src_abort_limit = qemu_get_be64(mis->from_src_file);
> +    uint64_t src_current_downtime = qemu_get_be64(mis->from_src_file);
> +
> +    mis->abort_limit = src_abort_limit;
> +    mis->src_downtime = src_current_downtime;
> +    mis->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);

I guess this didn't count in the delay of sending this packet.  Since the
whole point of this feature will be "making sure downtime is less than xxx
or cancel migration", I think this might cause the real downtime still more
than expected.  Not sure how big a concern this is.

E.g., have you measured how a packet can be delayed when the socket
pipeline is mostly full, then queue this SRC_DOWNTIME message after all
that?  I think that's very possible the case here at switchover where src
is trying to dump as fast as possible.  I'm not sure whether it's easy to
measure, either..

> +
> +    trace_loadvm_handle_src_downtime(src_abort_limit, src_current_downtime);
> +    return 0;
> +}
> +
>  /* After postcopy we will be told to throw some pages away since they're
>   * dirty and will have to be demand fetched.  Must happen before CPU is
>   * started.
> @@ -2540,6 +2576,9 @@ static int loadvm_process_command(QEMUFile *f)
>  
>      case MIG_CMD_ENABLE_COLO:
>          return loadvm_process_enable_colo(mis);
> +
> +    case MIG_CMD_SEND_SRC_DOWNTIME:
> +        return loadvm_handle_src_downtime(mis, len);
>      }
>  
>      return 0;
> @@ -2659,6 +2698,18 @@ qemu_loadvm_section_start_full(QEMUFile *f, 
> MigrationIncomingState *mis,
>          trace_vmstate_downtime_load("non-iterable", se->idstr,
>                                      se->instance_id, end_ts - start_ts);
>      }
> +    if (migrate_switchover_abort() && type == QEMU_VM_SECTION_FULL &&
> +        mis->downtime_start) {
> +        mis->downtime_now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +        uint64_t dst_downtime = mis->downtime_now - mis->downtime_start;
> +        if (mis->src_downtime + dst_downtime >= mis->abort_limit) {
> +            error_report("Shutdown destination migration, migration 
> abort_limit"
> +                         "(%lu ms) was reached.", mis->abort_limit);
> +            trace_qemu_loadvm_downtime_abort(mis->abort_limit, dst_downtime,
> +                                             mis->src_downtime);
> +            return -EINVAL;
> +        }
> +    }
>  
>      if (!check_section_footer(f, se)) {
>          return -EINVAL;
> @@ -2714,6 +2765,19 @@ qemu_loadvm_section_part_end(QEMUFile *f, 
> MigrationIncomingState *mis,
>                                      se->instance_id, end_ts - start_ts);
>      }
>  
> +    if (migrate_switchover_abort() && type == QEMU_VM_SECTION_END &&
> +        mis->downtime_start) {
> +        mis->downtime_now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +        uint64_t dst_downtime = mis->downtime_now - mis->downtime_start;
> +        if (mis->src_downtime + dst_downtime >= mis->abort_limit) {
> +            error_report("Shutdown destination migration, migration 
> abort_limit (%lu ms)"
> +                          "was reached.", mis->abort_limit);
> +            trace_qemu_loadvm_downtime_abort(mis->abort_limit, dst_downtime,
> +                                             mis->src_downtime);
> +            return -EINVAL;
> +        }
> +    }

So this traps both iteration and non-iteration phase.  What if the downtime
was caused by after these, or unluckily at the last non-iterator device
state?

After trying to think about it, I figured this is not easy to do right.
Also, I start to doubt whether it's definitely a good idea on having this
in the first place.

Firstly, I'm wondering how we should treat this new feature
v.s. downtime_limit.  It's about how the user should set both.

If this is about "cancel migration if downtime more than xxx",
then.. AFAICT that's exactly what downtime_limit was "designed" to be..
It's just that downtime_limit says the other way round, as: "don't
switchover if the downtime will be more than xxx".

Then, if the user has option to set both these parameters, what would be
the right thing to do?  Shouldn't they simply always set both parameters to
be the same value already?  But if so, what's the point on having two?

This caused me to think whether the 2nd parameter is needed at all, instead
whether we should simply make downtime_limit more accurate, so that it will
start to account more things than before.  It won't be accurate, but the
same case here: making this new feature "accurate" can also be very hard.

Thanks,

-- 
Peter Xu


Reply via email to