Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded

Joao Martins Tue, 25 Jun 2024 04:39:52 -0700

On 24/06/2024 20:41, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:32:21AM -0700, Elena Ufimtseva wrote:
>> @@ -2659,6 +2698,18 @@ qemu_loadvm_section_start_full(QEMUFile *f, 
>> MigrationIncomingState *mis,
 >>      if (!check_section_footer(f, se)) {
>>          return -EINVAL;
>> @@ -2714,6 +2765,19 @@ qemu_loadvm_section_part_end(QEMUFile *f, 
>> MigrationIncomingState *mis,
>>                                      se->instance_id, end_ts - start_ts);
>>      }
>>  
>> +    if (migrate_switchover_abort() && type == QEMU_VM_SECTION_END &&
>> +        mis->downtime_start) {
>> +        mis->downtime_now = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>> +        uint64_t dst_downtime = mis->downtime_now - mis->downtime_start;
>> +        if (mis->src_downtime + dst_downtime >= mis->abort_limit) {
>> +            error_report("Shutdown destination migration, migration 
>> abort_limit (%lu ms)"
>> +                          "was reached.", mis->abort_limit);
>> +            trace_qemu_loadvm_downtime_abort(mis->abort_limit, dst_downtime,
>> +                                             mis->src_downtime);
>> +            return -EINVAL;
>> +        }
>> +    }
> 
> So this traps both iteration and non-iteration phase.  What if the downtime
> was caused by after these, or unluckily at the last non-iterator device
> state?
> 
> After trying to think about it, I figured this is not easy to do right.
> Also, I start to doubt whether it's definitely a good idea on having this
> in the first place.
> 
> Firstly, I'm wondering how we should treat this new feature
> v.s. downtime_limit.  It's about how the user should set both.
> 
> If this is about "cancel migration if downtime more than xxx",
> then.. AFAICT that's exactly what downtime_limit was "designed" to be..
> It's just that downtime_limit says the other way round, as: "don't
> switchover if the downtime will be more than xxx".
> 
> Then, if the user has option to set both these parameters, what would be
> the right thing to do?  Shouldn't they simply always set both parameters to
> be the same value already?  But if so, what's the point on having two?
> 
> This caused me to think whether the 2nd parameter is needed at all, instead
> whether we should simply make downtime_limit more accurate, so that it will
> start to account more things than before.  It won't be accurate, but the
> same case here: making this new feature "accurate" can also be very hard.
>


The way I think about it is that current downtime-limit captures nicely the data
part as the only calculations it cares about is how much outstanding data it
sends to the destination (be it VF device state or RAM). This second parameter
captures what is *not* related to data, i.e. costs of hypervisor quiescing the
VM or added latencies in hypervisor *in addition* to sending outstanding data to
destination.

If we were to merge this all into a single parameter (aka downtime-limit) we are
possibility artificially increasing the downtime thanks to relaxing the
oustanding data part, as opposed to trying to capture the switchover cost (hence
the name the parameter) that can't be reflected in the former equation.

So I feel like this needs two parameters whereby this second new parameter
covers 'switchover cost' (hence the name) with current migration algorithm.

        Joao

Re: [PATCH RFC 2/2] migration: abort on destination if switchover limit exceeded

Reply via email to