On Wed, Jun 26, 2024 at 12:04:43PM +0100, Joao Martins wrote:
> Are you thinking in something specifically?

Not really. I don't think I have any idea on how to make it better,
unfortunately, but we did some measurement too quite some time ago and I
can share some below.

> 
> Many "variables" affect this from the point we decide switchover, and at the
> worst (likely) case it means having qemu subsystems declare empirical values 
> on
> how long it takes to suspend/resume/transfer-state to migration expected
> downtime prediction equation. Part of the reason that having headroom within
> downtime-limit was a simple 'catch-all' (from our PoV) in terms of
> maintainability while giving user something to fallback for characterizing its
> SLA.

Yes, I think this might be a way to go, by starting with something that can
catch all.

> Personally, I think there's a tiny bit disconnect between what the user
> desires when setting downtime-limit vs what it really does. downtime-limit 
> right
> now looks to be best viewed as 'precopy-ram-downtime-limit' :)

That's fair to say indeed.. QEMU can try to do better on this, it's just
not yet straightforward to know how.

> Unless the accuracy work you're thinking is just having a better
> migration algorithm at obtaining the best possible downtime for
> outstanding-data/RAM *even if* downtime-limit is set at a high limit,
> like giving 1) a grace period in the beginning of migration post first
> dirty sync

Can you elaborate on this one a bit?

> or 2) a measured value with continually incrementing target downtime
> limit until max downtime-limit set by user hits ... before defaulting to
> the current behaviour of migrating as soon as expected downtime is within
> the downtime-limit. As discussed in the last response, this could create
> the 'downtime headroom' for getting the enforcement/SLA better
> honored. Is this maybe your line of thinking?

Not what I was referring, but I think such logic existed for years, it was
just not implemented in QEMU.  I know at least OpenStack implemented
exactly that, where instead of keeping an internal smaller downtime_limit
and keep increasing that, OpenStack will keep adjusting downtime_limit
parameter from time to time, starting with a relatively low value.

That is also what I would suggest to most people who cares about downtime,
because QEMU does treat it pretty simple: if QEMU thinks it can switchover
within the downtime specified, QEMU will just do it, even if it's not the
best it can do.

Do you think such idea should be instead implemented in QEMU, too?  Note
that this will also be not about "making downtime accurate", but "reducing
downtime", it may depend on how we define downtime_limit in the context,
perhaps, where in OpenStack's case it simply won't directly feed that
parameter with the real max downtime the user allows.

Since that wasn't my original purpose, what I meant is simply see ways to
make downtime_limit accurate, and by analyzing the current downtimes (as
you mentioned, using the downtime tracepoints; and I'd say kudos to you on
suggesting that in a formal patch).

Here's something we collected by our QE team, for example, on a pretty
loaded system of 384 cores + 12TB:

Checkpoints analysis:

            downtime-start ->               vm-stopped:             267635.2 
(us)
                vm-stopped ->           iterable-saved:            3558506.2 
(us)
            iterable-saved ->       non-iterable-saved:             270352.2 
(us)
        non-iterable-saved ->             downtime-end:             144264.2 
(us)
                                        total downtime:            4240758.0 
(us)

Iterable device analysis:

  Device SAVE of                                      ram:  0 took    3470420 
(us)

Non-iterable device analysis:

  Device SAVE of                                      cpu:121 took     118090 
(us)
  Device SAVE of                                     apic:167 took       6899 
(us)
  Device SAVE of                                      cpu:296 took       3795 
(us)
  Device SAVE of             0000:00:02.2:00.0/virtio-blk:  0 took        638 
(us)
  Device SAVE of                                      cpu:213 took        630 
(us)
  Device SAVE of             0000:00:02.0:00.0/virtio-net:  0 took        534 
(us)
  Device SAVE of                                      cpu:374 took        517 
(us)
  Device SAVE of                                      cpu: 31 took        503 
(us)
  Device SAVE of                                      cpu:346 took        497 
(us)
  Device SAVE of             0000:00:02.0:00.1/virtio-net:  0 took        492 
(us)
  (1341 vmsd omitted)

In this case we also see the SLA violations since we specified something
much lower than 4.2sec as downtime_limit.

This might not be a good example, as I think when capturing the traces we
used to still have the issue on inaccurate bw estimations, and that was why
I introduced switchover-bandwidth parameter, I wished after that the result
can be closer to downtime_limit but we never tried to test again.  I am not
sure either on whether that's the best way to address this.

But let's just ignore the iterable save() huge delays (which can be
explained, and hopefully will still be covered by downtime_limit
calculations when it can try to get closer to right), and we can also see
at least a few things we didn't account:

  - stop vm: 268ms
  - non-iterables: 270ms
  - dest load until complete: 144ms

For the last one, we did see another outlier where it can only be seen from
dest:

Non-iterable device analysis:

  Device LOAD of                              kvm-tpr-opt:  0 took     123976 
(us)  <----- this one
  Device LOAD of              0000:00:02.0/pcie-root-port:  0 took       6362 
(us)
  Device LOAD of             0000:00:02.0:00.0/virtio-net:  0 took       4583 
(us)
  Device LOAD of             0000:00:02.0:00.1/virtio-net:  0 took       4440 
(us)
  Device LOAD of                         0000:00:01.0/vga:  0 took       3740 
(us)
  Device LOAD of                         0000:00:00.0/mch:  0 took       3557 
(us)
  Device LOAD of             0000:00:02.2:00.0/virtio-blk:  0 took       3530 
(us)
  Device LOAD of                   0000:00:02.1:00.0/xhci:  0 took       2712 
(us)
  Device LOAD of              0000:00:02.1/pcie-root-port:  0 took       2046 
(us)
  Device LOAD of              0000:00:02.2/pcie-root-port:  0 took       1890 
(us)

So we found either cpu save() taking 100+ms, or kvm-tpr-opt load() taking
100+ms.  None of them sounds normal, and I didn't look into them.

Now with a global ratio perhaps start to reflect "how much ratio of
downtime_limit should we account into data transfer", then we'll also need
to answer how the user should set that ratio value, and maybe there's a
sane way to calculate that by the VM setup?  I'm not sure, but those
questions may need to be answered together in the next post, so that such
parameter can be consumable.

The answer doesn't need to be accurate, but I hope that can be based on
some similar analysis like above (where I didn't do it well; as I don't
think I looked into any of the issues, and maybe they're fix-able).  But
just to show what I meant.  It'll be also great when doing the analysis we
found issues fix-able, then it'll be great we fix the issues intead.
That's the part when I mentioned "I still prefer fixing downtime_limit
itself".

Thanks,

-- 
Peter Xu


Reply via email to