On Wed, Jun 26, 2024 at 12:04:43PM +0100, Joao Martins wrote: > Are you thinking in something specifically?
Not really. I don't think I have any idea on how to make it better, unfortunately, but we did some measurement too quite some time ago and I can share some below. > > Many "variables" affect this from the point we decide switchover, and at the > worst (likely) case it means having qemu subsystems declare empirical values > on > how long it takes to suspend/resume/transfer-state to migration expected > downtime prediction equation. Part of the reason that having headroom within > downtime-limit was a simple 'catch-all' (from our PoV) in terms of > maintainability while giving user something to fallback for characterizing its > SLA. Yes, I think this might be a way to go, by starting with something that can catch all. > Personally, I think there's a tiny bit disconnect between what the user > desires when setting downtime-limit vs what it really does. downtime-limit > right > now looks to be best viewed as 'precopy-ram-downtime-limit' :) That's fair to say indeed.. QEMU can try to do better on this, it's just not yet straightforward to know how. > Unless the accuracy work you're thinking is just having a better > migration algorithm at obtaining the best possible downtime for > outstanding-data/RAM *even if* downtime-limit is set at a high limit, > like giving 1) a grace period in the beginning of migration post first > dirty sync Can you elaborate on this one a bit? > or 2) a measured value with continually incrementing target downtime > limit until max downtime-limit set by user hits ... before defaulting to > the current behaviour of migrating as soon as expected downtime is within > the downtime-limit. As discussed in the last response, this could create > the 'downtime headroom' for getting the enforcement/SLA better > honored. Is this maybe your line of thinking? Not what I was referring, but I think such logic existed for years, it was just not implemented in QEMU. I know at least OpenStack implemented exactly that, where instead of keeping an internal smaller downtime_limit and keep increasing that, OpenStack will keep adjusting downtime_limit parameter from time to time, starting with a relatively low value. That is also what I would suggest to most people who cares about downtime, because QEMU does treat it pretty simple: if QEMU thinks it can switchover within the downtime specified, QEMU will just do it, even if it's not the best it can do. Do you think such idea should be instead implemented in QEMU, too? Note that this will also be not about "making downtime accurate", but "reducing downtime", it may depend on how we define downtime_limit in the context, perhaps, where in OpenStack's case it simply won't directly feed that parameter with the real max downtime the user allows. Since that wasn't my original purpose, what I meant is simply see ways to make downtime_limit accurate, and by analyzing the current downtimes (as you mentioned, using the downtime tracepoints; and I'd say kudos to you on suggesting that in a formal patch). Here's something we collected by our QE team, for example, on a pretty loaded system of 384 cores + 12TB: Checkpoints analysis: downtime-start -> vm-stopped: 267635.2 (us) vm-stopped -> iterable-saved: 3558506.2 (us) iterable-saved -> non-iterable-saved: 270352.2 (us) non-iterable-saved -> downtime-end: 144264.2 (us) total downtime: 4240758.0 (us) Iterable device analysis: Device SAVE of ram: 0 took 3470420 (us) Non-iterable device analysis: Device SAVE of cpu:121 took 118090 (us) Device SAVE of apic:167 took 6899 (us) Device SAVE of cpu:296 took 3795 (us) Device SAVE of 0000:00:02.2:00.0/virtio-blk: 0 took 638 (us) Device SAVE of cpu:213 took 630 (us) Device SAVE of 0000:00:02.0:00.0/virtio-net: 0 took 534 (us) Device SAVE of cpu:374 took 517 (us) Device SAVE of cpu: 31 took 503 (us) Device SAVE of cpu:346 took 497 (us) Device SAVE of 0000:00:02.0:00.1/virtio-net: 0 took 492 (us) (1341 vmsd omitted) In this case we also see the SLA violations since we specified something much lower than 4.2sec as downtime_limit. This might not be a good example, as I think when capturing the traces we used to still have the issue on inaccurate bw estimations, and that was why I introduced switchover-bandwidth parameter, I wished after that the result can be closer to downtime_limit but we never tried to test again. I am not sure either on whether that's the best way to address this. But let's just ignore the iterable save() huge delays (which can be explained, and hopefully will still be covered by downtime_limit calculations when it can try to get closer to right), and we can also see at least a few things we didn't account: - stop vm: 268ms - non-iterables: 270ms - dest load until complete: 144ms For the last one, we did see another outlier where it can only be seen from dest: Non-iterable device analysis: Device LOAD of kvm-tpr-opt: 0 took 123976 (us) <----- this one Device LOAD of 0000:00:02.0/pcie-root-port: 0 took 6362 (us) Device LOAD of 0000:00:02.0:00.0/virtio-net: 0 took 4583 (us) Device LOAD of 0000:00:02.0:00.1/virtio-net: 0 took 4440 (us) Device LOAD of 0000:00:01.0/vga: 0 took 3740 (us) Device LOAD of 0000:00:00.0/mch: 0 took 3557 (us) Device LOAD of 0000:00:02.2:00.0/virtio-blk: 0 took 3530 (us) Device LOAD of 0000:00:02.1:00.0/xhci: 0 took 2712 (us) Device LOAD of 0000:00:02.1/pcie-root-port: 0 took 2046 (us) Device LOAD of 0000:00:02.2/pcie-root-port: 0 took 1890 (us) So we found either cpu save() taking 100+ms, or kvm-tpr-opt load() taking 100+ms. None of them sounds normal, and I didn't look into them. Now with a global ratio perhaps start to reflect "how much ratio of downtime_limit should we account into data transfer", then we'll also need to answer how the user should set that ratio value, and maybe there's a sane way to calculate that by the VM setup? I'm not sure, but those questions may need to be answered together in the next post, so that such parameter can be consumable. The answer doesn't need to be accurate, but I hope that can be based on some similar analysis like above (where I didn't do it well; as I don't think I looked into any of the issues, and maybe they're fix-able). But just to show what I meant. It'll be also great when doing the analysis we found issues fix-able, then it'll be great we fix the issues intead. That's the part when I mentioned "I still prefer fixing downtime_limit itself". Thanks, -- Peter Xu