On Thu, 2025-10-09 at 02:58 -0700, Dongli Zhang wrote: > So far, QEMU/KVM live migration does not account all elapsed blackout > downtimes. For example, if a guest is live-migrated to a file, left idle > for one hour, and then restored from that file to the target host, the > one-hour blackout period will not be reflected in the kvm-clock or guest > TSC. > > Typically, the elapsed time between KVM_GET_CLOCK (on the source QEMU) and > KVM_SET_CLOCK (on the target QEMU) is not accounted in the kvm-clock. > Similarly, the elapsed time between reading MSR_IA32_TSC on the source QEMU > and writing it on the target QEMU is not reflected in the guest TSC. > > The KVM patchset [1] introduced KVM_VCPU_TSC_CTRL, KVM_CLOCK_REALTIME, and > KVM_CLOCK_HOST_TSC to account the elapsed time during live migration > blackouts in the guest's system counter view. > > The core idea is to use the realtime clock (KVM_CLOCK_REALTIME) from both > the source and target hosts as a reference to calculate the elapsed > downtime in nanoseconds and adjust kvm-clock.
Nah, don't do that. For a start, never use CLOCK_REALTIME. You should use CLOCK_TAI. Leap seconds can still occur at least for the next few years, and CLOCK_REALTIME isn't monotonic. (Hopefully the BIPM will see sense and continue doing leap seconds even after 2035, rather than kicking the can down the road and creating a new larger y2k/y2038 style problem for the future. But regardless of that, they've given us the worst of all worlds by *both* pandering to broken software *and* not actually stopping immediately, so for now you still have to avoid having those bugs until 2035. And... your great- grandchildren might thank you for not introducing those bugs even if there isn't a leap second before you retire?) Secondly, in all sane modern hardware the kvmclock should be a fixed relationship from the guest's TSC which doesn't change for the whole lifetime of the guest. We should set the guest *TSC* as accurately as we can (with offsets, if it's on the same hardware after live update, via CLOCK_TAI if it's a live migration). And the y=mx+c relationship of the KVM clock should be put back *exactly* as it was. With KVM selftests which validate that the bits seen in the PVTI by the guest literally DID NOT CHANGE. Or at least, that they give the same precisely the same time result for all historical TSC readings, as they did at the time. So: 1. Set the TSC, from offset or CLOCK_TAI. 2. Set the TSC→kvmclock relationship. Let's get the kernel APIs in place to support that, and then make qemu do it right. I'm not sure I see any value in half-measures.
smime.p7s
Description: S/MIME cryptographic signature
