On Thu, 2025-10-09 at 02:58 -0700, Dongli Zhang wrote:
> So far, QEMU/KVM live migration does not account all elapsed blackout
> downtimes. For example, if a guest is live-migrated to a file, left idle
> for one hour, and then restored from that file to the target host, the
> one-hour blackout period will not be reflected in the kvm-clock or guest
> TSC.
> 
> Typically, the elapsed time between KVM_GET_CLOCK (on the source QEMU) and
> KVM_SET_CLOCK (on the target QEMU) is not accounted in the kvm-clock.
> Similarly, the elapsed time between reading MSR_IA32_TSC on the source QEMU
> and writing it on the target QEMU is not reflected in the guest TSC.
> 
> The KVM patchset [1] introduced KVM_VCPU_TSC_CTRL, KVM_CLOCK_REALTIME, and
> KVM_CLOCK_HOST_TSC to account the elapsed time during live migration
> blackouts in the guest's system counter view.
> 
> The core idea is to use the realtime clock (KVM_CLOCK_REALTIME) from both
> the source and target hosts as a reference to calculate the elapsed
> downtime in nanoseconds and adjust kvm-clock. 

Nah, don't do that.

For a start, never use CLOCK_REALTIME. You should use CLOCK_TAI. Leap
seconds can still occur at least for the next few years, and
CLOCK_REALTIME isn't monotonic.

(Hopefully the BIPM will see sense and continue doing leap seconds even
after 2035, rather than kicking the can down the road and creating a
new larger y2k/y2038 style problem for the future. But regardless of
that, they've given us the worst of all worlds by *both* pandering to
broken software *and* not actually stopping immediately, so for now you
still have to avoid having those bugs until 2035. And... your great-
grandchildren might thank you for not introducing those bugs even if
there isn't a leap second before you retire?)

Secondly, in all sane modern hardware the kvmclock should be a fixed
relationship from the guest's TSC which doesn't change for the whole
lifetime of the guest. We should set the guest *TSC* as accurately as
we can (with offsets, if it's on the same hardware after live update,
via CLOCK_TAI if it's a live migration). And the y=mx+c relationship of
the KVM clock should be put back *exactly* as it was. With KVM
selftests which validate that the bits seen in the PVTI by the guest
literally DID NOT CHANGE. Or at least, that they give the same
precisely the same time result for all historical TSC readings, as they
did at the time.

So:

1. Set the TSC, from offset or CLOCK_TAI.
2. Set the TSC→kvmclock relationship.

Let's get the kernel APIs in place to support that, and then make qemu
do it right. I'm not sure I see any value in half-measures.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to