Control: reassign -1 qemu-system-x86 Control: severity -1 important Control: found -1 qemu/1:2.1+dfsg-12 Control: retitle -1 QEMU: causes vCPU steal time overflow on live migration
On 22:25 Fri 05 Jun , Dominic Hargreaves wrote: > Great to hear that you found the underlying cause[1] of this! I note > the workaround: > > -cpu qemu64,-kvm_steal_time > > which may be applicable to the Debian hosts? > > [1] <https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg01295.html> After investigating a bit more, it looks like the issue comes from an overflow in the kernel's accumulate_steal_time() (arch/x86/kvm/x86.c:2023): static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) return; delta = current->sched_info.run_delay - vcpu->arch.st.last_steal; Using systemtap with the attached script to trace KVM execution on the receiving host kernel, we can see that shortly before marking the vCPUs as runnable on a migrated KVM instance with 2 vCPUs, the following happens (** marks lines of interest): ** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=7856949 ns steal=7856949 ns 0 qemu-system-x86(18446): -> kvm_arch_vcpu_load 0 vhost-18446(18447): -> kvm_arch_vcpu_should_kick 5 vhost-18446(18447): <- kvm_arch_vcpu_should_kick 23 qemu-system-x86(18446): <- kvm_arch_vcpu_load 0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl 2 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl 0 qemu-system-x86(18446): -> kvm_arch_vcpu_put 2 qemu-system-x86(18446): -> kvm_put_guest_fpu 3 qemu-system-x86(18446): <- kvm_put_guest_fpu 4 qemu-system-x86(18446): <- kvm_arch_vcpu_put ** 0 qemu-system-x86(18446): kvm_arch_vcpu_load: run_delay=7856949 ns steal=7856949 ns 0 qemu-system-x86(18446): -> kvm_arch_vcpu_load 1 qemu-system-x86(18446): <- kvm_arch_vcpu_load 0 qemu-system-x86(18446): -> kvm_arch_vcpu_ioctl 1 qemu-system-x86(18446): <- kvm_arch_vcpu_ioctl 0 qemu-system-x86(18446): -> kvm_arch_vcpu_put 1 qemu-system-x86(18446): -> kvm_put_guest_fpu 2 qemu-system-x86(18446): <- kvm_put_guest_fpu 3 qemu-system-x86(18446): <- kvm_arch_vcpu_put ** 0 qemu-system-x86(18449): kvm_arch_vcpu_load: run_delay=40304 ns steal=7856949 ns 0 qemu-system-x86(18449): -> kvm_arch_vcpu_load ** 7 qemu-system-x86(18449): delta: 18446744073701734971 ns, steal=7856949 ns, run_delay=40304 ns 10 qemu-system-x86(18449): <- kvm_arch_vcpu_load ** 0 qemu-system-x86(18449): -> kvm_arch_vcpu_ioctl_run 4 qemu-system-x86(18449): -> kvm_arch_vcpu_runnable 6 qemu-system-x86(18449): <- kvm_arch_vcpu_runnable ... 0 qemu-system-x86(18448): kvm_arch_vcpu_load: run_delay=0 ns steal=7856949 ns 0 qemu-system-x86(18448): -> kvm_arch_vcpu_load ** 34 qemu-system-x86(18448): delta: 18446744073701694667 ns, steal=7856949 ns, run_delay=0 ns 40 qemu-system-x86(18448): <- kvm_arch_vcpu_load ** 0 qemu-system-x86(18448): -> kvm_arch_vcpu_ioctl_run 5 qemu-system-x86(18448): -> kvm_arch_vcpu_runnable Now, what's really interesting is that current->sched_info.run_delay gets reset because the tasks (threads) using the vCPUs change, and thus have a different current->sched_info: it looks like task 18446 created the two vCPUs, and then they were handed over to 18448 and 18449 respectively. This is also verified by the fact that during the overflow, both vCPUs have the old steal time of the last vcpu_load of task 18446. However, according to Documentation/virtual/kvm/api.txt: - vcpu ioctls: These query and set attributes that control the operation of a single virtual cpu. Only run vcpu ioctls from the same thread that was used to create the vcpu. So it seems qemu is doing something that it shouldn't: calling vCPU ioctls from a thread that didn't create the vCPU. Note that this probably happens on every QEMU startup, but is not visible because the guest kernel zeroes out the steal time on boot. There are at least two ways to mitigate the issue without a kernel recompilation: - The first one is to disable the steal time propagation from host to guest by invoking qemu with `-cpu qemu64,-kvm_steal_time`. This will short-circuit accumulate_steal_time() due to (vcpu->arch.st.msr_val & KVM_MSR_ENABLED) and will completely disable steal time reporting in the guest, which may not be desired if people rely on it to detect CPU congestion. - The other one is using the following systemtap script to prevent the steal time counter from overflowing by dropping the problematic samples (WARNING: systemtap guru mode required, use at your own risk): probe module("kvm").statement("*@arch/x86/kvm/x86.c:2024") { if (@defined($delta) && $delta < 0) { printk(4, "kvm: steal time delta < 0, dropping") $delta = 0 } } Note that not all *guests* handle this condition in the same way: 3.2 guests still get the overflow in /proc/stat, but their scheduler continues to work as expected. 3.16 guests OTOH go nuts once steal time overflows and stop accumulating system & user time, while entering an erratic state where steal time in /proc/stat is *decreasing* on every clock tick. Regards, Apollon
global steal global run_delay probe module("kvm").function("*@arch/x86/kvm/x86.c").call { printf ("%s -> %s\n", thread_indent(1), ppfunc()) } probe module("kvm").function("*@arch/x86/kvm/x86.c").return { printf ("%s <- %s\n", thread_indent(-1), ppfunc()) } probe module("kvm").function("kvm_arch_vcpu_load") { task = task_current() run_delay[tid()] = @cast(task, "task_struct", "kernel<linux/sched.h>")->sched_info->run_delay steal[tid()] = $vcpu->arch->st->last_steal printf ("%s kvm_arch_vcpu_load: run_delay=%lu ns steal=%lu ns\n", thread_indent(0), run_delay[tid()], steal[tid()]) } probe module("kvm").statement("*@arch/x86/kvm/x86.c:2024") { if (@defined($delta) && $delta != 0) { printf ("%s delta: %lu ns, steal=%lu ns, run_delay=%lu ns\n", thread_indent(0), $delta, steal[tid()], run_delay[tid()]) } }