Package: src:linux Version: 4.9.30-2+deb9u3 Severity: normal Tags: patch Dear Maintainer,
running Debian Stretch as a paravirtualized guest under Xen, the kernel obtains its cpu steal time counter from the virtualization host. On some hosts, occasionally a slight decrease in the cpu steal time is returned which leads to an overflow of unsigned variables in the kernel and subsequent errors in steal time accounting (such as backwards running counters). This renders tools like "top" or "vmstat" broken in a way that the cpu utilization cannot be determined anymore. While this is likely a bug in the virtualization environment, the kernel running as a guest should deal with this gracefully. I attached a patch to this report which fixes the errors caused by this on the guest. Kernel versions 4.7 and older, as well as 4.11 and newer should not be affected by this issue. Bug #785557 shows that behavior like this is caused by some broken KVM hosts. I myself experience this on a Xen host which unfortunately I have no more information about. A more detailled description of the issue is part of the patch header, as well as the following blog post: https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/ I would appreciate inclusion of this patch in Debian as this issue may affect other people running on buggy virtualization hosts and the patch should not influence other systems. Note that the system I report this from already runs a customly patched kernel which may influence some of the information below. -- Package-specific info: ** Version: Linux version 4.9.0-3-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u3+lass1 (2017-08-08) ** Command line: root=/dev/xvda ro ** Not tainted ** Kernel log: Unable to read kernel log; any relevant messages should be attached ** Model information ** Loaded modules: ipt_REJECT nf_reject_ipv4 binfmt_misc xt_multiport iptable_filter intel_rapl sb_edac edac_core evdev kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr intel_rapl_perf ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache btrfs crc32c_generic xor raid6_pq crc32c_intel xen_netfront xen_blkfront aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd ** PCI devices: not available ** USB devices: not available -- System Information: Debian Release: 9.1 APT prefers stable APT policy: (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 4.9.0-3-amd64 (SMP w/1 CPU core) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/dash Init: systemd (via /run/systemd/system) Versions of packages linux-image-4.9.0-3-amd64 depends on: ii initramfs-tools [linux-initramfs-tool] 0.130 ii kmod 23-2 ii linux-base 4.5 Versions of packages linux-image-4.9.0-3-amd64 recommends: ii firmware-linux-free 3.4 ii irqbalance 1.1.0-2.3 Versions of packages linux-image-4.9.0-3-amd64 suggests: pn debian-kernel-handbook <none> pn grub-pc | grub-efi-amd64 | extlinux <none> pn linux-doc-4.9 <none> Versions of packages linux-image-4.9.0-3-amd64 is related to: pn firmware-amd-graphics <none> pn firmware-atheros <none> pn firmware-bnx2 <none> pn firmware-bnx2x <none> pn firmware-brcm80211 <none> pn firmware-cavium <none> pn firmware-intel-sound <none> pn firmware-intelwimax <none> pn firmware-ipw2x00 <none> pn firmware-ivtv <none> pn firmware-iwlwifi <none> pn firmware-libertas <none> pn firmware-linux-nonfree <none> pn firmware-misc-nonfree <none> pn firmware-myricom <none> pn firmware-netxen <none> pn firmware-qlogic <none> pn firmware-realtek <none> pn firmware-samsung <none> pn firmware-siano <none> pn firmware-ti-connectivity <none> pn xen-hypervisor <none> -- no debconf information
>From 4b66621a06a94d22629661a9262f92b8cf5b7ca9 Mon Sep 17 00:00:00 2001 From: Michael Lass <be...@bi-co.net> Date: Sun, 6 Aug 2017 18:09:21 +0200 Subject: [PATCH] sched/cputime: handle decreasing steal clock On some flaky Xen hosts, the steal clock returned by paravirt_steal_clock is not monotonically increasing but can slightly decrease. Currently this results in an overflow of u64 steal. Before giving this number to account_steal_time() it is converted into cputime, so the target cpustat counter cpustat[CPUTIME_STEAL] is not overflowing as well but instead increased by a large amount. Due to the conversion to cputime and back into nanoseconds, this_rq()->prev_steal_time does not correctly reflect the latest reported steal clock afterwards, resulting in erratic behavior such as backwards running cpustat[CPUTIME_STEAL]. The following is a trace from userspace of the value for steal time reported in /proc/stat: time stolen diff ---- ------ ---- 0ms 784 100ms 1844670130367 1844670129583 200ms 1844664564089 -5566278 300ms 1844659554439 -5009650 400ms 1844655101417 -4453022 This issue was probably introduced by the following commits, which deactivate a check for (steal < 0) in the Xen pv guest codepath and allow unlimited jumps of the cpustat counters (both introduced in v4.8): ecb23dc6f2eff0ce64dd60351a81f376f13b12cc 03cbc732639ddcad15218c4b2046d255851ff1e3 As a workaround, ignore decreasing values steal clock. By not updating this_rq()->prev_steal_time we make sure that steal time is only accuonted as soon as the steal clock raises above the value that was already observed and accounted for earlier. In current kernel versions (v4.11 and higher) this issue should not exist since conversion between nsec and cputime has been eliminated. Therefore all values will overflow, i.e. decrease as reported by the host system. --- kernel/sched/cputime.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 5ebee3164e64..5f039f7f9294 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -262,10 +262,19 @@ static __always_inline cputime_t steal_account_process_time(cputime_t maxtime) #ifdef CONFIG_PARAVIRT if (static_key_false(¶virt_steal_enabled)) { cputime_t steal_cputime; - u64 steal; - - steal = paravirt_steal_clock(smp_processor_id()); - steal -= this_rq()->prev_steal_time; + u64 steal_time; + s64 steal; + + steal_time = paravirt_steal_clock(smp_processor_id()); + steal = steal_time - this_rq()->prev_steal_time; + + if (unlikely(steal < 0)) { + printk_ratelimited(KERN_DEBUG "cputime: steal_clock for " + "processor %d decreased: %llu -> %llu, " + "ignoring\n", smp_processor_id(), + this_rq()->prev_steal_time, steal_time); + return 0; + } steal_cputime = min(nsecs_to_cputime(steal), maxtime); account_steal_time(steal_cputime); -- 2.14.0