On Wed, 5 Apr 2017 23:36:24 -0700 Wanpeng Li <kernel...@gmail.com> wrote:
> From: Wanpeng Li <wanpeng...@hotmail.com> > > This can be reproduced by setting up a CPU for nohz_full and pin two tasks > that hog the CPU 100% of the time to that CPU, top reports 70% system time > and 30% user time utilization sometimes. The trace like this: > > hog-11980 [015] 341.494491: function: enter_from_user_mode <-- > apic_timer_interrupt > <idle>-0 [000] 341.494492: function: smp_apic_timer_interrupt <-- > apic_timer_interrupt > hog-11980 [015] 341.494492: function: __context_tracking_exit <-- > enter_from_user_mode > <idle>-0 [000] 341.494492: function: irq_enter <-- > smp_apic_timer_interrupt > hog-11980 [015] 341.494492: bprint: vtime_delta: diff=0 > (now=4295008339 vtime_snap=4295008339) > hog-11980 [015] 341.494492: function: smp_apic_timer_interrupt <-- > apic_timer_interrupt > hog-11980 [015] 341.494492: function: irq_enter <-- > smp_apic_timer_interrupt > hog-11980 [015] 341.494493: function: tick_sched_timer <-- > __hrtimer_run_queues > <idle>-0 [000] 341.494493: function: tick_sched_timer <-- > __hrtimer_run_queues > <idle>-0 [000] 341.494493: function: > tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer > <idle>-0 [000] 341.494494: function: do_timer <-- > tick_do_update_jiffies64.part.14 > hog-11980 [015] 341.494494: function: irq_exit <-- > smp_apic_timer_interrupt > <idle>-0 [000] 341.494494: bprint: do_timer: updated > jiffies_64=4295008340 ticks=1 > hog-11980 [015] 341.494494: function: __context_tracking_enter <-- > prepare_exit_to_usermode > hog-11980 [015] 341.494494: function: vtime_user_enter <-- > __context_tracking_enter > hog-11980 [015] 341.494495: bprint: vtime_delta: diff=1000000 > (now=4295008340 vtime_snap=4295008339) > hog-11980 [015] 341.494495: function: __vtime_account_system <-- > vtime_user_enter > hog-11980 [015] 341.494495: bprint: get_vtime_delta: > vtime_snap=4295008339 now=4295008340 > hog-11980 [015] 341.494495: function: account_system_time <-- > __vtime_account_system > hog-11980 [015] 341.494495: bprint: account_system_time: > cputime=995488 > <idle>-0 [000] 341.494497: function: irq_exit <-- > smp_apic_timer_interrupt > > In this trace, we see the following: > > 1. On CPU15, we transition from user-space to kernel-space because > of a timer interrupt (it's the tick) > > 2. vtimer_delta() returns 0, because jiffies didn't change since the > last accounting > > 3. While CPU15 is executing in kernel-space, jiffies is updated > by CPU0 > > 4. When going back to user-space, vtime_delta() returns non-zero > and the whole time is accounted for system time (observe how > the cputime parameter in account_system_time() is less than 1ms) > > > After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN > to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to > jiffy based sampling. So the issue can happen if all CPUs fire their ticks > at the same time: > > > CPU 0 CPU 1 > ----- ----- > exit_user() // no cputime update > tick X update_jiffies > enter_user() // cputime update > > > exit_user() // no cputime update > tick X+1 update_jiffies > enter_user() // cputime update > > The nohz_full CPU receives an interrupt at the same time the timer > interrupt fires on the housekeeping CPU. > > This patch offsets the tick to avert all ticks alignment in order > that the vtime sampling does not end up "in phase" with the jiffies > incrementing. > > Reported-by: Luiz Capitulino <lcapitul...@redhat.com> > Suggested-by: Rik van Riel <r...@redhat.com> > Cc: Frederic Weisbecker <fweis...@gmail.com> > Cc: Rik van Riel <r...@redhat.com> > Cc: Mike Galbraith <efa...@gmx.de> > Cc: Luiz Capitulino <lcapitul...@redhat.com> > Cc: Thomas Gleixner <t...@linutronix.de> > Cc: Peter Zijlstra <pet...@infradead.org> > Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com> Reviewed-and-Tested-by: Luiz Capitulino <lcapitul...@redhat.com> Thanks for everyone who participated on tracking this down and fixing it! > --- > kernel/time/tick-sched.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > index 7fe53be..35d79f7 100644 > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void) > /* Get the next period (per-CPU) */ > hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update()); > > - /* Offset the tick to avert jiffies_lock contention. */ > - if (sched_skew_tick) { > + /* > + * Offset the tick to avert jiffies_lock contention, and all ticks > + * alignment in order that the vtime sampling does not end up "in > + * phase" with the jiffies incrementing. > + */ > + if (sched_skew_tick || tick_nohz_full_enabled()) { > u64 offset = ktime_to_ns(tick_period) >> 1; > do_div(offset, num_possible_cpus()); > offset *= smp_processor_id();