On Wed,  5 Apr 2017 23:36:24 -0700
Wanpeng Li <kernel...@gmail.com> wrote:

> From: Wanpeng Li <wanpeng...@hotmail.com>
> 
> This can be reproduced by setting up a CPU for nohz_full and pin two tasks 
> that hog the CPU 100% of the time to that CPU, top reports 70% system time 
> and 30% user time utilization sometimes. The trace like this:
> 
>    hog-11980 [015]   341.494491: function:    enter_from_user_mode <-- 
> apic_timer_interrupt
> <idle>-0     [000]   341.494492: function:    smp_apic_timer_interrupt <-- 
> apic_timer_interrupt
>    hog-11980 [015]   341.494492: function:    __context_tracking_exit <-- 
> enter_from_user_mode
> <idle>-0     [000]   341.494492: function:    irq_enter <-- 
> smp_apic_timer_interrupt
>    hog-11980 [015]   341.494492: bprint:      vtime_delta: diff=0 
> (now=4295008339 vtime_snap=4295008339)
>    hog-11980 [015]   341.494492: function:    smp_apic_timer_interrupt <-- 
> apic_timer_interrupt
>    hog-11980 [015]   341.494492: function:    irq_enter <-- 
> smp_apic_timer_interrupt
>    hog-11980 [015]   341.494493: function:    tick_sched_timer <-- 
> __hrtimer_run_queues
> <idle>-0     [000]   341.494493: function:    tick_sched_timer <-- 
> __hrtimer_run_queues
> <idle>-0     [000]   341.494493: function:    
> tick_do_update_jiffies64.part.14 <-- tick_sched_do_timer
> <idle>-0     [000]   341.494494: function:    do_timer <-- 
> tick_do_update_jiffies64.part.14
>    hog-11980 [015]   341.494494: function:    irq_exit <-- 
> smp_apic_timer_interrupt
> <idle>-0     [000]   341.494494: bprint:      do_timer: updated 
> jiffies_64=4295008340 ticks=1
>    hog-11980 [015]   341.494494: function:    __context_tracking_enter <-- 
> prepare_exit_to_usermode
>    hog-11980 [015]   341.494494: function:    vtime_user_enter <-- 
> __context_tracking_enter
>    hog-11980 [015]   341.494495: bprint:      vtime_delta: diff=1000000 
> (now=4295008340 vtime_snap=4295008339)
>    hog-11980 [015]   341.494495: function:    __vtime_account_system <-- 
> vtime_user_enter
>    hog-11980 [015]   341.494495: bprint:      get_vtime_delta: 
> vtime_snap=4295008339 now=4295008340
>    hog-11980 [015]   341.494495: function:    account_system_time <-- 
> __vtime_account_system
>    hog-11980 [015]   341.494495: bprint:      account_system_time: 
> cputime=995488
> <idle>-0     [000]   341.494497: function:    irq_exit <-- 
> smp_apic_timer_interrupt
> 
> In this trace, we see the following:
> 
>  1. On CPU15, we transition from user-space to kernel-space because
>     of a timer interrupt (it's the tick)
> 
>  2. vtimer_delta() returns 0, because jiffies didn't change since the
>     last accounting
> 
>  3. While CPU15 is executing in kernel-space, jiffies is updated
>     by CPU0
> 
>  4. When going back to user-space, vtime_delta() returns non-zero
>     and the whole time is accounted for system time (observe how
>     the cputime parameter in account_system_time() is less than 1ms)
> 
> 
> After commit ff9a9b4c433 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN 
> to jiffy granularity"), it reduces timing accuracy on nohz_full CPUs to 
> jiffy based sampling. So the issue can happen if all CPUs fire their ticks 
> at the same time:
> 
> 
>                CPU 0                  CPU 1
>                -----                  -----
>                                       exit_user() // no cputime update 
> tick X         update_jiffies
>                                       enter_user() // cputime update
> 
> 
>                                       exit_user() // no cputime update
> tick X+1       update_jiffies
>                                       enter_user() // cputime update
> 
> The nohz_full CPU receives an interrupt at the same time the timer 
> interrupt fires on the housekeeping CPU.
> 
> This patch offsets the tick to avert all ticks alignment in order 
> that the vtime sampling does not end up "in phase" with the jiffies 
> incrementing.
> 
> Reported-by: Luiz Capitulino <lcapitul...@redhat.com>
> Suggested-by: Rik van Riel <r...@redhat.com>
> Cc: Frederic Weisbecker <fweis...@gmail.com>
> Cc: Rik van Riel <r...@redhat.com>
> Cc: Mike Galbraith <efa...@gmx.de>
> Cc: Luiz Capitulino <lcapitul...@redhat.com>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com>

Reviewed-and-Tested-by: Luiz Capitulino <lcapitul...@redhat.com>

Thanks for everyone who participated on tracking this down
and fixing it!

> ---
>  kernel/time/tick-sched.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 7fe53be..35d79f7 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1197,8 +1197,12 @@ void tick_setup_sched_timer(void)
>       /* Get the next period (per-CPU) */
>       hrtimer_set_expires(&ts->sched_timer, tick_init_jiffy_update());
>  
> -     /* Offset the tick to avert jiffies_lock contention. */
> -     if (sched_skew_tick) {
> +     /*
> +      * Offset the tick to avert jiffies_lock contention, and all ticks
> +      * alignment in order that the vtime sampling does not end up "in
> +      * phase" with the jiffies incrementing.
> +      */
> +     if (sched_skew_tick || tick_nohz_full_enabled()) {
>               u64 offset = ktime_to_ns(tick_period) >> 1;
>               do_div(offset, num_possible_cpus());
>               offset *= smp_processor_id();

Reply via email to