On Tue, 22 Aug 2017 00:19:28 +1000
Nicholas Piggin <npig...@gmail.com> wrote:

> On Mon, 21 Aug 2017 11:18:33 +0100
> Jonathan Cameron <jonathan.came...@huawei.com> wrote:
> 
> > On Mon, 21 Aug 2017 16:06:05 +1000
> > Nicholas Piggin <npig...@gmail.com> wrote:
> >   
> > > On Mon, 21 Aug 2017 10:52:58 +1000
> > > Nicholas Piggin <npig...@gmail.com> wrote:
> > >     
> > > > On Sun, 20 Aug 2017 14:14:29 -0700
> > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > >       
> > > > > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:     
> > > > >    
> > > > > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:    
> > > > > >       
> > > > > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > > > > Nicholas Piggin <npig...@gmail.com> wrote:
> > > > > > >           
> > > > > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > > > > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:          
> > > > > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney 
> > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > Thomas, John, am I misinterpreting the timer trace event 
> > > > > > > > > messages?            
> > > > > > > > 
> > > > > > > > So I did some digging, and what you find is that rcu_sched 
> > > > > > > > seems to do a
> > > > > > > > simple scheudle_timeout(1) and just goes out to lunch for many 
> > > > > > > > seconds.
> > > > > > > > The process_timeout timer never fires (when it finally does 
> > > > > > > > wake after
> > > > > > > > one of these events, it usually removes the timer with 
> > > > > > > > del_timer_sync).
> > > > > > > > 
> > > > > > > > So this patch seems to fix it. Testing, comments welcome.       
> > > > > > > >    
> > > > > > > 
> > > > > > > Okay this had a problem of trying to forward the timer from a 
> > > > > > > timer
> > > > > > > callback function.
> > > > > > > 
> > > > > > > This was my other approach which also fixes the RCU warnings, but 
> > > > > > > it's
> > > > > > > a little more complex. I reworked it a bit so the mod_timer fast 
> > > > > > > path
> > > > > > > hopefully doesn't have much more overhead (actually by reading 
> > > > > > > jiffies
> > > > > > > only when needed, it probably saves a load).          
> > > > > > 
> > > > > > Giving this one a whirl!          
> > > > > 
> > > > > No joy here, but then again there are other reasons to believe that I
> > > > > am seeing a different bug than Dave and Jonathan are.
> > > > > 
> > > > > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > > > > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > > > > not statistically different from what I see without either patch.
> > > > > 
> > > > > But no statistical difference compared to without patch, and I still
> > > > > see the "rcu_sched kthread starved" messages.  For whatever it is 
> > > > > worth,
> > > > > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > > > > Hmmm...  I am also seeing that without any of your patches.  Might
> > > > > be hypervisor preemption, I guess.        
> > > > 
> > > > Okay it makes the warnings go away for me, but I'm just booting then
> > > > leaving the system idle. You're doing some CPU hotplug activity?      
> > > 
> > > Okay found a bug in the patch (it was not forwarding properly before
> > > adding the first timer after an idle) and a few other concerns.
> > > 
> > > There's still a problem of a timer function doing a mod timer from
> > > within expire_timers. It can't forward the base, which might currently
> > > be quite a way behind. I *think* after we close these gaps and get
> > > timely wakeups for timers on there, it should not get too far behind
> > > for standard timers.
> > > 
> > > Deferrable is a different story. Firstly it has no idle tracking so we
> > > never forward it. Even if we wanted to, we can't do it reliably because
> > > it could contain timers way behind the base. They are "deferrable", so
> > > you get what you pay for, but this still means there's a window where
> > > you can add a deferrable timer and get a far later expiry than you
> > > asked for despite the CPU never going idle after you added it.
> > > 
> > > All these problems would seem to go away if mod_timer just queued up
> > > the timer to a single list on the base then pushed them into the
> > > wheel during your wheel processing softirq... Although maybe you end
> > > up with excessive passes over big queue of timers. Anyway that
> > > wouldn't be suitable for 4.13 even if it could work.
> > > 
> > > I'll send out an updated minimal fix after some more testing...    
> > 
> > Hi All,
> > 
> > I'm back in the office with hardware access on our D05 64 core ARM64
> > boards.
> > 
> > I think we still have by far the quickest test cases for this so
> > feel free to ping me anything you want tested quickly (we were
> > looking at an average of less than 10 minutes to trigger
> > with machine idling).
> > 
> > Nick, I'm currently running your previous version and we are over an
> > hour so even without any instances of the issue so it looks like a
> > considerable improvement.  I'll see if I can line a couple of boards
> > up for an overnight run if you have your updated version out by then.
> > 
> > Be great to finally put this one to bed.  
> 
> Hi Jonathan,
> 
> Thanks here's an updated version with a couple more bugs fixed. If
> you could try testing, that would be much appreciated.
> 
> Thanks,
> Nick

Running now on 1 board. I'll grab another in a few hours and report back
in the morning if we don't see issues before I head off.

We got to about 5 hours on previous version without a problem vs
sub 10 minutes on the two baseline tests I ran without it, so even
with bugs that seemed to have dealt with the issue itself.

On 15 mins so far and all good.

Jonathan

> 
> ---
>  kernel/time/timer.c | 43 +++++++++++++++++++++++++++++++++++--------
>  1 file changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 8f5d1bf18854..2b9d2cdb3fac 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -203,6 +203,7 @@ struct timer_base {
>       bool                    migration_enabled;
>       bool                    nohz_active;
>       bool                    is_idle;
> +     bool                    was_idle; /* was it idle since last run/fwded */
>       DECLARE_BITMAP(pending_map, WHEEL_SIZE);
>       struct hlist_head       vectors[WHEEL_SIZE];
>  } ____cacheline_aligned;
> @@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned 
> tflags)
>  
>  static inline void forward_timer_base(struct timer_base *base)
>  {
> -     unsigned long jnow = READ_ONCE(jiffies);
> +     unsigned long jnow;
>  
>       /*
> -      * We only forward the base when it's idle and we have a delta between
> -      * base clock and jiffies.
> +      * We only forward the base when we are idle or have just come out
> +      * of idle (was_idle logic), and have a delta between base clock
> +      * and jiffies. In the common case, run_timers will take care of it.
>        */
> -     if (!base->is_idle || (long) (jnow - base->clk) < 2)
> +     if (likely(!base->was_idle))
> +             return;
> +
> +     jnow = READ_ONCE(jiffies);
> +     base->was_idle = base->is_idle;
> +     if ((long)(jnow - base->clk) < 2)
>               return;
>  
>       /*
> @@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long 
> expires, bool pending_only)
>        * same array bucket then just return:
>        */
>       if (timer_pending(timer)) {
> +             /*
> +              * The downside of this optimization is that it can result in
> +              * larger granularity than you would get from adding a new
> +              * timer with this expiry. Would a timer flag for networking
> +              * be appropriate, then we can try to keep expiry of general
> +              * timers within ~1/8th of their interval?
> +              */
>               if (timer->expires == expires)
>                       return 1;
>  
> @@ -948,6 +962,7 @@ __mod_timer(struct timer_list *timer, unsigned long 
> expires, bool pending_only)
>                * dequeue/enqueue dance.
>                */
>               base = lock_timer_base(timer, &flags);
> +             forward_timer_base(base);
>  
>               clk = base->clk;
>               idx = calc_wheel_index(expires, clk);
> @@ -964,6 +979,7 @@ __mod_timer(struct timer_list *timer, unsigned long 
> expires, bool pending_only)
>               }
>       } else {
>               base = lock_timer_base(timer, &flags);
> +             forward_timer_base(base);
>       }
>  
>       ret = detach_if_pending(timer, base, false);
> @@ -991,12 +1007,10 @@ __mod_timer(struct timer_list *timer, unsigned long 
> expires, bool pending_only)
>                       raw_spin_lock(&base->lock);
>                       WRITE_ONCE(timer->flags,
>                                  (timer->flags & ~TIMER_BASEMASK) | 
> base->cpu);
> +                     forward_timer_base(base);
>               }
>       }
>  
> -     /* Try to forward a stale timer base clock */
> -     forward_timer_base(base);
> -
>       timer->expires = expires;
>       /*
>        * If 'idx' was calculated above and the base time did not advance
> @@ -1499,8 +1513,10 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 
> basem)
>               /*
>                * If we expect to sleep more than a tick, mark the base idle:
>                */
> -             if ((expires - basem) > TICK_NSEC)
> +             if ((expires - basem) > TICK_NSEC) {
> +                     base->was_idle = true;
>                       base->is_idle = true;
> +             }
>       }
>       raw_spin_unlock(&base->lock);
>  
> @@ -1611,6 +1627,17 @@ static __latent_entropy void run_timer_softirq(struct 
> softirq_action *h)
>  {
>       struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
>  
> +     /*
> +      * was_idle must be cleared before running timers so that any timer
> +      * functions that call mod_timer will not try to forward the base.
> +      *
> +      * The deferrable base does not do idle tracking at all, so we do
> +      * not forward it. This can result in very large variations in
> +      * granularity for deferrable timers, but they can be deferred for
> +      * long periods due to idle.
> +      */
> +     base->was_idle = false;
> +
>       __run_timers(base);
>       if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active)
>               __run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));

Reply via email to