Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer

Phil Auld Wed, 13 Mar 2019 11:50:38 -0700

On Wed, Mar 13, 2019 at 10:44:09AM -0700 [email protected] wrote:
> Phil Auld <[email protected]> writes:
> 
> > On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
> >> On Mon, Mar 11, 2019 at 10:44:25AM -0700 [email protected] wrote:
> >> > Letting it spin for 100ms and then only increasing by 6% seems extremely
> >> > generous. If we went this route I'd probably say "after looping N
> >> > times, set the period to time taken / N + X%" where N is like 8 or
> >> > something. I think I'd probably perfer something like this to the
> >> > previous "just abort and let it happen again next interrupt" one.
> >> 
> >> Okay. I'll try to spin something up that does this. It may be a little 
> >> trickier to keep the quota proportional to the new period. I think that's 
> >> important since we'll be changing the user's setting.
> >> 
> >> Do you mean to have it break when it hits N and recalculates the period or 
> >> reset the counter and keep going?
> >> 
> >
> > Let me know what you think of the below. It's working nicely. I like your 
> > suggestion to limit it quickly based on number of loops and use that to 
> > scale up. I think it is best to break out and let it fire again if needed. 
> > The warning fires once, very occasionally twice, and then things are quiet.
> >
> > If that looks reasonable I'll do some more testing and spin it up as a real 
> > patch submission. 
> 
> Yeah, this looks reasonable. I should probably see how unreasonable the
> other thing would be, but if your previous periods were kinda small (and
> it's just that the machine crashing isn't an ok failure mode) I suppose
> it's not a big deal.
>


I posted it a little while ago. The periods were tiny (~2000us vs a minimum 
of 1000) and with 2500 mostly unused child cgroups (I didn't narrow that 
down much but it did reproduce still with 1250 children).  That's why I was 
thinking edge case. It also requires a fairly small quota and load to make 
sure cfs_rqs get throttled.

I'm still wrapping my head around the scheduler code but I'd be happy to 
try it the other way if you can give me a bit more description of what
you have in mind. Also happy to test a patch with my repro. 


Cheers,
Phil


> >
> > Cheers,
> > Phil
> > ---
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 310d0637fe4b..54b30adfc89e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4859,19 +4859,51 @@ static enum hrtimer_restart 
> > sched_cfs_slack_timer(struct hrtimer *timer)
> >     return HRTIMER_NORESTART;
> >  }
> >  
> > +extern const u64 max_cfs_quota_period;
> > +int cfs_period_autotune_loop_limit   = 8;
> > +int cfs_period_autotune_cushion_pct  = 15; /* percentage added to period 
> > recalculation */
> > +
> >  static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> >  {
> >     struct cfs_bandwidth *cfs_b =
> >             container_of(timer, struct cfs_bandwidth, period_timer);
> > +   s64 nsstart, nsnow, new_period;
> >     int overrun;
> >     int idle = 0;
> > +   int count = 0;
> >  
> >     raw_spin_lock(&cfs_b->lock);
> > +   nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
> >     for (;;) {
> >             overrun = hrtimer_forward_now(timer, cfs_b->period);
> >             if (!overrun)
> >                     break;
> >  
> > +           if (++count > cfs_period_autotune_loop_limit) {
> > +                   ktime_t old_period = ktime_to_ns(cfs_b->period);
> > +
> > +                   nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
> > +                   new_period = (nsnow - 
> > nsstart)/cfs_period_autotune_loop_limit;
> > +
> > +                   /*  Make sure new period will be larger than old. */
> > +                   if (new_period < old_period) {
> > +                           new_period = old_period;
> > +                   }
> > +                   new_period += (new_period *  
> > cfs_period_autotune_cushion_pct) / 100;
> 
> This ordering means that it will always increase by at least 15%. This
> is a bit odd but probably a good thing; I'd just change the comment to
> make it clear this is deliberate.
> 
> > +
> > +                   if (new_period >  max_cfs_quota_period)
> > +                           new_period = max_cfs_quota_period;
> > +
> > +                   cfs_b->period = ns_to_ktime(new_period);
> > +                   cfs_b->quota += (cfs_b->quota * ((new_period - 
> > old_period) * 100)/old_period)/100;
> 
> In general it makes sense to do fixed point via 1024 or something that
> can be optimized into shifts (and a larger number is better in general
> for better precision).
> 
> > +                   pr_warn_ratelimited(
> > +                           "cfs_period_timer[cpu%d]: period too short, 
> > scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > +                           smp_processor_id(), 
> > cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> > +
> > +                   idle = 0;
> > +                   break;
> > +           }
> > +
> >             idle = do_sched_cfs_period_timer(cfs_b, overrun);
> >     }
> >     if (idle)

--

Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer

Reply via email to