On Wed, Mar 13, 2019 at 10:44:09AM -0700 bseg...@google.com wrote: > Phil Auld <pa...@redhat.com> writes: > > > On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote: > >> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bseg...@google.com wrote: > >> > Letting it spin for 100ms and then only increasing by 6% seems extremely > >> > generous. If we went this route I'd probably say "after looping N > >> > times, set the period to time taken / N + X%" where N is like 8 or > >> > something. I think I'd probably perfer something like this to the > >> > previous "just abort and let it happen again next interrupt" one. > >> > >> Okay. I'll try to spin something up that does this. It may be a little > >> trickier to keep the quota proportional to the new period. I think that's > >> important since we'll be changing the user's setting. > >> > >> Do you mean to have it break when it hits N and recalculates the period or > >> reset the counter and keep going? > >> > > > > Let me know what you think of the below. It's working nicely. I like your > > suggestion to limit it quickly based on number of loops and use that to > > scale up. I think it is best to break out and let it fire again if needed. > > The warning fires once, very occasionally twice, and then things are quiet. > > > > If that looks reasonable I'll do some more testing and spin it up as a real > > patch submission. > > Yeah, this looks reasonable. I should probably see how unreasonable the > other thing would be, but if your previous periods were kinda small (and > it's just that the machine crashing isn't an ok failure mode) I suppose > it's not a big deal. >
I posted it a little while ago. The periods were tiny (~2000us vs a minimum of 1000) and with 2500 mostly unused child cgroups (I didn't narrow that down much but it did reproduce still with 1250 children). That's why I was thinking edge case. It also requires a fairly small quota and load to make sure cfs_rqs get throttled. I'm still wrapping my head around the scheduler code but I'd be happy to try it the other way if you can give me a bit more description of what you have in mind. Also happy to test a patch with my repro. Cheers, Phil > > > > Cheers, > > Phil > > --- > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 310d0637fe4b..54b30adfc89e 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4859,19 +4859,51 @@ static enum hrtimer_restart > > sched_cfs_slack_timer(struct hrtimer *timer) > > return HRTIMER_NORESTART; > > } > > > > +extern const u64 max_cfs_quota_period; > > +int cfs_period_autotune_loop_limit = 8; > > +int cfs_period_autotune_cushion_pct = 15; /* percentage added to period > > recalculation */ > > + > > static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) > > { > > struct cfs_bandwidth *cfs_b = > > container_of(timer, struct cfs_bandwidth, period_timer); > > + s64 nsstart, nsnow, new_period; > > int overrun; > > int idle = 0; > > + int count = 0; > > > > raw_spin_lock(&cfs_b->lock); > > + nsstart = ktime_to_ns(hrtimer_cb_get_time(timer)); > > for (;;) { > > overrun = hrtimer_forward_now(timer, cfs_b->period); > > if (!overrun) > > break; > > > > + if (++count > cfs_period_autotune_loop_limit) { > > + ktime_t old_period = ktime_to_ns(cfs_b->period); > > + > > + nsnow = ktime_to_ns(hrtimer_cb_get_time(timer)); > > + new_period = (nsnow - > > nsstart)/cfs_period_autotune_loop_limit; > > + > > + /* Make sure new period will be larger than old. */ > > + if (new_period < old_period) { > > + new_period = old_period; > > + } > > + new_period += (new_period * > > cfs_period_autotune_cushion_pct) / 100; > > This ordering means that it will always increase by at least 15%. This > is a bit odd but probably a good thing; I'd just change the comment to > make it clear this is deliberate. > > > + > > + if (new_period > max_cfs_quota_period) > > + new_period = max_cfs_quota_period; > > + > > + cfs_b->period = ns_to_ktime(new_period); > > + cfs_b->quota += (cfs_b->quota * ((new_period - > > old_period) * 100)/old_period)/100; > > In general it makes sense to do fixed point via 1024 or something that > can be optimized into shifts (and a larger number is better in general > for better precision). > > > + pr_warn_ratelimited( > > + "cfs_period_timer[cpu%d]: period too short, > > scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n", > > + smp_processor_id(), > > cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC); > > + > > + idle = 0; > > + break; > > + } > > + > > idle = do_sched_cfs_period_timer(cfs_b, overrun); > > } > > if (idle) --