On Tue, Dec 25, 2018 at 06:37:03PM -0200, Martin Pieuchot wrote:
> On 24/12/18(Mon) 20:07, Scott Cheloha wrote:
> > On Tue, Dec 18, 2018 at 03:39:43PM -0600, Ian Sutton wrote:
> > > On Mon, Aug 14, 2017 at 3:07 PM Martin Pieuchot <m...@openbsd.org> wrote:
> > > >
> > > > I'd like to improve the fairness of the scheduler, with the goal of
> > > > mitigating userland starvations.  For that the kernel needs to have
> > > > a better understanding of the amount of executed time per task.
> > > >
> > > > The smallest interval currently usable on all our architectures for
> > > > such accounting is a tick.  With the current HZ value of 100, this
> > > > smallest interval is 10ms.  I'd like to bump this value to 1000.
> > > >
> > > > The diff below intentionally bump other `hz' value to keep current
> > > > ratios.  We certainly want to call schedclock(), or a similar time
> > > > accounting function, at a higher frequency than 16 Hz.  However this
> > > > will be part of a later diff.
> > > >
> > > > I'd be really interested in test reports.  mlarkin@ raised a good
> > > > question: is your battery lifetime shorter with this diff?
> > > >
> > > [...] 
> > > I'd like to see more folks test and other devs to share their
> > > thoughts: What are the risks associated with bumping HZ globally?
> > > Drawbacks? Reasons for hesitation?
> > 
> > In general I'd like to reduce wakeup latency as well.  Raising HZ is an
> > obvious route to achieving that.  But I think there are a couple things
> > that need to be addressed before it would be reasonable.  The things that
> > come to mind for me are:
> > 
> >  - A tick is a 32-bit signed integer on all platforms.  If HZ=100, we
> >    can represent at most ~248 days in ticks.  This is plenty.  If HZ=1000,
> >    we now only have ~24.8 days.  Some may disagree, but I don't think this
> >    is enough.
> 
> Why do you think it isn't enough?
> 
> >    One possible solution is to make ticks 64-bit.  This addresses the
> >    timeout length issue at a cost to 32-bit platforms that I cannot
> >    quantify without lots of testing: what is the overhead of using 64-bit
> >    arithmetic on a 32-bit machine for all timeouts?
> > 
> >    A compromise is to make ticks a long.  kettenis mentioned this
> >    possibility in a commit [1] some time back.  This would allow 64-bit
> >    platforms to raise HZ without crippling timeout ranges.  But then you
> >    have ticks of different sizes on different platforms, which could be a
> >    headache, I imagine.
> 
> Note that we had, and certainly still have, tick-wrapping bugs in the
> kernel :)  
> 
> >    (maybe there are other solutions?)
> 
> Solution to what?
> 
> >  - How does an OpenBSD guest on vmd(8) behave when HZ=1000?  Multiple such
> >    guests on vmd(8)?  Such guests on other hypervisors?
> > 
> >  - The replies in this thread don't indicate any effect on battery life or
> >    power consumption but I find it hard to believe that raising HZ has no
> >    impact on such things.  Bumping HZ like this *must* increase CPU 
> > utilization.
> >    What is the cost in watt-hours?
> 
> It depends on the machine.  But that's one of the reasons I dropped the
> bump.
> 
> >  - Can smaller machines even handle HZ=1000?  Linux experimented with this
> >    over a decade ago and settled on a default HZ=250 for i386 [2].  I don't
> >    know how it all shook out, but my guess is that they didn't revert from
> >    1000 -> 250 for no reason at all.  Of course, FreeBSD went ahead with 
> > 1000
> >    on i386, so opinions differ.
> 
> Indeed, we still support architectures that can't handle an HZ of 1000.
> 
> >  - How does this effect e.g. packet throughput on smaller machines?  I think
> >    bigger boxes on amd64 would be fine, but I wonder if throughput would 
> > take
> >    a noticeable hit on a smaller router.
> 
> Some measurements indicated a drop of 10% in packet forwarding on some
> machines and no difference on others. 
> 
> > And then... can we reduce wakeup latency in general without raising HZ?  
> > Other
> > systems (e.g. DFly) have better wakeup latencies and still have HZ=100.  
> > What
> > are they doing?  Can we borrow it?
> 
> I haven't looked at other systems like DragonFly, but since you seem
> interested to improve that area, here's my story.  I didn't look at
> wakeup latencies.  I don't know why you're after that.  Instead I
> focused on `schedhz' and schedclock().  I landed there after observing
> that with a high number of threads in "running" state (an active
> browser while baking a muild), work was badly distributed amongst CPUs.
> Some per-CPU queues were growing and others stayed empty.
> 
> CPUs have runqueues that are selected based on per-thread `p_priority'.
> What does this field represent today is confusing.  Many changes since
> the original scheduler design, including hardware improvements, side-
> effects and developer mistakes makes it more confusing.  However bumping
> HZ improves the placements of "running" threads in per-CPU runqueues.
> 
> I spend a lot of time trying to observe and understand why.  I don't
> remember the details but came to the conclusion that `p_priority' was
> newer.  In other words the kernel has more up-to-date information to
> make choices.
> 
> However it became clear to me that the current mis-design works well
> enough by luck :)  Trying to theorise & understand it today is hard. 
> For example the introduction of kernel threads and the switch to rthread
> 1:1 model changed the meaning of sleeping priorities.  This has led
> to multiple workarounds in the past years...
> 
> It is also hard to shrink the SCHED_LOCK() because it protects
> accountings fields used to compute priorities.
> 
> There's also a current known problem with threads moving often between
> CPUs.  This is particularity bad when the distance between two CPUs is
> important (think multiple sockets).
> 
> Now I'm afraid that bumping HZ will lead to new mis-calculated values
> which will lead to new workarounds.  Instead I'd suggest to spend time
> to move to a scheduler that is understandable & understood and not a
> result of optimistic changes :o)
> 
> At the time of the diff I discussed moving to virtual deadlines instead
> of priorities.  That should simplify math & locking because there would
> be nothing to calculate.  I did an experiment using `hz' to calculate
> virtual deadlines and that's why I needed a higher HZ.
> 
> Now having a scheduler depending on `hz' is, IMHO, a limitation.  And
> that's one of the reasons why bumping HZ is complicated.  So I came to
> the conclusion that bumping HZ wasn't the solution to *my* problem.
> That's why I dropped the diff.
> 
> I think we should use high resolution timers to calculate deadlines.
> There is plenty of prior work in that area, so it shouldn't be too hard
> to get started.
> 
> Reducing usages of `hz' in the kernel would also be a very good step
> in the tickless direction.  For example by using timeout_add_msec(9)
> instead of  timeout_add(9)  :o)
> 

All of what mpi@ said, plus I am of the firm belief that moving toward
a tickless model is the right way to go.

-ml

Reply via email to