On Tue, Dec 25, 2018 at 06:37:03PM -0200, Martin Pieuchot wrote: > On 24/12/18(Mon) 20:07, Scott Cheloha wrote: > > On Tue, Dec 18, 2018 at 03:39:43PM -0600, Ian Sutton wrote: > > > On Mon, Aug 14, 2017 at 3:07 PM Martin Pieuchot <m...@openbsd.org> wrote: > > > > > > > > I'd like to improve the fairness of the scheduler, with the goal of > > > > mitigating userland starvations. For that the kernel needs to have > > > > a better understanding of the amount of executed time per task. > > > > > > > > The smallest interval currently usable on all our architectures for > > > > such accounting is a tick. With the current HZ value of 100, this > > > > smallest interval is 10ms. I'd like to bump this value to 1000. > > > > > > > > The diff below intentionally bump other `hz' value to keep current > > > > ratios. We certainly want to call schedclock(), or a similar time > > > > accounting function, at a higher frequency than 16 Hz. However this > > > > will be part of a later diff. > > > > > > > > I'd be really interested in test reports. mlarkin@ raised a good > > > > question: is your battery lifetime shorter with this diff? > > > > > > > [...] > > > I'd like to see more folks test and other devs to share their > > > thoughts: What are the risks associated with bumping HZ globally? > > > Drawbacks? Reasons for hesitation? > > > > In general I'd like to reduce wakeup latency as well. Raising HZ is an > > obvious route to achieving that. But I think there are a couple things > > that need to be addressed before it would be reasonable. The things that > > come to mind for me are: > > > > - A tick is a 32-bit signed integer on all platforms. If HZ=100, we > > can represent at most ~248 days in ticks. This is plenty. If HZ=1000, > > we now only have ~24.8 days. Some may disagree, but I don't think this > > is enough. > > Why do you think it isn't enough? > > > One possible solution is to make ticks 64-bit. This addresses the > > timeout length issue at a cost to 32-bit platforms that I cannot > > quantify without lots of testing: what is the overhead of using 64-bit > > arithmetic on a 32-bit machine for all timeouts? > > > > A compromise is to make ticks a long. kettenis mentioned this > > possibility in a commit [1] some time back. This would allow 64-bit > > platforms to raise HZ without crippling timeout ranges. But then you > > have ticks of different sizes on different platforms, which could be a > > headache, I imagine. > > Note that we had, and certainly still have, tick-wrapping bugs in the > kernel :) > > > (maybe there are other solutions?) > > Solution to what? > > > - How does an OpenBSD guest on vmd(8) behave when HZ=1000? Multiple such > > guests on vmd(8)? Such guests on other hypervisors? > > > > - The replies in this thread don't indicate any effect on battery life or > > power consumption but I find it hard to believe that raising HZ has no > > impact on such things. Bumping HZ like this *must* increase CPU > > utilization. > > What is the cost in watt-hours? > > It depends on the machine. But that's one of the reasons I dropped the > bump. > > > - Can smaller machines even handle HZ=1000? Linux experimented with this > > over a decade ago and settled on a default HZ=250 for i386 [2]. I don't > > know how it all shook out, but my guess is that they didn't revert from > > 1000 -> 250 for no reason at all. Of course, FreeBSD went ahead with > > 1000 > > on i386, so opinions differ. > > Indeed, we still support architectures that can't handle an HZ of 1000. > > > - How does this effect e.g. packet throughput on smaller machines? I think > > bigger boxes on amd64 would be fine, but I wonder if throughput would > > take > > a noticeable hit on a smaller router. > > Some measurements indicated a drop of 10% in packet forwarding on some > machines and no difference on others. > > > And then... can we reduce wakeup latency in general without raising HZ? > > Other > > systems (e.g. DFly) have better wakeup latencies and still have HZ=100. > > What > > are they doing? Can we borrow it? > > I haven't looked at other systems like DragonFly, but since you seem > interested to improve that area, here's my story. I didn't look at > wakeup latencies. I don't know why you're after that. Instead I > focused on `schedhz' and schedclock(). I landed there after observing > that with a high number of threads in "running" state (an active > browser while baking a muild), work was badly distributed amongst CPUs. > Some per-CPU queues were growing and others stayed empty. > > CPUs have runqueues that are selected based on per-thread `p_priority'. > What does this field represent today is confusing. Many changes since > the original scheduler design, including hardware improvements, side- > effects and developer mistakes makes it more confusing. However bumping > HZ improves the placements of "running" threads in per-CPU runqueues. > > I spend a lot of time trying to observe and understand why. I don't > remember the details but came to the conclusion that `p_priority' was > newer. In other words the kernel has more up-to-date information to > make choices. > > However it became clear to me that the current mis-design works well > enough by luck :) Trying to theorise & understand it today is hard. > For example the introduction of kernel threads and the switch to rthread > 1:1 model changed the meaning of sleeping priorities. This has led > to multiple workarounds in the past years... > > It is also hard to shrink the SCHED_LOCK() because it protects > accountings fields used to compute priorities. > > There's also a current known problem with threads moving often between > CPUs. This is particularity bad when the distance between two CPUs is > important (think multiple sockets). > > Now I'm afraid that bumping HZ will lead to new mis-calculated values > which will lead to new workarounds. Instead I'd suggest to spend time > to move to a scheduler that is understandable & understood and not a > result of optimistic changes :o) > > At the time of the diff I discussed moving to virtual deadlines instead > of priorities. That should simplify math & locking because there would > be nothing to calculate. I did an experiment using `hz' to calculate > virtual deadlines and that's why I needed a higher HZ. > > Now having a scheduler depending on `hz' is, IMHO, a limitation. And > that's one of the reasons why bumping HZ is complicated. So I came to > the conclusion that bumping HZ wasn't the solution to *my* problem. > That's why I dropped the diff. > > I think we should use high resolution timers to calculate deadlines. > There is plenty of prior work in that area, so it shouldn't be too hard > to get started. > > Reducing usages of `hz' in the kernel would also be a very good step > in the tickless direction. For example by using timeout_add_msec(9) > instead of timeout_add(9) :o) >
All of what mpi@ said, plus I am of the firm belief that moving toward a tickless model is the right way to go. -ml