On Fri, 21 Jun 2019 21:54:18 -0700 Mike Larkin <mlar...@nested.page> wrote:
> On Fri, Jun 21, 2019 at 05:11:26PM -0300, Martin Pieuchot wrote: > > On 06/06/19(Thu) 15:16, Martin Pieuchot wrote: > > > On 02/06/19(Sun) 16:41, Martin Pieuchot wrote: > > > > On 01/06/19(Sat) 18:55, Martin Pieuchot wrote: > > > > > Diff below exists mainly for documentation and test purposes. If > > > > > you're not interested about how to break the scheduler internals in > > > > > pieces, don't read further and go straight to testing! > > > > > > > > > > - First change is to stop calling tsleep(9) at PUSER. That makes > > > > > it clear that all "sleeping priorities" are smaller than PUSER. > > > > > That's important to understand for the diff below. `p_priority' > > > > > is currently a placeholder for the "sleeping priority" and the > > > > > "runnqueue priority". Both fields are separated by this diff. > > > > > > > > > > - When a thread goes to sleep, the priority argument of tsleep(9) is > > > > > now recorded in `p_slpprio'. This argument can be considered as > > > > > part > > > > > of the sleep queue. Its purpose is to place the thread into a > > > > > higher > > > > > runqueue when awoken. > > > > > > > > > > - Currently, for stopped threads, `p_priority' correspond to > > > > > `p_usrpri'. > > > > > So setrunnable() has been untangled to place SSTOP and SSLEEP > > > > > threads > > > > > in the preferred queue without having to use `p_priority'. Note > > > > > that > > > > > `p_usrpri' is still recalculated *after* having called > > > > > setrunqueue(). > > > > > This is currently fine because setrunnable() is called with > > > > > SCHED_LOCK() > > > > > but it will be racy when we'll split it. > > > > > > > > > > - A new field, `p_runprio' has been introduced. It should be > > > > > considered > > > > > as part of the per-CPU runqueues. It indicates where a current > > > > > thread > > > > > is placed. > > > > > > > > > > - `spc_curpriority' is now updated at every context-switch. That > > > > > means > > > > > need_resched() won't be called after comparing an out-of-date > > > > > value. > > > > > At the same time, `p_usrpri' is initialized to the highest possible > > > > > value for idle threads. > > > > > > > > > > - resched_proc() was calling need_resched() in the following > > > > > conditions: > > > > > - If the SONPROC thread has a higher priority that the current > > > > > running thread (itself). > > > > > - Twice in setrunnable() when we know that p_priority <= p_usrpri. > > > > > - If schedcpu() considered that a thread, after updating its prio, > > > > > should preempt the one running on the CPU pointed by `p_cpu'. > > > > > > > > > > The diff below simplify all of that by calling need_resched() when: > > > > > - A thread is inserted in a CPU runqueue at a higher priority than > > > > > the one SONPROC. > > > > > - schedcpu() decides that a thread in SRUN state should preempt the > > > > > one SONPROC. > > > > > > > > > > - `p_estcpu' `p_usrpri' and `p_slptime' which represent the "priority" > > > > > of a thread are now updated while holding a per-thread mutex. As a > > > > > result schedclock() and donice() no longer takes the SCHED_LOCK(), > > > > > and schedcpu() almost never take it. > > > > > > > > > > - With this diff top(1) and ps(1) will report the "real" `p_usrpi' > > > > > value > > > > > when displaying priorities. This is helpful to understand what's > > > > > happening: > > > > > > > > > > load averages: 0.99, 0.56, 0.25 > > > > > two.lab.grenadille.net 23:42:10 > > > > > 70 threads: 68 idle, 2 on processor > > > > > up 0:09 > > > > > CPU0: 0.0% user, 0.0% nice, 51.0% sys, 2.0% spin, 0.0% intr, > > > > > 47.1% idle > > > > > CPU1: 2.0% user, 0.0% nice, 51.0% sys, 3.9% spin, 0.0% intr, > > > > > 43.1% idle > > > > > Memory: Real: 47M/1005M act/tot Free: 2937M Cache: 812M Swap: 0K/4323M > > > > > > > > > > PID TID PRI NICE SIZE RES STATE WAIT TIME CPU > > > > > COMMAND > > > > > 81000 145101 72 0 0K 1664K sleep/1 bored 1:15 36.96% > > > > > softnet > > > > > 47133 244097 73 0 2984K 4408K sleep/1 netio 1:06 35.06% > > > > > cvs > > > > > 64749 522184 66 0 176K 148K onproc/1 - 0:55 28.81% > > > > > nfsd > > > > > 21615 602473 127 0 0K 1664K sleep/0 - 7:22 0.00% > > > > > idle0 > > > > > 12413 606242 127 0 0K 1664K sleep/1 - 7:08 0.00% > > > > > idle1 > > > > > 85778 338258 50 0 4936K 7308K idle select 0:10 0.00% > > > > > ssh > > > > > 22771 575513 50 0 176K 148K sleep/0 nfsd 0:02 0.00% > > > > > nfsd > > > > > .... > > > > > > > > > > > > > > > - The removal of `p_priority' and the change that makes mi_switch() > > > > > always update `spc_curpriority' might introduce some changes in > > > > > behavior, especially with kernel threads that were not going through > > > > > tsleep(9). We currently have some situations where the priority of > > > > > the running thread isn't correctly reflected. This diff changes > > > > > that > > > > > which means we should be able to better understand where the > > > > > problems > > > > > are. > > > > > > > > > > I'd be interested in comments/tests/reviews before continuing in this > > > > > direction. Note that at least part of this diff are required to split > > > > > the accounting apart from the SCHED_LOCK() as well. > > > > > > > > > > I'll also work on exporting scheduler statistics unless somebody wants > > > > > to beat me :) > > > > > > > > Updated diff to use IPL_SCHED and rebased to apply on top of -current > > > > :) > > > > > > Updated diff that fixes a pagefault reported by sthen@. > > > > Rebased diff on top of -current. I'm still looking for tests and > > comments :) > > > > I have been running this since the first version without any issues. I also > have the "unlock more syscalls" diff in this tree as well as the "push the > kernel lock lower on read and write" diff, and can say it's noticeably faster > with these diffs in. > Hi, FWIW, this diff is quite good. 1) It solves quite serious bugs in need_resched(). 2) Now nice value is respected for sndiod, ntpd. Previously, it was not. 3) Code is much simpler and easier to understand. 4) When a proc uses CPU, its prio gets lowered, and the runprio moves from 50 and trending towards 127, which is correct, and as desired. roughly same behavior as before. 5) IDLE has lowest prio, which is logical, bonus in this diff. 6) Zerothread priority is lowered, which is roughly same behaviour as before. One comment: all P_SYSTEM proc's are now at PUSER = 50, instead of 0, but that is by design to understand runtime behaviour, and if needed can be tweaked later. A request: can you please remove the double check for SRUN in sched_bsd.c-->schedcpu()? Thanks